7. Functions#

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Minimalist Data Wrangling with Python [27] too.

R is a functional language, i.e., one where functions play first fiddle. Each action we perform reduces itself to a call to some function or a combination thereof.

So far, we have been tinkering with dozens of available functions which were mostly part of base R. They constitute the essential vocabulary that everyone must be able to speak fluently.

Any operation, be it sum, sqrt, or paste, when fed with a number of arguments, generates a (hopefully fruitful) return value.

sum(1:10)  # invoking `sum` on a specific argument
## [1] 55

From a user’s perspective, each function is merely a tool. To achieve a goal at hand, we do not have to care about what is going on under its bonnet, i.e., how the inputs are being transformed so that, after a couple of nanoseconds or hours, we can relish what has been bred. This is very convenient: all we need to know is the function’s specification which can be stated, for example, informally, in plain Polish or Malay, on its help page.

In this chapter, we will learn how to write our own functions. Using this skill is a good development practice when we expect that the same operations will need to be executed many times but perhaps on different data.

Also, some functions invoke other procedures, for instance, on every element in a list or every section of a data frame grouped by a qualitative variable. Thus, it is advisable to learn how we can specify a custom operation to be propagated thereover.

Example 7.1

Given some objects (whatever):

x1 <- runif(16)
x2 <- runif(32)
x3 <- runif(64)

assume we want to apply the same action on different data, say, compute the root mean square. Then, instead of retyping almost identical expressions (or a bunch of them) over and over again:

sqrt(mean(x1^2))  # very fresh
## [1] 0.6545
sqrt(mean(x2^2))  # the same second time; borderline okay
## [1] 0.56203
sqrt(mean(x3^2))  # third time the same; tedious, barbarous, and error-prone
## [1] 0.57206

we can generalise the operation to any object like x:

rms <-                   # bind the name `rms` to...
    function(x)          # a function that takes one parameter, `x`
        sqrt(mean(x^2))  # transforming the input to yield output this way

and then reuse it on different concrete data instances:

rms(x1)
## [1] 0.6545
rms(x2)
## [1] 0.56203
rms(x3)
## [1] 0.57206

or even combine it with other function calls:

rms(sqrt(c(x1, x2, x3)))^2
## [1] 0.50824

Thus, custom functions are very useful.

Important

Does writing own functions equal reinventing the wheel? Can everything be found online these days (including on Stack Overflow, GitHub, or CRAN)? Luckily, it is not the case. Otherwise, data analysts’, researchers’, and developers’ lives would be monotonous, dreary, and uninspiring. What is more, we might be able to compose a function from scratch much more quickly than to get through the whole garbage dump called the internet from where, only occasionally, we can dig out some pearls. Let’s remember that we advocate for minimalism in this book. We will reflect on such issues in Chapter 9. There is also the personal growth side: we become more skilled programmers by crunching those exercises.

7.1. Creating and invoking functions#

7.1.1. Anonymous functions#

Functions are usually created through the following notation:

function(args) body

First, args is a (possibly empty) list of comma-separated parameter names which act as input variables.

Second, body is a single R expression that is evaluated when the function is called. The value this expression yields will constitute the function’s output.

For example, here is a definition of a function that takes no inputs and generates a constant output:

function() 1
## function() 1

We thus created a function object. However, as we have not used it at all, it disappeared immediately thereafter.

Any function f can be invoked, i.e., evaluated on concrete data, using the syntax f(arg1, ..., argn). Here, arg1, …, argn are expressions passed as arguments to f.

(function() 1)()  # invoking f like f(); here, no arguments are expected
## [1] 1

Only now have we obtained a return value.

Note

(*) Calling typeof on a function object will report "closure" (user-defined functions), "builtin", or "primitive" (built-in, base ones) for the reasons that we explain in more detail in Section 9.4.3 and Section 16.3.2. In our case:

typeof(function() 1)
## [1] "closure"

7.1.2. Named functions#

Names can be bound to function objects. This way, we can refer to them multiple times:

one <- function() 1  # one <- (function() 1)

We created an object named one (we use bold font to indicate that it is of the type function for functions are so crucial in R). We are very familiar with such a notation, as not since yesterday we are used to writing “x <- 1”, etc.

Invoking one, which can be done by writing one(), will generate a return value:

one()  # (function() 1)()
## [1] 1

This output can be used in further computations. For instance:

0:2 - one()  # 0:2 - (function() 1)(), i.e., 0:2 - 1
## [1] -1  0  1

7.1.3. Passing arguments to functions#

Functions with no arguments are kind of boring. Thus, let’s distil a more highbrowed operation:

concat <- function(x, y) paste(x, y, sep="")

We created a mapping whose aim is to concatenate two objects using a specialised call to paste. Yours faithfully pleads guilty to multiplying entities needlessly: it should not be a problem for anyone to write paste(x, y, sep="") each time. Yet, ‘tis merely an illustration.

The concat function has two parameters, x and y. Hence, calling it will require the provision of two arguments, which we put within round brackets and separate from each other by commas.

u <- 1:5
concat("spam", u)  # i.e., concat(x="spam", y=1:5)
## [1] "spam1" "spam2" "spam3" "spam4" "spam5"

Important

Notice the distinction: parameters (formal arguments) are abstract, general, or symbolic; “something, anything that will be put in place of x when the function is invoked”. Contrastingly, arguments (actual parameters) are concrete, specific, and real.

During the above call, x in the function’s body is precisely "spam" and nothing else. Also, the u object from the caller’s environment can be accessed via y in concat. Most of the time (yet, see Section 16.3), it is best to think of the function as being fed not with u per se but the value that u is bound to, i.e., 1:5.

Also:

x <- 1:5
y <- "spam"
concat(y, x)  # concat(x="spam", y=1:5)
## [1] "spam1" "spam2" "spam3" "spam4" "spam5"

This call is equivalent to concat(x=y, y=x). The argument x is assigned the value of y from the calling environment, "spam". Let’s stress that one x is not the same as the other x; which is which is unambiguously defined by the context.

Exercise 7.2

Write a function standardise that takes a numeric vector x as argument and returns its standardised version, i.e., from each element in x, subtract the sample arithmetic mean and then divide it by the standard deviation.

Note

Section 2.1.3 mentioned that, syntactically speaking, the following are perfectly valid alternatives to the positionally-matched call concat("spam", u):

concat(x="spam", y=u)
concat(y=u, x="spam")
concat("spam", y=u)
concat(u, x="spam")
concat(x="spam", u)
concat(y=u, "spam")

However, we recommend avoiding the last two for the sake of the readers’ sanity. It is best to provide positionally-matched arguments before the keyword-based ones; see Section 15.4.4 for more details.

Also, Section 10.4 introduces the (overused) forward pipe operator, `|>`, which will enable us to rewrite the above as “"spam" |> concat(u)”.

7.1.4. Grouping expressions with curly braces, `{`#

We have been informed that a function’s body is a single R expression whose evaluated value is passed to the user as its output. This may sound restrictive and in contrast with what we have experienced so far. Seldom are we faced with such simple computing tasks, and we have already seen R functions performing quite sophisticated operations.

Grammatically, a single R expression can be arbitrarily complex (Chapter 15). We can use curly braces to group many calls that are to be evaluated one after another. For instance:

{
    cat("first expression\n")
    cat("second expression\n")
    # ...
    cat("last expression\n")
}
## first expression
## second expression
## last expression

We used four spaces to visually indent the constituents for greater readability (some developers prefer tabs over spaces, others find two or three spaces more urbane, but we do not). This single (compound) expression can now play a role of a function’s body.

Important

The last expression evaluated in a curly-braces delimited block will be considered its output value.

x <- {
    1
    2
    3  # <--- last expression: will be taken as the output value
}
print(x)
## [1] 3

This code block can also be written more concisely by replacing newlines with semicolons, albeit with perhaps some loss in readability:

{1; 2; 3}
## [1] 3

Section 9.3 will give a few more details about `{`.

Example 7.3

Here is a version of our concat function that guarantees a more Chapter 2-style missing values’ propagation:

concat <- function(a, b)
{
    z <- paste(a, b, sep="")
    z[is.na(a) | is.na(b)] <- NA_character_
    z  # last expression in the block – return value
}

Example calls:

concat("a", 1:3)
## [1] "a1" "a2" "a3"
concat(NA_character_, 1:3)
## [1] NA NA NA
concat(1:6, c("a", NA_character_, "c"))
## [1] "1a" NA   "3c" "4a" NA   "6c"

Let’s appreciate the fact that we could keep the code brief thanks to paste’s and `|`’s implementing the recycling rule.

Exercise 7.4

Write a function normalise that takes a numeric vector x and returns its version shifted and scaled to the [0, 1] interval. To do so, subtract the sample minimum from each element, and then divide it by the range, i.e., the difference between the maximum and the minimum. Avoid computing min(x) twice.

Exercise 7.5

Write a function that applies the robust standardisation of a numeric vector: subtract the median and divide it by the median absolute deviation, 1.4826 times the median of the absolute differences between the values and their median.

Note

R is an open-source (free, libre) project distributed under the terms of the GNU General Public License version 2. Therefore, we are not only encouraged to run the software for whatever purpose, but also study and modify its source code without restrictions. To facilitate this, we can display all function definitions:

print(concat)  # the code of the above procedure
## function(a, b)
## {
##     z <- paste(a, b, sep="")
##     z[is.na(a) | is.na(b)] <- NA_character_
##     z  # last expression in the block – return value
## }
print(union)  # a built-in function
## function (x, y)
## {
##     u <- as.vector(x)
##     v <- as.vector(y)
##     unique(c(u, v))
## }
## <environment: namespace:base>

Nevertheless, some functionality might be implemented in compiled programming languages such as C, C++, or Fortran; notice a call to .Internal in the source code of paste, .Primitive in list, or .Call in runif. Therefore, we will sometimes have to dig a bit deeper to access the underlying definition; see Chapter 14 for more details.

7.2. Functional programming#

R is a functional programming language. As such, it shares several features with other languages that emphasise the role of function manipulation in software development (e.g., Common Lisp, Scheme, OCaml, Haskell, Clojure, F#). Let’s explore these commonalities now.

7.2.1. Functions are objects#

R functions were given the right to a fair go; they are what we refer to as first-class citizens. In other words, our interaction with them is not limited to their invocation; we treat them as any other language object.

  • They can be stored inside list objects, which can embrace R objects of any kind:

    list(identity, NROW, sum)  # a list storing three functions
    ## [[1]]
    ## function (x)
    ## x
    ## <environment: namespace:base>
    ##
    ## [[2]]
    ## function (x)
    ## if (length(d <- dim(x))) d[1L] else length(x)
    ## <environment: namespace:base>
    ##
    ## [[3]]
    ## function (..., na.rm = FALSE)  .Primitive("sum")
    
  • They can be created and then called inside another function’s body:

    euclidean_distance <- function(x, y)
    {
        square <- function(z) z^2  # auxiliary/internal/helper function
        sqrt(sum(square(x-y)))     # square root of the sum of squares
    }
    
    euclidean_distance(c(0, 1), c(1, 0))  # example call
    ## [1] 1.4142
    

    This is why we tend to classify functions as representatives of recursive types (compare is.recursive).

  • They can be passed as arguments to other operations:

    # Replaces missing values with a given aggregate
    # of all non-missing elements:
    fill_na <- function(x, filler_fun)
    {
        missing_ones <- is.na(x)  # otherwise, we'd have to call is.na twice
        replacement_value <- filler_fun(x[!missing_ones])
        x[missing_ones] <- replacement_value
        x
    }
    
    fill_na(c(0, NA_real_, NA_real_, 2, 3, 7, NA_real_), mean)
    ## [1] 0 3 3 2 3 7 3
    fill_na(c(0, NA_real_, NA_real_, 2, 3, 7, NA_real_), median)
    ## [1] 0.0 2.5 2.5 2.0 3.0 7.0 2.5
    

    Procedures like this are called higher-order functions.

Note

More advanced techniques, which we will discuss in the third part of the book, will let the functions be:

  • returned as other function’s outputs,

  • equipped with auxiliary data,

  • generated programmatically on the fly,

  • modified at runtime.

Let’s review the most essential higher-order functions, including do.call and Map.

7.2.2. Calling on precomputed arguments with do.call#

Notation like f(arg1, ..., argn) has no monopoly over how we call a function on a specific sequence of arguments. The list of actual parameters does not have to be hardcoded.

Here is an alternative. We can first prepare a number of objects to be passed as f’s inputs, wrap them in a list l, and then invoke do.call(f, l) to get the same result.

words <- list(
    c("spam",      "bacon",  "eggs"),
    c("buckwheat", "quinoa", "barley"),
    c("ham",       "spam",   "spam")
)
do.call(paste, words)  # paste(words[[1]], words[[2]], words[[3]])
## [1] "spam buckwheat ham" "bacon quinoa spam"  "eggs barley spam"
do.call(cbind, words)  # column-bind; returns a matrix (explained later)
##      [,1]    [,2]        [,3]
## [1,] "spam"  "buckwheat" "ham"
## [2,] "bacon" "quinoa"    "spam"
## [3,] "eggs"  "barley"    "spam"
do.call(rbind, words)  # row-bind (explained later)
##      [,1]        [,2]     [,3]
## [1,] "spam"      "bacon"  "eggs"
## [2,] "buckwheat" "quinoa" "barley"
## [3,] "ham"       "spam"   "spam"

The length and content of the list passed as the second argument of do.call can be arbitrary (possibly unknown at the time of writing the code). See Section 12.1.2 for more use cases, e.g., ways to concatenate a list of data frames (perhaps produced by some complex chain of commands) into a single data frame.

If elements of the list are named, they will be matched to the corresponding keyword arguments.

x <- 2^(seq(-2, 2, length.out=101))
plot_opts <- list(col="red", lty="dashed", type="l")
do.call(plot, c(list(x, log2(x), xlab="x", ylab="log2(x)"), plot_opts))
## (plot display suppressed)

Notice that our favourite plot_opts can now be reused in further calls to graphics functions. This is very convenient as it avoids repetitions.

7.2.3. Common higher-order functions#

There is an important class of higher-order functions that permit us to apply custom operations on consecutive elements of sequences without relying on loop-like statements, at least explicitly. They can be found in all functional programming languages (e.g., Lisp, Haskell, Scala) and have been ported to various add-on libraries (functools in Python, more recent versions of the C++ Standard Library, etc.) or frameworks (Apache Spark and the like). Their presence reflects the obvious truth that certain operations occur more frequently than others. In particular:

  • Map calls a function on each element of a sequence in order to transform:

    • their individual components (just like sqrt, round, or the unary `!` operator in R), or

    • the corresponding elements of many sequences so as to vectorise a given operation elementwisely (compare the binary `+` or paste),

  • Reduce (also called accumulate) applies a binary operation to combine consecutive elements in a sequence, e.g., to generate the aggregates, like, totally (compare sum, prod, all, max) or cumulatively (compare cumsum, cummmin),

  • Filter creates a subset of a sequence that is comprised of elements that enjoy a given property (which we typically achieve in R by means of the `[` operator),

  • Find locates the first element that fulfils some logical condition (compare which).

Below we will only focus on the Map function. The inspection of the remaining ones is left as an exercise. This is because, oftentimes, we can be better off with their more R-ish versions (e.g., using the subsetting operator, `[`).

7.2.4. Vectorising functions with Map#

In data-centric computing, we are frequently faced with tasks that involve processing each vector element independently, one after another. Such use cases can benefit from vectorised operations like those discussed in Chapter 2, Chapter 3, and Chapter 6.

Unfortunately, most of the functions that we introduced so far cannot be applied on lists. For instance, if we try calling sqrt on a generic vector, we will get an error, even if it is a list of numeric sequences only. One way to compute the square root of all elements would be to invoke sqrt(unlist(...)). It is a go-to approach if we want to treat all the list’s elements as one sequence. However, this comes at the price of losing the list’s structure.

We have also discussed a few operations that are not vectorised with respect to all their arguments, even though they could have been designed this way, e.g., grepl.

The Map function[1] applies an operation on each element in a vector or the corresponding elements in a number of vectors. In many situations, it may be used as a more elegant alternative to for loops that we will introduce in the next chapter.

First[2], a call to Map(f, x) yields a list whose \(i\)-th element is equal to f(x[[i]]) (recall that `[[` works on atomic vectors too). For example:

x <- list(  # an example named list
    x1=1:3,
    x2=seq(0, 1, by=0.25),
    x3=c(1, 0, NA_real_, 0, 0, 1, NA_real_)
)
Map(sqrt, x)  # x is named, hence the result will be named as well
## $x1
## [1] 1.0000 1.4142 1.7321
##
## $x2
## [1] 0.00000 0.50000 0.70711 0.86603 1.00000
##
## $x3
## [1]  1  0 NA  0  0  1 NA
Map(length, x)
## $x1
## [1] 3
##
## $x2
## [1] 5
##
## $x3
## [1] 7
unlist(Map(mean, x))  # compute three aggregates, convert to an atomic vector
##  x1  x2  x3
## 2.0 0.5  NA
Map(function(n) round(runif(n, -1, 1), 1), c(2, 4, 6))  # x is atomic now
## [[1]]
## [1] 0.4 0.8
##
## [[2]]
## [1]  0.5  0.8 -0.1 -0.7
##
## [[3]]
## [1] -0.3  0.0  0.5  1.0 -0.9 -0.7

Next, we can vectorise a given function over several parameters. A call to, e.g., Map(f, x, y, z) breeds a list whose \(i\)-th element is equal to f(x[[i]], y[[i]], z[[i]]). Like in the case of, e.g., paste, the recycling rule will be applied if necessary.

For example, the following generates list(seq(1, 6), seq(11, 13), seq(21, 29)):

Map(seq, c(1, 11, 21), c(6, 13, 29))
## [[1]]
## [1] 1 2 3 4 5 6
##
## [[2]]
## [1] 11 12 13
##
## [[3]]
## [1] 21 22 23 24 25 26 27 28 29

Moreover, we can get list(seq(1, 40, length.out=10), seq(11, 40, length.out=5), seq(21, 40, length.out=10), seq(31, 40, length.out=5)) by calling:

Map(seq, c(1, 11, 21, 31), 40, length.out=c(10, 5))
## [[1]]
##  [1]  1.0000  5.3333  9.6667 14.0000 18.3333 22.6667 27.0000 31.3333
##  [9] 35.6667 40.0000
##
## [[2]]
## [1] 11.00 18.25 25.50 32.75 40.00
##
## [[3]]
##  [1] 21.000 23.111 25.222 27.333 29.444 31.556 33.667 35.778 37.889 40.000
##
## [[4]]
## [1] 31.00 33.25 35.50 37.75 40.00

Note

If we have some additional arguments to be passed to the function applied (which it does not have to be vectorised over), we can wrap them inside a separate list and toss it via the MoreArgs argument (à la do.call).

unlist(Map(mean, x, MoreArgs=list(na.rm=TRUE)))  # mean(..., na.rm=TRUE)
##  x1  x2  x3
## 2.0 0.5 0.4

Alternatively, we can always construct a custom anonymous function:

unlist(Map(function(xi) mean(xi, na.rm=TRUE), x))
##  x1  x2  x3
## 2.0 0.5 0.4
Exercise 7.6

Here is an example list of files (see our teaching data repository) with daily Forex rates:

file_names <- c(
    "euraud-20200101-20200630.csv",
    "eurgbp-20200101-20200630.csv",
    "eurusd-20200101-20200630.csv"
)

Call Map to read them with scan. Determine each series’ minimal, mean, and maximal value.

Exercise 7.7

Implement your version of the Filter function based on a call to Map.

7.3. Accessing third-party functions#

When we indulge in the writing of a software piece, a few questions naturally arise. Is the problem we are facing fairly complex? Has it already been successfully addressed in its entirety? If not, can it, or its parts, be split into manageable chunks? Can it be constructed based on some readily available nontrivial components?

A smart developer is independent but knows when to stand on the shoulders to cry on. Let’s explore a few ways to reuse the existing function libraries.

7.3.1. Using R packages#

Most contributed R extensions come in the form of add-on packages, which can include:

  • reusable code (e.g., new functions),

  • data (which we can exercise on),

  • documentation (manuals, vignettes, etc.);

see Section 9.2.2 for more and Writing R Extensions [65] for all the details.

Most packages are published in the moderated repository that is part of the Comprehensive R Archive Network (CRAN). However, there are also other popular sources such as Bioconductor which specialises in bioinformatics.

We call install.packages("pkg") to fetch a package pkg from a repository (CRAN by default; see, however, the repos argument).

A call to library("pkg") loads an indicated package and makes the exported objects available to the user (i.e., attaches it to the search path; see Section 16.2.6).

For instance, in one of the previous chapters, we have mentioned the gsl package:

# call install.packages("gsl") first
library("gsl")  # load the package
poch(10, 3:6)   # calls gsl_sf_poch() from GNU GSL
## [1]    1320   17160  240240 3603600

Here, poch is an object exported by package gsl. If we did not call library("gsl"), trying to access the former would raise an error.

We could have also accessed the preceding function without attaching it to the search path using the pkg::object syntax, namely, gsl::poch.

Note

For more information about any R extension, call help(package="pkg"). Also, it is advisable to visit the package’s CRAN entry at an address like https://CRAN.R-project.org/package=pkg to access additional information, e.g., vignettes. Why waste our time and energy by querying a web search engine that will likely lead us to a dodgy middleman when we can acquire authoritative knowledge directly from the source?

Moreover, it is worth exploring various CRAN Task Views that group the packages into topics such as Genetics, Graphics, and Optimisation. They are curated by experts in their relevant fields.

Important

Frequently, R packages are written in their respective authors’ free time, many of whom are volunteers. Neither get they paid for this, nor do it as part of the so-called their job. Yes, not everyone is driven by money or fame.

Someday, when we come up with something valuable for the community, we will become one of them. Before this happens, we can show appreciation for their generosity by, e.g., spreading the word about their software by citing it in publications (see citation(package="pkg")), talking about them during lunchtime, or mentioning them in (un)social media. We can also help them improve the existing code base by reporting bugs, polishing documentation, proposing new features, or cleaning up the redundant fragments of their APIs.

7.3.1.1. Default packages#

The base package is omnipresent. It provides us with the most crucial functions such as the vector addition, c, Map, and library. Certain other extensions are also loaded by default:

getOption("defaultPackages")
## [1] "datasets"  "utils"     "grDevices" "graphics"  "stats"
## [6] "methods"

In this book, we assume that they are always attached (even though this list can, theoretically, be changed[3]). Due to this, in Section 2.4.5, there was no need to call, for example, library("stats") before referring to the var and sd functions.

On a side note, grDevices and graphics will be discussed in Chapter 13. methods will be mentioned in Section 10.5. datasets brings a few example R objects on which we can exercise our skills. The functions from utils, graphics, and stats already appeared here and there.

Exercise 7.8

Use the find function to determine which packages define mean, var, find, and Map. Recall from Section 1.4 where such information can be found in these objects’ manual pages.

7.3.1.2. Source vs binary packages (*)#

R is an open environment. Therefore, its packages are published primarily in the source form. This way, anyone can study how they work and improve them or reuse parts thereof in different projects.

If we call install.packages("path", repos=NULL, type="source"), we should be able to install a package from sources: path can be pinpointing either a directory or a source tarball (most often as a compressed pkg_version.tar.gz file; see help("untar")).

Note that type="source" is the default unless one is on a W****ws or m**OS box; see getOption("pkgType"). This is because these two operating systems require additional build tools, especially if a package relies on C or C++ code; see Chapter 14 and Section C.3 of [67]:

These systems are less developer-orientated. Thus, as a courtesy to their users, CRAN also distributes the platform-specific binary versions of the packages (.zip or .tgz files). install.packages will try to fetch them by default.

Example 7.9

It is very easy to retrieve a package’s source directly from GitLab and GitHub, which are popular hosting platforms. The relevant links are, respectively:

  • https://gitlab.com/user/repo/-/archive/branch/repo-branch.zip,

  • https://github.com/user/repo/archive/branch.zip.

For example, to download the contents of the master branch in the GitHub repository rpackagedemo owned by gagolews, we can call:

f <- tempfile()  # download destination: a temporary file name
download.file("https://github.com/gagolews/rpackagedemo/archive/master.zip",
    destfile=f)

Next, the contents can be extracted with unzip:

t <- tempdir()  # temporary directory for extracted files
(d <- unzip(f, exdir=t))  # returns extracted file paths

The path where the files were extracted can be passed to install.packages:

install.packages(dirname(d)[1], repos=NULL, type="source")
file.remove(c(f, d))  # clean up
Exercise 7.10

Use the git2r package to clone the git repository located at https://github.com/gagolews/rpackagedemo.git and install the package published therein.

7.3.1.3. Managing dependencies (*)#

By calling update.packages, all installed add-on packages may be upgraded to their most recent versions available on CRAN or other indicated repository.

As a general rule, the more experienced we become, the less excited we get about the new. Sure, bug fixes and well-thought-out additional features are usually welcome. Still, just we wait until someone updates a package’s API for the \(n\)-th time, \(n \ge 2\), breaking our so-far flawless program.

Hence, when designing software projects (see Chapter 9 for more details), we must ask ourselves the ultimate question: do we really need to import that package with lots of dependencies from which we will just use only about 3–5 functions? Wouldn’t it be better to write our own version of some functionality (and learn something new, exercise our brain, etc.), or call a mature terminal-based tool?

Otherwise, as all the historical versions of the packages are archived on CRAN, simple software dependency management can be conducted by storing different releases of packages in different directories. This way, we can create an isolated environment for the add-ons. To fetch the locations where packages are sought (in this very order), we call:

.libPaths()
## [1] "/home/gagolews/R/x86_64-pc-linux-gnu-library/4.3"
## [2] "/usr/local/lib/R/site-library"
## [3] "/usr/lib/R/site-library"
## [4] "/usr/lib/R/library"

The same function can add new folders to the search path; see also the environment variable R_LIBS_USER that we can set using Sys.setenv. The install.packages function will honour them as target directories; see its lib parameter for more details. Note that only one version of a package can be loaded at a time, though.

Moreover, the packages may deposit auxiliary data on the user’s machine. Therefore, it might be worthwhile to set the following directories (via the corresponding environment variables) relative to the current project:

tools::R_user_dir("pkg", "data")    # R_USER_DATA_DIR
## [1] "/home/gagolews/.local/share/R/pkg"
tools::R_user_dir("pkg", "config")  # R_USER_CONFIG_DIR
## [1] "/home/gagolews/.config/R/pkg"
tools::R_user_dir("pkg", "cache")   # R_USER_CACHE_DIR
## [1] "/home/gagolews/.cache/R/pkg"

7.3.2. Calling external programs#

Many tasks can be accomplished by calling external programs. Such an approach is particularly natural on UNIX-like systems, which classically follow modular, minimalist design patterns. There are many tools at a developer’s hand and each tool is specialised at solving a single, well-defined problem.

Apart from the many standard UNIX commands, we may consider:

  • pandoc converts documents between markup formats, e.g., Markdown, HTML, reStructuredText, and LaTeX and can generate LibreOffice Writer documents, EPUB or PDF files, or slides;

  • jupyter-nbconvert converts Jupyter notebooks (see Section 1.2.5) to other formats such as LaTeX, HTML, Markdown, etc.;

  • convert (from ImageMagick) applies various operations on bitmap graphics (scaling, cropping, conversion between formats);

  • graphviz and PlantUML draws graphs and diagrams;

  • python, perl, … can be called to perform tasks that can be expressed more easily in languages other than R.

The good news is that we are not limited to calling R from the system shell in the interactive or batch mode; see Section 1.2. Our environment serves well as a glue language too.

The system2 function invokes a system command. Communication between such programs may be done using, e.g., intermediate text, JSON, CSV, XML, or any other files. The stdin, stdout, and stderr arguments control the redirection of the standard I/O streams.

system2("pandoc", "-s input.md -o output.html")
system2("bash", "-c 'for i in `seq 1 2 10`; do echo $i; done'", stdout=TRUE)
## [1] "1" "3" "5" "7" "9"
system2("python3", "-", stdout=TRUE,
    input=c(
    "import numpy as np",
    "print(repr(np.arange(5)))"
    ))
## [1] "array([0, 1, 2, 3, 4])"

On a side note, the current working directory can be read and changed through a call to getwd and setwd, respectively. By default, it is the directory where the current R session was started.

Important

Relying on system2 assumes that the commands it refers to are available on the target platform. Hence, it might not be portable unless additional assumptions are made, e.g., that a user runs a UNIX-like system and that certain libraries are available. We strongly recommend GNU/Linux or FreeBSD for both software development and production use, as they are free, open, developer-friendly, user-loving, reliable, ethical, and sustainable. Users of other operating systems are missing out on so many good features.

7.3.3. Interfacing C, C++, Fortran, Python, Java, etc. (**)#

Most standalone data processing algorithms are implemented in compiled, slightly lower-level programming languages. This usually makes them faster and more reusable in other environments. For instance, an industry-standard library might be written in very portable C, C++, or Fortran and define bindings for easier access from within R, Python, Julia, etc. It is the case with FFTW, LIBSVM, mlpack, OpenBLAS, ICU, and GNU GSL, amongst many others. Chapter 14 explains basic ways to refer to such compiled code.

Also, the rJava package can dynamically create JVM objects and access their fields and methods. Similarly, reticulate can be used to access Python objects, including numpy arrays and pandas data frames (but see also the rpy2 package for Python).

Important

We should not feel obliged to use R in all parts of a data processing pipeline. Some activities can be expressed more naturally in other languages or environments (e.g., parse raw data and create a SQL database in Python but visualise it in R).

7.4. Exercises#

Exercise 7.11

Answer the following questions.

  • What is the result of “{x <- "x"; x <- function(x) x; x(x)}”?

  • How to compose a function that returns two objects?

  • What is a higher-order function?

  • What are the use cases of do.call?

  • Why a call to Map is redundant in the expression Map(paste, x, y, z)?

  • What is the difference between Map(mean, x, na.rm=TRUE) and Map(mean, x, MoreArgs=list(na.rm=TRUE))?

  • What do we mean when we write stringx::sprintf?

  • How to get access to the vignettes (tutorials, FAQs, etc.) of the data.table and dplyr packages? Why perhaps 95% of R users would just googleit, and what is suboptimal about this strategy?

  • What is the difference between a source and a binary package?

  • How to update the base package?

  • How to ensure that we will always run an R session with only specific versions of a set of packages?

Exercise 7.12

Write a function that computes the Gini index of a vector of positive integers x, which, assuming \(x_1\le x_2\le\dots\le x_n\), is equal to:

\[G(x_1,\dots,x_n) = \frac{ \sum_{i=1}^{n} (n-2i+1) x_{i} }{ (n-1) \sum_{i=1}^n x_i }.\]
Exercise 7.13

Implement a function between(x, a, b) that verifies whether each element in x is in the [a, b] interval. Return a logical vector of the same length as x. Ensure the function is correctly vectorised with respect to all the arguments and handles missing data correctly.

Exercise 7.14

Write your version of the strrep function called dup.

dup <- ...to.do...
dup(c("a", "b", "c"), c(1, 3, 5))
## [1] "a"     "bbb"   "ccccc"
dup("a", 1:3)
## [1] "a"   "aa"  "aaa"
dup(c("a", "b", "c"), 4)
## [1] "aaaa" "bbbb" "cccc"
Exercise 7.15

Given a list x, generate its sublist with all the elements equal to NULL removed.

Exercise 7.16

Implement your version of the sequence function.

Exercise 7.17

Using Map, how can we generate window indexes like below?

## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 2 3 4
##
## [[3]]
## [1] 3 4 5
##
## [[4]]
## [1] 4 5 6

Write a function windows(k, n) that yields index windows of length \(k\) with elements between \(1\) and \(n\) (the above example is for \(k=3\) and \(k=6\)).

Exercise 7.18

Write a function to extract all \(q\)-grams, \(q \ge 1\), from a given character vector. Return a list of character vectors. For example, bigrams (2-grams) in "abcd" are: "ab", "bc", “cd”`.

Exercise 7.19

Implement a function movstat(f, x, k) that computes, using Map, a given aggregate f of each \(k\) consecutive elements in x. For instance:

movstat <- ...to.do...
x <- c(1, 3, 5, 10, 25, -25)  # example data
movstat(mean, x, 3)           # 3-moving mean
## [1]  3.0000  6.0000 13.3333  3.3333
movstat(median, x, 3)         # 3-moving median
## [1]  3.0000  6.0000 13.3333  3.3333
Exercise 7.20

Recode a character vector with a small number of distinct values to a vector where each unique code is assigned a positive integer from \(1\) to \(k\). Here are example calls and the corresponding expected results:

recode <- ...to.do...
recode(c("a", "a", "a", "b", "b"))
## [1] 1 1 1 2 2
recode(c("x", "z", "y", "x", "y", "x"))
## [1] 1 3 2 1 2 1
Exercise 7.21

Implement a function that returns the number of occurrences of each unique element in a given atomic vector. The return value should be a numeric vector equipped with the names attribute. Hint: use match and tabulate.

count <- ...to.do...
count(c(5, 5, 5, 5, 42, 42, 954))
##   5  42 954
##   4   2   1
count(c("x", "z", "y", "x", "y", "x", "w", "x", "x", "y", NA_character_))
##    w    x    y    z <NA>
##    1    5    3    1    1
Exercise 7.22

Extend the built-in duplicated function. For each vector element, indicate which occurrence of a repeated value is it (starting from the beginning of the vector).

duplicatedn <- ...to.do...
duplicatedn(c("a", "a", "a", "b", "b"))
## [1] 1 2 3 1 2
duplicatedn(c("x", "z", "y", "x", "y", "x", "w", "x", "x", "y", "z"))
##  [1] 1 1 1 2 2 3 1 4 5 3 2
Exercise 7.23

Based on a call to Map, implement your version of split that takes two atomic vectors as arguments. Then, extend it to handle the second argument being a list of the form list(y1, y2, ...) representing the product of many levels. If the \(y\)s are of different lengths, apply the recycling rule.

Exercise 7.24

Implement my_unsplit being your version of unsplit. For any x and g of the same lengths, ensure that my_unsplit(split(x, g), g) is equal to x.

Exercise 7.25

Write a function that takes as arguments: (a) an integer \(n\), (b) a numeric vector x of length \(k\) and no duplicated elements, (c) a vector of probabilities p of length \(k\). Verify that \(p_i\ge 0\) for all \(i\) and \(\sum_{i=1}^k p_i \simeq 1\). Based on a random number generator from the uniform distribution on the unit interval, generate \(n\) independent realisations of a random variable \(X\) such that \(\Pr(X=x_i)=p_i\) for \(i=1,\dots,k\). To obtain a single value:

  1. generate \(u\in[0, 1]\),

  2. find \(m\in\{1,\dots,k\}\) such that \(u\in\left(\sum_{j=1}^{m-1} p_{j}, \sum_{j=1}^m p_{j}\right]\),

  3. the result is then \(x_m\).

Exercise 7.26

Write a function that takes as arguments: (a) an increasingly sorted vector x of length \(n\), (b) any vector y of length \(n\), (c) a vector z of length \(k\) and elements in \([x_1,x_n)\). Let \(f\) be the piecewise linear spline that interpolates the points \((x_1,y_1),\dots,(x_n,y_n)\). Return a vector w of length \(k\) such that \(w_i=f(z_i)\).

Exercise 7.27

(*) Write functions dpareto, ppareto, qpareto, and rpareto that implement the functions related to the Pareto distribution; compare Section 2.3.4.