7. Functions

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of Chapters 1–12 are already complete, but there will be more. In the meantime, any bug/typos reports/fixes are appreciated. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [20].

R is a functional language, where functions play first fiddle. Each action we perform reduces itself to a call to some function, or a combination thereof.

So far we have been tinkering with dozens of available functions which are part of base R, with only few exceptions. They constitute the essential vocabulary that everyone must be able to speak fluently.

Any operation, be it sum, sqrt, or paste, when fed with a number of arguments, generates some (hopefully useful) return value.

sum(1:10)  # invoking `sum` on a specific argument
## [1] 55

From a user’s perspective, each function is merely a tool. To achieve a goal at hand, we do not really have to care about what is going on under its hood, i.e., how the inputs are actually being transformed so that, after a couple of nanoseconds or hours, we can enjoy what has been yielded. This is very convenient: all we need to know is the function’s specification which can be stated, for example, informally, in plain Polish or Malay, in its help page.

In this chapter, we will learn how to write our own functions. The use of this skill is a good development practice when we expect that some operations are to be executed many times but perhaps on different data.

Also, some R functions are meant to invoke other functions, for instance on every element in a list or every section of a data frame grouped by a qualitative variable, so it is good to learn know how we can specify a custom operation to be propagated thereover.

Example 7.1

Given some objects (whatever):

x1 <- runif(16)
x2 <- runif(32)
x3 <- runif(64)

when we want to apply the same action on different data, say, compute the root mean square, instead of re-typing almost identical expressions (or a bunch of them) over and over again:

sqrt(mean(x1^2))
## [1] 0.6545
sqrt(mean(x2^2))  # the same second time - borderline okay
## [1] 0.56203
sqrt(mean(x3^2))  # tedious, barbarous, and error-prone
## [1] 0.57206

we can generalise the operation to any object like x:

rms <-                   # bound what follows to name `rms`
    function(x)          # a function that takes one parameter, `x`
        sqrt(mean(x^2))  # expression to transform the input to yield output

and then re-use it on different concrete data instances:

rms(x1)
## [1] 0.6545
rms(x2)
## [1] 0.56203
rms(x3)
## [1] 0.57206

or even combine it with other function calls:

rms(sqrt(c(x1, x2, x3)))^2
## [1] 0.50824

Important

Does writing your own functions equal reinventing the wheel? Can everything be found on the internet these days (including on Stack Overflow, GitHub, or CRAN)?

Luckily, this is not the case. Otherwise, data analysts’, researchers’, and developers’ lives could be considered monotonous, dreary, and uninspiring. Plus, sometimes it is much quicker to write a function from scratch than to get through the whole garbage dump from where, only occasionally, we can dig out some pearls. Not to mention the self-educative side: we become better programmers by crunching those exercises. We are advocating for minimalism here, remember?

This and many more other important issues in function design will be reflected upon in Chapter 9.

7.1. Creating and Invoking Functions

7.1.1. Anonymous Functions

Functions are usually created by means of the following notation:

function(args) body

First, args is a (possibly empty) list of comma-separated parameter names which are supposed to act as input variables.

Second, body is a single R expression which will be evaluated when the function is called. The value that this expression yields will constitute the function’s output.

For example, here is a definition of a function which takes no inputs and generates a constant output:

function() 1
## function() 1

We thus created a function object. However, it has disappeared immediately thereafter, as we have not used it at all.

Any function, say, f can be invoked, i.e., evaluated on concrete data, by using the notation f(arg1, ..., argn), where “arg1, ..., argn” are the arguments to be passed to f.

(function() 1)()  # invoking f like f(); here, no arguments are expected
## [1] 1

Only now we have obtained a return value.

Note

(*) Calling typeof on a function object will report "closure" (for user-defined functions), "builtin", or "primitive" (for some built-in, base ones), for the reasons that we explain in more detail[1] in Section 9.5.3:

typeof(function() 1)
## [1] "closure"

7.1.2. Named Functions

Function objects can be bound with names so that they can be referred to multiple times:

one <- function() 1  # one <- (function() 1)

We created an object named one (we use bold font to indicate that it is of type function, because functions are so important in R). We are very familiar with such a notation, as not since yesterday we are used to writing “x <- 1” etc.

Invoking one, which can be done by writing one(), will yield a return value:

one()  # (function() 1)()
## [1] 1

This output can be used in further computations, for instance:

0:2 - one()  # 0:2 - (function() 1)(), i.e., 0:2 - 1
## [1] -1  0  1

7.1.3. Passing Arguments To Functions

Functions with no arguments are kind of boring, thus let us distil a more serious operation:

concat <- function(x, y) paste(x, y, sep="")

Here we have created a mapping whose aim is to concatenate two objects by means of a specialised call to paste. Yours faithfully pleads guilty to multiplying entities needlessly, because it should not be a problem for anyone to write paste(x, y, sep="") each time. Yet, ‘tis merely an illustration.

The concat function has two parameters, “x” and “y”. Hence, calling it will require the provision of two arguments, which we put within round brackets and separate from each other by commas.

u <- 1:5
concat("spam", u)  # i.e., concat(x="spam", y=1:5)
## [1] "spam1" "spam2" "spam3" "spam4" "spam5"

Important

Notice the distinction: parameters (also called formal arguments) are abstract, general, or symbolic; “something, anything that will be put in place of x when the function is invoked”. By contrast, arguments (a.k.a. actual parameters) are concrete, specific, and real.

During the above call, x in the function’s body is precisely "spam", and nothing else. Also, the u object from the caller’s environment is seen under the name y there. Most of the time (however, see Section 16.4), it is best to think of the function as being fed not with u per se, but the value that u is bound to, i.e., “1:5”.

Also:

x <- 1:5
y <- "spam"
concat(y, x)  # concat(x="spam", y=1:5)
## [1] "spam1" "spam2" "spam3" "spam4" "spam5"

This is still a call to equivalent to concat(x=y, y=x). The argument x is being assigned with the value of y from the calling environment, "spam". Yes, one x is not the same as the other x, and which is which is unambiguously defined by the context. Understanding and being able to manipulate such abstractions is basic logic and common sense that everyone should master.

Exercise 7.2

Write a function called standardise that takes a numeric vector x as argument and returns its standardised version, i.e., from each element in x subtract the sample arithmetic mean and then divide it by the standard deviation.

Note

Recall from Section 2.1.3 that, syntactically speaking, the following are perfectly valid alternatives to the positionally-matched call concat("spam", u).

concat(x="spam", y=u)
concat(y=u, x="spam")
concat("spam", y=u)
concat(u, x="spam")
concat(x="spam", u)
concat(y=u, "spam")

However, the last two should particularly be avoided, for the sake of the readers’ sanity. It is best to provide positionally-matched arguments before the keyword-based ones.

Also, in Section 10.5, we introduce the (overused) forward-pipe operator, `|>`, which enables the above to be written as “"spam" |> concat(u)”.

7.1.4. Grouping Expressions with Curly Braces, `{`

We have been informed that a function’s body is a single R expression whose evaluated value is passed to the user as its output. This may sound restrictive and contrast with what we have experienced so far. Rarely are we faced with such simple computing tasks and we have already seen R functions performing quite sophisticated operations.

It turns out that, grammatically, a single R expression can be arbitrarily complex (Chapter 15); we can use curly braces to group many calls that are to be evaluated one after another.

For instance:

{
    cat("first expression\n")
    cat("second expression\n")
    # ...
    cat("last expression\n")
}
## first expression
## second expression
## last expression

Note that we used four spaces to visually indent the constituents for greater readability (some developers prefer tabs over spaces, others find two or three spaces more urbane, but we do not). This single (compound) expression can now play a role of a function’s body.

Important

The last expression evaluated in a curly-braces delimited block will be considered its the output value.

x <- {
    1
    2
    3  # <--- last expression: will be taken as the output value
}
print(x)
## [1] 3

Note

(*) The above code block can also be written more concisely by replacing newlines with semicolons, although with perhaps some loss in readability:

{1; 2; 3}
## [1] 3

In Section 9.4, we will give a few more details about `{`.

Example 7.3

Here is a version of the above concat function which takes care of a more Chapter 2-style missing values’ propagation:

concat <- function(a, b)
{
    z <- paste(a, b, sep="")
    z[is.na(a) | is.na(b)] <- NA_character_
    z  # last expression in the block – return value
}

Example calls:

concat("a", 1:3)
## [1] "a1" "a2" "a3"
concat(NA_character_, 1:3)
## [1] NA NA NA
concat(1:6, c("a", NA_character_, "c"))
## [1] "1a" NA   "3c" "4a" NA   "6c"

Let us appreciate the fact that we could keep the code brief thanks to paste and `|` implementing the recycling rule.

Exercise 7.4

Write a function called normalise that takes a numeric vector x and returns its version shifted and scaled to the [0, 1] interval. To do so, from each element subtract the sample minimum and then divide it by the range, i.e., the difference between the maximum and the minimum. Avoid computing min(x) twice.

Exercise 7.5

Write a function that applies the robust standardisation of a numeric vector: subtract the median and divide it by the median absolute deviation, 1.4826 times the median of the absolute differences between the values and their median.

Note

R is an open-source (free, libre) project – users are not only encouraged to run the software for whatever the purpose, but also study and modify its source code without any restrictions. This applies both to functions that we have authored ourselves:

print(concat)
## function(a, b)
## {
##     z <- paste(a, b, sep="")
##     z[is.na(a) | is.na(b)] <- NA_character_
##     z  # last expression in the block – return value
## }
## <bytecode: 0x55f963fbb130>

and to the routines that are part of base R or any other extension packages:

print(union)
## function (x, y) 
## {
##     u <- as.vector(x)
##     v <- as.vector(y)
##     unique(c(u, v))
## }
## <bytecode: 0x55f9652d01a8>
## <environment: namespace:base>

Nevertheless, some functionality might be implemented in a compiled programming language such as C, C++, or Fortran; notice a call to .Internal in the source code of paste, .Primitive in list, or .Call in runif. Therefore, we will sometimes have to dig a little bit deeper to access the underlying source code; see Chapter 14 for more details.

7.2. Functional Programming

R is a functional programming language. As such, it shares a number of common features with other languages that emphasise on the role of function manipulation in software development (e.g., Common Lisp, Scheme, OCaml, Haskell, Clojure, F#). Let us explore them now.

7.2.1. Functions are Objects

R functions were given the right to a fair go; they are what we refer to as first-class citizens. In other words, our interaction with them is not limited to their invocation; we treat them as any other language objects. Namely, they can be:

  • stored inside list objects:

    list(identity, nrow, sum)  # a list with three elements of type function
    ## [[1]]
    ## function (x) 
    ## x
    ## <bytecode: 0x55f963829560>
    ## <environment: namespace:base>
    ## 
    ## [[2]]
    ## function (x) 
    ## dim(x)[1L]
    ## <bytecode: 0x55f964915fc0>
    ## <environment: namespace:base>
    ## 
    ## [[3]]
    ## function (..., na.rm = FALSE)  .Primitive("sum")
    

    This is possible owing to the fact that lists, as we recall, can embrace R objects of any kind.

  • created and then called inside another function’s body:

    euclidean_distance <- function(x, y)
    {
        square <- function(z) z^2  # auxiliary/internal/helper function
        sqrt(sum(square(x-y)))     # square root of the sum of squares
    }
    
    euclidean_distance(c(0, 1), c(1, 0))  # example call
    ## [1] 1.4142
    

    This is why we tend to classify functions as representatives of recursive types (compare is.recursive).

  • passed as arguments to other operations:

    # Replaces missing values with a given aggregate
    # of all non-missing elements:
    fill_na <- function(x, filler_fun)
    {
        missing_ones <- is.na(x)  # otherwise, we'd call is.na twice
        replacement_value <- filler_fun(x[!missing_ones])
        x[missing_ones] <- replacement_value
        x
    }
    
    fill_na(c(0, NA_real_, NA_real_, 2, 3, 7, NA_real_), mean)
    ## [1] 0 3 3 2 3 7 3
    fill_na(c(0, NA_real_, NA_real_, 2, 3, 7, NA_real_), median)
    ## [1] 0.0 2.5 2.5 2.0 3.0 7.0 2.5
    

    We call these higher-order functions.

Note

More advanced techniques, which we will discuss later (i.e., closures, lazy evaluation, metaprogramming, etc.), will let the functions be:

  • returned as other function’s outputs (sec:to-do),

  • equipped auxiliary data (sec:to-do),

  • generated programmatically on the fly (sec:to-do), and

  • modified at runtime (sec:to-do).

Below we review some noteworthy higher-order functions, in particular: do.call and Map. Many other ones will be introduced in due course or are left as an educative exercise.

7.2.2. Calling on Precomputed Arguments with do.call

The notation like f(arg1, ..., argn) has no monopoly over how we are supposed to call a function on a specific sequence of comma-delimited arguments: the latter do not have to be hardcoded.

Here is an alternative. We can first prepare a number of objects to be passed as f’s inputs, wrap them in a list l, and then invoke do.call(f, l) to get the same result.

words <- list(
    c("spam",      "bacon",  "eggs"),
    c("buckwheat", "quinoa", "barley"),
    c("ham",       "spam",   "spam")
)
do.call(paste, words)  # paste(words[[1]], words[[2]], words[[3]])
## [1] "spam buckwheat ham" "bacon quinoa spam"  "eggs barley spam"
do.call(cbind, words)  # column-bind; returns a matrix (explained later)
##      [,1]    [,2]        [,3]  
## [1,] "spam"  "buckwheat" "ham" 
## [2,] "bacon" "quinoa"    "spam"
## [3,] "eggs"  "barley"    "spam"
do.call(rbind, words)  # row-bind (explained later)
##      [,1]        [,2]     [,3]    
## [1,] "spam"      "bacon"  "eggs"  
## [2,] "buckwheat" "quinoa" "barley"
## [3,] "ham"       "spam"   "spam"

Note that the length and content of the list passed as the 2nd argument of do.call can be arbitrary (possibly unknown at the time of writing the code). See Section 12.1.2 for more use cases, e.g., ways to concatenate a list of data frames (perhaps produced by some complex chain of commands) into a single data frame.

If elements of the list are named, they will be matched to the corresponding keyword arguments.

x <- 2^(seq(-2, 2, length.out=101))
plot_opts <- list(col="red", lty="dashed", type="l")
do.call(plot, c(list(x, log2(x), xlab="x", ylab="log2(x)"), plot_opts))
## (the displaying of the plot has been suppressed)

Note that, e.g., plot_opts can now be reused in further calls to graphical functions. This is very convenient as it avoids repetitions.

7.2.3. Common Higher-Order Functions

There is an important class of higher-order functions that allow us to apply custom operations on consecutive elements of sequences without relying on loop-like statements, at least explicitly. They can be found in all functional programming languages (e.g., Lisp, Haskell, Scala) and have been ported to various add-on libraries (functools in Python, more recent versions of the C++ Standard Library, etc.) or frameworks (Apache Spark and the like). Their presence reflects the obvious truth that some kinds of operations occur more frequently than other ones.

In particular:

  • Map calls a function on each element of a sequence in order to transform:

    • their individual components (just like sqrt, round, or the unary `!` operator in R), or

    • the corresponding elements of many sequences so as to vectorise a given operation elementwisely (compare the binary `+` or paste),

  • Reduce (also called accumulate) applies a binary operation to combine consecutive elements in a sequence, e.g., to generate the aggregates, like, totally (compare sum, prod, all, max) or cumulatively (compare cumsum, cummmin),

  • Filter creates a subset of a sequence that is comprised of elements that enjoy a given property (which we typically achieve in R by means of the `[` operator),

  • Find locates the first element that fulfils some logical condition (compare which),

and so forth.

Below we will only focus on the Map function. The inspection of the remaining ones is left as an exercise. This is because, oftentimes, we can be better-off with their more R-ish versions (e.g., using the subsetting operator, `[`).

7.2.4. Vectorising Functions with Map

In data-centric computing, we are frequently faced with tasks that involve processing each and every element in a sequence independently, one after another. Such use cases can benefit from vectorised operations like those discussed in Chapter 2, Chapter 3, and Chapter 6.

Most of the functions that we have introduced in the preceding parts, unfortunately, cannot be applied on lists. For instance, if we try calling sqrt on a list, we will get an error, even if it is a list of numeric vectors only. One way to compute the square root of all elements would be to invoke sqrt(unlist(...)). It is a go-to approach if we wish to treat all the list’s elements as one sequence. But this comes at a price of losing the list’s structure.

We have also discussed some operations that are not vectorised with respect to all their arguments, even though they could have been designed this way, e.g., grepl.

The Map function[2] applies an operation on each element in a vector or the corresponding elements in a number of vectors. In many situations, it may be used as a more elegant alternative to for loops that we will introduce in the next chapter.

First[3], a call to Map(f, x) yields a list whose i-th element is equal to f(x[[i]]) (recall that `[[` works on atomic vectors too).

For example:

x <- list(  # an example named list
    x1=1:3,
    x2=seq(0, 1, by=0.25),
    x3=c(1, 0, NA_real_, 0, 0, 1, NA_real_)
)
Map(sqrt, x)  # x is named, hence the result will be named too
## $x1
## [1] 1.0000 1.4142 1.7321
## 
## $x2
## [1] 0.00000 0.50000 0.70711 0.86603 1.00000
## 
## $x3
## [1]  1  0 NA  0  0  1 NA
Map(length, x)
## $x1
## [1] 3
## 
## $x2
## [1] 5
## 
## $x3
## [1] 7
unlist(Map(mean, x))  # compute three aggregates, convert to an atomic vector
##  x1  x2  x3 
## 2.0 0.5  NA
Map(function(n) round(runif(n, -1, 1), 1), c(2, 4, 6))  # x is atomic now
## [[1]]
## [1] 0.4 0.8
## 
## [[2]]
## [1]  0.5  0.8 -0.1 -0.7
## 
## [[3]]
## [1] -0.3  0.0  0.5  1.0 -0.9 -0.7

Next, we can vectorise a given function over a number of parameters. A call to, e.g., Map(f, x, y, z) results in a list whose i-th element is equal to f(x[[i]], y[[i]], z[[i]]). Just like in case of, e.g., paste, recycling rule will be applied if necessary.

For example, the following generates list(seq(1, 6), seq(11, 13), seq(21, 29)):

Map(seq, c(1, 11, 21), c(6, 13, 29))
## [[1]]
## [1] 1 2 3 4 5 6
## 
## [[2]]
## [1] 11 12 13
## 
## [[3]]
## [1] 21 22 23 24 25 26 27 28 29

Moreover, we can get list(seq(1, 40, length.out=10), seq(11, 40, length.out=5), seq(21, 40, length.out=10), seq(31, 40, length.out=5)) by calling:

Map(seq, c(1, 11, 21, 31), 40, length.out=c(10, 5))
## [[1]]
##  [1]  1.0000  5.3333  9.6667 14.0000 18.3333 22.6667 27.0000 31.3333
##  [9] 35.6667 40.0000
## 
## [[2]]
## [1] 11.00 18.25 25.50 32.75 40.00
## 
## [[3]]
##  [1] 21.000 23.111 25.222 27.333 29.444 31.556 33.667 35.778 37.889 40.000
## 
## [[4]]
## [1] 31.00 33.25 35.50 37.75 40.00

Note

If we have some additional arguments to be passed to the function applied (which the function does not have to be vectorised over), we can wrap them inside a separate list and toss it via the MoreArgs argument (à la do.call).

unlist(Map(mean, x, MoreArgs=list(na.rm=TRUE)))  # mean(..., na.rm=TRUE)
##  x1  x2  x3 
## 2.0 0.5 0.4

Alternatively, we can always construct a custom anonymous function:

unlist(Map(function(xi) mean(xi, na.rm=TRUE), x))
##  x1  x2  x3 
## 2.0 0.5 0.4
Exercise 7.6

Here is an example list of files (see our teaching data repository) with daily Forex rates:

file_names <- c(
    "euraud-20200101-20200630.csv",
    "eurgbp-20200101-20200630.csv",
    "eurusd-20200101-20200630.csv"
)

Call Map to read each dataset with scan and determine the minimal, mean, and maximal value in each series.

Exercise 7.7

Implement your own version of the Filter function based on a call to Map.

7.3. Accessing Third-Party Functions

When we indulge in the writing of a software piece, a few questions naturally arise. Is the problem we are facing fairly complex? Has it already been successfully addressed in its entirety? If not, can it, or its parts, be split into manageable chunks? Can it be constructed based on some readily available nontrivial components?

A smart developer is independent, but knows when to stand on the shoulders to cry on. Let us explore some ways in which we can reuse the existing function libraries.

7.3.1. Using R Packages

Most contributed R extensions come in the form of the so-called add-on packages, which can include:

  • reusable code (e.g., new functions),

  • data (which we can exercise on),

  • documentation (manuals, vignettes, etc.);

see Section 9.3.2 for some more and [47] for all the details.

Most packages are published in the moderated repository that is part of the Comprehensive R Archive Network (CRAN). However, there are also other popular sources such as Bioconductor which specialises in bioinformatics.

To fetch a package pkg from a repository (CRAN by default; see, however, the repos argument), we call install.packages("pkg").

A call to library("pkg") loads an indicated package and makes its exported objects available to the user (i.e., attaches it on the search list; see sec:to-do).

For instance, in one of the previous chapters, we have mentioned the gsl package:

# call install.packages("gsl") first
library("gsl")  # load the package
poch(10, 3:6)   # calls gsl_sf_poch() from GNU GSL
## [1]    1320   17160  240240 3603600

Here, poch is an object exported by package gsl. If we did not call library("gsl"), trying to access the former would result in an error.

We could also have accessed the above function without attaching it onto the object search list by using the pkg::object syntax, i.e., gsl::poch.

Exercise 7.8

Use the find function to determine which packages define the following objects: mean, var, find, and Map. Recall from Section 1.4 where such information can be found in these objects’ manual pages.

Note

For more information about any R extension, call help(package="pkg"). Also, it is a good idea to visit the package’s CRAN entry at an address like https://CRAN.R-project.org/package=pkg to access some additional information (e.g., vignettes; see also vignette(package="pkg")). Why waste our time and energy by querying a web search engine that will lead us to some (usually low-quality) middleman when you can acquire authoritative knowledge directly from the source?

Moreover, it is worth exploring various CRAN Task Views that group the packages into topics such as Genetics, Graphics, and Optimisation. These are edited by experts in their relevant fields.

Important

Frequently, R packages are written in their respective authors’ free time, many of whom are volunteers/public servants/enthusiasts who are neither paid for doing this nor it is part of the so-called their job. You can show appreciation for their generosity by, e.g., spreading the word about their software by citing them in publications (see citation(package="pkg")), talking about them during lunch time, or mentioning them in (un)social media. You can also help them improve the existing code base by reporting bugs, polishing documentation, proposing new features, or cleaning up the redundant fragments of their APIs. Some readers will become one of them someday (when they will come up with something useful for our community).

7.3.1.1. Default Packages

Note that the always-on package base is a must-have that provides us with the most crucial functions (vector addition, c, Map, library). Certain other packages are also loaded by default:

getOption("defaultPackages")
## [1] "datasets"  "utils"     "grDevices" "graphics"  "stats"    
## [6] "methods"

Although this list can – technically speaking – be changed, in this book we assume that the above are always attached, because it is reasonable to do so. This is why in Section 2.4.5, there was no need to call, for example, library("stats") before referring to the var and sd functions.

On a side note, grDevices and graphics will be discussed in sec:to-do and methods will be mentioned in sec:to-do. datasets brings a few example R objects that we can exercise our skills on. Functions from utils, graphics, and stats, on the other hand, already appeared here and there.

7.3.1.2. Source vs Binary Packages (*)

R is a free and open project, therefore its packages are published primarily in the source form – so that anyone can study how they work and improve them or reuse parts thereof in different projects.

If we call install.packages("path", repos=NULL, type="source"), we should be able to install a package from sources: path can either be pinpointing a directory or a source tarball (see help("untar"), most often as a compressed pkg_version.tar.gz file).

Note that type="source" is the default unless one is on W****ws or some m**OS boxes; see getOption("pkgType"). This is because these two might require additional build tools to be present in the system, especially if a package features C, C++, or Fortran code; see Chapter 14 and Section C.3 of [49]:

Because of these systems’ being less developer-oriented, as a courtesy to their users, CRAN also distributes the platform-specific binary versions of the packages (.zip or .tgz files). install.packages will try to fetch them by default.

Example 7.9

It is very easy to fetch a package’s source directly from GitLab or GitHub, which are quite popular hosting platforms these days. At the time of writing this, the relevant links were, respectively:

  • https://gitlab.com/user/repo/-/archive/branch/repo-branch.zip

  • https://github.com/user/repo/archive/branch.zip

For example, to download the contents of the master branch in the repository rpackagedemo owned by gagolews, we can call:

f <- tempfile()  # temporary file name - download destination
download.file("https://github.com/gagolews/rpackagedemo/archive/master.zip",
    destfile=f)

Next, the contents can be extracted with unzip:

t <- tempdir()  # temporary directory to extract the files to
(d <- unzip(f, exdir=t))  # returns extracted file paths

The path where the files were extracted can be passed to install.packages:

install.packages(dirname(d)[1], repos=NULL, type="source")
file.remove(c(f, d))  # clean up
Exercise 7.10

Use the git2r package to clone the git repository located at https://github.com/gagolews/rpackagedemo.git and install the package published therein from within the current R session.

7.3.2. Managing Dependencies (*)

The currently-installed add-on packages may be upgraded to their most recent versions available on CRAN (or other indicated repository) by calling update.packages.

As a general rule, the more experienced developers we become, the less excited we get about the new. Sure, bug fixes and some well-thought of additional features are usually welcome, but just we wait until an updated package API for the n-th time, \(n\ge 2\), breaks our program that used to work flawlessly for so long.

Hence, when designing software projects (see Chapter 9 for more details), it is essential that we ask ourselves the ultimate question: do we really need to import that package with lots of dependencies from which we will just use only about 3–5 functions? Wouldn’t it be better to write our own version of some functionality (and learn something new, exercise our brain, etc.) or call a mature terminal-based tool?

Otherwise, as all the historical versions of all the packages are archived on CRAN, some software dependency management can easily be conducted by storing different version of packages in different directories (only one version of a package can be loaded at a time though). This way, we can create some sort of an isolated environment for the add-ons.

To fetch the locations where packages are sought (in this very order), call:

.libPaths()
## [1] "/home/gagolews/R/x86_64-pc-linux-gnu-library/4.2"
## [2] "/usr/local/lib/R/site-library"                   
## [3] "/usr/lib/R/site-library"                         
## [4] "/usr/lib/R/library"

The same function can be used to add new folders to the search path; see also the environment variable R_LIBS_USER (e.g., help("Sys.setenv")). The install.packages function will honour them as target directories, see its lib parameter for more details.

Moreover, the packages may deposit some auxiliary data on the user’s machine. Therefore, it might be a good idea to set the following directories (via the corresponding environment variables) as relative to the current project:

tools::R_user_dir("pkg", "data")   # R_USER_DATA_DIR
## [1] "/home/gagolews/.local/share/R/pkg"
tools::R_user_dir("pkg", "config") # R_USER_CONFIG_DIR
## [1] "/home/gagolews/.config/R/pkg"
tools::R_user_dir("pkg", "cache")  # R_USER_CACHE_DIR
## [1] "/home/gagolews/.cache/R/pkg"

7.3.3. Calling External Programs

Many tasks can naturally be accomplished by calling external programs. Such an approach is particularly natural on Unix-like systems, which classically follow a modular, minimalist design patterns: there are many tools at a developer’s hand and each tool is specialised at solving a single, well-defined problem.

Apart from the many standard Unix commands, we can consider, for example:

  • pandoc converts documents between markup formats, e.g., Markdown, reStructuredText, LaTeX, LibreOffice Writer, EPUB;

  • pdflatex, xelatex, and lualatex compile LaTeX documents to PDF;

  • convert (from ImageMagick) applies various operations on bitmap graphics (scaling, cropping, conversion between formats);

  • graphviz and PlantUML can be used to create various graphs and diagrams;

  • jupyter-nbconvert converts Jupyter notebooks (see Section 1.2.5) to other formats such as LaTeX, HTML, Markdown, etc.;

  • python, {command}perl, … can be called to perform tasks that can be expressed more easily in languages other than R;

and so forth.

Good news is that R not only can be called from the shell (in an interactive or batch mode; see Section 1.2), but also it can serve well as a glue language itself.

The system2 function can be used to invoke any system command. Communication between such programs can be done by means of, e.g., intermediate text, JSON, CSV, XML, or any other files. The stdin, stdout, and stderr arguments can be used to control the redirection of the standard I/O streams.

system2("pandoc", "-s input.md -o output.html")
system2("bash", "-c 'for i in `seq 1 2 10`; do echo $i; done'", stdout=TRUE)
## [1] "1" "3" "5" "7" "9"
system2("python3", "-", stdout=TRUE,
    input=c(
    "import numpy as np",
    "print(repr(np.arange(5)))"
    ))
## [1] "array([0, 1, 2, 3, 4])"

Note that the current working directory can be read and changed by means of a call to getwd and setwd, respectively. It is the directory from where the current R session was started.

Important

Relying on system2 assumes that the commands referred to are available on the target platform. Hence, it might not be portable, unless additional assumptions are made (e.g., that a user runs some Unix system, that certain libraries are installed therein). We strongly recommend GNU/Linux or FreeBSD for both software development and production use, as they are free, open, developer-friendly, user-loving, reliable, ethical, and sustainable.

7.3.4. A Note on Interfacing C, C++, Python, Java, etc. (*)

Most stand-alone data processing algorithms are implemented in compiled, slightly lower-level programming languages. This usually makes them faster and more re-usable in other environments. For instance, it is often the case that an industry-standard library is written in very portable C, C++, or Fortran and has some bindings available for easier access from within R, Python, Julia, etc. This is the case with FFTW, LIBSVM, mlpack, OpenBLAS, ICU, and GNU GSL, amongst many others.

For basic ways to interact with such compiled code, see Chapter 14.

Also, the rJava package can be used to dynamically create JVM objects and access their fields and methods. Similarly, reticulate can be used to access Python objects, including numpy arrays and pandas data frames (but see also the rpy2 package for Python).

Important

We should not feel obliged to use R in all the parts of a data processing pipeline. Some activities can be expressed more naturally in other languages/environments (e.g., parse raw data and create an SQL database in Python, but visualise it in R). We can use other tools as the glue language (including R, Python, or Bash) that will steer the data flow in the right direction.

7.4. Exercises

Exercise 7.11

Answer the following questions:

  • What is the result of “x <- 2; x <- function(x) x^2; x(x)”?

  • How to write a function that returns two objects?

  • What is a higher-order function?

  • What are the use cases of do.call?

  • Why a call to Map is not necessary in the expression “Map(paste, x, y, z)”?

  • What is the difference between Map(mean, x, na.rm=TRUE) and Map(mean, x, MoreArgs=list(na.rm=TRUE))?

  • What do we mean when we write stringx::sprintf?

  • How to get access to the vignettes (tutorials, FAQs, etc.) of the data.table and dplyr packages? Why perhaps 95% of R users would just googleit and what is sub-optimal about this strategy?

  • What is the difference between a source and a binary package?

  • How to update the base package?

  • How to assure that we will always run an R session with only specific versions of a set of packages?

Exercise 7.12

Write a function that computes the Gini index of a vector of positive integers x, which, assuming \(x_1\le x_2\le\dots\le x_n\), is equal to:

\[G(x_1,\dots,x_n) = \frac{ \sum_{i=1}^{n} (n-2i+1) x_{i} }{ (n-1) \sum_{i=1}^n x_i }.\]
Exercise 7.13

Implement a function between(x, a, b) that verifies whether each element in x is in the [a, b] interval or not. Return a logical vector of the same length as x. Make sure the function is correctly vectorised with respect to all the arguments and handles missing data correctly.

Exercise 7.14

Write your own version of the strrep function called dup.

dup <- ...to.do...
dup(c("a", "b", "c"), c(1, 3, 5))
## [1] "a"     "bbb"   "ccccc"
dup("a", 1:3)
## [1] "a"   "aa"  "aaa"
dup(c("a", "b", "c"), 4)
## [1] "aaaa" "bbbb" "cccc"
Exercise 7.15

Given a list x, generate its sublist with all the elements equal to NULL removed.

Exercise 7.16

Implement your own version of the built-in sequence function.

Exercise 7.17

Using Map, how can we generate window indexes like:

## [[1]]
## [1] 1 2 3
## 
## [[2]]
## [1] 2 3 4
## 
## [[3]]
## [1] 3 4 5
## 
## [[4]]
## [1] 4 5 6

Write a function windows(k, n) that yields k index windows with elements between 1 and n (the above example is for k=3 and k=6).

Exercise 7.18

Implement a function movstat(f, x, k) that computes, using Map, a given aggregate f of each k consecutive elements in x. For instance:

movstat <- ...to.do...
x <- c(1, 3, 5, 10, 25, -25)  # example data
movstat(mean, x, 3)           # 3-moving mean
## [1]  3.0000  6.0000 13.3333  3.3333
movstat(median, x, 3)         # 3-moving median
## [1]  3.0000  6.0000 13.3333  3.3333
Exercise 7.19

Write a function to extract all q-grams, q ≥ 1, from a given character vector. Return a list of character vectors. For examples, 2-grams (bigrams) in "abcd" are: "ab", "bc", “cd”`.

Exercise 7.20

Recode a character vector with a small number of distinct values to a vector where each unique code is assigned a positive integer from 1 to k. Example calls and the corresponding expected results:

recode <- ...to.do...
recode(c("a", "a", "a", "b", "b"))
## [1] 1 1 1 2 2
recode(c("x", "z", "y", "x", "y", "x"))
## [1] 1 3 2 1 2 1
Exercise 7.21

Implement a function that returns the number of occurrences of each unique element in a given atomic vector. The return value should be a numeric vector equipped with a names attribute.

count <- ...to.do...
count(c(5, 5, 5, 5, 42, 42, 954))
##   5  42 954 
##   4   2   1
count(c("x", "z", "y", "x", "y", "x", "w", "x", "x", "y", NA_character_))
##    w    x    y    z <NA> 
##    1    5    3    1    1

Hint: use match and tabulate.

Exercise 7.22

Implement a function that extends upon the built-in duplicated, indicating which occurrence (starting from the beginning of the vector) of a repeated value a given value constitutes.

duplicatedn <- ...to.do...
duplicatedn(c("a", "a", "a", "b", "b"))
## [1] 1 2 3 1 2
duplicatedn(c("x", "z", "y", "x", "y", "x", "w", "x", "x", "y", "z"))
##  [1] 1 1 1 2 2 3 1 4 5 3 2
Exercise 7.23

Based on a call to Map, implement a function my_split such that, given a vector x and an atomic vector y of the same length as x, my_split(x, y) yields the same result as split(x, y).

Exercise 7.24

Extend my_split to handle the second argument being a list of the form list(y1, y2, ...) that represents the product of many levels. If the ys are of different lengths, apply the recycling rule.

Exercise 7.25

Implement my_unsplit being your own version of the built-in unsplit. Make sure it holds my_unsplit(split(x, g), g) == x for x and g of the same lengths.

Exercise 7.26

Write a function that takes as arguments: (a) an integer n, (b) a numeric vector x of length k and no duplicated elements, (c) a vector of probabilities p of length k; verify that \(p_i\ge 0\) for all \(i\) and \(\sum_{i=1}^k p_i \simeq 1\). Based on a random number generator from the uniform distribution on the unit interval, generate n independent realisations of a random variable \(X\) such that \(\Pr(X=x_i)=p_i\) for \(i=1,\dots,k\). Hint: to obtain a single value:

  1. generate \(u\in[0,1]\),

  2. find \(m\in\{1,\dots,k\}\) such that \(u\in\left(\sum_{j=1}^{m-1} p_{j}, \sum_{j=1}^m p_{j}\right]\),

  3. the result is then \(x_m\).

Exercise 7.27

Write a function that takes as arguments: (a) an increasingly sorted vector x of length n, (b) any vector y of length n, (c) a vector z of length k and elements in \([x_1,x_n)\). Let \(f\) be the piecewise linear spline that interpolates the points \((x_1,y_1),\dots,(x_n,y_n)\). Return a vector w of length k such that \(w_i=f(z_i)\).

Exercise 7.28

(*) Write functions dpareto, ppareto, qpareto, and rpareto that implement the basic functions related to the Pareto distribution; compare Section 2.3.4.