7. Functions#

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime, any bug/typos reports/fixes are appreciated. Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [26].

R is a functional language where functions play first fiddle. Each action we perform reduces itself to a call to some function or a combination thereof.

So far, we have been tinkering with dozens of available functions which are part of base R, with only a few exceptions. They constitute the essential vocabulary that everyone must be able to speak fluently.

Any operation, be it sum, sqrt, or paste, when fed with a number of arguments, generates some (hopefully fruitful) return value.

sum(1:10)  # invoking `sum` on a specific argument
## [1] 55

From a user’s perspective, each function is merely a tool. To achieve a goal at hand, we do not have to care about what is going on under its hood, i.e., how the inputs are being transformed so that, after a couple of nanoseconds or hours, we can relish what has been bred. This is very convenient: all we need to know is the function’s specification which can be stated, for example, informally, in plain Polish or Malay, on its help page.

In this chapter, we will learn how to write our own functions. Using this skill is a good development practice when we expect that some operations are to be executed many times but perhaps on different data.

Also, some R functions are meant to invoke other functions, for instance, on every element in a list or every section of a data frame grouped by a qualitative variable. Thus, it is advisable to learn how we can specify a custom operation to be propagated thereover.

Example 7.1

Given some objects (whatever):

x1 <- runif(16)
x2 <- runif(32)
x3 <- runif(64)

when we want to apply the same action on different data, say, compute the root mean square, instead of retyping almost identical expressions (or a bunch of them) over and over again:

sqrt(mean(x1^2))
## [1] 0.6545
sqrt(mean(x2^2))  # the same second time - borderline okay
## [1] 0.56203
sqrt(mean(x3^2))  # tedious, barbarous, and error-prone
## [1] 0.57206

we can generalise the operation to any object like x:

rms <-                   # bound what follows to name `rms`
    function(x)          # a function that takes one parameter, `x`
        sqrt(mean(x^2))  # expression to transform the input to yield output

and then reuse it on different concrete data instances:

rms(x1)
## [1] 0.6545
rms(x2)
## [1] 0.56203
rms(x3)
## [1] 0.57206

or even combine it with other function calls:

rms(sqrt(c(x1, x2, x3)))^2
## [1] 0.50824

Important

Does writing your own functions equal reinventing the wheel? Can everything be found online these days (including on Stack Overflow, GitHub, or CRAN)?

Luckily, it is not the case. Otherwise, data analysts’, researchers’, and developers’ lives would be monotonous, dreary, and uninspiring. Plus, sometimes it is much quicker to compose a function from scratch than to get through the whole garbage dump from where, only occasionally, we can dig out some pearls. Not to mention the self-educative side: we become better programmers by crunching those exercises. We are advocating for minimalism here, remember?

This and many other vital issues in function design will be reflected upon in Chapter 9.

7.1. Creating and invoking functions#

7.1.1. Anonymous functions#

Functions are usually created through the following notation:

function(args) body

First, args is a (possibly empty) list of comma-separated parameter names which are supposed to act as input variables.

Second, body is a single R expression that will be evaluated when the function is called. The value this expression yields will constitute the function’s output.

For example, here is a definition of a function that takes no inputs and generates a constant output:

function() 1
## function() 1

We thus created a function object. However, it disappeared immediately thereafter, as we have not used it at all.

Any function, say, f can be invoked, i.e., evaluated on concrete data, using the notation f(arg1, ..., argn), where “arg1, ..., argn” are the arguments to be passed to f.

(function() 1)()  # invoking f like f(); here, no arguments are expected
## [1] 1

Only now have we obtained a return value.

Note

(*) Calling typeof on a function object will report "closure" (for user-defined functions), "builtin", or "primitive" (for some built-in, base ones), for the reasons that we explain in more detail[1] in Section 9.5.3:

typeof(function() 1)
## [1] "closure"

7.1.2. Named functions#

Function objects can be bound with names so that they can be referred to multiple times:

one <- function() 1  # one <- (function() 1)

We created an object named one (we use bold font to indicate that it is of the type function for functions are so crucial in R). We are very familiar with such a notation, as not since yesterday we are used to writing “x <- 1”, etc.

Invoking one, which can be done by writing one(), will generate a return value:

one()  # (function() 1)()
## [1] 1

This output can be used in further computations, for instance:

0:2 - one()  # 0:2 - (function() 1)(), i.e., 0:2 - 1
## [1] -1  0  1

7.1.3. Passing arguments to functions#

Functions with no arguments are kind of boring, thus let us distil a more highbrowed operation:

concat <- function(x, y) paste(x, y, sep="")

Here we have created a mapping whose aim is to concatenate two objects using a specialised call to paste. Yours faithfully pleads guilty to multiplying entities needlessly: it should not be a problem for anyone to write paste(x, y, sep="") each time. Yet, ‘tis merely an illustration.

The concat function has two parameters, “x” and “y”. Hence, calling it will require the provision of two arguments, which we put within round brackets and separate from each other by commas.

u <- 1:5
concat("spam", u)  # i.e., concat(x="spam", y=1:5)
## [1] "spam1" "spam2" "spam3" "spam4" "spam5"

Important

Notice the distinction: parameters (formal arguments) are abstract, general, or symbolic; “something, anything that will be put in place of x when the function is invoked”. Contrastingly, arguments (actual parameters) are concrete, specific, and real.

During the above call, x in the function’s body is precisely "spam" and nothing else. Also, the u object from the caller’s environment can be accessed via y in concat. Most of the time (yet, see Section 16.3), it is best to think of the function as being fed not with u per se but the value that u is bound to, i.e., “1:5”.

Also:

x <- 1:5
y <- "spam"
concat(y, x)  # concat(x="spam", y=1:5)
## [1] "spam1" "spam2" "spam3" "spam4" "spam5"

This call is still equivalent to concat(x=y, y=x). The argument x is assigned the value of y from the calling environment, "spam". Yes, one x is not the same as the other x, and which is unambiguously defined by the context. Understanding and being able to manipulate such abstractions is basic logic and common sense that everyone should master.

Exercise 7.2

Write a function standardise that takes a numeric vector x as argument and returns its standardised version, i.e., from each element in x, subtract the sample arithmetic mean and then divide it by the standard deviation.

Note

Recall from Section 2.1.3 that, syntactically speaking, the following are perfectly valid alternatives to the positionally-matched call concat("spam", u); see Section 15.4.4 for more details.

concat(x="spam", y=u)
concat(y=u, x="spam")
concat("spam", y=u)
concat(u, x="spam")
concat(x="spam", u)
concat(y=u, "spam")

However, we recommend to avoid the last two for the sake of the readers’ sanity. It is best to provide positionally-matched arguments before the keyword-based ones.

Also, in Section 10.5, we will introduce the (overused) forward-pipe operator, `|>`, which enables the above to be written as “"spam" |> concat(u)”.

7.1.4. Grouping expressions with curly braces, `{`#

We have been informed that a function’s body is a single R expression whose evaluated value is passed to the user as its output. This may sound restrictive and in contrast with what we have experienced so far. Rarely are we faced with such simple computing tasks, and we have already seen R functions performing quite sophisticated operations.

It turns out that, grammatically, a single R expression can be arbitrarily complex (Chapter 15); we can use curly braces to group many calls that are to be evaluated one after another.

For instance:

{
    cat("first expression\n")
    cat("second expression\n")
    # ...
    cat("last expression\n")
}
## first expression
## second expression
## last expression

We used four spaces to visually indent the constituents for greater readability (some developers prefer tabs over spaces, others find two or three spaces more urbane, but we do not). This single (compound) expression can now play a role of a function’s body.

Important

The last expression evaluated in a curly-braces delimited block will be considered its output value.

x <- {
    1
    2
    3  # <--- last expression: will be taken as the output value
}
print(x)
## [1] 3

Note

(*) The above code block can also be written more concisely by replacing newlines with semicolons, albeit with perhaps some loss in readability:

{1; 2; 3}
## [1] 3

Section 9.4 will give a few more details about `{`.

Example 7.3

Here is a version of the above concat function, which takes care of a more Chapter 2-style missing values’ propagation:

concat <- function(a, b)
{
    z <- paste(a, b, sep="")
    z[is.na(a) | is.na(b)] <- NA_character_
    z  # last expression in the block – return value
}

Example calls:

concat("a", 1:3)
## [1] "a1" "a2" "a3"
concat(NA_character_, 1:3)
## [1] NA NA NA
concat(1:6, c("a", NA_character_, "c"))
## [1] "1a" NA   "3c" "4a" NA   "6c"

Let us appreciate the fact that we could keep the code brief thanks to paste and `|` implementing the recycling rule.

Exercise 7.4

Write a function normalise that takes a numeric vector x and returns its version shifted and scaled to the [0, 1] interval. To do so, subtract the sample minimum from each element, and then divide it by the range, i.e., the difference between the maximum and the minimum. Avoid computing min(x) twice.

Exercise 7.5

Write a function that applies the robust standardisation of a numeric vector: subtract the median and divide it by the median absolute deviation, 1.4826 times the median of the absolute differences between the values and their median.

Note

R is an open-source (free, libre) project. Users are not only encouraged to run the software for whatever purpose, but also study and modify its source code without any restrictions. This applies both to functions that we have authored ourselves:

print(concat)
## function(a, b)
## {
##     z <- paste(a, b, sep="")
##     z[is.na(a) | is.na(b)] <- NA_character_
##     z  # last expression in the block – return value
## }

and to the routines that are part of base R or any other extension packages:

print(union)
## function (x, y)
## {
##     u <- as.vector(x)
##     v <- as.vector(y)
##     unique(c(u, v))
## }
## <environment: namespace:base>

Nevertheless, some functionality might be implemented in compiled programming languages such as C, C++, or Fortran; notice a call to .Internal in the source code of paste, .Primitive in list, or .Call in runif. Therefore, we will sometimes have to dig a bit deeper to access the underlying source code; see Chapter 14 for more details.

7.2. Functional programming#

R is a functional programming language. As such, it shares several common features with other languages that emphasise the role of function manipulation in software development (e.g., Common Lisp, Scheme, OCaml, Haskell, Clojure, F#). Let us explore them now.

7.2.1. Functions are objects#

R functions were given the right to a fair go; they are what we refer to as first-class citizens. In other words, our interaction with them is not limited to their invocation; we treat them as any other language object.

  • They can be stored inside list objects:

    list(identity, NROW, sum)  # a list storing three functions
    ## [[1]]
    ## function (x)
    ## x
    ## <environment: namespace:base>
    ##
    ## [[2]]
    ## function (x)
    ## if (length(d <- dim(x))) d[1L] else length(x)
    ## <environment: namespace:base>
    ##
    ## [[3]]
    ## function (..., na.rm = FALSE)  .Primitive("sum")
    

    This is possible owing to the fact that lists, as we recall, can embrace R objects of any kind,

  • They can be created and then called inside another function’s body:

    euclidean_distance <- function(x, y)
    {
        square <- function(z) z^2  # auxiliary/internal/helper function
        sqrt(sum(square(x-y)))     # square root of the sum of squares
    }
    
    euclidean_distance(c(0, 1), c(1, 0))  # example call
    ## [1] 1.4142
    

    This is why we tend to classify functions as representatives of recursive types (compare is.recursive).

  • They can be passed as arguments to other operations:

    # Replaces missing values with a given aggregate
    # of all non-missing elements:
    fill_na <- function(x, filler_fun)
    {
        missing_ones <- is.na(x)  # otherwise, we'd call is.na twice
        replacement_value <- filler_fun(x[!missing_ones])
        x[missing_ones] <- replacement_value
        x
    }
    
    fill_na(c(0, NA_real_, NA_real_, 2, 3, 7, NA_real_), mean)
    ## [1] 0 3 3 2 3 7 3
    fill_na(c(0, NA_real_, NA_real_, 2, 3, 7, NA_real_), median)
    ## [1] 0.0 2.5 2.5 2.0 3.0 7.0 2.5
    

    We call these higher-order functions.

Note

More advanced techniques, which we will discuss in the third part of the book, will let the functions be:

  • returned as other function’s outputs,

  • equipped with auxiliary data,

  • generated programmatically on the fly,

  • modified at runtime.

Below we review the most basic higher-order functions, including do.call and Map.

7.2.2. Calling on precomputed arguments with do.call#

Notation like f(arg1, ..., argn) has no monopoly over how we are supposed to call a function on a specific sequence of comma-delimited arguments. The list of actual parameters does not have to be hardcoded.

Here is an alternative. We can first prepare a number of objects to be passed as f’s inputs, wrap them in a list l, and then invoke do.call(f, l) to get the same result.

words <- list(
    c("spam",      "bacon",  "eggs"),
    c("buckwheat", "quinoa", "barley"),
    c("ham",       "spam",   "spam")
)
do.call(paste, words)  # paste(words[[1]], words[[2]], words[[3]])
## [1] "spam buckwheat ham" "bacon quinoa spam"  "eggs barley spam"
do.call(cbind, words)  # column-bind; returns a matrix (explained later)
##      [,1]    [,2]        [,3]
## [1,] "spam"  "buckwheat" "ham"
## [2,] "bacon" "quinoa"    "spam"
## [3,] "eggs"  "barley"    "spam"
do.call(rbind, words)  # row-bind (explained later)
##      [,1]        [,2]     [,3]
## [1,] "spam"      "bacon"  "eggs"
## [2,] "buckwheat" "quinoa" "barley"
## [3,] "ham"       "spam"   "spam"

The length and content of the list passed as the second argument of do.call can be arbitrary (possibly unknown at the time of writing the code). See Section 12.1.2 for more use cases, e.g., ways to concatenate a list of data frames (perhaps produced by some complex chain of commands) into a single data frame.

If elements of the list are named, they will be matched to the corresponding keyword arguments.

x <- 2^(seq(-2, 2, length.out=101))
plot_opts <- list(col="red", lty="dashed", type="l")
do.call(plot, c(list(x, log2(x), xlab="x", ylab="log2(x)"), plot_opts))
## (plot display suppressed)

Notice that, e.g., plot_opts can now be reused in further calls to graphics functions. This is very convenient as it avoids repetitions.

7.2.3. Common higher-order functions#

There is an important class of higher-order functions that allow us to apply custom operations on consecutive elements of sequences without relying on loop-like statements, at least explicitly. They can be found in all functional programming languages (e.g., Lisp, Haskell, Scala) and have been ported to various add-on libraries (functools in Python, more recent versions of the C++ Standard Library, etc.) or frameworks (Apache Spark and the like). Their presence reflects the obvious truth that some operations occur more frequently than others.

In particular:

  • Map calls a function on each element of a sequence in order to transform:

    • their individual components (just like sqrt, round, or the unary `!` operator in R), or

    • the corresponding elements of many sequences so as to vectorise a given operation elementwisely (compare the binary `+` or paste),

  • Reduce (also called accumulate) applies a binary operation to combine consecutive elements in a sequence, e.g., to generate the aggregates, like, totally (compare sum, prod, all, max) or cumulatively (compare cumsum, cummmin),

  • Filter creates a subset of a sequence that is comprised of elements that enjoy a given property (which we typically achieve in R by means of the `[` operator),

  • Find locates the first element that fulfils some logical condition (compare which),

and so forth.

Below we will only focus on the Map function. The inspection of the remaining ones is left as an exercise. This is because, oftentimes, we can be better off with their more R-ish versions (e.g., using the subsetting operator, `[`).

7.2.4. Vectorising functions with Map#

In data-centric computing, we are frequently faced with tasks that involve processing each element in a sequence independently, one after another. Such use cases can benefit from vectorised operations like those discussed in Chapter 2, Chapter 3, and Chapter 6.

Unfortunately, most of the functions that we introduced so far cannot be applied on lists. For instance, if we try calling sqrt on a list, we will get an error, even if it is a list of numeric vectors only. One way to compute the square root of all elements would be to invoke sqrt(unlist(...)). It is a go-to approach if we wish to treat all the list’s elements as one sequence. However, this comes at the price of losing the list’s structure.

We also discussed some operations that are not vectorised with respect to all their arguments, even though they could have been designed this way, e.g., grepl.

The Map function[2] applies an operation on each element in a vector or the corresponding elements in a number of vectors. In many situations, it may be used as a more elegant alternative to for loops that we will introduce in the next chapter.

First[3], a call to Map(f, x) yields a list whose \(i\)-th element is equal to f(x[[i]]) (recall that `[[` works on atomic vectors too).

For example:

x <- list(  # an example named list
    x1=1:3,
    x2=seq(0, 1, by=0.25),
    x3=c(1, 0, NA_real_, 0, 0, 1, NA_real_)
)
Map(sqrt, x)  # x is named, hence the result will be named as well
## $x1
## [1] 1.0000 1.4142 1.7321
##
## $x2
## [1] 0.00000 0.50000 0.70711 0.86603 1.00000
##
## $x3
## [1]  1  0 NA  0  0  1 NA
Map(length, x)
## $x1
## [1] 3
##
## $x2
## [1] 5
##
## $x3
## [1] 7
unlist(Map(mean, x))  # compute three aggregates, convert to an atomic vector
##  x1  x2  x3
## 2.0 0.5  NA
Map(function(n) round(runif(n, -1, 1), 1), c(2, 4, 6))  # x is atomic now
## [[1]]
## [1] 0.4 0.8
##
## [[2]]
## [1]  0.5  0.8 -0.1 -0.7
##
## [[3]]
## [1] -0.3  0.0  0.5  1.0 -0.9 -0.7

Next, we can vectorise a given function over several parameters. A call to, e.g., Map(f, x, y, z) breeds a list whose \(i\)-th element is equal to f(x[[i]], y[[i]], z[[i]]). Like in the case of, e.g., paste, the recycling rule will be applied if necessary.

For example, the following generates list(seq(1, 6), seq(11, 13), seq(21, 29)):

Map(seq, c(1, 11, 21), c(6, 13, 29))
## [[1]]
## [1] 1 2 3 4 5 6
##
## [[2]]
## [1] 11 12 13
##
## [[3]]
## [1] 21 22 23 24 25 26 27 28 29

Moreover, we can get list(seq(1, 40, length.out=10), seq(11, 40, length.out=5), seq(21, 40, length.out=10), seq(31, 40, length.out=5)) by calling:

Map(seq, c(1, 11, 21, 31), 40, length.out=c(10, 5))
## [[1]]
##  [1]  1.0000  5.3333  9.6667 14.0000 18.3333 22.6667 27.0000 31.3333
##  [9] 35.6667 40.0000
##
## [[2]]
## [1] 11.00 18.25 25.50 32.75 40.00
##
## [[3]]
##  [1] 21.000 23.111 25.222 27.333 29.444 31.556 33.667 35.778 37.889 40.000
##
## [[4]]
## [1] 31.00 33.25 35.50 37.75 40.00

Note

If we have some additional arguments to be passed to the function applied (which it does not have to be vectorised over), we can wrap them inside a separate list and toss it via the MoreArgs argument (à la do.call).

unlist(Map(mean, x, MoreArgs=list(na.rm=TRUE)))  # mean(..., na.rm=TRUE)
##  x1  x2  x3
## 2.0 0.5 0.4

Alternatively, we can always construct a custom anonymous function:

unlist(Map(function(xi) mean(xi, na.rm=TRUE), x))
##  x1  x2  x3
## 2.0 0.5 0.4
Exercise 7.6

Here is an example list of files (see our teaching data repository) with daily Forex rates:

file_names <- c(
    "euraud-20200101-20200630.csv",
    "eurgbp-20200101-20200630.csv",
    "eurusd-20200101-20200630.csv"
)

Call Map to read each dataset with scan and determine each series’ minimal, mean, and maximal value.

Exercise 7.7

Implement your version of the Filter function based on a call to Map.

7.3. Accessing third-party functions#

When we indulge in the writing of a software piece, a few questions naturally arise. Is the problem we are facing fairly complex? Has it already been successfully addressed in its entirety? If not, can it, or its parts, be split into manageable chunks? Can it be constructed based on some readily available nontrivial components?

A smart developer is independent but knows when to stand on the shoulders to cry on. Let us explore some ways to reuse the existing function libraries.

7.3.1. Using R packages#

Most contributed R extensions come in the form of the so-called add-on packages, which can include:

  • reusable code (e.g., new functions),

  • data (which we can exercise on),

  • documentation (manuals, vignettes, etc.);

see Section 9.3.2 for more and [62] for all the details.

Most packages are published in the moderated repository that is part of the Comprehensive R Archive Network (CRAN). However, there are also other popular sources such as Bioconductor which specialises in bioinformatics.

We call install.packages("pkg") to fetch a package pkg from a repository (CRAN by default; see, however, the repos argument).

A call to library("pkg") loads an indicated package and makes the exported objects available to the user (i.e., attaches it to the search path; see Section 16.2.6).

For instance, in one of the previous chapters, we have mentioned the gsl package:

# call install.packages("gsl") first
library("gsl")  # load the package
poch(10, 3:6)   # calls gsl_sf_poch() from GNU GSL
## [1]    1320   17160  240240 3603600

Here, poch is an object exported by package gsl. If we did not call library("gsl"), trying to access the former would raise an error.

We could also have accessed the above function without attaching it to the search path using the pkg::object syntax, i.e., gsl::poch.

Exercise 7.8

Use the find function to determine which packages define mean, var, find, and Map. Recall from Section 1.4 where such information can be found in these objects’ manual pages.

Note

For more information about any R extension, call help(package="pkg"). Also, it is advisable to visit the package’s CRAN entry at an address like https://CRAN.R-project.org/package=pkg to access some additional information (e.g., vignettes; see also vignette(package="pkg")). Why waste our time and energy by querying a web search engine that will lead us to some (usually low-quality) middleman when we can acquire authoritative knowledge directly from the source?

Moreover, it is worth exploring various CRAN Task Views that group the packages into topics such as Genetics, Graphics, and Optimisation. They are curated by experts in their relevant fields.

Important

Frequently, R packages are written in their respective authors’ free time, many of whom are volunteers/public servants/enthusiasts who are neither paid for doing this nor it is part of the so-called their job. You can show appreciation for their generosity by, e.g., spreading the word about their software by citing them in publications (see citation(package="pkg")), talking about them during lunchtime, or mentioning them in (un)social media. You can also help them improve the existing code base by reporting bugs, polishing documentation, proposing new features, or cleaning up the redundant fragments of their APIs. Some readers will become one of them someday (when they come up with something valuable for our community).

7.3.1.1. Default packages#

The always-on package base is a must-have. It provides us with the most crucial functions (vector addition, c, Map, library). Certain other packages are also loaded by default:

getOption("defaultPackages")
## [1] "datasets"  "utils"     "grDevices" "graphics"  "stats"
## [6] "methods"

This list can, theoretically, be changed[4]. However, in this book, we assume that the above are always attached because it is reasonable to do so. This is why in Section 2.4.5, there was no need to call, for example, library("stats") before referring to the var and sd functions.

On a side note, grDevices and graphics will be discussed in Chapter 13. methods will be mentioned in Section 11.5. datasets brings a few example R objects on which we can exercise our skills. On the other hand, the functions from utils, graphics, and stats already appeared here and there.

7.3.1.2. Source vs binary packages (*)#

R is a free and open project. Therefore, its packages are published primarily in the source form. This way, anyone can study how they work and improve them or reuse parts thereof in different projects.

If we call install.packages("path", repos=NULL, type="source"), we should be able to install a package from sources: path can either be pinpointing a directory or a source tarball (see help("untar"), most often as a compressed pkg_version.tar.gz file).

Note that type="source" is the default unless one is on W****ws or some m**OS boxes; see getOption("pkgType"). This is because these two require additional build tools to be present in the system, especially if a package features C or C++ code; see Chapter 14 and Section C.3 of [64]:

Because these systems are less developer-oriented, as a courtesy to their users, CRAN also distributes the platform-specific binary versions of the packages (.zip or .tgz files). install.packages will try to fetch them by default.

Example 7.9

GitLab and GitHub are quite popular hosting platforms. It is very easy to fetch a package’s source directly from them. At the time of writing this, the relevant links were, respectively:

  • https://gitlab.com/user/repo/-/archive/branch/repo-branch.zip

  • https://github.com/user/repo/archive/branch.zip

For example, to download the contents of the master branch in the repository rpackagedemo owned by gagolews, we can call:

f <- tempfile()  # temporary file name - download destination
download.file("https://github.com/gagolews/rpackagedemo/archive/master.zip",
    destfile=f)

Next, the contents can be extracted with unzip:

t <- tempdir()  # temporary directory to extract the files to
(d <- unzip(f, exdir=t))  # returns extracted file paths

The path where the files were extracted can be passed to install.packages:

install.packages(dirname(d)[1], repos=NULL, type="source")
file.remove(c(f, d))  # clean up
Exercise 7.10

Use the git2r package to clone the git repository located at https://github.com/gagolews/rpackagedemo.git and install the package published therein from the current R session.

7.3.1.3. Managing dependencies (*)#

All installed add-on packages may be upgraded to their most recent versions available on CRAN (or other indicated repository) by calling update.packages.

As a general rule, the more experienced developers we become, the less excited we get about the new. Sure, bug fixes and some well-thought-of additional features are usually welcome. Still, just we wait until someone updates the package API for the \(n\)-th time, \(n \ge 2\), which will break our program that used to work flawlessly for so long.

Hence, when designing software projects (see Chapter 9 for more details), we must ask ourselves the ultimate question: do we really need to import that package with lots of dependencies from which we will just use only about 3–5 functions? Wouldn’t it be better to write our own version of some functionality (and learn something new, exercise our brain, etc.), or call a mature terminal-based tool?

Otherwise, as all the historical versions of all the packages are archived on CRAN, some software dependency management can easily be conducted by storing different releases of packages in different directories (only one version of a package can be loaded at a time though). This way, we can create an isolated environment for the add-ons.

To fetch the locations where packages are sought (in this very order), call:

.libPaths()
## [1] "/home/gagolews/R/x86_64-pc-linux-gnu-library/4.3"
## [2] "/usr/local/lib/R/site-library"
## [3] "/usr/lib/R/site-library"
## [4] "/usr/lib/R/library"

The same function can be used to add new folders to the search path; see also the environment variable R_LIBS_USER (e.g., help("Sys.setenv")). The install.packages function will honour them as target directories; see its lib parameter for more details.

Moreover, the packages may deposit some auxiliary data on the user’s machine. Therefore, it might be worthwhile to set the following directories (via the corresponding environment variables) relative to the current project:

tools::R_user_dir("pkg", "data")   # R_USER_DATA_DIR
## [1] "/home/gagolews/.local/share/R/pkg"
tools::R_user_dir("pkg", "config") # R_USER_CONFIG_DIR
## [1] "/home/gagolews/.config/R/pkg"
tools::R_user_dir("pkg", "cache")  # R_USER_CACHE_DIR
## [1] "/home/gagolews/.cache/R/pkg"

7.3.2. Calling external programs#

Many tasks can naturally be accomplished by calling external programs. Such an approach is particularly natural on UNIX-like systems, which classically follow modular, minimalist design patterns. There are many tools at a developer’s hand and each tool is specialised at solving a single, well-defined problem.

Apart from the many standard UNIX commands, we can consider, for example:

  • pandoc converts documents between markup formats, e.g., Markdown, reStructuredText, LaTeX, LibreOffice Writer, EPUB;

  • pdflatex, xelatex, and lualatex compile LaTeX documents to PDF;

  • convert (from ImageMagick) applies various operations on bitmap graphics (scaling, cropping, conversion between formats);

  • graphviz and PlantUML can be used to create various graphs and diagrams;

  • jupyter-nbconvert converts Jupyter notebooks (see Section 1.2.5) to other formats such as LaTeX, HTML, Markdown, etc.;

  • python, perl, … can be called to perform tasks that can be expressed more easily in languages other than R;

and so forth.

The good news is that R can not only be called from the system shell (in an interactive or batch mode; see Section 1.2). It can also serve well as a glue language.

The system2 function can be used to invoke any system command. Communication between such programs can be done using, e.g., intermediate text, JSON, CSV, XML, or any other files. The stdin, stdout, and stderr arguments can control the redirection of the standard I/O streams.

system2("pandoc", "-s input.md -o output.html")
system2("bash", "-c 'for i in `seq 1 2 10`; do echo $i; done'", stdout=TRUE)
## [1] "1" "3" "5" "7" "9"
system2("python3", "-", stdout=TRUE,
    input=c(
    "import numpy as np",
    "print(repr(np.arange(5)))"
    ))
## [1] "array([0, 1, 2, 3, 4])"

On a side note, the current working directory can be read and changed through a call to getwd and setwd, respectively. It is the directory where the current R session was started.

Important

Relying on system2 assumes that the commands it refers to are available on the target platform. Hence, it might not be portable unless additional assumptions are made (e.g., that a user runs some UNIX-like system and that certain libraries are installed therein). We strongly recommend GNU/Linux or FreeBSD for both software development and production use, as they are free, open, developer-friendly, user-loving, reliable, ethical, and sustainable.

7.3.3. Interfacing C, C++, Fortran, Python, Java, etc. (**)#

Most stand-alone data processing algorithms are implemented in compiled, slightly lower-level programming languages. This usually makes them faster and more reusable in other environments. For instance, an industry-standard library might be written in very portable C, C++, or Fortran and have some bindings available for easier access from within R, Python, Julia, etc. It is the case with FFTW, LIBSVM, mlpack, OpenBLAS, ICU, and GNU GSL, amongst many others. Chapter 14 explains basic ways to refer to such compiled code.

Also, the rJava package can dynamically create JVM objects and access their fields and methods. Similarly, reticulate can be used to access Python objects, including numpy arrays and pandas data frames (but see also the rpy2 package for Python).

Important

We should not feel obliged to use R in all parts of a data processing pipeline. Some activities can be expressed more naturally in other languages or environments (e.g., parse raw data and create a SQL database in Python but visualise it in R). We can use other tools as the glue language (including R, Python, or Bash) to steer the data flow in the right direction.

R is an effective glue language: it is suitable for implementing data wrangling pipelines, visualisation, and developing prototypes of data analysis algorithms. In other words, it makes connecting larger building blocks very easy.

Nevertheless, for performance reasons[5], we should move the more computing-intensive tasks to the C or C++ level. In this chapter, we will demonstrate that R works very well as a user-friendly interface to compiled code written in these languages[6].

7.4. Exercises#

Exercise 7.11

Answer the following questions:

  • What is the result of “x <- 2; x <- function(x) x^2; x(x)”?

  • How to compose a function that returns two objects?

  • What is a higher-order function?

  • What are the use cases of do.call?

  • Why a call to Map is not necessary in the expression “Map(paste, x, y, z)”?

  • What is the difference between Map(mean, x, na.rm=TRUE) and Map(mean, x, MoreArgs=list(na.rm=TRUE))?

  • What do we mean when we write stringx::sprintf?

  • How to get access to the vignettes (tutorials, FAQs, etc.) of the data.table and dplyr packages? Why perhaps 95% of R users would just googleit, and what is sub-optimal about this strategy?

  • What is the difference between a source and a binary package?

  • How to update the base package?

  • How to ensure that we will always run an R session with only specific versions of a set of packages?

Exercise 7.12

Write a function that computes the Gini index of a vector of positive integers x, which, assuming \(x_1\le x_2\le\dots\le x_n\), is equal to:

\[G(x_1,\dots,x_n) = \frac{ \sum_{i=1}^{n} (n-2i+1) x_{i} }{ (n-1) \sum_{i=1}^n x_i }.\]
Exercise 7.13

Implement a function between(x, a, b) that verifies whether each element in x is in the [a, b] interval. Return a logical vector of the same length as x. Ensure the function is correctly vectorised with respect to all the arguments and handles missing data correctly.

Exercise 7.14

Write your version of the strrep function called dup.

dup <- ...to.do...
dup(c("a", "b", "c"), c(1, 3, 5))
## [1] "a"     "bbb"   "ccccc"
dup("a", 1:3)
## [1] "a"   "aa"  "aaa"
dup(c("a", "b", "c"), 4)
## [1] "aaaa" "bbbb" "cccc"
Exercise 7.15

Given a list x, generate its sublist with all the elements equal to NULL removed.

Exercise 7.16

Implement your version of the built-in sequence function.

Exercise 7.17

Using Map, how can we generate window indexes like below?

## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 2 3 4
##
## [[3]]
## [1] 3 4 5
##
## [[4]]
## [1] 4 5 6

Write a function windows(k, n) that yields \(k\) index windows with elements between \(1\) and \(n\) (the above example is for \(k=3\) and \(k=6\)).

Exercise 7.18

Implement a function movstat(f, x, k) that computes, using Map, a given aggregate f of each \(k\) consecutive elements in x. For instance:

movstat <- ...to.do...
x <- c(1, 3, 5, 10, 25, -25)  # example data
movstat(mean, x, 3)           # 3-moving mean
## [1]  3.0000  6.0000 13.3333  3.3333
movstat(median, x, 3)         # 3-moving median
## [1]  3.0000  6.0000 13.3333  3.3333
Exercise 7.19

Write a function to extract all \(q\)-grams, \(q \ge 1\), from a given character vector. Return a list of character vectors. For example, 2-grams (bigrams) in "abcd" are: "ab", "bc", “cd”`.

Exercise 7.20

Recode a character vector with a small number of distinct values to a vector where each unique code is assigned a positive integer from \(1\) to \(k\). Example calls and the corresponding expected results:

recode <- ...to.do...
recode(c("a", "a", "a", "b", "b"))
## [1] 1 1 1 2 2
recode(c("x", "z", "y", "x", "y", "x"))
## [1] 1 3 2 1 2 1
Exercise 7.21

Implement a function that returns the number of occurrences of each unique element in a given atomic vector. The return value should be a numeric vector equipped with a names attribute.

count <- ...to.do...
count(c(5, 5, 5, 5, 42, 42, 954))
##   5  42 954
##   4   2   1
count(c("x", "z", "y", "x", "y", "x", "w", "x", "x", "y", NA_character_))
##    w    x    y    z <NA>
##    1    5    3    1    1

Hint: use match and tabulate.

Exercise 7.22

Extend the built-in duplicated function. For each vector element, indicate which occurrence of a repeated value is it (starting from the beginning of the vector).

duplicatedn <- ...to.do...
duplicatedn(c("a", "a", "a", "b", "b"))
## [1] 1 2 3 1 2
duplicatedn(c("x", "z", "y", "x", "y", "x", "w", "x", "x", "y", "z"))
##  [1] 1 1 1 2 2 3 1 4 5 3 2
Exercise 7.23

Based on a call to Map, implement a function my_split such that, given a vector x and an atomic vector y of the same length as x, my_split(x, y) yields the same result as split(x, y).

Exercise 7.24

Extend my_split to handle the second argument being a list of the form list(y1, y2, ...) representing the product of many levels. If the \(y\)s are of different lengths, apply the recycling rule.

Exercise 7.25

Implement my_unsplit being your version of the built-in unsplit. Ensure it holds my_unsplit(split(x, g), g) == x for x and g of the same lengths.

Exercise 7.26

Write a function that takes as arguments: (a) an integer \(n\), (b) a numeric vector x of length \(k\) and no duplicated elements, (c) a vector of probabilities p of length \(k\). Verify that \(p_i\ge 0\) for all \(i\) and \(\sum_{i=1}^k p_i \simeq 1\). Based on a random number generator from the uniform distribution on the unit interval, generate \(n\) independent realisations of a random variable \(X\) such that \(\Pr(X=x_i)=p_i\) for \(i=1,\dots,k\). Hint: to obtain a single value:

  1. generate \(u\in[0, 1]\),

  2. find \(m\in\{1,\dots,k\}\) such that \(u\in\left(\sum_{j=1}^{m-1} p_{j}, \sum_{j=1}^m p_{j}\right]\),

  3. the result is then \(x_m\).

Exercise 7.27

Write a function that takes as arguments: (a) an increasingly sorted vector x of length \(n\), (b) any vector y of length \(n\), (c) a vector z of length \(k\) and elements in \([x_1,x_n)\). Let \(f\) be the piecewise linear spline that interpolates the points \((x_1,y_1),\dots,(x_n,y_n)\). Return a vector w of length \(k\) such that \(w_i=f(z_i)\).

Exercise 7.28

(*) Write functions dpareto, ppareto, qpareto, and rpareto that implement the basic functions related to the Pareto distribution; compare Section 2.3.4.