9. Designing functions#

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Minimalist Data Wrangling with Python [27] too.

In Chapter 7, we learnt how to compose simple functions. This skill is vital to enforcing the good development practice of avoiding code repetition: running the same command sequence on different data.

This chapter is devoted to designing reusable methods so that they are easier to use, test, and maintain. We also provide more technical details about functions. They were not of the highest importance during our first exposure to this topic but are crucial to our better understanding of how R works.

9.1. Managing data flow#

A function, most of the time, can and should be treated as a black box. Its callers do not have to care what it hides inside. After all, they are supposed to use it. Given some inputs, they expect well-defined outputs that are explained in detail in the function’s manual.

9.1.1. Checking input data integrity and argument handling#

A function takes R objects of any kind as arguments, but it does not mean feeding it with everything is healthy for its guts.

When designing functions, it is best to handle the inputs in a manner similar to base R’s behaviour. This will make our contributions easier to work with.

Lamentably, base functions frequently do not process arguments of a similar kind fully consistently. Such variability might be due to many reasons and, in essence, is not necessarily bad. Usually, there might be many possible behaviours and choosing one over another would make a few users unhappy anyway. Some choices might not be optimal, but they are for historical compatibility (e.g., with S). Of course, it might also happen that something is poorly designed or there is a bug (but the likelihood is low).

This is why we should rather keep our vocabulary restricted. Even if there are exceptions to the general rules, with fewer functions, they are easier to remember. We advocate for such minimalism in this book.

Consider the following case study, illustrating that even the extremely simple scenario dealing with a single positive integer is not necessarily straightforward.

Exercise 9.1

In mathematical notation, we usually denote the number of objects in a collection by the famous “\(n\)”. It is implicitly assumed that such \(n\) is a single natural number (albeit whether this includes 0 or not should be specified at some point). The functions runif, sample, seq, rep, strrep, and class::knn take it as arguments. Nonetheless, nothing stops us from trying to challenge them by passing:

  • 2.5, -1, 0, 1-1e-16 (non-positive numbers, non-integers);

  • NA_real_, Inf (not finite);

  • 1:5 (not of length 1; after all, there are no scalars in R);

  • numeric(0) (an empty vector);

  • TRUE, NA, c(TRUE, FALSE, NA), "1", c("1", "2", "3") (non-numeric, but coercible to);

  • list(1), list(1, 2, 3), list(1:3, 4) (non-atomic);

  • "Spanish Inquisition" (unexpected nonsense);

  • as.matrix(1), factor(7), factor(c(3, 4, 2, 3)), etc. (compound types; Chapter 10).

Read the aforementioned functions’ reference manuals and call them on different inputs. Notice how differently they handle such atypical arguments.

Sometimes we will rely on other functions to check data integrity for us.

Example 9.2

Consider a function that generates \(n\) pseudorandom numbers from the unit interval rounded to \(d\) decimal digits. We strongly believe, or at least hope (the good faith and high competence assumption), that its author knew what he was doing when he wrote:

round_rand <- function(n, d)
{
    x <- runif(n)  # runif will check if `n` makes sense
    round(x, d)    # round will determine the appropriateness of `d`
}

What constitutes correct \(n\) and \(d\) and how the function behaves when not provided with positive integers is determined by the two underlying functions, runif and round:

round_rand(4, 1)  # the expected use case
## [1] 0.3 0.8 0.4 0.9
round_rand(4.8, 1.9)  # 4, 2
## [1] 0.94 0.05 0.53 0.89
round_rand(4, NA)
## [1] NA NA NA NA
round_rand(0, 1)
## numeric(0)

Some design choices can be defended if they are well thought out and adequately documented. Certain programmers will opt for high uniformity/compatibility across numerous tools, as there are cases where diversity does more good than harm.

Our functions might become part of a more complicated data flow pipeline. Let’s consider what happens when another procedure generates a value that we did not expect (due to a bug or because we did not study its manual). The problem arises when this unthinkable value is passed to our function. In our case, this would correspond to the said \(n\)’s or \(d\)’s being determined programmatically.

Example 9.3

Continuing the previous example, the following might be somewhat challenging with regard to our being flexible and open-minded:

round_rand(c(100, 42, 63, 30), 1)  # n=length(c(...))
## [1] 0.7 0.6 0.1 0.9
round_rand("4", 1)  # n=as.numeric("4")
## [1] 0.2 0.0 0.3 1.0

Sure, it is convenient. Nevertheless, it might lead to problems that are hard to diagnose.

Also, note the not so informative error messages in cases like:

round_rand(NA, 1)
## Error in runif(n): invalid arguments
round_rand(4, "1")
## Error in round(x, d): non-numeric argument to mathematical function

Defensive design strategies are always welcome, especially if they lead to constructive error messages.

Important

stopifnot gives a convenient means to assert the enjoyment of our expectations about a function’s arguments (or intermediate values). A call to stopifnot(cond1, cond2, ...) is more or less equivalent to:

if (!(is.logical(cond1) && !any(is.na(cond1)) && all(cond1)))
    stop("`cond1` are not all TRUE")
if (!(is.logical(cond2) && !any(is.na(cond2)) && all(cond2)))
    stop("`cond2` are not all TRUE")
...

Thus, if all the elements in the given logical vectors are TRUE, nothing happens. We can move on with certainty.

Example 9.4

We can rewrite the preceding function as:

round_rand2 <- function(n, d)
{
    stopifnot(
        is.numeric(n), length(n) == 1,
        is.finite(n), n > 0, n == floor(n),
        is.numeric(d), length(d) == 1,
        is.finite(d), d > 0, d == floor(d)
    )
    x <- runif(n)
    round(x, d)
}

round_rand2(5, 1)
## [1] 0.7 0.7 0.5 0.6 0.3
round_rand2(5.4, 1)
## Error in round_rand2(5.4, 1): n == floor(n) is not TRUE
round_rand2(5, "1")
## Error in round_rand2(5, "1"): is.numeric(d) is not TRUE

It is the strictest test for “a single positive integer” possible. In the case of any violation of the underlying condition, we get a very informative error message.

Example 9.5

At other times, we might be interested in a more liberal yet still foolproof argument checking like:

if (!is.numeric(n))
    n <- as.numeric(n)
if (length(n) > 1) {
    warning("only the first element will be used")
    n <- n[1]
}
n <- floor(n)
stopifnot(is.finite(n), n > 0)

This way, "4" and c(4.9, 100) will all be accepted as 4[1].

We see that there is always a tension between being generous/flexible and precise/restrictive. Also, because of their particular use cases, for certain functions, it will be better to behave differently from the others. Excessive uniformity is as bad as chaos. We are always expected to rely on common sense. Let’s not be boring bureaucrats.

Still, it is our duty to be explicit about all the assumptions we make or exceptions we tolerate (by writing comprehensive documentation; see Section 9.2.2.3).

Note

(*) Example exercises related to improving the consistency of base R’s argument handling in different domains include the vctrs and stringx packages. Can these contributions be justified?

Exercise 9.6

Reflect on how you would respond to miscellaneous boundary cases in the following scenarios (and how base R and other packages or languages you know deal with them):

  • a vectorised mathematical function (empty vector? non-numeric input? what if it is equipped with the names attribute? what if it has other ones?);

  • an aggregation function (what about missing values? empty vector?);

  • a function vectorised with regard to two arguments (elementwise vectorisation? recycling rule? only scalar vs vector, or vector vs vector of the same length allowed? what if one argument is a row vector and the other is a column vector?);

  • a function vectorised with respect to all arguments (really all? maybe some exceptions are necessary?);

  • a function vectorised with respect to the first argument but not the second (why such a restriction? when?).

Find a few functions that match each case.

9.1.2. Putting outputs into context#

Our functions do not exist in a vacuum. We should put them into a much broader context: how can they be combined with other tools?

As a general rule, we ought to generate outputs of a predictable kind. This way, we can easily deduce what will happen in the code chunks that utilise them.

Example 9.7

Some base R functions do not adhere to this rule for the sake of (questionable) users’ convenience. We will meet a few of them in Chapter 11 and Chapter 12. In particular, sapply and the underlying simplify2array, can return a list, an atomic vector, or a matrix.

simplify2array(list(1, 3:4))    # list
## [[1]]
## [1] 1
##
## [[2]]
## [1] 3 4
simplify2array(list(1, 3))      # vector
## [1] 1 3
simplify2array(list(1:2, 3:4))  # matrix
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Further, the index operator with drop=TRUE, which is the default, may output an atomic vector. However, it may as well yield a matrix or a data frame.

(A <- matrix(1:6, nrow=3))  # an example matrix
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
A[1, ]    # vector
## [1] 1 4
A[1:2, ]  # matrix
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
A[1, , drop=FALSE]  # matrix with 1 row
##      [,1] [,2]
## [1,]    1    4

We proclaim that, if there are many options, the default functions’ behaviour should be to return the object of the most generic kind possible, even when it is not the most convenient form. Then, either:

  • we equip the function with a further argument which must be explicitly set if we really want to simplify the output, or

  • we ask the user to call a simplifier explicitly after the function call; in this case, if the simplifier cannot neaten the object, it should probably fail by issuing an error or at least try to apply some brute force solution (e.g., “fill the gaps” somehow itself, preferably with a warning).

For instance:

as.numeric(A[1:2, ])  # always returns a vector
## [1] 1 2 4 5
stringi::stri_list2matrix(list(1, 3:4))  # fills the gaps with NAs
##      [,1] [,2]
## [1,] "1"  "3"
## [2,] NA   "4"

Ideally, a function is expected to perform one (and only one) well-defined task. If it tends to generate objects of different kinds, depending on the arguments provided, it might be better to compose two or more separate procedures instead.

Exercise 9.8

Functions such as rep, seq, and sample do not perform a single task. Or do they?

Note

(*) In a purely functional programming language, we can assume the so-called referential transparency: a call to a pure function can always be replaced with the value it generates. If this is true, then for the same set of argument values, the output is always the same. Furthermore, there are no side effects. In R, it is not exactly the case:

  • a call can introduce/modify/delete variables in other environments (see Chapter 16), e.g., the state of the random number generator,

  • due to lazy evaluation, functions are free to interpret the argument forms (passed expressions, i.e., not only: values) however they like; see Section 9.4.7, Section 12.3.9, and Section 17.5,

  • printing, plotting, file writing, and database access have apparent consequences with regard to the state of certain external devices or resources.

Important

Each function must return a value. However, in several instances (e.g., plotting, printing) this does not necessarily make sense. In such a case, we may consider returning invisible(NULL), a NULL whose first printing will be suppressed. Compare the following:

f <- function() invisible(NULL)
f()  # printing suppressed
x <- f()  # by the way, assignment also returns an invisible value
print(x)  # no longer invisible
## NULL

9.2. Organising and maintaining functions#

9.2.1. Function libraries#

Definitions of frequently-used functions or datasets can be emplaced in separate source files (.R extension) for further reference.

Such libraries can be executed by calling:

source("path_to_file.R")
Exercise 9.9

Create a source file (script) named mylib.R, where you define a function called nlargest which returns a few largest elements in a given atomic vector.

From within another script, call source("mylib.R"); note that relative paths refer to the current working directory (Section 2.1.6). Then, write a few lines of code where you test nlargest on some example inputs.

9.2.2. Writing R packages (*)#

When a function library grows substantially, there is a need for equipping its contents with the relevant help pages, or we wish to rely on compiled code, turning it into an R package might be worth considering.

Important

Packages can be written only for ourselves or a small team’s purpose. We do not have to publish them on CRAN[2]. Have mercy on the busy CRAN maintainers and do not contribute to the information overload unless we have come up with something potentially of service[3] for other R users. Packages can always be hosted on and installed from GitLab or GitHub.

9.2.2.1. Package structure (*)#

A source package is a directory containing the following special files and subdirectories:

  • DESCRIPTION – a text file that gives the name of the project, its version, authors, dependencies on other packages, license, etc.;

  • NAMESPACE – a text file containing directives stating which objects are available to the package users and which names are imported from other packages;

  • R – a directory with R scripts (.R files), which define, e.g., functions, example datasets, etc.;

  • man – a directory with R documentation files (.Rd), describing at least all the exported objects (Section 9.2.2.3);

  • src – optional; compiled code (Chapter 14);

  • tests – optional; tests to run on the package check (Section 9.2.4.2).

See Section 1 of Writing R Extensions [65] for more details and other options. We do not need to repeat the information from the official manual as all readers can read it themselves.

Exercise 9.10

Inspect the source code of the example package available for download from https://github.com/gagolews/rpackagedemo.

9.2.2.2. Building and installing (*)#

Recall from Section 7.3.1.2 that a source package can be built and installed by calling:

install.packages("pkg_directory", repos=NULL, type="source")

Then it can be used as any other R package (Section 7.3.1). In particular, it can be loaded and attached to the search path (Section 16.2.6) via a call to:

library("pkg")

All the exported objects mentioned in its NAMESPACE file are now available to the user; see also Section 16.3.5.

Exercise 9.11

Create a package mypkg with the solutions to the exercises listed in the previous chapter. When in doubt, refer to the official manual [65].

Note

(*) The building and installing of packages also be done from the command line:

R CMD build pkg_directory  # creates a distributable source tarball (.tar.gz)
R CMD INSTALL pkg-version.tar.gz
R CMD INSTALL --build pkg_directory

Also, some users may benefit from authoring Makefiles that help automate the processes of building, testing, checking, etc.

9.2.2.3. Documenting (*)#

Documenting functions and commenting code thoroughly is critical, even if we just write for ourselves. Most programmers sooner or later will notice that they find it hard to determine what a piece of code is doing after they took a break from it. In some sense, we always communicate with external audiences, which includes our future selves.

The help system is one of the stronger assets of the R environment. By far, we most likely have interacted with many documentation pages and got a general idea of what constitutes an informative documentation piece.

From the technical side, documentation (.Rd) files are located in the man subdirectory of a source package. All exported objects (e.g., functions) should be described clearly. Additional topics can be covered too.

During the package installation, the .Rd files are converted to various output formats, e.g., HTML or plain text, and displayed on a call to the well-known help function.

Documentation files use a LaTeX-like syntax, which looks obscure to an untrained eye. The relevant commands are explained in very detail in Section 2 of [65].

Note

The process of writing .Rd files by hand might be tedious, especially keeping track of the changes to the \usage and \arguments commands. Rarely do we recommend using external packages for base R facilities are usually sufficient. But roxygen2 might be worth a try because it makes the developers’ lives easier. Most importantly, it allows the documentation to be specified alongside the functions’ definitions, which is much more natural.

Exercise 9.12

Add a few manual pages to your example R package.

9.2.3. Writing standalone programs (**)#

Section 7.3.2 mentioned how to call external programs using system2.

On UNIX-like operating systems, it is easy to turn our R scripts into standalone tools that can be run from the terminal. We have already touched upon this topic in Section 1.2.3.

The commandArgs function returns the list of arguments passed from the command line to our script in the form of a character vector. Whatever we do with them is up to us. Moreover, q can terminate a script, yielding any integer return code. By convention, anything other than 0 indicates an error.

Example 9.13

Say we have the following script named testfile in the current directory:

#!/bin/env -S Rscript --vanilla

argv <- commandArgs(trailingOnly=TRUE)
cat("commandArgs:\n")
print(argv)

if (length(argv) == 0) {
    cat("Usage: testfiles file1 file2 ...\n")
    q(save="no", status=1)  # exit with code 1
}

if (!all(file.exists(argv))) {
    cat("Some files do not exist.\n")
    q(save="no", status=2)  # exit with code 2
}

cat("All files exist.\n")

# exits with code 0 (success)

Example interactions with this program from a UNIX-like shell (bash):

chmod u+x testfiles  # add permission to execute
./testfiles
## commandArgs:
## character(0)
## Usage: testfiles file1 file2 ...
./testfiles spanish_inquisition
## commandArgs:
## [1] "spanish_inquisition"
## Some files do not exist.
./testfiles spam bacon eggs spam
## commandArgs:
## [1] "spam"  "bacon" "eggs"  "spam"
## All files exist.

stdin, stdout, and stderr represent the always-open connections mapped to the standard input (“keyboard”), as well as the normal and error output. They can be read from or written to using functions such as scan or cat.

During run time, we can redirect stdout and stderr to different files or even strings using sink.

9.2.4. Assuring quality code#

Below we mention some good development practices related to maintaining quality code. This is an important topic, but writing about them is tedious to the same extent that reading about them is dull. It is the more artistic part of software engineering as such heuristics are learnt best by observing and mimicking what more skilled programmers are doing (the coming exercises aim to make up for our not having them at hand at the moment).

9.2.4.1. Managing changes and working collaboratively#

We recommend employing a source code version control system, such as git, to keep track of the changes made to the software.

Note

It is worth investing time and effort to learn how to use git from the command line; see https://git-scm.com/doc.

There are a few hosting providers for git repositories, with GitLab and GitHub being particularly popular among open-source software developers. They support working collaboratively on the projects and are equipped with additional tools for reporting bugs, suggesting feature requests, etc.

Exercise 9.14

Find source code of your favourite R packages or other projects. Explore the corresponding repositories, feature trackers, wikis, discussion boards, etc. Each community is different and is governed by varied, sometimes contrasting guidelines; after all, we come from all corners of the world.

9.2.4.2. Test-driven development and continuous integration#

It is often hygienic to include some principles of test-driven development.

Exercise 9.15

Assume that, for some reason, we were asked to compose a function to compute the root mean square (quadratic mean) of a given numeric vector. Before implementing the actual routine, we need to reflect upon what we want to achieve, especially how we want our function to behave in certain boundary cases.

stopifnot gives simple means to ensure that a given assertion is fulfilled. If that is the case, it will move forward without fuss.

Say we have come up with the following set of expectations:

stopifnot(all.equal(rms(1), 1))
stopifnot(all.equal(rms(1:100), 58.16786054171151931769))
stopifnot(all.equal(rms(rep(pi, 10)), pi))
stopifnot(all.equal(rms(numeric(0)), 0))

Write a function rms that fulfils these assertions.

Exercise 9.16

Implement your version of the sample function (assuming replace=TRUE), using calls to runif. Start by writing a few unit tests.

A couple of R packages support writing and executing unit tests, including testthat, tinytest, RUnit, or realtest. However, in the most typical use cases, relying on stopifnot is powerful enough.

Exercise 9.17

(*) Consult the Writing R Extensions manual [65] about where and how to include unit tests in your example package.

Note

(*) R can check a couple of code quality areas: running R CMD check pkg_directory from the command line (preferably using the most recent version of the environment) will suggest several improvements.

Also, it is possible to use various continuous integration techniques that are automatically triggered when pushing changes to our software repositories; see GitLab CI or GitHub Actions. For instance, we can run a package build, install, and check process is possible on every git commit. Also, CRAN deploys continuous integration services, including checking the package on various platforms.

9.2.4.3. Debugging#

For all his life, the current author has been debugging his programs primarily by manually printing the state of the suspicious variables (printf and the like) in different code areas. This is old-school but uncannily efficient.

R has an interactive debugger; see the browser function and Section 9 of [69] for more details. Some IDEs (e.g., RStudio) also support this feature; see their corresponding documentation.

9.2.4.4. Profiling#

Typically, a program spends relatively long time executing only a small portion of code. The Rprof function can be a helpful tool to identify which chunks might need a rewrite, for instance, using a compiled language (Chapter 14).

Please remember, though, that bottlenecks are not only formed by using algorithms with high computational complexity, but also data input and output (such as reading files from disk, printing messages on the console, querying Web APIs, etc.).

9.3. Special functions: Syntactic sugar#

Some functions, such as `*`, are somewhat special. They can be referred to using infix syntax which, for obvious reasons, most of us accepted as the default one. However, we will later reveal, amongst others, that “5 * 9” reduces to an ordinary function call:

`*`(5, 9)  # a call to `*` with two arguments, equivalent to 5 * 9
## [1] 45

9.3.1. Backticks#

In Section 2.2, we mentioned that via `<-` we can assign syntactically valid names to our objects. Most identifiers comprised of letters, digits, dots, and underscores can be used directly in R code.

Nevertheless, it is possible to label our objects however we like. Not syntactically valid (nonstandard) identifiers just need to be enclosed in backticks (back quotes, grave accents):

`42 a quite peculiar name :O` <- c(a=1, `b c`=2, `42`=3, `!`=4)
1/(1+exp(-`42 a quite peculiar name :O`))
##       a     b c      42       !
## 0.73106 0.88080 0.95257 0.98201

Such names are less convenient but backticks allow us to refer to them in any setting.

9.3.2. Dollar, `$` (*)#

The dollar operator, `$`, can be an alternative accessor to a single element in a named list[4]. If a label is a syntactically valid name, then x$label does the same job as x[["label"]] (saving five keystrokes: such a burden!).

x <- list(spam="a", eggs="b", `eggs and spam`="c", best.spam.ever="d")
x$eggs
## [1] "b"
x$best.spam.ever  # recall that a dot has no special meaning in most contexts
## [1] "d"

Nonstandard names must still be enclosed in backticks (or quotes):

x$`eggs and spam`  # x[["eggs and spam"]] is okay as usual
## [1] "c"

We are minimalist by design here. Thence, we will avoid this operator for it does not increase the expressive power of our function repertoire. Also, it does not work on atomic vectors nor matrices. Furthermore, it does not support names that are generated programmatically:

what <- "spam"
x$what  # the same as x[["what"]]; we do not want this
## NULL
x[[what]]  # works fine
## [1] "a"

The support for the partial matching of element names has been added to provide users working in interactive programming sessions with some relief in the case where they find typing the whole label daunting:

x$s
## Warning in x$s: partial match of 's' to 'spam'
## [1] "a"

Compare:

x[["s"]]  # no warning here...
## NULL
x[["s", exact=FALSE]]
## [1] "a"

Partial matching is generally a rubbishy programming practice. The result depends on the names of other items in x (which might change later) and can decrease code readability. The only reason why we obtained a warning message was because this book enforces the options(warnPartialMatchDollar=TRUE) setting, which, sadly, is not the default.

Note the behaviour on an ambiguous partial match:

x$egg  # ambiguous resolution
## NULL

as well as on an element assignment:

x$s <- "e"
str(x)
## List of 5
##  $ spam          : chr "a"
##  $ eggs          : chr "b"
##  $ eggs and spam : chr "c"
##  $ best.spam.ever: chr "d"
##  $ s             : chr "e"

It did not modify spam but added a new element, s. Confusing? Just let’s not use the dollar operator and we will have one less thing to worry about.

9.3.3. Curly braces, `{`#

A block of statements grouped with curly braces, `{`, corresponds to a function call. When we write:

{
    print(TRUE)
    cat("two")
    3
}
## [1] TRUE
## two
## [1] 3

The parser translates it to a call to:

`{`(print(TRUE), cat("two"), 3)
## [1] TRUE
## two
## [1] 3

When it is executed, every argument to `{` is evaluated one by one. Then, the last value is returned as the result of that call.

9.3.4. `if`#

if is a function too. As mentioned in Section 8.1, it returns the value corresponding to the expression that is evaluated conditionally. Hence, we may write:

if (runif(1) < 0.5) "head" else "tail"
## [1] "head"

but also:

`if`(runif(1) < 0.5, "head", "tail")
## [1] "head"

Note

A call like `if`(test, what_if_true, what_if_false) can only work correctly because of the lazy evaluation of function arguments; see Chapter 17.

On a side note, while, for, repeat can also be called that way, but they return invisible(NULL).

9.3.5. Operators are functions#

9.3.5.1. Calling built-in operators as functions#

Every arithmetic, logical, and relational operator is translated to a call to the corresponding function. For instance:

`<`(`+`(`*`(`-`(3), 4)), 5)  # 2+(-3)*4 < 5
## [1] TRUE

Also, x[i] is equivalent to `[`(x, i) and x[[i]] maps to `[[`(x, i).

Knowing this will not only enable us to manipulate unevaluated R code (Chapter 15) or access the corresponding manual pages (see, e.g., help("[")), but also verbalise certain operations more concisely. For instance:

x <- list(1:5, 11:17, 21:23)
unlist(Map(`[`, x, 1))  # 1 is a further argument passed to `[`
## [1]  1 11 21

is equivalent to a call to Map(function(e) e[1], x).

Note

Unsurprisingly, the assignment operator, `<-`, is also a function. It returns the assigned value invisibly.

`<-` binds right to left (compare help("Syntax")). Thus, the expression “a <- b <- 1” assigns 1 to both b and a. It is equivalent to `<-`("a", `<-`("b", 1)) and `<-`("b", 1) returns 1.

Owing to the pass-by-value-like semantics (Section 9.4.1), we can also expect that we will be assigning a copy of the value on the right side of the operator (with the exception of environments; Chapter 16).

x <- 1:6
y <- x  # makes a copy (but delayed, on demand, for performance reasons)
y[c(TRUE, FALSE)] <- NA_real_  # modify every second element
print(y)
## [1] NA  2 NA  4 NA  6
print(x)  # state of x has not changed; x and y are different objects
## [1] 1 2 3 4 5 6

This is especially worth pointing out to Python (amongst others) programmers, where the preceding assignment would mean that x and y both refer to the same (shared) object in the computer’s memory.

However, with no harm done to semantics, copying x is postponed until absolutely necessary (Section 16.1.4). This is efficient both time- and memory-wisely.

9.3.5.2. Defining binary operators#

We can also introduce custom binary operators named like `%myopname%`:

`%:)%` <- function(e1, e2) (e1+e2)/2
5 %:)% 1:10
##  [1] 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5

Recall that `%%`, `%/%`, and `%in%` are built-in operators denoting division remainder, integer division, and testing for set inclusion. Also, in Chapter 11, we will learn about `%*%`, which implements matrix multiplication.

Note

Chapter 10 notes that most existing operators can be overloaded for objects of custom types.

9.3.6. Replacement functions#

Functions generally do not change the state of their arguments. However, there is some syntactic sugar that permits us to replace objects or their parts with new content. We call them replacement functions.

For instance, three of the following calls replace the input x with its modified version:

x <- 1:5  # example input
x[3] <- 0  # replace the third element with 0
length(x) <- 7  # "replace" length
names(x) <- LETTERS[seq_along(x)]  # replace the names attribute
print(x)  # `x` is now different
##  A  B  C  D  E  F  G
##  1  2  0  4  5 NA NA

9.3.6.1. Creating replacement functions#

A replacement function is a mapping named like `f<-` with at least two parameters:

  • x (the object to be modified),

  • ... (possible further arguments),

  • value (as the last parameter; the object on the right-hand side of the `<-` operator).

We will most often interact with existing replacement functions, not create our own ones. But knowing how to do the latter is vital to understanding this language feature. For example:

`add<-` <- function(x, where=TRUE, value)
{
    x[where] <- x[where] + value
    x  # the modified object that will replace the original one
}

This function aims to add a value to a subset of the input vector x (by default, to each element therein). Then, it returns its altered version.

y <- 1:5           # example vector
add(y) <- 10       # calls y <- `add<-`(y, value=10)
print(y)           # y has changed
## [1] 11 12 13 14 15
add(y, 3) <- 1000  # calls y <- `add<-`(y, 3, value=1000)
print(y)           # y has changed again
## [1]   11   12 1013   14   15

Thus, invoking “add(y, w) <- v” is equivalent to “y <- `add<-`(y, w, value=v)”.

Note

(*) According to [69], a call “add(y, 3) <- 1000” is a syntactic sugar precisely for:

`*tmp*` <- y  # temporary substitution
y <- `add<-`(`*tmp*`, 3, value=1000)
rm("*tmp*")  # remove the named object from the current scope

This has at least two implications. First, in the unlikely event that a variable `*tmp*` existed before the call to the replacement function, it will be no more, it will cease to be. It will be an ex-variable. Second, the temporary substitution guarantees that y must exist before the call (due to lazy evaluation, a function’s body does not have to refer to all the arguments passed).

9.3.6.2. Substituting parts of vectors#

The replacement versions of the index-like operators are named as follows:

  • `[<-` is used in substitutions like “x[i] <- value”,

  • `[[<-` is called when we perform “x[[i]] <- value”,

  • `$<-` is used whilst calling “x$i <- value”.

x <- 1:5
`[<-`(x, c(3, 5), NA_real_)  # returns a new object
## [1]  1  2 NA  4 NA
print(x)  # does not change the original input
## [1] 1 2 3 4 5
Exercise 9.18

Write a function `extend<-`, which pushes new elements at the end of a given vector, modifying it in place.

`extend<-` <- function(x, value) ...to.do...

Example use:

x <- 1
extend(x) <- 2     # push 2 at the back
extend(x) <- 3:10  # add 3, 4, ..., 10
print(x)
##  [1]  1  2  3  4  5  6  7  8  9 10

9.3.6.3. Replacing attributes#

There are many replacement functions to reset object attributes (Section 4.4). In particular, each special attribute has its replacement procedure, e.g., `names<-`, `class<-`, `dim<-`, `levels<-`, etc.

x <- 1:3
names(x) <- c("a", "b", "c")  # change the `names` attribute
print(x)  # x has been altered
## a b c
## 1 2 3

Individual (arbitrary, including non-special ones) attributes can be set using `attr<-`, and all of them can be established via a single call to `attributes<-`.

x <- "spam"
attributes(x) <- list(shape="oval", smell="meaty")
attributes(x) <- c(attributes(x), taste="umami")
attr(x, "colour") <- "rose"
print(x)
## [1] "spam"
## attr(,"shape")
## [1] "oval"
## attr(,"smell")
## [1] "meaty"
## attr(,"taste")
## [1] "umami"
## attr(,"colour")
## [1] "rose"

Also, setting an attribute to NULL results, by convention, in its removal:

attr(x, "taste") <- NULL  # it is tasteless now
print(x)
## [1] "spam"
## attr(,"shape")
## [1] "oval"
## attr(,"smell")
## [1] "meaty"
## attr(,"colour")
## [1] "rose"
attributes(x) <- NULL  # remove all
print(x)
## [1] "spam"

Which can be worthwhile in contexts such as:

x <- structure(c(a=1, b=2, c=3), some_attrib="value")
y <- `attributes<-`(x, NULL)

y is a version of x with metadata removed. The latter remains unchanged.

9.3.6.4. Compositions of replacement functions (*)#

Updating only selected names like:

x <- c(a=1, b=2, c=3)
names(x)[2] <- "spam"
print(x)
##    a spam    c
##    1    2    3

is possible due to the fact that “names(x)[i] <- v” is equivalent to:

old_names <- names(x)
new_names <- `[<-`(old_names, i, value=v)
x <- `names<-`(x, value=new_names)

Important

More generally, a composition of replacement calls “g(f(x, a), b) <- y” yields a result equivalent to “x <- `f<-`(x, a, value=`g<-`(f(x, a), b, value=y))”. Both f and `f<-` need to be defined, but having g is not necessary.

Exercise 9.19

(*) What is “h(g(f(x, a), b), c) <- y” equivalent to?

Exercise 9.20

Write a (convenient!) function `recode<-` which replaces specific elements in a character vector with other ones, allowing the following interface:

`recode<-` <- function(x, value) ...to.do...
x <- c("spam", "bacon", "eggs", "spam", "eggs")
recode(x) <- c(eggs="best spam", bacon="yummy spam")
print(x)
## [1] "spam"       "yummy spam" "best spam"  "spam"       "best spam"

We see that the named character vector gives a few from="to" pairs, e.g., all eggs are to be replaced by best spam. Determine which calls are equivalent to the following:

x <- c(a=1, b=2, c=3)
recode(names(x)) <- c(c="z", b="y")  # or equivalently = ... ?
print(x)
## a y z
## 1 2 3
y <- list(c("spam", "bacon", "spam"), c("spam", "eggs", "cauliflower"))
recode(y[[2]]) <- c(cauliflower="broccoli")  # or = ... ?
print(y)
## [[1]]
## [1] "spam"  "bacon" "spam"
##
## [[2]]
## [1] "spam"     "eggs"     "broccoli"
Exercise 9.21

(*) Consider the `recode<-` function from the previous exercise.

Here is an example matrix with the dimnames attribute whose names attribute is set (more details in Chapter 11):

(x <- Titanic["Crew", , "Adult", ])
##         Survived
## Sex       No Yes
##   Male   670 192
##   Female   3  20
recode(names(dimnames(x))) <- c(Sex="sex", Survived="survived")
print(x)
##         survived
## sex       No Yes
##   Male   670 192
##   Female   3  20

This changes the x object. For each of the following subtasks, compose a single call that alters names(dimnames(x)) without modifying x in place but returning a recoded copy of:

  • names(dimnames(x)),

  • dimnames(x),

  • x.

Exercise 9.22

(*) Consider the `recode<-` function again but now let an example object be a data frame with a column of the factor class:

x <- iris[c(1, 2, 51, 101), ]
recode(levels(x[["Species"]])) <- c(
    setosa="SET", versicolor="VER", virginica="VIR"
)
print(x)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1            5.1         3.5          1.4         0.2     SET
## 2            4.9         3.0          1.4         0.2     SET
## 51           7.0         3.2          4.7         1.4     VER
## 101          6.3         3.3          6.0         2.5     VIR

How to change levels(x[["Species"]]) and return an altered copy of:

  • levels(x[["Species"]]),

  • x[["Species"]],

  • x

without modifying x in place?

9.4. Arguments and local variables#

9.4.1. Call by “value”#

As a general rule, functions cannot change the state of their arguments[5]. We can think of them as being passed by value, i.e., as if their copy was made.

test_change <- function(y)
{
    y[1] <- 7
    y
}

x <- 1:5
test_change(x)
## [1] 7 2 3 4 5
print(x)  # same
## [1] 1 2 3 4 5

If the preceding statement was not true, the state of x would change after the call.

9.4.2. Variable scope#

Function arguments and any other variables we create inside a function’s body are relative to each call to that function.

test_change <- function(x)
{
    x <- x+1
    z <- -x
    z
}

x <- 1:5
test_change(x*10)
## [1] -11 -21 -31 -41 -51
print(x)  # x in the function's body was a different x
## [1] 1 2 3 4 5
print(z)  # z was local
## Error in eval(expr, envir, enclos): object 'z' not found

Both x and z are local variables. They only live whilst our function is being executed. The former temporarily masks[6] the object of the same name from the caller’s context.

Important

It is a good development practice to refrain from referring to objects not created within the current function, especially to “global” variables. We can always pass an object as an argument explicitly.

Note

It is a function call as such, not curly braces per se that form a local scope. When we run “x <- { y <- 1; y + 1 }”, y is not an temporary variable. It is an ordinary named object created alongside x.

On the other hand, in “x <- (function() { z <- 1; z + 1 })()”, z will not be available thereafter.

9.4.3. Closures (*)#

Most user-defined functions are, in fact, instances of the so-called closures; see Section 16.3.2 and [1]. They not only consist of an R expression to evaluate but also can carry auxiliary data.

For instance, given two numeric vectors x and y of the same length, a call to approxfun(x, y) returns a function that linearly interpolates between the consecutive points \((x_1, y_1)\), \((x_2, y_2)\), etc., so that a corresponding \(y\) can be determined for any \(x\).

x <- seq(0, 1, length.out=11)
f1 <- approxfun(x, x^2)
f2 <- approxfun(x, x^3)
f1(0.75)  # check that it is close to the true 0.75^2
## [1] 0.565
f2(0.75)  # compare with 0.75^3
## [1] 0.4275

Let’s inspect the source code of the above functions:

print(f1)
## function (v)
## .approxfun(x, y, v, method, yleft, yright, f, na.rm)
## <environment: 0x55cb82dc7748>
print(f2)
## function (v)
## .approxfun(x, y, v, method, yleft, yright, f, na.rm)
## <environment: 0x55cb835701d0>

We might wonder how they can produce different results. It is evident that they are identical. It turns out, however, that they internally store additional data that are referred to when they are called:

environment(f1)[["y"]]
##  [1] 0.00 0.01 0.04 0.09 0.16 0.25 0.36 0.49 0.64 0.81 1.00
environment(f2)[["y"]]
##  [1] 0.000 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1.000

We will explore these concepts in detail in the third part of this book.

9.4.4. Default arguments#

We often need to find a sweet spot between being generous, mindful of the diverse needs of our users, and making the API neither overwhelming nor oversimplistic. We have established that it is best if a function performs a single, well-specified task. However, we are always delighted when it also lets us tweak its behaviour should we wish to do so. The use of default arguments can facilitate this principle.

For instance, log computes logarithms, by default, the natural ones.

log(2.718)  # the same as log(2.718, base=exp(1)), i.e., default base, e
## [1] 0.9999
log(4, base=2)  # different base
## [1] 2
Exercise 9.23

Study the documentation of the following functions and note the default values they define: round, hist, grep, and download.file.

Let’s create a function equipped with such recommended settings:

test_default <- function(x=1) x

test_default()   # use default
## [1] 1
test_default(2)  # use something else
## [1] 2

Most often, default arguments are just constants, e.g., 1. Generally, though, they can be any R expressions, also ones that include a reference to other arguments passed to the same function; see Section 17.2.

Default arguments usually appear at the end of the parameter list, but see Section 9.3.6 (on replacement functions) for a well-justified exception.

9.4.5. Lazy vs eager evaluation#

In some languages, function arguments are always evaluated prior to a call. In R, though, they are only computed when actually needed. We call it lazy or delayed evaluation. Recall that in Section 8.1.4, we introduced the short-circuit evaluation operators `||` (or) and `&&` (and). They can do their job precisely thanks to this mechanism.

Example 9.24

In the following example, we do not use the function’s argument at all:

lazy_test1 <- function(x) 1  # x is not used

lazy_test1({cat("and now for something completely different!"); 7})
## [1] 1

Otherwise, we would see a message being printed out on the console.

Example 9.25

Next, let’s use x amidst other expressions in a function’s body:

lazy_test2 <- function(x)
{
    cat("it's... ")
    y <- x+x  # using x twice
    cat(" a man with two noses")
    y
}

lazy_test2({cat("and now for something completely different!"); 7})
## it's... and now for something completely different! a man with two noses
## [1] 14

An argument is evaluated once, and its value is stored for further reference. If that was not the case, we would see two messages like “and now...”.

We will elaborate on this in Chapter 17.

9.4.6. Ellipsis, `...`#

We will start with an exercise.

Exercise 9.26

Notice the presence of `...` in the parameter list of c, list, structure, cbind, rbind, cat, Map (and the underlying mapply), lapply (a specialised version of Map), optimise, optim, uniroot, integrate, outer, aggregate. What purpose does it serve, according to these functions’ documentation pages?

We can create a variadic function by including `...` (dot-dot-dot, ellipsis; see help("dots")) somewhere in its parameter list. The ellipsis serves as a placeholder for all objects passed to the function but not matched by any formal (named) parameters.

The easiest way to process arguments passed via `...` programmatically (see also Section 17.3) is by redirecting them to list.

test_dots <- function(...)
    list(...)

test_dots(1, a=2)
## [[1]]
## [1] 1
##
## $a
## [1] 2

Such a list can be processed just like… any other generic vector. What we can do with these arguments is only limited by our creativity (in particular, recall from Section 7.2.2 the very powerful do.call function). There are two primary use cases of the ellipsis[7]:

  • create a new object by combining an arbitrary number of other objects:

    c(1, 2, 3)   # three arguments
    ## [1] 1 2 3
    c(1:5, 6:7)  # two arguments
    ## [1] 1 2 3 4 5 6 7
    structure("spam")  # no additional arguments
    ## [1] "spam"
    structure("spam", color="rose", taste="umami")  # two further arguments
    ## [1] "spam"
    ## attr(,"color")
    ## [1] "rose"
    ## attr(,"taste")
    ## [1] "umami"
    cbind(1:2, 3:4)  # two
    ##      [,1] [,2]
    ## [1,]    1    3
    ## [2,]    2    4
    cbind(1:2, 3:4, 5:6, 7:8)  # four
    ##      [,1] [,2] [,3] [,4]
    ## [1,]    1    3    5    7
    ## [2,]    2    4    6    8
    sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 42)  # twelve
    ## [1] 108
    
  • pass further arguments (as-is) to other methods:

    lapply(list(c(1, NA, 3), 4:9), mean, na.rm=TRUE)  # mean(x, na.rm=TRUE)
    ## [[1]]
    ## [1] 2
    ##
    ## [[2]]
    ## [1] 6.5
    integrate(dbeta, 0, 1,
        shape1=2.5, shape2=0.5)  # dbeta(x, shape1=2.5, shape2=0.5)
    ## 1 with absolute error < 1.2e-05
    
Example 9.27

The documentation of lapply states that this function is defined like lapply(X, FUN, ...). Here, the ellipsis is a placeholder for a number of optional arguments that can be passed to FUN. Hence, if we denote the \(i\)-th element of a vector X by X[[i]], calling lapply(X, FUN, ...) will return a list whose \(i\)-th element will be equal to FUN(X[[i]], ...).

Exercise 9.28

Using a single call to lapply, generate a list with three numeric vectors of lengths 3, 9, and 7, respectively, drawn from the uniform distribution on the unit interval. Then, upgrade your code to get numbers sampled from the interval \([-1, 1]\).

Example 9.29

Chapter 4 mentioned that concatenating a mix of lists and atomic vectors with c, unfortunately, unrolls the latter:

str(c(u=list(1:2), v=list(a=3:4, b=5:6), w=7:8))
## List of 5
##  $ u  : int [1:2] 1 2
##  $ v.a: int [1:2] 3 4
##  $ v.b: int [1:2] 5 6
##  $ w1 : int 7
##  $ w2 : int 8

Let’s implement a fix:

as.list2 <- function(x) if (is.list(x)) x else list(x)
clist <- function(...) do.call(c, lapply(list(...), as.list2))
str(clist(u=list(1:2), v=list(a=3:4, b=5:6), w=7:8))
## List of 4
##  $ u  : int [1:2] 1 2
##  $ v.a: int [1:2] 3 4
##  $ v.b: int [1:2] 5 6
##  $ w  : int [1:2] 7 8

9.4.7. Metaprogramming (*)#

We can access expressions passed as a function’s arguments without evaluating them. In particular, a call to the composition of deparse and substitute converts them to a character vector.

test_deparse_substitute <- function(x)
    deparse(substitute(x))  # does not evaluate whatever is behind `x`

test_deparse_substitute(testing+1+2+3)
## [1] "testing + 1 + 2 + 3"
test_deparse_substitute(spam & spam^2 & bacon | grilled(spam))
## [1] "spam & spam^2 & bacon | grilled(spam)"
Exercise 9.30

Check out the y-axis label generated by plot.default((1:100)^2). Inspect its source code. Notice a call to the two aforementioned functions.

Similarly, call shapiro.test(log(rlnorm(100))) and take note of the “data:” field.

A function is free to do with such an expression whatever it likes. For instance, it can modify the expression and then evaluate it in a very different context. Such a language feature allows certain operations to be expressed much more compactly. In theory, it is a potent tool. Alas, it is easy to find many practical examples where it was over/misused and made learning or using R confusing.

Example 9.31

(*) In Section 12.3.9 and Section 17.5, we explain that subset and transform use metaprogramming techniques to specify basic data frame transformations. For instance:

transform(
    subset(
        iris,
        Sepal.Length>=7.7 & Sepal.Width >= 3.0,  # huh?
        select=c(Species, Sepal.Length:Sepal.Width)  # le what?
    ),
    Sepal.Length.mm=Sepal.Length/10  # pardon my French, but pardon?
)
##       Species Sepal.Length Sepal.Width Sepal.Length.mm
## 118 virginica          7.7         3.8            0.77
## 132 virginica          7.9         3.8            0.79
## 136 virginica          7.7         3.0            0.77

None of the arguments (except iris) makes sense outside of the function’s call. In particular, neither Sepal.Length nor Sepal.Width exists as a standalone variable.

The two functions took the liberty to interpret the arguments passed how they felt. They created their own virtual reality within our well-defined world. The reader must refer to their documentation to discover the meaning of such special syntax.

Note

(*) Some functions have rather bizarre default arguments. For instance, in the manual page of prop.test, we read that the alternative parameter defaults to c("two.sided", "less", "greater"). However, if a user does not set this argument explicitly, alternative="two.sided" (the first element in the above vector), will actually be assumed.

If we call print(prop.test), we will find the code line responsible for this odd behaviour: “alternative <- match.arg(alternative)”. Consider the following example:

test_match_arg <- function(x=c("a", "b", "c")) match.arg(x)

test_match_arg()  # missing argument; choose first
## [1] "a"
test_match_arg("c")  # one of the predefined options
## [1] "c"
test_match_arg("d")  # unexpected setting
## Error in match.arg(x): 'arg' should be one of "a", "b", "c"

In the current context, match.arg only allows an actual parameter from a given set of choices. However, if the argument is missing, it selects the first option.

Unfortunately, we have to learn this behaviour by heart, because the above source code is far from self-explanatory. If such an expression was normally evaluated, we would use either the default argument or whatever the user passed as x (but then the function would not know the range of possible choices). A call to match.arg(x, c("a", "b", "c")) could guarantee the desired functionality and would be much more readable. Instead, metaprogramming techniques enabled match.arg to access the enclosing function’s default argument list without explicitly referring to them.

One may ask: why is it so? The only sensible answer to this will be “because its programmer decided it must be this way”. Let’s contemplate this for a while. In cases like these, we are not dealing with some base R language design choice that we might like or dislike, but which we should just accept as an inherent feature. Instead, we are struggling intellectually because of some programmers’ (mis)use (in good faith…) of R’s flexibility itself. They have introduced a slang/dialect on top of our mother tongue, whose meaning is valid only within this function. Blame the middleman, not the environment, please.

This is why we generally advocate for avoiding metaprogramming-based techniques wherever possible. We shall elaborate on this topic in the third part of this book.

9.5. Principles of sustainable design (*)#

Fine design is more art than science. As usual in real life, we will need to make many compromises. This is because improving things with regard to one criterion sometimes makes them worse with respect to other aspects[8] (also those that we are not aware of). Moreover, not everything that counts can nor will be counted.

We do not want to be considered heedless enablers who say that if anything is possible, it should be done. Therefore, below we serve some food for thought. However, as there is no accounting for taste, the kind readers might as well decide to skip this spicy meal.

9.5.1. To write or abstain#

Our functions can often be considered merely creative combinations of the building blocks available in base R or a few high-quality add-on packages. Some are simpler than others. Thus, there is a question if a new operation should be introduced at all: whether we are faced with the case of multiplying entities without necessity.

On the one hand, the DRY (don’t repeat yourself) principle tells us that the most frequently used code chunks (say, called at least thrice) should be generalised in the form of a new function. As far as complex operations are concerned, this is definitely a correct approach.

On the other hand, not every generalisation is necessarily welcome. Let’s say we are tired of writing g(f(x)) for the \(n\)-th time, \(n\ge 2\). Why not introduce h defined as a combination of g and f? This might seem like a clever idea, but let’s not take it for granted. Being tired might be an indication that we need a rest. Being lazy can be a call for more self-discipline (not an overly popular word these days, but still, an endearing trait).

Example 9.32

paste0 is a specialised version of paste, but has the sep argument hardcoded to an empty string.

  • Even if this might be the most often applied use case, is the introduction of a new function justifiable? Is it so hard to write sep="" each time?

  • Would changing paste’s default argument be better? That, of course, would harm backward compatibility, but what strategies could we apply to make the transition as smooth as possible?

  • What about introducing a new version of paste with sep defaulting to "", and informing the users that the old version is deprecated and will be removed in, say, two years? (or maybe one month is preferable? or five?)

Example 9.33

R 4.0 defined a new function called deparse1. It is nothing but a combination of deparse and paste:

print(deparse1)
## function (expr, collapse = " ", width.cutoff = 500L, ...)
## paste(deparse(expr, width.cutoff, ...), collapse = collapse)
## <environment: namespace:base>

Let’s say this covers 90% of use cases: was introducing it a justified idea then? What if that number was 99%? Might it lead to new users’ not knowing that the more primitive operations are available?

Overall, more functions contribute to information overload. We do not want our users to be overwhelmed by unreasonably many choices. Luckily, nothing is cemented once and for all. Had we made bad design choices resulting in our API’s being bloated, we could always cancel those that no longer spark joy.

9.5.2. To pamper or challenge#

We should think about the kind of audience we would like to serve: is it our team only, students, professionals, certain client groups, etc.? Do they have mathematical, programming, engineering, or scientific background?

Not everything appropriate for one cohort will be valuable for another.

Not everything pleasing some now will benefit them in the long run: people (their skills, attitudes, etc.) change.

Example 9.34

Assume we are writing a friendly package for novices who would like to grasp the rudiments of data analysis as quickly as possible. Without much effort, it could enable them to solve 80–95% of the most common, easy problems.

Think of introducing the students to a function that returns the five largest observations in a given vector. Let’s call it nlargest. So pleasant. It makes the students feel empowered and improves their retention[9].

However, when faced with the remaining 5–20% of tasks, they will have to learn another, more advanced, generic, and capable tool anyway (in our case, the base R itself). Are they determined and skilled enough to do that? Some might, unfortunately, say: “it is not my problem, I made sure everyone was happy at that time”. Due to this shortsightedness, it is our problem now.

Recall that it took us some time to arrive at order and subsetting via `[`. Assuming that we read this book from the beginning to the end and solve all the exercises, which we should, we are now able to author the said nlargest (and lots of other functions) ourselves, using a single line of code. This will also pay off in many scenarios that we will be facing in the future, e.g., when we consider matrices and data frames.

Yes, everyone will be reinventing their own nlargest this way. But this constitutes a great exercise: by our being immoderately nice (spoonfeeding), some might have lost an opportunity to learn a new, more universal skill.

Although most users would love to minimise the effort put into all their activities, ultimately, they sometimes need to learn new things. Let’s thus not be afraid to teach them stuff.

Furthermore, we do not want to discourage experts (or experts to-be) by presenting them with overly simplified solutions that keep their hands tied when something more ambitious needs to be done.

9.5.3. To build or reuse#

The fail-fast philosophy encourages us to build applications using prefabricated components. This is fantastic at the early stage of their life cycles. Nonetheless, if we construct something uncomplicated or whose only purpose is to illustrate an idea, educate, or show off, let’s be explicit about it so that other users do not feel obliged to treat our product (exercise) seriously.

In the (not so likely, probabilistically speaking) event of its becoming successful, we are expected to start thinking about the project’s long-term stability and sustainability. After all, relying on third-party functions, packages, or programs makes our software projects less… independent. This may be problematic because:

  • the dependencies might not be available on every platform or may behave differently across various system configurations,

  • they may be huge (and can depend on other external software too),

  • their APIs may be altered, which can cause our code to break,

  • their functionality can change, which can lead to unexpected behaviour.

Hence, it might be better to rewrite some parts from scratch on our own.

Exercise 9.35

Identify a few R packages on CRAN with many dependencies. See what functions they import from other packages. How often do they only borrow a few lines of code?

The UNIX philosophy emphasises building and using minimalist yet nontrivial, single-purpose, high-quality pieces of software that can work as parts of more complex pipelines. R serves as a glue language very well.

In the long run, our software project might converge to such a tool. Thus, we might have to standardise its API (e.g., make it available from the command line; Section 1.2) so that the users of other languages can benefit from our work.

Important

If our project is merely a modified interface/front-end to a standalone program developed by others, we should be humble about it. We should strive to ensure we are not the ones who get all the credit for other people’s work. Also, we must clearly state how the original tools can be used to achieve the same goals, e.g., when working from the command line. In other words, let’s not be selfish jerks.

9.5.4. To revolt or evolve#

The wise, gradual improving of things is generally welcome. It gives everyone time to adjust.

Some projects, however, are governed in a compulsive way, reinforced by neurotic thinking that “stakeholders need to be kept engaged or we’re going to lose popularity”. It is not a sustainable strategy. Less is better, even though slightly more challenging. Put good engineering first.

It might even happen that we realise that “everything so far was wrong and we need a global reset”. But if we become very successful, we will cause a divide in the community. Especially when we decide to duplicate the existing, base functionality, we should note that some users will be introduced to the system through the supplementary interface and they will not be familiar with the classic one. Others will have to learn the added syntax to be able to communicate with the former group. This gives rise to a whole new set of issues (how to make all the functions interoperable with each other seamlessly, etc.). Such moves are sometimes necessary, but let’s not treat them lightly; it is a great responsibility.

9.6. Exercises#

Exercise 9.36

Answer the following questions.

  • Will stopifnot(1) stop? What about stopifnot(NA), stopifnot(TRUE, FALSE), and stopifnot(c(TRUE, TRUE, NA))?

  • What does the `if` function return?

  • Does `attributes<-`(x, NULL) modify x?

  • When can we be interested in calling `[` and `[<-` as functions (and not as operators) directly?

  • How to define a new binary operator? Can it be equipped with default arguments?

  • What are the main use cases of the ellipsis?

  • What is wrong with transform, subset, and match.arg?

  • When a call like f(-1, do_something_that_takes_a_million_years()) does not necessarily have to be a regrettable action?

  • What is the difference between “names(x)[1] <- new_name” and “names(x[1]) <- new_name”?

  • What might be the form of x if it is legit to call it like x[[c(1, 2)]]()()()[[1]]()()?

Exercise 9.37

Consider a function:

f <- function(x)
    for (e in x)
        print(e)

What is the return value of a call to f(list(1, 2, 3))? Is it NULL, invisible(NULL), x[[length(x)]], or invisible(x[[length(x)]])? Does it change relative to whether x is empty or not?

Exercise 9.38

The split function also has its replacement version. Study its documentation to learn how it works.

Exercise 9.39

A call to ls(envir=baseenv()) returns all objects defined in the base package (see Chapter 16). List the names corresponding to replacement functions.

Important

Apply the principle of test-driven development when solving the remaining exercises.

Exercise 9.40

Implement your version of the Position and Find functions. Evaluation should stop as soon as the first element fulfilling a given predicate has been found.

Exercise 9.41

Implement your version of the Reduce function.

Exercise 9.42

Write a function slide(f, x, k, ...) which returns a list y with length(x)-k+1 elements such that y[[i]] = f(x[i:(i+k-1)], ...)

unlist(slide(sum, 1:5, 1))
## [1] 1 2 3 4 5
unlist(slide(sum, 1:5, 3))
## [1]  6  9 12
unlist(slide(sum, 1:5, 5))
## [1] 15
Exercise 9.43

Using slide defined above, write another function that counts how many increasing pairs of numbers are in a given numeric vector. For instance, in (0, 2, 1, 1, 0, 1, 6, 0), there are three such pairs: (0, 2), (0, 1), (1, 6).

Exercise 9.44

(*) Write your version of tools::package_dependencies with reverse=TRUE based on information extracted by calling utils::available.packages.

Exercise 9.45

(**) Write a standalone program which can be run from the system shell and which computes the total size of all the files in directories given as the script’s arguments (via commandArgs).