9. Designing functions#
The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime, any bug/typos reports/fixes are appreciated. Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [26].
In Chapter 7, we learnt how to compose our own functions. This skill is vital to enforcing the good development practice of avoiding code repetition: running the same command sequence on different data.
This chapter is devoted to designing such reusable modules to make them easier to use, test, and maintain. We also provide some more technical details. They were not of the highest importance during our first exposure to this topic but are crucial to our better understanding of how R works.
9.1. Principles of sustainable design#
Fine design is more art than science. As usual in real life, we will need to make many compromises. This is because improving things with regard to one criterion sometimes makes them worse with respect to other aspects[1] (also those that we are not aware of). Moreover, not everything that counts can nor will be counted.
Below we provide some observations, ideas, and food for thought.
9.1.1. To write or abstain#
Our functions can often be considered merely creative combinations of the building blocks available in base R or a few high-quality add-on packages[2]. Some are simpler than others. Thus, there is a question if a new operation should be introduced at all: whether we are faced with the case of multiplying entities without necessity.
On the one hand, the DRY (don’t repeat yourself) principle tells us that the most frequently used code chunks (say, called at least thrice) should be generalised in the form of a new function. As far as complex operations are concerned, this is definitely a correct approach.
On the other hand, not every generalisation is necessarily welcome.
Let us say we are tired of writing
g(
f(x))
for the \(n\)-th time,
\(n\ge 2\).
Why don’t we introduce h defined as a combination
of g and f?
This might seem like a clever idea, but let us not take it for granted:
being tired might be an indication of our body and mind
needing a rest; being lazy can be a call for more self-discipline
(not an overly popular word these days, but still, an endearing trait).
paste0 is a specialised version of paste,
but has the sep
argument hardcoded to an empty string.
Even if this might be the most often applied use case, is the introduction of a new function justifiable? Is it so hard to write “
paste=""
” each time?Would changing paste’s default argument be better? That, of course, would harm backward compatibility, but what strategies could we apply to make the transition as smooth as possible?
Would it be better to introduce a new version of paste with
sep
defaulting to""
, and inform the users that the old version is deprecated and to be removed in, say, two years? (or maybe one year is better? or five?)
In R 4.0, deparse1 has been introduced: it is merely a combination of deparse (see below) and paste:
print(deparse1)
## function (expr, collapse = " ", width.cutoff = 500L, ...)
## paste(deparse(expr, width.cutoff, ...), collapse = collapse)
## <environment: namespace:base>
Let us say this covers 90% of use cases: was introducing it a justified idea then? What if that number was 99%?
Overall, more functions contribute to information overload. We do not want our users to be overwhelmed by too many choices. Luckily, nothing is cemented once and for all. Had we made bad design choices resulting in our API’s being bloated, we could always clean up those that no longer spark joy.
9.1.2. To pamper or challenge#
Think about the kind of audience we would like to serve: is it our team only, students, professionals, certain client groups, etc.? Do they have mathematical, programming, engineering, or scientific background? Not everything appropriate for one cohort will be valuable for another. And not everything pleasing some now will benefit them in the long run: people (their skills, attitudes, etc.) change.
Assume we are writing a friendly and inclusive package for novices who would like to grasp the basics of data analysis as quickly[3] as possible. Without much effort, it would enable them to solve 80–95% of the most common, easy problems.
Think of introducing the students to a function that returns the five largest observations in a given vector. Let us call it nlargest: so pleasant to use. It makes the students feel empowered quickly.
Still, when faced with the remaining 5–20% of tasks, they will have to learn another, more advanced, generic, and powerful tool anyway (in our case, the base R itself). Are they determined and skilled enough to do that? Time will tell. The least we can do is to be explicit about it.
Recall that it took us some time to arrive at order and subsetting via `[`. Assuming that we read this book from the beginning to the end and solve all the exercises, which we should, we are now able to author the said nlargest (and lots of other functions) ourselves, using a single line of code. This will also pay off in many scenarios that we will be facing in the future, e.g., when we consider matrices and data frames.
Yes, everyone will be reinventing their own nlargest this way. But this constitutes a great exercise: by our being too nice, some might have lost an opportunity to learn a new, more universal skill.
Although most users would love to minimise the effort put into all their activities, ultimately, they sometimes need to learn new things. Let us thus not be afraid to teach them stuff.
Furthermore, we do not want to discourage experts (or experts to-be) by presenting them with overly simplified solutions that keep their hands tied when something more ambitious needs to be done.
9.1.3. To build or reuse#
In the short term, the fail-fast philosophy encourages us to build our applications using prefabricated components. This is fantastic at the early stage of its life cycle. If we build something really simple or whose purpose is merely to illustrate an idea, educate, or show off how “awesome” we are, let us be explicit about it so that other users do not feel obliged to treat our product (exercise) seriously.
In the (not so likely, probabilistically speaking) event of its becoming successful, we are expected to start thinking about the project’s long-term stability and sustainability. After all, relying on third-party functions, packages, or programs makes our software projects less… independent. This may be problematic since:
the dependencies might not be available on every platform or may behave differently across various system configurations,
they may be huge (and can depend on other external software too),
their APIs may be altered, which can cause our code to break,
their functionality can change, which can lead to some unexpected behaviours.
Hence, it might be better to rewrite some parts from scratch on our own.
Identify some R packages on CRAN with many dependencies. See what functions they import from other packages. How often is it just a few lines of code?
The UNIX philosophy emphasises building and using minimalist yet nontrivial, single-purpose, high-quality pieces of software that can work as parts of more complex pipelines. R serves as a glue language quite well.
In the long run, some of our software projects might converge to such a tool. Thus, we might have to standardise our API (e.g., make it available from the command line; Section 1.2) so that the users of other languages can benefit from our work.
Important
If our project is merely a modified interface/front-end to a standalone program developed by others, we should be humble about it. We should strive to ensure we are not the ones who get all the credit for other people’s work.
Also, we must clearly state how the original tools can be used to achieve the same goals, e.g., when working from the command line.
9.2. Managing data flow#
A function, most of the time, can and should be treated as a black box: its callers do not have to care what it hides inside. After all, they are supposed to use it: given some inputs, they expect well-defined outputs (explained in very detail in the function’s manual; see Section 9.3.2.3).
9.2.1. Checking input data integrity and argument handling#
A function takes R objects of any kind as arguments, but it does not mean feeding it with every- or any-thing is healthy for its guts.
When designing functions, it is best to handle the inputs in a manner similar to base R’s behaviour. This will make our contributions easier to work with.
Lamentably, base R functions frequently do not process arguments of a similar kind fully consistently. Such variability might be due to many reasons and, in essence, is not necessarily bad. Usually, there might be many possible behaviours and choosing one over another will make a few users unhappy anyway. Some choices might not be optimal, but they are for historical compatibility (e.g., with S). Of course, it might also happen (but the probability is low) that there is a bug or something is poorly designed.
This is why it is better to keep the vocabulary quite restricted Even if there are exceptions to the general rules, with fewer functions, they are easier to remember. We advocate for such minimalism in this book.
Consider the following case study, illustrating that even the extremely simple scenario dealing with a single positive integer is not necessarily straightforward.
In mathematical notation, we usually denote the number of objects in a collection with the famous “\(n\)”.
It is implicitly assumed that such \(n\) is a single natural number
(albeit whether this includes 0 or not should be specified at some point).
The functions
runif,
sample,
seq,
rep,
strrep, and
class::
knn
take it as arguments. Nonetheless, nothing prevents their users from trying
to challenge them by passing:
2.5
,-1
,0
,1-1e-16
(non-positive numbers, non-integers);NA_real_
,Inf
(not finite);1:5
(not of length 1; after all, there are no scalars in R)numeric
(0)
(an empty vector);TRUE
,NA
, c(TRUE, FALSE, NA)
,"1"
, c("1", "2", "3")
(non-numeric, but coercible to);list
(1)
, list(1, 2, 3)
, list(1:3, 4)
(non-atomic);"spam"
(utter nonsense);as.matrix
(1)
, factor(7)
, factor(
c(3, 4, 2, 3))
, etc. (compound types; see Chapter 10).
Read the aforementioned functions’ reference manuals and call them on different inputs. Notice how differently they handle such atypical arguments.
Sometimes we will rely on other functions to check data integrity for us.
Let us consider the following function that generates \(n\) pseudorandom numbers from the unit interval rounded to \(d\) decimal digits. We strongly believe or hope (good faith and high competence assumption) that its authors knew what they were doing when they wrote:
round_rand <- function(n, d)
{
x <- runif(n) # runif will check if `n` makes sense
round(x, d) # round will determine the appropriateness of `d`
}
What constitutes correct \(n\) and \(d\) and how the function behaves when not provided with positive integers is determined by the two underlying functions, runif and round:
round_rand(4, 1) # the expected use case
## [1] 0.3 0.8 0.4 0.9
round_rand(4.8, 1.9) # 4, 2
## [1] 0.94 0.05 0.53 0.89
round_rand(4, NA)
## [1] NA NA NA NA
round_rand(0, 1)
## numeric(0)
Many such design choices can be defended if they are well thought-out and adequately documented. Some programmers will opt for high uniformity/compatibility across numerous tools, but there are cases where some exceptions/diversity do more good than harm.
Yet, our functions might be part of a more complicated data flow pipeline. It might happen that some other procedure generates a value that we did not expect (because of a bug therein or because we did not study its manual). The problem arises when this unthinkable value passed to our function. In our case, this would correspond to the said \(n\)’s or \(d\)’s being determined programmatically.
Continuing the previous example, the following might be somewhat challenging with regard to our being flexible and open-minded:
round_rand(c(100, 42, 63, 30), 1) # length(c(...)), 1)
## [1] 0.7 0.6 0.1 0.9
round_rand("4", 1) # as.numeric(...), 1
## [1] 0.2 0.0 0.3 1.0
Sure, it is quite convenient. Nevertheless, it might lead to problems that are hard to diagnose.
Also, note the not so informative error messages in cases like:
round_rand(NA, 1)
## Error in runif(n): invalid arguments
round_rand(4, "1")
## Error in round(x, d): non-numeric argument to mathematical function
Hence, some defensive design mechanisms are not a bad idea, especially if they lead to generating an informative error message.
Important
stopifnot gives a convenient means to assert the
enjoyment of our expectations about a function’s arguments
(or some intermediate values).
A call to stopifnot(cond1, cond2, ...)
is more or less equivalent to:
if (!(is.logical(cond1) && !any(is.na(cond1)) && all(cond1)))
stop("`cond1` are not all TRUE")
if (!(is.logical(cond2) && !any(is.na(cond2)) && all(cond2)))
stop("`cond2` are not all TRUE")
...
Thus, if all the elements in the given logical vectors are TRUE
,
nothing happens. We can move on with certainty.
We can rewrite the above function as follows:
round_rand2 <- function(n, d)
{
stopifnot(
is.numeric(n), length(n) == 1,
is.finite(n), n > 0, n == floor(n),
is.numeric(d), length(d) == 1,
is.finite(d), d > 0, d == floor(d)
)
x <- runif(n) # runif will check if n makes sense
round(x, d) # round will determine the appropriateness of d
}
round_rand2(5, 1)
## [1] 0.7 0.7 0.5 0.6 0.3
round_rand2(5.4, 1)
## Error in round_rand2(5.4, 1): n == floor(n) is not TRUE
round_rand2(5, "1")
## Error in round_rand2(5, "1"): is.numeric(d) is not TRUE
It is the strictest test for “a single positive integer” possible. In the case of any violation of the underlying condition, we get a very informative error message.
At other times, we might be interested in the argument checking like:
if (!is.numeric(n))
n <- as.numeric(n)
if (length(n) > 1) {
warning("only the first element will be used")
n <- n[1]
}
n <- floor(n)
stopifnot(is.finite(n), n > 0)
This way, "4"
and c(4.9, 100)
will all be accepted as 4
[4].
We see that there is always a tension between being generous/flexible and precise/restrictive. Also, because of their particular use cases, for some functions, it will be better to behave differently than the others. Too much uniformity is as bad as chaos. We are expected to rely on common sense, but adding lightweight foolproof mechanisms is always welcome.
It is our duty to be explicit about all the assumptions we make or exceptions we allow (by writing comprehensive documentation; see Section 9.3.2.3).
We will revisit this topic in Section 10.4.
Note
Example exercises related to improving the consistency of base R’s argument handling in different domains include the vctrs and stringx packages[5]. Can these contributions be justified?
Reflect on how you would act in the following scenarios (and how base R and other packages or languages you know deal with them):
a vectorised mathematical function (empty vectors? non-numeric inputs? what if it is equipped with the
names
attribute? what if it has other ones?);an aggregation function (what about missing values? empty vectors?);
a function vectorised with regard to two arguments (elementwise vectorisation? recycling rule? only scalar vs vector or vector vs vector of the same length allowed? what if one argument is a row vector and the other is a column vector);
a function vectorised with respect to all arguments (really all? maybe some exceptions are necessary?);
a function vectorised with respect to the first argument but not the second (why such a restriction? when?).
Find a few functions that match each case.
9.2.2. Putting outputs into context#
Our functions do not exist in a vacuum. We should put them into a much broader context: how can they be combined with other tools?
As a general rule, we ought to generate outputs of a predictable kind. This way, we can easily deduce what will happen in the code chunks that utilise them.
Some base R functions do not adhere to this rule for the sake of (questionable) users’ convenience. We will meet a few of them in Chapter 11 and Chapter 12. In particular, sapply and the underlying simplify2array, can return a list, an atomic vector, or a matrix.
simplify2array(list(1, 3:4)) # list
## [[1]]
## [1] 1
##
## [[2]]
## [1] 3 4
simplify2array(list(1, 3)) # vector
## [1] 1 3
simplify2array(list(1:2, 3:4)) # matrix
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
Further, the index operator with drop=TRUE
, which is the default,
may output an atomic vector. However, it may as well
yield a matrix or a data frame.
(A <- matrix(1:6, nrow=3)) # an example matrix
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
A[1, ] # vector
## [1] 1 4
A[1:2, ] # matrix
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
A[1, , drop=FALSE] # matrix with 1 row
## [,1] [,2]
## [1,] 1 4
We proclaim that the default functions’ behaviour should be to return the object of the most generic kind possible (if there are other options). Then:
either we equip the function with a further argument which must be explicitly set if we really wish to simplify the output,
or we ask the user to call a simplifier explicitly after the function call. In this case, if the simplifier cannot neaten the object, it should probably fail by issuing an error or at least try to apply some brute force solution (e.g., “fill the gaps” somehow itself, preferably with a warning).
For instance:
as.numeric(A[1:2, ]) # always returns a vector
## [1] 1 2 4 5
stringi::stri_list2matrix(list(1, 3:4)) # fills the gaps with NAs
## [,1] [,2]
## [1,] "1" "3"
## [2,] NA "4"
Ideally, a function is expected to perform one (and only one) well-defined task. If it tends to generate objects of different kinds, depending on the arguments provided, it might be better to compose two or more separate procedures instead.
Functions such as rep, seq, and sample do not perform a single task. Or do they?
Note
(*) In a purely functional programming language, we can assume the so-called referential transparency: a call to a pure function can always be replaced with the value it generates. If this is true, then for the same set of argument values, the output is always the same. Furthermore, there are no side effects. In R, it is not exactly the case:
a call can introduce/modify/delete variables in other environments (see Chapter 16), e.g., the state of the random number generator,
due to lazy evaluation, functions are free to interpret the argument forms (passed expressions, i.e., not only: values) in whatever way they like (see Section 9.5.7, Section 12.3.9, and Section 17.5),
printing, plotting, file reading, and database access have apparent consequences with regard to the state of some external resources.
Important
Each function must return some value, but there are several instances (e.g., plotting, printing), where this does not make sense.
In such a case, we may consider returning
invisible(NULL)
, a NULL
whose first printing
will be suppressed.
Compare the following:
(function() NULL)() # anonymous function, called instantly
## NULL
(function() invisible(NULL))() # printing suppressed
x <- (function() invisible(NULL))()
print(x) # no longer invisible
## NULL
Take a look at the return value of the built-on cat.
9.3. Organising and maintaining functions#
9.3.1. Function libraries#
Definitions of frequently-used functions or datasets can be emplaced in separate source files (.R extension) for further reference.
Such libraries can be executed by calling:
source("path_to_file.R")
Create a source file (script) named mylib.R
,
where you define a function called nlargest which returns
a few largest elements in a given atomic vector.
From within another script, call
source("mylib.R")
(note that relative paths refer to the current working directory;
(compare Section 2.1.6).
Then, write a few lines of code where you test nlargest
on some example inputs.
9.3.2. Writing R packages (*)#
When a function library grows substantially, or when there is a need for equipping it with the relevant manual pages[6] (Section 9.3.2.3) or compiled code (Chapter 14), turning it into an R package (Section 7.3.1) might be worth considering. This is the case even if it is only for our own or small team’s purpose.
Important
You do not have to publish your package on CRAN[7]. Many users are tempted to submit whatever they have been tinkering around with for a while. Have mercy on the busy CRAN maintainers and do not contribute to the information overload unless you have come up with something potentially of service for other R users (make it less about you and more about the community; thank you in advance). R packages can always be hosted on and installed from, for instance, GitLab or GitHub.
9.3.2.1. Package structure (*)#
A source package is merely a directory containing some special files and subdirectories:
DESCRIPTION
– a text file that gives the name of the package, its version, authors, dependencies on other packages, license, etc.;NAMESPACE
– a text file containing directives stating which objects are to be exported so that they are available to the package users and which names are to be imported from other packages;R
– a directory with R scripts (.R files), which define, e.g., functions, example datasets, etc.;man
– a directory with R documentation files (.Rd), describing at least all the exported objects; see Section 9.3.2.3;src
– optional; compiled code, see Chapter 14;tests
– optional; tests to run on the package check, see Section 9.3.4.2.
See Section 1 of [62] for more details and other options. We do not need to repeat the information from the official manual as all readers can read it themselves.
Inspect the source code of the example package available for download from https://github.com/gagolews/rpackagedemo/.
9.3.2.2. Building and installing (*)#
Recall from Section 7.3.1.2 that a source package can be built and installed by calling:
install.packages("pkg_directory", repos=NULL, type="source")
Then it can be used as any other R package (Section 7.3.1). In particular, it can be loaded and attached to the search path (Section 16.2.6) via a call to:
library("pkg")
This makes all the objects marked as exportable
in its NAMESPACE
file available to the user;
see also Section 16.3.5.
Create your own package mypkg featuring the solutions to the exercises you have solved whilst studying the material in the previous chapters. When in doubt, refer to the official manual [62].
Note
(*) The building and installing of packages also be done from the command line:
R CMD build pkg_directory # creates a distributable source tarball (.tar.gz)
R CMD INSTALL pkg-version.tar.gz
R CMD INSTALL --build pkg_directory
Also, some users could potentially benefit from creating their own Makefiles that help automate the processes of building, testing, checking, etc.
9.3.2.3. Documenting R packages (*)#
Documenting functions and commenting code thoroughly is critical, even if we just write for ourselves. Most programmers sooner or later will notice that they find it hard to determine what a piece of code is doing after they take a break from it. In some sense, we always communicate with external audiences, which includes our future selves.
The help system is one of the stronger assets of the R environment. By far, we most likely have interacted with many documentation pages and got a general idea of what constitutes an informative documentation piece.
From the technical side, R Documentation (.Rd) files
are located in the man
subdirectory of a source package.
All exported objects (e.g., functions) should be described clearly.
Additional topics can be covered too.
During the package installation, the .Rd files are converted to various output formats, e.g., HTML or plain text, and displayed on a call to the well-known help function.
Documentation files use a LaTeX-like syntax, which looks quite obscure to an untrained eye. The relevant commands are explained in very detail in Section 2 of Writing R Extensions [62].
Note
The process of writing .Rd files by hand might be tedious,
especially keeping track of the changes to the \usage
and \arguments
commands.
Rarely do we recommend the use of third-party packages:
base R facilities are usually sufficient.
But roxygen2 might be worth a try
because it makes the developers’ lives easier.
Most importantly, it allows the documentation to be specified alongside
the functions’ definitions, which is much more natural.
Add a few manual pages to your example R package.
9.3.3. Writing standalone programs (**)#
Section 7.3.2 mentioned how to call external programs using system2.
On UNIX-like operating systems, it is easy to turn our R scripts into standalone tools that can be run from the shell. We already touched upon this topic in Section 1.2.3.
The commandArgs function returns the list of arguments passed from the command line to our script in the form of a character vector. Whatever we do with them is up to us. Moreover, q can terminate a script, yielding any integer return code. By convention, anything other than 0 indicates an error.
Say we have the following script named testfile
in the current directory:
#!/bin/env -S Rscript --vanilla
argv <- commandArgs(trailingOnly=TRUE)
cat("commandArgs:\n")
print(argv)
if (length(argv) == 0) {
cat("Usage: testfiles file1 file2 ...\n")
q(save="no", status=1) # exit with code 1
}
if (!all(file.exists(argv))) {
cat("Some files do not exist.\n")
q(save="no", status=2) # exit with code 2
}
cat("All files exist.\n")
# exits with code 0 (success)
Example interactions with this program from the UNIX-like terminal (bash):
chmod u+x testfiles # add permission to execute
./testfiles
## commandArgs:
## character(0)
## Usage: testfiles file1 file2 ...
./testfiles spanish_inquisition
## commandArgs:
## [1] "spanish_inquisition"
## Some files do not exist.
./testfiles spam bacon eggs spam
## commandArgs:
## [1] "spam" "bacon" "eggs" "spam"
## All files exist.
The stdin, stdout, and stderr represent the always-open connections mapped to the standard input (“keyboard”), and normal and error output. They can be read from or written to using functions such as scan or cat.
During run time, we can redirect stdout and stderr to different files or even strings using sink.
9.3.4. Assuring quality code#
Below we mention some good development practices related to maintaining quality code. This is an important topic, but writing about them is tedious to the same extent that reading about them is boring: it is the more-artistic part of software engineering. After all, they are some heuristics that are learnt best by observing and mimicking what the others are doing (and hence the exercises below will encourage us to do so).
9.3.4.1. Managing changes and working collaboratively#
We are recommended to employ some source code version control system, such as git, to keep track of the changes made to the software.
Note
It is worth investing time and effort to learn how to use git from the command line; see https://git-scm.com/doc.
There are a few hosting providers for git repositories, with GitLab and GitHub being particularly popular among open-source software developers. They support working collaboratively on the projects and are equipped with additional tools for reporting bugs, suggesting feature requests, etc.
Find where the source code of your favourite R packages or other open-source projects is hosted. Explore the corresponding repositories, feature trackers, wikis, discussion boards, etc. Each community is different and is governed by varied, sometimes contrasting guidelines: after all, we come from all corners of the world.
9.3.4.2. Test-driven development and continuous integration#
It is often hygienic to include some principles of test-driven development when writing own functions.
Assume that, for some reason, we were asked to compose a function to compute the root mean square (quadratic mean) of a given numeric vector. Before implementing the actual routine, we need to reflect upon what we want to achieve, especially how we want our function to behave in certain boundary cases.
stopifnot gives simple means to ensure a given assertion is fulfilled. If that is the case, it will move forward quietly.
Let us say we have come up with the following set of expectations:
stopifnot(all.equal(rms(1), 1))
stopifnot(all.equal(rms(1:100), 58.16786054171151931769))
stopifnot(all.equal(rms(rep(pi, 10)), pi))
stopifnot(all.equal(rms(numeric(0)), 0))
Write a function rms that fulfils the above assertions.
Implement your version of the sample function
(assuming replace=TRUE
), using calls to runif.
Start by writing a few unit tests.
A couple of R packages support writing and executing unit tests, including testthat, tinytest, RUnit, or realtest. However, in the most typical use cases, relying on stopifnot is powerful enough.
(*) Consult the Writing R Extensions manual [62] about where and how to include unit tests in your example package.
Note
(*)
R includes a built-in mechanism to check a couple of code quality
areas: running R CMD check pkg_directory
from the command line
(preferably using the most recent version of R) can suggest
several improvements.
Also, it is possible to use various continuous integration techniques that are automatically triggered when pushing changes to our software repositories; see GitLab CI or GitHub Actions. For instance, running a package build, install, and check process is possible on every git commit. Also, CRAN features some continuous integration services, including checking the package on various platforms.
9.3.4.3. Debugging#
For all his life, the current author has been debugging his programs primarily by manually printing the state of the suspicious variables (printf and the like) in different code areas. Hahaha lololol, so old school. Yet, weirdly efficient.
R has an interactive debugger; see the browser function. Also, refer to Section 9 of [66] for more details.
Some IDEs (e.g., RStudio) support this feature, too; see their corresponding documentation.
9.3.4.4. Profiling#
Typically, a program spends a relatively long time executing only a small portion of code. The Rprof function can be a helpful tool to identify which chunks might need a rewrite, for instance, using a compiled language (Chapter 14).
Please remember, though, that bottlenecks are not only formed by using algorithms with high computational complexity, but also data input and output (such as reading files from disk, printing messages on the console, querying Web APIs, etc.).
9.4. Special functions: Syntactic sugar#
Some functions, such as `*`, are somewhat special.
They can be referred to using an alternative syntax which, for some
reason, most of us accepted as the default one.
Below we will reveal, amongst others,
that “5 * 9
” reduces to an ordinary function call:
`*`(5, 9) # a call to `*` with 2 arguments, equivalent to 5 * 9
## [1] 45
9.4.1. Backticks#
In Section 2.2, we have mentioned that we can assign (as in `<-`) syntactically valid names to our objects. Most identifiers comprised of letters, digits, dots, and underscores can be used directly in R code.
Nevertheless, it is possible to label our objects however we like: not syntactically valid (nonstandard) identifiers just need to be enclosed in backticks (back quotes, grave accents):
`42 a quite peculiar name :O tlollolll` <- c(a=1, `b c`=2, `42`=3, `!`=4)
1/(1+exp(-`42 a quite peculiar name :O tlollolll`))
## a b c 42 !
## 0.73106 0.88080 0.95257 0.98201
Of course, such names are less convenient. However, backticks let us refer to them in any context.
9.4.2. Dollar, `$` (*)#
The dollar operator, `$`, can be an alternative accessor to a single element in a named list[8].
If the label
is a syntactically valid name, then x$label
does the same job as x[["label"]]
(saving five keystrokes:
such a burden!).
x <- list(spam="a", eggs="b", `eggs and spam`="c", best.spam.ever="d")
x$eggs
## [1] "b"
x$best.spam.ever # recall that a dot has no special meaning in most contexts
## [1] "d"
Nonstandard names must still be enclosed in backticks:
x$`eggs and spam` # x[["eggs and spam"]] is okay as usual
## [1] "c"
We are minimalist by design here. Thence, we will avoid this operator, as it does not increase the expressive power of our function repertoire. Also, it works on neither atomic vectors nor matrices.
Furthermore, it does not work with names that are generated programmatically:
what <- "spam"
x$what # the same as x[["what"]] – we don't want this
## NULL
x[[what]] # works fine
## [1] "a"
The support for the partial matching of element names has been added to provide the users working in quick-and-dirty, interactive programming sessions with some relief in the case where they find the typing of the whole label extremely problematic:
x$s
## Warning in x$s: partial match of 's' to 'spam'
## [1] "a"
Compare:
x[["s"]] # no warning here...
## NULL
x[["s", exact=FALSE]]
## [1] "a"
It is generally a bad programming practice. The result depends
on the names of other items in x
(which might change later)
and can decrease code readability. The only reason why we obtained
a warning message was because this book enforces the
options(warnPartialMatchDollar=TRUE)
setting,
which, sadly, is not the default.
Note the behaviour on an ambiguous partial match:
x$egg # ambiguous resolution
## NULL
as well as on an element assignment:
x$s <- "e"
str(x)
## List of 5
## $ spam : chr "a"
## $ eggs : chr "b"
## $ eggs and spam : chr "c"
## $ best.spam.ever: chr "d"
## $ s : chr "e"
This did not modify spam
: it added a new element, s
.
9.4.3. Curly braces, `{`#
A block of statements grouped with curly braces, `{`, corresponds to a function call. When we write:
{
print(TRUE)
cat("two")
3
}
## [1] TRUE
## two
## [1] 3
The parser translates it to a call to:
`{`(print(TRUE), cat("two"), 3)
## [1] TRUE
## two
## [1] 3
When the above is executed, every argument, one by one, is evaluated. Then, the last value is returned as the result of that call.
9.4.4. `if`#
if is a function, too. As mentioned in Section 8.1, it returns the value corresponding to the expression that is evaluated conditionally. Hence, we may write:
if (runif(1) < 0.5) "head" else "tail"
## [1] "head"
but also:
`if`(runif(1) < 0.5, "head", "tail")
## [1] "head"
Note
A call like `if`(test, what_if_true, what_if_false)
can only work correctly because of the lazy evaluation of function arguments;
see Chapter 17.
On a side note, while, for, repeat
can also be called that way, but they return
invisible(NULL)
.
9.4.5. Operators are functions#
9.4.5.1. Calling built-in operators as functions#
Every arithmetic, logical, and relational operator is translated to a call to the corresponding function. For instance:
`<`(`+`(`*`(`-`(3), 4)), 5) # 2+(-3)*4 < 5
## [1] TRUE
Also, x[i]
is equivalent to `[`(x, i)
and x[[i]]
maps to `[[`(x, i)
.
Knowing this will not only enable us to manipulate unevaluated R
code (Chapter 15) or access the corresponding manual pages
(see, e.g., help("[")
), but also verbalise some expressions more concisely.
For instance,
x <- list(1:5, 11:17, 21:23)
unlist(Map(`[`, x, 1)) # 1 is a further argument passed to `[`
## [1] 1 11 21
is equivalent to a call to
Map(
function(e) e[1], x)
.
Note
Unsurprisingly, the assignment operator, `<-`, is also a function. It returns the assigned value invisibly.
`<-` binds right to left (compare help("Syntax")
).
Thus, the expression “a <- b <- 1
” assigns 1 to both b
and a
.
It is equivalent to
“`<-`("a",
`<-`("b", 1))
”
and “`<-`("b", 1)
” returns 1.
Owing to the pass-by-value semantics (Section 9.5.1), we can also expect that we will always be assigning a copy of the value on the right-hand side (with the exception of environments; Chapter 16).
x <- 1:6
y <- x # makes a copy (but delayed, on demand, for performance reasons)
y[c(TRUE, FALSE)] <- NA_real_ # modify every 2nd element
print(y)
## [1] NA 2 NA 4 NA 6
print(x) # state of x has not changed — x and y are different objects
## [1] 1 2 3 4 5 6
This is especially worth pointing out to Python (amongst others)
programmers, where the above assignment would mean that x
and y
both refer to the same (shared) object in the computer’s memory.
However, with no harm done to semantics, copying x
is postponed until absolutely necessary (Section 16.1.4).
This is efficient both time- and memory-wise.
9.4.5.2. Creating own binary operators#
We can also introduce our own binary operators named like `%myopname%`:
`%:)%` <- function(e1, e2) (e1+e2)/2
5 %:)% 1:10
## [1] 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5
Recall that `%%` and `%/%` are built-in operators denoting division remainder and integer division. Also, in Chapter 11, we will learn about `%*%`, which implements matrix multiplication.
Note
Chapter 10 notes that most existing operators can be overloaded for objects of different types.
9.4.6. Replacement functions#
Functions generally do not change the state of their arguments. However, there is some syntactic sugar that allows us to replace objects or their parts with new content. We call them replacement functions.
For instance, three of the following calls replace
the input x
with its modified version:
x <- 1:5 # example input
x[3] <- 0 # replace the 3rd element with 0
length(x) <- 7 # "replace" length
names(x) <- LETTERS[seq_along(x)] # replace the names attribute
print(x) # `x` is now different
## A B C D E F G
## 1 2 0 4 5 NA NA
9.4.6.1. Creating replacement functions#
A replacement function is a mapping named like `name<-` with at least two parameters:
x
(the object to be modified),...
(possible further arguments),value
(as the last parameter; the object on the right-hand side of the `<-` operator).
We will most often interact with existing replacement functions, not create our own ones. But knowing how to do the latter is vital to understanding this language feature.
For example:
`add<-` <- function(x, where=TRUE, value)
{
x[where] <- x[where] + value
x # the modified object that will replace the original one
}
The above aims to add some value
to a subset of the input vector x
(by default, to each element therein).
Then, it returns its altered version.
y <- 1:5 # example vector
add(y) <- 10 # calls `add<-`(y, value=10)
print(y) # y has changed
## [1] 11 12 13 14 15
add(y, 3) <- 1000 # calls `add<-`(y, 3, value=1000)
print(y) # y has changed again
## [1] 11 12 1013 14 15
We see that calling “add(y, w) <- v
”
works as if we have called
“y <-
`add<-`(y, w, value=v)
”.
Note
(*)
According to [66],
a call “add(y, 3) <- 1000
” is a syntactic
sugar precisely for:
`*tmp*` <- y # temporary substitution
y <- `add<-`(`*tmp*`, 3, value=1000)
rm("*tmp*") # remove the named object from the current scope
This has at least two implications.
First, in the unlikely event that a variable `*tmp*
` existed before
the call to the replacement function, it will be no more,
it will cease to be. It will be an ex-variable.
Second, the temporary substitution guarantees that y
must exist
before the call (a function’s body does not have to refer to all the
arguments passed; because of lazy evaluation, see Chapter 17,
we could get away with it otherwise).
9.4.6.2. Substituting parts of vectors#
The replacement versions of the subsetting operators are named as follows:
`[<-` is used in substitutions like “
x[i] <- value
”,`[[<-` is called when we perform “
x[[i]] <- value
”,`$<-` is used whilst calling “
x$i <- value
”.
Here is a use case:
x <- 1:5
`[<-`(x, c(3, 5), NA_real_) # returns a new object
## [1] 1 2 NA 4 NA
print(x) # does not change the original input
## [1] 1 2 3 4 5
On a side note, `length<-` can be used to expand
or shorten a given vector
by calling “length(x) <- new_length
”;
see also Section 5.3.3.
x <- 1:5
x[7] <- 7
length(x) <- 10
print(x)
## [1] 1 2 3 4 5 NA 7 NA NA NA
length(x) <- 3
print(x)
## [1] 1 2 3
Semantically speaking, calling `[<-` breeds a new vector (a modified version of the original one). Luckily, we may expect some performance optimisations behind the scenes.
Write a function `extend<-`, which pushes new elements at the end of a given vector, modifying it in place.
`extend<-` <- function(x, value) ...to.do...
Example use:
x <- 1
extend(x) <- 2 # push 2 at the back
extend(x) <- 3:10 # add 3, 4, ..., 10
print(x)
## [1] 1 2 3 4 5 6 7 8 9 10
9.4.6.3. Replacing attributes#
There are many replacement functions to reset object attributes (Section 4.4). In particular, each special attribute has its replacement procedure, e.g., `names<-`, `class<-`, `dim<-`, `levels<-`, etc.
x <- 1:3
names(x) <- c("a", "b", "c") # change the `names` attribute
print(x) # x has been altered
## a b c
## 1 2 3
Individual (arbitrary, including non-special ones) attributes can be set using `attr<-`, and all of them can be established via a single call to `attributes<-`.
x <- "spam"
attributes(x) <- list(shape="oval", smell="meaty")
attributes(x) <- c(attributes(x), taste="umami")
attr(x, "colour") <- "rose"
print(x)
## [1] "spam"
## attr(,"shape")
## [1] "oval"
## attr(,"smell")
## [1] "meaty"
## attr(,"taste")
## [1] "umami"
## attr(,"colour")
## [1] "rose"
Also, setting an attribute to NULL
results, by convention,
in its removal:
attr(x, "taste") <- NULL # this is tasteless now
print(x)
## [1] "spam"
## attr(,"shape")
## [1] "oval"
## attr(,"smell")
## [1] "meaty"
## attr(,"colour")
## [1] "rose"
attributes(x) <- NULL # remove all
print(x)
## [1] "spam"
Which can be worthwhile in contexts such as:
x <- structure(c(a=1, b=2, c=3), some_attrib="value")
y <- `attributes<-`(x, NULL)
Here, x
retains its attributes, and y
is a version of x
with
metadata removed.
9.4.6.4. Compositions of replacement functions#
Updating only selected names like:
x <- c(a=1, b=2, c=3)
names(x)[2] <- "spam"
print(x)
## a spam c
## 1 2 3
is possible due to the fact
that “names(x)[i] <- v
”
is equivalent to:
old_names <- names(x)
new_names <- `[<-`(old_names, i, value=v)
x <- `names<-`(x, value=new_names)
Important
More generally, a composition of replacement
calls “g(
f(x, a), b) <- y
”
yields a result equivalent to
“x <-
`f<-`(x, a, value=
`g<-`(
f(x, a), b, value=y))
”.
Both f and `f<-` need to be defined,
but having g is not necessary.
(*) What is
“h(
g(
f(x, a), b), c) <- y
”
equivalent to?
Write a (convenient!) function `recode<-` which replaces specific elements in a character vector with some other ones, allowing the following interface:
`recode<-` <- function(x, value) ...to.do...
x <- c("spam", "bacon", "eggs", "spam", "eggs")
recode(x) <- c(eggs="best spam", bacon="yummy spam")
print(x)
## [1] "spam" "yummy spam" "best spam" "spam" "best spam"
We see that the named character vector gives a few from="to"
pairs,
e.g., all eggs
are to be replaced by best spam
.
Now, determine which calls are equivalent to the following:
x <- c(a=1, b=2, c=3)
recode(names(x)) <- c(c="z", b="y") # or equivalently = ... ?
print(x)
## a y z
## 1 2 3
y <- list(c("spam", "bacon", "spam"), c("spam", "eggs", "cauliflower"))
recode(y[[2]]) <- c(cauliflower="broccoli") # or = ... ?
print(y)
## [[1]]
## [1] "spam" "bacon" "spam"
##
## [[2]]
## [1] "spam" "eggs" "broccoli"
(*) Consider the `recode<-` function from the previous exercise.
Here is an example matrix with the dimnames
attribute whose names
attribute is set (more
details in Chapter 11):
(x <- Titanic["Crew", "Male", , ])
## Survived
## Age No Yes
## Child 0 0
## Adult 670 192
recode(names(dimnames(x))) <- c(Age="age", Survived="survived")
print(x)
## survived
## age No Yes
## Child 0 0
## Adult 670 192
This changes the x
object. For each of the following subtasks,
compose a single call that alters
names(
dimnames(x))
without modifying x
in place but returning
a recoded copy of:
names
(
dimnames(x))
,dimnames
(x))
,x
.
(*) Consider the `recode<-` function once again
but now let an example object be a data frame featuring
a column of the factor
class:
x <- iris[c(1, 2, 51, 101), ]
recode(levels(x[["Species"]])) <- c(
setosa="SET", versicolor="VER", virginica="VIR"
)
print(x)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 SET
## 2 4.9 3.0 1.4 0.2 SET
## 51 7.0 3.2 4.7 1.4 VER
## 101 6.3 3.3 6.0 2.5 VIR
How to change levels(x[["Species"]])
and return
an altered copy of:
levels
(x[["Species"]])
,x[["Species"]]
,x
without modifying x
in place?
9.5. Arguments and local variables#
9.5.1. Pass by “value”#
As a general rule, functions cannot change the state of their arguments[9]. We can think of them as being passed by value, i.e., as if their copy was made.
test_change <- function(y)
{
y[1] <- 7
y
}
x <- 1:5
test_change(x)
## [1] 7 2 3 4 5
print(x) # same
## [1] 1 2 3 4 5
If the above were not the case, the state of x
would have
changed after the call.
9.5.2. Variable scope#
Function arguments and any other variables we create inside a function’s body are relative to each call to that function.
test_change <- function(x)
{
x <- x+1
z <- -x
z
}
x <- 1:5
test_change(x*10)
## [1] -11 -21 -31 -41 -51
print(x) # x in the function's body was a different x
## [1] 1 2 3 4 5
print(z) # z was local
## Error in eval(expr, envir, enclos): object 'z' not found
Both x
and z
are local variables. They only live whilst our function
is being executed.
The former temporarily masks[10] the object
of the same name from the caller’s context.
Important
It is a good development practice to refrain from referring to objects not created within the current function, especially to “global” variables. We can always pass an object as an argument explicitly.
Note
It is a function call as such, not curly braces per se that form a local scope.
Writing “x <- { y <- 1; y + 1 }
”, y
is not an auxiliary variable;
it is an ordinary named object created alongside x
.
On the other hand, in
“x <- (
function() { z <- 1; z + 1 })()
”,
z
will not be available thereafter.
9.5.3. Closures (*)#
Most user-defined functions are, in fact, representatives of the so-called closures; see Section 16.3.2 and [1]. They not only consist of an R expression to evaluate but also can carry some auxiliary data.
For instance, given two equal-length numeric vectors x
and y
,
a call to approxfun(x, y)
returns
a function that linearly interpolates between the
consecutive points \((x_1, y_1)\), \((x_2, y_2)\), and so forth, so that
a corresponding \(y\) can be determined for any \(x\).
x <- seq(0, 1, length.out=11)
f1 <- approxfun(x, x^2)
f2 <- approxfun(x, x^3)
f1(0.75) # check that it is quite close to the true 0.75^2
## [1] 0.565
f2(0.75) # compare with 0.75^3
## [1] 0.4275
Inspecting, however, the source codes of the above functions:
print(f1)
## function (v)
## .approxfun(x, y, v, method, yleft, yright, f, na.rm)
## <environment: 0x55d9258999f8>
print(f2)
## function (v)
## .approxfun(x, y, v, method, yleft, yright, f, na.rm)
## <environment: 0x55d924a01548>
we might wonder how they can produce different results: it is evident that they are identical. It turns out, however, that they internally store some additional data that are referred to when they are called:
environment(f1)[["y"]]
## [1] 0.00 0.01 0.04 0.09 0.16 0.25 0.36 0.49 0.64 0.81 1.00
environment(f2)[["y"]]
## [1] 0.000 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1.000
We will explore these concepts in detail in the third part of this book.
9.5.4. Default arguments#
We have already mentioned above that when designing functions performing complex tasks we will sometimes be faced with a design problem: how to find a sweet spot between being generous/mindful of the diverse needs of our users and making the API neither overwhelming nor oversimplistic.
We know that it is best if a function performs a single well-specified task. Moreover, it is nice if it also allows its behaviour to be tweaked if one wishes to do so. The use of default arguments can facilitate this principle.
For instance, log computes logarithms, by default, the natural ones.
log(2.718) # the same as log(2.78, base=exp(1)) — default base
## [1] 0.9999
log(4, base=2) # different base
## [1] 2
Study the documentation of the following functions and note the default values they define: round, hist, grep, and download.file.
We can easily define our own functions equipped with such recommended settings:
test_default <- function(x=1) x
test_default() # use default
## [1] 1
test_default(2) # use something else
## [1] 2
Most often, default arguments are just constants, e.g., 1
.
Generally, though, they can be any R expressions:
also ones that include a reference to other arguments
passed to the same function; see Section 17.2.
Default arguments most often appear at the end of the parameter list, but see Section 9.4.6 (on replacement functions) for a well-justified exception.
9.5.5. Lazy vs eager evaluation#
In some languages, function arguments are always evaluated prior to a call. In R, though, they are only computed when actually needed. We call it lazy or delayed evaluation. Recall that in Section 8.1.4, we introduced the short-circuit evaluation operators `||` (or) and `&&` (and). They can do their job precisely thanks to this mechanism.
In the following example, we do not use the function’s argument at all:
lazy_test1 <- function(x) 1 # x not used at all
lazy_test1({cat("and now for something completely different!"); 7})
## [1] 1
Otherwise, we would see a message being printed out on the console.
Next, let us use x
amidst other expressions in the body:
lazy_test2 <- function(x)
{
cat("it's... ")
y <- x+x # using x twice
cat(" a man with two noses")
y
}
lazy_test2({cat("and now for something completely different!"); 7})
## it's... and now for something completely different! a man with two noses
## [1] 14
An argument is evaluated once, and its value is stored
for further reference. If that were not the case,
we would see two messages like and now...
.
We will elaborate on this in Chapter 17.
9.5.6. Ellipsis, `...
`#
Let us start with an exercise.
Notice the presence of `...
` in the parameter list of
c,
list,
structure,
cbind,
rbind,
cat,
Map (and the underlying mapply),
lapply (a specialised version of Map),
optimise,
optim,
uniroot,
integrate,
outer,
aggregate.
What purpose does it serve, according to these functions’ manual pages?
We can create a variadic function by including
`...
` (dot-dot-dot, ellipsis; see help("dots")
),
somewhere in its parameter list.
The ellipsis serves as a placeholder for all objects passed to the function
but not matched by any formal (named) parameters.
The easiest way to process arguments passed via `...
`
programmatically (see also Section 17.3)
is by redirecting them to list.
test_dots <- function(...)
list(...)
test_dots(1, a=2)
## [[1]]
## [1] 1
##
## $a
## [1] 2
Such a list can be processed just like… any other generic vector. What we can do with these arguments is only limited by our creativity (in particular, recall from Section 7.2.2 the very powerful do.call function). There are two primary use cases of the ellipsis[11]:
create a new object by combining an arbitrary number of other objects:
c(1, 2, 3) # 3 arguments ## [1] 1 2 3 c(1:5, 6:7) # 2 arguments ## [1] 1 2 3 4 5 6 7 structure("spam") # 0 additional arguments ## [1] "spam" structure("spam", color="rose", taste="umami") # 2 further arguments ## [1] "spam" ## attr(,"color") ## [1] "rose" ## attr(,"taste") ## [1] "umami" cbind(1:2, 3:4) ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 cbind(1:2, 3:4, 5:6, 7:8) ## [,1] [,2] [,3] [,4] ## [1,] 1 3 5 7 ## [2,] 2 4 6 8 sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 42) ## [1] 108
pass further arguments (as-is) to other methods :
lapply(list(c(1, NA, 3), 4:9), mean, na.rm=TRUE) # mean(x, na.rm=TRUE) ## [[1]] ## [1] 2 ## ## [[2]] ## [1] 6.5 integrate(dbeta, 0, 1, shape1=2.5, shape2=0.5) # dbeta(x, shape1=2.5, shape2=0.5) ## 1 with absolute error < 1.2e-05
For more details, see Section 17.3.
The documentation of lapply (let us call help("lapply")
now)
states that this function is defined as
lapply(X, FUN, ...)
.
Here, the ellipsis is a placeholder for a number of optional arguments
that can be passed to FUN.
Hence, if we denote the \(i\)-th element of a vector X
by
X[[i]]
, calling lapply(X, FUN, ...)
will return a list whose \(i\)-th element will be equal to
FUN(X[[i]], ...)
.
Using a single call to lapply, generate a list with three numeric vectors of lengths 3, 9, and 7, respectively, drawn from the uniform distribution on the unit interval. Then, upgrade your code to get numbers sampled from the interval \([-1, 1]\).
9.5.7. Metaprogramming (*)#
In the third part of this book, we will learn that we can access the expressions passed as functions’ arguments programmatically. In particular, a call to the composition of deparse and substitute can convert them to a character vector:
test_deparse_substitute <- function(x)
deparse(substitute(x))
test_deparse_substitute(testing+1+2+3)
## [1] "testing + 1 + 2 + 3"
test_deparse_substitute(spam & spam^2 & bacon | grilled(spam))
## [1] "spam & spam^2 & bacon | grilled(spam)"
Check out the y-axis label generated by
plot.default((1:100)^2)
.
Inspect its source code. Notice a call to the two aforementioned functions.
Similarly, call
shapiro.test(
log(
rlnorm(100)))
and take note of the data:
field.
A function is free to do with such an expression whatever it likes. For instance, it can modify the expression and then evaluate it in a very different context. Such a language feature allows certain operations to be expressed much more compactly. In theory, it is a potent tool. Unfortunately, it is easy to find many practical examples where it was over/misused and made learning or using R confusing.
(*) The built-in subset and transform use metaprogramming techniques to specify basic data frame transformations (see Section 12.3.9 and Section 17.5). For instance:
transform(
subset(
iris,
Sepal.Length>=7.7 & Sepal.Width >= 3.0,
select=c(Species, Sepal.Length:Sepal.Width)
),
Sepal.Length.mm=Sepal.Length/10
)
## Species Sepal.Length Sepal.Width Sepal.Length.mm
## 118 virginica 7.7 3.8 0.77
## 132 virginica 7.9 3.8 0.79
## 136 virginica 7.7 3.0 0.77
None of the arguments (except iris
) makes sense
outside of the function call contexts.
In particular, neither Sepal.Length
nor Sepal.Width
variables exist.
The two functions took the liberty to interpret the arguments passed how they felt. They created their own virtual reality within our well-defined world. The reader must refer to their documentation to discover the meaning of such special syntax.
Note
(*)
Some functions have rather peculiar default arguments.
For instance, in the manual page of prop.test,
we read that the alternative
parameter defaults to c("two.sided", "less", "greater")
but that "two.sided"
is actually the default one.
If we call print(
prop.test)
,
we will find the code line responsible for this behaviour:
“alternative <-
match.arg(alternative)
”.
Consider the following example:
test_match_arg <- function(x=c("a", "b", "c")) match.arg(x)
test_match_arg() # missing argument — choose 1st
## [1] "a"
test_match_arg("c") # one of the predefined options
## [1] "c"
test_match_arg("d") # unexpected setting
## Error in match.arg(x): 'arg' should be one of "a", "b", "c"
In this setting, match.arg only allows an actual parameter from a given set of choices but selects the first option if the argument is missing.
Unfortunately, we have to learn this behaviour by heart because
looking at the above source code gives us no clue
about this being possible.
If such an expression was normally evaluated, we would use
either the default argument or whatever the user passed as x
(but
then the function would not know the range of possible choices).
A call to
“match.arg(x,
c("a", "b", "c"))
”
could guarantee the desired functionality and would be much more readable.
Instead, metaprogramming techniques allowed match.arg to access
the enclosing function’s default argument list without explicitly
referring to them.
One may ask: why is it so? The only sensible answer to this will be “because its programmer decided it must be this way”. Let us contemplate this for a while. In cases like these, we are not dealing with some base R language design choice that we might like or dislike, but which we should just accept as an inherent feature. Instead, we are struggling intellectually because of some programmer’s (mis)use (in good faith…) of R’s flexibility itself. They have introduced a slang/dialect on top of our mother tongue, whose meaning is valid only within this function. Blame the middleman, not the environment, please.
This is why we generally advocate for avoiding relying on metaprogramming-based techniques wherever possible. We shall elaborate on this in the third part of this book.
9.6. Exercises#
Answer the following questions:
Will “stopifnot
(1)
” stop? What about “stopifnot(NA)
”, “stopifnot(TRUE, FALSE)
”, and “stopifnot(
c(TRUE, TRUE, NA))
”?What does the `if` function return?
Does `attributes<-`
(x, NULL)
modifyx
?When can we be interested in calling `[` and `[<-` as functions (and not as operators) directly?
How to define our own binary operator? Can it have some default arguments?
What are the main use cases of `
...
`?What is wrong with transform, subset, and match.arg?
When a call like “f
(-1,
do_something_that_takes_a_million_years())
” does not necessarily have to be a bad idea?What is the difference between “names
(x)[1] <- "new name"
” and “names(x[1]) <- "new name"
”?What might be the form of
x
if it is legit to call it like x[[c(1, 2)]]()()()[[1]]()()?
What is the return value of a call to
“f(
list(1, 2, 3))
”?
f <- function(x)
{
for (e in x) {
print(e)
}
}
Is it NULL
,
invisible(NULL)
,
x[[
length(x)]]
, or
invisible(x[[
length(x)]])
?
The split function also has its replacement version. Study its documentation to learn how it works.
A call to ls(envir=
baseenv())
returns all objects defined in the base package
(see Chapter 16).
List the names corresponding to some replacement functions.
Important
Apply the principle of test-driven development when solving the remaining exercises (or those you have skipped intentionally).
Implement your version of the Position and Find functions. Evaluation should stop as soon as the first element fulfilling a given predicate has been found.
Implement your version of the Reduce function.
Write a function slide(
f, x, k, ...)
which returns a list y
of size length(x)-k+1
such that y[[i]] =
f(x[i:(i+k-1)], ...)
unlist(slide(sum, 1:5, 1))
## [1] 1 2 3 4 5
unlist(slide(sum, 1:5, 3))
## [1] 6 9 12
unlist(slide(sum, 1:5, 5))
## [1] 15
Using slide defined above,
write another function that counts how many
increasing pairs of numbers are featured in a given numeric
vector. For instance, in c(0, 2, 1, 1, 0, 1, 6, 0)
,
there are three such pairs: (0, 2), (0, 1), (1, 6).
(*) Write your version of
tools::
package_dependencies with
reverse=TRUE
based on information extracted by calling
utils::
available.packages.
(**) Write a standalone program (that can be run from the system shell) that computes the total size of all the files in given directories given as the script’s arguments (via commandArgs).