7. Functionsï
The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyoneâs enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of Chapters 1â12 are already complete, but there will be more. In the meantime, any bug/typos reports/fixes are appreciated. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [20].
R is a functional language, where functions play first fiddle. Each action we perform reduces itself to a call to some function, or a combination thereof.
So far we have been tinkering with dozens of available functions which are part of base R, with only few exceptions. They constitute the essential vocabulary that everyone must be able to speak fluently.
Any operation, be it sum, sqrt, or paste, when fed with a number of arguments, generates some (hopefully useful) return value.
sum(1:10) # invoking `sum` on a specific argument
## [1] 55
From a userâs perspective, each function is merely a tool. To achieve a goal at hand, we do not really have to care about what is going on under its hood, i.e., how the inputs are actually being transformed so that, after a couple of nanoseconds or hours, we can enjoy what has been yielded. This is very convenient: all we need to know is the functionâs specification which can be stated, for example, informally, in plain Polish or Malay, in its help page.
In this chapter, we will learn how to write our own functions. The use of this skill is a good development practice when we expect that some operations are to be executed many times but perhaps on different data.
Also, some R functions are meant to invoke other functions, for instance on every element in a list or every section of a data frame grouped by a qualitative variable, so it is good to learn know how we can specify a custom operation to be propagated thereover.
Given some objects (whatever):
x1 <- runif(16)
x2 <- runif(32)
x3 <- runif(64)
when we want to apply the same action on different data, say, compute the root mean square, instead of re-typing almost identical expressions (or a bunch of them) over and over again:
sqrt(mean(x1^2))
## [1] 0.6545
sqrt(mean(x2^2)) # the same second time - borderline okay
## [1] 0.56203
sqrt(mean(x3^2)) # tedious, barbarous, and error-prone
## [1] 0.57206
we can generalise the operation to any object like x
:
rms <- # bound what follows to name `rms`
function(x) # a function that takes one parameter, `x`
sqrt(mean(x^2)) # expression to transform the input to yield output
and then re-use it on different concrete data instances:
rms(x1)
## [1] 0.6545
rms(x2)
## [1] 0.56203
rms(x3)
## [1] 0.57206
or even combine it with other function calls:
rms(sqrt(c(x1, x2, x3)))^2
## [1] 0.50824
Important
Does writing your own functions equal reinventing the wheel? Can everything be found on the internet these days (including on Stack Overflow, GitHub, or CRAN)?
Luckily, this is not the case. Otherwise, data analystsâ, researchersâ, and developersâ lives could be considered monotonous, dreary, and uninspiring. Plus, sometimes it is much quicker to write a function from scratch than to get through the whole garbage dump from where, only occasionally, we can dig out some pearls. Not to mention the self-educative side: we become better programmers by crunching those exercises. We are advocating for minimalism here, remember?
This and many more other important issues in function design will be reflected upon in Chapter 9.
7.1. Creating and Invoking Functionsï
7.1.1. Anonymous Functionsï
Functions are usually created by means of the following notation:
function(args) body
First, args
is a (possibly empty) list of comma-separated parameter
names which are supposed to act as input variables.
Second, body
is a single
R expression which will be evaluated when the function is called.
The value that this expression yields will constitute the functionâs
output.
For example, here is a definition of a function which takes no inputs and generates a constant output:
function() 1
## function() 1
We thus created a function object. However, it has disappeared immediately thereafter, as we have not used it at all.
Any function, say, f can be invoked,
i.e., evaluated on concrete data, by
using the notation f(arg1, ..., argn)
,
where âarg1, ..., argn
â are the arguments
to be passed to f.
(function() 1)() # invoking f like f(); here, no arguments are expected
## [1] 1
Only now we have obtained a return value.
Note
(*)
Calling typeof on a function object will report
"closure"
(for user-defined functions),
"builtin"
, or "primitive"
(for some built-in, base ones),
for the reasons that we explain in more detail[1]
in Section 9.5.3:
typeof(function() 1)
## [1] "closure"
7.1.2. Named Functionsï
Function objects can be bound with names so that they can be referred to multiple times:
one <- function() 1 # one <- (function() 1)
We created an object named one
(we use bold font to indicate that it is of type function,
because functions are so important in R).
We are very familiar with such a notation, as not since yesterday we
are used to writing âx <- 1
â etc.
Invoking one, which can be done by writing
one()
, will yield a return value:
one() # (function() 1)()
## [1] 1
This output can be used in further computations, for instance:
0:2 - one() # 0:2 - (function() 1)(), i.e., 0:2 - 1
## [1] -1 0 1
7.1.3. Passing Arguments To Functionsï
Functions with no arguments are kind of boring, thus let us distil a more serious operation:
concat <- function(x, y) paste(x, y, sep="")
Here we have created a mapping whose aim is to concatenate
two objects by means of a specialised call to paste.
Yours faithfully pleads guilty to multiplying entities needlessly,
because it should not be a problem for anyone to write
paste(x, y, sep="")
each time.
Yet, âtis merely an illustration.
The concat function has two parameters, âx
â and ây
â.
Hence, calling it will require the provision of two arguments,
which we put within round brackets and separate from each other by commas.
u <- 1:5
concat("spam", u) # i.e., concat(x="spam", y=1:5)
## [1] "spam1" "spam2" "spam3" "spam4" "spam5"
Important
Notice the distinction: parameters (also called formal arguments)
are abstract, general, or symbolic;
âsomething, anything that will be put in place of x
when the function is invokedâ. By contrast, arguments (a.k.a. actual parameters)
are concrete, specific, and real.
During the above call, x
in the functionâs body is
precisely "spam"
, and nothing else.
Also, the u
object from the callerâs environment
is seen under the name y
there.
Most of the time (however, see Section 16.4),
it is best to think of the function as being fed not with u
per se,
but the value that u
is bound to, i.e., â1:5
â.
Also:
x <- 1:5
y <- "spam"
concat(y, x) # concat(x="spam", y=1:5)
## [1] "spam1" "spam2" "spam3" "spam4" "spam5"
This is still a call to equivalent to
concat(x=y, y=x)
.
The argument x
is being assigned with the value of y
from
the calling environment, "spam"
. Yes, one x
is not the same
as the other x
, and which is which is unambiguously
defined by the context.
Understanding and being able to manipulate such abstractions is
basic logic and common sense that everyone should master.
Write a function called standardise that takes a numeric
vector x
as argument and returns its standardised version,
i.e., from each element in x
subtract the sample arithmetic mean
and then divide it by the standard deviation.
Note
Recall from Section 2.1.3 that, syntactically speaking,
the following are perfectly valid alternatives to the positionally-matched
call concat("spam", u)
.
concat(x="spam", y=u)
concat(y=u, x="spam")
concat("spam", y=u)
concat(u, x="spam")
concat(x="spam", u)
concat(y=u, "spam")
However, the last two should particularly be avoided, for the sake of the readersâ sanity. It is best to provide positionally-matched arguments before the keyword-based ones.
Also, in Section 10.5, we introduce the (overused) forward-pipe
operator, `|>`, which enables the above to be written as
â"spam" |> concat(u)
â.
7.1.4. Grouping Expressions with Curly Braces, `{`ï
We have been informed that a functionâs body is a single R expression whose evaluated value is passed to the user as its output. This may sound restrictive and contrast with what we have experienced so far. Rarely are we faced with such simple computing tasks and we have already seen R functions performing quite sophisticated operations.
It turns out that, grammatically, a single R expression can be arbitrarily complex (Chapter 15); we can use curly braces to group many calls that are to be evaluated one after another.
For instance:
{
cat("first expression\n")
cat("second expression\n")
# ...
cat("last expression\n")
}
## first expression
## second expression
## last expression
Note that we used four spaces to visually indent the constituents for greater readability (some developers prefer tabs over spaces, others find two or three spaces more urbane, but we do not). This single (compound) expression can now play a role of a functionâs body.
Important
The last expression evaluated in a curly-braces delimited block will be considered its the output value.
x <- {
1
2
3 # <--- last expression: will be taken as the output value
}
print(x)
## [1] 3
Note
(*) The above code block can also be written more concisely by replacing newlines with semicolons, although with perhaps some loss in readability:
{1; 2; 3}
## [1] 3
In Section 9.4, we will give a few more details about `{`.
Here is a version of the above concat function which takes care of a more Chapter 2-style missing valuesâ propagation:
concat <- function(a, b)
{
z <- paste(a, b, sep="")
z[is.na(a) | is.na(b)] <- NA_character_
z # last expression in the block â return value
}
Example calls:
concat("a", 1:3)
## [1] "a1" "a2" "a3"
concat(NA_character_, 1:3)
## [1] NA NA NA
concat(1:6, c("a", NA_character_, "c"))
## [1] "1a" NA "3c" "4a" NA "6c"
Let us appreciate the fact that we could keep the code brief thanks to paste and `|` implementing the recycling rule.
Write a function called normalise that takes a numeric
vector x
and returns its version shifted and scaled
to the [0, 1] interval. To do so,
from each element subtract the sample minimum and then divide it by the range,
i.e., the difference between the maximum and the minimum.
Avoid computing min(x)
twice.
Write a function that applies the robust standardisation of a numeric vector: subtract the median and divide it by the median absolute deviation, 1.4826 times the median of the absolute differences between the values and their median.
Note
R is an open-source (free, libre) project â users are not only encouraged to run the software for whatever the purpose, but also study and modify its source code without any restrictions. This applies both to functions that we have authored ourselves:
print(concat)
## function(a, b)
## {
## z <- paste(a, b, sep="")
## z[is.na(a) | is.na(b)] <- NA_character_
## z # last expression in the block â return value
## }
## <bytecode: 0x55f963fbb130>
and to the routines that are part of base R or any other extension packages:
print(union)
## function (x, y)
## {
## u <- as.vector(x)
## v <- as.vector(y)
## unique(c(u, v))
## }
## <bytecode: 0x55f9652d01a8>
## <environment: namespace:base>
Nevertheless, some functionality might be implemented in a compiled programming language such as C, C++, or Fortran; notice a call to .Internal in the source code of paste, .Primitive in list, or .Call in runif. Therefore, we will sometimes have to dig a little bit deeper to access the underlying source code; see Chapter 14 for more details.
7.2. Functional Programmingï
R is a functional programming language. As such, it shares a number of common features with other languages that emphasise on the role of function manipulation in software development (e.g., Common Lisp, Scheme, OCaml, Haskell, Clojure, F#). Let us explore them now.
7.2.1. Functions are Objectsï
R functions were given the right to a fair go; they are what we refer to as first-class citizens. In other words, our interaction with them is not limited to their invocation; we treat them as any other language objects. Namely, they can be:
stored inside list objects:
list(identity, nrow, sum) # a list with three elements of type function ## [[1]] ## function (x) ## x ## <bytecode: 0x55f963829560> ## <environment: namespace:base> ## ## [[2]] ## function (x) ## dim(x)[1L] ## <bytecode: 0x55f964915fc0> ## <environment: namespace:base> ## ## [[3]] ## function (..., na.rm = FALSE) .Primitive("sum")
This is possible owing to the fact that lists, as we recall, can embrace R objects of any kind.
created and then called inside another functionâs body:
euclidean_distance <- function(x, y) { square <- function(z) z^2 # auxiliary/internal/helper function sqrt(sum(square(x-y))) # square root of the sum of squares } euclidean_distance(c(0, 1), c(1, 0)) # example call ## [1] 1.4142
This is why we tend to classify functions as representatives of recursive types (compare is.recursive).
passed as arguments to other operations:
# Replaces missing values with a given aggregate # of all non-missing elements: fill_na <- function(x, filler_fun) { missing_ones <- is.na(x) # otherwise, we'd call is.na twice replacement_value <- filler_fun(x[!missing_ones]) x[missing_ones] <- replacement_value x } fill_na(c(0, NA_real_, NA_real_, 2, 3, 7, NA_real_), mean) ## [1] 0 3 3 2 3 7 3 fill_na(c(0, NA_real_, NA_real_, 2, 3, 7, NA_real_), median) ## [1] 0.0 2.5 2.5 2.0 3.0 7.0 2.5
We call these higher-order functions.
Note
More advanced techniques, which we will discuss later (i.e., closures, lazy evaluation, metaprogramming, etc.), will let the functions be:
returned as other functionâs outputs (
sec:to-do
),equipped auxiliary data (
sec:to-do
),generated programmatically on the fly (
sec:to-do
), andmodified at runtime (
sec:to-do
).
Below we review some noteworthy higher-order functions, in particular: do.call and Map. Many other ones will be introduced in due course or are left as an educative exercise.
7.2.2. Calling on Precomputed Arguments with do.callï
The notation like f(arg1, ..., argn)
has no monopoly over how we are supposed to call a function
on a specific sequence of comma-delimited arguments:
the latter do not have to be hardcoded.
Here is an alternative.
We can first prepare a number of objects to be passed
as fâs inputs, wrap them in a list l
,
and then invoke do.call(
f, l)
to get the same result.
words <- list(
c("spam", "bacon", "eggs"),
c("buckwheat", "quinoa", "barley"),
c("ham", "spam", "spam")
)
do.call(paste, words) # paste(words[[1]], words[[2]], words[[3]])
## [1] "spam buckwheat ham" "bacon quinoa spam" "eggs barley spam"
do.call(cbind, words) # column-bind; returns a matrix (explained later)
## [,1] [,2] [,3]
## [1,] "spam" "buckwheat" "ham"
## [2,] "bacon" "quinoa" "spam"
## [3,] "eggs" "barley" "spam"
do.call(rbind, words) # row-bind (explained later)
## [,1] [,2] [,3]
## [1,] "spam" "bacon" "eggs"
## [2,] "buckwheat" "quinoa" "barley"
## [3,] "ham" "spam" "spam"
Note that the length and content of the list passed as the 2nd argument of do.call can be arbitrary (possibly unknown at the time of writing the code). See Section 12.1.2 for more use cases, e.g., ways to concatenate a list of data frames (perhaps produced by some complex chain of commands) into a single data frame.
If elements of the list are named, they will be matched to the corresponding keyword arguments.
x <- 2^(seq(-2, 2, length.out=101))
plot_opts <- list(col="red", lty="dashed", type="l")
do.call(plot, c(list(x, log2(x), xlab="x", ylab="log2(x)"), plot_opts))
## (the displaying of the plot has been suppressed)
Note that, e.g., plot_opts
can now be reused in further
calls to graphical functions. This is very convenient as it avoids
repetitions.
7.2.3. Common Higher-Order Functionsï
There is an important class of higher-order functions that allow us to apply custom operations on consecutive elements of sequences without relying on loop-like statements, at least explicitly. They can be found in all functional programming languages (e.g., Lisp, Haskell, Scala) and have been ported to various add-on libraries (functools in Python, more recent versions of the C++ Standard Library, etc.) or frameworks (Apache Spark and the like). Their presence reflects the obvious truth that some kinds of operations occur more frequently than other ones.
In particular:
Map calls a function on each element of a sequence in order to transform:
their individual components (just like sqrt, round, or the unary `!` operator in R), or
the corresponding elements of many sequences so as to vectorise a given operation elementwisely (compare the binary `+` or paste),
Reduce (also called accumulate) applies a binary operation to combine consecutive elements in a sequence, e.g., to generate the aggregates, like, totally (compare sum, prod, all, max) or cumulatively (compare cumsum, cummmin),
Filter creates a subset of a sequence that is comprised of elements that enjoy a given property (which we typically achieve in R by means of the `[` operator),
Find locates the first element that fulfils some logical condition (compare which),
and so forth.
Below we will only focus on the Map function. The inspection of the remaining ones is left as an exercise. This is because, oftentimes, we can be better-off with their more R-ish versions (e.g., using the subsetting operator, `[`).
7.2.4. Vectorising Functions with Mapï
In data-centric computing, we are frequently faced with tasks that involve processing each and every element in a sequence independently, one after another. Such use cases can benefit from vectorised operations like those discussed in Chapter 2, Chapter 3, and Chapter 6.
Most of the functions that we have introduced in the preceding
parts, unfortunately, cannot be applied on lists.
For instance, if we try calling sqrt on a list,
we will get an error, even if it is a list of numeric vectors only.
One way to compute the square root of all elements would
be to invoke sqrt(
unlist(...))
.
It is a go-to approach if we wish
to treat all the listâs elements as one sequence.
But this comes at a price of losing the listâs structure.
We have also discussed some operations that are not vectorised with respect to all their arguments, even though they could have been designed this way, e.g., grepl.
The Map function[2] applies an operation on each element in a vector or the corresponding elements in a number of vectors. In many situations, it may be used as a more elegant alternative to for loops that we will introduce in the next chapter.
First[3], a call to Map(f, x)
yields a list whose i-th element is equal to
f(x[[i]])
(recall that `[[` works on atomic vectors too).
For example:
x <- list( # an example named list
x1=1:3,
x2=seq(0, 1, by=0.25),
x3=c(1, 0, NA_real_, 0, 0, 1, NA_real_)
)
Map(sqrt, x) # x is named, hence the result will be named too
## $x1
## [1] 1.0000 1.4142 1.7321
##
## $x2
## [1] 0.00000 0.50000 0.70711 0.86603 1.00000
##
## $x3
## [1] 1 0 NA 0 0 1 NA
Map(length, x)
## $x1
## [1] 3
##
## $x2
## [1] 5
##
## $x3
## [1] 7
unlist(Map(mean, x)) # compute three aggregates, convert to an atomic vector
## x1 x2 x3
## 2.0 0.5 NA
Map(function(n) round(runif(n, -1, 1), 1), c(2, 4, 6)) # x is atomic now
## [[1]]
## [1] 0.4 0.8
##
## [[2]]
## [1] 0.5 0.8 -0.1 -0.7
##
## [[3]]
## [1] -0.3 0.0 0.5 1.0 -0.9 -0.7
Next, we can vectorise a given function over a number of
parameters. A call to, e.g.,
Map(f, x, y, z)
results in a
list whose i-th element is equal to
f(x[[i]], y[[i]], z[[i]])
.
Just like in case of, e.g., paste,
recycling rule will be applied if necessary.
For example, the following generates
list(
seq(1, 6),
seq(11, 13),
seq(21, 29))
:
Map(seq, c(1, 11, 21), c(6, 13, 29))
## [[1]]
## [1] 1 2 3 4 5 6
##
## [[2]]
## [1] 11 12 13
##
## [[3]]
## [1] 21 22 23 24 25 26 27 28 29
Moreover, we can get
list(
seq(1, 40, length.out=10),
seq(11, 40, length.out=5),
seq(21, 40, length.out=10),
seq(31, 40, length.out=5))
by calling:
Map(seq, c(1, 11, 21, 31), 40, length.out=c(10, 5))
## [[1]]
## [1] 1.0000 5.3333 9.6667 14.0000 18.3333 22.6667 27.0000 31.3333
## [9] 35.6667 40.0000
##
## [[2]]
## [1] 11.00 18.25 25.50 32.75 40.00
##
## [[3]]
## [1] 21.000 23.111 25.222 27.333 29.444 31.556 33.667 35.778 37.889 40.000
##
## [[4]]
## [1] 31.00 33.25 35.50 37.75 40.00
Note
If we have some additional arguments to be passed to
the function applied (which the function does not have to be vectorised
over), we can wrap them inside a separate list
and toss it via the MoreArgs
argument
(Ă la do.call).
unlist(Map(mean, x, MoreArgs=list(na.rm=TRUE))) # mean(..., na.rm=TRUE)
## x1 x2 x3
## 2.0 0.5 0.4
Alternatively, we can always construct a custom anonymous function:
unlist(Map(function(xi) mean(xi, na.rm=TRUE), x))
## x1 x2 x3
## 2.0 0.5 0.4
Here is an example list of files (see our teaching data repository) with daily Forex rates:
file_names <- c(
"euraud-20200101-20200630.csv",
"eurgbp-20200101-20200630.csv",
"eurusd-20200101-20200630.csv"
)
Call Map to read each dataset with scan and determine the minimal, mean, and maximal value in each series.
Implement your own version of the Filter function based on a call to Map.
7.3. Accessing Third-Party Functionsï
When we indulge in the writing of a software piece, a few questions naturally arise. Is the problem we are facing fairly complex? Has it already been successfully addressed in its entirety? If not, can it, or its parts, be split into manageable chunks? Can it be constructed based on some readily available nontrivial components?
A smart developer is independent, but knows when to stand on the shoulders to cry on. Let us explore some ways in which we can reuse the existing function libraries.
7.3.1. Using R Packagesï
Most contributed R extensions come in the form of the so-called add-on packages, which can include:
reusable code (e.g., new functions),
data (which we can exercise on),
documentation (manuals, vignettes, etc.);
see Section 9.3.2 for some more and [47] for all the details.
Most packages are published in the moderated repository that is part of the Comprehensive R Archive Network (CRAN). However, there are also other popular sources such as Bioconductor which specialises in bioinformatics.
To fetch a package pkg from a repository
(CRAN by default; see, however, the repos
argument),
we call install.packages("pkg")
.
A call to library("pkg")
loads
an indicated package and makes its exported objects
available to the user (i.e., attaches it on the search list;
see sec:to-do
).
For instance, in one of the previous chapters, we have mentioned the gsl package:
# call install.packages("gsl") first
library("gsl") # load the package
poch(10, 3:6) # calls gsl_sf_poch() from GNU GSL
## [1] 1320 17160 240240 3603600
Here, poch is an object exported by package gsl.
If we did not call library("gsl")
,
trying to access the former would result in an error.
We could also have accessed the above function
without attaching it onto the object search list by using
the pkg::object
syntax,
i.e., gsl::poch
.
Use the find function to determine which packages define the following objects: mean, var, find, and Map. Recall from Section 1.4 where such information can be found in these objectsâ manual pages.
Note
For more information about any R extension, call
help(package="pkg")
.
Also, it is a good idea to visit the packageâs CRAN entry
at an address like https://CRAN.R-project.org/package=pkg
to access some additional information (e.g.,
vignettes; see also vignette(package="pkg")
).
Why waste our time and energy by querying a web search engine
that will lead us to some (usually low-quality) middleman when you
can acquire authoritative knowledge directly from the source?
Moreover, it is worth exploring various CRAN Task Views that group the packages into topics such as Genetics, Graphics, and Optimisation. These are edited by experts in their relevant fields.
Important
Frequently, R packages are written in their respective authorsâ free time,
many of whom are volunteers/public servants/enthusiasts
who are neither paid for doing this nor it is part of the so-called
their job.
You can show appreciation for their generosity by, e.g.,
spreading the word about their software by citing them in publications
(see citation(package="pkg")
),
talking about them during lunch time,
or mentioning them in (un)social media.
You can also help them improve the existing code base
by reporting bugs, polishing documentation,
proposing new features, or cleaning up the redundant fragments of their APIs.
Some readers will become one of them someday (when they will come
up with something useful for our community).
7.3.1.1. Default Packagesï
Note that the always-on package base is a must-have that provides us with the most crucial functions (vector addition, c, Map, library). Certain other packages are also loaded by default:
getOption("defaultPackages")
## [1] "datasets" "utils" "grDevices" "graphics" "stats"
## [6] "methods"
Although this list can â technically speaking â be changed,
in this book we assume that the above are always attached,
because it is reasonable to do so.
This is why in Section 2.4.5,
there was no need to call, for example,
library("stats")
before referring to the var and sd functions.
On a side note, grDevices and graphics
will be discussed in sec:to-do
and methods will be mentioned in sec:to-do
.
datasets brings a few example R objects that
we can exercise our skills on.
Functions from utils, graphics, and stats,
on the other hand, already appeared here and there.
7.3.1.2. Source vs Binary Packages (*)ï
R is a free and open project, therefore its packages are published primarily in the source form â so that anyone can study how they work and improve them or reuse parts thereof in different projects.
If we call
install.packages("path", repos=NULL, type="source")
,
we should be able to install a package from sources:
path
can either be pinpointing a directory or a
source tarball (see help("untar")
,
most often as a compressed pkg_version.tar.gz
file).
Note that
type="source"
is the default unless one is
on W****ws or some m**OS boxes;
see getOption("pkgType")
.
This is because these two might require additional build tools
to be present in the system, especially if a package features
C, C++, or Fortran code; see Chapter 14
and Section C.3 of [49]:
Rtools on W****ws,
Xcode Command Line Tools on m**OS.
Because of these systemsâ being less developer-oriented, as a courtesy to their users, CRAN also distributes the platform-specific binary versions of the packages (.zip or .tgz files). install.packages will try to fetch them by default.
It is very easy to fetch a packageâs source directly from GitLab or GitHub, which are quite popular hosting platforms these days. At the time of writing this, the relevant links were, respectively:
https://gitlab.com/user/repo/-/archive/branch/repo-branch.zip
https://github.com/user/repo/archive/branch.zip
For example, to download the contents of the master branch in the repository rpackagedemo owned by gagolews, we can call:
f <- tempfile() # temporary file name - download destination
download.file("https://github.com/gagolews/rpackagedemo/archive/master.zip",
destfile=f)
Next, the contents can be extracted with unzip:
t <- tempdir() # temporary directory to extract the files to
(d <- unzip(f, exdir=t)) # returns extracted file paths
The path where the files were extracted can be passed to install.packages:
install.packages(dirname(d)[1], repos=NULL, type="source")
file.remove(c(f, d)) # clean up
Use the git2r package to clone the git repository located at https://github.com/gagolews/rpackagedemo.git and install the package published therein from within the current R session.
7.3.2. Managing Dependencies (*)ï
The currently-installed add-on packages may be upgraded to their most recent versions available on CRAN (or other indicated repository) by calling update.packages.
As a general rule, the more experienced developers we become, the less excited we get about the new. Sure, bug fixes and some well-thought of additional features are usually welcome, but just we wait until an updated package API for the n-th time, \(n\ge 2\), breaks our program that used to work flawlessly for so long.
Hence, when designing software projects (see Chapter 9 for more details), it is essential that we ask ourselves the ultimate question: do we really need to import that package with lots of dependencies from which we will just use only about 3â5 functions? Wouldnât it be better to write our own version of some functionality (and learn something new, exercise our brain, etc.) or call a mature terminal-based tool?
Otherwise, as all the historical versions of all the packages are archived on CRAN, some software dependency management can easily be conducted by storing different version of packages in different directories (only one version of a package can be loaded at a time though). This way, we can create some sort of an isolated environment for the add-ons.
To fetch the locations where packages are sought (in this very order), call:
.libPaths()
## [1] "/home/gagolews/R/x86_64-pc-linux-gnu-library/4.2"
## [2] "/usr/local/lib/R/site-library"
## [3] "/usr/lib/R/site-library"
## [4] "/usr/lib/R/library"
The same function can be used to add new folders to the search
path; see also the environment variable
R_LIBS_USER
(e.g., help("Sys.setenv")
).
The install.packages function will honour them as target
directories, see its lib
parameter for more details.
Moreover, the packages may deposit some auxiliary data on the userâs machine. Therefore, it might be a good idea to set the following directories (via the corresponding environment variables) as relative to the current project:
tools::R_user_dir("pkg", "data") # R_USER_DATA_DIR
## [1] "/home/gagolews/.local/share/R/pkg"
tools::R_user_dir("pkg", "config") # R_USER_CONFIG_DIR
## [1] "/home/gagolews/.config/R/pkg"
tools::R_user_dir("pkg", "cache") # R_USER_CACHE_DIR
## [1] "/home/gagolews/.cache/R/pkg"
7.3.3. Calling External Programsï
Many tasks can naturally be accomplished by calling external programs. Such an approach is particularly natural on Unix-like systems, which classically follow a modular, minimalist design patterns: there are many tools at a developerâs hand and each tool is specialised at solving a single, well-defined problem.
Apart from the many standard Unix commands, we can consider, for example:
pandoc converts documents between markup formats, e.g., Markdown, reStructuredText, LaTeX, LibreOffice Writer, EPUB;
pdflatex, xelatex, and lualatex compile LaTeX documents to PDF;
convert (from ImageMagick) applies various operations on bitmap graphics (scaling, cropping, conversion between formats);
graphviz and PlantUML can be used to create various graphs and diagrams;
jupyter-nbconvert converts Jupyter notebooks (see Section 1.2.5) to other formats such as LaTeX, HTML, Markdown, etc.;
python,
{command}perl
, ⊠can be called to perform tasks that can be expressed more easily in languages other than R;
and so forth.
Good news is that R not only can be called from the shell (in an interactive or batch mode; see Section 1.2), but also it can serve well as a glue language itself.
The system2 function can be used to invoke
any system command.
Communication between such programs can be done by means of,
e.g., intermediate text, JSON, CSV, XML, or any other files.
The stdin
, stdout
, and stderr
arguments can be used to control the
redirection of the standard I/O streams.
system2("pandoc", "-s input.md -o output.html")
system2("bash", "-c 'for i in `seq 1 2 10`; do echo $i; done'", stdout=TRUE)
## [1] "1" "3" "5" "7" "9"
system2("python3", "-", stdout=TRUE,
input=c(
"import numpy as np",
"print(repr(np.arange(5)))"
))
## [1] "array([0, 1, 2, 3, 4])"
Note that the current working directory can be read and changed by means of a call to getwd and setwd, respectively. It is the directory from where the current R session was started.
Important
Relying on system2 assumes that the commands referred to are available on the target platform. Hence, it might not be portable, unless additional assumptions are made (e.g., that a user runs some Unix system, that certain libraries are installed therein). We strongly recommend GNU/Linux or FreeBSD for both software development and production use, as they are free, open, developer-friendly, user-loving, reliable, ethical, and sustainable.
7.3.4. A Note on Interfacing C, C++, Python, Java, etc. (*)ï
Most stand-alone data processing algorithms are implemented in compiled, slightly lower-level programming languages. This usually makes them faster and more re-usable in other environments. For instance, it is often the case that an industry-standard library is written in very portable C, C++, or Fortran and has some bindings available for easier access from within R, Python, Julia, etc. This is the case with FFTW, LIBSVM, mlpack, OpenBLAS, ICU, and GNU GSL, amongst many others.
For basic ways to interact with such compiled code, see Chapter 14.
Also, the rJava package can be used to dynamically create JVM objects and access their fields and methods. Similarly, reticulate can be used to access Python objects, including numpy arrays and pandas data frames (but see also the rpy2 package for Python).
Important
We should not feel obliged to use R in all the parts of a data processing pipeline. Some activities can be expressed more naturally in other languages/environments (e.g., parse raw data and create an SQL database in Python, but visualise it in R). We can use other tools as the glue language (including R, Python, or Bash) that will steer the data flow in the right direction.
7.4. Exercisesï
Answer the following questions:
What is the result of â
x <- 2; x <-
function(x) x^2; x(x)
â?How to write a function that returns two objects?
What is a higher-order function?
What are the use cases of do.call?
Why a call to Map is not necessary in the expression âMap
(
paste, x, y, z)
â?What is the difference between Map
(
mean, x, na.rm=TRUE)
and Map(
mean, x, MoreArgs=
list(na.rm=TRUE))
?What do we mean when we write stringx
::
sprintf?How to get access to the vignettes (tutorials, FAQs, etc.) of the data.table and dplyr packages? Why perhaps 95% of R users would just googleit and what is sub-optimal about this strategy?
What is the difference between a source and a binary package?
How to update the base package?
How to assure that we will always run an R session with only specific versions of a set of packages?
Write a function that computes the Gini index of
a vector of positive integers x
,
which, assuming \(x_1\le x_2\le\dots\le x_n\), is equal to:
Implement a function between(x, a, b)
that verifies whether each element in x
is in the [a
, b
]
interval or not.
Return a logical vector of the same length as x
.
Make sure the function is correctly vectorised with respect to all
the arguments and handles missing data correctly.
Write your own version of the strrep function called dup.
dup <- ...to.do...
dup(c("a", "b", "c"), c(1, 3, 5))
## [1] "a" "bbb" "ccccc"
dup("a", 1:3)
## [1] "a" "aa" "aaa"
dup(c("a", "b", "c"), 4)
## [1] "aaaa" "bbbb" "cccc"
Given a list x
, generate its sublist with all the elements
equal to NULL
removed.
Implement your own version of the built-in sequence function.
Using Map, how can we generate window indexes like:
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 2 3 4
##
## [[3]]
## [1] 3 4 5
##
## [[4]]
## [1] 4 5 6
Write a function windows(k, n)
that yields
k index windows with elements between 1 and n
(the above example is for k=3 and k=6).
Implement a function movstat(f, x, k)
that computes, using Map,
a given aggregate f
of each k consecutive elements in x
.
For instance:
movstat <- ...to.do...
x <- c(1, 3, 5, 10, 25, -25) # example data
movstat(mean, x, 3) # 3-moving mean
## [1] 3.0000 6.0000 13.3333 3.3333
movstat(median, x, 3) # 3-moving median
## [1] 3.0000 6.0000 13.3333 3.3333
Write a function to extract all q-grams, q â„ 1,
from a given character vector. Return a list of
character vectors. For examples,
2-grams (bigrams) in "abcd"
are: "ab"
, "bc"
, âcdâ`.
Recode a character vector with a small number of distinct values to a vector where each unique code is assigned a positive integer from 1 to k. Example calls and the corresponding expected results:
recode <- ...to.do...
recode(c("a", "a", "a", "b", "b"))
## [1] 1 1 1 2 2
recode(c("x", "z", "y", "x", "y", "x"))
## [1] 1 3 2 1 2 1
Implement a function that returns the number of occurrences of
each unique element in a given atomic vector. The return value
should be a numeric vector equipped with a names
attribute.
count <- ...to.do...
count(c(5, 5, 5, 5, 42, 42, 954))
## 5 42 954
## 4 2 1
count(c("x", "z", "y", "x", "y", "x", "w", "x", "x", "y", NA_character_))
## w x y z <NA>
## 1 5 3 1 1
Hint: use match and tabulate.
Implement a function that extends upon the built-in duplicated, indicating which occurrence (starting from the beginning of the vector) of a repeated value a given value constitutes.
duplicatedn <- ...to.do...
duplicatedn(c("a", "a", "a", "b", "b"))
## [1] 1 2 3 1 2
duplicatedn(c("x", "z", "y", "x", "y", "x", "w", "x", "x", "y", "z"))
## [1] 1 1 1 2 2 3 1 4 5 3 2
Based on a call to Map, implement a function my_split
such that, given a vector x
and an atomic vector
y
of the same length as x
, my_split(x, y)
yields the same result as split(x, y)
.
Extend my_split to handle the second argument
being a list of the form list(y1, y2, ...)
that represents the product of many levels.
If the ys are of different lengths, apply the recycling rule.
Implement my_unsplit being your own version of the built-in
unsplit. Make sure it holds
my_unsplit(
split(x, g), g) == x
for x
and g
of the same lengths.
Write a function that takes as arguments: (a) an integer n, (b)
a numeric vector x
of length k and no duplicated elements, (c)
a vector of probabilities p
of length k;
verify that \(p_i\ge 0\) for all \(i\) and \(\sum_{i=1}^k p_i \simeq 1\).
Based on a random number generator from the uniform distribution on
the unit interval, generate n independent realisations
of a random variable \(X\) such that \(\Pr(X=x_i)=p_i\) for \(i=1,\dots,k\).
Hint: to obtain a single value:
generate \(u\in[0,1]\),
find \(m\in\{1,\dots,k\}\) such that \(u\in\left(\sum_{j=1}^{m-1} p_{j}, \sum_{j=1}^m p_{j}\right]\),
the result is then \(x_m\).
Write a function that takes as arguments: (a) an increasingly sorted
vector x
of length n, (b) any vector y
of length n, (c)
a vector z
of length k and elements in \([x_1,x_n)\).
Let \(f\) be the piecewise linear spline that interpolates the points
\((x_1,y_1),\dots,(x_n,y_n)\).
Return a vector w
of length k such that \(w_i=f(z_i)\).
(*) Write functions dpareto, ppareto, qpareto, and rpareto that implement the basic functions related to the Pareto distribution; compare Section 2.3.4.