14. Interfacing compiled code (**)#
The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime, any bug/typos reports/fixes are appreciated. Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [26].
R is an effective glue language. It is suitable for composing whole data wrangling pipelines: from data import through processing, analysis, and visualisation to export. It makes using and connecting larger building blocks very convenient.
R is also a competent tool for developing quick and dirty prototypes of standalone, general-purpose algorithms, especially if they are of numerical nature. Nevertheless, for performance reasons, we may[1] consider moving computing-intensive tasks to the C or C++ level[2]. It is particularly the case when we need a method that:
has higher memory or time complexity than its straightforward implementation when it is programmed using vectorised R functions,
has an iterative or recursive nature, e.g., involving unvectorisable for or while loops,
relies on complicated dynamic data structures (e.g., hash maps, linked lists, or trees),
needs methods provided elsewhere and not available in R (e.g., other C or C++ libraries).
In the current chapter, we will demonstrate that R works very well as a user-friendly interface to compiled code.
This topic is overall very technical. The definitive reference is [62]; but see also Chapter 11 of [10]. Furthermore, R’s source code provides many working examples of how to deal with R objects in C.
Here, we will only cover the most important basics.
We will focus on writing or interfacing portable[3]
function libraries that only rely on simple[4] data structures
(e.g., arrays of the type double
and int
).
Thanks to this, we will be able to reuse
them in other environments such as Python (e.g., via Cython) or Julia.
Let us remember that R is one of many languages out there.
We assume basic knowledge of the C language; see [38]. The reader can skip this chapter now and return to it later. The remaining material is not contingent on the current one.
From now on, we take for granted that our environment can successfully build a source package with C code, as mentioned in Section 7.3.1.2. In particular, W****ws and m**OS should have, respectively, RTools and Xcode installed.
Note
To avoid ambiguity, in the main text, calls to C functions will be prefixed
with “C::
”; e.g.,
C::
spanish_inquisition()
.
14.1. C and C++ code in R#
14.1.1. Source files for compiled code in R packages#
Perhaps the most versatile way to interact with portable C code is via standalone R packages. For the purpose of the current chapter, we created a demo project available at https://github.com/gagolews/cpackagedemo.
Inspect the structure of cpackagedemo.
Note that C source files are located in the src/
subdirectory.
Build and install the package. Call my_sum in R on some
numeric vector.
The package provides an R interface to one function,
C::
my_c_sum, written in the
most portable fashion possible.
Its declaration is included in the src/cfuns.h
file:
#ifndef __CFUNS_H
#define __CFUNS_H
#include <stddef.h>
double my_c_sum(const double* x, size_t n);
#endif
The function accepts a pointer to the start of a numeric sequence
and its size, which is a standard[5] way of representing
an array of double
s.
Its definition is given in src/cfuns.c
. We see that it is
nothing more than a simple sum of all the elements in an array:
#include "cfuns.h"
/* computes the sum of all elements in an array x of size n */
double my_c_sum(const double* x, size_t n)
{
double s = 0.0;
for (size_t i = 0; i < n; ++i) {
/* this code does not treat potential missing values specially
(they are kinds of NaNs); to fix this, add:
if (ISNA(x[i])) return NA_REAL; // #include <R.h> */
s += x[i];
}
return s;
}
To make C::
my_c_sum available in R,
we will have to introduce a wrapper around it that works
with the data structures from the first part of this jolly book.
We know that an R function accepts objects of any kind
on input and yields anything as a result.
In the next section, we will explain that we get access
to R objects via special pointers of the type SEXP
(S expressions).
And so we declare our R-callable wrapper in src/rfuns.h
:
#ifndef __RFUNS_H
#define __RFUNS_H
#include <R.h>
#include <Rinternals.h>
#include <Rmath.h>
SEXP my_c_sum_wrapper(SEXP x);
#endif
The actual definition is included in src/rfuns.c
:
#include "rfuns.h"
#include "cfuns.h"
/* a wrapper around my_c_sum callable from R */
SEXP my_c_sum_wrapper(SEXP x)
{
double s;
if (!Rf_isReal(x)) {
/* the caller is expected to prepare the arguments
(doing it at the C level is tedious work) */
Rf_error("`x` should be a vector of type 'double'");
}
/* Exercise: consider missing value (NA) checking
as my_c_sum doesn't do it */
s = my_c_sum(REAL(x), (size_t)XLENGTH(x));
return Rf_ScalarReal(s);
}
The arguments can be, technically speaking, prepared at the C level.
For instance, if `x
` turned
out to be an integer vector, we could convert it to the double
one
(these are two different types; see Section 6.4.1).
Nevertheless, overall, it is very burdensome. It is better to use pure R code
to ensure that the arguments are of the correct form and to beautify
the outputs.
This explains why we only assert the enjoyment of
C::
Rf_isReal(x)
.
It guarantees that the C::
REAL and
C::
XLENGTH
functions correctly return the pointer to the start of the sequence
and its length, respectively.
Once C::
my_c_sum is called,
we must convert it to an R object. Here, it is a newly allocated
numeric vector of length one. It can be done easily by calling
C::
Rf_ScalarReal.
Although optional (see Section 5.4 of [62]),
we will register C::
my_c_sum_wrapper
as a callable function explicitly. This way, R will not struggle to find
the specific entry point in the resulting dynamically linked library (DLL).
We do this in src/cpackagedemo.c
:
#include <R_ext/Rdynload.h>
#include "rfuns.h"
/* the list of functions available in R via a call to .Call():
each entry is like {exported_name, fun_pointer, number_of_arguments} */
static const R_CallMethodDef cCallMethods[] = {
{"my_c_sum_wrapper", (DL_FUNC)&my_c_sum_wrapper, 1},
{NULL, NULL, 0} // the end of the list
};
/* registers the list of callable functions */
void R_init_cpackagedemo(DllInfo *dll)
{
R_registerRoutines(dll, NULL, cCallMethods, NULL, NULL);
R_useDynamicSymbols(dll, FALSE);
}
The function can be invoked from R using .Call.
We do it in R/my_sum.R
:
my_sum <- function(x)
{
# prepare input data:
if (!is.double(x))
x <- as.double(x)
s <- .Call("my_c_sum_wrapper", x, PACKAGE="cpackagedemo")
# some rather random postprocessing:
attr(s, "what") <- deparse(substitute(x))
s
}
And, finally, here is the package NAMESPACE
file responsible
for registering the exported R names and indicating the DLL to use:
export(my_sum)
useDynLib(cpackagedemo)
Once the package is built and installed
(e.g., by running “R CMD INSTALL <pkgdir>
” in the
terminal), we can test it by calling:
library("cpackagedemo")
my_sum(runif(100)/100)
## [1] 0.49856
## attr(,"what")
## [1] "runif(100)/100"
Extend the package (locally) by adding a function to compute the index of the greatest element in a numeric vector. Note that C uses 0-based array indexing, whereas, in R, the first element is at index 1. Compare its run time against which.max using proc.time.
14.1.2. R CMD SHLIB
#
The “R CMD SHLIB <files>
” command compiles one or more source files
without the need for turning them into standalone packages;
see [62]. Then, dyn.load loads the resulting DLL.
(*) Compile src/cfuns.c
and src/rfuns.c
from our
demo package using “R CMD SHLIB <files>
”. Call dyn.load.
Write an R function that uses .Call to invoke
C::
my_c_sum_wrapper from the second source file.
The direct SHLIB
approach is convenient for learning C programming,
including running simple examples. We will thus use it
for didactic reasons in this chapter.
The inst/examples/csource.R
file in our demo package
includes the implementation of an R function called csource.
It compiles a given C source file, loads the resulting DLL, and
extracts and executes a designated R code chunk
(preferably defining a function that refers to .Call).
Here is an example source file, inst/examples/helloworld.c
in the cpackagedemo source code repository:
// the necessary header files are automatically included by `csource`
SEXP C_hello()
{
Rprintf("The mill's closed. There's no more work. We're destitute.\n"
"I'm afraid I've no choice but to sell you all "
"for scientific experiments.\n");
return R_NilValue;
}
/* R
# this chunk will be extracted and executed by `csource`.
hello <- function()
invisible(.Call("C_hello", PACKAGE="helloworld"))
R */
Let us compile it and call the R function defined above.
source("~/R/cpackagedemo/inst/examples/csource.R") # defines csource
csource("~/R/cpackagedemo/inst/examples/helloworld.c")
hello()
## The mill's closed. There's no more work. We're destitute.
## I'm afraid I've no choice but to sell you all for scientific experiments.
(*)
C++ is also supported. It can be thought of as a superset of the C language,
but the devil is in the detail. Change the name of the above file to
helloworld2.cpp
, add “extern "C"
before the function
declaration, and compile it.
(*) Verify that C and C++ source files can coexist in R packages.
(*) It might be very educative to study the implementation of csource. By now, we should be able to author such functions ourselves (a few hours’ worth of work), let alone read with understanding.
# compiles a C or C++ source file using R CMD SHLIB,
# loads the resulting DLL, and executes the embedded R code
csource <- function(
fname,
libname=regmatches(basename(fname),
regexpr("[^.]*(?=\\..*)", basename(fname), perl=TRUE)),
shlibargs=character(),
headers=paste0(
"#include <R.h>\n",
"#include <Rinternals.h>\n",
"#include <Rmath.h>\n"
),
R=file.path(R.home(), "bin/R")
) {
stopifnot(file.exists(fname))
stopifnot(is.character(libname), length(libname) == 1)
stopifnot(is.character(shlibargs))
stopifnot(is.character(headers))
stopifnot(is.character(R), length(R) == 1)
# read the source file:
f <- paste(readLines(fname), collapse="\n")
# set up output file names:
tmpdir <- normalizePath(tempdir(), winslash="/") # tempdir on Win uses \
dynlib_ext <- .Platform[["dynlib.ext"]]
libpath <- file.path(tmpdir, sprintf("%s%s", libname, dynlib_ext))
cfname <- file.path(tmpdir, basename(fname))
rfname <- sub("\\..*?$", ".R", cfname, perl=TRUE)
# separate the /* R ... <R code> ... R */ chunk from the source file:
rpart <- regexec("(?smi)^/\\* R\\s?(.*)R \\*/$", f, perl=TRUE)[[1]]
rpart_start <- rpart
rpart_len <- attr(rpart, "match.length")
if (rpart_start[1] < 0 || rpart_len[1] < 0)
stop("enclose R code between /* R ... and ... R */")
rcode <- substr(f, rpart_start[2], rpart_start[2]+rpart_len[2]-1)
cat(rcode, file=rfname, append=FALSE)
# write the C/C++ file:
ccode <- paste(
headers,
substr(f, 1, rpart_start[1]-1),
substr(f, rpart_start[1]+rpart_len[1], nchar(f)),
collapse="\n"
)
cat(ccode, file=cfname, append=FALSE)
# prepare the "R CMD SHLIB ..." command:
shlibargs <- c(
"CMD", "SHLIB",
sprintf("-o %s", libpath),
cfname,
shlibargs
)
# compile and load the DLL, run the extracted R script:
retval <- FALSE
oldwd <- setwd(tmpdir)
tryCatch({
if (libpath %in% sapply(getLoadedDLLs(), `[[`, "path"))
dyn.unload(libpath)
stopifnot(system2(R, shlibargs) == 0) # 0 == success
dyn.load(libpath)
source(rfname)
retval <- TRUE
}, error=function(e) {
cat(as.character(e), file=stderr())
})
setwd(oldwd)
if (!retval) stop("error compiling file")
invisible(TRUE)
}
14.2. Handling basic types#
14.2.1. SEXPTYPE
s#
All R objects are stored as instances of the C language structure
SEXPREC
. Usually, we access them via
pointers, which are of the type SEXP
(S expression).
C functions referred to via .Call
take the very generic SEXP
s on input. They output some SEXP
.
One of the said structure’s fields represents
the actual R object type (SEXPTYPE
numbers);
see Table 14.1 for a selection.
|
Type in R (typeof) |
Test in C |
---|---|---|
|
|
Rf_isNull |
|
|
TYPEOF |
|
|
Rf_isLogical |
|
|
Rf_isInteger |
|
|
Rf_isReal |
|
|
Rf_isComplex |
|
|
Rf_isString |
|
|
Rf_isVectorList |
|
|
TYPEOF |
|
|
TYPEOF |
To illustrate that any R object is available as a SEXP
, let us consider
inst/examples/sexptype.c
from cpackagedemo:
SEXP C_test_sexptype(SEXP x)
{
Rprintf("type of x: %s (SEXPTYPE=%d)\n",
Rf_type2char(TYPEOF(x)),
(int)TYPEOF(x)
);
return R_NilValue;
}
/* R
test_sexptype <- function(x)
invisible(.Call("C_test_sexptype", x, PACKAGE="sexptype"))
R */
Example calls:
csource("~/R/cpackagedemo/inst/examples/sexptype.c")
test_sexptype(1:10)
## type of x: integer (SEXPTYPE=13)
test_sexptype(NA)
## type of x: logical (SEXPTYPE=10)
test_sexptype("spam")
## type of x: character (SEXPTYPE=16)
We should refer to particular SEXPTYPE
s
via their descriptive names (constants; e.g., STRSXP
),
not their numeric identifiers (e.g., 16);
see Section 1.1 of [65] for the complete list[6].
14.2.2. Accessing elements in simple atomic vectors#
Table 14.2 gives the most important vector-like
SEXPTYPE
s (atomic and generic), their corresponding C
types and the functions to access the underlying array pointers.
A call to C::
XLENGTH returns
the length of a given sequence.
We have already seen an example function that processes a numeric vector;
see C::
my_c_sum_wrapper above.
Let us stress that writing functions that accept only int
and double
array pointers and their lengths makes them easily reusable in other
environments. In many data analysis applications, we do not need much more.
|
Array element type |
Pointer access |
---|---|---|
|
|
RAW |
|
|
LOGICAL |
|
|
INTEGER |
|
|
REAL |
|
|
COMPLEX |
|
|
(not directly) |
|
|
(not directly) |
|
|
CHAR |
Important
With raw, logical, integer, floating-point, and complex vectors,
we get direct access to data that might be shared amongst many objects
(compare Section 16.1.4).
SEXPREC
s are simply passed by pointers (since SEXP
s are
exactly them). We must thus refrain[7]
from modifying the objects passed as function parameters.
See below for ways to create new vectors, e.g., for storing auxiliary
or return values.
Consider inst/examples/sharedmem.c
:
SEXP C_test_sharedmem(SEXP x)
{
if (!Rf_isReal(x) || XLENGTH(x) == 0)
Rf_error("`x` should be a non-empty vector of type 'double'");
REAL(x)[0] = REAL(x)[0]+1; // never do it; always make a copy;
// the underlying array `x` may be shared by many objects
return R_NilValue;
}
/* R
test_sharedmem <- function(x)
invisible(.Call("C_test_sharedmem", x, PACKAGE="sharedmem"))
R */
And now, calling the above function on an example vector:
csource("~/R/cpackagedemo/inst/examples/sharedmem.c")
y <- 1
z <- y
test_sharedmem(y)
print(z)
## [1] 2
modifies y
and z
in place! It is not the same semantics as
the one we are used to when interacting with R. Hence, we
must always make a copy.
14.2.3. Representation of missing values#
Most languages do not support the notion of missing values out of the box. Therefore, in R, they have to be emulated. Table 14.3 lists the relevant constants and the conventional ways for testing for missingness.
|
Missing value |
Testing |
---|---|---|
|
(none) |
(none) |
|
|
|
|
|
|
|
|
ISNA |
|
a pair of |
ISNA |
|
|
|
In logical and integer vectors, NA
s are represented
as the smallest 32-bit signed integer. Thus, we need to be careful
when performing any operations on these types:
testing for missingness must be performed first.
The case of double
s is slightly less irksome, for a missing
value is represented as a special not-a-number.
Many arithmetic operations on NaN
s
return NaN
s as well, albeit there is no guarantee[8]
that they will be of precisely the same type as NA_REAL
. Thus, manual
testing for the missingness is also advised.
The inst/examples/mean_naomit.c
file defines a function
to compute the arithmetic mean of an int
or a double
vector:
SEXP C_mean_naomit(SEXP x)
{
double ret = 0.0;
size_t k = 0;
if (Rf_isInteger(x)) {
const int* xp = INTEGER(x);
size_t n = XLENGTH(x);
for (size_t i=0; i<n; ++i)
if (xp[i] != NA_INTEGER) { // NOT: ISNA(xp[i])
ret += (double)xp[i];
k++;
}
}
else if (Rf_isReal(x)) {
const double* xp = REAL(x);
size_t n = XLENGTH(x);
for (size_t i=0; i<n; ++i)
if (!ISNA(xp[i])) { // NOT: xp[i] == NA_REAL
ret += xp[i];
k++;
}
}
else
Rf_error("`x` should be a numeric vector");
return Rf_ScalarReal((k>0)?(ret/(double)k):NA_REAL);
}
/* R
mean_naomit <- function(x)
{
if (!is.numeric(x)) # neither integer nor double
x <- as.numeric(x) # convert to double
.Call("C_mean_naomit", x, PACKAGE="mean_naomit")
}
R */
There is some inherent code duplication: int
and double
are
distinct types, and they need to be handled separately. Some tests:
csource("~/R/cpackagedemo/inst/examples/mean_naomit.c")
mean_naomit(c(1L, NA_integer_, 3L, NA_integer_, 5L))
## [1] 3
mean_naomit(rep(NA_real_, 10))
## [1] NA
Implement all and any in C.
14.2.4. Memory allocation#
R features a simple yet effective garbage collector that relies on reference counting. Occasionally[9], memory blocks deemed no longer reachable are either freed or marked as reusable.
To allocate a new vector of length one and set
its only element, we can call
C::
ScalarLogical,
C::
ScalarInteger,
C::
ScalarReal, etc.
We have already used these functions for returning
R “scalars”.
Vectors of arbitrary lengths can be created using
C::
Rf_allocVector(sexptype, size)
.
Note that this function does not initialise the elements of
logical and numeric sequences (amongst others). They will need
to be set manually after creation.
Important
We must protect all allocated vectors from garbage
collection when we use them inside our functions.
For this purpose, R maintains a stack[10] of objects guarded
against premature annihilation.
C::
PROTECT(sexp)
pushes a given object pointer onto the top of the list.
C::
UNPROTECT(n)
pops the last n elements therefrom (in a last-in-first-out manner).
At the end of a .Call, R checks if the number
of protects matches that of un-protects. If it is not the case,
a warning is generated.
Protection is not needed:
for arguments to functions referred to by .Call, as they are already in use and hence protected;
for objects assigned as list or character vectors’ elements using C
::
SET_VECTOR_ELT and C::
SET_STRING_ELT (see below); when the container is protected, so are its components;when we return the allocated vector to R immediately after creating it (like in return Rf_ScalarReal
(val)
in a C function invoked by .Call).
Here is a function to compute the square of each element in a numeric vector. Note that the new vector must be protected from garbage collection while data are being prepared.
SEXP C_square1(SEXP x)
{
// no need to call PROTECT(x), it is already in use
if (!Rf_isReal(x)) Rf_error("`x` should be a vector of type 'double'");
size_t n = XLENGTH(x);
const double* xp = REAL(x);
SEXP y = PROTECT(Rf_allocVector(REALSXP, n)); // won't be GC'd
double* yp = REAL(y);
for (size_t i=0; i<n; ++i) {
if (ISNA(xp[i])) yp[i] = xp[i];
else yp[i] = xp[i]*xp[i];
}
UNPROTECT(1); // pops 1 object from the protect stack;
// does not trigger garbage collection, so we can return `y` now
return y;
}
/* R
square1 <- function(x)
{
if (!is.double(x)) x <- as.double(x)
.Call("C_square1", x, PACKAGE="square1")
}
R */
Alternatively, in this case, we may use
C::
Rf_duplicate:
SEXP C_square2(SEXP x)
{
if (!Rf_isReal(x)) Rf_error("`x` should be a vector of type 'double'");
x = PROTECT(Rf_duplicate(x)); // OK; it's a pointer
size_t n = XLENGTH(x);
double* xp = REAL(x);
for (size_t i=0; i<n; ++i)
if (!ISNA(xp[i])) xp[i] = xp[i]*xp[i];
UNPROTECT(1);
return x;
}
/* R
square2 <- function(x)
{
if (!is.double(x)) x <- as.double(x)
.Call("C_square2", x, PACKAGE="square2")
}
R */
Some tests:
csource("~/R/cpackagedemo/inst/examples/square1.c")
square1(c(-2, -1, 0, 1, 2, 3, 4, NA_real_))
## [1] 4 1 0 1 4 9 16 NA
csource("~/R/cpackagedemo/inst/examples/square2.c")
square2(c(-2, -1, 0, 1, 2, 3, 4, NA_real_))
## [1] 4 1 0 1 4 9 16 NA
We can claim some auxiliary memory from the heap
during a function’s runtime using the well-known
C::
malloc (or new in C++).
We are of course fully responsible for releasing it via
C::
free (or delete).
There is also C::
R_alloc(n, size)
which allocates n*size
bytes. The memory will automatically be
garbage-collected at the end of a .Call.
Here is our version of the which function.
SEXP C_which1(SEXP x)
{
if (!Rf_isLogical(x)) Rf_error("`x` should be of type 'logical'");
size_t n = XLENGTH(x), i, k;
const int* xp = LOGICAL(x);
size_t* d = (size_t*)malloc(n*sizeof(size_t)); // conservative size
if (!d) Rf_error("memory allocation error");
for (i=0, k=0; i<n; ++i)
if (xp[i] != NA_LOGICAL && xp[i])
d[k++] = i;
SEXP y = PROTECT(Rf_allocVector(REALSXP, k));
double* yp = REAL(y); // yes, type is double - ready for long vectors
for (i=0; i<k; ++i)
yp[i] = (double)d[i]+1; // R uses 1-based indexing
UNPROTECT(1);
free(d); // we don't want any memory leaks!
return y;
}
/* R
which1 <- function(x)
{
if (!is.logical(x)) x <- as.logical(x)
.Call("C_which1", x, PACKAGE="which1")
}
R */
Some tests:
csource("~/R/cpackagedemo/inst/examples/which1.c")
which1(c(TRUE, FALSE, TRUE, NA, TRUE))
## [1] 1 3 5
Note that R’s which returns either an int
or a double
vector, depending on the size of the input vector
(whether it is shorter than \(2^{31}-1\)). As an exercise,
rewrite the above to take that into account: integer arithmetic is faster.
(*) If we do not like that we are potentially wasting too much
memory in the case of sparse logical vectors, we can rely on
dynamically growable arrays. Below is a C++ rewrite of the
above function using deque
(double-ended queue)
from the language’s standard library.
#include <deque>
extern "C" SEXP C_which2(SEXP x)
{
if (!Rf_isLogical(x)) Rf_error("`x` should be of type 'logical'");
size_t n = XLENGTH(x), i;
const int* xp = LOGICAL(x);
std::deque<size_t> d;
for (i=0; i<n; ++i)
if (xp[i] != NA_LOGICAL && xp[i])
d.push_back(i);
SEXP y = PROTECT(Rf_allocVector(REALSXP, d.size()));
double* yp = REAL(y); // ready for long vectors
i=0;
for (size_t k : d)
yp[i++] = (double)k+1; // R uses 1-based indexing
UNPROTECT(1);
return y;
}
/* R
which2 <- function(x)
{
if (!is.logical(x)) x <- as.logical(x)
.Call("C_which2", x, PACKAGE="which2")
}
R */
Example calls:
csource("~/R/cpackagedemo/inst/examples/which2.cpp")
x <- (runif(10) > 0.5)
stopifnot(which(x) == which1(x))
stopifnot(which(x) == which2(x))
Alternatively, we could use C::
realloc
to extend an initially small buffer allocated using
C::
malloc by, say, 50%
whenever it is about to overflow.
14.2.5. Lists#
For safety reasons[11], we do not get access to the underlying pointers
in lists and character vectors. List items can be read by calling
C::
VECTOR_ELT(x, index)
and can be set with
C::
SET_VECTOR_ELT(x, index, newval)
.
Note that lists (VECSXP
s) are comprised of SEXP
s of any type.
Hence, after extracting an element, its SEXPTYPE
needs to be tested
using one of the functions listed in Table 14.1.
This is tiresome.
Here is a rather useless function that fetches the first and the last element in a given numeric vector or a list. However, if the latter case, we apply the function recursively on all its elements.
SEXP C_firstlast(SEXP x)
{
if (!Rf_isVector(x) || XLENGTH(x) == 0)
Rf_error("`x` must be a non-empty vector (atomic or generic)");
else if (Rf_isReal(x)) {
SEXP y = PROTECT(Rf_allocVector(REALSXP, 2));
REAL(y)[0] = REAL(x)[0]; // first
REAL(y)[1] = REAL(x)[XLENGTH(x)-1]; // last
UNPROTECT(1);
return y;
}
else if (Rf_isVectorList(x)) {
SEXP y = PROTECT(Rf_allocVector(VECSXP, 2));
// VECTOR_ELT(x, i) is PROTECTed by the container;
// SET_VECTOR_ELT does not trigger GC, so no need to call PROTECT
// on the result of C_firstlast(...) in this context
SET_VECTOR_ELT(y, 0, C_firstlast(VECTOR_ELT(x, 0)));
SET_VECTOR_ELT(y, 1, C_firstlast(VECTOR_ELT(x, XLENGTH(x)-1)));
UNPROTECT(1);
return y;
}
else
Rf_error("other cases left as an exercise");
return R_NilValue; // avoid compiler warning
}
/* R
firstlast <- function(x)
.Call("C_firstlast", x, PACKAGE="firstlast")
R */
Testing:
csource("~/R/cpackagedemo/inst/examples/firstlast.c")
firstlast(c(1, 2, 3))
## [1] 1 3
firstlast(list(c(1, 2, 3), c(4, 5), 6))
## [[1]]
## [1] 1 3
##
## [[2]]
## [1] 6 6
firstlast(list(c(1, 2, 3), 4, 5, list(6, c(7, 8), c(9, 10, 11))))
## [[1]]
## [1] 1 3
##
## [[2]]
## [[2]][[1]]
## [1] 6 6
##
## [[2]][[2]]
## [1] 9 11
Implement a C function that returns the longest vector in a given list.
Use C::
Rf_isVector
to check whether a given object is an atomic or a generic vector,
and hence if C::
XLENGTH
can be called thereon.
Inscribe your version of unlist. Consider scanning the input list twice. First, determine the size of the output vector. Second, fill the return object with un-listed values.
Write a C function that takes a list of numeric vectors of identical lengths on input. Return their elementwise sum: the first element of the output should be the sum of the first elements in every vector, and so forth.
14.2.6. Character vectors and individual strings (*)#
Character vectors (STRSXP
s) are similar to VECSXP
s
except that they only feature individual strings. The latter are
represented using a separate data type, CHARSXP
.
C::
STRING_ELT(x, index)
and
C::
SET_STRING_ELT(x, index, newval)
work as the element getters and setters.
Important
If we are not interested in text processing but rather in handling categorical data or defining grouping variables, we should consider converting character vectors to factors before issuing a .Call. Comparing small integers is much faster than strings; see below for more details.
Because of R’s string cache, there are no duplicate strings
in the memory. However, this feature could only be guaranteed
by making data in CHARSXP
s read-only. We can access
the underlying const char*
pointer by calling
C::
CHAR(s)
.
As typical in C, a string is terminated by byte 0.
Note
R strings may be of different encodings; compare Section 6.1.1. For portability and peace of mind, it is best to preprocess the arguments to .Call using enc2utf8, which converts all strings to UTF-8[12].
Despite being the most universal encoding, UTF-8 does not
represent a single code point using a fixed number of bytes.
For instance, computing the string length
requires iterating over all its elements. For CHARSXP
s,
C::
XLENGTH returns the number
of bytes, not including the trailing 0.
It is thus best to leave the processing of strings to the dedicated libraries, e.g., ICU or rely on functions from the stringi package [27] at the R level.
C strings can be converted to CHARSXP
s by calling
C::
Rf_mkCharCE(stringbuf, CE_UTF8)
or
C::
Rf_mkCharLenCE(stringbuf, buflen, CE_UTF8)
.
If we are sure that a string is in ASCII (a subset of UTF-8),
we can also call C::
Rf_mkChar(stringbuf)
.
We should never return CHARSXP
s as results to R. They are
for internal use only. They must be wrapped inside a character vector,
e.g., using C::
Rf_ScalarString.
14.2.7. Calling R functions from C (**)#
Section 5.11 of [62] discusses ways to call R functions in C. To understand them, we will first need to study the material from the remaining chapters of our book, i.e., environments and the related evaluation model. They can be useful, e.g., when calling optimisation algorithms implemented in C on objective functions written in R.
14.2.8. External pointers (**)#
There is a separate basic R type for storing arbitrary
C pointers, externalptr
(SEXPTYPE
of EXTPTRSXP
);
see Section 5.13 of [62] for more details.
We can use them to maintain dynamic data structures or resource handlers between calls to R functions. The problem with these is that pointers are passed as… pointers. They can easily break R’s pass-by-value semantics, where changes to the state of the referenced object will be visible outside the call.
Furthermore, pointers are not serialisable. They cannot be saved for use in another R session.
(**) inst/examples/stack.cpp
provides a C++ implementation
of the stack data structure, being a last-in-first-out container
of arbitrary R objects:
#include <deque>
class S : public std::deque<SEXP>
{
public: ~S()
{ // destructor: release all SEXPs so that they can be GC'd
while (!this->empty()) {
SEXP obj = this->front();
this->pop_front();
R_ReleaseObject(obj);
}
}
};
S* get_stack_pointer(SEXP s, bool check_zero=true) // internal function
{
if (TYPEOF(s) != EXTPTRSXP)
Rf_error("not an external pointer");
SEXP tag = R_ExternalPtrTag(s); // our convention, this can be anything
if (TYPEOF(tag) != CHARSXP || strcmp(CHAR(tag), "stack") !=0 )
Rf_error("not a stack");
S* sp = (S*)R_ExternalPtrAddr(s);
if (check_zero && !sp)
Rf_error("address is 0");
return sp;
}
void stack_finaliser(SEXP s) // internal function
{
// called during garbage collection
S* sp = get_stack_pointer(s, false);
if (sp) {
delete sp; // destruct S, release SEXPs
R_ClearExternalPtr(s);
}
}
extern "C" SEXP C_stack_create()
{
S* sp = new S(); // stack pointer
SEXP s = PROTECT(
R_MakeExternalPtr((void*)sp, /*tag*/mkChar("stack"), R_NilValue)
);
R_RegisterCFinalizerEx(s, stack_finaliser, TRUE); // auto-called on GC
UNPROTECT(1);
return s;
}
extern "C" SEXP C_stack_empty(SEXP s)
{
S* sp = get_stack_pointer(s);
return Rf_ScalarLogical(sp->empty());
}
extern "C" SEXP C_stack_push(SEXP s, SEXP obj)
{
S* sp = get_stack_pointer(s);
R_PreserveObject(obj);
sp->push_front(obj);
return R_NilValue;
}
extern "C" SEXP C_stack_pop(SEXP s)
{
S* sp = get_stack_pointer(s);
if (sp->empty())
Rf_error("stack is empty");
SEXP obj = sp->front();
sp->pop_front();
R_ReleaseObject(obj);
return obj;
}
/* R
stack_create <- function()
.Call("C_stack_create", PACKAGE="stack")
stack_empty <- function(s)
.Call("C_stack_empty", s, PACKAGE="stack")
stack_push <- function(s, obj)
.Call("C_stack_push", s, obj, PACKAGE="stack")
stack_pop <- function(s)
.Call("C_stack_pop", s, PACKAGE="stack")
R */
Note how we preserve R objects from garbage collection.
Some tests:
csource("~/R/cpackagedemo/inst/examples/stack.cpp")
s <- stack_create()
print(s)
## <pointer: 0x55fbff912210>
typeof(s)
## [1] "externalptr"
for (i in c("one", "two", "Spanish Inquisition"))
stack_push(s, i)
while (!stack_empty(s))
print(stack_pop(s))
## [1] "Spanish Inquisition"
## [1] "two"
## [1] "one"
14.3. Dealing with compound types#
14.3.1. Reading and setting attributes#
From Chapter 10, we know that compound types such as matrices, factors, or data frames are represented using basic data structures. Usually, they are atomic vectors of lists organised in a predefined manner.
C::
Rf_getAttrib(x, attrname)
and
C::
Rf_setAttrib(x, attrname, newval)
gets and sets specific attributes of an object `x
`.
Their second argument, `attrname
`, should be a one-element STRSXP
.
For convenience, the constants
R_ClassSymbol
,
R_DimNamesSymbol
,
R_DimSymbol
,
R_LevelsSymbol
,
R_NamesSymbol
, and
R_RowNamesSymbol
can be used instead of
the STRSXP
versions of the
"class"
,
"dimnames"
,
"dim"
,
"levels"
,
"names"
, and
"row.names"
strings.
Consider a function for testing whether a given object is of a specific class:
#include <string.h>
SEXP C_isofclass(SEXP x, SEXP class)
{
if (!Rf_isString(class) && XLENGTH(class) != 1)
Rf_error("`class` must be a single string");
if (!OBJECT(x)) // is the class attribute set at all?
return Rf_ScalarLogical(FALSE);
SEXP xclass = Rf_getAttrib(x, R_ClassSymbol); // STRSXP (guaranteed)
const char* c = CHAR(STRING_ELT(class, 0)); // class arg as a C string
size_t n = XLENGTH(xclass);
for (size_t i=0; i<n; ++i)
if (strcmp(CHAR(STRING_ELT(xclass, i)), c) == 0)
return Rf_ScalarLogical(TRUE);
return Rf_ScalarLogical(FALSE);
}
/* R
isofclass <- function(x, class)
.Call("C_isofclass", x, class, PACKAGE="isofclass")
R */
Some tests:
csource("~/R/cpackagedemo/inst/examples/isofclass.c")
isofclass(Sys.time(), "POSIXct")
## [1] TRUE
isofclass(cbind(1:5, 11:15), "matrix")
## [1] FALSE
Note that a matrix has an implicit class (reported by
the class function), but its class
attribute does
not have to be set. Hence the negative result.
Write a function that fetches a particular named element in a list.
14.3.2. Factors#
Factors (Section 10.3.2) are represented as integer vectors
with elements in the set {1, 2, …, k, NA_integer_
}
for some k. They are equipped with the levels
attribute,
being a character vector of length k. Their class
attribute is set to "factor"
.
An example implementation of a function to compute the number of occurrences of each factor level is given below.
SEXP C_table1(SEXP x)
{
if (!Rf_isFactor(x)) Rf_error("`x` is not an object of class 'factor'");
size_t n = XLENGTH(x);
const int* xp = INTEGER(x); // x is INTSXP
SEXP levels = Rf_getAttrib(x, R_LevelsSymbol); // levels is STRSXP
size_t k = XLENGTH(levels);
SEXP y = PROTECT(Rf_allocVector(REALSXP, k));
double* yp = REAL(y);
for (size_t i=0; i<k; ++i)
yp[i] = 0.0;
for (size_t j=0; j<n; ++j) {
if (xp[j] != NA_INTEGER) {
if (xp[j] < 1 || xp[j] > k)
Rf_error("malformed factor"); // better safe than sorry
yp[xp[j]-1] += 1.0; // levels are 1..k, but C needs 0..k-1
}
}
Rf_setAttrib(y, R_NamesSymbol, levels); // set names attribute
UNPROTECT(1);
return y;
}
/* R
table1 <- function(x)
{
if (!is.factor(x)) x <- as.factor(x)
.Call("C_table1", x, PACKAGE="table1")
}
R */
Testing:
csource("~/R/cpackagedemo/inst/examples/table1.c")
table1(c("spam", "bacon", NA, "spam", "eggs", "bacon", "spam", "spam"))
## bacon eggs spam
## 2 1 4
Create a function to compute the most frequently occurring value (mode) in a given factor. Return a character vector. If a mode is ambiguous, return all the possible candidates.
14.3.3. Matrices#
Matrices (Chapter 11) are flat atomic vectors or lists with
the dim
attribute being a vector of length two.
The class
attribute does not have to be set
(but the class function returns "matrix"
and "array"
).
Matrices are so important in data analysis that they have been blessed
with a few dedicated functions available at the C level.
C::
Rf_isMatrix tests if a given
object meets the criteria mentioned above.
R relies on the Fortran order of matrix elements:
they are aligned columnwisely. Let A
be a matrix
with n rows and m columns
(compare C::
Rf_nrows
and C::
Rf_ncols).
Then, the element in the i-th row
and the j-th column is at A[i+n*j]
.
C::
Rf_allocMatrix(sexptype, n, m)
allocates a new matrix. The dimnames
attributes must be handled
manually, though.
Here is a function to compute the transpose of a given numeric matrix:
SEXP C_transpose(SEXP x)
{
if (!Rf_isMatrix(x) || !Rf_isReal(x))
Rf_error("`x` must be a real matrix");
size_t n = Rf_nrows(x);
size_t m = Rf_ncols(x);
const double* xp = REAL(x);
SEXP y = PROTECT(Rf_allocMatrix(REALSXP, m, n));
double* yp = REAL(y);
for (size_t i=0; i<n; ++i)
for (size_t j=0; j<m; ++j)
yp[j+m*i] = xp[i+n*j];
SEXP dimnames = Rf_getAttrib(x, R_DimNamesSymbol);
if (!Rf_isNull(dimnames)) {
SEXP tdimnames = PROTECT(Rf_allocVector(VECSXP, 2));
SET_VECTOR_ELT(tdimnames, 0, VECTOR_ELT(dimnames, 1));
SET_VECTOR_ELT(tdimnames, 1, VECTOR_ELT(dimnames, 0));
Rf_setAttrib(y, R_DimNamesSymbol, tdimnames); // set dimnames
UNPROTECT(1);
// dimnames might have the names attribute too (left as an exercise)
}
UNPROTECT(1);
return y;
}
/* R
transpose <- function(x)
{
if (!is.matrix(x)) x <- as.matrix(x)
.Call("C_transpose", x, PACKAGE="transpose")
}
R */
Testing:
csource("~/R/cpackagedemo/inst/examples/transpose.c")
transpose(cbind(c(1, 2, 3, 4), c(5, 6, 7, 8)))
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
transpose(Titanic[, "Male", "Adult", ])
## 1st 2nd 3rd Crew
## No 118 154 387 670
## Yes 57 14 75 192
Author a C function named table2 that computes a two-way contingency table.
14.3.4. Data frames#
Data frames (Chapter 12) are lists of m vectors
of identical lengths n or matrices of n rows for some n and m.
The character vectors stored in the row.names
and names
attributes
give the n row and m column labels.
They are objects of S3 class "data.frame"
.
We process data frames as ordinary lists. However, in many applications, it is enough to write functions that accept ordinary numeric matrices instead. If element grouping is required, they can be accompanied by a factor or a list of factor variables. Do not complicate our lives beyond necessity.
14.4. Using existing function libraries#
14.4.1. Checking for user interrupts#
Long computations may lead to R’s becoming unresponsive. For instance, the user may request to cancel the evaluation of the current expression by pressing Ctrl+C.
To process the event queue, we should occasionally call
C::
R_CheckUserInterrupt()
,
e.g., in every iteration of a complex for
loop.
Note that R might decide never to return to our function.
Thus, we have to prevent memory leaks, e.g.,
by preferring C::
R_alloc
over C::
malloc.
14.4.2. Generating pseudorandom numbers#
C::
unif_rand returns
a single pseudorandom deviate from the uniform distribution
on the unit interval. It is the basis for generating numbers
from all other supported distributions (Section 6.7.1 of [62]).
It uses the same pseudorandom generator as we described
in Section 2.1.5. To read and memorise
its seed (the `.Random.seed
` object in the global environment),
we have to call
C::
GetRNGstate()
and
C::
PutRNGstate()
at the beginning and the end of our function, respectively.
Below is a function to generate a pseudorandom bit sequence:
SEXP C_randombits(SEXP n)
{
if (!Rf_isInteger(n) || XLENGTH(n) != 1)
Rf_error("`n` should be a single integer");
size_t _n = INTEGER(n)[0];
if (_n == NA_INTEGER || _n < 1)
Rf_error("incorrect `n`");
SEXP y = PROTECT(Rf_allocVector(INTSXP, _n));
int* yp = INTEGER(y);
GetRNGstate();
for (size_t i=0; i<_n; ++i)
yp[i] = (int)(unif_rand()>0.5);
PutRNGstate();
UNPROTECT(1);
return y;
}
/* R
randombits <- function(n)
{
if (!is.integer(n)) n <- as.integer(n)
.Call("C_randombits", n, PACKAGE="randombits")
}
R */
Let us play with the above:
csource("~/R/cpackagedemo/inst/examples/randombits.c")
set.seed(123); randombits(10)
## [1] 0 1 0 1 1 0 1 1 1 0
randombits(10)
## [1] 1 0 1 1 0 1 0 0 0 1
set.seed(123); randombits(20)
## [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1
set.seed(123); as.integer(runif(20)>0.5) # it's the same "algorithm"
## [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1
Create a function to compute the most frequently occurring value (mode) in a given factor object. In the case of ambiguity, return a randomly chosen candidate.
14.4.3. Mathematical functions from the R API#
Section 6.7 of [62] lists the available statistical distribution functions, mathematical routines and constants, and other numerical utilities.
14.4.4. Header files from other R packages (*)#
A package may use header files from another package.
For this to be possible, it must include the dependency name
in the LinkingTo
field of its DESCRIPTION
file;
see [62] for discussion.
The BH package on CRAN gives access to Boost,
which are header-only C++ libraries defining many algorithms
and data structures. Create an R package that calls
C::
boost::
math::
gcd after issuing
the #include <boost/math/common_factor.hpp>
directive.
14.4.5. Specifying compiler and linker flags (**)#
We can pass arbitrary flags to the compiler or linker, e.g., to use any library installed on our system.
Basic configuration can be set via Makevars
(or Makevars.win
on W****ws), e.g.,
by setting PKG_CFLAGS
or PKG_LIBS
variables.
For maximum portability across different platforms,
which is overall challenging to ensure if we do not wish to exclude
W****ws users, we might be required to write our own
configure
(and configure.win
) scripts.
For more information, see [62]. In particular, it discusses how to use OpenMP[13] in our projects.
14.5. Exercises#
Answer the following questions:
What are the most common
SEXPTYPE
s?How are missing values represented?
How can we check if an integer is a missing value? What about a floating-point variable?
How to prevent
SEXP
s from being garbage-collected?How are character vectors represented? What is the difference between a
CHARSXP
and aSTRSXP
?Why is it better to process factor objects than raw character vectors if we merely would like to define grouping variables?
How are R matrices handled in C? Does R use the C or Fortran order of matrix elements?
How are R data frames handled in C?
Implement the C versions of the rep, seq, rle, match, findInterval, sample, order, unique, and split functions.
(*) Read [62] in its entirety.
(*) Download R’s source code from CRAN
or its Subversion (SVN)
repository. Explore the header files in
the src/include/
subdirectory. They are part of the callable API.