14. Interfacing compiled code (**)¶
This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Minimalist Data Wrangling with Python [27], too.
R is an effective glue language. It is suitable for composing whole data wrangling pipelines: from data import through processing, analysis, and visualisation to export. It makes using and connecting larger building blocks very convenient. R is also a competent tool for developing usable implementations of standalone, general-purpose algorithms, especially if they are of numerical nature. Nevertheless, for performance reasons, we may consider rewriting computing-intensive tasks in C or C++[1]. Such a move can be beneficial if we need a method that:
has higher memory or time complexity when programmed using vectorised R functions than its straightforward implementation,
has an iterative or recursive nature, e.g., involving unvectorisable for or while loops,
relies on complicated dynamic data structures (e.g., hash maps, linked lists, or trees),
needs methods provided elsewhere and not available in R (e.g., other C or C++ libraries).
In the current chapter, we will demonstrate that R works very well as a user-friendly interface to compiled code.
As the topic is overall very technical,
we will only cover the most important rudiments.
We will focus on writing or interfacing portable[2]
function libraries that only rely on simple[3] data structures
(e.g., arrays of the type double
and int
).
Thanks to this, we will be able to reuse
them in other environments such as Python (e.g., via Cython) or Julia
(remember that R is one of many languages out there).
For those who are interested in specifics,
the definitive reference is the Writing R Extensions manual [65],
but see also Chapter 11 of [10]. Furthermore, R’s source
code provides many working examples of how to deal with R objects in C.
We assume some knowledge of the C language; see [38]. The reader can skip this chapter now and return to it later: the remaining material is not contingent on the current one. Otherwise, from now on, we take for granted that our environment can successfully build a source package with C code, as mentioned in Section 7.3.1.2. In particular, Wi***ws and m**OS users should install, respectively, RTools and Xcode.
Note
To avoid ambiguity, in the main text, calls to C functions will be
denoted by the “C::
” prefix, e.g.,
C::
spanish_inquisition()
.
14.1. C and C++ code in R¶
14.1.1. Source files for compiled code in R packages¶
Perhaps the most versatile way to interact with portable C code is via standalone R packages; compare Section 7.3.1 and Section 9.2.2. For the purpose of the current chapter, we created a demo project available at https://github.com/gagolews/cpackagedemo.
Inspect the structure of cpackagedemo.
Note that C source files are located in the src/
subdirectory.
Build and install the package using install.packages
or the “R CMD INSTALL
” command. Then, load the package in R and call
my_sum defined there on some numeric vector.
The package provides an R interface to one C function,
C::
my_c_sum, written in the
most portable fashion possible.
Its declaration is included in the src/cfuns.h
file:
#ifndef __CFUNS_H
#define __CFUNS_H
#include <stddef.h>
double my_c_sum(const double* x, size_t n);
#endif
The function accepts a pointer to the start of a numeric sequence
and its size, which is a standard[4] way of representing
an array of double
s.
Its definition is given in src/cfuns.c
. We see that it is
nothing more than a simple sum of all the elements in an array:
#include "cfuns.h"
/* computes the sum of all elements in an array x of size n */
double my_c_sum(const double* x, size_t n)
{
double s = 0.0;
for (size_t i = 0; i < n; ++i) {
/* this code does not treat potential missing values specially
(they are kinds of NaNs); to fix this, add:
if (ISNA(x[i])) return NA_REAL; // #include <R.h> */
s += x[i];
}
return s;
}
To make C::
my_c_sum available in R,
we have to introduce a wrapper around it that works
with the data structures from the first part of this jolly book.
We know that an R function accepts objects of any kind
as input and yields anything as a result.
In the next section, we will explain that we get access
to R objects via special pointers of the type SEXP
(S expressions).
Thus, let us declare our R-callable wrapper in src/rfuns.h
:
#ifndef __RFUNS_H
#define __RFUNS_H
#include <R.h>
#include <Rinternals.h>
#include <Rmath.h>
SEXP my_c_sum_wrapper(SEXP x);
#endif
The actual definition is included in src/rfuns.c
:
#include "rfuns.h"
#include "cfuns.h"
/* a wrapper around my_c_sum callable from R */
SEXP my_c_sum_wrapper(SEXP x)
{
double s;
if (!Rf_isReal(x)) {
/* the caller is expected to prepare the arguments
(doing it at the C level is tedious work) */
Rf_error("`x` should be a vector of the type 'double'");
}
s = my_c_sum(REAL(x), (size_t)XLENGTH(x));
return Rf_ScalarReal(s);
}
The arguments could be, technically speaking, prepared at the C level.
For instance, if x
turned out to be an integer vector,
we could have converted it to the double
one
(they are two different types; see Section 6.4.1).
Nevertheless, overall, it is very burdensome. It is easier to use pure R code
to ensure that the arguments are of the correct form as well as to beautify
the outputs.
This explains why we only assert the enjoyment of
C::
Rf_isReal(x)
.
It guarantees that the C::
REAL and
C::
XLENGTH
functions correctly return the pointer to the start of the sequence
and its length, respectively.
Once C::
my_c_sum is called,
we must convert it to an R object so that it can be returned to our
environment. Here, it is a newly allocated
numeric vector of length one. We did this by calling
C::
Rf_ScalarReal.
Although optional (see Section 5.4 of [65]),
we will register C::
my_c_sum_wrapper
as a callable function explicitly. This way, R will not be struggling
to find the specific entry point in the resulting dynamically linked library
(DLL). We do this in src/cpackagedemo.c
:
#include <R_ext/Rdynload.h>
#include "rfuns.h"
/* the list of functions available in R via a call to .Call():
each entry is like {exported_name, fun_pointer, number_of_arguments} */
static const R_CallMethodDef cCallMethods[] = {
{"my_c_sum_wrapper", (DL_FUNC)&my_c_sum_wrapper, 1},
{NULL, NULL, 0} // the end of the list (sentinel)
};
/* registers the list of callable functions */
void R_init_cpackagedemo(DllInfo *dll)
{
R_registerRoutines(dll, NULL, cCallMethods, NULL, NULL);
R_useDynamicSymbols(dll, FALSE);
}
The function can be invoked from R using .Call.
Here are the contents of R/my_sum.R
:
my_sum <- function(x)
{
# prepare input data:
if (!is.double(x))
x <- as.double(x)
s <- .Call("my_c_sum_wrapper", x, PACKAGE="cpackagedemo")
# some rather random postprocessing:
attr(s, "what") <- deparse(substitute(x))
s
}
And, finally, here is the package NAMESPACE
file responsible
for registering the exported R names and indicating the DLL to use:
export(my_sum)
useDynLib(cpackagedemo)
Once the package is built and installed
(e.g., by running “R CMD INSTALL <pkgdir>
” in the
terminal or calling install.packages),
we can test it by calling:
library("cpackagedemo")
my_sum(runif(100)/100)
## [1] 0.49856
## attr(,"what")
## [1] "runif(100)/100"
Extend the package by adding a function to compute the index of the greatest element in a numeric vector. Note that C uses 0-based array indexing whereas in R, the first element is at index 1. Compare its run time against which.max using proc.time.
14.1.2. R CMD SHLIB
¶
The “R CMD SHLIB <files>
” shell command compiles one or more source files
without the need for turning them into standalone packages;
see [65]. Then, dyn.load loads the resulting DLL.
(*) Compile src/cfuns.c
and src/rfuns.c
from our
demo package using “R CMD SHLIB
”. Call dyn.load.
Write an R function that uses .Call to invoke
C::
my_c_sum_wrapper from the second source file.
The direct SHLIB
approach is convenient for learning C programming,
including running simple examples. We will thus use it
for didactic reasons in this chapter.
The inst/examples/csource.R
file in our demo package
includes the implementation of an R function called csource.
It compiles a given C source file, and loads the resulting DLL.
It also extracts and executes a designated R code chunk
preferably defining a function that refers to .Call.
Here is an example source file, inst/examples/helloworld.c
in the cpackagedemo source code repository:
// the necessary header files are automatically included by `csource`
SEXP C_hello()
{
Rprintf("The mill's closed. There's no more work. We're destitute.\n"
"I'm afraid I've no choice but to sell you all "
"for scientific experiments.\n");
return R_NilValue;
}
/* R
# this chunk will be extracted and executed by `csource`.
hello <- function()
invisible(.Call("C_hello", PACKAGE="helloworld"))
R */
Let’s compile it and call the aforementioned R function.
source("~/R/cpackagedemo/inst/examples/csource.R") # defines csource
csource("~/R/cpackagedemo/inst/examples/helloworld.c")
hello()
## The mill's closed. There's no more work. We're destitute.
## I'm afraid I've no choice but to sell you all for scientific experiments.
(*)
C++, which can be thought of as a superset of the C language
(but the devil is in the detail), is also supported.
Change the name of the aforementioned file to helloworld2.cpp
,
add extern "C"
before the function declaration,
pass PACKAGE="helloworld2"
to .Call,
and run csource on the new file.
(*) Verify that C and C++ source files can coexist in R packages.
(*) It might be very educative to study the implementation of csource. We should be able to author such functions ourselves now (a few hours’ worth of work), let alone read with understanding.
# compiles a C or C++ source file using R CMD SHLIB,
# loads the resulting DLL, and executes the embedded R code
csource <- function(
fname,
libname=NULL, # defaults to the base name of `fname` without extension
shlibargs=character(),
headers=paste0(
"#include <R.h>\n",
"#include <Rinternals.h>\n",
"#include <Rmath.h>\n"
),
R=file.path(R.home(), "bin/R")
)
{
stopifnot(file.exists(fname))
stopifnot(is.character(shlibargs))
stopifnot(is.character(headers))
stopifnot(is.character(R), length(R) == 1)
if (is.null(libname))
libname <- regmatches(basename(fname),
regexpr("[^.]*(?=\\..*)", basename(fname), perl=TRUE))
stopifnot(is.character(libname), length(libname) == 1)
# read the source file:
f <- paste(readLines(fname), collapse="\n")
# set up output file names:
tmpdir <- normalizePath(tempdir(), winslash="/") # tempdir on Win uses \
dynlib_ext <- .Platform[["dynlib.ext"]]
libpath <- file.path(tmpdir, sprintf("%s%s", libname, dynlib_ext))
cfname <- file.path(tmpdir, basename(fname))
rfname <- sub("\\..*?$", ".R", cfname, perl=TRUE) # .R extension
# separate the /* R ... <R code> ... R */ chunk from the source file:
rpart <- regexec("(?smi)^/\\* R\\s?(.*)R \\*/$", f, perl=TRUE)[[1]]
rpart_start <- rpart
rpart_len <- attr(rpart, "match.length")
if (rpart_start[1] < 0 || rpart_len[1] < 0)
stop("enclose R code between /* R ... and ... R */")
rcode <- substr(f, rpart_start[2], rpart_start[2]+rpart_len[2]-1)
cat(rcode, file=rfname, append=FALSE)
# write the C/C++ file:
ccode <- paste(
headers,
substr(f, 1, rpart_start[1]-1),
substr(f, rpart_start[1]+rpart_len[1], nchar(f)),
collapse="\n"
)
cat(ccode, file=cfname, append=FALSE)
# prepare the "R CMD SHLIB ..." command:
shlibargs <- c(
"CMD", "SHLIB",
sprintf("-o %s", libpath),
cfname,
shlibargs
)
# compile and load the DLL, run the extracted R script:
retval <- FALSE
oldwd <- setwd(tmpdir)
tryCatch({
if (libpath %in% sapply(getLoadedDLLs(), `[[`, "path"))
dyn.unload(libpath)
stopifnot(system2(R, shlibargs) == 0) # 0 == success
dyn.load(libpath)
source(rfname)
retval <- TRUE
}, error=function(e) {
cat(as.character(e), file=stderr())
})
setwd(oldwd)
if (!retval) stop("error compiling file or executing R code therein")
invisible(TRUE)
}
14.2. Handling basic types¶
14.2.1. SEXPTYPE
s¶
All R objects are stored as instances of the C language structure
SEXPREC
. Usually, we access them via
pointers, which are of the type SEXP
(S expression).
A C function referred to via .Call takes the very generic
SEXP
s as input. It outputs another SEXP
. Importantly,
one of the said structure’s fields represents
the actual R object type (SEXPTYPE
numbers);
see Table 14.1 for a selection.
|
Type in R (typeof) |
Test in C |
---|---|---|
|
|
Rf_isNull |
|
|
TYPEOF |
|
|
Rf_isLogical |
|
|
Rf_isInteger |
|
|
Rf_isReal |
|
|
Rf_isComplex |
|
|
Rf_isString |
|
|
Rf_isVectorList |
|
|
TYPEOF |
|
|
TYPEOF |
To illustrate that any R object is available as a SEXP
, consider
the inst/examples/sexptype.c
file from cpackagedemo:
SEXP C_test_sexptype(SEXP x)
{
Rprintf("type of x: %s (SEXPTYPE=%d)\n",
Rf_type2char(TYPEOF(x)),
(int)TYPEOF(x)
);
return R_NilValue;
}
/* R
test_sexptype <- function(x)
invisible(.Call("C_test_sexptype", x, PACKAGE="sexptype"))
R */
Example calls:
csource("~/R/cpackagedemo/inst/examples/sexptype.c")
test_sexptype(1:10)
## type of x: integer (SEXPTYPE=13)
test_sexptype(NA)
## type of x: logical (SEXPTYPE=10)
test_sexptype("spam")
## type of x: character (SEXPTYPE=16)
We should refer to particular SEXPTYPE
s
via their descriptive names (constants; e.g., STRSXP
),
not their numeric identifiers (e.g., 16);
see Section 1.1 of [68] for the complete list[5].
14.2.2. Accessing elements in simple atomic vectors¶
We have already seen an example function that processes a numeric vector;
see C::
my_c_sum_wrapper above.
Table 14.2 gives other important vector-like
SEXPTYPE
s (atomic and generic), the C types of their elements,
and the functions to access the underlying array pointers.
It is also worth knowing that as call to
C::
XLENGTH returns
the length of a given sequence.
Let’s stress that writing functions that accept int
and double
array pointers and their lengths makes them easily
reusable in other programming environments. In many data analysis
applications, we do not need much more.
|
Array element type |
Pointer access |
---|---|---|
|
|
RAW |
|
|
LOGICAL |
|
|
INTEGER |
|
|
REAL |
|
|
COMPLEX |
|
|
(not directly) |
|
|
(not directly) |
|
|
CHAR |
Important
With raw, logical, integer, floating-point, and complex vectors,
we get direct access to data that might be shared amongst many objects
(compare Section 16.1.4).
SEXPREC
s are simply passed by pointers (since SEXP
s are
exactly them). We must thus refrain[6]
from modifying objects passed as function arguments.
Ways to create new vectors, e.g., for storing auxiliary
or return values, are discussed below.
Consider inst/examples/sharedmem.c
:
SEXP C_test_sharedmem(SEXP x)
{
if (!Rf_isReal(x) || XLENGTH(x) == 0)
Rf_error("`x` should be a non-empty vector of the type 'double'");
REAL(x)[0] = REAL(x)[0]+1; // never do it; always make a copy;
// the underlying array `x` may be shared by many objects
return R_NilValue;
}
/* R
test_sharedmem <- function(x)
invisible(.Call("C_test_sharedmem", x, PACKAGE="sharedmem"))
R */
Calling the foregoing function on an example vector:
csource("~/R/cpackagedemo/inst/examples/sharedmem.c")
y <- 1
z <- y
test_sharedmem(y)
print(c(y, z))
## [1] 2 2
modifies y
and z
in place. Hence, to maintain the compatibility
with the classic R semantics, we must always make a copy.
14.2.3. Representation of missing values¶
Most languages do not support the notion of missing values out of the box. Therefore, in R, they have to be emulated. Table 14.3 lists the relevant constants and the conventional ways for testing for missingness.
|
Missing value |
Testing |
---|---|---|
|
(none) |
(none) |
|
|
|
|
|
|
|
|
ISNA |
|
a pair of |
ISNA |
|
|
|
In logical and integer vectors, NA
s are represented
as the smallest 32-bit signed integer. Thus, we need to be careful
when performing any operations on these types:
testing for missingness must be performed first.
The case of double
s is slightly less irksome, for a missing
value is represented as a special not-a-number.
Many arithmetic operations on NaN
s
return NaN
s as well, albeit there is no guarantee[7]
that they will be of precisely the same type as NA_REAL
. Thus, manual
testing for missingness is also advised.
The inst/examples/mean_naomit.c
file defines a function
to compute the arithmetic mean of an int
or a double
vector:
SEXP C_mean_naomit(SEXP x)
{
double ret = 0.0;
size_t k = 0;
if (Rf_isInteger(x)) {
const int* xp = INTEGER(x);
size_t n = XLENGTH(x);
for (size_t i=0; i<n; ++i)
if (xp[i] != NA_INTEGER) { // NOT: ISNA(xp[i])
ret += (double)xp[i];
k++;
}
}
else if (Rf_isReal(x)) {
const double* xp = REAL(x);
size_t n = XLENGTH(x);
for (size_t i=0; i<n; ++i)
if (!ISNA(xp[i])) { // NOT: xp[i] == NA_REAL
ret += xp[i];
k++;
}
}
else
Rf_error("`x` should be a numeric vector");
return Rf_ScalarReal((k>0)?(ret/(double)k):NA_REAL);
}
/* R
mean_naomit <- function(x)
{
if (!is.numeric(x)) # neither integer nor double
x <- as.numeric(x) # convert to double (the same as as.double)
.Call("C_mean_naomit", x, PACKAGE="mean_naomit")
}
R */
There is some inherent code duplication but int
and double
are
distinct types. Thus, they need to be handled separately
(we could have convert them to double
s at the R level too).
Some tests:
csource("~/R/cpackagedemo/inst/examples/mean_naomit.c")
mean_naomit(c(1L, NA_integer_, 3L, NA_integer_, 5L))
## [1] 3
mean_naomit(rep(NA_real_, 10))
## [1] NA
Implement all and any in C.
Add the na.rm
argument.
14.2.4. Memory allocation¶
To allocate a new vector of length one and set
its only element, we can call
C::
ScalarLogical,
C::
ScalarInteger,
C::
ScalarReal, etc.
We have already used these functions for returning R “scalars”.
Vectors of arbitrary lengths can be created using
C::
Rf_allocVector(sexptype, size)
.
Note that this function does not initialise the elements of
logical and numeric sequences (amongst others). They will need
to be set manually after creation.
Important
R implements a simple yet effective garbage collector that relies on reference counting. Occasionally[8], memory blocks that can no longer be reached are either freed or marked as reusable.
All allocated vectors must be manually protected from garbage collection.
To guard against premature annihilation, R maintains a stack[9]
of objects. C::
PROTECT(sexp)
pushes a given object pointer onto the top of the list.
C::
UNPROTECT(n)
pops the last n elements from it in the last-in-first-out manner.
At the end of a .Call, R checks if the number
of protects matches that of unprotects and generates a warning
if there is a stack imbalance.
Protection is not needed:
for arguments to functions referred to by .Call, as they are already in use and hence protected;
for objects assigned as list or character vectors’ elements using C
::
SET_VECTOR_ELT and C::
SET_STRING_ELT (see below); when the container is protected, so are its components;when we return the allocated vector to R immediately after creating it (like in return Rf_ScalarReal
(val)
in a C function invoked by .Call).
Here is a function to compute the square of each element in a numeric vector. Note that the new vector must be protected from garbage collection while data are being prepared.
SEXP C_square1(SEXP x)
{
// no need to call PROTECT(x), it is already in use
if (!Rf_isReal(x)) Rf_error("`x` should be of the type 'double'");
size_t n = XLENGTH(x);
const double* xp = REAL(x);
SEXP y = PROTECT(Rf_allocVector(REALSXP, n)); // won't be GC'd
double* yp = REAL(y);
for (size_t i=0; i<n; ++i) {
if (ISNA(xp[i])) yp[i] = xp[i]; // NA_REAL
else yp[i] = xp[i]*xp[i];
}
UNPROTECT(1); // pops one object from the protect stack;
// does not trigger garbage collection, so we can return `y` now
return y; // R will retrieve and protect it
}
/* R
square1 <- function(x)
{
if (!is.double(x)) x <- as.double(x)
.Call("C_square1", x, PACKAGE="square1")
}
R */
As an alternative, in this case, we may use
C::
Rf_duplicate:
SEXP C_square2(SEXP x)
{
if (!Rf_isReal(x)) Rf_error("`x` should be of the type 'double'");
x = PROTECT(Rf_duplicate(x)); // OK; just replaces the pointer (address)
size_t n = XLENGTH(x);
double* xp = REAL(x);
for (size_t i=0; i<n; ++i)
if (!ISNA(xp[i])) xp[i] = xp[i]*xp[i];
UNPROTECT(1);
return x;
}
/* R
square2 <- function(x)
{
if (!is.double(x)) x <- as.double(x)
.Call("C_square2", x, PACKAGE="square2")
}
R */
Some tests:
csource("~/R/cpackagedemo/inst/examples/square1.c")
square1(c(-2, -1, 0, 1, 2, 3, 4, NA_real_))
## [1] 4 1 0 1 4 9 16 NA
csource("~/R/cpackagedemo/inst/examples/square2.c")
square2(c(-2, -1, 0, 1, 2, 3, 4, NA_real_))
## [1] 4 1 0 1 4 9 16 NA
We can claim auxiliary memory from the heap
during a function’s runtime using the well-known
C::
malloc (or new in C++).
We are of course fully responsible for releasing it via
C::
free (or delete).
Here is our version of the which function.
SEXP C_which1(SEXP x)
{
if (!Rf_isLogical(x)) Rf_error("`x` should be of the type 'logical'");
size_t n = XLENGTH(x), i, k;
const int* xp = LOGICAL(x);
size_t* d = (size_t*)malloc(n*sizeof(size_t)); // conservative size
if (!d) Rf_error("memory allocation error");
for (i=0, k=0; i<n; ++i)
if (xp[i] != NA_LOGICAL && xp[i])
d[k++] = i;
// Rf_allocVector can longjmp, memory leak possible...
SEXP y = PROTECT(Rf_allocVector(REALSXP, k));
double* yp = REAL(y); // yes, the type is double; ready for long vectors
for (i=0; i<k; ++i)
yp[i] = (double)d[i]+1; // R uses 1-based indexing
free(d);
UNPROTECT(1);
return y;
}
/* R
which1 <- function(x)
{
if (!is.logical(x)) x <- as.logical(x)
.Call("C_which1", x, PACKAGE="which1")
}
R */
Some tests:
csource("~/R/cpackagedemo/inst/examples/which1.c")
which1(c(TRUE, FALSE, TRUE, NA, TRUE))
## [1] 1 3 5
R’s which returns either an int
or a double
vector, depending
on the size of the input vector (whether it is shorter than \(2^{31}-1\)).
Rewrite the above to take that into account: integer arithmetic is
slightly faster.
Note
(*)
R’s exception handling uses a
long jump.
Therefore, when calling C::
Rf_error
(whether directly or not) normal stack unwinding will not occur.
This is particularly important when using C++ objects which deallocate memory
in their destructors as they might not be invoked whatsoever.
In the preceding example, a call to
C::
Rf_allocVector may trigger a long jump,
e.g., if we run out of available memory. In such a case, d
will
not be freed.
Thus, care should be taken to make sure there are no memory leaks.
We can sometimes switch to
C::
R_alloc(n, size)
which allocates n*size
bytes. The memory it requests will automatically be
garbage-collected at the end of a .Call.
Otherwise, we should ensure that blocks relying on manual memory
allocation are not mixed with the calls to R API functions.
In our C::
which1, it would be better
to determine the desired size of y
and allocate it before
calling C::
malloc.
(*) If we do not like that we are potentially wasting memory
in the case of sparse logical vectors, we can rely on
dynamically growable arrays. Below is a C++ rewrite of the
foregoing function using deque
(double-ended queue)
from the language’s standard library.
#include <deque>
extern "C" SEXP C_which2(SEXP x)
{
if (!Rf_isLogical(x)) Rf_error("`x` should be of the type 'logical'");
size_t n = XLENGTH(x), i, k=0;
const int* xp = LOGICAL(x);
// precompute k, Rf_allocVector can do a longjmp
for (i=0; i<n; ++i) if (xp[i] != NA_LOGICAL && xp[i]) k++;
SEXP y = PROTECT(Rf_allocVector(REALSXP, k));
double* yp = REAL(y); // ready for long vectors
std::deque<size_t> d; // allocates memory
for (i=0; i<n; ++i)
if (xp[i] != NA_LOGICAL && xp[i])
d.push_back(i);
i=0;
for (size_t e : d)
yp[i++] = (double)e+1; // R uses 1-based indexing
UNPROTECT(1);
return y; // d's destructor will be called automatically
}
/* R
which2 <- function(x)
{
if (!is.logical(x)) x <- as.logical(x)
.Call("C_which2", x, PACKAGE="which2")
}
R */
Example calls:
csource("~/R/cpackagedemo/inst/examples/which2.cpp")
x <- (runif(10) > 0.5)
stopifnot(which(x) == which1(x))
stopifnot(which(x) == which2(x))
Alternatively, we could have used C::
realloc
to extend an initially small buffer created using
C::
malloc by, say, 50%
whenever it is about to overflow.
14.2.5. Lists¶
For safety reasons[10], we do not get access to the underlying pointers
in lists and character vectors. List items can be read by calling
C::
VECTOR_ELT(x, index)
and can be set with
C::
SET_VECTOR_ELT(x, index, newval)
.
Note that lists (VECSXP
s) are comprised of SEXP
s of any type.
Hence, after extracting an element, its SEXPTYPE
needs to be tested
using one of the functions listed in Table 14.1.
This can be tiresome.
Here is a rather useless function that fetches the first and the last element in a given numeric vector or a list. However, if the latter case, we apply the function recursively on all its elements.
SEXP C_firstlast(SEXP x)
{
if (!Rf_isVector(x) || XLENGTH(x) == 0)
Rf_error("`x` must be a non-empty vector (atomic or generic)");
else if (Rf_isReal(x)) {
SEXP y = PROTECT(Rf_allocVector(REALSXP, 2));
REAL(y)[0] = REAL(x)[0]; // first
REAL(y)[1] = REAL(x)[XLENGTH(x)-1]; // last
UNPROTECT(1);
return y;
}
else if (Rf_isVectorList(x)) {
SEXP y = PROTECT(Rf_allocVector(VECSXP, 2));
// VECTOR_ELT(x, i) is PROTECTed by the container;
// SET_VECTOR_ELT does not trigger GC; no need to call PROTECT
// on the result of C_firstlast(...) in this context
SET_VECTOR_ELT(y, 0, C_firstlast(VECTOR_ELT(x, 0)));
SET_VECTOR_ELT(y, 1, C_firstlast(VECTOR_ELT(x, XLENGTH(x)-1)));
UNPROTECT(1);
return y;
}
else
Rf_error("other cases left as an exercise");
return R_NilValue; // avoid compiler warning
}
/* R
firstlast <- function(x)
.Call("C_firstlast", x, PACKAGE="firstlast")
R */
Testing:
csource("~/R/cpackagedemo/inst/examples/firstlast.c")
firstlast(c(1, 2, 3))
## [1] 1 3
firstlast(list(c(1, 2, 3), c(4, 5), 6))
## [[1]]
## [1] 1 3
##
## [[2]]
## [1] 6 6
firstlast(list(c(1, 2, 3), 4, 5, list(6, c(7, 8), c(9, 10, 11))))
## [[1]]
## [1] 1 3
##
## [[2]]
## [[2]][[1]]
## [1] 6 6
##
## [[2]][[2]]
## [1] 9 11
Implement a C function that returns the longest vector in a given list.
Use C::
Rf_isVector
to check whether a given object is an atomic or a generic vector,
and hence if C::
XLENGTH
can be called thereon.
Inscribe your version of unlist. Consider scanning the input list twice. First, determine the size of the output vector. Second, fill the return object with the un-listed values.
Write a C function that takes a list of numeric vectors of identical lengths. Return their elementwise sum: the first element of the output should be the sum of the first elements in every input vector, and so forth.
14.2.6. Character vectors and individual strings (*)¶
Character vectors (STRSXP
s) are similar to VECSXP
s
except that they only carry individual strings which are
represented using a separate data type, CHARSXP
.
C::
STRING_ELT(x, index)
and
C::
SET_STRING_ELT(x, index, newval)
play the role of the element getters and setters.
Important
If we are not interested in text processing but rather in handling categorical data or defining grouping variables, we should consider converting character vectors to factors before issuing a .Call. Comparing small integers is much faster than strings; see below for more details.
Because of R’s string cache, there are no duplicate strings
in the memory. However, this feature could only be guaranteed
by making data in CHARSXP
s read-only. We can access
the underlying const char*
pointer by calling
C::
CHAR(s)
.
As typical in C, a string is terminated by byte 0.
Note
R strings may be of different encodings; compare Section 6.1.1. For portability and peace of mind, it is best to preprocess the arguments to .Call using enc2utf8, which converts all strings to UTF-8[11].
Despite being the most universal encoding, UTF-8 does not
represent each code point using a fixed number of bytes.
For instance, computing the string length
requires iterating over all its elements. For CHARSXP
s,
C::
XLENGTH returns the number
of bytes, not including the trailing 0.
It is thus best to leave the processing of strings to the dedicated libraries, e.g., ICU or rely on functions from the stringi package [26] at the R level.
C strings can be converted to CHARSXP
s by calling
C::
Rf_mkCharCE(stringbuf, CE_UTF8)
or
C::
Rf_mkCharLenCE(stringbuf, buflen, CE_UTF8)
.
If we are sure that a string is in ASCII (a subset of UTF-8),
we can also call C::
Rf_mkChar(stringbuf)
.
We should never return CHARSXP
s as results to R. They are
for internal use only. They must be wrapped inside a character vector,
e.g., using C::
Rf_ScalarString.
14.2.7. Calling R functions from C (**)¶
Section 5.11 of [65] discusses ways to call R functions in C. To understand them, we will first need to study the material from the remaining chapters of our book, i.e., environments and the related evaluation model. They can be useful, e.g., when calling optimisation algorithms implemented in C on objective functions written in R.
14.2.8. External pointers (**)¶
For storing arbitrary C pointers, there is a separate basic R type
named externalptr
(SEXPTYPE
of EXTPTRSXP
);
see Section 5.13 of [65] for more details.
We can use them to maintain dynamic data structures
or resource handlers between calls to R functions.
The problem with these is that pointers are passed as… pointers.
They can easily break R’s pass-by-value-like semantics,
where changes to the state of the referenced object will
be visible outside the call.
(**) inst/examples/stack.cpp
provides a C++ implementation
of the stack data structure, being a last-in-first-out container
of arbitrary R objects:
#include <deque>
class S : public std::deque<SEXP>
{
public: ~S()
{ // destructor: release all SEXPs so that they can be GC'd
while (!this->empty()) {
SEXP obj = this->front();
this->pop_front();
R_ReleaseObject(obj);
}
}
};
S* get_stack_pointer(SEXP s, bool check_zero=true) // internal function
{
if (TYPEOF(s) != EXTPTRSXP)
Rf_error("not an external pointer");
SEXP tag = R_ExternalPtrTag(s); // our convention, this can be anything
if (TYPEOF(tag) != CHARSXP || strcmp(CHAR(tag), "stack") != 0)
Rf_error("not a stack");
S* sp = (S*)R_ExternalPtrAddr(s);
if (check_zero && !sp)
Rf_error("address is 0");
return sp;
}
void stack_finaliser(SEXP s) // internal function
{
// called during garbage collection
S* sp = get_stack_pointer(s, false);
if (sp) {
delete sp; // destruct S, release SEXPs
R_ClearExternalPtr(s);
}
}
extern "C" SEXP C_stack_create()
{
S* sp = new S(); // stack pointer
SEXP s = PROTECT(
R_MakeExternalPtr((void*)sp, /*tag*/mkChar("stack"), R_NilValue)
);
R_RegisterCFinalizerEx(s, stack_finaliser, TRUE); // auto-called on GC
UNPROTECT(1);
return s;
}
extern "C" SEXP C_stack_empty(SEXP s)
{
S* sp = get_stack_pointer(s);
return Rf_ScalarLogical(sp->empty());
}
extern "C" SEXP C_stack_push(SEXP s, SEXP obj)
{
S* sp = get_stack_pointer(s);
R_PreserveObject(obj);
sp->push_front(obj);
return R_NilValue;
}
extern "C" SEXP C_stack_pop(SEXP s)
{
S* sp = get_stack_pointer(s);
if (sp->empty())
Rf_error("stack is empty");
SEXP obj = sp->front();
sp->pop_front();
R_ReleaseObject(obj);
return obj;
}
/* R
stack_create <- function()
.Call("C_stack_create", PACKAGE="stack")
stack_empty <- function(s)
.Call("C_stack_empty", s, PACKAGE="stack")
stack_push <- function(s, obj)
.Call("C_stack_push", s, obj, PACKAGE="stack")
stack_pop <- function(s)
.Call("C_stack_pop", s, PACKAGE="stack")
R */
Note how we preserve R objects from garbage collection. Some tests:
csource("~/R/cpackagedemo/inst/examples/stack.cpp")
s <- stack_create()
print(s)
## <pointer: 0x560958a90550>
typeof(s)
## [1] "externalptr"
for (i in c("one", "two", "Spanish Inquisition"))
stack_push(s, i)
while (!stack_empty(s))
print(stack_pop(s))
## [1] "Spanish Inquisition"
## [1] "two"
## [1] "one"
Note that pointers are not serialisable. They cannot be saved for use in another R session.
14.3. Dealing with compound types¶
14.3.1. Reading and setting attributes¶
From Chapter 10, we know that compound types such as matrices, factors, or data frames are represented using basic data structures. Usually, they are atomic vectors or lists organised in a predefined manner.
C::
Rf_getAttrib(x, attrname)
and
C::
Rf_setAttrib(x, attrname, newval)
gets and sets specific attributes of an object x
.
Their second argument, attrname
, should be a one-element STRSXP
.
For convenience, the
R_ClassSymbol
,
R_DimNamesSymbol
,
R_DimSymbol
,
R_NamesSymbol
,
R_RowNamesSymbol
,
and
R_LevelsSymbol
constants
can be used instead of the STRSXP
versions of the
"class"
,
"dimnames"
,
"dim"
,
"names"
,
"row.names"
,
and
"levels"
strings.
Consider a function for testing whether an object is of a given class:
#include <string.h>
SEXP C_isofclass(SEXP x, SEXP class)
{
if (!Rf_isString(class) && XLENGTH(class) != 1)
Rf_error("`class` must be a single string");
if (!OBJECT(x)) // is the class attribute set at all?
return Rf_ScalarLogical(FALSE);
SEXP xclass = Rf_getAttrib(x, R_ClassSymbol); // STRSXP (guaranteed)
const char* c = CHAR(STRING_ELT(class, 0)); // class arg as a C string
size_t n = XLENGTH(xclass);
for (size_t i=0; i<n; ++i)
if (strcmp(CHAR(STRING_ELT(xclass, i)), c) == 0)
return Rf_ScalarLogical(TRUE);
return Rf_ScalarLogical(FALSE);
}
/* R
isofclass <- function(x, class)
.Call("C_isofclass", x, class, PACKAGE="isofclass")
R */
Some tests:
csource("~/R/cpackagedemo/inst/examples/isofclass.c")
isofclass(Sys.time(), "POSIXct")
## [1] TRUE
isofclass(cbind(1:5, 11:15), "matrix")
## [1] FALSE
Note that a matrix has an implicit class (reported by
the class function), but its class
attribute does
not have to be set. Hence the negative result.
Write a function that fetches a particular named element in a list.
14.3.2. Factors¶
Factors (Section 10.3.2) are represented as integer vectors
with elements in the set {1, 2, …, k, NA_integer_
}
for some k. They are equipped with the levels
attribute,
being a character vector of length k. Their class
attribute is set to factor
.
An example implementation of a function to compute the number of occurrences of each factor level is given below.
SEXP C_table1(SEXP x)
{
if (!Rf_isFactor(x)) Rf_error("`x` is not a 'factor' object");
size_t n = XLENGTH(x);
const int* xp = INTEGER(x); // `x` is INTSXP
SEXP levels = Rf_getAttrib(x, R_LevelsSymbol); // `levels` is a STRSXP
size_t k = XLENGTH(levels);
SEXP y = PROTECT(Rf_allocVector(REALSXP, k));
double* yp = REAL(y);
for (size_t i=0; i<k; ++i)
yp[i] = 0.0;
for (size_t j=0; j<n; ++j) {
if (xp[j] != NA_INTEGER) {
if (xp[j] < 1 || xp[j] > k)
Rf_error("malformed factor"); // better safe than sorry
yp[xp[j]-1] += 1.0; // levels are 1..k, but C needs 0..k-1
}
}
Rf_setAttrib(y, R_NamesSymbol, levels); // set names attribute
UNPROTECT(1);
return y;
}
/* R
table1 <- function(x)
{
if (!is.factor(x)) x <- as.factor(x)
.Call("C_table1", x, PACKAGE="table1")
}
R */
Testing:
csource("~/R/cpackagedemo/inst/examples/table1.c")
table1(c("spam", "bacon", NA, "spam", "eggs", "bacon", "spam", "spam"))
## bacon eggs spam
## 2 1 4
Create a function to compute the most frequently occurring value (mode) in a given factor. Return a character vector. If a mode is ambiguous, return all the possible candidates.
14.3.3. Matrices¶
Matrices (Chapter 11) are flat atomic vectors or lists with
the dim
attribute being an integer vector of length two.
The class
attribute does not have to be set
(but the class function returns matrix
and array
).
Matrices are so important in data analysis that they have been blessed
with a few dedicated functions available at the C level.
C::
Rf_isMatrix tests if a given
object meets the criteria mentioned above.
C::
Rf_allocMatrix(sexptype, nrows, ncols)
allocates a new matrix.
R relies on the Fortran order of matrix elements, i.e.,
it uses the column-major alignment. Let A
be a matrix
with n rows and m columns
(compare C::
Rf_nrows
and C::
Rf_ncols).
Then, the element in the i-th row
and the j-th column is at A[i+n*j]
.
The dimnames
attributes must be handled
manually, though.
Here is a function to compute the transpose of a numeric matrix:
SEXP C_transpose(SEXP x)
{
if (!Rf_isMatrix(x) || !Rf_isReal(x))
Rf_error("`x` must be a real matrix");
size_t n = Rf_nrows(x);
size_t m = Rf_ncols(x);
const double* xp = REAL(x);
SEXP y = PROTECT(Rf_allocMatrix(REALSXP, m, n));
double* yp = REAL(y);
for (size_t i=0; i<n; ++i)
for (size_t j=0; j<m; ++j)
yp[j+m*i] = xp[i+n*j];
SEXP dimnames = Rf_getAttrib(x, R_DimNamesSymbol);
if (!Rf_isNull(dimnames)) {
SEXP tdimnames = PROTECT(Rf_allocVector(VECSXP, 2));
SET_VECTOR_ELT(tdimnames, 0, VECTOR_ELT(dimnames, 1));
SET_VECTOR_ELT(tdimnames, 1, VECTOR_ELT(dimnames, 0));
Rf_setAttrib(y, R_DimNamesSymbol, tdimnames); // set dimnames
UNPROTECT(1);
// dimnames may have the names attribute too (left as an exercise)
}
UNPROTECT(1);
return y;
}
/* R
transpose <- function(x)
{
if (!is.matrix(x)) x <- as.matrix(x)
if (!is.double(x)) x[] <- as.double(x) # preserves attributes
.Call("C_transpose", x, PACKAGE="transpose")
}
R */
Testing:
csource("~/R/cpackagedemo/inst/examples/transpose.c")
transpose(cbind(c(1, 2, 3, 4), c(5, 6, 7, 8)))
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
transpose(Titanic[, "Male", "Adult", ])
## 1st 2nd 3rd Crew
## No 118 154 387 670
## Yes 57 14 75 192
Author a C function named table2 that computes a two-way contingency table.
14.3.4. Data frames¶
Data frames (Chapter 12) are lists
of the S3 class data.frame
featuring, for some n and m,
m vectors of identical lengths n or matrices of n rows.
The character vectors stored in the row.names
and names
attributes
give the n row and m column labels.
We process data frames as ordinary lists. However, assuming we only want to process numeric data, we can extract the quantitative columns and put them inside a matrix at the R level. If element grouping is required, they can be accompanied by a factor or a list of factor variables. In many applications, this strategy is good enough.
14.4. Using existing function libraries¶
14.4.1. Checking for user interrupts¶
Long computations may lead to R’s becoming unresponsive.
The user may always request to cancel the evaluation of
the current expression by pressing Ctrl+C.
To allow the processing of the event queue, we should call
C::
R_CheckUserInterrupt()
occasionally, e.g., in every iteration of a complex for
loop.
Note that R might decide never to return to our function.
Thus, we have to prevent memory leaks, e.g.,
by preferring C::
R_alloc
over C::
malloc.
14.4.2. Generating pseudorandom numbers¶
C::
unif_rand returns
a single pseudorandom deviate from the uniform distribution
on the unit interval. It is the basis for generating numbers
from all other supported distributions (Section 6.7.1 of [65]).
It uses the same pseudorandom generator as we described
in Section 2.1.5. To read and memorise
its seed (the `.Random.seed
` object in the global environment),
we have to call
C::
GetRNGstate()
and
C::
PutRNGstate()
at the beginning and the end of our function, respectively.
Below is a function to generate a pseudorandom bit sequence:
SEXP C_randombits(SEXP n)
{
if (!Rf_isInteger(n) || XLENGTH(n) != 1)
Rf_error("`n` should be a single integer");
int _n = INTEGER(n)[0];
if (_n == NA_INTEGER || _n < 1)
Rf_error("incorrect `n`");
SEXP y = PROTECT(Rf_allocVector(INTSXP, _n));
int* yp = INTEGER(y);
GetRNGstate();
for (int i=0; i<_n; ++i)
yp[i] = (int)(unif_rand()>0.5); // not the best way to sample bits
PutRNGstate();
UNPROTECT(1);
return y;
}
/* R
randombits <- function(n)
{
if (!is.integer(n)) n <- as.integer(n)
.Call("C_randombits", n, PACKAGE="randombits")
}
R */
Let’s play around with it:
csource("~/R/cpackagedemo/inst/examples/randombits.c")
set.seed(123); randombits(10)
## [1] 0 1 0 1 1 0 1 1 1 0
randombits(10)
## [1] 1 0 1 1 0 1 0 0 0 1
set.seed(123); randombits(20)
## [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1
set.seed(123); as.integer(runif(20)>0.5) # it's the same "algorithm"
## [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1
Create a function to compute the most frequently occurring value (mode) in a given factor object. In the case of ambiguity, return a randomly chosen candidate.
14.4.3. Mathematical functions from the R API¶
Section 6.7 of [65] lists the available statistical distribution functions, mathematical routines and constants, and other numerical utilities.
14.4.4. Header files from other R packages (*)¶
A package may use header files from another package.
For this to be possible, it must include the dependency name
in the LinkingTo
field of its DESCRIPTION
file;
see [65] for discussion.
The BH package on CRAN gives access to Boost,
the header-only C++ libraries that define many useful algorithms
and data structures. Create an R package that calls
C++::
boost::
math::
gcd after issuing
the #include <boost/math/common_factor.hpp>
directive.
14.4.5. Specifying compiler and linker flags (**)¶
We can pass arbitrary flags to the compiler or linker,
e.g., to use any library installed on our system.
Basic configuration can be tweaked via Makevars
(or Makevars.win
on Wi***ws), e.g.,
by setting PKG_CFLAGS
or PKG_LIBS
variables.
For maximum portability across different platforms,
which is overall challenging to ensure when we do not wish to exclude
Wi***ws users, we might be required to author custom
configure
(and configure.win
) scripts.
For more information, see [65] which discusses,
amongst others, how to use OpenMP[12] in our projects.
14.5. Exercises¶
Answer the following questions about the C language API for R.
What are the most common
SEXPTYPE
s?How are missing values represented?
How can we check if an
int
is a missing value? What about adouble
?How to prevent
SEXP
s from being garbage-collected?How are character vectors represented? What is the difference between a
CHARSXP
and aSTRSXP
?Why is it better to handle factor objects rather than raw character vectors if we merely would like to define grouping variables?
How are R matrices represented in C? Does R use the C or Fortran order of matrix elements?
How are R data frames handled in C?
Implement the C versions of the rep, seq, rle, match, findInterval, sample, order, unique, and split functions.
(*) Read Writing R Extensions [65] in its entirety.
(*) Download R’s source code from CRAN
or its Subversion (SVN)
repository. Explore the header files in
the src/include
subdirectory. They are part of the callable API.