14. Interfacing compiled code (**)#

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime, any bug/typos reports/fixes are appreciated. Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [26].

R is an effective glue language. It is suitable for composing whole data wrangling pipelines: from data import through processing, analysis, and visualisation to export. It makes using and connecting larger building blocks very convenient.

R is also a competent tool for developing quick and dirty prototypes of standalone, general-purpose algorithms, especially if they are of numerical nature. Nevertheless, for performance reasons, we may[1] consider moving computing-intensive tasks to the C or C++ level[2]. It is particularly the case when we need a method that:

  • has higher memory or time complexity than its straightforward implementation when it is programmed using vectorised R functions,

  • has an iterative or recursive nature, e.g., involving unvectorisable for or while loops,

  • relies on complicated dynamic data structures (e.g., hash maps, linked lists, or trees),

  • needs methods provided elsewhere and not available in R (e.g., other C or C++ libraries).

In the current chapter, we will demonstrate that R works very well as a user-friendly interface to compiled code.

This topic is overall very technical. The definitive reference is [62]; but see also Chapter 11 of [10]. Furthermore, R’s source code provides many working examples of how to deal with R objects in C.

Here, we will only cover the most important basics. We will focus on writing or interfacing portable[3] function libraries that only rely on simple[4] data structures (e.g., arrays of the type double and int). Thanks to this, we will be able to reuse them in other environments such as Python (e.g., via Cython) or Julia. Let us remember that R is one of many languages out there.

We assume basic knowledge of the C language; see [38]. The reader can skip this chapter now and return to it later. The remaining material is not contingent on the current one.

From now on, we take for granted that our environment can successfully build a source package with C code, as mentioned in Section 7.3.1.2. In particular, W****ws and m**OS should have, respectively, RTools and Xcode installed.

Note

To avoid ambiguity, in the main text, calls to C functions will be prefixed with “C::”; e.g., C::spanish_inquisition().

14.1. C and C++ code in R#

14.1.1. Source files for compiled code in R packages#

Perhaps the most versatile way to interact with portable C code is via standalone R packages. For the purpose of the current chapter, we created a demo project available at https://github.com/gagolews/cpackagedemo.

Exercise 14.1

Inspect the structure of cpackagedemo. Note that C source files are located in the src/ subdirectory. Build and install the package. Call my_sum in R on some numeric vector.

The package provides an R interface to one function, C::my_c_sum, written in the most portable fashion possible. Its declaration is included in the src/cfuns.h file:

#ifndef __CFUNS_H
#define __CFUNS_H
#include <stddef.h>

double my_c_sum(const double* x, size_t n);

#endif

The function accepts a pointer to the start of a numeric sequence and its size, which is a standard[5] way of representing an array of doubles.

Its definition is given in src/cfuns.c. We see that it is nothing more than a simple sum of all the elements in an array:

#include "cfuns.h"

/* computes the sum of all elements in an array x of size n */
double my_c_sum(const double* x, size_t n)
{
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        /* this code does not treat potential missing values specially
           (they are kinds of NaNs); to fix this, add:
        if (ISNA(x[i])) return NA_REAL;  // #include <R.h>  */
        s += x[i];
    }
    return s;
}

To make C::my_c_sum available in R, we will have to introduce a wrapper around it that works with the data structures from the first part of this jolly book. We know that an R function accepts objects of any kind on input and yields anything as a result. In the next section, we will explain that we get access to R objects via special pointers of the type SEXP (S expressions).

And so we declare our R-callable wrapper in src/rfuns.h:

#ifndef __RFUNS_H
#define __RFUNS_H
#include <R.h>
#include <Rinternals.h>
#include <Rmath.h>

SEXP my_c_sum_wrapper(SEXP x);

#endif

The actual definition is included in src/rfuns.c:

#include "rfuns.h"
#include "cfuns.h"

/* a wrapper around my_c_sum callable from R */
SEXP my_c_sum_wrapper(SEXP x)
{
    double s;

    if (!Rf_isReal(x)) {
        /* the caller is expected to prepare the arguments
           (doing it at the C level is tedious work) */
        Rf_error("`x` should be a vector of type 'double'");
    }

    /* Exercise: consider missing value (NA) checking
       as my_c_sum doesn't do it */

    s = my_c_sum(REAL(x), (size_t)XLENGTH(x));

    return Rf_ScalarReal(s);
}

The arguments can be, technically speaking, prepared at the C level. For instance, if `x` turned out to be an integer vector, we could convert it to the double one (these are two different types; see Section 6.4.1). Nevertheless, overall, it is very burdensome. It is better to use pure R code to ensure that the arguments are of the correct form and to beautify the outputs.

This explains why we only assert the enjoyment of C::Rf_isReal(x). It guarantees that the C::REAL and C::XLENGTH functions correctly return the pointer to the start of the sequence and its length, respectively.

Once C::my_c_sum is called, we must convert it to an R object. Here, it is a newly allocated numeric vector of length one. It can be done easily by calling C::Rf_ScalarReal.

Although optional (see Section 5.4 of [62]), we will register C::my_c_sum_wrapper as a callable function explicitly. This way, R will not struggle to find the specific entry point in the resulting dynamically linked library (DLL). We do this in src/cpackagedemo.c:

#include <R_ext/Rdynload.h>
#include "rfuns.h"

/* the list of functions available in R via a call to .Call():
   each entry is like {exported_name, fun_pointer, number_of_arguments} */
static const R_CallMethodDef cCallMethods[] = {
    {"my_c_sum_wrapper", (DL_FUNC)&my_c_sum_wrapper, 1},
    {NULL, NULL, 0}  // the end of the list
};

/* registers the list of callable functions */
void R_init_cpackagedemo(DllInfo *dll)
{
    R_registerRoutines(dll, NULL, cCallMethods, NULL, NULL);
    R_useDynamicSymbols(dll, FALSE);
}

The function can be invoked from R using .Call. We do it in R/my_sum.R:

my_sum <- function(x)
{
    # prepare input data:
    if (!is.double(x))
        x <- as.double(x)

    s <- .Call("my_c_sum_wrapper", x, PACKAGE="cpackagedemo")

    # some rather random postprocessing:
    attr(s, "what") <- deparse(substitute(x))
    s
}

And, finally, here is the package NAMESPACE file responsible for registering the exported R names and indicating the DLL to use:

export(my_sum)
useDynLib(cpackagedemo)

Once the package is built and installed (e.g., by running “R CMD INSTALL <pkgdir>” in the terminal), we can test it by calling:

library("cpackagedemo")
my_sum(runif(100)/100)
## [1] 0.49856
## attr(,"what")
## [1] "runif(100)/100"
Exercise 14.2

Extend the package (locally) by adding a function to compute the index of the greatest element in a numeric vector. Note that C uses 0-based array indexing, whereas, in R, the first element is at index 1. Compare its run time against which.max using proc.time.

14.1.2. R CMD SHLIB#

The “R CMD SHLIB <files>” command compiles one or more source files without the need for turning them into standalone packages; see [62]. Then, dyn.load loads the resulting DLL.

Exercise 14.3

(*) Compile src/cfuns.c and src/rfuns.c from our demo package using “R CMD SHLIB <files>”. Call dyn.load. Write an R function that uses .Call to invoke C::my_c_sum_wrapper from the second source file.

The direct SHLIB approach is convenient for learning C programming, including running simple examples. We will thus use it for didactic reasons in this chapter. The inst/examples/csource.R file in our demo package includes the implementation of an R function called csource. It compiles a given C source file, loads the resulting DLL, and extracts and executes a designated R code chunk (preferably defining a function that refers to .Call).

Here is an example source file, inst/examples/helloworld.c in the cpackagedemo source code repository:

// the necessary header files are automatically included by `csource`

SEXP C_hello()
{
    Rprintf("The mill's closed. There's no more work. We're destitute.\n"
        "I'm afraid I've no choice but to sell you all "
        "for scientific experiments.\n");
    return R_NilValue;
}

/* R
# this chunk will be extracted and executed by `csource`.

hello <- function()
    invisible(.Call("C_hello", PACKAGE="helloworld"))

R */

Let us compile it and call the R function defined above.

source("~/R/cpackagedemo/inst/examples/csource.R")  # defines csource
csource("~/R/cpackagedemo/inst/examples/helloworld.c")
hello()
## The mill's closed. There's no more work. We're destitute.
## I'm afraid I've no choice but to sell you all for scientific experiments.
Exercise 14.4

(*) C++ is also supported. It can be thought of as a superset of the C language, but the devil is in the detail. Change the name of the above file to helloworld2.cpp, add “extern "C" before the function declaration, and compile it.

Exercise 14.5

(*) Verify that C and C++ source files can coexist in R packages.

Example 14.6

(*) It might be very educative to study the implementation of csource. By now, we should be able to author such functions ourselves (a few hours’ worth of work), let alone read with understanding.

# compiles a C or C++ source file using R CMD SHLIB,
# loads the resulting DLL, and executes the embedded R code
csource <- function(
    fname,
    libname=regmatches(basename(fname),
        regexpr("[^.]*(?=\\..*)", basename(fname), perl=TRUE)),
    shlibargs=character(),
    headers=paste0(
        "#include <R.h>\n",
        "#include <Rinternals.h>\n",
        "#include <Rmath.h>\n"
    ),
    R=file.path(R.home(), "bin/R")
) {
    stopifnot(file.exists(fname))
    stopifnot(is.character(libname), length(libname) == 1)
    stopifnot(is.character(shlibargs))
    stopifnot(is.character(headers))
    stopifnot(is.character(R), length(R) == 1)

    # read the source file:
    f <- paste(readLines(fname), collapse="\n")

    # set up output file names:
    tmpdir <- normalizePath(tempdir(), winslash="/")  # tempdir on Win uses \
    dynlib_ext <- .Platform[["dynlib.ext"]]
    libpath <- file.path(tmpdir, sprintf("%s%s", libname, dynlib_ext))
    cfname <- file.path(tmpdir, basename(fname))
    rfname <- sub("\\..*?$", ".R", cfname, perl=TRUE)

    # separate the /* R ... <R code> ... R */ chunk from the source file:
    rpart <- regexec("(?smi)^/\\* R\\s?(.*)R \\*/$", f, perl=TRUE)[[1]]
    rpart_start <- rpart
    rpart_len <- attr(rpart, "match.length")
    if (rpart_start[1] < 0 || rpart_len[1] < 0)
        stop("enclose R code between /* R ... and ... R */")

    rcode <- substr(f, rpart_start[2], rpart_start[2]+rpart_len[2]-1)
    cat(rcode, file=rfname, append=FALSE)

    # write the C/C++ file:
    ccode <- paste(
        headers,
        substr(f, 1, rpart_start[1]-1),
        substr(f, rpart_start[1]+rpart_len[1], nchar(f)),
        collapse="\n"
    )
    cat(ccode, file=cfname, append=FALSE)

    # prepare the "R CMD SHLIB ..." command:
    shlibargs <- c(
        "CMD", "SHLIB",
        sprintf("-o %s", libpath),
        cfname,
        shlibargs
    )

    # compile and load the DLL, run the extracted R script:
    retval <- FALSE
    oldwd <- setwd(tmpdir)
    tryCatch({
        if (libpath %in% sapply(getLoadedDLLs(), `[[`, "path"))
            dyn.unload(libpath)
        stopifnot(system2(R, shlibargs) == 0)  # 0 == success
        dyn.load(libpath)
        source(rfname)
        retval <- TRUE
    }, error=function(e) {
        cat(as.character(e), file=stderr())
    })
    setwd(oldwd)

    if (!retval) stop("error compiling file")
    invisible(TRUE)
}

14.2. Handling basic types#

14.2.1. SEXPTYPEs#

All R objects are stored as instances of the C language structure SEXPREC. Usually, we access them via pointers, which are of the type SEXP (S expression).

C functions referred to via .Call take the very generic SEXPs on input. They output some SEXP. One of the said structure’s fields represents the actual R object type (SEXPTYPE numbers); see Table 14.1 for a selection.

Table 14.1 Basic R types in C#

SEXPTYPE

Type in R (typeof)

Test in C

NILSXP

NULL

Rf_isNull(x) (true for R_NilValue only)

RAWSXP

raw

TYPEOF(x) == RAWSXP

LGLSXP

logical

Rf_isLogical(x)

INTSXP

integer

Rf_isInteger(x)

REALSXP

double

Rf_isReal(x)

CPLXSXP

complex

Rf_isComplex(x)

STRSXP

character

Rf_isString(x)

VECSXP

list

Rf_isVectorList(x)

CHARSXP

char (scalar string; internal)

TYPEOF(x) == CHARSXP

EXTPTRSXP

externalptr (internal)

TYPEOF(x) == EXTPTRSXP

Example 14.7

To illustrate that any R object is available as a SEXP, let us consider inst/examples/sexptype.c from cpackagedemo:

SEXP C_test_sexptype(SEXP x)
{
    Rprintf("type of x: %s (SEXPTYPE=%d)\n",
        Rf_type2char(TYPEOF(x)),
        (int)TYPEOF(x)
    );
    return R_NilValue;
}

/* R
test_sexptype <- function(x)
    invisible(.Call("C_test_sexptype", x, PACKAGE="sexptype"))
R */

Example calls:

csource("~/R/cpackagedemo/inst/examples/sexptype.c")
test_sexptype(1:10)
## type of x: integer (SEXPTYPE=13)
test_sexptype(NA)
## type of x: logical (SEXPTYPE=10)
test_sexptype("spam")
## type of x: character (SEXPTYPE=16)

We should refer to particular SEXPTYPEs via their descriptive names (constants; e.g., STRSXP), not their numeric identifiers (e.g., 16); see Section 1.1 of [65] for the complete list[6].

14.2.2. Accessing elements in simple atomic vectors#

Table 14.2 gives the most important vector-like SEXPTYPEs (atomic and generic), their corresponding C types and the functions to access the underlying array pointers. A call to C::XLENGTH returns the length of a given sequence.

We have already seen an example function that processes a numeric vector; see C::my_c_sum_wrapper above.

Let us stress that writing functions that accept only int and double array pointers and their lengths makes them easily reusable in other environments. In many data analysis applications, we do not need much more.

Table 14.2 Basic array-like R types and their elements in C#

SEXPTYPE

Array element type

Pointer access

RAWSXP

typedef unsigned char Rbyte;

RAW(x)

LGLSXP

int (use the FALSE, TRUE, and NA_LOGICAL constants)

LOGICAL(x)

INTSXP

int

INTEGER(x)

REALSXP

double

REAL(x)

CPLXSXP

typedef struct { double r; double i; } Rcomplex;

COMPLEX(x)

STRSXP

SEXP (array of SEXPs of type CHARSXP)

(not directly)

VECSXP

SEXP (array of SEXPs of any SEXPTYPE)

(not directly)

CHARSXP

const char* (read-only; trailing 0; check encoding)

CHAR(x)

Important

With raw, logical, integer, floating-point, and complex vectors, we get direct access to data that might be shared amongst many objects (compare Section 16.1.4). SEXPRECs are simply passed by pointers (since SEXPs are exactly them). We must thus refrain[7] from modifying the objects passed as function parameters. See below for ways to create new vectors, e.g., for storing auxiliary or return values.

Example 14.8

Consider inst/examples/sharedmem.c:

SEXP C_test_sharedmem(SEXP x)
{
    if (!Rf_isReal(x) || XLENGTH(x) == 0)
        Rf_error("`x` should be a non-empty vector of type 'double'");

    REAL(x)[0] = REAL(x)[0]+1;  // never do it; always make a copy;
            // the underlying array `x` may be shared by many objects

    return R_NilValue;
}

/* R
test_sharedmem <- function(x)
    invisible(.Call("C_test_sharedmem", x, PACKAGE="sharedmem"))
R */

And now, calling the above function on an example vector:

csource("~/R/cpackagedemo/inst/examples/sharedmem.c")
y <- 1
z <- y
test_sharedmem(y)
print(z)
## [1] 2

modifies y and z in place! It is not the same semantics as the one we are used to when interacting with R. Hence, we must always make a copy.

14.2.3. Representation of missing values#

Most languages do not support the notion of missing values out of the box. Therefore, in R, they have to be emulated. Table 14.3 lists the relevant constants and the conventional ways for testing for missingness.

Table 14.3 Representation of missing values#

SEXPTYPE

Missing value

Testing

RAWSXP

(none)

(none)

LGLSXP

NA_LOGICAL (equal to INT_MIN)

el == NA_LOGICAL

INTSXP

NA_INTEGER (equal to INT_MIN)

el == NA_INTEGER

REALSXP

NA_REAL (a special NaN)

ISNA(el)

CPLXSXP

a pair of NA_REALs

ISNA(el.r)

STRSXP

NA_STRING (a CHARSXP object)

el == NA_STRING

In logical and integer vectors, NAs are represented as the smallest 32-bit signed integer. Thus, we need to be careful when performing any operations on these types: testing for missingness must be performed first.

The case of doubles is slightly less irksome, for a missing value is represented as a special not-a-number. Many arithmetic operations on NaNs return NaNs as well, albeit there is no guarantee[8] that they will be of precisely the same type as NA_REAL. Thus, manual testing for the missingness is also advised.

Example 14.9

The inst/examples/mean_naomit.c file defines a function to compute the arithmetic mean of an int or a double vector:

SEXP C_mean_naomit(SEXP x)
{
    double ret = 0.0;
    size_t k = 0;

    if (Rf_isInteger(x)) {
        const int* xp = INTEGER(x);
        size_t n = XLENGTH(x);
        for (size_t i=0; i<n; ++i)
            if (xp[i] != NA_INTEGER) {  // NOT: ISNA(xp[i])
                ret += (double)xp[i];
                k++;
            }
    }
    else if (Rf_isReal(x)) {
        const double* xp = REAL(x);
        size_t n = XLENGTH(x);
        for (size_t i=0; i<n; ++i)
            if (!ISNA(xp[i])) {  // NOT: xp[i] == NA_REAL
                ret += xp[i];
                k++;
            }
    }
    else
        Rf_error("`x` should be a numeric vector");

    return Rf_ScalarReal((k>0)?(ret/(double)k):NA_REAL);
}

/* R
mean_naomit <- function(x)
{
    if (!is.numeric(x))  # neither integer nor double
        x <- as.numeric(x)  # convert to double
    .Call("C_mean_naomit", x, PACKAGE="mean_naomit")
}
R */

There is some inherent code duplication: int and double are distinct types, and they need to be handled separately. Some tests:

csource("~/R/cpackagedemo/inst/examples/mean_naomit.c")
mean_naomit(c(1L, NA_integer_, 3L, NA_integer_, 5L))
## [1] 3
mean_naomit(rep(NA_real_, 10))
## [1] NA
Exercise 14.10

Implement all and any in C.

14.2.4. Memory allocation#

R features a simple yet effective garbage collector that relies on reference counting. Occasionally[9], memory blocks deemed no longer reachable are either freed or marked as reusable.

To allocate a new vector of length one and set its only element, we can call C::ScalarLogical, C::ScalarInteger, C::ScalarReal, etc. We have already used these functions for returning R “scalars”.

Vectors of arbitrary lengths can be created using C::Rf_allocVector(sexptype, size). Note that this function does not initialise the elements of logical and numeric sequences (amongst others). They will need to be set manually after creation.

Important

We must protect all allocated vectors from garbage collection when we use them inside our functions. For this purpose, R maintains a stack[10] of objects guarded against premature annihilation. C::PROTECT(sexp) pushes a given object pointer onto the top of the list. C::UNPROTECT(n) pops the last n elements therefrom (in a last-in-first-out manner). At the end of a .Call, R checks if the number of protects matches that of un-protects. If it is not the case, a warning is generated.

Protection is not needed:

  • for arguments to functions referred to by .Call, as they are already in use and hence protected;

  • for objects assigned as list or character vectors’ elements using C::SET_VECTOR_ELT and C::SET_STRING_ELT (see below); when the container is protected, so are its components;

  • when we return the allocated vector to R immediately after creating it (like in return Rf_ScalarReal(val) in a C function invoked by .Call).

Example 14.11

Here is a function to compute the square of each element in a numeric vector. Note that the new vector must be protected from garbage collection while data are being prepared.

SEXP C_square1(SEXP x)
{
    // no need to call PROTECT(x), it is already in use
    if (!Rf_isReal(x)) Rf_error("`x` should be a vector of type 'double'");

    size_t n = XLENGTH(x);
    const double* xp = REAL(x);

    SEXP y = PROTECT(Rf_allocVector(REALSXP, n));  // won't be GC'd
    double* yp = REAL(y);

    for (size_t i=0; i<n; ++i) {
        if (ISNA(xp[i])) yp[i] = xp[i];
        else             yp[i] = xp[i]*xp[i];
    }

    UNPROTECT(1);  // pops 1 object from the protect stack;
        // does not trigger garbage collection, so we can return `y` now
    return y;
}

/* R
square1 <- function(x)
{
    if (!is.double(x)) x <- as.double(x)
    .Call("C_square1", x, PACKAGE="square1")
}
R */

Alternatively, in this case, we may use C::Rf_duplicate:

SEXP C_square2(SEXP x)
{
    if (!Rf_isReal(x)) Rf_error("`x` should be a vector of type 'double'");

    x = PROTECT(Rf_duplicate(x));  // OK; it's a pointer

    size_t n = XLENGTH(x);
    double* xp = REAL(x);
    for (size_t i=0; i<n; ++i)
        if (!ISNA(xp[i])) xp[i] = xp[i]*xp[i];

    UNPROTECT(1);

    return x;
}

/* R
square2 <- function(x)
{
    if (!is.double(x)) x <- as.double(x)
    .Call("C_square2", x, PACKAGE="square2")
}
R */

Some tests:

csource("~/R/cpackagedemo/inst/examples/square1.c")
square1(c(-2, -1, 0, 1, 2, 3, 4, NA_real_))
## [1]  4  1  0  1  4  9 16 NA
csource("~/R/cpackagedemo/inst/examples/square2.c")
square2(c(-2, -1, 0, 1, 2, 3, 4, NA_real_))
## [1]  4  1  0  1  4  9 16 NA

We can claim some auxiliary memory from the heap during a function’s runtime using the well-known C::malloc (or new in C++). We are of course fully responsible for releasing it via C::free (or delete).

There is also C::R_alloc(n, size) which allocates n*size bytes. The memory will automatically be garbage-collected at the end of a .Call.

Example 14.12

Here is our version of the which function.

SEXP C_which1(SEXP x)
{
    if (!Rf_isLogical(x)) Rf_error("`x` should be of type 'logical'");

    size_t n = XLENGTH(x), i, k;
    const int* xp = LOGICAL(x);

    size_t* d = (size_t*)malloc(n*sizeof(size_t));  // conservative size
    if (!d) Rf_error("memory allocation error");

    for (i=0, k=0; i<n; ++i)
        if (xp[i] != NA_LOGICAL && xp[i])
            d[k++] = i;

    SEXP y = PROTECT(Rf_allocVector(REALSXP, k));
    double* yp = REAL(y);  // yes, type is double - ready for long vectors
    for (i=0; i<k; ++i)
        yp[i] = (double)d[i]+1;  // R uses 1-based indexing
    UNPROTECT(1);

    free(d);  // we don't want any memory leaks!

    return y;
}

/* R
which1 <- function(x)
{
    if (!is.logical(x)) x <- as.logical(x)
    .Call("C_which1", x, PACKAGE="which1")
}
R */

Some tests:

csource("~/R/cpackagedemo/inst/examples/which1.c")
which1(c(TRUE, FALSE, TRUE, NA, TRUE))
## [1] 1 3 5

Note that R’s which returns either an int or a double vector, depending on the size of the input vector (whether it is shorter than \(2^{31}-1\)). As an exercise, rewrite the above to take that into account: integer arithmetic is faster.

Example 14.13

(*) If we do not like that we are potentially wasting too much memory in the case of sparse logical vectors, we can rely on dynamically growable arrays. Below is a C++ rewrite of the above function using deque (double-ended queue) from the language’s standard library.

#include <deque>

extern "C" SEXP C_which2(SEXP x)
{
    if (!Rf_isLogical(x)) Rf_error("`x` should be of type 'logical'");

    size_t n = XLENGTH(x), i;
    const int* xp = LOGICAL(x);

    std::deque<size_t> d;
    for (i=0; i<n; ++i)
        if (xp[i] != NA_LOGICAL && xp[i])
            d.push_back(i);

    SEXP y = PROTECT(Rf_allocVector(REALSXP, d.size()));
    double* yp = REAL(y);  // ready for long vectors
    i=0;
    for (size_t k : d)
        yp[i++] = (double)k+1;  // R uses 1-based indexing
    UNPROTECT(1);

    return y;
}

/* R
which2 <- function(x)
{
    if (!is.logical(x)) x <- as.logical(x)
    .Call("C_which2", x, PACKAGE="which2")
}
R */

Example calls:

csource("~/R/cpackagedemo/inst/examples/which2.cpp")
x <- (runif(10) > 0.5)
stopifnot(which(x) == which1(x))
stopifnot(which(x) == which2(x))

Alternatively, we could use C::realloc to extend an initially small buffer allocated using C::malloc by, say, 50% whenever it is about to overflow.

14.2.5. Lists#

For safety reasons[11], we do not get access to the underlying pointers in lists and character vectors. List items can be read by calling C::VECTOR_ELT(x, index) and can be set with C::SET_VECTOR_ELT(x, index, newval).

Note that lists (VECSXPs) are comprised of SEXPs of any type. Hence, after extracting an element, its SEXPTYPE needs to be tested using one of the functions listed in Table 14.1. This is tiresome.

Example 14.14

Here is a rather useless function that fetches the first and the last element in a given numeric vector or a list. However, if the latter case, we apply the function recursively on all its elements.

SEXP C_firstlast(SEXP x)
{
    if (!Rf_isVector(x) || XLENGTH(x) == 0)
        Rf_error("`x` must be a non-empty vector (atomic or generic)");
    else if (Rf_isReal(x)) {
        SEXP y = PROTECT(Rf_allocVector(REALSXP, 2));
        REAL(y)[0] = REAL(x)[0];             // first
        REAL(y)[1] = REAL(x)[XLENGTH(x)-1];  // last
        UNPROTECT(1);
        return y;
    }
    else if (Rf_isVectorList(x)) {
        SEXP y = PROTECT(Rf_allocVector(VECSXP, 2));
        // VECTOR_ELT(x, i) is PROTECTed by the container;
        // SET_VECTOR_ELT does not trigger GC, so no need to call PROTECT
        // on the result of C_firstlast(...) in this context
        SET_VECTOR_ELT(y, 0, C_firstlast(VECTOR_ELT(x, 0)));
        SET_VECTOR_ELT(y, 1, C_firstlast(VECTOR_ELT(x, XLENGTH(x)-1)));
        UNPROTECT(1);
        return y;
    }
    else
        Rf_error("other cases left as an exercise");

    return R_NilValue;  // avoid compiler warning
}

/* R
firstlast <- function(x)
    .Call("C_firstlast", x, PACKAGE="firstlast")
R */

Testing:

csource("~/R/cpackagedemo/inst/examples/firstlast.c")
firstlast(c(1, 2, 3))
## [1] 1 3
firstlast(list(c(1, 2, 3), c(4, 5), 6))
## [[1]]
## [1] 1 3
##
## [[2]]
## [1] 6 6
firstlast(list(c(1, 2, 3), 4, 5, list(6, c(7, 8), c(9, 10, 11))))
## [[1]]
## [1] 1 3
##
## [[2]]
## [[2]][[1]]
## [1] 6 6
##
## [[2]][[2]]
## [1]  9 11
Exercise 14.15

Implement a C function that returns the longest vector in a given list. Use C::Rf_isVector to check whether a given object is an atomic or a generic vector, and hence if C::XLENGTH can be called thereon.

Exercise 14.16

Inscribe your version of unlist. Consider scanning the input list twice. First, determine the size of the output vector. Second, fill the return object with un-listed values.

Exercise 14.17

Write a C function that takes a list of numeric vectors of identical lengths on input. Return their elementwise sum: the first element of the output should be the sum of the first elements in every vector, and so forth.

14.2.6. Character vectors and individual strings (*)#

Character vectors (STRSXPs) are similar to VECSXPs except that they only feature individual strings. The latter are represented using a separate data type, CHARSXP. C::STRING_ELT(x, index) and C::SET_STRING_ELT(x, index, newval) work as the element getters and setters.

Important

If we are not interested in text processing but rather in handling categorical data or defining grouping variables, we should consider converting character vectors to factors before issuing a .Call. Comparing small integers is much faster than strings; see below for more details.

Because of R’s string cache, there are no duplicate strings in the memory. However, this feature could only be guaranteed by making data in CHARSXPs read-only. We can access the underlying const char* pointer by calling C::CHAR(s). As typical in C, a string is terminated by byte 0.

Note

R strings may be of different encodings; compare Section 6.1.1. For portability and peace of mind, it is best to preprocess the arguments to .Call using enc2utf8, which converts all strings to UTF-8[12].

Despite being the most universal encoding, UTF-8 does not represent a single code point using a fixed number of bytes. For instance, computing the string length requires iterating over all its elements. For CHARSXPs, C::XLENGTH returns the number of bytes, not including the trailing 0.

It is thus best to leave the processing of strings to the dedicated libraries, e.g., ICU or rely on functions from the stringi package [27] at the R level.

C strings can be converted to CHARSXPs by calling C::Rf_mkCharCE(stringbuf, CE_UTF8) or C::Rf_mkCharLenCE(stringbuf, buflen, CE_UTF8). If we are sure that a string is in ASCII (a subset of UTF-8), we can also call C::Rf_mkChar(stringbuf).

We should never return CHARSXPs as results to R. They are for internal use only. They must be wrapped inside a character vector, e.g., using C::Rf_ScalarString.

14.2.7. Calling R functions from C (**)#

Section 5.11 of [62] discusses ways to call R functions in C. To understand them, we will first need to study the material from the remaining chapters of our book, i.e., environments and the related evaluation model. They can be useful, e.g., when calling optimisation algorithms implemented in C on objective functions written in R.

14.2.8. External pointers (**)#

There is a separate basic R type for storing arbitrary C pointers, externalptr (SEXPTYPE of EXTPTRSXP); see Section 5.13 of [62] for more details.

We can use them to maintain dynamic data structures or resource handlers between calls to R functions. The problem with these is that pointers are passed as… pointers. They can easily break R’s pass-by-value semantics, where changes to the state of the referenced object will be visible outside the call.

Furthermore, pointers are not serialisable. They cannot be saved for use in another R session.

Example 14.18

(**) inst/examples/stack.cpp provides a C++ implementation of the stack data structure, being a last-in-first-out container of arbitrary R objects:

#include <deque>

class S : public std::deque<SEXP>
{
    public: ~S()
    {   // destructor: release all SEXPs so that they can be GC'd
        while (!this->empty()) {
            SEXP obj = this->front();
            this->pop_front();
            R_ReleaseObject(obj);
        }
    }
};

S* get_stack_pointer(SEXP s, bool check_zero=true)  // internal function
{
    if (TYPEOF(s) != EXTPTRSXP)
        Rf_error("not an external pointer");

    SEXP tag = R_ExternalPtrTag(s);  // our convention, this can be anything
    if (TYPEOF(tag) != CHARSXP || strcmp(CHAR(tag), "stack") !=0 )
        Rf_error("not a stack");

    S* sp = (S*)R_ExternalPtrAddr(s);
    if (check_zero && !sp)
        Rf_error("address is 0");

    return sp;
}

void stack_finaliser(SEXP s)  // internal function
{
    // called during garbage collection
    S* sp = get_stack_pointer(s, false);
    if (sp) {
        delete sp;  // destruct S, release SEXPs
        R_ClearExternalPtr(s);
    }
}

extern "C" SEXP C_stack_create()
{
    S* sp = new S();  // stack pointer
    SEXP s = PROTECT(
        R_MakeExternalPtr((void*)sp, /*tag*/mkChar("stack"), R_NilValue)
    );
    R_RegisterCFinalizerEx(s, stack_finaliser, TRUE);  // auto-called on GC
    UNPROTECT(1);
    return s;
}

extern "C" SEXP C_stack_empty(SEXP s)
{
    S* sp = get_stack_pointer(s);
    return Rf_ScalarLogical(sp->empty());
}

extern "C" SEXP C_stack_push(SEXP s, SEXP obj)
{
    S* sp = get_stack_pointer(s);
    R_PreserveObject(obj);
    sp->push_front(obj);
    return R_NilValue;
}

extern "C" SEXP C_stack_pop(SEXP s)
{
    S* sp = get_stack_pointer(s);
    if (sp->empty())
        Rf_error("stack is empty");
    SEXP obj = sp->front();
    sp->pop_front();
    R_ReleaseObject(obj);
    return obj;
}

/* R
stack_create <- function()
    .Call("C_stack_create", PACKAGE="stack")

stack_empty <- function(s)
    .Call("C_stack_empty", s, PACKAGE="stack")

stack_push <- function(s, obj)
    .Call("C_stack_push", s, obj, PACKAGE="stack")

stack_pop <- function(s)
    .Call("C_stack_pop", s, PACKAGE="stack")
R */

Note how we preserve R objects from garbage collection.

Some tests:

csource("~/R/cpackagedemo/inst/examples/stack.cpp")
s <- stack_create()
print(s)
## <pointer: 0x55fbff912210>
typeof(s)
## [1] "externalptr"
for (i in c("one", "two", "Spanish Inquisition"))
    stack_push(s, i)
while (!stack_empty(s))
    print(stack_pop(s))
## [1] "Spanish Inquisition"
## [1] "two"
## [1] "one"

14.3. Dealing with compound types#

14.3.1. Reading and setting attributes#

From Chapter 10, we know that compound types such as matrices, factors, or data frames are represented using basic data structures. Usually, they are atomic vectors of lists organised in a predefined manner.

C::Rf_getAttrib(x, attrname) and C::Rf_setAttrib(x, attrname, newval) gets and sets specific attributes of an object `x`. Their second argument, `attrname`, should be a one-element STRSXP. For convenience, the constants R_ClassSymbol, R_DimNamesSymbol, R_DimSymbol, R_LevelsSymbol, R_NamesSymbol, and R_RowNamesSymbol can be used instead of the STRSXP versions of the "class", "dimnames", "dim", "levels", "names", and "row.names" strings.

Example 14.19

Consider a function for testing whether a given object is of a specific class:

#include <string.h>

SEXP C_isofclass(SEXP x, SEXP class)
{
    if (!Rf_isString(class) && XLENGTH(class) != 1)
        Rf_error("`class` must be a single string");

    if (!OBJECT(x))  // is the class attribute set at all?
        return Rf_ScalarLogical(FALSE);

    SEXP xclass = Rf_getAttrib(x, R_ClassSymbol);  // STRSXP (guaranteed)
    const char* c = CHAR(STRING_ELT(class, 0));  // class arg as a C string
    size_t n = XLENGTH(xclass);
    for (size_t i=0; i<n; ++i)
        if (strcmp(CHAR(STRING_ELT(xclass, i)), c) == 0)
            return Rf_ScalarLogical(TRUE);

    return Rf_ScalarLogical(FALSE);
}

/* R
isofclass <- function(x, class)
    .Call("C_isofclass", x, class, PACKAGE="isofclass")
R */

Some tests:

csource("~/R/cpackagedemo/inst/examples/isofclass.c")
isofclass(Sys.time(), "POSIXct")
## [1] TRUE
isofclass(cbind(1:5, 11:15), "matrix")
## [1] FALSE

Note that a matrix has an implicit class (reported by the class function), but its class attribute does not have to be set. Hence the negative result.

Example 14.20

Write a function that fetches a particular named element in a list.

14.3.2. Factors#

Factors (Section 10.3.2) are represented as integer vectors with elements in the set {1, 2, …, k, NA_integer_} for some k. They are equipped with the levels attribute, being a character vector of length k. Their class attribute is set to "factor".

Example 14.21

An example implementation of a function to compute the number of occurrences of each factor level is given below.

SEXP C_table1(SEXP x)
{
    if (!Rf_isFactor(x)) Rf_error("`x` is not an object of class 'factor'");

    size_t n = XLENGTH(x);
    const int* xp = INTEGER(x);  // x is INTSXP

    SEXP levels = Rf_getAttrib(x, R_LevelsSymbol);  // levels is STRSXP
    size_t k = XLENGTH(levels);

    SEXP y = PROTECT(Rf_allocVector(REALSXP, k));
    double* yp = REAL(y);
    for (size_t i=0; i<k; ++i)
        yp[i] = 0.0;
    for (size_t j=0; j<n; ++j) {
        if (xp[j] != NA_INTEGER) {
            if (xp[j] < 1 || xp[j] > k)
                Rf_error("malformed factor");  // better safe than sorry
            yp[xp[j]-1] += 1.0;  // levels are 1..k, but C needs 0..k-1
        }
    }

    Rf_setAttrib(y, R_NamesSymbol, levels);  // set names attribute
    UNPROTECT(1);
    return y;
}

/* R
table1 <- function(x)
{
    if (!is.factor(x)) x <- as.factor(x)
    .Call("C_table1", x, PACKAGE="table1")
}
R */

Testing:

csource("~/R/cpackagedemo/inst/examples/table1.c")
table1(c("spam", "bacon", NA, "spam", "eggs", "bacon", "spam", "spam"))
## bacon  eggs  spam
##     2     1     4
Exercise 14.22

Create a function to compute the most frequently occurring value (mode) in a given factor. Return a character vector. If a mode is ambiguous, return all the possible candidates.

14.3.3. Matrices#

Matrices (Chapter 11) are flat atomic vectors or lists with the dim attribute being a vector of length two. The class attribute does not have to be set (but the class function returns "matrix" and "array").

Matrices are so important in data analysis that they have been blessed with a few dedicated functions available at the C level. C::Rf_isMatrix tests if a given object meets the criteria mentioned above.

R relies on the Fortran order of matrix elements: they are aligned columnwisely. Let A be a matrix with n rows and m columns (compare C::Rf_nrows and C::Rf_ncols). Then, the element in the i-th row and the j-th column is at A[i+n*j].

C::Rf_allocMatrix(sexptype, n, m) allocates a new matrix. The dimnames attributes must be handled manually, though.

Example 14.23

Here is a function to compute the transpose of a given numeric matrix:

SEXP C_transpose(SEXP x)
{
    if (!Rf_isMatrix(x) || !Rf_isReal(x))
        Rf_error("`x` must be a real matrix");

    size_t n = Rf_nrows(x);
    size_t m = Rf_ncols(x);
    const double* xp = REAL(x);

    SEXP y = PROTECT(Rf_allocMatrix(REALSXP, m, n));
    double* yp = REAL(y);
    for (size_t i=0; i<n; ++i)
        for (size_t j=0; j<m; ++j)
            yp[j+m*i] = xp[i+n*j];

    SEXP dimnames = Rf_getAttrib(x, R_DimNamesSymbol);
    if (!Rf_isNull(dimnames)) {
        SEXP tdimnames = PROTECT(Rf_allocVector(VECSXP, 2));
        SET_VECTOR_ELT(tdimnames, 0, VECTOR_ELT(dimnames, 1));
        SET_VECTOR_ELT(tdimnames, 1, VECTOR_ELT(dimnames, 0));
        Rf_setAttrib(y, R_DimNamesSymbol, tdimnames);  // set dimnames
        UNPROTECT(1);
        // dimnames might have the names attribute too (left as an exercise)
    }
    UNPROTECT(1);
    return y;
}

/* R
transpose <- function(x)
{
    if (!is.matrix(x)) x <- as.matrix(x)
    .Call("C_transpose", x, PACKAGE="transpose")
}
R */

Testing:

csource("~/R/cpackagedemo/inst/examples/transpose.c")
transpose(cbind(c(1, 2, 3, 4), c(5, 6, 7, 8)))
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
transpose(Titanic[, "Male", "Adult", ])
##     1st 2nd 3rd Crew
## No  118 154 387  670
## Yes  57  14  75  192
Exercise 14.24

Author a C function named table2 that computes a two-way contingency table.

14.3.4. Data frames#

Data frames (Chapter 12) are lists of m vectors of identical lengths n or matrices of n rows for some n and m. The character vectors stored in the row.names and names attributes give the n row and m column labels. They are objects of S3 class "data.frame".

We process data frames as ordinary lists. However, in many applications, it is enough to write functions that accept ordinary numeric matrices instead. If element grouping is required, they can be accompanied by a factor or a list of factor variables. Do not complicate our lives beyond necessity.

14.4. Using existing function libraries#

14.4.1. Checking for user interrupts#

Long computations may lead to R’s becoming unresponsive. For instance, the user may request to cancel the evaluation of the current expression by pressing Ctrl+C.

To process the event queue, we should occasionally call C::R_CheckUserInterrupt(), e.g., in every iteration of a complex for loop. Note that R might decide never to return to our function. Thus, we have to prevent memory leaks, e.g., by preferring C::R_alloc over C::malloc.

14.4.2. Generating pseudorandom numbers#

C::unif_rand returns a single pseudorandom deviate from the uniform distribution on the unit interval. It is the basis for generating numbers from all other supported distributions (Section 6.7.1 of [62]).

It uses the same pseudorandom generator as we described in Section 2.1.5. To read and memorise its seed (the `.Random.seed` object in the global environment), we have to call C::GetRNGstate() and C::PutRNGstate() at the beginning and the end of our function, respectively.

Example 14.25

Below is a function to generate a pseudorandom bit sequence:

SEXP C_randombits(SEXP n)
{
    if (!Rf_isInteger(n) || XLENGTH(n) != 1)
        Rf_error("`n` should be a single integer");

    size_t _n = INTEGER(n)[0];
    if (_n == NA_INTEGER || _n < 1)
        Rf_error("incorrect `n`");

    SEXP y = PROTECT(Rf_allocVector(INTSXP, _n));
    int* yp = INTEGER(y);

    GetRNGstate();
    for (size_t i=0; i<_n; ++i)
        yp[i] = (int)(unif_rand()>0.5);
    PutRNGstate();

    UNPROTECT(1);

    return y;
}

/* R
randombits <- function(n)
{
    if (!is.integer(n)) n <- as.integer(n)
    .Call("C_randombits", n, PACKAGE="randombits")
}
R */

Let us play with the above:

csource("~/R/cpackagedemo/inst/examples/randombits.c")
set.seed(123); randombits(10)
##  [1] 0 1 0 1 1 0 1 1 1 0
randombits(10)
##  [1] 1 0 1 1 0 1 0 0 0 1
set.seed(123); randombits(20)
##  [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1
set.seed(123); as.integer(runif(20)>0.5)  # it's the same "algorithm"
##  [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1
Exercise 14.26

Create a function to compute the most frequently occurring value (mode) in a given factor object. In the case of ambiguity, return a randomly chosen candidate.

14.4.3. Mathematical functions from the R API#

Section 6.7 of [62] lists the available statistical distribution functions, mathematical routines and constants, and other numerical utilities.

14.4.4. Header files from other R packages (*)#

A package may use header files from another package. For this to be possible, it must include the dependency name in the LinkingTo field of its DESCRIPTION file; see [62] for discussion.

Exercise 14.27

The BH package on CRAN gives access to Boost, which are header-only C++ libraries defining many algorithms and data structures. Create an R package that calls C::boost::math::gcd after issuing the #include <boost/math/common_factor.hpp> directive.

14.4.5. Specifying compiler and linker flags (**)#

We can pass arbitrary flags to the compiler or linker, e.g., to use any library installed on our system.

Basic configuration can be set via Makevars (or Makevars.win on W****ws), e.g., by setting PKG_CFLAGS or PKG_LIBS variables.

For maximum portability across different platforms, which is overall challenging to ensure if we do not wish to exclude W****ws users, we might be required to write our own configure (and configure.win) scripts.

For more information, see [62]. In particular, it discusses how to use OpenMP[13] in our projects.

14.5. Exercises#

Exercise 14.28

Answer the following questions:

  • What are the most common SEXPTYPEs?

  • How are missing values represented?

  • How can we check if an integer is a missing value? What about a floating-point variable?

  • How to prevent SEXPs from being garbage-collected?

  • How are character vectors represented? What is the difference between a CHARSXP and a STRSXP?

  • Why is it better to process factor objects than raw character vectors if we merely would like to define grouping variables?

  • How are R matrices handled in C? Does R use the C or Fortran order of matrix elements?

  • How are R data frames handled in C?

Exercise 14.29

Implement the C versions of the rep, seq, rle, match, findInterval, sample, order, unique, and split functions.

Exercise 14.30

(*) Read [62] in its entirety.

Exercise 14.31

(*) Download R’s source code from CRAN or its Subversion (SVN) repository. Explore the header files in the src/include/ subdirectory. They are part of the callable API.