14. Interfacing compiled code (**)

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Minimalist Data Wrangling with Python [27], too.

R is an effective glue language. It is suitable for composing whole data wrangling pipelines: from data import through processing, analysis, and visualisation to export. It makes using and connecting larger building blocks very convenient. R is also a competent tool for developing usable implementations of standalone, general-purpose algorithms, especially if they are of numerical nature. Nevertheless, for performance reasons, we may consider rewriting computing-intensive tasks in C or C++[1]. Such a move can be beneficial if we need a method that:

  • has higher memory or time complexity when programmed using vectorised R functions than its straightforward implementation,

  • has an iterative or recursive nature, e.g., involving unvectorisable for or while loops,

  • relies on complicated dynamic data structures (e.g., hash maps, linked lists, or trees),

  • needs methods provided elsewhere and not available in R (e.g., other C or C++ libraries).

In the current chapter, we will demonstrate that R works very well as a user-friendly interface to compiled code.

As the topic is overall very technical, we will only cover the most important rudiments. We will focus on writing or interfacing portable[2] function libraries that only rely on simple[3] data structures (e.g., arrays of the type double and int). Thanks to this, we will be able to reuse them in other environments such as Python (e.g., via Cython) or Julia (remember that R is one of many languages out there). For those who are interested in specifics, the definitive reference is the Writing R Extensions manual [65], but see also Chapter 11 of [10]. Furthermore, R’s source code provides many working examples of how to deal with R objects in C.

We assume some knowledge of the C language; see [38]. The reader can skip this chapter now and return to it later: the remaining material is not contingent on the current one. Otherwise, from now on, we take for granted that our environment can successfully build a source package with C code, as mentioned in Section 7.3.1.2. In particular, Wi***ws and m**OS users should install, respectively, RTools and Xcode.

Note

To avoid ambiguity, in the main text, calls to C functions will be denoted by the “C::” prefix, e.g., C::spanish_inquisition().

14.1. C and C++ code in R

14.1.1. Source files for compiled code in R packages

Perhaps the most versatile way to interact with portable C code is via standalone R packages. For the purpose of the current chapter, we created a demo project available at https://github.com/gagolews/cpackagedemo.

Exercise 14.1

Inspect the structure of cpackagedemo. Note that C source files are located in the src/ subdirectory. Build and install the package using install.packages or the “R CMD INSTALL” command. Then, load the package in R and call my_sum defined there on some numeric vector.

The package provides an R interface to one C function, C::my_c_sum, written in the most portable fashion possible. Its declaration is included in the src/cfuns.h file:

#ifndef __CFUNS_H
#define __CFUNS_H
#include <stddef.h>

double my_c_sum(const double* x, size_t n);

#endif

The function accepts a pointer to the start of a numeric sequence and its size, which is a standard[4] way of representing an array of doubles. Its definition is given in src/cfuns.c. We see that it is nothing more than a simple sum of all the elements in an array:

#include "cfuns.h"

/* computes the sum of all elements in an array x of size n */
double my_c_sum(const double* x, size_t n)
{
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        /* this code does not treat potential missing values specially
           (they are kinds of NaNs); to fix this, add:
        if (ISNA(x[i])) return NA_REAL;  // #include <R.h>  */
        s += x[i];
    }
    return s;
}

To make C::my_c_sum available in R, we have to introduce a wrapper around it that works with the data structures from the first part of this jolly book. We know that an R function accepts objects of any kind on input and yields anything as a result. In the next section, we will explain that we get access to R objects via special pointers of the type SEXP (S expressions). Thus, let us declare our R-callable wrapper in src/rfuns.h:

#ifndef __RFUNS_H
#define __RFUNS_H
#include <R.h>
#include <Rinternals.h>
#include <Rmath.h>

SEXP my_c_sum_wrapper(SEXP x);

#endif

The actual definition is included in src/rfuns.c:

#include "rfuns.h"
#include "cfuns.h"

/* a wrapper around my_c_sum callable from R */
SEXP my_c_sum_wrapper(SEXP x)
{
    double s;

    if (!Rf_isReal(x)) {
        /* the caller is expected to prepare the arguments
           (doing it at the C level is tedious work) */
        Rf_error("`x` should be a vector of the type 'double'");
    }

    s = my_c_sum(REAL(x), (size_t)XLENGTH(x));

    return Rf_ScalarReal(s);
}

The arguments could be, technically speaking, prepared at the C level. For instance, if x turned out to be an integer vector, we could have converted it to the double one (they are two different types; see Section 6.4.1). Nevertheless, overall, it is very burdensome. It is easier to use pure R code to ensure that the arguments are of the correct form as well as to beautify the outputs. This explains why we only assert the enjoyment of C::Rf_isReal(x). It guarantees that the C::REAL and C::XLENGTH functions correctly return the pointer to the start of the sequence and its length, respectively.

Once C::my_c_sum is called, we must convert it to an R object so that it can be returned to our environment. Here, it is a newly allocated numeric vector of length one. We did this by calling C::Rf_ScalarReal.

Although optional (see Section 5.4 of [65]), we will register C::my_c_sum_wrapper as a callable function explicitly. This way, R will not be struggling to find the specific entry point in the resulting dynamically linked library (DLL). We do this in src/cpackagedemo.c:

#include <R_ext/Rdynload.h>
#include "rfuns.h"

/* the list of functions available in R via a call to .Call():
   each entry is like {exported_name, fun_pointer, number_of_arguments} */
static const R_CallMethodDef cCallMethods[] = {
    {"my_c_sum_wrapper", (DL_FUNC)&my_c_sum_wrapper, 1},
    {NULL, NULL, 0}  // the end of the list (sentinel)
};

/* registers the list of callable functions */
void R_init_cpackagedemo(DllInfo *dll)
{
    R_registerRoutines(dll, NULL, cCallMethods, NULL, NULL);
    R_useDynamicSymbols(dll, FALSE);
}

The function can be invoked from R using .Call. Here are the contents of R/my_sum.R:

my_sum <- function(x)
{
    # prepare input data:
    if (!is.double(x))
        x <- as.double(x)

    s <- .Call("my_c_sum_wrapper", x, PACKAGE="cpackagedemo")

    # some rather random postprocessing:
    attr(s, "what") <- deparse(substitute(x))
    s
}

And, finally, here is the package NAMESPACE file responsible for registering the exported R names and indicating the DLL to use:

export(my_sum)
useDynLib(cpackagedemo)

Once the package is built and installed (e.g., by running “R CMD INSTALL <pkgdir>” in the terminal or calling install.packages), we can test it by calling:

library("cpackagedemo")
my_sum(runif(100)/100)
## [1] 0.49856
## attr(,"what")
## [1] "runif(100)/100"
Exercise 14.2

Extend the package by adding a function to compute the index of the greatest element in a numeric vector. Note that C uses 0-based array indexing whereas in R, the first element is at index 1. Compare its run time against which.max using proc.time.

14.1.2. R CMD SHLIB

The “R CMD SHLIB <files>” shell command compiles one or more source files without the need for turning them into standalone packages; see [65]. Then, dyn.load loads the resulting DLL.

Exercise 14.3

(*) Compile src/cfuns.c and src/rfuns.c from our demo package using “R CMD SHLIB”. Call dyn.load. Write an R function that uses .Call to invoke C::my_c_sum_wrapper from the second source file.

The direct SHLIB approach is convenient for learning C programming, including running simple examples. We will thus use it for didactic reasons in this chapter. The inst/examples/csource.R file in our demo package includes the implementation of an R function called csource. It compiles a given C source file, and loads the resulting DLL. It also extracts and executes a designated R code chunk preferably defining a function that refers to .Call.

Here is an example source file, inst/examples/helloworld.c in the cpackagedemo source code repository:

// the necessary header files are automatically included by `csource`

SEXP C_hello()
{
    Rprintf("The mill's closed. There's no more work. We're destitute.\n"
        "I'm afraid I've no choice but to sell you all "
        "for scientific experiments.\n");
    return R_NilValue;
}

/* R
# this chunk will be extracted and executed by `csource`.

hello <- function()
    invisible(.Call("C_hello", PACKAGE="helloworld"))

R */

Let’s compile it and call the aforementioned R function.

source("~/R/cpackagedemo/inst/examples/csource.R")  # defines csource
csource("~/R/cpackagedemo/inst/examples/helloworld.c")
hello()
## The mill's closed. There's no more work. We're destitute.
## I'm afraid I've no choice but to sell you all for scientific experiments.
Exercise 14.4

(*) C++, which can be thought of as a superset of the C language (but the devil is in the detail), is also supported. Change the name of the aforementioned file to helloworld2.cpp, add extern "C" before the function declaration, pass PACKAGE="helloworld2" to .Call, and run csource on the new file.

Exercise 14.5

(*) Verify that C and C++ source files can coexist in R packages.

Example 14.6

(*) It might be very educative to study the implementation of csource. We should be able to author such functions ourselves now (a few hours’ worth of work), let alone read with understanding.

# compiles a C or C++ source file using R CMD SHLIB,
# loads the resulting DLL, and executes the embedded R code

csource <- function(
    fname,
    libname=NULL,  # defaults to the base name of `fname` without extension
    shlibargs=character(),
    headers=paste0(
        "#include <R.h>\n",
        "#include <Rinternals.h>\n",
        "#include <Rmath.h>\n"
    ),
    R=file.path(R.home(), "bin/R")
)
{
    stopifnot(file.exists(fname))
    stopifnot(is.character(shlibargs))
    stopifnot(is.character(headers))
    stopifnot(is.character(R), length(R) == 1)

    if (is.null(libname))
        libname <- regmatches(basename(fname),
            regexpr("[^.]*(?=\\..*)", basename(fname), perl=TRUE))

    stopifnot(is.character(libname), length(libname) == 1)

    # read the source file:
    f <- paste(readLines(fname), collapse="\n")

    # set up output file names:
    tmpdir <- normalizePath(tempdir(), winslash="/")  # tempdir on Win uses \
    dynlib_ext <- .Platform[["dynlib.ext"]]
    libpath <- file.path(tmpdir, sprintf("%s%s", libname, dynlib_ext))
    cfname <- file.path(tmpdir, basename(fname))
    rfname <- sub("\\..*?$", ".R", cfname, perl=TRUE)  # .R extension

    # separate the /* R ... <R code> ... R */ chunk from the source file:
    rpart <- regexec("(?smi)^/\\* R\\s?(.*)R \\*/$", f, perl=TRUE)[[1]]
    rpart_start <- rpart
    rpart_len <- attr(rpart, "match.length")
    if (rpart_start[1] < 0 || rpart_len[1] < 0)
        stop("enclose R code between /* R ... and ... R */")

    rcode <- substr(f, rpart_start[2], rpart_start[2]+rpart_len[2]-1)
    cat(rcode, file=rfname, append=FALSE)

    # write the C/C++ file:
    ccode <- paste(
        headers,
        substr(f, 1, rpart_start[1]-1),
        substr(f, rpart_start[1]+rpart_len[1], nchar(f)),
        collapse="\n"
    )
    cat(ccode, file=cfname, append=FALSE)

    # prepare the "R CMD SHLIB ..." command:
    shlibargs <- c(
        "CMD", "SHLIB",
        sprintf("-o %s", libpath),
        cfname,
        shlibargs
    )

    # compile and load the DLL, run the extracted R script:
    retval <- FALSE
    oldwd <- setwd(tmpdir)
    tryCatch({
        if (libpath %in% sapply(getLoadedDLLs(), `[[`, "path"))
            dyn.unload(libpath)
        stopifnot(system2(R, shlibargs) == 0)  # 0 == success
        dyn.load(libpath)
        source(rfname)
        retval <- TRUE
    }, error=function(e) {
        cat(as.character(e), file=stderr())
    })
    setwd(oldwd)

    if (!retval) stop("error compiling file or executing R code therein")
    invisible(TRUE)
}

14.2. Handling basic types

14.2.1. SEXPTYPEs

All R objects are stored as instances of the C language structure SEXPREC. Usually, we access them via pointers, which are of the type SEXP (S expression).

A C function referred to via .Call takes the very generic SEXPs on input. It outputs another SEXP. Importantly, one of the said structure’s fields represents the actual R object type (SEXPTYPE numbers); see Table 14.1 for a selection.

Table 14.1 Basic R types in C.

SEXPTYPE

Type in R (typeof)

Test in C

NILSXP

NULL

Rf_isNull(x) (true for R_NilValue only)

RAWSXP

raw

TYPEOF(x) == RAWSXP

LGLSXP

logical

Rf_isLogical(x)

INTSXP

integer

Rf_isInteger(x)

REALSXP

double

Rf_isReal(x)

CPLXSXP

complex

Rf_isComplex(x)

STRSXP

character

Rf_isString(x)

VECSXP

list

Rf_isVectorList(x)

CHARSXP

char (scalar string; internal)

TYPEOF(x) == CHARSXP

EXTPTRSXP

externalptr (internal)

TYPEOF(x) == EXTPTRSXP

Example 14.7

To illustrate that any R object is available as a SEXP, consider the inst/examples/sexptype.c file from cpackagedemo:

SEXP C_test_sexptype(SEXP x)
{
    Rprintf("type of x: %s (SEXPTYPE=%d)\n",
        Rf_type2char(TYPEOF(x)),
        (int)TYPEOF(x)
    );
    return R_NilValue;
}

/* R
test_sexptype <- function(x)
    invisible(.Call("C_test_sexptype", x, PACKAGE="sexptype"))
R */

Example calls:

csource("~/R/cpackagedemo/inst/examples/sexptype.c")
test_sexptype(1:10)
## type of x: integer (SEXPTYPE=13)
test_sexptype(NA)
## type of x: logical (SEXPTYPE=10)
test_sexptype("spam")
## type of x: character (SEXPTYPE=16)

We should refer to particular SEXPTYPEs via their descriptive names (constants; e.g., STRSXP), not their numeric identifiers (e.g., 16); see Section 1.1 of [68] for the complete list[5].

14.2.2. Accessing elements in simple atomic vectors

We have already seen an example function that processes a numeric vector; see C::my_c_sum_wrapper above. Table 14.2 gives other important vector-like SEXPTYPEs (atomic and generic), the C types of their elements, and the functions to access the underlying array pointers. It is also worth knowing that as call to C::XLENGTH returns the length of a given sequence. Let’s stress that writing functions that accept int and double array pointers and their lengths makes them easily reusable in other programming environments. In many data analysis applications, we do not need much more.

Table 14.2 Basic array-like R types and their elements in C.

SEXPTYPE

Array element type

Pointer access

RAWSXP

typedef unsigned char Rbyte;

RAW(x)

LGLSXP

int (use the FALSE, TRUE, and NA_LOGICAL constants)

LOGICAL(x)

INTSXP

int

INTEGER(x)

REALSXP

double

REAL(x)

CPLXSXP

typedef struct { double r; double i; } Rcomplex;

COMPLEX(x)

STRSXP

SEXP (array of SEXPs of the type CHARSXP)

(not directly)

VECSXP

SEXP (array of SEXPs of any SEXPTYPE)

(not directly)

CHARSXP

const char* (read-only; trailing 0; check encoding)

CHAR(x)

Important

With raw, logical, integer, floating-point, and complex vectors, we get direct access to data that might be shared amongst many objects (compare Section 16.1.4). SEXPRECs are simply passed by pointers (since SEXPs are exactly them). We must thus refrain[6] from modifying objects passed as function arguments. Ways to create new vectors, e.g., for storing auxiliary or return values, are discussed below.

Example 14.8

Consider inst/examples/sharedmem.c:

SEXP C_test_sharedmem(SEXP x)
{
    if (!Rf_isReal(x) || XLENGTH(x) == 0)
        Rf_error("`x` should be a non-empty vector of the type 'double'");

    REAL(x)[0] = REAL(x)[0]+1;  // never do it; always make a copy;
            // the underlying array `x` may be shared by many objects

    return R_NilValue;
}

/* R
test_sharedmem <- function(x)
    invisible(.Call("C_test_sharedmem", x, PACKAGE="sharedmem"))
R */

Calling the foregoing function on an example vector:

csource("~/R/cpackagedemo/inst/examples/sharedmem.c")
y <- 1
z <- y
test_sharedmem(y)
print(c(y, z))
## [1] 2 2

modifies y and z in place. Hence, to maintain the compatibility with the classic R semantics, we must always make a copy.

14.2.3. Representation of missing values

Most languages do not support the notion of missing values out of the box. Therefore, in R, they have to be emulated. Table 14.3 lists the relevant constants and the conventional ways for testing for missingness.

Table 14.3 Representation of missing values.

SEXPTYPE

Missing value

Testing

RAWSXP

(none)

(none)

LGLSXP

NA_LOGICAL (equal to INT_MIN)

el == NA_LOGICAL

INTSXP

NA_INTEGER (equal to INT_MIN)

el == NA_INTEGER

REALSXP

NA_REAL (a special NaN)

ISNA(el)

CPLXSXP

a pair of NA_REALs

ISNA(el.r)

STRSXP

NA_STRING (a CHARSXP object)

el == NA_STRING

In logical and integer vectors, NAs are represented as the smallest 32-bit signed integer. Thus, we need to be careful when performing any operations on these types: testing for missingness must be performed first.

The case of doubles is slightly less irksome, for a missing value is represented as a special not-a-number. Many arithmetic operations on NaNs return NaNs as well, albeit there is no guarantee[7] that they will be of precisely the same type as NA_REAL. Thus, manual testing for missingness is also advised.

Example 14.9

The inst/examples/mean_naomit.c file defines a function to compute the arithmetic mean of an int or a double vector:

SEXP C_mean_naomit(SEXP x)
{
    double ret = 0.0;
    size_t k = 0;

    if (Rf_isInteger(x)) {
        const int* xp = INTEGER(x);
        size_t n = XLENGTH(x);
        for (size_t i=0; i<n; ++i)
            if (xp[i] != NA_INTEGER) {  // NOT: ISNA(xp[i])
                ret += (double)xp[i];
                k++;
            }
    }
    else if (Rf_isReal(x)) {
        const double* xp = REAL(x);
        size_t n = XLENGTH(x);
        for (size_t i=0; i<n; ++i)
            if (!ISNA(xp[i])) {  // NOT: xp[i] == NA_REAL
                ret += xp[i];
                k++;
            }
    }
    else
        Rf_error("`x` should be a numeric vector");

    return Rf_ScalarReal((k>0)?(ret/(double)k):NA_REAL);
}

/* R
mean_naomit <- function(x)
{
    if (!is.numeric(x))  # neither integer nor double
        x <- as.numeric(x)  # convert to double (the same as as.double)
    .Call("C_mean_naomit", x, PACKAGE="mean_naomit")
}
R */

There is some inherent code duplication but int and double are distinct types. Thus, they need to be handled separately (we could have convert them to doubles at the R level too). Some tests:

csource("~/R/cpackagedemo/inst/examples/mean_naomit.c")
mean_naomit(c(1L, NA_integer_, 3L, NA_integer_, 5L))
## [1] 3
mean_naomit(rep(NA_real_, 10))
## [1] NA
Exercise 14.10

Implement all and any in C. Add the na.rm argument.

14.2.4. Memory allocation

To allocate a new vector of length one and set its only element, we can call C::ScalarLogical, C::ScalarInteger, C::ScalarReal, etc. We have already used these functions for returning R “scalars”. Vectors of arbitrary lengths can be created using C::Rf_allocVector(sexptype, size). Note that this function does not initialise the elements of logical and numeric sequences (amongst others). They will need to be set manually after creation.

Important

R implements a simple yet effective garbage collector that relies on reference counting. Occasionally[8], memory blocks that can no longer be reached are either freed or marked as reusable.

All allocated vectors must be manually protected from garbage collection. To guard against premature annihilation, R maintains a stack[9] of objects. C::PROTECT(sexp) pushes a given object pointer onto the top of the list. C::UNPROTECT(n) pops the last n elements from it in the last-in-first-out manner. At the end of a .Call, R checks if the number of protects matches that of unprotects and generates a warning if there is a stack imbalance.

Protection is not needed:

  • for arguments to functions referred to by .Call, as they are already in use and hence protected;

  • for objects assigned as list or character vectors’ elements using C::SET_VECTOR_ELT and C::SET_STRING_ELT (see below); when the container is protected, so are its components;

  • when we return the allocated vector to R immediately after creating it (like in return Rf_ScalarReal(val) in a C function invoked by .Call).

Example 14.11

Here is a function to compute the square of each element in a numeric vector. Note that the new vector must be protected from garbage collection while data are being prepared.

SEXP C_square1(SEXP x)
{
    // no need to call PROTECT(x), it is already in use
    if (!Rf_isReal(x)) Rf_error("`x` should be of the type 'double'");

    size_t n = XLENGTH(x);
    const double* xp = REAL(x);

    SEXP y = PROTECT(Rf_allocVector(REALSXP, n));  // won't be GC'd
    double* yp = REAL(y);

    for (size_t i=0; i<n; ++i) {
        if (ISNA(xp[i])) yp[i] = xp[i];  // NA_REAL
        else             yp[i] = xp[i]*xp[i];
    }

    UNPROTECT(1);  // pops one object from the protect stack;
        // does not trigger garbage collection, so we can return `y` now
    return y;  // R will retrieve and protect it
}

/* R
square1 <- function(x)
{
    if (!is.double(x)) x <- as.double(x)
    .Call("C_square1", x, PACKAGE="square1")
}
R */

As an alternative, in this case, we may use C::Rf_duplicate:

SEXP C_square2(SEXP x)
{
    if (!Rf_isReal(x)) Rf_error("`x` should be of the type 'double'");

    x = PROTECT(Rf_duplicate(x));  // OK; just replaces the pointer (address)

    size_t n = XLENGTH(x);
    double* xp = REAL(x);
    for (size_t i=0; i<n; ++i)
        if (!ISNA(xp[i])) xp[i] = xp[i]*xp[i];

    UNPROTECT(1);
    return x;
}

/* R
square2 <- function(x)
{
    if (!is.double(x)) x <- as.double(x)
    .Call("C_square2", x, PACKAGE="square2")
}
R */

Some tests:

csource("~/R/cpackagedemo/inst/examples/square1.c")
square1(c(-2, -1, 0, 1, 2, 3, 4, NA_real_))
## [1]  4  1  0  1  4  9 16 NA
csource("~/R/cpackagedemo/inst/examples/square2.c")
square2(c(-2, -1, 0, 1, 2, 3, 4, NA_real_))
## [1]  4  1  0  1  4  9 16 NA

We can claim auxiliary memory from the heap during a function’s runtime using the well-known C::malloc (or new in C++). We are of course fully responsible for releasing it via C::free (or delete).

Example 14.12

Here is our version of the which function.

SEXP C_which1(SEXP x)
{
    if (!Rf_isLogical(x)) Rf_error("`x` should be of the type 'logical'");

    size_t n = XLENGTH(x), i, k;
    const int* xp = LOGICAL(x);

    size_t* d = (size_t*)malloc(n*sizeof(size_t));  // conservative size
    if (!d) Rf_error("memory allocation error");

    for (i=0, k=0; i<n; ++i)
        if (xp[i] != NA_LOGICAL && xp[i])
            d[k++] = i;

    // Rf_allocVector can longjmp, memory leak possible...
    SEXP y = PROTECT(Rf_allocVector(REALSXP, k));
    double* yp = REAL(y);  // yes, the type is double; ready for long vectors
    for (i=0; i<k; ++i)
        yp[i] = (double)d[i]+1;  // R uses 1-based indexing

    free(d);
    UNPROTECT(1);
    return y;
}

/* R
which1 <- function(x)
{
    if (!is.logical(x)) x <- as.logical(x)
    .Call("C_which1", x, PACKAGE="which1")
}
R */

Some tests:

csource("~/R/cpackagedemo/inst/examples/which1.c")
which1(c(TRUE, FALSE, TRUE, NA, TRUE))
## [1] 1 3 5
Exercise 14.13

R’s which returns either an int or a double vector, depending on the size of the input vector (whether it is shorter than \(2^{31}-1\)). Rewrite the above to take that into account: integer arithmetic is slightly faster.

Note

(*) R’s exception handling uses a long jump. Therefore, when calling C::Rf_error (whether directly or not) normal stack unwinding will not occur. This is particularly important when using C++ objects which deallocate memory in their destructors as they might not be invoked whatsoever.

In the preceding example, a call to C::Rf_allocVector may trigger a long jump, e.g., if we run out of available memory. In such a case, d will not be freed.

Thus, care should be taken to make sure there are no memory leaks. We can sometimes switch to C::R_alloc(n, size) which allocates n*size bytes. The memory it requests will automatically be garbage-collected at the end of a .Call.

Otherwise, we should ensure that blocks relying on manual memory allocation are not mixed with the calls to R API functions. In our C::which1, it would be better to determine the desired size of y and allocate it before calling C::malloc.

Example 14.14

(*) If we do not like that we are potentially wasting memory in the case of sparse logical vectors, we can rely on dynamically growable arrays. Below is a C++ rewrite of the foregoing function using deque (double-ended queue) from the language’s standard library.

#include <deque>

extern "C" SEXP C_which2(SEXP x)
{
    if (!Rf_isLogical(x)) Rf_error("`x` should be of the type 'logical'");

    size_t n = XLENGTH(x), i, k=0;
    const int* xp = LOGICAL(x);

    // precompute k, Rf_allocVector can do a longjmp
    for (i=0; i<n; ++i) if (xp[i] != NA_LOGICAL && xp[i]) k++;
    SEXP y = PROTECT(Rf_allocVector(REALSXP, k));
    double* yp = REAL(y);  // ready for long vectors

    std::deque<size_t> d;  // allocates memory
    for (i=0; i<n; ++i)
        if (xp[i] != NA_LOGICAL && xp[i])
            d.push_back(i);

    i=0;
    for (size_t e : d)
        yp[i++] = (double)e+1;  // R uses 1-based indexing

    UNPROTECT(1);
    return y;  // d's destructor will be called automatically
}

/* R
which2 <- function(x)
{
    if (!is.logical(x)) x <- as.logical(x)
    .Call("C_which2", x, PACKAGE="which2")
}
R */

Example calls:

csource("~/R/cpackagedemo/inst/examples/which2.cpp")
x <- (runif(10) > 0.5)
stopifnot(which(x) == which1(x))
stopifnot(which(x) == which2(x))

Alternatively, we could have used C::realloc to extend an initially small buffer created using C::malloc by, say, 50% whenever it is about to overflow.

14.2.5. Lists

For safety reasons[10], we do not get access to the underlying pointers in lists and character vectors. List items can be read by calling C::VECTOR_ELT(x, index) and can be set with C::SET_VECTOR_ELT(x, index, newval). Note that lists (VECSXPs) are comprised of SEXPs of any type. Hence, after extracting an element, its SEXPTYPE needs to be tested using one of the functions listed in Table 14.1. This can be tiresome.

Example 14.15

Here is a rather useless function that fetches the first and the last element in a given numeric vector or a list. However, if the latter case, we apply the function recursively on all its elements.

SEXP C_firstlast(SEXP x)
{
    if (!Rf_isVector(x) || XLENGTH(x) == 0)
        Rf_error("`x` must be a non-empty vector (atomic or generic)");
    else if (Rf_isReal(x)) {
        SEXP y = PROTECT(Rf_allocVector(REALSXP, 2));
        REAL(y)[0] = REAL(x)[0];             // first
        REAL(y)[1] = REAL(x)[XLENGTH(x)-1];  // last
        UNPROTECT(1);
        return y;
    }
    else if (Rf_isVectorList(x)) {
        SEXP y = PROTECT(Rf_allocVector(VECSXP, 2));
        // VECTOR_ELT(x, i) is PROTECTed by the container;
        // SET_VECTOR_ELT does not trigger GC; no need to call PROTECT
        // on the result of C_firstlast(...) in this context
        SET_VECTOR_ELT(y, 0, C_firstlast(VECTOR_ELT(x, 0)));
        SET_VECTOR_ELT(y, 1, C_firstlast(VECTOR_ELT(x, XLENGTH(x)-1)));
        UNPROTECT(1);
        return y;
    }
    else
        Rf_error("other cases left as an exercise");

    return R_NilValue;  // avoid compiler warning
}

/* R
firstlast <- function(x)
    .Call("C_firstlast", x, PACKAGE="firstlast")
R */

Testing:

csource("~/R/cpackagedemo/inst/examples/firstlast.c")
firstlast(c(1, 2, 3))
## [1] 1 3
firstlast(list(c(1, 2, 3), c(4, 5), 6))
## [[1]]
## [1] 1 3
##
## [[2]]
## [1] 6 6
firstlast(list(c(1, 2, 3), 4, 5, list(6, c(7, 8), c(9, 10, 11))))
## [[1]]
## [1] 1 3
##
## [[2]]
## [[2]][[1]]
## [1] 6 6
##
## [[2]][[2]]
## [1]  9 11
Exercise 14.16

Implement a C function that returns the longest vector in a given list. Use C::Rf_isVector to check whether a given object is an atomic or a generic vector, and hence if C::XLENGTH can be called thereon.

Exercise 14.17

Inscribe your version of unlist. Consider scanning the input list twice. First, determine the size of the output vector. Second, fill the return object with the un-listed values.

Exercise 14.18

Write a C function that takes a list of numeric vectors of identical lengths. Return their elementwise sum: the first element of the output should be the sum of the first elements in every input vector, and so forth.

14.2.6. Character vectors and individual strings (*)

Character vectors (STRSXPs) are similar to VECSXPs except that they only carry individual strings which are represented using a separate data type, CHARSXP. C::STRING_ELT(x, index) and C::SET_STRING_ELT(x, index, newval) play the role of the element getters and setters.

Important

If we are not interested in text processing but rather in handling categorical data or defining grouping variables, we should consider converting character vectors to factors before issuing a .Call. Comparing small integers is much faster than strings; see below for more details.

Because of R’s string cache, there are no duplicate strings in the memory. However, this feature could only be guaranteed by making data in CHARSXPs read-only. We can access the underlying const char* pointer by calling C::CHAR(s). As typical in C, a string is terminated by byte 0.

Note

R strings may be of different encodings; compare Section 6.1.1. For portability and peace of mind, it is best to preprocess the arguments to .Call using enc2utf8, which converts all strings to UTF-8[11].

Despite being the most universal encoding, UTF-8 does not represent each code point using a fixed number of bytes. For instance, computing the string length requires iterating over all its elements. For CHARSXPs, C::XLENGTH returns the number of bytes, not including the trailing 0.

It is thus best to leave the processing of strings to the dedicated libraries, e.g., ICU or rely on functions from the stringi package [26] at the R level.

C strings can be converted to CHARSXPs by calling C::Rf_mkCharCE(stringbuf, CE_UTF8) or C::Rf_mkCharLenCE(stringbuf, buflen, CE_UTF8). If we are sure that a string is in ASCII (a subset of UTF-8), we can also call C::Rf_mkChar(stringbuf).

We should never return CHARSXPs as results to R. They are for internal use only. They must be wrapped inside a character vector, e.g., using C::Rf_ScalarString.

14.2.7. Calling R functions from C (**)

Section 5.11 of [65] discusses ways to call R functions in C. To understand them, we will first need to study the material from the remaining chapters of our book, i.e., environments and the related evaluation model. They can be useful, e.g., when calling optimisation algorithms implemented in C on objective functions written in R.

14.2.8. External pointers (**)

For storing arbitrary C pointers, there is a separate basic R type named externalptr (SEXPTYPE of EXTPTRSXP); see Section 5.13 of [65] for more details. We can use them to maintain dynamic data structures or resource handlers between calls to R functions. The problem with these is that pointers are passed as… pointers. They can easily break R’s pass-by-value-like semantics, where changes to the state of the referenced object will be visible outside the call.

Example 14.19

(**) inst/examples/stack.cpp provides a C++ implementation of the stack data structure, being a last-in-first-out container of arbitrary R objects:

#include <deque>

class S : public std::deque<SEXP>
{
    public: ~S()
    {   // destructor: release all SEXPs so that they can be GC'd
        while (!this->empty()) {
            SEXP obj = this->front();
            this->pop_front();
            R_ReleaseObject(obj);
        }
    }
};

S* get_stack_pointer(SEXP s, bool check_zero=true)  // internal function
{
    if (TYPEOF(s) != EXTPTRSXP)
        Rf_error("not an external pointer");

    SEXP tag = R_ExternalPtrTag(s);  // our convention, this can be anything
    if (TYPEOF(tag) != CHARSXP || strcmp(CHAR(tag), "stack") != 0)
        Rf_error("not a stack");

    S* sp = (S*)R_ExternalPtrAddr(s);
    if (check_zero && !sp)
        Rf_error("address is 0");

    return sp;
}

void stack_finaliser(SEXP s)  // internal function
{
    // called during garbage collection
    S* sp = get_stack_pointer(s, false);
    if (sp) {
        delete sp;  // destruct S, release SEXPs
        R_ClearExternalPtr(s);
    }
}

extern "C" SEXP C_stack_create()
{
    S* sp = new S();  // stack pointer
    SEXP s = PROTECT(
        R_MakeExternalPtr((void*)sp, /*tag*/mkChar("stack"), R_NilValue)
    );
    R_RegisterCFinalizerEx(s, stack_finaliser, TRUE);  // auto-called on GC
    UNPROTECT(1);
    return s;
}

extern "C" SEXP C_stack_empty(SEXP s)
{
    S* sp = get_stack_pointer(s);
    return Rf_ScalarLogical(sp->empty());
}

extern "C" SEXP C_stack_push(SEXP s, SEXP obj)
{
    S* sp = get_stack_pointer(s);
    R_PreserveObject(obj);
    sp->push_front(obj);
    return R_NilValue;
}

extern "C" SEXP C_stack_pop(SEXP s)
{
    S* sp = get_stack_pointer(s);
    if (sp->empty())
        Rf_error("stack is empty");
    SEXP obj = sp->front();
    sp->pop_front();
    R_ReleaseObject(obj);
    return obj;
}

/* R
stack_create <- function()
    .Call("C_stack_create", PACKAGE="stack")

stack_empty <- function(s)
    .Call("C_stack_empty", s, PACKAGE="stack")

stack_push <- function(s, obj)
    .Call("C_stack_push", s, obj, PACKAGE="stack")

stack_pop <- function(s)
    .Call("C_stack_pop", s, PACKAGE="stack")
R */

Note how we preserve R objects from garbage collection. Some tests:

csource("~/R/cpackagedemo/inst/examples/stack.cpp")
s <- stack_create()
print(s)
## <pointer: 0x560958a90550>
typeof(s)
## [1] "externalptr"
for (i in c("one", "two", "Spanish Inquisition"))
    stack_push(s, i)
while (!stack_empty(s))
    print(stack_pop(s))
## [1] "Spanish Inquisition"
## [1] "two"
## [1] "one"

Note that pointers are not serialisable. They cannot be saved for use in another R session.

14.3. Dealing with compound types

14.3.1. Reading and setting attributes

From Chapter 10, we know that compound types such as matrices, factors, or data frames are represented using basic data structures. Usually, they are atomic vectors or lists organised in a predefined manner.

C::Rf_getAttrib(x, attrname) and C::Rf_setAttrib(x, attrname, newval) gets and sets specific attributes of an object x. Their second argument, attrname, should be a one-element STRSXP. For convenience, the R_ClassSymbol, R_DimNamesSymbol, R_DimSymbol, R_NamesSymbol, R_RowNamesSymbol, and R_LevelsSymbol constants can be used instead of the STRSXP versions of the "class", "dimnames", "dim", "names", "row.names", and "levels" strings.

Example 14.20

Consider a function for testing whether an object is of a given class:

#include <string.h>

SEXP C_isofclass(SEXP x, SEXP class)
{
    if (!Rf_isString(class) && XLENGTH(class) != 1)
        Rf_error("`class` must be a single string");

    if (!OBJECT(x))  // is the class attribute set at all?
        return Rf_ScalarLogical(FALSE);

    SEXP xclass = Rf_getAttrib(x, R_ClassSymbol);  // STRSXP (guaranteed)
    const char* c = CHAR(STRING_ELT(class, 0));  // class arg as a C string
    size_t n = XLENGTH(xclass);
    for (size_t i=0; i<n; ++i)
        if (strcmp(CHAR(STRING_ELT(xclass, i)), c) == 0)
            return Rf_ScalarLogical(TRUE);

    return Rf_ScalarLogical(FALSE);
}

/* R
isofclass <- function(x, class)
    .Call("C_isofclass", x, class, PACKAGE="isofclass")
R */

Some tests:

csource("~/R/cpackagedemo/inst/examples/isofclass.c")
isofclass(Sys.time(), "POSIXct")
## [1] TRUE
isofclass(cbind(1:5, 11:15), "matrix")
## [1] FALSE

Note that a matrix has an implicit class (reported by the class function), but its class attribute does not have to be set. Hence the negative result.

Example 14.21

Write a function that fetches a particular named element in a list.

14.3.2. Factors

Factors (Section 10.3.2) are represented as integer vectors with elements in the set {1, 2, …, k, NA_integer_} for some k. They are equipped with the levels attribute, being a character vector of length k. Their class attribute is set to factor.

Example 14.22

An example implementation of a function to compute the number of occurrences of each factor level is given below.

SEXP C_table1(SEXP x)
{
    if (!Rf_isFactor(x)) Rf_error("`x` is not a 'factor' object");

    size_t n = XLENGTH(x);
    const int* xp = INTEGER(x);  // `x` is INTSXP

    SEXP levels = Rf_getAttrib(x, R_LevelsSymbol);  // `levels` is a STRSXP
    size_t k = XLENGTH(levels);

    SEXP y = PROTECT(Rf_allocVector(REALSXP, k));
    double* yp = REAL(y);
    for (size_t i=0; i<k; ++i)
        yp[i] = 0.0;
    for (size_t j=0; j<n; ++j) {
        if (xp[j] != NA_INTEGER) {
            if (xp[j] < 1 || xp[j] > k)
                Rf_error("malformed factor");  // better safe than sorry
            yp[xp[j]-1] += 1.0;  // levels are 1..k, but C needs 0..k-1
        }
    }

    Rf_setAttrib(y, R_NamesSymbol, levels);  // set names attribute
    UNPROTECT(1);
    return y;
}

/* R
table1 <- function(x)
{
    if (!is.factor(x)) x <- as.factor(x)
    .Call("C_table1", x, PACKAGE="table1")
}
R */

Testing:

csource("~/R/cpackagedemo/inst/examples/table1.c")
table1(c("spam", "bacon", NA, "spam", "eggs", "bacon", "spam", "spam"))
## bacon  eggs  spam
##     2     1     4
Exercise 14.23

Create a function to compute the most frequently occurring value (mode) in a given factor. Return a character vector. If a mode is ambiguous, return all the possible candidates.

14.3.3. Matrices

Matrices (Chapter 11) are flat atomic vectors or lists with the dim attribute being a vector of length two. The class attribute does not have to be set (but the class function returns matrix and array). Matrices are so important in data analysis that they have been blessed with a few dedicated functions available at the C level. C::Rf_isMatrix tests if a given object meets the criteria mentioned above. C::Rf_allocMatrix(sexptype, n, m) allocates a new matrix.

R relies on the Fortran order of matrix elements, i.e., it uses the column-major alignment. Let A be a matrix with n rows and m columns (compare C::Rf_nrows and C::Rf_ncols). Then, the element in the i-th row and the j-th column is at A[i+n*j]. The dimnames attributes must be handled manually, though.

Example 14.24

Here is a function to compute the transpose of a numeric matrix:

SEXP C_transpose(SEXP x)
{
    if (!Rf_isMatrix(x) || !Rf_isReal(x))
        Rf_error("`x` must be a real matrix");

    size_t n = Rf_nrows(x);
    size_t m = Rf_ncols(x);
    const double* xp = REAL(x);

    SEXP y = PROTECT(Rf_allocMatrix(REALSXP, m, n));
    double* yp = REAL(y);
    for (size_t i=0; i<n; ++i)
        for (size_t j=0; j<m; ++j)
            yp[j+m*i] = xp[i+n*j];

    SEXP dimnames = Rf_getAttrib(x, R_DimNamesSymbol);
    if (!Rf_isNull(dimnames)) {
        SEXP tdimnames = PROTECT(Rf_allocVector(VECSXP, 2));
        SET_VECTOR_ELT(tdimnames, 0, VECTOR_ELT(dimnames, 1));
        SET_VECTOR_ELT(tdimnames, 1, VECTOR_ELT(dimnames, 0));
        Rf_setAttrib(y, R_DimNamesSymbol, tdimnames);  // set dimnames
        UNPROTECT(1);
        // dimnames may have the names attribute too (left as an exercise)
    }

    UNPROTECT(1);
    return y;
}

/* R
transpose <- function(x)
{
    if (!is.matrix(x)) x <- as.matrix(x)
    if (!is.double(x)) x[] <- as.double(x)  # preserves attributes
    .Call("C_transpose", x, PACKAGE="transpose")
}
R */

Testing:

csource("~/R/cpackagedemo/inst/examples/transpose.c")
transpose(cbind(c(1, 2, 3, 4), c(5, 6, 7, 8)))
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
transpose(Titanic[, "Male", "Adult", ])
##     1st 2nd 3rd Crew
## No  118 154 387  670
## Yes  57  14  75  192
Exercise 14.25

Author a C function named table2 that computes a two-way contingency table.

14.3.4. Data frames

Data frames (Chapter 12) are lists of the S3 class data.frame featuring, for some n and m, m vectors of identical lengths n or matrices of n rows. The character vectors stored in the row.names and names attributes give the n row and m column labels.

We process data frames as ordinary lists. However, assuming we only want to process numeric data, we can extract the quantitative columns and put them inside a matrix at the R level. If element grouping is required, they can be accompanied by a factor or a list of factor variables. In many applications, this strategy is good enough.

14.4. Using existing function libraries

14.4.1. Checking for user interrupts

Long computations may lead to R’s becoming unresponsive. The user may always request to cancel the evaluation of the current expression by pressing Ctrl+C. To allow the processing of the event queue, we should call C::R_CheckUserInterrupt() occasionally, e.g., in every iteration of a complex for loop. Note that R might decide never to return to our function. Thus, we have to prevent memory leaks, e.g., by preferring C::R_alloc over C::malloc.

14.4.2. Generating pseudorandom numbers

C::unif_rand returns a single pseudorandom deviate from the uniform distribution on the unit interval. It is the basis for generating numbers from all other supported distributions (Section 6.7.1 of [65]). It uses the same pseudorandom generator as we described in Section 2.1.5. To read and memorise its seed (the `.Random.seed` object in the global environment), we have to call C::GetRNGstate() and C::PutRNGstate() at the beginning and the end of our function, respectively.

Example 14.26

Below is a function to generate a pseudorandom bit sequence:

SEXP C_randombits(SEXP n)
{
    if (!Rf_isInteger(n) || XLENGTH(n) != 1)
        Rf_error("`n` should be a single integer");

    int _n = INTEGER(n)[0];
    if (_n == NA_INTEGER || _n < 1)
        Rf_error("incorrect `n`");

    SEXP y = PROTECT(Rf_allocVector(INTSXP, _n));
    int* yp = INTEGER(y);

    GetRNGstate();
    for (int i=0; i<_n; ++i)
        yp[i] = (int)(unif_rand()>0.5);  // not the best way to sample bits
    PutRNGstate();

    UNPROTECT(1);

    return y;
}

/* R
randombits <- function(n)
{
    if (!is.integer(n)) n <- as.integer(n)
    .Call("C_randombits", n, PACKAGE="randombits")
}
R */

Let’s play around with it:

csource("~/R/cpackagedemo/inst/examples/randombits.c")
set.seed(123); randombits(10)
##  [1] 0 1 0 1 1 0 1 1 1 0
randombits(10)
##  [1] 1 0 1 1 0 1 0 0 0 1
set.seed(123); randombits(20)
##  [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1
set.seed(123); as.integer(runif(20)>0.5)  # it's the same "algorithm"
##  [1] 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 1
Exercise 14.27

Create a function to compute the most frequently occurring value (mode) in a given factor object. In the case of ambiguity, return a randomly chosen candidate.

14.4.3. Mathematical functions from the R API

Section 6.7 of [65] lists the available statistical distribution functions, mathematical routines and constants, and other numerical utilities.

14.4.4. Header files from other R packages (*)

A package may use header files from another package. For this to be possible, it must include the dependency name in the LinkingTo field of its DESCRIPTION file; see [65] for discussion.

Exercise 14.28

The BH package on CRAN gives access to Boost, the header-only C++ libraries that define many useful algorithms and data structures. Create an R package that calls C++::boost::math::gcd after issuing the #include <boost/math/common_factor.hpp> directive.

14.4.5. Specifying compiler and linker flags (**)

We can pass arbitrary flags to the compiler or linker, e.g., to use any library installed on our system. Basic configuration can be tweaked via Makevars (or Makevars.win on Wi***ws), e.g., by setting PKG_CFLAGS or PKG_LIBS variables. For maximum portability across different platforms, which is overall challenging to ensure when we do not wish to exclude Wi***ws users, we might be required to author custom configure (and configure.win) scripts. For more information, see [65] which discusses, amongst others, how to use OpenMP[12] in our projects.

14.5. Exercises

Exercise 14.29

Answer the following questions about the C language API for R.

  • What are the most common SEXPTYPEs?

  • How are missing values represented?

  • How can we check if an int is a missing value? What about a double?

  • How to prevent SEXPs from being garbage-collected?

  • How are character vectors represented? What is the difference between a CHARSXP and a STRSXP?

  • Why is it better to handle factor objects rather than raw character vectors if we merely would like to define grouping variables?

  • How are R matrices represented in C? Does R use the C or Fortran order of matrix elements?

  • How are R data frames handled in C?

Exercise 14.30

Implement the C versions of the rep, seq, rle, match, findInterval, sample, order, unique, and split functions.

Exercise 14.31

(*) Read Writing R Extensions [65] in its entirety.

Exercise 14.32

(*) Download R’s source code from CRAN or its Subversion (SVN) repository. Explore the header files in the src/include subdirectory. They are part of the callable API.