10. S3 classes#

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime, any bug/typos reports/fixes are appreciated. Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [26].

Let x be a randomly generated matrix with 1 000 000 rows and 1 000 columns, y be a data frame with results from the latest survey indicating that things are not the way most people (no matter the side of the many political spectra) think they are, and and z be another matrix, this time with many zeroes.

Human brain is not capable of dealing with too much information which is too specific. This is why we have a natural tendency to group different entities based on their similarities so as to form some more abstract classes thereof.

Also, many of us are inherently lazy. Thus, oftentimes we will take shortcuts to minimise energy (at a price to be paid later).

Printing out a matrix, a data frame, and a time series are all instances of the displaying of things, although they undoubtedly differ in detail.

By now, we probably have forgotten which objects are hidden behind x, y, and z. Being able to simply call “print(y)” without having to recall that, yes, y is a data frame, might seem quite appealing.

This chapter introduces the so-called S3 classes [13]. They provide a lightweight object-oriented programming (OOP) approach for automated dispatching of calls to generics of the type “print(y)” to concrete methods such as “print.data.frame(y)”, based on the class of the object they are invoked on.

S3 classes in their essence are beautifully simple[1]. They are inspired[2] by the well-thought-through concepts present in other functional programming languages (such as the Common Lisp Object System; see below). Ultimately, those generics and methods are ordinary R functions (Chapter 7) and classes are merely additional object attributes (Section 4.4).

Of course, this does not mean that wrapping our heads around them will be effortless. However, unlike other “class systems”[3], S3 is ubiquitous in R programming. Suffice it to say that factors, matrices, and data frames discussed in the following chapters are quite straightforward, S3-based extensions of the concepts we introduce below.

10.1. Object type vs class#

Recall that typeof (introduced in Section 4.1) returns the internal type of any R object. So far, the basic types we covered were mostly atomic and generic vectors (compare Figure 1 in the Preface).

typeof(NULL)
## [1] "NULL"
typeof(c(TRUE, FALSE, NA))
## [1] "logical"
typeof(c(1, 2, 3, NA_real_))
## [1] "double"
typeof(c("a", "b", NA_character_))
## [1] "character"
typeof(list(list(1, 2, 3), LETTERS))
## [1] "list"
typeof(function(x) x)
## [1] "closure"

The number of admissible types is small[4], but they open the world of endless possibilities[5]. We will learn that they provide a basis for more complex data structures. This is thanks to the fact that they can be equipped with arbitrary attributes (Section 4.4).

Most compound types constructed using the mechanisms discussed in this chapter[6] only pretend they are something different from what they actually are. Still, they do their job very well most of the time. By knowing what is under their hood, we will demystify them and become able to manipulate their state outside of the prescribed use cases.

Important

Setting the class attribute might make some objects behave differently in certain scenarios.

Example 10.1

Let us consider two identical objects equipped with different class attributes.

xt <- structure(123, class="POSIXct")  # POSIX calendar time
xd <- structure(123, class="Date")

Both objects are represented using numeric vectors:

c(typeof(xt), typeof(xd))
## [1] "double" "double"

However, when printed, they are decoded quite differently:

print(xt)
## [1] "1970-01-01 10:02:03 AEST"
print(xd)
## [1] "1970-05-04"

In the former case, 123 is understood as the number of seconds since the so-called UNIX epoch, 1970-01-01T00:00:00+0000. The latter is deciphered as the number of days since the said timestamp.

Hence, we expect that there exists some underlying mechanism that calls a version of print dependent on an object’s virtual class.

That this only relies on the class attribute, which might be set, unset, or reset quite freely, is emphasised below.

attr(xt, "class") <- "Date"  # change class from POSIXct to Date
print(xt)  # same 123, but now interpreted as Date
## [1] "1970-05-04"
as.numeric(xt)  # drops all attributes
## [1] 123
unclass(xd)  # drops the class attribute; `attr<-`(xd, "class", NULL)
## [1] 123

We are having so much fun that one more illustration can only increase our joy.

Example 10.2

Consider an example data frame:

x <- iris[1:3, 1:2]  # a subset of a built-in example data frame
print(x)
##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2

It is an object of the following class (an object whose class attribute is set to):

attr(x, "class")
## [1] "data.frame"

Some may say, and they are absolutely right, that we have not covered data frames yet. After all, they are the topic of Chapter 12, which is still ahead of us. However, from the current perspective, we should know that R data frames are merely lists (Chapter 4) of vectors of the same lengths equipped with names and row.names attributes.

typeof(x)
## [1] "list"
attr(x, "class") <- NULL  # or x <- unclass(x)
print(x)
## $Sepal.Length
## [1] 5.1 4.9 4.7
##
## $Sepal.Width
## [1] 3.5 3.0 3.2
##
## attr(,"row.names")
## [1] 1 2 3

Important

Revealing how x is actually represented, enables us to process it using the extensive skill set that we have already[7] developed by studying the material covered in the previous part of our book (including all the exercises). This fact is particularly noteworthy, especially bearing in mind that some (built-in or third-party) data types are not particularly well-designed.

Let us underline again that attributes are simple additions to R objects. However, as we said in Section 4.4.3, certain attributes are special, and class is one of them.

In particular, we can set class to be only a character vector (possibly of length greater than one; see Section 10.2.5).

x <- 12345
attr(x, "class") <- 1  # character vectors only
## Error in attr(x, "class") <- 1: attempt to set invalid 'class' attribute

Furthermore, the class function can read the value of the class attribute. Its replacement version is also available.

class(x) <- "Date"  # set; the same as attr(x, "class") <- "Date"
class(x)  # get; the same as attr(x, "class")
## [1] "Date"

Important

The class function always yields some value, even if the corresponding attribute is not set. We call it an implicit class. Compare the following and the outputs generated by typeof:

class(NULL)  # no `class` set because NULL cannot have attributes at all
## [1] "NULL"
class(c(TRUE, FALSE, NA))  # no attributes so class is implicit (= typeof)
## [1] "logical"
class(c(1, 2, 3, NA_real_))  # typeof returns "double"
## [1] "numeric"
class(c("a", "b", NA_character_))
## [1] "character"
class(list(list(1, 2, 3), LETTERS))
## [1] "list"
class(function(x) x)  # typeof gives "closure"
## [1] "function"

Also, Chapter 11 will explain that any object equipped with the dim attribute also has an implicit class:

(x <- as.matrix(c(1, 2, 3)))
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
attributes(x)  # `class` is not amongst the attributes
## $dim
## [1] 3 1
class(x)  # implicit class
## [1] "matrix" "array"
typeof(x)  # it is still a numeric vector (under the hood)
## [1] "double"

10.2. Generics and method dispatching#

10.2.1. Generics, default, and custom methods#

Let us inspect the source code of the print function:

print(print)  # sic!
## function (x, ...)
## UseMethod("print")
## <environment: namespace:base>

Any function like the above[8] we will call from now on a generic (S3 generic, from S version 3). Its only job is to invoke UseMethod("print"). It dispatches the control flow to another function, referred to as a method, based on the class of the first argument.

Important

Even though it cannot be implied by reading the above source code, all arguments passed to the generic will also be available[9] in the dispatched method.

For example, let us define an object of the class categorical (a name that we have just come up with; we could have called it cat, CATEGORICAL, or SpanishInquisition as well), which will be our own version of the famous built-in factor type that we discuss later.

x <- structure(
    c(1, 3, 2, 1, 1, 1, 3),
    levels=c("a", "b", "c"),
    class="categorical"
)

We assume that such an object is a sequence of small positive integers (codes). It is equipped with a character vector of length no less than the maximum of the said integers, stored as its levels attribute. For example, the first category will be used to decipher the meaning of code “1”. Hence, the above vector represents a sequence a, c, b, a, a, a, c.

We have not defined any special method for printing objects of the categorical class. Hence, when we call print, the default (fallback) method will be called:

print(x)
## [1] 1 3 2 1 1 1 3
## attr(,"levels")
## [1] "a" "b" "c"
## attr(,"class")
## [1] "categorical"

It is the standard function for displaying numeric vectors that we are all well familiar with. Its name is print.default, and we can always call it directly:

print.default(x)  # the default print method
## [1] 1 3 2 1 1 1 3
## attr(,"levels")
## [1] "a" "b" "c"
## attr(,"class")
## [1] "categorical"

However, we can introduce our own method for printing categorical objects. Its name must precisely be print.categorical:

print.categorical <- function(x, ...)
{
    x_character <- attr(x, "levels")[unclass(x)]
    print(x_character)  # calls `print.default`
    cat(sprintf("Categories: %s\n",
        paste(attr(x, "levels"), collapse=", ")))
    invisible(x)  # this is what all print methods do; see help("print")
}

Now, calling print automatically dispatches the control flow to the above method:

print(x)
## [1] "a" "c" "b" "a" "a" "a" "c"
## Categories: a, b, c

Of course, the default method can still be called; calling print.default(x) directly will output the same result as before.

Note

print.categorical has been equipped with the dot-dot-dot attribute since the generic print had one too. We are expected to ensure consistency ourselves[10].

10.2.2. Creating own generics#

Introducing new S3 generics is as straightforward as defining a function that calls UseMethod.

For instance, here is a dispatcher which allows for creating new objects of the categorical class based on other objects:

as.categorical <- function(x, ...)
    UseMethod("as.categorical")  # synonym: UseMethod("as.categorical", x)

We always need to define the default method:

as.categorical.default <- function(x, ...)
{
    x <- as.character(x)
    xu <- unique(sort(x))  # drops NAs
    structure(
        match(x, xu),
        class="categorical",
        levels=xu
    )
}

Testing:

as.categorical(c("a", "c", "a", "a", "d", "c"))
## [1] "a" "c" "a" "a" "d" "c"
## Categories: a, c, d
as.categorical(c(3, 6, 4, NA, 9, 9, 6, NA, 3))
## [1] "3" "6" "4" NA  "9" "9" "6" NA  "3"
## Categories: 3, 4, 6, 9

Here, print.categorical was invoked twice. The above is quite flexible already: it relies on the generic (Section 10.2.3) as.character, which handles a wide variety of data types.

Example 10.3

For instance, we might want to forbid the conversion from lists because this does not necessarily make sense:

as.categorical.list <- function(x, ...)
    stop("conversion of lists to categorical is not supported")

The users can always be instructed in the method’s documentation that they are solely responsible for an explicit conversion of list objects to something different prior to a call to as.categorical. Whether this was a justified design choice, time will tell.

Example 10.4

The default method deals with logical vectors perfectly fine:

as.categorical(c(TRUE, FALSE, NA, NA, FALSE))  # as.categorical.default
## [1] "TRUE"  "FALSE" NA      NA      "FALSE"
## Categories: FALSE, TRUE

However, we might still want to introduce its specialised version. This is because we know a slightly more efficient algorithm (and we have nothing better to do) based on the fact that FALSE and TRUE converted to numeric yield 0 and 1, respectively:

as.categorical.logical <- function(x, ...)
{
    x <- as.logical(x)  # or stopifnot(is.logical(x)) ?
    structure(
        x + 1,  # only 1, 2, and NAs will be generated
        class="categorical",
        levels=c("FALSE", "TRUE")
    )
}

It spawns the same result but is slightly faster:

as.categorical(c(TRUE, FALSE, NA, NA, FALSE))  # as.categorical.logical
## [1] "TRUE"  "FALSE" NA      NA      "FALSE"
## Categories: FALSE, TRUE

We performed some argument validation at the beginning: a user is always able to call a method directly on an R object of any kind (which is a good thing; see Section 10.2.4). In other words, there is no guarantee that the argument x must be of type logical.

10.2.3. Built-in generics#

Many functions and operators we have introduced so far are, in fact, S3 generics: print, head, `[`, `+`, `<=`, as.character, as.list, round, log, sum, c, and na.omit, to name a few.

Some of them might not even call UseMethod explicitly; dispatching can be done internally at the C language level[11]. Overall, generating the list of all S3 generics is somewhat tricky[12]. Luckily, at least the internal ones are enumerated in help("InternalMethods") and help("groupGeneric").

Example 10.5

Let us overload the as.character method. The default one does not make much sense for the objects of our custom type:

as.character(x)
## [1] "1" "3" "2" "1" "1" "1" "3"

So:

as.character.categorical <- function(x, ...)
    attr(x, "levels")[unclass(x)]

And now:

as.character(x)
## [1] "a" "c" "b" "a" "a" "a" "c"
Exercise 10.6

Overload the unique and rep methods for objects of the class categorical.

Example 10.7

New types ought to be designed carefully. For instance, if we forget to consider overloading the to-numeric converter, we might end up with some users being puzzled[13] when they see:

(x <- as.categorical(c(4, 9, 100, 9, 9, 100, 42, 666, 4)))
## [1] "4"   "9"   "100" "9"   "9"   "100" "42"  "666" "4"
## Categories: 100, 4, 42, 666, 9
as.double(x)  # as.double.default(x)
## [1] 2 5 1 5 5 1 3 4 2

Hence, we might want to introduce:

as.double.categorical <- function(x, ...)
{
    # as.double.default(as.character.categorical(x))
    as.double(as.character(x))
}

Which now yields:

as.double(x)  # as.double.categorical(x)
## [1]   4   9 100   9   9 100  42 666   4

Note

We can still use unclass to fetch the codes:

unclass(x)
## [1] 2 5 1 5 5 1 3 4 2
## attr(,"levels")
## [1] "100" "4"   "42"  "666" "9"

It is because the above returns a class-free object, which is now guaranteed to be processed by the default methods (print, subsetting, as.character, etc.).

Exercise 10.8

What would happen if we used as.numeric instead of unclass in print.categorical and as.character.categorical?

Exercise 10.9

Update the above methods so that we can also create named objects of the class categorical (i.e., equipped with the names attribute).

Exercise 10.10

The levels of x are sorted lexicographically, not numerically. Introduce a single method that would make the above code (when rerun without any alterations) generate a more natural result.

10.2.4. First-argument dispatch and calling S3 methods directly#

With S3, the dispatching is done most often based on the class of only one[14] argument: by default, the first one from the parameter list.

For example, the c function is a generic that dispatches on the first argument’s class. Let us overload it for categorical objects. In other words, let us create a function dispatched by the generic when called on a series of objects whose first element is of the said class.

c.categorical <- function(...)
    as.categorical(
        unlist(
            lapply(list(...), as.character)
        )
    )

It converts each argument to a character vector (relying on the generic as.character to take care of the details). It works because unlist converts a list of such atomic vectors to a single sequence of strings.

Calling c with the first argument being of the class categorical dispatches to the above method:

x <- c(9, 5, 7, 7, 2)
xc <- as.categorical(x)
c(xc, x)  # c.categorical
##  [1] "9" "5" "7" "7" "2" "9" "5" "7" "7" "2"
## Categories: 2, 5, 7, 9

However, if the first argument is, say, unclassed, the default method will be consulted:

c(x, xc)  # default c
##  [1] 9 5 7 7 2 4 2 3 3 1

This method ignores the class attribute and sees xc as it is, a barebone numeric vector:

`attributes<-`(xc, NULL)  # the underlying codes
## [1] 4 2 3 3 1

It is not a bug. It is a well-documented (and now explained) behaviour. After all, compound types (classed objects) are merely emulated through the basic ones.

Important

In most cases, S3 methods can be called directly to get the desired outcome:

c.categorical(x, xc)  # force a call to the specific method
##  [1] "9" "5" "7" "7" "2" "9" "5" "7" "7" "2"
## Categories: 2, 5, 7, 9

We said in most cases for some methods can be:

  • hardcoded at the C language level (for instance, there is no c.default defined at all[15]),

  • hidden (defined in a package’s namespace but not exported); see Section 16.3.6,

  • overloaded as a group; see Section 10.2.6 and help("groupGeneric").

Example 10.11

Just for fun, let us find a partition of the iris dataset into three clusters using the \(k\)-means algorithm:

res <- kmeans(iris[-5], centers=3, nstart=10)
print(res)
## K-means clustering with 3 clusters of sizes 50, 62, 38
##
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1       5.0060      3.4280       1.4620      0.2460
## 2       5.9016      2.7484       4.3935      1.4339
## 3       6.8500      3.0737       5.7421      2.0711
##
## Clustering vector:
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [ reached getOption("max.print") -- omitted 51 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 15.151 39.821 23.879
##  (between_SS / total_SS =  88.4 %)
##
## Available components:
##
## [1] "cluster"      "centers"      "totss"        "withinss"
## [5] "tot.withinss" "betweenss"    "size"         "iter"
## [9] "ifault"

The above is an object of the class:

class(res)
## [1] "kmeans"

which, in fact, is a:

typeof(res)
## [1] "list"

The underlying list looks like:

unclass(res)
## $cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [ reached getOption("max.print") -- omitted 51 entries ]
##
## $centers
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1       5.0060      3.4280       1.4620      0.2460
## 2       5.9016      2.7484       4.3935      1.4339
## 3       6.8500      3.0737       5.7421      2.0711
##
## $totss
## [1] 681.37
##
## $withinss
## [1] 15.151 39.821 23.879
##
## $tot.withinss
## [1] 78.851
##
## $betweenss
## [1] 602.52
##
## $size
## [1] 50 62 38
##
## $iter
## [1] 2
##
## $ifault
## [1] 0

We already know that the above was displayed in a fancy way only because there is a print method overloaded for objects of the kmeans class.

But is there?

print.kmeans
## Error in eval(expr, envir, enclos): object 'print.kmeans' not found

Even though the method is hidden (internal) in the stats package’s namespace, from Section 16.3.6 we will learn that it can be accessed by calling getS3method("print", "kmeans") or referring to stats:::print.kmeans (note the triple colon).

10.2.5. Multi-class-ness#

The class attribute can be instantiated as a character vector of any length. For example:

(t1 <- Sys.time())
## [1] "2023-05-18 19:41:59 AEST"
(t2 <- strptime("2021-08-15T12:59:59+1000", "%Y-%m-%dT%H:%M:%S%z"))
## [1] "2021-08-15 12:59:59"

Let us inspect the classes of these two objects:

class(t1)
## [1] "POSIXct" "POSIXt"
class(t2)
## [1] "POSIXlt" "POSIXt"

Section 10.3.1 will discuss date-time classes in more detail. It will highlight that the former is represented as a numeric vector, while the latter is a list. Thus, these two should primarily be seen as instances of two distinct types.

However, both of them have a lot in common. Hence, it was a wise design choice to allow them to be seen also as the representatives of the same generic category of POSIX time objects.

Important

When calling a generic function[16] f on an object x of the classes[17] class1, class2, …, classK (in this order), UseMethod(f, x) dispatches to the method determined as follows:

  1. if f.class1 is available[18], call it;

  2. otherwise, if f.class2 is available, call this one;

  3. …;

  4. otherwise, if f.classK is available, invoke it;

  5. otherwise, refer to the fallback f.default.

Example 10.12

There is a method diff for objects of the class POSIXt featuring a statement:

r <- if (inherits(x, "POSIXlt")) as.POSIXct(x) else x

This way, we can process both POSIXct and POSIXlt instances using the same procedure.

Let us see no magic in this simple scheme. It is nothing more than what we described above: a way to determine which method to call for a particular R object. It can be used as a mechanism to mimic the idea of inheritance in object-oriented programming languages. However, the S3 system does not allow for defining classes in any formal manner.

For example, we cannot say that objects of the class POSIXct inherit from POSIXt. Neither can we say that each object of the class POSIXct is also an instance of POSIXt. The class attribute can still be set arbitrarily on a per-object basis. We can create ones whose class is simply POSIXct (without the POSIXt part) or even c("POSIXt", "POSIXct") (in this order).

Note

In any method, it is possible to call the method corresponding to the next class by calling NextMethod.

For instance, if we are in f.class1, a call to NextMethod(f) will try invoking f.class2. If such a method does not exist, further methods in the search chain will be tried, falling back to the default method if necessary. An example will be given below.

10.2.6. Operator overloading#

Operators are ordinary functions (Section 9.4.5). Even though what follows can partially be implied by what we have said above, as usual in R, there will be some oddities.

For example, let us overload the index operator for objects of the class categorical. Looking at help("["), we see that the default method has two arguments: x (the categorical object being sliced) and i (the indexer). Ours will have the same interface then:

`[.categorical` <- function(x, i)
{
    structure(
        unclass(x)[i],  # `[`(unclass(x), i)
        class="categorical",
        levels=attr(x, "levels")  # the same levels as input
    )
}

However, the default S3 method, `[.default`, is hardcoded at the C language level. Therefore, we cannot refer to it directly. We have to call unclass instead. Alternatively, we can also call NextMethod:

`[.categorical` <- function(x, i)
{
    structure(
        NextMethod("["),  # call default method, passing `x` and `i`
        class="categorical",
        levels=attr(x, "levels")  # the same levels as input
    )
}

We can also introduce the replacement version of this operator:

`[<-.categorical` <- function(x, i, value)
{
    levels <- attr(x, "levels")
    value <- match(value, levels)  # integer codes corresponding to levels
    structure(
        NextMethod("[<-"),  # call default method, passing `x`, `i`, `values`
        class="categorical",
        levels=levels  # same levels as input
    )

    # # or, equivalently:
    # structure(
    #     `[<-`(unclass(x), i, value=match(value, attr(x, "levels"))),
    #     class="categorical",
    #     levels=attr(x, "levels")
    # )
}

Testing:

x <- as.categorical(c(3, 6, 4, NA, 9, 9, 6, NA, 3))
x[1:4]
## [1] "3" "6" "4" NA
## Categories: 3, 4, 6, 9
x[1:4] <- c("6", "7")
print(x)
## [1] "6" NA  "6" NA  "9" "9" "6" NA  "3"
## Categories: 3, 4, 6, 9

Notice how we handled the case of nonexistent levels and that the recycling rule has been automagically inherited (amongst other features) from the default index operator.

Exercise 10.13

Do these two operators preserve the names attribute of x? Is indexing with negative integers or logical vectors supported as well? Why is that/is that not the case?

Furthermore, let us overload the `==` operator. Assume[19] that we would like two categorical objects to be compared based on the actual labels they encode, in an elementwise manner:

`==.categorical` <- function(e1, e2)
    as.character(e1) == as.character(e2)

We are feeling lucky: by not performing any type checking, we rely on the particular as.character methods corresponding to the types of e1 and e2. Also, assuming that as.character always[20] returns a character object, we dispatch to the default method for `==` (which handles atomic vectors).

Some examples:

as.categorical(c(1, 3, 5, 1)) == as.categorical(c(1, 3, 1, 1))
## [1]  TRUE  TRUE FALSE  TRUE
as.categorical(c(1, 3, 5, 1)) == c(1, 3, 1, 1)
## [1]  TRUE  TRUE FALSE  TRUE
c(1, 3, 5, 1) == as.categorical(c(1, 3, 1, 1))
## [1]  TRUE  TRUE FALSE  TRUE

Important

In the case of binary operators, dispatching is done based on the classes of both arguments. In all three example calls above, we call `==.categorical`, regardless of whether the classed object is the first or the second operand.

If two operands are classed, and different methods are overloaded for both, a warning will be generated, and the default internal method will be called.

`==.A` <- function(e1, e2) "A"
`==.B` <- function(e1, e2) "B"
structure(c(1, 2, 3), class="A") == structure(c(2, NA, 3), class="B")
## Warning: Incompatible methods ("==.A", "==.B") for "=="
## [1] FALSE    NA  TRUE

Note

(*) By defining a single Ops method, we can define the meaning of all binary operators at once.

Ops.categorical <- function(e1, e2)
{
    if (!(.Generic %in% c("<", ">", "<=", ">=", "==", "!=")))
        stop(sprintf("%s not defined for 'categorical' objects", .Generic))
    e1 <- as.character(e1)
    e2 <- as.character(e2)
    NextMethod(.Generic)  # dispatch to the default method (for character)
}

as.categorical(c(1, 3, 5, 1)) > c(1, 2, 4, 2)
## [1] FALSE  TRUE  TRUE FALSE

Here .Generic is a variable representing the name of the operator (generic) being invoked; see sec:use-method.

Other group generics are: Summary (including functions such as min, sum, and all), Math (abs, log, round, etc.), and Complex (e.g., Re, Im); see help("groupGeneric") for more details.

Sometimes we must rely on registerS3method to force R to recognise a custom method related to such generics.

10.3. Common built-in S3 classes#

Let us discuss some noteworthy built-in classes, including those representing date-time information and factors (ordered or not).

Classes for representing tabular data will be dealt with in separate parts of this textbook, owing to their importance and ubiquity. Namely, matrices and other arrays are covered in Chapter 11, and data frames are discussed in Chapter 12.

Inspecting other[21] interesting classes is left as a simple exercise for the studious reader.

10.3.1. Date, time, etc.#

The Date class can be used to represent… dates.

(x <- c(Sys.Date(), as.Date(c("1969-12-31", "1970-01-01", "2023-02-29"))))
## [1] "2023-05-18" "1969-12-31" "1970-01-01" NA
class(x)
## [1] "Date"

Complex types are built on basic ones. Underneath, what we deal with here is:

typeof(x)
## [1] "double"
unclass(x)
## [1] 19495    -1     0    NA

which is the number of days since the UNIX epoch, 1970-01-01T00:00:00+0000 (midnight GMT/UTC).

The POSIXct (calendar time) class can be used to represent date-time objects:

(x <- Sys.time())
## [1] "2023-05-18 19:41:59 AEST"
class(x)
## [1] "POSIXct" "POSIXt"
typeof(x)
## [1] "double"
unclass(x)
## [1] 1684402920

Underneath, it is the number of seconds since the UNIX epoch. By default, whilst printing, the current default timezone is used (see Sys.timezone). However, such objects can be equipped with the tzone attribute.

structure(1, class=c("POSIXct", "POSIXt"))  # using current default timezone
## [1] "1970-01-01 10:00:01 AEST"
structure(1, class=c("POSIXct", "POSIXt"), tzone="UTC")
## [1] "1970-01-01 00:00:01 UTC"

In both cases, the time is 1 second after the beginning of the UNIX epoch. On the author’s PC, the former is displayed in the current local timezone, though.

Exercise 10.14

Use ISOdatetime to inspect how midnights are displayed in different timezones.

The POSIXlt (local time) class is represented using a list of atomic vectors[22].

(x <- as.POSIXlt(c(a="1970-01-01 00:00:00", b="2030-12-31 23:59:59")))
##                          a                          b
## "1970-01-01 00:00:00 AEST" "2030-12-31 23:59:59 AEDT"
class(x)
## [1] "POSIXlt" "POSIXt"
typeof(x)
## [1] "list"
str(unclass(x))  # calling str instead of print to make display more compact
## List of 11
##  $ sec   : num [1:2] 0 59
##  $ min   : int [1:2] 0 59
##  $ hour  : int [1:2] 0 23
##  $ mday  : int [1:2] 1 31
##  $ mon   : int [1:2] 0 11
##  $ year  : Named int [1:2] 70 130
##   ..- attr(*, "names")= chr [1:2] "a" "b"
##  $ wday  : int [1:2] 4 2
##  $ yday  : int [1:2] 0 364
##  $ isdst : int [1:2] 0 1
##  $ zone  : chr [1:2] "AEST" "AEDT"
##  $ gmtoff: int [1:2] NA NA
##  - attr(*, "tzone")= chr [1:3] "" "AEST" "AEDT"
##  - attr(*, "balanced")= logi TRUE
Exercise 10.15

Read about the meaning of each named element, especially mon and year; see help("DateTimeClasses").

The manual states that POSIXlt is supposedly closer to human-readable forms than POSIXct, but it is a matter of taste. Some R functions return the former, and some output the latter type.

Exercise 10.16

The two main functions for date formatting and parsing, strftime and strptime, use special field formatters (similar to sprintf). Read about them in the R manual. What type of inputs do they accept? What outputs do they produce?

There are several methods overloaded for objects of the said classes. In fact, the first call in this section already involved the use of c.Date.

Exercise 10.17

Play around with the overloaded versions of seq, rep, and as.character.

A specific number of days or seconds can be added to or subtracted from a date or time, respectively. However, `-` (see also diff) can also be applied on two date-time objects, which yields an object of the class difftime.

Sys.Date() - (Sys.Date() - 1)
## Time difference of 1 days
Sys.time() - (Sys.time() - 1)
## Time difference of 1 secs
Exercise 10.18

Check out how objects of the class difftime are internally represented.

Applying other arithmetic operations on date-time objects raises an error. Because date-time objects are just numbers, they can be compared to each other using binary operators[23]. Also, methods such as sort and order[24] could be applied on them.

Exercise 10.19

Check out the stringx package, which replaces the base R date-time processing functions with their more portable counterparts.

Exercise 10.20

proc.time can be used to measure the time to execute a given expression:

t0 <- proc.time()  # timer start
# ... to do - something time-consuming ...
sum(runif(1e7))  # whatever, just testing
## [1] 4999488
print(proc.time() - t0)  # elapsed time
##    user  system elapsed
##   0.243   0.024   0.266

The function returns an object of the class proc_time. Inspect how it is represented internally.

10.3.2. Factors#

The factor class is often used to represent categorical (qualitative) data, e.g., species, groups, types. In fact, categorical (the example class that we played with above) was inspired by the built-in factor.

(x <- c("spam", "spam", "bacon", "sausage", "spam", "bacon"))
## [1] "spam"    "spam"    "bacon"   "sausage" "spam"    "bacon"
(f <- factor(x))
## [1] spam    spam    bacon   sausage spam    bacon
## Levels: bacon sausage spam

Note how factors are printed. There are no double quote characters around the labels. The list of levels is given at the end.

Internally, such objects are represented as integer vectors (Section 6.4.1) with elements between \(1\) and \(k\). They are equipped with the special (as in Section 4.4.3) levels attribute, which is a character vector of length \(k\)[25].

class(f)
## [1] "factor"
typeof(f)
## [1] "integer"
unclass(f)
## [1] 3 3 1 2 3 1
## attr(,"levels")
## [1] "bacon"   "sausage" "spam"
attr(f, "levels")  # also: levels(f)
## [1] "bacon"   "sausage" "spam"

Factors are often used instead of character vectors defined over a small number of unique labels[26], where there is a need to manipulate their levels easily.

attr(f, "levels") <- c("a", "b", "c")  # also levels(f) <- c(....new...)
print(f)
## [1] c c a b c a
## Levels: a b c

The underlying codes remain the same.

Certain operations on vectors of small integers are relatively easy to express, especially those concerning element grouping: splitting, counting, and plotting (e.g., Figure 13.17). It is because the integer codes can naturally be used whilst indexing other vectors. Section 5.4 mentioned a few functions related to this, such as match, split, findInterval, and tabulate. Specifically, the latter can be implemented like “for each i, increase count[factor_codes[i]] by one”.

Exercise 10.21

Study the source code of the factor function. Note the use of as.character, unique, order, and match.

Exercise 10.22

Implement a simplified version of table based on tabulate. It should work for objects of the class factor and return a named numeric vector.

Exercise 10.23

Implement your version of cut based on findInterval.

Important

The as.numeric method has not been overloaded for factors. Therefore, when we call the generic, the default method is used: it returns the underlying integer codes as-is. This can surprise unaware users when they play with factors that feature levels consisting of strings representing integer numbers:

(g <- factor(c(11, 15, 16, 11, 13, 4, 15)))  # converts numbers to strings
## [1] 11 15 16 11 13 4  15
## Levels: 4 11 13 15 16
as.numeric(g)  # the underlying codes
## [1] 2 4 5 2 3 1 4
as.numeric(as.character(g))  # to get the numbers en-coded
## [1] 11 15 16 11 13  4 15

Unfortunately, support for factors is often hardcoded at the C language level, which will make this class behave less predictably (from the R perspective). In particular, the manual overloading of methods for factor objects might have no effect.

Important

If f is a factor, then x[f] does not behave like x[as.character(f)] (indexing by labels, using the names attribute). Instead, we get x[as.numeric(f)] (the underlying codes will determine the positions).

h <- factor(c("a", "b", "a", "c", "a", "c"))
levels(h)[h]  # the same as c("a", "b", "c")[c(1, 2, 1, 3, 1, 3)]
## [1] "a" "b" "a" "c" "a" "c"
c(b="x", c="y", a="z")[h]  # names are not used whilst indexing
##   b   c   b   a   b   a
## "x" "y" "x" "z" "x" "z"
c(b="x", c="y", a="z")[as.character(h)]  # names are used now
##   a   b   a   c   a   c
## "z" "x" "z" "y" "z" "y"

More often than not, indexing by factors will happen “accidentally”, leaving us slightly puzzled. In particular, factors look much like character vectors when they are featured in data frames:

(df <- data.frame(A=c("x", "y", "z"), B=factor(c("x", "y", "z"))))
##   A B
## 1 x x
## 2 y y
## 3 z z
class(df[["A"]])
## [1] "character"
class(df[["B"]])
## [1] "factor"

(*) Up until R 4.0, many functions (including data.frame and read.csv) had the stringsAsFactors option (see help("options")) set to TRUE, which resulted in all character vectors’ being automatically converted to factors when, e.g., creating data frames (compare Chapter 12). Luckily, this is no longer the case, but they can still be encountered sporadically: for instance, the built-in iris dataset has the fifth column of the class:

class(iris[["Species"]])
## [1] "factor"

Important

Be careful when combining factors and not-factors:

x <- factor(c("A", "B", "A"))
c(x, "C")
## [1] "1" "2" "1" "C"
c(x, factor("C"))
## [1] A B A C
## Levels: A B C
Exercise 10.24

When subsetting a factor object, the result will inherit the levels attribute as-is:

f[c(1, 2)]  # drop=FALSE
## [1] c c
## Levels: a b c

However:

f[c(1, 2), drop=TRUE]
## [1] c c
## Levels: c

Implement your version of the droplevels function, which removes the unused attributes.

Exercise 10.25

The replacement version of the index operator does not automatically add new levels to the modified object:

x <- factor(c("A", "B", "A"))
`[<-`(x, 4, value="C")  # like in x[4] <- "C"
## Warning in `[<-.factor`(x, 4, value = "C"): invalid factor level, NA
##     generated
## [1] A    B    A    <NA>
## Levels: A B

Implement your version of `[<-.factor]`, which is capable of doing so.

10.3.3. Ordered factors#

When creating factors, we can enforce a particular ordering and the number of levels:

x <- c("spam", "spam", "bacon", "sausage", "spam", "bacon")
factor(x, levels=c("eggs", "bacon", "sausage", "spam"))
## [1] spam    spam    bacon   sausage spam    bacon
## Levels: eggs bacon sausage spam

If we want the arrangement of the levels to define a linear ordering relation over the set of labels, we can call:

(f <- factor(x, levels=c("eggs", "bacon", "sausage", "spam"), ordered=TRUE))
## [1] spam    spam    bacon   sausage spam    bacon
## Levels: eggs < bacon < sausage < spam
class(f)
## [1] "ordered" "factor"

It yields an ordered factor, which enables comparisons like:

f[f >= "bacon"]  # what's not worse than bacon?
## [1] spam    spam    bacon   sausage spam    bacon
## Levels: eggs < bacon < sausage < spam

How is that possible? Well, based on information provided in this chapter, it will come as no surprise that it is because… someone has created a relational operator for objects of the class ordered.

10.3.4. Formulae (*)#

Formulae are created using the `~` operator. Some R users employ them to specify widely-conceived statistical models in functions such as lm (e.g., linear regression), glm (generalised linear models: logistic regression etc.), aov (analysis of variance), wilcox.test (the two-sample Mann–Whitney–Wilcoxon test), aggregate (computing aggregates within data groups), boxplot (box-and-whisker plots for a variable split by a combination of factors), or plot (scatter plots); see also Chapter 11 of [56].

For instance, they can be used to describe symbolic relationships such as:

  • y as a linear combination of x1, x2, and x3”,

  • y grouped/split by a combination of x1 and x2”,

where y, x1, etc., are, for example, column names in some data frame.

Formulae are interpreted by the corresponding functions, and not the R language itself. Thus, programmers are free to assign them any meaning. As their syntax is quite esoteric, beginners might find them confusing. Hence, we will discuss them much later: in Section 17.6.

Luckily, the use of formulae can usually quite easily be avoided[27].

10.4. Argument checking revisited#

Recall that anything can be passed as a function’s input. Here are some additions to the topic we touched upon in Section 9.2.1.

Compound objects are internally represented merely through basic types (such as numeric vectors, lists, or combinations thereof) and attributes. However, unless we really know better (which, by the way, this book is all about), we recommend relying on the hopefully well-thought-out methods developed by the designer of the class.

Ideally, when checking arguments passed to a function, determining if an object is of a desired type should be solely done by means of the generics like is.class. If that is not the case, a call to as.class may be used to ensure we will deal with an object of the desired type.

If a conversion is impossible, either because a specific method is unavailable or because its designer decided that this must be the case, whatever the consequences are is not necessarily our problem anymore.

We should explain to the user how the input type assurance is performed. In the case where they get any surprising results, they are expected to check/redefine the underlying is.class or as.class themselves.

This scheme is not watertight, and there will be users complaining that they get unexpected or confusing (in their opinion) outputs. With infinitely many potential types, however, we cannot respond to every possible situation.

Example 10.26

As an illustration, here is a function that counts the number of occurrences of items in a numerised (digitised?) version of a given object:

numtable <- function(x)
{
    if (!is.numeric(x)) x <- as.numeric(x)  # two generics!

    u <- unique(x)
    structure(
        tabulate(match(x, u)),
        names=as.character(u)
    )
}

Let us assume that the user has been informed (in the corresponding documentation page) that x must be a numeric vector (as in is.numeric) or an object coercible to (by means of as.numeric).

The callers will be stress-testing our function in many different ways:

numtable(c(1, 3, 5, 5, 1, 5))
## 1 3 5
## 2 1 3

It is an intended behaviour.

numtable(c("1", "3", "5", "5", "1", "5"))
## 1 3 5
## 2 1 3

It makes sense, too: a character vector consisting of number-strings has been fed on input.

numtable(c("a", "e", "z", "z", "a", "z"))
## Warning in numtable(c("a", "e", "z", "z", "a", "z")): NAs introduced by
##     coercion
## <NA>
##    6

Does the output make no sense? Of course, it does. They have just passed something not easily coercible to a numeric vector. Notice the warning that suggests there is something wrong. The users need to correct their possible mistakes by themselves.

numtable(list(1, 2, 3:10, 2))
## Error in numtable(list(1, 2, 3:10, 2)): 'list' object cannot be coerced to
##     type 'double'

Again, it makes sense. “But I think that this function should apply unlist automatically…” Well, if you want such behaviour, why don’t you call numtable(unlist(...)) yourself? It is not so difficult.

numtable(factor(c(1, 3, 5, 5, 1, 5)))
## 1 2 3
## 2 1 3

Is this confusing? No, this is a well-documented behaviour of as.numeric on objects of the type factor (designed by another developer). A user is expected to know that in this case, as.character should be called first (but we can remind them about it in the documentation).

Of course, sometimes users might discover bugs or unexpected behaviours, especially related to boundary cases we have not been considerate enough to inspect. We are, of course, the ones to blame for the following:

numtable(numeric(0))  # bug: this should be corrected
## <NA>
##    0

10.5. (Over)using the forward-pipe operator, `|>` (*)#

The object-oriented programming paradigm is utile when we wish to define a new data type, perhaps even a hierarchy of types. Many development teams find it an efficient tool to organise larger pieces of software. Yet, in the broad data science and numerical computing domains, we are often the consumers of object orientation rather than class designers.

Thanks to the discussed method dispatch mechanism, our language is easily extensible. Something that mimics a new data type can easily be introduced. Most importantly, methods can be added or removed during run-time, e.g., when importing external packages.

However, R is still a functional programming language, where functions are not just first-class citizens: they are privileged. Of course, there are some inherent limitations stemming from the ingenious simplicity of S3: method dispatch is usually based only on the type of the first function argument, classes cannot be defined formally (but see Section 11.5), and there is no real encapsulation (we cannot actually hide data from a user[28]). However, overall the whole concept has proven quite versatile.

In functional programming, the emphasis is on operations (verbs), not data (nouns). It leads to a very readable syntax. For example, assuming that square, x, and y are sensibly defined, the mean squared error can be written as:

mean(square(x-y))

It is very user-centric. However, when implementing more complex data processing pipelines, a programmer thinks: “first, I need to do this, then I need to do that, and afterwards…”. When they write it down, there can be some pressing of Home and End keys necessary. It should not be a problem for most programmers.

finally(thereafter(then(first(x))))

However, some people are inherently lazy, always complaining, or compulsively trying to “optimise”[29] things.

Example 10.27

Base R is extremely flexible. We can introduce new vocabulary as we please. In Chapter 12, we will study an example where we define:

  • group_by (a function that splits a data frame with respect to a combination of levels in given named columns and returns a list of data frames with class list_dfs),

  • aggregate.list_dfs (which applies a given aggregation function on each column of each data frame in a given list), and

  • mean.list_dfs (a specialised version of the former that calls mean).

The specifics do not matter now. Let us just consider the notation we use when the operations are chained:

# select a few rows and columns from the `iris` data frame:
iris_subset <- iris[51:150, c("Sepal.Width", "Petal.Length", "Species")]
# compute the averages of all variables grouped by Species:
mean(group_by(iris_subset, "Species"))
##      Species            x  Mean
## 1 versicolor  Sepal.Width 2.770
## 2 versicolor Petal.Length 4.260
## 3  virginica  Sepal.Width 2.974
## 4  virginica Petal.Length 5.552

It is quite readable. We compute the mean in groups defined by Species in a subset of the iris data frame. All verbs appear on the left-hand side of the expression, with the last (the most important?) operation being listed first.

By the way, self-explanatory variable names and rich comments are priceless.

In more traditional object-oriented programming languages, either the method list is sealed inside[30] the class’ definition (like in C++), or some peculiar patches must be applied to inject a method (like in Python)[31]. There, it is the objects that are told what to do. They are treated as black boxes.

Some popular languages rely on message-passing syntax, where operations are propagated (and written) left-to-right instead of inside-out. For instance, in C++ and Python (amongst many others), “obj.method1().method2()” means “call method1 on obj and then call method2 on the result.

Since R 4.1.0, there is a pipe operator[32], `|>`. It is merely a syntactic sugar for translating between the message-passing and function-centric notion. In a nutshell, writing:

x |> f() |> g(y) |> h()
(x-y) |> square() |> mean()

is equivalent, respectively, to:

h(g(f(x), y))
mean(square(x-y))

This syntax is developer-centric. It emphasises the order in which the operations are executed, something that could always be achieved with the function-centric form and perhaps a few auxiliary variables.

The placeholder `_` can be used on the right-hand side of the pipe operator (only once) to indicate that the left-hand side must be matched to a specific named argument of the function to be called. Otherwise, the left side always gets passed as the first argument.

Therefore, the two following expressions are equivalent:

x |> median() |> `-`(e1=x, e2=_) |> abs() |> median()
median(abs(x-median(x)))
Example 10.28

In the above example, a pipe operator version of the iris aggregation exercise would look like:

iris_subset |> group_by("Species") |> mean()

Expressions on the right-hand side of the pipe operator must always be proper calls. Therefore, the use of round brackets is obligatory. Thus, when passing anonymous functions, we must write:

runif(10) |> (function(x) mean((x-mean(x))^2))()  # note the "()" at the end
## [1] 0.078184

Peculiarly, in R 4.1.0, a “shorthand” notation for creating functions was introduced. We can save seven keystrokes and scribble “\(...) expr” instead of “function(...) expr”.

runif(10) |> (\(x) mean((x-mean(x))^2))()  # again: "()" at the end
## [1] 0.078184

There is nothing that cannot be achieved without the pipe operator. As this book is minimalist by design, we will be refraining[33] ourselves from using it.

Note

When writing code interactively, we may sometimes benefit from using the rightward `->` operator. Suffice it to say that “name <- value” and “value -> name” are synonymous.

This way, we can write some lengthy code, store the result in an intermediate variable, and then continue to the next line (possibly referring to that auxiliary value more than once). In the long run, multiplying entities without necessity is unsustainable.

For instance:

runif(10) -> .; mean((.-mean(.))^2)
## [1] 0.078184

or:

iris[, c("Sepal.Width", "Petal.Length", "Species")] -> .
.[ .[, "Species"] %in% c("versicolor", "virginica"), ] -> .
mean(group_by(., "Species"))
##      Species            x  Mean
## 1 versicolor  Sepal.Width 2.770
## 2 versicolor Petal.Length 4.260
## 3  virginica  Sepal.Width 2.974
## 4  virginica Petal.Length 5.552

`.` is as valid a variable name as any other one.

10.6. Exercises#

Exercise 10.29

Answer the following questions:

  • How to display the source code of the default methods for head and tail?

  • Can there be, at the same time, one object of the class c("A", "B") and another one of the class c("B", "A")?

  • If f is a factor, what are the relationships between as.character(f), as.numeric(f), as.character(as.numeric(f)), and as.numeric(as.character(f))?

  • If x is a named vector and f is a factor, is x[f] equivalent to x[as.character(f)] or rather x[as.numeric(f)]?

Exercise 10.30

A user calls:

plot(x, y, col="red", ylim=c(1, max(x)), log="y")

where x and y are numeric vectors. Consult help("plot") for the meaning of the ylim and log arguments. Was that straightforward?

Exercise 10.31

Explain why the two following calls return significantly different results.

c(Sys.Date(), "1970-01-01")
## [1] "2023-05-18" "1970-01-01"
c("1970-01-01", Sys.Date())
## [1] "1970-01-01" "19495"

Propose a workaround.

Exercise 10.32

Write methods head and tail for our example categorical class.

Exercise 10.33

(*) Write an R package that defines S3 class categorical. Add a few methods for this class. Note the need to use the S3method directive in the NAMESPACE file; see [62].

Exercise 10.34

Inspect the result of a call to binom.test(79, 100). Find the method responsible for such objects’ pretty printing.

Exercise 10.35

Inspect the result of a call to rle(c(1, 1, 1, 4, 3, 3, 3, 3, 3, 2, 2)). Find the method responsible for such objects’ pretty printing.

Exercise 10.36

Read more about the connection class. In particular, see the Value section in help("connections").

Exercise 10.37

Read about the subsetting operators overloaded for the package_version class; see help("numeric_version").

Exercise 10.38

There are xtfrm methods overloaded for classes such as numeric_version, difftime, Date, and factor. Find out how they work and where they might be of service (especially in relation to order and sort; see also Section 12.3.1).

Exercise 10.39

Give an example where split(x, list(y1, y2)) (with default arguments) will fail to generate the correct result.

Exercise 10.40

Write a function that determines the mode, i.e., the most frequently occurring value in a given object of the class factor. If the mode is not unique, return a randomly chosen one (each with the same probability).

Exercise 10.41

Implement your version of the gl function.

Exercise 10.42

Check out which built-in date-time functions the stringx package replaces with more portable ones.