# 10. S3 Classes

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of Chapters 1–12 are already complete, but there will be more. In the meantime, any bug/typos reports/fixes are appreciated. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [20].

Let x be a randomly generated matrix with 1,000,000 rows and 1,000 columns, y be a data frame with results from the latest survey indicating that things are not the way most people (no matter the side of the many political spectra) think they are, and and z be another matrix, this time with many zeroes.

Human brain is not capable of handling too much information which is too specific. This is why we have a natural tendency to group different entities based on their similarities so as to form some more abstract classes thereof.

Also, many of us are inherently lazy. Thus, oftentimes we will take shortcuts to minimise energy (at a price to be paid later).

Printing out a matrix, a data frame, and a time series are all still instances of the displaying of things, although they surely differ in detail. Now that ad probably forgotten which objects are hidden behind x, y, and z, being able to simply call “print(y)” without having to recall that, yes, y is a data frame, might seem quite appealing.

This chapter introduces the so-called S3 classes [8], which provide a lightweight object oriented programming (OOP) approach for automated dispatching of calls to generics of the type “print(y)” to concrete methods such as “print.data.frame(y)”, based on the class of the object they are invoked upon.

S3 classes in their essence are beautifully simple[1]. They are inspired[2] by the well-thought-through concepts present in other functional programming languages (such as the Common Lisp Object System; see below). Ultimately, those generics and methods are ordinary R functions (Chapter 7) and classes are merely additional object attributes (Section 4.4).

Of course this does not mean that wrapping our heads around them will be effortless. However, unlike other “class systems”[3], S3 is ubiquitous in R programming: suffice it to say that factors, matrices, and data frames discussed in the next chapters are quite straightforward, S3-based extensions of the concepts we have introduced so far.

## 10.1. Object Type vs Class

Recall that typeof (introduced in Section 4.1) returns the internal type of any R object. Even though there are only few admissible cases thereof[4], they open the world of endless possibilities[5].

The basic types we covered so far (mostly atomic and generic vectors; compare Figure 1) provide a basis for more complex data structures. This is thanks to the fact that they can be equipped with arbitrary attributes (Section 4.4).

typeof(NULL)
## [1] "NULL"
typeof(c(TRUE, FALSE, NA))
## [1] "logical"
typeof(c(1, 2, 3, NA_real_))
## [1] "double"
typeof(c("a", "b", NA_character_))
## [1] "character"
typeof(list(list(1, 2, 3), LETTERS))
## [1] "list"
typeof(function(x) x)
## [1] "closure"


The interesting fact is that most compound types, whose most prevalent instances are constructed using the mechanisms discussed in this chapter[6], only pretend they are something different from what they actually are. They are often quite good at doing their job, though, and hence might be useful. By knowing what is under their hood we will demystify them and become able to manipulate their state outside of the prescribed use cases.

Important

Setting the class attribute might make some objects behave differently in certain scenarios.

Example 10.1

Let us consider two identical objects equipped with different class attributes.

xt <- structure(123, class="POSIXct")  # POSIX calendar time
xd <- structure(123, class="Date")


Despite that both objects are being represented using numeric vectors:

c(typeof(xt), typeof(xd))
## [1] "double" "double"


When printed, they are decoded quite differently:

print(xt)
## [1] "1970-01-01 10:02:03 AEST"
print(xd)
## [1] "1970-05-04"


In the former case, 123 is treated as the number of seconds since the so-called UNIX epoch, 1970-01-01T00:00:00+0000. The latter is deciphered as the number of days since the said (quite widely used in computer systems by the way) timestamp.

We may hence suspect, and we are absolutely right, that there exists some underlying mechanism that actually calls a version of print that is dependent on an object’s virtual class.

That this only depends on the class attribute, which might be set, unset, or reset quite freely, is emphasised below:

attr(xt, "class") <- "Date"  # change class from POSIXct to Date
print(xt)  # same 123, but now interpreted as Date
## [1] "1970-05-04"
as.numeric(xt)  # drops all attributes
## [1] 123
unclass(xd)  # drops the class attribute; attr<-(xd, "class", NULL)
## [1] 123


We are having so much fun that one more illustration can only add to joy.

Example 10.2

Consider an example data frame:

x <- iris[1:3, 1:2]  # a subset of a built-in example data frame
print(x)
##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2


This is an object of the following class (an object whose class attribute is set to):

attr(x, "class")
## [1] "data.frame"


Some may say, and they are absolutely right, that we have not covered data frames yet: this is the topic of Chapter 12, which is still ahead of us. However, from the current perspective, we are interested in the fact that an R data frame is merely a list (Chapter 4) of vectors of the same lengths equipped with names and row.names attributes.

typeof(x)
## [1] "list"
attr(x, "class") <- NULL  # or x <- unclass(x)
print(x)
## $Sepal.Length ## [1] 5.1 4.9 4.7 ## ##$Sepal.Width
## [1] 3.5 3.0 3.2
##
## attr(,"row.names")
## [1] 1 2 3


Important

Revealing how x is actually represented, enables us to process it (although perhaps not in the most convenient or efficient manner) using the extensive skill set that we have already[7] developed by studying the material covered in the previous part of our book (including solving all the exercises). This can be particularly useful, especially bearing in mind that some (built-in or third-party) data types are not particularly well-designed.

Note again that attributes are simple additions to R objects. However, as we said in Section 4.4.3, certain attributes are special, and class is one of them.

In particular, we can set class to be only a character vector (possibly of length greater than one; see Section 10.2.5).

x <- 12345
attr(x, "class") <- 1  # character vectors only
## Error in attr(x, "class") <- 1: attempt to set invalid 'class' attribute


Furthermore, there exists the class function that can read the value of the class attribute. Its replacement version is also available.

class(x) <- "Date"  # set; the same as attr(x, "class") <- "Date"
class(x)  # get; the same as attr(x, "class")
## [1] "Date"


Important

The class function always yields a value, even if the corresponding attribute is not set. We call it an implicit class. Compare between the following and the outputs generated by typeof:

class(NULL)  # no class set, because NULL cannot have attributes at all
## [1] "NULL"
class(c(TRUE, FALSE, NA))  # no attributes, so class is implicit (= typeof)
## [1] "logical"
class(c(1, 2, 3, NA_real_))  # typeof yields "double"
## [1] "numeric"
class(c("a", "b", NA_character_))
## [1] "character"
class(list(list(1, 2, 3), LETTERS))
## [1] "list"
class(function(x) x)  # typeof yields "closure"
## [1] "function"


Also, in Chapter 11, we will note that any object equipped with the dim attribute also has an implicit class:

(x <- as.matrix(c(1, 2, 3)))
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
attributes(x)  # class is not amongst the attributes
## $dim ## [1] 3 1 class(x) # implicit class ## [1] "matrix" "array" typeof(x) # it is still a numeric vector (under the hood) ## [1] "double"  ## 10.2. Generics and Method Dispatching ### 10.2.1. Generics, Default, and Custom Methods Let us inspect the source code of the print function: print(print) # sic! ## function (x, ...) ## UseMethod("print") ## <bytecode: 0x55ecc88904e0> ## <environment: namespace:base>  Any function like the above[8] we will call from now on an S3 (S version 3) generic. Its only job is to invoke UseMethod("print")[9]. This dispatches the control flow to another function, referred to as method, based on the class of the first argument. For example, let us define an object of class categorical (a name that we have just come up with; we could have called it cat, CtGrCl, or SpanishInquisition as well), which will be our own version of the famous built-in factor type that we discuss later. x <- structure( c(1, 3, 2, 1, 1, 1, 3), levels=c("a", "b", "c"), class="categorical" )  We assume that such an object is a vector of small positive integers (codes) equipped with the levels attribute being a character vector of length no less than the maximum of the said integers. The first category will be used to decipher the meaning of code “1”, for example. Hence, the above vector represents a sequence a, c, b, a, a, a, c. We have not defined any special method for the printing of objects of class categorical. Hence, when we call print, the default (fallback) method will be called: print(x) ## [1] 1 3 2 1 1 1 3 ## attr(,"levels") ## [1] "a" "b" "c" ## attr(,"class") ## [1] "categorical"  This is the standard function for displaying numeric vectors that we are all well familiar with. Its name is print.default, and we can always call it directly: print.default(x) # the default print method ## [1] 1 3 2 1 1 1 3 ## attr(,"levels") ## [1] "a" "b" "c" ## attr(,"class") ## [1] "categorical"  We can, however, introduce our own method for the custom printing of objects of class categorical, whose name must precisely be print.categorical: print.categorical <- function(x, ...) { x_character <- attr(x, "levels")[unclass(x)] print(x_character) # calls print.default cat(sprintf("Categories: %s\n", paste(attr(x, "levels"), collapse=", "))) invisible(x) # this is what all print methods do; see help("print") }  Now, calling print automatically dispatches the control flow to the above method: print(x) ## [1] "a" "c" "b" "a" "a" "a" "c" ## Categories: a, b, c  Of course, the default method can still be called; calling print.default(x) directly will output the same result as before. Note print.categorical has been equipped with the dot-dot-dot attribute, because the generic print had one too; we should always ensure consistency ourselves[10]. ### 10.2.2. Creating Own Generics Introducing new S3 generics is as straightforward as defining a function that calls UseMethod. For instance, here is a dispatcher which allows for creating new objects of class categorical based on other objects: as.categorical <- function(x, ...) UseMethod("as.categorical")  We always need to define the default method: as.categorical.default <- function(x, ...) { x <- as.character(x) xu <- unique(sort(x)) # drops NAs structure( match(x, xu), class="categorical", levels=xu ) }  Testing: as.categorical(c("a", "c", "a", "a", "d", "c")) ## [1] "a" "c" "a" "a" "d" "c" ## Categories: a, c, d as.categorical(c(3, 6, 4, NA, 9, 9, 6, NA, 3)) ## [1] "3" "6" "4" NA "9" "9" "6" NA "3" ## Categories: 3, 4, 6, 9  Note that print.categorical has been invoked twice here. The above is quite flexible already, because it relies on the generic (Section 10.2.3) as.character, which handles a wide variety of data types. Of course, it does not mean we cannot be more precise about some particular ones. Example 10.3 For instance, we might want to forbid the conversion from lists, because this does not necessarily make sense: as.categorical.list <- function(x, ...) stop("conversion of lists to categorical is not supported")  The users can always be instructed in the method’s documentation that they are the ones responsible for an explicit conversion of list objects to something different prior to a call to as.categorical. Whether this was a good design choice, time will tell. Example 10.4 Note that the default method deals with logical vectors perfectly fine: as.categorical(c(TRUE, FALSE, NA, NA, FALSE)) # as.categorical.default ## [1] "TRUE" "FALSE" NA NA "FALSE" ## Categories: FALSE, TRUE  However, we might still want to introduce a specialised version, because we know a slightly more efficient algorithm (and we have nothing better to do) based on the fact that FALSE and TRUE converted to numeric yield 0 and 1, respectively: as.categorical.logical <- function(x, ...) { x <- as.logical(x) # or stopifnot(is.logical(x)) ? structure( x + 1, # only 1, 2, and NAs will be generated class="categorical", levels=c("FALSE", "TRUE") ) }  This yields the same result, but is a bit faster: as.categorical(c(TRUE, FALSE, NA, NA, FALSE)) # as.categorical.logical ## [1] "TRUE" "FALSE" NA NA "FALSE" ## Categories: FALSE, TRUE  Note that we have performed some argument validation at the beginning, because a user is always able to call a method directly on an R object of any kind (which is a good thing!; see Section 10.2.4). In other words, there is no guarantee that the argument x must be of type logical. ### 10.2.3. Built-in Generics Many functions and operators we have introduced so far are in fact S3 generics: print, head, [, +, <=, as.character, as.list, round, log, sum, c, and na.omit, to name a few. Some of them might not even call UseMethod explicitly; dispatching can be done internally, at the C language level[11]. Overall, the list of all S3 generics is somewhat difficult to generate[12], but at least the internal ones are enumerated in help("InternalMethods") and help("groupGeneric"). Example 10.5 Let us overload the as.character method. The default one does not make much sense for the objects of our custom type: as.character(x) ## [1] "1" "3" "2" "1" "1" "1" "3"  So: as.character.categorical <- function(x, ...) attr(x, "levels")[unclass(x)]  And now: as.character(x) ## [1] "a" "c" "b" "a" "a" "a" "c"  Exercise 10.6 Overload the unique method for objects of class categorical. Exercise 10.7 Overload the rep method for objects of class categorical. Example 10.8 New types should be designed carefully. For instance, if we forget to consider overloading the to-numeric converter, we might end up with some users being puzzled[13] when they see: (x <- as.categorical(c(4, 9, 100, 9, 9, 100, 42, 666, 4))) ## [1] "4" "9" "100" "9" "9" "100" "42" "666" "4" ## Categories: 100, 4, 42, 666, 9 as.double(x) # as.double.default(x) ## [1] 2 5 1 5 5 1 3 4 2  Hence, we might want to introduce: as.double.categorical <- function(x, ...) { # as.double.default(as.character.categorical(x)) as.double(as.character(x)) }  Which now yields: as.double(x) # as.double.categorical(x) ## [1] 4 9 100 9 9 100 42 666 4  Note We can still use unclass to fetch the codes: unclass(x) ## [1] 2 5 1 5 5 1 3 4 2 ## attr(,"levels") ## [1] "100" "4" "42" "666" "9"  This is because the above returns a class-free object, which is now guaranteed to be handled by the default methods (print, subsetting, as.character, etc.). Exercise 10.9 What would happen if we used as.numeric instead of unclass in print.categorical and as.character.categorical? Exercise 10.10 Update the above methods in such a way that we can also create named objects of class categorical (i.e., equipped with the names attribute). Exercise 10.11 Note that the levels of x are sorted lexicographically, not numerically. Introduce a single method that would make the above code (when re-run without any alterations) generate a more natural result. ### 10.2.4. Dispatching Only on One Argument and Calling S3 Methods Directly With S3, the dispatching is done based on the class of only one[14] argument: by default, the first one from the parameter list. For example, the c function is a generic which dispatches on the class of the first argument. Let us overload it for categorical objects (or, more precisely, create a function that will be dispatched to when the generic is called upon a series of objects such that the first element is of the said class). c.categorical <- function(...) as.categorical( unlist( lapply(list(...), as.character) ) )  It converts each argument to a character vector (relying on the generic as.character to take care of the details) and makes use of the fact that unlist converts a list of such atomic vectors to a single sequence of strings. Calling c with the first argument being of class categorical dispatches to the above method: x <- c(9, 5, 7, 7, 2) xc <- as.categorical(x) c(xc, x) # c.categorical ## [1] "9" "5" "7" "7" "2" "9" "5" "7" "7" "2" ## Categories: 2, 5, 7, 9  However, if the first argument is, say, unclassed, the default method will be consulted: c(x, xc) # default c ## [1] 9 5 7 7 2 4 2 3 3 1  This method ignores the class attribute and sees xc as-it-is, a barebone numeric vector: attributes<-(xc, NULL) # the underlying codes ## [1] 4 2 3 3 1  This is not a bug! This is a well-documented (and now explained) behaviour. After all, compound types (classed objects) are merely emulated through the basic ones. Important In most cases, S3 methods can be called directly to get the desired outcome: c.categorical(x, xc) # force a call to the specific method ## [1] "9" "5" "7" "7" "2" "9" "5" "7" "7" "2" ## Categories: 2, 5, 7, 9  Note We said “in most cases”, because some methods can be: • hardcoded at the C language level (for instance, there is no c.default defined at all[15]), • hidden (defined in a package namespace but not exported); see Section 16.4.5, • overloaded as a group; see Section 16.4.6. Example 10.12 Just for fun, let us find a partition of the iris dataset into three clusters using the k-means algorithm: res <- kmeans(iris[-5], centers=3, nstart=10) print(res) ## K-means clustering with 3 clusters of sizes 50, 62, 38 ## ## Cluster means: ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 5.0060 3.4280 1.4620 0.2460 ## 2 5.9016 2.7484 4.3935 1.4339 ## 3 6.8500 3.0737 5.7421 2.0711 ## ## Clustering vector: ## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## [ reached getOption("max.print") -- omitted 51 entries ] ## ## Within cluster sum of squares by cluster: ## [1] 15.151 39.821 23.879 ## (between_SS / total_SS = 88.4 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" ## [5] "tot.withinss" "betweenss" "size" "iter" ## [9] "ifault"  The above is an object of class: class(res) ## [1] "kmeans"  which in fact is a: typeof(res) ## [1] "list"  The underlying list looks like: unclass(res) ##$cluster
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [ reached getOption("max.print") -- omitted 51 entries ]
##
## $centers ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 5.0060 3.4280 1.4620 0.2460 ## 2 5.9016 2.7484 4.3935 1.4339 ## 3 6.8500 3.0737 5.7421 2.0711 ## ##$totss
## [1] 681.37
##
## $withinss ## [1] 15.151 39.821 23.879 ## ##$tot.withinss
## [1] 78.851
##
## $betweenss ## [1] 602.52 ## ##$size
## [1] 50 62 38
##
## $iter ## [1] 2 ## ##$ifault
## [1] 0


and we already know that the above is displayed in a fancy way only because there is a print method overloaded for objects of class kmeans.

But is there really?

print.kmeans


Even though the method is hidden in the stats package’s namespace, from Section 16.4.5 we will learn that it can be accessed by calling getS3method("print", "kmeans") or referring to stats:::print.kmeans (note the triple colon).

### 10.2.5. Multi-class-ness

The class attribute can be instantiated as a character vector of any length. For example:

(t1 <- Sys.time())
## [1] "2023-01-18 09:42:09 AEDT"
(t2 <- strptime("2021-08-15T12:59:59+1000", "%Y-%m-%dT%H:%M:%S%z"))
## [1] "2021-08-15 12:59:59"


Let us inspect the two objects’ classes:

class(t1)
## [1] "POSIXct" "POSIXt"
class(t2)
## [1] "POSIXlt" "POSIXt"


When we discuss date-time classes in more detail later, we will take note that the former is represented as a numeric vector, whilst the latter is a list. Hence, primarily, these two should be seen as instances of two distinct types. However, both of them have a lot in common, hence it was a wise design choice to also allow them to be seen as the representatives of the same generic category of POSIX time objects.

Important

When calling a generic function[16] f on an object x of classes[17] class1, class2, …, classK (in this order), UseMethod(f, x) dispatches to the method determined as follows:

1. if f.class1 is available[18], call it;

2. otherwise, if f.class2 is available, call this one;

3. …;

4. otherwise, if f.classK is available, invoke it;

5. otherwise, refer to the fallback f.default.

Example 10.13

There is a method diff for objects of class POSIXt featuring a statement:

r <- if (inherits(x, "POSIXlt")) as.POSIXct(x) else x


This way, we can be handling both POSIXct and POSIXlt instances via the same procedure.

Let us see in this simple scheme any magic. It is nothing more than what was described above: a way of determining which method should be called for a particular R object. It can of course be used as a mechanism to mimic (and certainly it was inspired by) the idea of inheritance in object-oriented programming languages, but note that the S3 system does not allow for defining classes in any formal manner.

For example, we cannot say that objects of class POSIXct inherit from POSIXt or each object of class POSIXct is also an instance of POSIXt. The class attribute can still be set arbitrarily on an per-object basis: we can create ones whose class is simply POSIXct (without the POSIXt part) or even c("POSIXt", "POSIXct") (in this very order).

Operators are ordinary functions (Section 9.4.5). Even though what follows can partially be implied by what we have said above, as usual in R, there will be some oddities.

For example, let us overload the index operator for objects of class categorical. Looking at help("["), we see that the default method[19] has two arguments: x (the categorical object being sliced) and i (the indexer). Ours will have the same interface then:

[.categorical <- function(x, i)
{
structure(
unclass(x)[i],  # [(unclass(x), i)
class="categorical",
levels=attr(x, "levels")  # the same levels as input
)
}


We can also introduce the replacement version of this operator:

[<-.categorical <- function(x, i, value)
{
levels <- attr(x, "levels")
codes <- match(value, levels)  # integer codes corresponding to levels
x <- unclass(x)
x[i] <- codes  # default method for the replacement version of [
structure(
x,
class="categorical",
levels=levels  # same levels as input
)

# # or, equivalently:
# structure(
#     [<-(unclass(x), i, value=match(value, attr(x, "levels"))),
#     class="categorical",
#     levels=attr(x, "levels")
# )
}


Testing:

x <- as.categorical(c(3, 6, 4, NA, 9, 9, 6, NA, 3))
x[1:4]
## [1] "3" "6" "4" NA
## Categories: 3, 4, 6, 9
x[1:4] <- c("6", "7")
print(x)
## [1] "6" NA  "6" NA  "9" "9" "6" NA  "3"
## Categories: 3, 4, 6, 9


Note how we handled the case of non-existing levels and that the recycling rule has been automagically inherited (amongst other features) from the default index operator.

Exercise 10.14

Do these two operators preserve the names attribute of x? Is indexing with negative integers or logical vectors supported as well? Why is that/is that not the case?

Furthermore, let us overload the == operator. Assume[20] that we would like two categorical objects be compared based on the actual labels they encode, in an elementwise manner:

==.categorical <- function(e1, e2)
as.character(e1) == as.character(e2)


We are feeling lucky: by not performing any type checking, we rely on the particular as.character methods corresponding to the types of e1 and e2. Also, assuming that as.character always[21] returns an object of type character, we dispatch to the default method for == (which handles atomic vectors).

Some examples:

as.categorical(c(1, 3, 5, 1)) == as.categorical(c(1, 3, 1, 1))
## [1]  TRUE  TRUE FALSE  TRUE
as.categorical(c(1, 3, 5, 1)) == c(1, 3, 1, 1)
## [1]  TRUE  TRUE FALSE  TRUE
c(1, 3, 5, 1) == as.categorical(c(1, 3, 1, 1))
## [1]  TRUE  TRUE FALSE  TRUE


Important

In the case of binary operators, dispatching is done based on the classes of both arguments. In all three example calls above, we call ==.categorical, regardless of whether the classed object is the first or the second operand. If two operands are classed and different methods are overloaded for both of them, a warning will be generated and the default internal method will be called.

==.A <- function(e1, e2) "A"
==.B <- function(e1, e2) "B"
structure(c(1, 2, 3), class="A") == structure(c(2, NA, 3), class="B")
## Warning: Incompatible methods ("==.A", "==.B") for "=="
## [1] FALSE    NA  TRUE


Note

In Section 16.4.6, we will mention that operators as well as certain groups of functions (including min, sum, and all or abs, log, and round) can be overloaded all at once; see also help("groupGeneric").

## 10.3. Common Built-in S3 Classes

Let us discuss some noteworthy built-in classes, including the ones that represent date/time information and factors (ordered or not).

Classes for representing tabular data will be dealt with in separate parts of this textbook, owing to their importance and ubiquity. Namely, matrices and other arrays are covered in Chapter 11, and data frames in Chapter 12.

The inspecting of other[22] interesting classes is left as a simple exercise to the kind reader.

### 10.3.1. Date, Time, etc.

The Date class can be used to represent… dates.

(x <- c(Sys.Date(), as.Date(c("1969-12-31", "1970-01-01", "2023-02-29"))))
## [1] "2023-01-18" "1969-12-31" "1970-01-01" NA
class(x)
## [1] "Date"


Complex types are built upon basic ones; underneath, what we deal with is:

typeof(x)
## [1] "double"
unclass(x)
## [1] 19375    -1     0    NA


which is the number of days since the so called UNIX epoch, 1970-01-01T00:00:00+0000 (midnight GMT/UTC).

The POSIXct (calendar time) class can be used to represent date-time objects:

(x <- Sys.time())
## [1] "2023-01-18 09:42:09 AEDT"
class(x)
## [1] "POSIXct" "POSIXt"
typeof(x)
## [1] "double"
unclass(x)
## [1] 1.674e+09


Underneath, it is the number of seconds since the UNIX epoch. By default, whilst printing, the current default timezone is used (see Sys.timezone). However, such objects can be equipped with the tzone attribute.

structure(1, class=c("POSIXct", "POSIXt"))  # using current default timezone
## [1] "1970-01-01 10:00:01 AEST"
structure(1, class=c("POSIXct", "POSIXt"), tzone="UTC")
## [1] "1970-01-01 00:00:01 UTC"


In both cases, the time is 1 second after the beginning of UNIX epoch. In the former, it is displayed in the current local timezone, though (on the author’s PC).

Exercise 10.15

Use ISOdatetime to inspect how midnights are displayed in different timezones.

There is also the POSIXlt (local time) class, which is represented using a list of atomic vectors[23].

(x <- as.POSIXlt(c(a="1970-01-01 00:00:00", b="2030-12-31 23:59:59")))
##                          a                          b
## "1970-01-01 00:00:00 AEST" "2030-12-31 23:59:59 AEDT"
class(x)
## [1] "POSIXlt" "POSIXt"
typeof(x)
## [1] "list"
str(unclass(x))  # calling str instead of print to make display more compact
## List of 11
##  $sec : num [1:2] 0 59 ##$ min   : int [1:2] 0 59
##  $hour : int [1:2] 0 23 ##$ mday  : int [1:2] 1 31
##  $mon : int [1:2] 0 11 ##$ year  : Named int [1:2] 70 130
##   ..- attr(*, "names")= chr [1:2] "a" "b"
##  $wday : int [1:2] 4 2 ##$ yday  : int [1:2] 0 364
##  $isdst : int [1:2] 0 1 ##$ zone  : chr [1:2] "AEST" "AEDT"
##  \$ gmtoff: int [1:2] NA NA

Exercise 10.16

Read about the meaning of each named element, especially mon and year; see help("DateTimeClasses").

The manual states that POSIXlt is supposedly closer to human-readable forms than POSIXct, but it is a matter of taste. Some R functions return the former, and some yield the latter type.

Exercise 10.17

The two main functions for date formatting and parsing, strftime and strptime, use special field formatters (similar to those used by sprintf). Read about them in the R manual. What type of inputs do they accept? What outputs do they produce?

There is a number of methods overloaded for objects of the said classes. In fact, the first call in this section already involved the use of c.Date.

Exercise 10.18

Play around with the overloaded versions of seq, rep, and as.character.

Note that a specific number of days or seconds can be added to or subtracted from a date or time, respectively. However, - (see also diff) can also be applied on two date-time objects, which yields an object of class difftime.

Sys.Date() - (Sys.Date() - 1)
## Time difference of 1 days
Sys.time() - (Sys.time() - 1)
## Time difference of 1 secs

Exercise 10.19

Check out how objects of class difftime are internally represented.

Applying other arithmetic operations on date-time objects yields an error. Also note that because date-time objects are just numbers, they can be compared to each other using binary operators[24] and methods such as sort and order[25].

Exercise 10.20

Check out the stringx package [22] which replaces the base R date-time processing functions with their more portable counterparts.

Exercise 10.21

system.time can be used to measure the time to execute a given expression:

system.time({
sum(runif(1e7))  # whatever, just testing
})
##    user  system elapsed
##   0.225   0.019   0.245


The function returns an object of class proc_time. Inspect how it is represented internally.

### 10.3.2. Formulas (*)

Formulas (created by means of ~) are quite advanced language constructs and hence they will be discussed much further: in Section 16.5.

Some R users refer to them in functions such as lm, aggregate, t.test, boxplot, or plot to specify models or queries such as “y as a function of x1, x2, and x3” and “y grouped/split by a combination of x1 and x2” where y, x1, etc. are for example column names in a data frame or named items in a list.

There is no single standard governing how a function should interpret a formula’s terms. In fact, each procedure is free to introduce its own meaning (a micro-language built on top of R). Due to this, yours truly discourages[26] their use (especially by beginners).

### 10.3.3. Factors

The factor class is often used to represent categorical (qualitative) data, e.g., species, groups, types. In fact, the example categorical class that we played with above has been inspired by the built-in factor.

(x <- c("spam", "spam", "bacon", "sausage", "spam", "bacon"))
## [1] "spam"    "spam"    "bacon"   "sausage" "spam"    "bacon"
(f <- factor(x))
## [1] spam    spam    bacon   sausage spam    bacon
## Levels: bacon sausage spam


Take note of how factors are printed: there are no double quote characters around the labels and the list of levels is given at the end.

Internally, such objects are represented as integer vectors (Section 6.4.1) with elements between 1 and k with the special (as in Section 4.4.3) levels attribute being a character vector of length k[27].

class(f)
## [1] "factor"
typeof(f)
## [1] "integer"
unclass(f)
## [1] 3 3 1 2 3 1
## attr(,"levels")
## [1] "bacon"   "sausage" "spam"
attr(f, "levels")  # also: levels(f)
## [1] "bacon"   "sausage" "spam"


Factors are often used instead of character vectors defined over a small number of unique labels[28], where there is a need to manipulate their levels easily.

attr(f, "levels") <- c("a", "b", "c")  # also levels(f) <- c(....new...)
print(f)
## [1] c c a b c a
## Levels: a b c


The underlying codes remain the same.

Certain operations on vectors of small integers are relatively easy to implement, especially those concerning element grouping: splitting, counting, plotting (e.g., Figure 13.1). It is because the integer codes can naturally be used whilst indexing other vectors. In Section 5.4, we mentioned a few functions related to this, such as match, split, findInterval, and tabulate. Specifically, the latter can be implemented like “for each i, increase count[factor_codes[i]] by one”.

Exercise 10.22

Study the source code of the factor function. Note the use of as.character, unique, order, and match.

Exercise 10.23

Implement a simplified version of table based on tabulate. It should work for objects of class factor and return a named numeric vector.

Exercise 10.24

Implement your own version of cut based on findInterval.

Important

The as.numeric method has not been overloaded for factors. Therefore, when we call the generic, the default method is used: it returns the underlying integer codes as-is. This can surprise the unaware users when they play with factors that feature levels consisting of strings representing integer numbers:

(g <- factor(c(11, 15, 16, 11, 13, 4, 15)))  # converts numbers to strings
## [1] 11 15 16 11 13 4  15
## Levels: 4 11 13 15 16
as.numeric(g)  # the underlying codes
## [1] 2 4 5 2 3 1 4
as.numeric(as.character(g))  # to get the numbers en-coded
## [1] 11 15 16 11 13  4 15


Unfortunately, support for factors is often hardcoded at the C language level, which will make this class behave less predictably (from the R perspective). In particular, the manual overloading of methods for factor objects might have no effect.

Important

If f is a factor, then x[f] does not behave like x[as.character(f)] (indexing by labels, using the names attribute). Instead, we get x[as.numeric(f)] (the underlying codes will determine the positions).

h <- factor(c("a", "b", "a", "c", "a", "c"))
levels(h)[h]  # the same as c("a", "b", "c")[c(1, 2, 1, 3, 1, 3)]
## [1] "a" "b" "a" "c" "a" "c"
c(b="x", c="y", a="z")[h]  # names are not used whilst indexing
##   b   c   b   a   b   a
## "x" "y" "x" "z" "x" "z"
c(b="x", c="y", a="z")[as.character(h)]  # names are used now
##   a   b   a   c   a   c
## "z" "x" "z" "y" "z" "y"


More often than not, indexing by factors will happen “accidentally”, leading to our being slightly puzzled. In particular, factors look much like character vectors when they are featured in data frames:

(df <- data.frame(A=c("x", "y", "z"), B=factor(c("x", "y", "z"))))
##   A B
## 1 x x
## 2 y y
## 3 z z
class(df[["A"]])
## [1] "character"
class(df[["B"]])
## [1] "factor"


(*) Up until R 4.0, many functions (including data.frame and read.csv) had the stringsAsFactors option (see help("options")) set to TRUE, which resulted in all character vectors’ being automatically converted to factors when, e.g., creating data frames (compare Chapter 12). Luckily, this is no longer the case, but they can still be encountered sporadically: for instance, the built-in iris dataset has the fifth column of class:

class(iris[["Species"]])
## [1] "factor"


Important

Be careful when combining factors and not-factors:

x <- factor(c("A", "B", "A"))
c(x, "C")
## [1] "1" "2" "1" "C"
c(x, factor("C"))
## [1] A B A C
## Levels: A B C

Exercise 10.25

Note that when subsetting a factor object, the result will have the levels attribute inherited as-is.

f[c(1, 2)]
## [1] c c
## Levels: a b c


Implement your own version of the droplevels function which removes the unused attributes.

Exercise 10.26

The replacement version of the index operator does not automatically add new levels to the modified object:

x <- factor(c("A", "B", "A"))
[<-(x, 4, value="C")  # like in x[4] <- "C"
## Warning in [<-.factor(x, 4, value = "C"): invalid factor level, NA
## generated
## [1] A    B    A    <NA>
## Levels: A B


Implement your own version of [<-.factor] which is capable of doing so.

### 10.3.4. Ordered Factors

Note that when creating factors, we can enforce a particular ordering and the number of levels:

x <- c("spam", "spam", "bacon", "sausage", "spam", "bacon")
factor(x, levels=c("eggs", "bacon", "sausage", "spam"))
## [1] spam    spam    bacon   sausage spam    bacon
## Levels: eggs bacon sausage spam


If we want the arrangement of the levels to define a linear ordering relation over set of the labels, we can call:

(f <- factor(x, levels=c("eggs", "bacon", "sausage", "spam"), ordered=TRUE))
## [1] spam    spam    bacon   sausage spam    bacon
## Levels: eggs < bacon < sausage < spam
class(f)
## [1] "ordered" "factor"


This yields an ordered factor, which enables comparisons like:

f[f >= "bacon"]  # what's not worse than bacon?
## [1] spam    spam    bacon   sausage spam    bacon
## Levels: eggs < bacon < sausage < spam


How is that possible? Well, based on information provided in this chapter it will come as no surprise that it is because… someone has implemented a comparison operator for objects of class ordered.

## 10.4. Argument Checking Revisited

Recall that anything can be passed as a function’s input. Here are some additions to the topic we touched upon in Section 9.2.1.

Despite that compound objects are internally represented through basic types (such as numeric vectors, lists, or combinations thereof) and attributes, unless we really know better (which, by the way, this book is all about), we should be relying on the hopefully well-thought-out methods developed by the class’ designer.

Ideally, when checking arguments passed to a function, determining if an object is of a desired type should be solely done by means of the generics like is.class. If that is not the case, a call to as.class should be used to make sure we will be dealing with an object of the desired type.

If a conversion is not possible, either because a specific method is unavailable or because its designer decided that this must be the case, whatever the consequences are is not necessarily our problem anymore.

We should explain to the user that the input type assurance is done via this very mechanism and, in case they get any surprising results, they should check/redefine the underlying is.class or as.class themselves.

This is of course not watertight, and there will be users complaining that they get unexpected or confusing (in their opinion) outputs. With infinitely many potential types, however, we cannot respond to every possible situation.

Example 10.27

As an illustration, here is a function that counts the number of occurrences of items in a numerised (digitised?) version of a given object:

numtable <- function(x)
{
if (!is.numeric(x)) x <- as.numeric(x)  # two generics!

u <- unique(x)
structure(
tabulate(match(x, u)),
names=as.character(u)
)
}


Let us assume that the user has been informed (in the corresponding documentation page) that x must be a numeric vector (as in is.numeric) or an object coercible to (by means of as.numeric).

The callers will be stress-testing our function in many different ways:

numtable(c(1, 3, 5, 5, 1, 5))
## 1 3 5
## 2 1 3


This is an intended behaviour.

numtable(c("1", "3", "5", "5", "1", "5"))
## 1 3 5
## 2 1 3


This makes sense too, a character vector consisting of number-strings has been fed on input.

numtable(c("a", "e", "z", "z", "a", "z"))
## Warning in numtable(c("a", "e", "z", "z", "a", "z")): NAs introduced by
## coercion
## <NA>
##    6


Does the output make no sense? Of course, it does, they have just passed something not easily coercible to a numeric vector. Note the warning that suggests there is something wrong. The user needs to correct their possible mistake by themself.

numtable(list(1, 2, 3:10, 2))
## Error in numtable(list(1, 2, 3:10, 2)): 'list' object cannot be coerced to type 'double'


Again, makes sense. ‘But I think that this function should apply unlist automatically’ – well, if you want such a behaviour, why don’t you call numtable(unlist(...)) yourself? It is not so difficult.

numtable(factor(c(1, 3, 5, 5, 1, 5)))
## 1 2 3
## 2 1 3


Is this confusing? No; this is a well-documented behaviour of as.numeric on objects of type factor (which was designed by another developer). A user should know (but we can remind them about it in the documentation) that in this case, as.character should rather be called first.

Of course, sometimes users might discover bugs or unexpected behaviours, especially related to boundary cases we have not been considerate enough to inspect. We are of course the ones to blame for the following:

numtable(numeric(0))  # bug: this should be corrected
## <NA>
##    0


## 10.5. (Over)using the Forward-pipe Operator, |> (*)

The object-oriented programming paradigm is useful when we wish to define a new data type, perhaps even a hierarchy of types. Many development teams find it an efficient tool to organise larger pieces of software. Yet, in the broad data science and numerical computing domains, we are more often consumers of OOP rather than class designers.

Thanks to the discussed method dispatch mechanism, our language is easily extensible and something that mimics a new data type can easily be introduced. Most importantly, methods can be added or removed during run-time, e.g., when importing external packages.

However, R is still a functional programming language, where functions not only are first-class citizens; they are privileged. Of course, there are some inherent limitations stemming from the ingenious simplicity of S3: method dispatch is usually based only on the type of the first function argument, classes cannot be defined formally (but see Section 11.5) and that there is no real encapsulation (we cannot actually hide data from a user[29]). However, overall the whole concept has proven quite versatile.

In functional programming, emphasis is on operations (verbs), not data (nouns). This leads to a very readable syntax, for example (assuming that square, x, and y are sensibly defined), the mean squared error can be written as:

mean(square(x-y))


This is very user-centric. However, when implementing more complex data processing pipelines, a programmer thinks “first, I need to do this, then I need to do that, and afterwards…”. When they write it down, there can be some pressing of HOME and END keys on the keyboard involved. This should not be a problem for most programmers.

finally(thereafter(then(first(x))))


However, some people are inherently lazy, always complaining and/or always trying to “optimise”[30] things.

Example 10.28

Base R is of course extremely flexible and we can introduce new vocabulary as we please. In Chapter 12, we study an example, where we define:

• group_by (a function that splits a data frame with respect to a combination of levels in given named columns and returns a list of data frames with class list_dfs),

• aggregate.list_dfs (which applies a given aggregation function on each column of each data frame in a given list), and

• mean.list_dfs (a specialised version of the former that calls mean).

The specifics do not really matter now, let us just consider the notation we use when the operations are chained:

# select a few rows and columns from the iris data frame:
iris_subset <- iris[51:150, c("Sepal.Width", "Petal.Length", "Species")]
# compute the averages of all variables grouped by Species:
mean(group_by(iris_subset, "Species"))
##      Species            x  Mean
## 1 versicolor  Sepal.Width 2.770
## 2 versicolor Petal.Length 4.260
## 3  virginica  Sepal.Width 2.974
## 4  virginica Petal.Length 5.552


This is quite readable: we compute the mean in groups defined by Species in a subset of the iris data frame. All verbs appear on the lefthand side of the expression, with the last (the most important?) operation being listed first.

By the way, self-explanatory variable names and rich comments are priceless.

In more traditional object-oriented programming languages, either the method list is sealed inside[31] the class’ definition (like in C++), or some peculiar patches must be applied to inject a method (like in Python)[32]. There, it is the objects that are told what to do: they are treated as black boxes.

Some popular languages rely on the message-passing syntax, where operations are propagated (and written) left-to-right instead of inside-out. For instance, in C++ and Python (amongst many others), “obj.method1().method2()” means “call method1 on obj and then call method2 on the result.

Since R 4.1.0, there is a pipe operator[33], |>, which is merely a syntactic sugar for translating between the message-passing and function-centric notion. In a nutshell, writing:

x |> f() |> g(y) |> h()
(x-y) |> square() |> mean()


is equivalent to:

h(g(f(x), y))
mean(square(x-y))


This syntax is developer-centric: it emphasises on the order in which the operations are executed, something that could always be achieved with the function-centric form and perhaps a few auxiliary variables.

Example 10.29

In the above example, a pipe operator version of the iris aggregation exercise would look like:

iris_subset |> group_by("Species") |> mean()


This book is minimalist by design and there is nothing that cannot be achieved without the pipe operator, hence we will be refraining[34] ourselves from using it.

Note

When writing code interactively, we may sometimes benefit from the use of the the rightward -> operator. Suffice to say that “name <- value” and “value -> name” are synonymous.

This way we can write some lengthy code, store the result in an intermediate variable, and then continue on in the next line (possibly referring to that auxiliary value more than once). In the long run, multiplying entities without necessity is unsustainable.

For instance:

iris[, c("Sepal.Width", "Petal.Length", "Species")] -> .
.[ .[, "Species"] %in% c("versicolor", "virginica"), ] -> .
mean(group_by(., "Species"))
##      Species            x  Mean
## 1 versicolor  Sepal.Width 2.770
## 2 versicolor Petal.Length 4.260
## 3  virginica  Sepal.Width 2.974
## 4  virginica Petal.Length 5.552


.” is as good a variable name as any other one.

## 10.6. Exercises

Exercise 10.30

• How to display the source code of the default methods for head and tail?

• Can there be, at the same time, one object of class c("A", "B") and another one of class c("B", "A")?

• If f is a factor, what are the relationships between as.character(f), as.numeric(f), as.character(as.numeric(f)), and as.numeric(as.character(f))?

• If x is a named vector and f is a factor, is x[f] equivalent to x[as.character(f)] or rather x[as.numeric(f)]?

Exercise 10.31

A user calls:

plot(x, y, col="red", ylim=c(1, max(x)), log="y")


where x and y are numeric vectors. Consult help("plot") for the meaning of the ylim and log arguments. Was that straightforward?

Exercise 10.32

Explain why the two following calls yield significantly different results and present a workaround:

c(Sys.Date(), "1970-01-01")
## [1] "2023-01-18" "1970-01-01"
c("1970-01-01", Sys.Date())
## [1] "1970-01-01" "19375"

Exercise 10.33

Write methods head and tail for our example categorical class.

Exercise 10.34

(*) Write an R package that defines S3 class categorical and a couple of methods therefor. Note the need for the use of the S3method directive NAMESPACE; see [47].

Exercise 10.35

Inspect the result of a call to binom.test(79, 100). Find the method responsible for the pretty-printing of such objects.

Exercise 10.36

Inspect the result of a call to rle(c(1, 1, 1, 4, 3, 3, 3, 3, 3,, 2, 2)). Find the method responsible for the pretty-printing of such objects.

Exercise 10.37

Read more about the connection class; see the Value section in help("connections").

Exercise 10.38

Read about the subsetting operators overloaded for the package_version class; see help("numeric_version").

Exercise 10.39

There are xtfrm methods overloaded for classes such as numeric_version, difftime, Date, and factor. Find out how they work and where they might be useful (especially in relation to order and sort; see also Section 12.3.1).

Exercise 10.40

Give an example where split(x, list(y1, y2)) (with default arguments) will fail to generate the correct result.

Exercise 10.41

Write a function that determines the mode, i.e., the most frequently occurring value in a given object of class factor. If the mode is not unique, return a randomly chosen one (each with the same probability).

Exercise 10.42

Implement your own version of the gl function.

Exercise 10.43

Check out which built-in date-time functions the stringx package replaces with more portable ones.