10. S3 Classesď
The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyoneâs enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of Chapters 1â12 are already complete, but there will be more. In the meantime, any bug/typos reports/fixes are appreciated. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [20].
Let x
be a randomly generated matrix with 1,000,000 rows and 1,000 columns,
y
be a data frame with results from the latest survey
indicating that things are not the way most people (no matter the side
of the many political spectra) think they are, and
and z
be another matrix, this time with many zeroes.
Human brain is not capable of handling too much information which is too specific. This is why we have a natural tendency to group different entities based on their similarities so as to form some more abstract classes thereof.
Also, many of us are inherently lazy. Thus, oftentimes we will take shortcuts to minimise energy (at a price to be paid later).
Printing out a matrix, a data frame, and a time series
are all still instances of the displaying of things,
although they surely differ in detail. Now that ad probably forgotten
which objects are hidden behind x
, y
, and z
,
being able to simply call âprint(y)
â
without having to recall that, yes, y
is a data frame,
might seem quite appealing.
This chapter introduces the so-called S3 classes [8],
which provide a lightweight object oriented programming (OOP) approach
for automated dispatching of calls to generics
of the type âprint(y)
â to concrete methods
such as âprint.data.frame(y)
â,
based on the class of the object they are invoked upon.
S3 classes in their essence are beautifully simple[1]. They are inspired[2] by the well-thought-through concepts present in other functional programming languages (such as the Common Lisp Object System; see below). Ultimately, those generics and methods are ordinary R functions (Chapter 7) and classes are merely additional object attributes (Section 4.4).
Of course this does not mean that wrapping our heads around them will be effortless. However, unlike other âclass systemsâ[3], S3 is ubiquitous in R programming: suffice it to say that factors, matrices, and data frames discussed in the next chapters are quite straightforward, S3-based extensions of the concepts we have introduced so far.
10.1. Object Type vs Classď
Recall that typeof (introduced in Section 4.1) returns the internal type of any R object. Even though there are only few admissible cases thereof[4], they open the world of endless possibilities[5].
The basic types we covered so far (mostly atomic and generic vectors; compare Figure 1) provide a basis for more complex data structures. This is thanks to the fact that they can be equipped with arbitrary attributes (Section 4.4).
typeof(NULL)
## [1] "NULL"
typeof(c(TRUE, FALSE, NA))
## [1] "logical"
typeof(c(1, 2, 3, NA_real_))
## [1] "double"
typeof(c("a", "b", NA_character_))
## [1] "character"
typeof(list(list(1, 2, 3), LETTERS))
## [1] "list"
typeof(function(x) x)
## [1] "closure"
The interesting fact is that most compound types, whose most prevalent instances are constructed using the mechanisms discussed in this chapter[6], only pretend they are something different from what they actually are. They are often quite good at doing their job, though, and hence might be useful. By knowing what is under their hood we will demystify them and become able to manipulate their state outside of the prescribed use cases.
Important
Setting the class
attribute might make some objects
behave differently in certain scenarios.
Let us consider two identical objects equipped with different
class
attributes.
xt <- structure(123, class="POSIXct") # POSIX calendar time
xd <- structure(123, class="Date")
Despite that both objects are being represented using numeric vectors:
c(typeof(xt), typeof(xd))
## [1] "double" "double"
When printed, they are decoded quite differently:
print(xt)
## [1] "1970-01-01 10:02:03 AEST"
print(xd)
## [1] "1970-05-04"
In the former case, 123 is treated as the number of seconds since the so-called UNIX epoch, 1970-01-01T00:00:00+0000. The latter is deciphered as the number of days since the said (quite widely used in computer systems by the way) timestamp.
We may hence suspect, and we are absolutely right, that there exists some underlying mechanism that actually calls a version of print that is dependent on an objectâs virtual class.
That this only depends on the class
attribute,
which might be set, unset, or reset quite freely,
is emphasised below:
attr(xt, "class") <- "Date" # change class from POSIXct to Date
print(xt) # same 123, but now interpreted as Date
## [1] "1970-05-04"
as.numeric(xt) # drops all attributes
## [1] 123
unclass(xd) # drops the class attribute; `attr<-`(xd, "class", NULL)
## [1] 123
We are having so much fun that one more illustration can only add to joy.
Consider an example data frame:
x <- iris[1:3, 1:2] # a subset of a built-in example data frame
print(x)
## Sepal.Length Sepal.Width
## 1 5.1 3.5
## 2 4.9 3.0
## 3 4.7 3.2
This is an object of the following class
(an object whose class
attribute is set to):
attr(x, "class")
## [1] "data.frame"
Some may say, and they are absolutely right, that
we have not covered data frames yet: this is the topic of
Chapter 12, which is still ahead of us.
However, from the current perspective, we are interested
in the fact that an R data frame is merely
a list (Chapter 4) of vectors of the same
lengths equipped with names
and row.names
attributes.
typeof(x)
## [1] "list"
attr(x, "class") <- NULL # or x <- unclass(x)
print(x)
## $Sepal.Length
## [1] 5.1 4.9 4.7
##
## $Sepal.Width
## [1] 3.5 3.0 3.2
##
## attr(,"row.names")
## [1] 1 2 3
Important
Revealing how x
is actually represented, enables us to process it
(although perhaps not in the most convenient or efficient manner)
using the extensive skill set that we have already[7]
developed by studying the material covered in the previous part of our book
(including solving all the exercises).
This can be particularly useful, especially bearing in mind that some
(built-in or third-party) data types are not particularly well-designed.
Note again that attributes are simple additions to R objects.
However, as we said in Section 4.4.3,
certain attributes are special, and class
is one of them.
In particular, we can set class
to be only a character vector
(possibly of length greater than one; see Section 10.2.5).
x <- 12345
attr(x, "class") <- 1 # character vectors only
## Error in attr(x, "class") <- 1: attempt to set invalid 'class' attribute
Furthermore, there exists the class function
that can read the value of the class
attribute.
Its replacement version is also available.
class(x) <- "Date" # set; the same as attr(x, "class") <- "Date"
class(x) # get; the same as attr(x, "class")
## [1] "Date"
Important
The class function always yields a value, even if the corresponding attribute is not set. We call it an implicit class. Compare between the following and the outputs generated by typeof:
class(NULL) # no `class` set, because NULL cannot have attributes at all
## [1] "NULL"
class(c(TRUE, FALSE, NA)) # no attributes, so class is implicit (= typeof)
## [1] "logical"
class(c(1, 2, 3, NA_real_)) # typeof yields "double"
## [1] "numeric"
class(c("a", "b", NA_character_))
## [1] "character"
class(list(list(1, 2, 3), LETTERS))
## [1] "list"
class(function(x) x) # typeof yields "closure"
## [1] "function"
Also, in Chapter 11, we will note that any object
equipped with the dim
attribute also has an implicit class:
(x <- as.matrix(c(1, 2, 3)))
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
attributes(x) # `class` is not amongst the attributes
## $dim
## [1] 3 1
class(x) # implicit class
## [1] "matrix" "array"
typeof(x) # it is still a numeric vector (under the hood)
## [1] "double"
10.2. Generics and Method Dispatchingď
10.2.1. Generics, Default, and Custom Methodsď
Let us inspect the source code of the print function:
print(print) # sic!
## function (x, ...)
## UseMethod("print")
## <bytecode: 0x55ecc88904e0>
## <environment: namespace:base>
Any function like the above[8] we will call from now on
an S3 (S version 3) generic.
Its only job is to invoke UseMethod("print")
[9].
This dispatches the control flow to another function,
referred to as method, based on the class of the first argument.
For example, let us define an object of class categorical
(a name that we have just come up with; we could have called it
cat
, CtGrCl
, or SpanishInquisition
as well), which
will be our own version of the famous built-in factor
type that we discuss
later.
x <- structure(
c(1, 3, 2, 1, 1, 1, 3),
levels=c("a", "b", "c"),
class="categorical"
)
We assume that such an object is a vector of small positive integers (codes)
equipped with the levels
attribute being a character vector
of length no less than the maximum of the said integers.
The first category will be used to decipher the meaning of code â1â,
for example. Hence, the above vector represents a sequence
a, c, b, a, a, a, c.
We have not defined any special method for the printing of
objects of class categorical
. Hence, when we call print,
the default (fallback) method will be called:
print(x)
## [1] 1 3 2 1 1 1 3
## attr(,"levels")
## [1] "a" "b" "c"
## attr(,"class")
## [1] "categorical"
This is the standard function for displaying numeric vectors that we are all well familiar with. Its name is print.default, and we can always call it directly:
print.default(x) # the default print method
## [1] 1 3 2 1 1 1 3
## attr(,"levels")
## [1] "a" "b" "c"
## attr(,"class")
## [1] "categorical"
We can, however, introduce our own method for the custom printing
of objects of class categorical
, whose name
must precisely be print.categorical:
print.categorical <- function(x, ...)
{
x_character <- attr(x, "levels")[unclass(x)]
print(x_character) # calls `print.default`
cat(sprintf("Categories: %s\n",
paste(attr(x, "levels"), collapse=", ")))
invisible(x) # this is what all print methods do; see help("print")
}
Now, calling print automatically dispatches the control flow to the above method:
print(x)
## [1] "a" "c" "b" "a" "a" "a" "c"
## Categories: a, b, c
Of course, the default method can still be called;
calling print.default(x)
directly
will output the same result as before.
Note
print.categorical has been equipped with the dot-dot-dot attribute, because the generic print had one too; we should always ensure consistency ourselves[10].
10.2.2. Creating Own Genericsď
Introducing new S3 generics is as straightforward as defining a function that calls UseMethod.
For instance, here is a dispatcher which allows for
creating new objects of class categorical
based on
other objects:
as.categorical <- function(x, ...)
UseMethod("as.categorical")
We always need to define the default method:
as.categorical.default <- function(x, ...)
{
x <- as.character(x)
xu <- unique(sort(x)) # drops NAs
structure(
match(x, xu),
class="categorical",
levels=xu
)
}
Testing:
as.categorical(c("a", "c", "a", "a", "d", "c"))
## [1] "a" "c" "a" "a" "d" "c"
## Categories: a, c, d
as.categorical(c(3, 6, 4, NA, 9, 9, 6, NA, 3))
## [1] "3" "6" "4" NA "9" "9" "6" NA "3"
## Categories: 3, 4, 6, 9
Note that print.categorical has been invoked twice here. The above is quite flexible already, because it relies on the generic (Section 10.2.3) as.character, which handles a wide variety of data types. Of course, it does not mean we cannot be more precise about some particular ones.
For instance, we might want to forbid the conversion from lists, because this does not necessarily make sense:
as.categorical.list <- function(x, ...)
stop("conversion of lists to categorical is not supported")
The users can always be instructed in the methodâs documentation that they are the ones responsible for an explicit conversion of list objects to something different prior to a call to as.categorical. Whether this was a good design choice, time will tell.
Note that the default method deals with logical vectors perfectly fine:
as.categorical(c(TRUE, FALSE, NA, NA, FALSE)) # as.categorical.default
## [1] "TRUE" "FALSE" NA NA "FALSE"
## Categories: FALSE, TRUE
However, we might still want to introduce a specialised version,
because we know a slightly more efficient algorithm
(and we have nothing better to do)
based on the fact that FALSE
and TRUE
converted to
numeric yield 0 and 1, respectively:
as.categorical.logical <- function(x, ...)
{
x <- as.logical(x) # or stopifnot(is.logical(x)) ?
structure(
x + 1, # only 1, 2, and NAs will be generated
class="categorical",
levels=c("FALSE", "TRUE")
)
}
This yields the same result, but is a bit faster:
as.categorical(c(TRUE, FALSE, NA, NA, FALSE)) # as.categorical.logical
## [1] "TRUE" "FALSE" NA NA "FALSE"
## Categories: FALSE, TRUE
Note that we have performed some argument validation
at the beginning, because a user is always able to
call a method directly on an R object of any kind
(which is a good thing!; see Section 10.2.4).
In other words, there is no guarantee that the argument x
must be of
type logical.
10.2.3. Built-in Genericsď
Many functions and operators we have introduced so far are in fact S3 generics: print, head, `[`, `+`, `<=`, as.character, as.list, round, log, sum, c, and na.omit, to name a few.
Some of them might not even call UseMethod
explicitly; dispatching can be done internally, at the C language
level[11].
Overall, the list of all S3 generics is somewhat difficult
to generate[12],
but at least the internal ones are enumerated in
help("InternalMethods")
and
help("groupGeneric")
.
Let us overload the as.character method. The default one does not make much sense for the objects of our custom type:
as.character(x)
## [1] "1" "3" "2" "1" "1" "1" "3"
So:
as.character.categorical <- function(x, ...)
attr(x, "levels")[unclass(x)]
And now:
as.character(x)
## [1] "a" "c" "b" "a" "a" "a" "c"
Overload the unique method for objects of class categorical
.
Overload the rep method for objects of class categorical
.
New types should be designed carefully. For instance, if we forget to consider overloading the to-numeric converter, we might end up with some users being puzzled[13] when they see:
(x <- as.categorical(c(4, 9, 100, 9, 9, 100, 42, 666, 4)))
## [1] "4" "9" "100" "9" "9" "100" "42" "666" "4"
## Categories: 100, 4, 42, 666, 9
as.double(x) # as.double.default(x)
## [1] 2 5 1 5 5 1 3 4 2
Hence, we might want to introduce:
as.double.categorical <- function(x, ...)
{
# as.double.default(as.character.categorical(x))
as.double(as.character(x))
}
Which now yields:
as.double(x) # as.double.categorical(x)
## [1] 4 9 100 9 9 100 42 666 4
Note
We can still use unclass to fetch the codes:
unclass(x)
## [1] 2 5 1 5 5 1 3 4 2
## attr(,"levels")
## [1] "100" "4" "42" "666" "9"
This is because the above returns a class-free object, which is now guaranteed to be handled by the default methods (print, subsetting, as.character, etc.).
What would happen if we used as.numeric instead of unclass in print.categorical and as.character.categorical?
Update the above methods in such a way that
we can also create named objects of class categorical
(i.e., equipped with the names
attribute).
Note that the levels of x
are sorted lexicographically,
not numerically. Introduce a single method
that would make the above code (when re-run without any alterations)
generate a more natural result.
10.2.4. Dispatching Only on One Argument and Calling S3 Methods Directlyď
With S3, the dispatching is done based on the class of only one[14] argument: by default, the first one from the parameter list.
For example, the c function is a generic which
dispatches on the class of the first argument.
Let us overload it for categorical
objects
(or, more precisely, create a function that will be dispatched to
when the generic is called upon a series of objects such that
the first element is of the said class).
c.categorical <- function(...)
as.categorical(
unlist(
lapply(list(...), as.character)
)
)
It converts each argument to a character vector (relying on the generic as.character to take care of the details) and makes use of the fact that unlist converts a list of such atomic vectors to a single sequence of strings.
Calling c with the first argument being of class categorical
dispatches to the above method:
x <- c(9, 5, 7, 7, 2)
xc <- as.categorical(x)
c(xc, x) # c.categorical
## [1] "9" "5" "7" "7" "2" "9" "5" "7" "7" "2"
## Categories: 2, 5, 7, 9
However, if the first argument is, say, unclassed, the default method will be consulted:
c(x, xc) # default c
## [1] 9 5 7 7 2 4 2 3 3 1
This method ignores the class
attribute and sees xc
as-it-is,
a barebone numeric vector:
`attributes<-`(xc, NULL) # the underlying codes
## [1] 4 2 3 3 1
This is not a bug! This is a well-documented (and now explained) behaviour. After all, compound types (classed objects) are merely emulated through the basic ones.
Important
In most cases, S3 methods can be called directly to get the desired outcome:
c.categorical(x, xc) # force a call to the specific method
## [1] "9" "5" "7" "7" "2" "9" "5" "7" "7" "2"
## Categories: 2, 5, 7, 9
Note
We said âin most casesâ, because some methods can be:
hardcoded at the C language level (for instance, there is no c.default defined at all[15]),
hidden (defined in a package namespace but not exported); see Section 16.4.5,
overloaded as a group; see Section 16.4.6.
Just for fun, let us find a partition of the iris
dataset
into three clusters using the k-means algorithm:
res <- kmeans(iris[-5], centers=3, nstart=10)
print(res)
## K-means clustering with 3 clusters of sizes 50, 62, 38
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.0060 3.4280 1.4620 0.2460
## 2 5.9016 2.7484 4.3935 1.4339
## 3 6.8500 3.0737 5.7421 2.0711
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [ reached getOption("max.print") -- omitted 51 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 15.151 39.821 23.879
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
The above is an object of class:
class(res)
## [1] "kmeans"
which in fact is a:
typeof(res)
## [1] "list"
The underlying list looks like:
unclass(res)
## $cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [ reached getOption("max.print") -- omitted 51 entries ]
##
## $centers
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.0060 3.4280 1.4620 0.2460
## 2 5.9016 2.7484 4.3935 1.4339
## 3 6.8500 3.0737 5.7421 2.0711
##
## $totss
## [1] 681.37
##
## $withinss
## [1] 15.151 39.821 23.879
##
## $tot.withinss
## [1] 78.851
##
## $betweenss
## [1] 602.52
##
## $size
## [1] 50 62 38
##
## $iter
## [1] 2
##
## $ifault
## [1] 0
and we already know that the above is displayed in a fancy way
only because there is a print method overloaded
for objects of class kmeans
.
But is there really?
print.kmeans
## Error in eval(expr, envir, enclos): object 'print.kmeans' not found
Even though the method is hidden in the stats packageâs
namespace, from Section 16.4.5 we will learn that
it can be accessed by calling
getS3method("print", "kmeans")
or referring to stats:::
print.kmeans
(note the triple colon).
10.2.5. Multi-class-nessď
The class
attribute can be instantiated as a character vector
of any length. For example:
(t1 <- Sys.time())
## [1] "2023-01-18 09:42:09 AEDT"
(t2 <- strptime("2021-08-15T12:59:59+1000", "%Y-%m-%dT%H:%M:%S%z"))
## [1] "2021-08-15 12:59:59"
Let us inspect the two objectsâ classes:
class(t1)
## [1] "POSIXct" "POSIXt"
class(t2)
## [1] "POSIXlt" "POSIXt"
When we discuss date-time classes in more detail later, we will take note that the former is represented as a numeric vector, whilst the latter is a list. Hence, primarily, these two should be seen as instances of two distinct types. However, both of them have a lot in common, hence it was a wise design choice to also allow them to be seen as the representatives of the same generic category of POSIX time objects.
Important
When calling a generic function[16] f on an object x
of classes[17]
class1
, class2
, âŚ, classK
(in this order),
UseMethod(f, x)
dispatches to the method
determined as follows:
if
f.class1
is available[18], call it;otherwise, if
f.class2
is available, call this one;âŚ;
otherwise, if
f.classK
is available, invoke it;otherwise, refer to the fallback
f.default
.
There is a method diff for objects of class
POSIXt
featuring a statement:
r <- if (inherits(x, "POSIXlt")) as.POSIXct(x) else x
This way, we can be handling both POSIXct
and POSIXlt
instances
via the same procedure.
Let us see in this simple scheme any magic. It is nothing more than what was described above: a way of determining which method should be called for a particular R object. It can of course be used as a mechanism to mimic (and certainly it was inspired by) the idea of inheritance in object-oriented programming languages, but note that the S3 system does not allow for defining classes in any formal manner.
For example, we cannot say that objects of class POSIXct
inherit
from POSIXt
or each object of class POSIXct
is also an instance of
POSIXt
. The class attribute can still be set arbitrarily on an
per-object basis: we can create ones whose class is
simply POSIXct
(without the POSIXt
part)
or even c("POSIXt", "POSIXct")
(in this very order).
10.2.6. Operator Overloadingď
Operators are ordinary functions (Section 9.4.5). Even though what follows can partially be implied by what we have said above, as usual in R, there will be some oddities.
For example, let us overload the index operator for
objects of class categorical
.
Looking at help("[")
, we see that the default method[19]
has two arguments: x
(the categorical
object being sliced)
and i
(the indexer).
Ours will have the same interface then:
`[.categorical` <- function(x, i)
{
structure(
unclass(x)[i], # `[`(unclass(x), i)
class="categorical",
levels=attr(x, "levels") # the same levels as input
)
}
We can also introduce the replacement version of this operator:
`[<-.categorical` <- function(x, i, value)
{
levels <- attr(x, "levels")
codes <- match(value, levels) # integer codes corresponding to levels
x <- unclass(x)
x[i] <- codes # default method for the replacement version of `[`
structure(
x,
class="categorical",
levels=levels # same levels as input
)
# # or, equivalently:
# structure(
# `[<-`(unclass(x), i, value=match(value, attr(x, "levels"))),
# class="categorical",
# levels=attr(x, "levels")
# )
}
Testing:
x <- as.categorical(c(3, 6, 4, NA, 9, 9, 6, NA, 3))
x[1:4]
## [1] "3" "6" "4" NA
## Categories: 3, 4, 6, 9
x[1:4] <- c("6", "7")
print(x)
## [1] "6" NA "6" NA "9" "9" "6" NA "3"
## Categories: 3, 4, 6, 9
Note how we handled the case of non-existing levels and that the recycling rule has been automagically inherited (amongst other features) from the default index operator.
Do these two operators preserve the names
attribute of x
?
Is indexing with negative integers or logical vectors supported as well?
Why is that/is that not the case?
Furthermore, let us overload the `==` operator.
Assume[20] that we would like two categorical
objects
be compared based on the actual labels they encode, in an elementwise
manner:
`==.categorical` <- function(e1, e2)
as.character(e1) == as.character(e2)
We are feeling lucky: by not performing any type checking,
we rely on the particular as.character methods
corresponding to the types of e1
and e2
.
Also, assuming that as.character always[21]
returns an object of type character,
we dispatch to the default method for `==`
(which handles atomic vectors).
Some examples:
as.categorical(c(1, 3, 5, 1)) == as.categorical(c(1, 3, 1, 1))
## [1] TRUE TRUE FALSE TRUE
as.categorical(c(1, 3, 5, 1)) == c(1, 3, 1, 1)
## [1] TRUE TRUE FALSE TRUE
c(1, 3, 5, 1) == as.categorical(c(1, 3, 1, 1))
## [1] TRUE TRUE FALSE TRUE
Important
In the case of binary operators, dispatching is done based on the classes of both arguments. In all three example calls above, we call `==.categorical`, regardless of whether the classed object is the first or the second operand. If two operands are classed and different methods are overloaded for both of them, a warning will be generated and the default internal method will be called.
`==.A` <- function(e1, e2) "A"
`==.B` <- function(e1, e2) "B"
structure(c(1, 2, 3), class="A") == structure(c(2, NA, 3), class="B")
## Warning: Incompatible methods ("==.A", "==.B") for "=="
## [1] FALSE NA TRUE
Note
In Section 16.4.6, we will mention
that operators as well as certain groups of functions
(including min, sum, and all
or abs, log, and round)
can be overloaded all at once; see also help("groupGeneric")
.
10.3. Common Built-in S3 Classesď
Let us discuss some noteworthy built-in classes, including the ones that represent date/time information and factors (ordered or not).
Classes for representing tabular data will be dealt with in separate parts of this textbook, owing to their importance and ubiquity. Namely, matrices and other arrays are covered in Chapter 11, and data frames in Chapter 12.
The inspecting of other[22] interesting classes is left as a simple exercise to the kind reader.
10.3.1. Date, Time, etc.ď
The Date
class can be used to represent⌠dates.
(x <- c(Sys.Date(), as.Date(c("1969-12-31", "1970-01-01", "2023-02-29"))))
## [1] "2023-01-18" "1969-12-31" "1970-01-01" NA
class(x)
## [1] "Date"
Complex types are built upon basic ones; underneath, what we deal with is:
typeof(x)
## [1] "double"
unclass(x)
## [1] 19375 -1 0 NA
which is the number of days since the so called UNIX epoch, 1970-01-01T00:00:00+0000 (midnight GMT/UTC).
The POSIXct
(calendar time) class
can be used to represent date-time objects:
(x <- Sys.time())
## [1] "2023-01-18 09:42:09 AEDT"
class(x)
## [1] "POSIXct" "POSIXt"
typeof(x)
## [1] "double"
unclass(x)
## [1] 1.674e+09
Underneath, it is the number of seconds since the UNIX epoch.
By default, whilst printing, the current default timezone
is used (see Sys.timezone).
However, such objects can be equipped with the tzone
attribute.
structure(1, class=c("POSIXct", "POSIXt")) # using current default timezone
## [1] "1970-01-01 10:00:01 AEST"
structure(1, class=c("POSIXct", "POSIXt"), tzone="UTC")
## [1] "1970-01-01 00:00:01 UTC"
In both cases, the time is 1 second after the beginning of UNIX epoch. In the former, it is displayed in the current local timezone, though (on the authorâs PC).
Use ISOdatetime to inspect how midnights are displayed in different timezones.
There is also the POSIXlt
(local time) class,
which is represented using a list of atomic vectors[23].
(x <- as.POSIXlt(c(a="1970-01-01 00:00:00", b="2030-12-31 23:59:59")))
## a b
## "1970-01-01 00:00:00 AEST" "2030-12-31 23:59:59 AEDT"
class(x)
## [1] "POSIXlt" "POSIXt"
typeof(x)
## [1] "list"
str(unclass(x)) # calling str instead of print to make display more compact
## List of 11
## $ sec : num [1:2] 0 59
## $ min : int [1:2] 0 59
## $ hour : int [1:2] 0 23
## $ mday : int [1:2] 1 31
## $ mon : int [1:2] 0 11
## $ year : Named int [1:2] 70 130
## ..- attr(*, "names")= chr [1:2] "a" "b"
## $ wday : int [1:2] 4 2
## $ yday : int [1:2] 0 364
## $ isdst : int [1:2] 0 1
## $ zone : chr [1:2] "AEST" "AEDT"
## $ gmtoff: int [1:2] NA NA
Read about the meaning of each named element,
especially mon
and year
; see help("DateTimeClasses")
.
The manual states that POSIXlt
is supposedly
closer to human-readable forms than POSIXct
,
but it is a matter of taste.
Some R functions return the former, and some yield the latter type.
The two main functions for date formatting and parsing, strftime and strptime, use special field formatters (similar to those used by sprintf). Read about them in the R manual. What type of inputs do they accept? What outputs do they produce?
There is a number of methods overloaded for objects of the said classes. In fact, the first call in this section already involved the use of c.Date.
Play around with the overloaded versions of seq, rep, and as.character.
Note that a specific number of days or seconds can be added to or
subtracted from a date or time, respectively.
However, - (see also diff)
can also be applied on two date-time objects,
which yields an object of class difftime
.
Sys.Date() - (Sys.Date() - 1)
## Time difference of 1 days
Sys.time() - (Sys.time() - 1)
## Time difference of 1 secs
Check out how objects of class difftime
are internally represented.
Applying other arithmetic operations on date-time objects yields an error. Also note that because date-time objects are just numbers, they can be compared to each other using binary operators[24] and methods such as sort and order[25].
Check out the stringx package [22] which replaces the base R date-time processing functions with their more portable counterparts.
system.time can be used to measure the time to execute a given expression:
system.time({
sum(runif(1e7)) # whatever, just testing
})
## user system elapsed
## 0.225 0.019 0.245
The function returns an object of class proc_time
.
Inspect how it is represented internally.
10.3.2. Formulas (*)ď
Formulas (created by means of `~`) are quite advanced language constructs and hence they will be discussed much further: in Section 16.5.
Some R users refer to them in functions such as
lm, aggregate, t.test,
boxplot, or plot to specify models or queries
such as ây
as a function of x1
, x2
, and x3
â and
ây
grouped/split by a combination of x1
and x2
â
where y
, x1
, etc. are for example
column names in a data frame or named items in a list.
There is no single standard governing how a function should interpret a formulaâs terms. In fact, each procedure is free to introduce its own meaning (a micro-language built on top of R). Due to this, yours truly discourages[26] their use (especially by beginners).
10.3.3. Factorsď
The factor
class is often used to represent categorical (qualitative) data,
e.g., species, groups, types.
In fact, the example categorical
class that we played with above has been
inspired by the built-in factor
.
(x <- c("spam", "spam", "bacon", "sausage", "spam", "bacon"))
## [1] "spam" "spam" "bacon" "sausage" "spam" "bacon"
(f <- factor(x))
## [1] spam spam bacon sausage spam bacon
## Levels: bacon sausage spam
Take note of how factors are printed: there are no double quote characters around the labels and the list of levels is given at the end.
Internally, such objects are represented as integer vectors
(Section 6.4.1) with elements between 1 and k with the special
(as in Section 4.4.3) levels
attribute being a character vector of length k[27].
class(f)
## [1] "factor"
typeof(f)
## [1] "integer"
unclass(f)
## [1] 3 3 1 2 3 1
## attr(,"levels")
## [1] "bacon" "sausage" "spam"
attr(f, "levels") # also: levels(f)
## [1] "bacon" "sausage" "spam"
Factors are often used instead of character vectors defined over a small number of unique labels[28], where there is a need to manipulate their levels easily.
attr(f, "levels") <- c("a", "b", "c") # also levels(f) <- c(....new...)
print(f)
## [1] c c a b c a
## Levels: a b c
The underlying codes remain the same.
Certain operations on vectors of small integers are relatively easy to
implement, especially those concerning element grouping:
splitting, counting, plotting (e.g., Figure 13.1).
It is because the integer codes can naturally be used whilst indexing
other vectors. In Section 5.4, we mentioned a few functions
related to this, such as
match,
split,
findInterval, and
tabulate.
Specifically, the latter can be implemented like
âfor each i
, increase count[factor_codes[i]]
by oneâ.
Study the source code of the factor function. Note the use of as.character, unique, order, and match.
Implement a simplified version of table
based on tabulate. It should work
for objects of class factor
and return a named numeric vector.
Implement your own version of cut based on findInterval.
Important
The as.numeric method has not been overloaded for factors. Therefore, when we call the generic, the default method is used: it returns the underlying integer codes as-is. This can surprise the unaware users when they play with factors that feature levels consisting of strings representing integer numbers:
(g <- factor(c(11, 15, 16, 11, 13, 4, 15))) # converts numbers to strings
## [1] 11 15 16 11 13 4 15
## Levels: 4 11 13 15 16
as.numeric(g) # the underlying codes
## [1] 2 4 5 2 3 1 4
as.numeric(as.character(g)) # to get the numbers en-coded
## [1] 11 15 16 11 13 4 15
Unfortunately, support for factors is often hardcoded at the C language level, which will make this class behave less predictably (from the R perspective). In particular, the manual overloading of methods for factor objects might have no effect.
Important
If f
is a factor, then x[f]
does not behave like
x[
as.character(f)]
(indexing by labels, using the names
attribute). Instead,
we get x[
as.numeric(f)]
(the underlying codes will determine the positions).
h <- factor(c("a", "b", "a", "c", "a", "c"))
levels(h)[h] # the same as c("a", "b", "c")[c(1, 2, 1, 3, 1, 3)]
## [1] "a" "b" "a" "c" "a" "c"
c(b="x", c="y", a="z")[h] # names are not used whilst indexing
## b c b a b a
## "x" "y" "x" "z" "x" "z"
c(b="x", c="y", a="z")[as.character(h)] # names are used now
## a b a c a c
## "z" "x" "z" "y" "z" "y"
More often than not, indexing by factors will happen âaccidentallyâ, leading to our being slightly puzzled. In particular, factors look much like character vectors when they are featured in data frames:
(df <- data.frame(A=c("x", "y", "z"), B=factor(c("x", "y", "z"))))
## A B
## 1 x x
## 2 y y
## 3 z z
class(df[["A"]])
## [1] "character"
class(df[["B"]])
## [1] "factor"
(*) Up until R 4.0, many functions (including data.frame
and read.csv) had the stringsAsFactors
option
(see help("options")
) set to TRUE
, which resulted in all character
vectorsâ being automatically converted to factors when, e.g.,
creating data frames (compare Chapter 12).
Luckily, this is no longer the case, but they can still be
encountered sporadically: for instance,
the built-in iris
dataset has the fifth column of class:
class(iris[["Species"]])
## [1] "factor"
Important
Be careful when combining factors and not-factors:
x <- factor(c("A", "B", "A"))
c(x, "C")
## [1] "1" "2" "1" "C"
c(x, factor("C"))
## [1] A B A C
## Levels: A B C
Note that when subsetting a factor object,
the result will have the levels
attribute inherited as-is.
f[c(1, 2)]
## [1] c c
## Levels: a b c
Implement your own version of the droplevels function which removes the unused attributes.
The replacement version of the index operator does not automatically add new levels to the modified object:
x <- factor(c("A", "B", "A"))
`[<-`(x, 4, value="C") # like in x[4] <- "C"
## Warning in `[<-.factor`(x, 4, value = "C"): invalid factor level, NA
## generated
## [1] A B A <NA>
## Levels: A B
Implement your own version of `[<-.factor]` which is capable of doing so.
10.3.4. Ordered Factorsď
Note that when creating factors, we can enforce a particular ordering and the number of levels:
x <- c("spam", "spam", "bacon", "sausage", "spam", "bacon")
factor(x, levels=c("eggs", "bacon", "sausage", "spam"))
## [1] spam spam bacon sausage spam bacon
## Levels: eggs bacon sausage spam
If we want the arrangement of the levels to define a linear ordering relation over set of the labels, we can call:
(f <- factor(x, levels=c("eggs", "bacon", "sausage", "spam"), ordered=TRUE))
## [1] spam spam bacon sausage spam bacon
## Levels: eggs < bacon < sausage < spam
class(f)
## [1] "ordered" "factor"
This yields an ordered factor, which enables comparisons like:
f[f >= "bacon"] # what's not worse than bacon?
## [1] spam spam bacon sausage spam bacon
## Levels: eggs < bacon < sausage < spam
How is that possible? Well, based on information provided in this chapter
it will come as no surprise that it is because⌠someone has
implemented a comparison operator for objects of class ordered
.
10.4. Argument Checking Revisitedď
Recall that anything can be passed as a functionâs input. Here are some additions to the topic we touched upon in Section 9.2.1.
Despite that compound objects are internally represented through basic types (such as numeric vectors, lists, or combinations thereof) and attributes, unless we really know better (which, by the way, this book is all about), we should be relying on the hopefully well-thought-out methods developed by the classâ designer.
Ideally, when checking arguments passed to a function, determining if an object is of a desired type should be solely done by means of the generics like is.class. If that is not the case, a call to as.class should be used to make sure we will be dealing with an object of the desired type.
If a conversion is not possible, either because a specific method is unavailable or because its designer decided that this must be the case, whatever the consequences are is not necessarily our problem anymore.
We should explain to the user that the input type assurance is done via this very mechanism and, in case they get any surprising results, they should check/redefine the underlying is.class or as.class themselves.
This is of course not watertight, and there will be users complaining that they get unexpected or confusing (in their opinion) outputs. With infinitely many potential types, however, we cannot respond to every possible situation.
As an illustration, here is a function that counts the number of occurrences of items in a numerised (digitised?) version of a given object:
numtable <- function(x)
{
if (!is.numeric(x)) x <- as.numeric(x) # two generics!
u <- unique(x)
structure(
tabulate(match(x, u)),
names=as.character(u)
)
}
Let us assume that the user has been informed (in the corresponding
documentation page) that x
must be a numeric vector
(as in is.numeric
) or an object coercible to
(by means of as.numeric
).
The callers will be stress-testing our function in many different ways:
numtable(c(1, 3, 5, 5, 1, 5))
## 1 3 5
## 2 1 3
This is an intended behaviour.
numtable(c("1", "3", "5", "5", "1", "5"))
## 1 3 5
## 2 1 3
This makes sense too, a character vector consisting of number-strings has been fed on input.
numtable(c("a", "e", "z", "z", "a", "z"))
## Warning in numtable(c("a", "e", "z", "z", "a", "z")): NAs introduced by
## coercion
## <NA>
## 6
Does the output make no sense? Of course, it does, they have just passed something not easily coercible to a numeric vector. Note the warning that suggests there is something wrong. The user needs to correct their possible mistake by themself.
numtable(list(1, 2, 3:10, 2))
## Error in numtable(list(1, 2, 3:10, 2)): 'list' object cannot be coerced to type 'double'
Again, makes sense. âBut I think that this function should
apply unlist automaticallyâ â well, if
you want such a behaviour, why donât you
call numtable(
unlist(...))
yourself? It is not so difficult.
numtable(factor(c(1, 3, 5, 5, 1, 5)))
## 1 2 3
## 2 1 3
Is this confusing? No; this is a well-documented behaviour
of as.numeric on objects of type factor
(which was designed by another developer).
A user should know (but we can remind them about it in the documentation)
that in this case, as.character should rather be called first.
Of course, sometimes users might discover bugs or unexpected behaviours, especially related to boundary cases we have not been considerate enough to inspect. We are of course the ones to blame for the following:
numtable(numeric(0)) # bug: this should be corrected
## <NA>
## 0
10.5. (Over)using the Forward-pipe Operator, `|>` (*)ď
The object-oriented programming paradigm is useful when we wish to define a new data type, perhaps even a hierarchy of types. Many development teams find it an efficient tool to organise larger pieces of software. Yet, in the broad data science and numerical computing domains, we are more often consumers of OOP rather than class designers.
Thanks to the discussed method dispatch mechanism, our language is easily extensible and something that mimics a new data type can easily be introduced. Most importantly, methods can be added or removed during run-time, e.g., when importing external packages.
However, R is still a functional programming language, where functions not only are first-class citizens; they are privileged. Of course, there are some inherent limitations stemming from the ingenious simplicity of S3: method dispatch is usually based only on the type of the first function argument, classes cannot be defined formally (but see Section 11.5) and that there is no real encapsulation (we cannot actually hide data from a user[29]). However, overall the whole concept has proven quite versatile.
In functional programming, emphasis is on operations (verbs),
not data (nouns).
This leads to a very readable syntax, for example
(assuming that square, x
, and y
are sensibly defined),
the mean squared error can be written as:
mean(square(x-y))
This is very user-centric. However, when implementing more complex data processing pipelines, a programmer thinks âfirst, I need to do this, then I need to do that, and afterwardsâŚâ. When they write it down, there can be some pressing of HOME and END keys on the keyboard involved. This should not be a problem for most programmers.
finally(thereafter(then(first(x))))
However, some people are inherently lazy, always complaining and/or always trying to âoptimiseâ[30] things.
Base R is of course extremely flexible and we can introduce new vocabulary as we please. In Chapter 12, we study an example, where we define:
group_by (a function that splits a data frame with respect to a combination of levels in given named columns and returns a list of data frames with class
list_dfs
),aggregate.list_dfs (which applies a given aggregation function on each column of each data frame in a given list), and
mean.list_dfs (a specialised version of the former that calls mean).
The specifics do not really matter now, let us just consider the notation we use when the operations are chained:
# select a few rows and columns from the `iris` data frame:
iris_subset <- iris[51:150, c("Sepal.Width", "Petal.Length", "Species")]
# compute the averages of all variables grouped by Species:
mean(group_by(iris_subset, "Species"))
## Species x Mean
## 1 versicolor Sepal.Width 2.770
## 2 versicolor Petal.Length 4.260
## 3 virginica Sepal.Width 2.974
## 4 virginica Petal.Length 5.552
This is quite readable: we compute the mean
in groups defined by Species
in a subset of the iris
data frame.
All verbs appear on the lefthand side of the expression,
with the last (the most important?) operation
being listed first.
By the way, self-explanatory variable names and rich comments are priceless.
In more traditional object-oriented programming languages, either the method list is sealed inside[31] the classâ definition (like in C++), or some peculiar patches must be applied to inject a method (like in Python)[32]. There, it is the objects that are told what to do: they are treated as black boxes.
Some popular languages rely on the message-passing syntax,
where operations are propagated (and written) left-to-right
instead of inside-out.
For instance, in C++ and Python (amongst many others),
âobj.
method1().
method2()
â
means âcall method1 on obj
and
then call method2 on the result.
Since R 4.1.0, there is a pipe operator[33], `|>`, which is merely a syntactic sugar for translating between the message-passing and function-centric notion. In a nutshell, writing:
x |> f() |> g(y) |> h()
(x-y) |> square() |> mean()
is equivalent to:
h(g(f(x), y))
mean(square(x-y))
This syntax is developer-centric: it emphasises on the order in which the operations are executed, something that could always be achieved with the function-centric form and perhaps a few auxiliary variables.
In the above example, a pipe operator version
of the iris
aggregation exercise would look like:
iris_subset |> group_by("Species") |> mean()
This book is minimalist by design and there is nothing that cannot be achieved without the pipe operator, hence we will be refraining[34] ourselves from using it.
Note
When writing code interactively, we may sometimes benefit from
the use of the the rightward `->` operator.
Suffice to say that âname <- value
â and âvalue -> name
â are synonymous.
This way we can write some lengthy code, store the result in an intermediate variable, and then continue on in the next line (possibly referring to that auxiliary value more than once). In the long run, multiplying entities without necessity is unsustainable.
For instance:
iris[, c("Sepal.Width", "Petal.Length", "Species")] -> .
.[ .[, "Species"] %in% c("versicolor", "virginica"), ] -> .
mean(group_by(., "Species"))
## Species x Mean
## 1 versicolor Sepal.Width 2.770
## 2 versicolor Petal.Length 4.260
## 3 virginica Sepal.Width 2.974
## 4 virginica Petal.Length 5.552
â.
â is as good a variable name as any other one.
10.6. Exercisesď
Answer the following questions:
How to display the source code of the default methods for head and tail?
Can there be, at the same time, one object of class c
("A", "B")
and another one of class c("B", "A")
?If
f
is a factor, what are the relationships between as.character(f)
, as.numeric(f)
, as.character(
as.numeric(f))
, and as.numeric(
as.character(f))
?If
x
is a named vector andf
is a factor, isx[f]
equivalent tox[
as.character(f)]
or ratherx[
as.numeric(f)]
?
A user calls:
plot(x, y, col="red", ylim=c(1, max(x)), log="y")
where x
and y
are numeric vectors. Consult help("plot")
for the meaning of the ylim
and log
arguments.
Was that straightforward?
Explain why the two following calls yield significantly different results and present a workaround:
c(Sys.Date(), "1970-01-01")
## [1] "2023-01-18" "1970-01-01"
c("1970-01-01", Sys.Date())
## [1] "1970-01-01" "19375"
Write methods head and tail
for our example categorical
class.
(*) Write an R package that defines S3 class categorical
and a couple of methods therefor.
Note the need for the use of the S3method
directive NAMESPACE
;
see [47].
Inspect the result of a call to
binom.test(79, 100)
.
Find the method responsible for the pretty-printing of such objects.
Inspect the result of a call to
rle(
c(1, 1, 1, 4, 3, 3, 3, 3, 3,, 2, 2))
.
Find the method responsible for the pretty-printing of such objects.
Read more about the connection
class; see the Value section
in help("connections")
.
Read about the subsetting
operators overloaded for the package_version
class;
see help("numeric_version")
.
There are xtfrm methods
overloaded for classes such as numeric_version
,
difftime
, Date
, and factor
. Find out how they work and
where they might be useful (especially in relation to
order and sort; see also Section 12.3.1).
Give an example where
split(x,
list(y1, y2))
(with default arguments) will fail to generate the correct result.
Write a function that determines the mode, i.e., the most frequently
occurring value in a given object of class factor
. If the mode is not unique,
return a randomly chosen one (each with the same probability).
Implement your own version of the gl function.
Check out which built-in date-time functions the stringx package replaces with more portable ones.