4. Lists and attributes

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Minimalist Data Wrangling with Python [27], too.

After two brain-teasing chapters, it is time to cool it down a little. In this more technical part, we will introduce lists, which serve as universal containers for R objects of any size and type. Moreover, we will also show that each R object can be equipped with a number of optional attributes. Thanks to them, we will be able to label elements in any vector, and, in Chapter 10, introduce new complex data types such as matrices and data frames.

4.1. Type hierarchy and conversion

So far, we have been playing with three types of atomic vectors:

  1. logical (Chapter 3),

  2. numeric (Chapter 2),

  3. character (which we have barely touched upon yet, but rest assured that they will be covered in detail very soon; see Chapter 6).

To determine the type of an object programmatically, we can call the typeof function.

typeof(c(1, 2, 3))
## [1] "double"
typeof(c(TRUE, FALSE, TRUE, NA))
## [1] "logical"
typeof(c("spam", "spam", "bacon", "eggs", "spam"))
## [1] "character"

We can easily convert between these types, either on our explicit demand (type casting) or on-the-fly (coercion, when we perform an operation that expects something different from the kind of input it was fed with).

Note

(*) Numeric vectors are reported as being either of the type double (double-precision floating-point numbers) or integer (32-bit; it is a subset of double); see Section 6.4.1. In most practical cases, this is a technical detail that we can risklessly ignore; compare also the mode function.

4.1.1. Explicit type casting

We can use functions such as as.logical, as.numeric[1], and as.character to convert given objects to the corresponding types.

as.numeric(c(TRUE, FALSE, NA, TRUE, NA, FALSE))  # synonym: as.double
## [1]  1  0 NA  1 NA  0
as.logical(c(-2, -1, 0, 1, 2, 3, NA_real_, -Inf, NaN))
## [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE    NA  TRUE    NA

Important

The rules are:

  • TRUE \(\rightarrow\) 1,

  • FALSE \(\rightarrow\) 0,

  • NA \(\rightarrow\) NA_real_,

and:

  • 0 \(\rightarrow\) FALSE,

  • NA_real_ and NaN \(\rightarrow\) NA,

  • anything else \(\rightarrow\) TRUE.

The distinction between zero and non-zero is commonly applied in other programming languages as well.

Moreover, in the case of the conversion involving character strings, we have:

as.character(c(TRUE, FALSE, NA, TRUE, NA, FALSE))
## [1] "TRUE"  "FALSE" NA      "TRUE"  NA      "FALSE"
as.character(c(-2, -1, 0, 1, 2, 3, NA_real_, -Inf, NaN))
## [1] "-2"   "-1"   "0"    "1"    "2"    "3"    NA     "-Inf" "NaN"
as.logical(c("TRUE", "True", "true", "T",
             "FALSE", "False", "false", "F",
             "anything other than these", NA_character_))
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE    NA    NA
as.numeric(c("0", "-1.23e4", "pi", "2+2", "NaN", "-Inf", NA_character_))
## Warning: NAs introduced by coercion
## [1]      0 -12300     NA     NA    NaN   -Inf     NA

4.1.2. Implicit conversion (coercion)

Recall that we referred to the three vector types as atomic ones. They can only be used to store elements of the same type. If we make an attempt at composing an object of mixed types with c, the common type will be determined in such a way that data are stored without information loss:

c(-1, FALSE, TRUE, 2, "three", NA)
## [1] "-1"    "FALSE" "TRUE"  "2"     "three" NA
c("zero", TRUE, NA)
## [1] "zero" "TRUE" NA
c(-1, FALSE, TRUE, 2, NA)
## [1] -1  0  1  2 NA

Hence, we see that logical is the most specialised of the tree, whereas character is the most general.

Note

The logical NA is converted to NA_real_ and NA_character_ in the preceding examples. R users tend to rely on implicit type conversion when they write c(1, 2, NA, 4) rather than c(1, 2, NA_real_, 4). In most cases, this is fine, but it might make us less vigilant.

However, occasionally, it will be wiser to be more unequivocal. For instance, rep(NA_real_, 1e9) preallocates a long numeric vector instead of a logical one.

Some functions that expect vectors of specific types can apply coercion by themselves (or act as if they do so):

c(NA, FALSE, TRUE) + 10  # implicit conversion logical –> numeric
## [1] NA 10 11
c(-1, 0, 1) & TRUE  # implicit conversion numeric –> logical
## [1]  TRUE FALSE  TRUE
sum(c(TRUE, TRUE, FALSE, TRUE, FALSE))  # same as sum(as.numeric(...))
## [1] 3
cumsum(c(TRUE, TRUE, FALSE, TRUE, FALSE))
## [1] 1 2 2 3 3
cummin(c(TRUE, TRUE, FALSE, TRUE, FALSE))
## [1] 1 1 0 0 0
Exercise 4.1

In Exercise 3.6, we computed the cross-entropy loss between a logical vector \(\boldsymbol{y}\in\{0, 1\}^n\) and a numeric vector \(\boldsymbol{p}\in(0, 1)^n\). This measure can be equivalently defined as:

\[ \mathcal{L}(\boldsymbol{p}, \boldsymbol{y}) = -\frac{1}{n}\left(\sum_{i=1}^n y_i\log(p_i)+ (1-y_i)\log(1-p_i) \right). \]

Implement this formula using vectorised operations, but not relying on ifelse this time. Then, compute the cross-entropy loss between, for instance, “y <- sample(c(FALSE, TRUE), n, replace=TRUE)” and “p <- runif(n)” for some n. Note how seamlessly we translate between FALSE/TRUEs and 0/1s in the above equation (in particular, where \(1-y_i\) means the logical negation of \(y_i\)).

4.2. Lists

Lists are generalised vectors. They can be comprised of R objects of any kind, also other lists. It is why we classify them as recursive (and not atomic) objects. They are especially useful wherever there is a need to handle some multitude as a single entity.

4.2.1. Creating lists

The most straightforward way to create a list is by means of the list function:

list(1, 2, 3)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3

Notice that it is not the same as c(1, 2, 3). We got a sequence that wraps three numeric vectors, each of length one. More examples:

list(1:3, 4, c(TRUE, FALSE, NA, TRUE), "and so forth")  # different types
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 4
##
## [[3]]
## [1]  TRUE FALSE    NA  TRUE
##
## [[4]]
## [1] "and so forth"
list(list(c(TRUE, FALSE, NA, TRUE), letters), list(1:3))  # a list of lists
## [[1]]
## [[1]][[1]]
## [1]  TRUE FALSE    NA  TRUE
##
## [[1]][[2]]
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
##
##
## [[2]]
## [[2]][[1]]
## [1] 1 2 3

The display of lists is (un)pretty bloated. However, the str function prints any R object in a more concise fashion:

str(list(list(c(TRUE, FALSE, NA, TRUE), letters), list(1:3)))
## List of 2
##  $ :List of 2
##   ..$ : logi [1:4] TRUE FALSE NA TRUE
##   ..$ : chr [1:26] "a" "b" "c" "d" ...
##  $ :List of 1
##   ..$ : int [1:3] 1 2 3

Note

In Section 4.1, we said that the c function, when fed with arguments of mixed types, tries to determine the common type that retains the sense of data. If coercion to an atomic vector is not possible, the result will be a list.

c(1, "two", identity)  # `identity` is an object of the type "function"
## [[1]]
## [1] 1
##
## [[2]]
## [1] "two"
##
## [[3]]
## function (x)
## x
## <environment: namespace:base>

Thus, the c function can also be used to concatenate lists:

c(list(1), list(2), list(3))  # three lists –> one list
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3

Lists can be repeated using rep:

rep(list(1:11, LETTERS), 2)
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11
##
## [[2]]
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
##
## [[3]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11
##
## [[4]]
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

4.2.2. Converting to and from lists

The conversion of an atomic vector to a list of vectors of length one can be done via a call to as.list:

as.list(c(1, 2, 3))  # vector of length 3 –> list of 3 vectors of length 1
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3

Unfortunately, calling, say, as.numeric on a list arouses an error, even if it is comprised of numeric vectors only. We can try flattening it to an atomic sequence by calling unlist:

unlist(list(list(1, 2), list(3, list(4:8)), 9))
## [1] 1 2 3 4 5 6 7 8 9
unlist(list(list(1, 2), list(3, list(4:8)), "spam"))
## [1] "1"    "2"    "3"    "4"    "5"    "6"    "7"    "8"    "spam"

Note

(*) Chapter 11 will mention the simplify2array function, which generalises unlist in a way that can sometimes give rise to a matrix.

4.3. NULL

NULL, being the one and only instance of the eponymous type, can be used as a placeholder for an R object or designate the absence of any entities whatsoever.

list(NULL, NULL, month.name)
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
##  [1] "January"   "February"  "March"     "April"     "May"
##  [6] "June"      "July"      "August"    "September" "October"
## [11] "November"  "December"

NULL is different from a vector of length zero because the latter has a type. However, NULL sometimes behaves like a zero-length vector. In particular, length(NULL) returns 0. Also, c called with no arguments returns NULL.

Testing for NULL-ness can be done with a call to is.null.

Important

NULL is not the same as NA. The former cannot be emplaced in an atomic vector.

c(1, NA, 3, NULL, 5)  # here, NULL behaves like a zero-length vector
## [1]  1 NA  3  5

They have very distinct semantics (no value vs a missing value).

Later we will see that some functions return NULL invisibly when they have nothing interesting to report. This is the case of print or plot, which are called because of their side effects (printing and plotting).

Furthermore, in certain contexts, replacing content with NULL will actually result in its removal, e.g., when subsetting a list.

4.4. Object attributes

Lists can embrace many entities in the form of a single item sequence. Attributes, on the other hand, give means to inject extra data into an object. They are unordered key=value pairs, where key is a single string, and value is any R object except NULL. We can introduce them by calling, amongst others[2], the structure function:

x_simple <- 1:10
x <- structure(
    x_simple,  # the object to be equipped with attributes
    attribute1="value1",
    attribute2=c(6, 100, 324)
)

4.4.1. Developing perceptual indifference to most attributes

Let’s see how the foregoing x is reported on the console:

print(x)
##  [1]  1  2  3  4  5  6  7  8  9 10
## attr(,"attribute1")
## [1] "value1"
## attr(,"attribute2")
## [1]   6 100 324

The object of concern, 1:10, was displayed first. We need to get used to that. Most of the time, we suggest to treat the “attr…” parts of the display as if they were printed in tiny font.

Equipping an object with attributes does not usually change its nature; see, however, Chapter 10 for a few exceptions. The above x is still treated as an ordinary sequence of numbers by most functions:

sum(x)   # the same as sum(1:10); `sum` does not care about any attributes
## [1] 55
typeof(x)  # just a numeric vector, but with some perks
## [1] "integer"

Important

Attributes are generally ignored by most functions unless they have specifically been programmed to pay attention to them.

4.4.2. But there are a few use cases

Some R functions add attributes to the return value to sneak extra information that might be useful, just in case. For instance, na.omit, whose main aim is to remove missing values from an atomic vector, yields:

y <- c(10, 20, NA, 40, 50, NA, 70)
(y_na_free <- na.omit(y))
## [1] 10 20 40 50 70
## attr(,"na.action")
## [1] 3 6
## attr(,"class")
## [1] "omit"

We can enjoy the NA-free version of y in further computations:

mean(y_na_free)
## [1] 38

Additionally, the na.action attribute indicates the former whereabouts of the missing observations:

attr(y_na_free, "na.action")  # read the attribute value
## [1] 3 6
## attr(,"class")
## [1] "omit"

We ignore the class part until Chapter 10.

As another example, gregexpr discussed in Chapter 6 searches for a given pattern in a character vector:

needle <- "spam|durian"  # pattern to search for: spam OR durian
haystack <- c("spam, bacon, and durian-flavoured spam", "spammer")  # text
(pos <- gregexpr(needle, haystack, perl=TRUE))
## [[1]]
## [1]  1 18 35
## attr(,"match.length")
## [1] 4 6 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

We sought all occurrences of the pattern within two character strings. As their number may vary from string to string, wrapping the results in a list was a good design choice. Each list element gives the starting positions where matches can be found: there are three and one match(es), respectively. Moreover, every vector of positions has a designated match.length attribute (amongst others), in case we need it.

Exercise 4.2

Create a list with EUR/AUD, EUR/GBP, and EUR/USD exchange rates read from the euraud-*.csv, eurgbp-*.csv, and eurusd-*.csv files in our data repository. Each of its three elements should be a numeric vector storing the currency exchange rates. Furthermore, equip them with currency_from, currency_to, date_from, and date_to attributes. For example:

##  [1]     NA 1.6006 1.6031     NA     NA 1.6119 1.6251 1.6195 1.6193 1.6132
## [11]     NA     NA 1.6117 1.6110 1.6188 1.6115 1.6122     NA
## attr(,"currency_from")
## [1] "EUR"
## attr(,"currency_to")
## [1] "AUD"
## attr(,"date_from")
## [1] "2020-01-01"
## attr(,"date_to")
## [1] "2020-06-30"

Such an additional piece of information could be stored in a few separate variables (other vectors), but then it would not be as convenient to use as the above representation.

4.4.3. Special attributes

Attributes have great potential which is somewhat wasted, for R users rarely know:

  • that attributes exist (pessimistic scenario), or

  • how to handle them (realistic scenario).

But we know now.

What is more, certain attributes have been predestined to play a unique role in R. Namely, the most prevalent amongst the special attributes are:

  • names, row.names, and dimnames are used to label the elements of atomic and generic vectors (see below) as well as rows and columns in matrices (Chapter 11) and data frames (Chapter 12),

  • dim turns flat vectors into matrices and other tensors (Chapter 11),

  • levels labels the underlying integer codes in factor objects (Section 10.3.2),

  • class can be used to bring forth new complex data structures based on basic types (Chapter 10).

We call them special because:

  • they cannot be assigned arbitrary values; for instance, we will soon see that names can accept a character vector of a specific length,

  • they can be accessed via designated functions, e.g., names, class, dim, dimnames, levels, etc.,

  • they are widely recognised by many other functions.

However, in spite of the above, special attributes can still be managed as ordinary ones.

Exercise 4.3

comment is perhaps the most rarely used special attribute. Create an object (whatever) equipped with the comment attribute. Verify that assigning to it anything other than a character vector leads to an error. Read its value by calling the comment function. Display the object equipped with this attribute. Note that the print function ignores its existence whatsoever: this is how special it is.

Important

(*) The accessor functions such as names or class might return meaningful values, even if the corresponding attribute is not set explicitly; see, e.g., Section 11.1.5 for an example.

4.4.4. Labelling vector elements with the names attribute

The special attribute names labels atomic vectors’ and lists’ elements.

(x <- structure(c(13, 2, 6), names=c("spam", "sausage", "celery")))
##    spam sausage  celery
##      13       2       6

The labels may improve the expressivity and readability of our code and data.

Exercise 4.4

Verify that the above x is still an ordinary numeric vector by calling typeof and sum on it.

Let’s stress that we can ignore the names attribute whatsoever. If we apply any operation discussed in Chapter 2, we will garner the same result regardless whether such extra information is present or not.

It is just the print function that changed its behaviour slightly. After all, it is a special attribute. Instead of reporting:

## [1] 13  2  6
## attr(,"names​")
## [1] "spam"    "sausage" "celery"

we got a nicely formatted table-like display. Non-special attributes are still printed in the standard way:

structure(x, additional_attribute=1:10)
##    spam sausage  celery
##      13       2       6
## attr(,"additional_attribute")
##  [1]  1  2  3  4  5  6  7  8  9 10

Note

Chapter 5 will also mention that some operations (such as indexing) gain superpowers in the presence of the names attribute.

This attribute can be read by calling:

attr(x, "names")  # just like any other attribute
## [1] "spam"    "sausage" "celery"
names(x)  # because it is so special
## [1] "spam"    "sausage" "celery"

Named vectors can be easily created with the c and list functions as well:

c(a=1, b=2)
## a b
## 1 2
list(a=1, b=2)
## $a
## [1] 1
##
## $b
## [1] 2
c(a=c(x=1, y=2), b=3, c=c(z=4))  # this is smart
## a.x a.y   b c.z
##   1   2   3   4

Let’s contemplate how a named list is printed on the console. Again, it is still a list, but with some extras.

Exercise 4.5

A whole lot of functions return named vectors. Evaluate the following expressions and read the corresponding pages in their documentation:

  • quantile(runif(100)),

  • hist(runif(100), plot=FALSE),

  • options() (take note of digits, scipen, max.print, and width),

  • capabilities().

Note

(*) Most of the time, lists are used merely as containers for other R objects. This is a dull yet essential role. However, let’s just mention here that every data frame is, in fact, a generic vector (see Chapter 12). Each column corresponds to a named list element:

(df <- head(iris))  # some data frame
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
typeof(df)  # it is just a list (with extras that will be discussed later)
## [1] "list"
unclass(df)  # how it is represented exactly (without the extras)
## $Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4
##
## $Sepal.Width
## [1] 3.5 3.0 3.2 3.1 3.6 3.9
##
## $Petal.Length
## [1] 1.4 1.4 1.3 1.5 1.4 1.7
##
## $Petal.Width
## [1] 0.2 0.2 0.2 0.2 0.2 0.4
##
## $Species
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
##
## attr(,"row.names")
## [1] 1 2 3 4 5 6

Therefore, the functions we discuss in this chapter are of use in processing such structured data, too.

4.4.5. Altering and removing attributes

We know that a single attribute can be read by calling attr. Their whole list is generated with a call to attributes.

(x <- structure(c("some", "object"), names=c("X", "Y"),
    attribute1="value1", attribute2="value2", attribute3="value3"))
##        X        Y
##   "some" "object"
## attr(,"attribute1")
## [1] "value1"
## attr(,"attribute2")
## [1] "value2"
## attr(,"attribute3")
## [1] "value3"
attr(x, "attribute1")  # reads a single attribute, returns NULL if unset
## [1] "value1"
attributes(x)  # returns a named list with all attributes of an object
## $names
## [1] "X" "Y"
##
## $attribute1
## [1] "value1"
##
## $attribute2
## [1] "value2"
##
## $attribute3
## [1] "value3"

We can alter an attribute’s value or add further attributes by referring to the structure function once again. Moreover, setting an attribute’s value to NULL gets rid of it completely.

structure(x, attribute1=NULL, attribute4="added", attribute3="modified")
##        X        Y
##   "some" "object"
## attr(,"attribute2")
## [1] "value2"
## attr(,"attribute3")
## [1] "modified"
## attr(,"attribute4")
## [1] "added"

As far as the names attribute is concerned, we may generate an unnamed copy of an object by calling:

unname(x)
## [1] "some"   "object"
## attr(,"attribute1")
## [1] "value1"
## attr(,"attribute2")
## [1] "value2"
## attr(,"attribute3")
## [1] "value3"

In Section 9.3.6, we will introduce replacement functions. They will enable us to modify or remove an object’s attribute by calling “attr(x, "some_attribute") <- new_value”.

Moreover, Section 5.5 highlights that certain operations (such as vector indexing, elementwise arithmetic operations, and coercion) might not preserve all attributes of the objects that were given as their inputs.

4.5. Exercises

Exercise 4.6

Provide an answer to the following questions.

  • What is the meaning of c(TRUE, FALSE)*1:10?

  • What does sum(as.logical(x)) compute when x is a numeric vector?

  • We said that atomic vectors of the type character are the most general ones. Therefore, is as.numeric(as.character(x)) the same as as.numeric(x), regardless of the type of x?

  • What is the meaning of as.logical(x+y) if x and y are logical vectors? What about as.logical(x*y), as.logical(1-x), and as.logical(x!=y)?

  • What is the meaning of the following when x is a logical vector?

    • cummin(x) and cummin(!x),

    • cummax(x) and cummax(!x),

    • cumsum(x) and cumsum(!x),

    • cumprod(x) and cumprod(!x).

  • Let x be a named numeric vector, e.g., “x <- quantile(runif(100))”. What is the result of 2*x, mean(x), and round(x, 2)?

  • What is the meaning of x == NULL?

  • Give two ways to create a named character vector.

  • Give two ways (discussed above; there are more) to remove the names attribute from an object.

Exercise 4.7

There are a few peculiarities when joining or coercing lists. Compare the results generated by the following pairs of expressions:

# 1)
as.character(list(list(1, 2), list(3, list(4)), 5))
as.character(unlist(list(list(1, 2), list(3, list(4)), 5)))
# 2)
as.numeric(list(list(1, 2), list(3, list(4)), 5))
as.numeric(unlist(list(list(1, 2), list(3, list(4)), 5)))
# 3)
unlist(list(list(1, 2), sd))
list(1, 2, sd)
# 4)
c(list(c(1, 2), 3), 4, 5)
c(list(c(1, 2), 3), c(4, 5))
Exercise 4.8

Given numeric vectors x, y, z, and w, how to combine x, y, and list(z, w) so as to obtain list(x, y, z, w)? More generally, given a set of atomic vectors and lists of atomic vectors, how to combine them to obtain a single list of atomic vectors (not a list of atomic vectors and lists, not atomic vectors unwound, etc.)?

Exercise 4.9

readRDS serialises R objects and writes their snapshots to disk so that they can be restored via a call to saveRDS at a later time. Verify that this function preserves object attributes. Also, check out dput and dget which work with objects’ textual representation in the form executable R code.

Exercise 4.10

(*) Use jsonlite::fromJSON to read a JSON file in the form of a named list.

In the extremely unlikely event of our finding the current chapter boring, let’s rejoice: some of the exercises and remarks that we will encounter in the next part, which is devoted to vector indexing, will definitely be mouthwatering.