4. Lists and attributes¶
This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Minimalist Data Wrangling with Python [27], too.
After two brain-teasing chapters, it is time to cool it down a little. In this more technical part, we will introduce lists, which serve as universal containers for R objects of any size and type. Moreover, we will also show that each R object can be equipped with a number of optional attributes. Thanks to them, we will be able to label elements in any vector, and, in Chapter 10, introduce new complex data types such as matrices and data frames.
4.1. Type hierarchy and conversion¶
So far, we have been playing with three types of atomic vectors:
logical
(Chapter 3),numeric
(Chapter 2),character
(which we have barely touched upon yet, but rest assured that they will be covered in detail very soon; see Chapter 6).
To determine the type of an object programmatically, we can call the typeof function.
typeof(c(1, 2, 3))
## [1] "double"
typeof(c(TRUE, FALSE, TRUE, NA))
## [1] "logical"
typeof(c("spam", "spam", "bacon", "eggs", "spam"))
## [1] "character"
We can easily convert between these types, either on our explicit demand (type casting) or on-the-fly (coercion, when we perform an operation that expects something different from the kind of input it was fed with).
Note
(*)
Numeric vectors are reported as being either of the type
double
(double-precision floating-point numbers)
or integer
(32-bit; it is a subset of double
);
see Section 6.4.1.
In most practical cases, this is a technical detail that
we can risklessly ignore; compare also the mode function.
4.1.1. Explicit type casting¶
We can use functions such as as.logical, as.numeric[1], and as.character to convert given objects to the corresponding types.
as.numeric(c(TRUE, FALSE, NA, TRUE, NA, FALSE)) # synonym: as.double
## [1] 1 0 NA 1 NA 0
as.logical(c(-2, -1, 0, 1, 2, 3, NA_real_, -Inf, NaN))
## [1] TRUE TRUE FALSE TRUE TRUE TRUE NA TRUE NA
Important
The rules are:
TRUE
\(\rightarrow\) 1,FALSE
\(\rightarrow\) 0,NA
\(\rightarrow\)NA_real_
,
and:
0 \(\rightarrow\)
FALSE
,NA_real_
andNaN
\(\rightarrow\)NA
,anything else \(\rightarrow\)
TRUE
.
The distinction between zero and non-zero is commonly applied in other programming languages as well.
Moreover, in the case of the conversion involving character strings, we have:
as.character(c(TRUE, FALSE, NA, TRUE, NA, FALSE))
## [1] "TRUE" "FALSE" NA "TRUE" NA "FALSE"
as.character(c(-2, -1, 0, 1, 2, 3, NA_real_, -Inf, NaN))
## [1] "-2" "-1" "0" "1" "2" "3" NA "-Inf" "NaN"
as.logical(c("TRUE", "True", "true", "T",
"FALSE", "False", "false", "F",
"anything other than these", NA_character_))
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE NA NA
as.numeric(c("0", "-1.23e4", "pi", "2+2", "NaN", "-Inf", NA_character_))
## Warning: NAs introduced by coercion
## [1] 0 -12300 NA NA NaN -Inf NA
4.1.2. Implicit conversion (coercion)¶
Recall that we referred to the three vector types as atomic ones. They can only be used to store elements of the same type. If we make an attempt at composing an object of mixed types with c, the common type will be determined in such a way that data are stored without information loss:
c(-1, FALSE, TRUE, 2, "three", NA)
## [1] "-1" "FALSE" "TRUE" "2" "three" NA
c("zero", TRUE, NA)
## [1] "zero" "TRUE" NA
c(-1, FALSE, TRUE, 2, NA)
## [1] -1 0 1 2 NA
Hence, we see that logical
is the most specialised of the tree,
whereas character
is the most general.
Note
The logical NA
is converted to NA_real_
and NA_character_
in the preceding examples. R users tend to rely on implicit type
conversion when they write c(1, 2, NA, 4)
rather than c(1, 2, NA_real_, 4)
.
In most cases, this is fine, but it might make us less vigilant.
However, occasionally, it will be wiser to be more unequivocal.
For instance, rep(NA_real_, 1e9)
preallocates
a long numeric vector instead of a logical one.
Some functions that expect vectors of specific types can apply coercion by themselves (or act as if they do so):
c(NA, FALSE, TRUE) + 10 # implicit conversion logical –> numeric
## [1] NA 10 11
c(-1, 0, 1) & TRUE # implicit conversion numeric –> logical
## [1] TRUE FALSE TRUE
sum(c(TRUE, TRUE, FALSE, TRUE, FALSE)) # same as sum(as.numeric(...))
## [1] 3
cumsum(c(TRUE, TRUE, FALSE, TRUE, FALSE))
## [1] 1 2 2 3 3
cummin(c(TRUE, TRUE, FALSE, TRUE, FALSE))
## [1] 1 1 0 0 0
In Exercise 3.6, we computed the cross-entropy loss between a logical vector \(\boldsymbol{y}\in\{0, 1\}^n\) and a numeric vector \(\boldsymbol{p}\in(0, 1)^n\). This measure can be equivalently defined as:
Implement this formula using vectorised operations, but not relying on ifelse this time. Then, compute the cross-entropy loss between,
for instance,
“y <-
sample(
c(FALSE, TRUE), n, replace=TRUE)
”
and
“p <-
runif(n)
” for some n
.
Note how seamlessly we translate between
FALSE
/TRUE
s and 0/1s in the above equation (in particular,
where \(1-y_i\) means the logical negation of \(y_i\)).
4.2. Lists¶
Lists are generalised vectors. They can be comprised of R objects of any kind, also other lists. It is why we classify them as recursive (and not atomic) objects. They are especially useful wherever there is a need to handle some multitude as a single entity.
4.2.1. Creating lists¶
The most straightforward way to create a list is by means of the list function:
list(1, 2, 3)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
Notice that it is not the same as c(1, 2, 3)
.
We got a sequence that wraps three numeric vectors, each of length one.
More examples:
list(1:3, 4, c(TRUE, FALSE, NA, TRUE), "and so forth") # different types
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] TRUE FALSE NA TRUE
##
## [[4]]
## [1] "and so forth"
list(list(c(TRUE, FALSE, NA, TRUE), letters), list(1:3)) # a list of lists
## [[1]]
## [[1]][[1]]
## [1] TRUE FALSE NA TRUE
##
## [[1]][[2]]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
##
##
## [[2]]
## [[2]][[1]]
## [1] 1 2 3
The display of lists is (un)pretty bloated. However, the str function prints any R object in a more concise fashion:
str(list(list(c(TRUE, FALSE, NA, TRUE), letters), list(1:3)))
## List of 2
## $ :List of 2
## ..$ : logi [1:4] TRUE FALSE NA TRUE
## ..$ : chr [1:26] "a" "b" "c" "d" ...
## $ :List of 1
## ..$ : int [1:3] 1 2 3
Note
In Section 4.1, we said that the c function, when fed with arguments of mixed types, tries to determine the common type that retains the sense of data. If coercion to an atomic vector is not possible, the result will be a list.
c(1, "two", identity) # `identity` is an object of the type "function"
## [[1]]
## [1] 1
##
## [[2]]
## [1] "two"
##
## [[3]]
## function (x)
## x
## <environment: namespace:base>
Thus, the c function can also be used to concatenate lists:
c(list(1), list(2), list(3)) # three lists –> one list
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
Lists can be repeated using rep:
rep(list(1:11, LETTERS), 2)
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11
##
## [[2]]
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
##
## [[3]]
## [1] 1 2 3 4 5 6 7 8 9 10 11
##
## [[4]]
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
4.2.2. Converting to and from lists¶
The conversion of an atomic vector to a list of vectors of length one can be done via a call to as.list:
as.list(c(1, 2, 3)) # vector of length 3 –> list of 3 vectors of length 1
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
Unfortunately, calling, say, as.numeric on a list arouses an error, even if it is comprised of numeric vectors only. We can try flattening it to an atomic sequence by calling unlist:
unlist(list(list(1, 2), list(3, list(4:8)), 9))
## [1] 1 2 3 4 5 6 7 8 9
unlist(list(list(1, 2), list(3, list(4:8)), "spam"))
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "spam"
Note
(*) Chapter 11 will mention the simplify2array function, which generalises unlist in a way that can sometimes give rise to a matrix.
4.3. NULL
¶
NULL
, being the one and only instance of the eponymous type,
can be used as a placeholder for an R object
or designate the absence of any entities whatsoever.
list(NULL, NULL, month.name)
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## [1] "January" "February" "March" "April" "May"
## [6] "June" "July" "August" "September" "October"
## [11] "November" "December"
NULL
is different from a vector of length zero because the latter has
a type. However, NULL
sometimes behaves like a zero-length vector.
In particular, length(NULL)
returns 0.
Also, c called with no arguments returns NULL
.
Testing for NULL
-ness can be done with a call to is.null.
Important
NULL
is not the same as NA
.
The former cannot be emplaced in an atomic vector.
c(1, NA, 3, NULL, 5) # here, NULL behaves like a zero-length vector
## [1] 1 NA 3 5
They have very distinct semantics (no value vs a missing value).
Later we will see that some functions return NULL
invisibly
when they have nothing interesting to report.
This is the case of print or plot, which are called
because of their side effects (printing and plotting).
Furthermore, in certain contexts, replacing content with NULL
will actually result in its removal, e.g., when subsetting a list.
4.4. Object attributes¶
Lists can embrace many entities in the form of a single item sequence.
Attributes, on the other hand, give means to inject extra data
into an object. They are unordered key=value
pairs,
where key
is a single string,
and value
is any R object except NULL
.
We can introduce them by calling, amongst others[2], the
structure function:
x_simple <- 1:10
x <- structure(
x_simple, # the object to be equipped with attributes
attribute1="value1",
attribute2=c(6, 100, 324)
)
4.4.1. Developing perceptual indifference to most attributes¶
Let’s see how the foregoing x
is reported on the console:
print(x)
## [1] 1 2 3 4 5 6 7 8 9 10
## attr(,"attribute1")
## [1] "value1"
## attr(,"attribute2")
## [1] 6 100 324
The object of concern, 1:10
, was displayed first.
We need to get used to that. Most of the time, we suggest to treat
the “attr
…” parts of the display as if they were printed
in tiny font.
Equipping an object with attributes does not usually change its nature;
see, however, Chapter 10 for a few exceptions. The above x
is still
treated as an ordinary sequence of numbers by most functions:
sum(x) # the same as sum(1:10); `sum` does not care about any attributes
## [1] 55
typeof(x) # just a numeric vector, but with some perks
## [1] "integer"
Important
Attributes are generally ignored by most functions unless they have specifically been programmed to pay attention to them.
4.4.2. But there are a few use cases¶
Some R functions add attributes to the return value to sneak extra information that might be useful, just in case. For instance, na.omit, whose main aim is to remove missing values from an atomic vector, yields:
y <- c(10, 20, NA, 40, 50, NA, 70)
(y_na_free <- na.omit(y))
## [1] 10 20 40 50 70
## attr(,"na.action")
## [1] 3 6
## attr(,"class")
## [1] "omit"
We can enjoy the NA
-free version of y
in further computations:
mean(y_na_free)
## [1] 38
Additionally, the na.action
attribute indicates the former whereabouts
of the missing observations:
attr(y_na_free, "na.action") # read the attribute value
## [1] 3 6
## attr(,"class")
## [1] "omit"
We ignore the class
part until Chapter 10.
As another example, gregexpr discussed in Chapter 6 searches for a given pattern in a character vector:
needle <- "spam|durian" # pattern to search for: spam OR durian
haystack <- c("spam, bacon, and durian-flavoured spam", "spammer") # text
(pos <- gregexpr(needle, haystack, perl=TRUE))
## [[1]]
## [1] 1 18 35
## attr(,"match.length")
## [1] 4 6 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
We sought all occurrences of the pattern within
two character strings. As their number may vary from string to string,
wrapping the results in a list was a good design choice.
Each list element gives the starting positions
where matches can be found: there are three and one match(es), respectively.
Moreover, every vector of positions has a designated match.length
attribute
(amongst others), in case we need it.
Create a list with EUR/AUD, EUR/GBP, and EUR/USD exchange rates
read from the euraud-*.csv
,
eurgbp-*.csv
, and
eurusd-*.csv
files
in our data repository.
Each of its three elements should be a numeric vector storing
the currency exchange rates. Furthermore, equip them with
currency_from
, currency_to
, date_from
, and date_to
attributes.
For example:
## [1] NA 1.6006 1.6031 NA NA 1.6119 1.6251 1.6195 1.6193 1.6132
## [11] NA NA 1.6117 1.6110 1.6188 1.6115 1.6122 NA
## attr(,"currency_from")
## [1] "EUR"
## attr(,"currency_to")
## [1] "AUD"
## attr(,"date_from")
## [1] "2020-01-01"
## attr(,"date_to")
## [1] "2020-06-30"
Such an additional piece of information could be stored in a few separate variables (other vectors), but then it would not be as convenient to use as the above representation.
4.4.3. Special attributes¶
Attributes have great potential which is somewhat wasted, for R users rarely know:
that attributes exist (pessimistic scenario), or
how to handle them (realistic scenario).
But we know now.
What is more, certain attributes have been predestined to play a unique role in R. Namely, the most prevalent amongst the special attributes are:
names
,row.names
, anddimnames
are used to label the elements of atomic and generic vectors (see below) as well as rows and columns in matrices (Chapter 11) and data frames (Chapter 12),dim
turns flat vectors into matrices and other tensors (Chapter 11),levels
labels the underlying integer codes in factor objects (Section 10.3.2),class
can be used to bring forth new complex data structures based on basic types (Chapter 10).
We call them special because:
they cannot be assigned arbitrary values; for instance, we will soon see that
names
can accept a character vector of a specific length,they can be accessed via designated functions, e.g., names, class, dim, dimnames, levels, etc.,
they are widely recognised by many other functions.
However, in spite of the above, special attributes can still be managed as ordinary ones.
comment
is perhaps the most rarely used special attribute.
Create an object (whatever) equipped with the comment
attribute.
Verify that assigning to it anything other than a character vector
leads to an error. Read its value by calling the comment
function. Display the object equipped with this attribute.
Note that the print function ignores its existence whatsoever:
this is how special it is.
Important
(*) The accessor functions such as names or class might return meaningful values, even if the corresponding attribute is not set explicitly; see, e.g., Section 11.1.5 for an example.
4.4.4. Labelling vector elements with the names
attribute¶
The special attribute names
labels atomic vectors’ and lists’ elements.
(x <- structure(c(13, 2, 6), names=c("spam", "sausage", "celery")))
## spam sausage celery
## 13 2 6
The labels may improve the expressivity and readability of our code and data.
Verify that the above x
is still an ordinary numeric vector
by calling typeof and sum on it.
Let’s stress that we can ignore the names
attribute whatsoever.
If we apply any operation discussed in Chapter 2,
we will garner the same result regardless whether such extra information
is present or not.
It is just the print function that changed its behaviour slightly. After all, it is a special attribute. Instead of reporting:
## [1] 13 2 6
## attr(,"names")
## [1] "spam" "sausage" "celery"
we got a nicely formatted table-like display. Non-special attributes are still printed in the standard way:
structure(x, additional_attribute=1:10)
## spam sausage celery
## 13 2 6
## attr(,"additional_attribute")
## [1] 1 2 3 4 5 6 7 8 9 10
Note
Chapter 5 will also mention that some operations
(such as indexing) gain superpowers in the presence of the names
attribute.
This attribute can be read by calling:
attr(x, "names") # just like any other attribute
## [1] "spam" "sausage" "celery"
names(x) # because it is so special
## [1] "spam" "sausage" "celery"
Named vectors can be easily created with the c and list functions as well:
c(a=1, b=2)
## a b
## 1 2
list(a=1, b=2)
## $a
## [1] 1
##
## $b
## [1] 2
c(a=c(x=1, y=2), b=3, c=c(z=4)) # this is smart
## a.x a.y b c.z
## 1 2 3 4
Let’s contemplate how a named list is printed on the console. Again, it is still a list, but with some extras.
A whole lot of functions return named vectors. Evaluate the following expressions and read the corresponding pages in their documentation:
quantile
(
runif(100))
,hist
(
runif(100), plot=FALSE)
,options
()
(take note ofdigits
,scipen
,max.print
, andwidth
),capabilities
()
.
Note
(*) Most of the time, lists are used merely as containers for other R objects. This is a dull yet essential role. However, let’s just mention here that every data frame is, in fact, a generic vector (see Chapter 12). Each column corresponds to a named list element:
(df <- head(iris)) # some data frame
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
typeof(df) # it is just a list (with extras that will be discussed later)
## [1] "list"
unclass(df) # how it is represented exactly (without the extras)
## $Sepal.Length
## [1] 5.1 4.9 4.7 4.6 5.0 5.4
##
## $Sepal.Width
## [1] 3.5 3.0 3.2 3.1 3.6 3.9
##
## $Petal.Length
## [1] 1.4 1.4 1.3 1.5 1.4 1.7
##
## $Petal.Width
## [1] 0.2 0.2 0.2 0.2 0.2 0.4
##
## $Species
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
##
## attr(,"row.names")
## [1] 1 2 3 4 5 6
Therefore, the functions we discuss in this chapter are of use in processing such structured data, too.
4.4.5. Altering and removing attributes¶
We know that a single attribute can be read by calling attr. Their whole list is generated with a call to attributes.
(x <- structure(c("some", "object"), names=c("X", "Y"),
attribute1="value1", attribute2="value2", attribute3="value3"))
## X Y
## "some" "object"
## attr(,"attribute1")
## [1] "value1"
## attr(,"attribute2")
## [1] "value2"
## attr(,"attribute3")
## [1] "value3"
attr(x, "attribute1") # reads a single attribute, returns NULL if unset
## [1] "value1"
attributes(x) # returns a named list with all attributes of an object
## $names
## [1] "X" "Y"
##
## $attribute1
## [1] "value1"
##
## $attribute2
## [1] "value2"
##
## $attribute3
## [1] "value3"
We can alter an attribute’s value or add further attributes
by referring to the structure function once again.
Moreover, setting an attribute’s value to NULL
gets rid of it completely.
structure(x, attribute1=NULL, attribute4="added", attribute3="modified")
## X Y
## "some" "object"
## attr(,"attribute2")
## [1] "value2"
## attr(,"attribute3")
## [1] "modified"
## attr(,"attribute4")
## [1] "added"
As far as the names
attribute is concerned,
we may generate an unnamed copy of an object by calling:
unname(x)
## [1] "some" "object"
## attr(,"attribute1")
## [1] "value1"
## attr(,"attribute2")
## [1] "value2"
## attr(,"attribute3")
## [1] "value3"
In Section 9.3.6, we will introduce
replacement functions. They will enable us to modify or remove
an object’s attribute
by calling “attr(x, "some_attribute") <- new_value
”.
Moreover, Section 5.5 highlights that certain operations (such as vector indexing, elementwise arithmetic operations, and coercion) might not preserve all attributes of the objects that were given as their inputs.
4.5. Exercises¶
Provide an answer to the following questions.
What is the meaning of c
(TRUE, FALSE)*1:10
?What does sum
(
as.logical(x))
compute whenx
is a numeric vector?We said that atomic vectors of the type
character
are the most general ones. Therefore, is as.numeric(
as.character(x))
the same as as.numeric(x)
, regardless of the type ofx
?What is the meaning of as.logical
(x+y)
ifx
andy
are logical vectors? What about as.logical(x*y)
, as.logical(1-x)
, and as.logical(x!=y)
?What is the meaning of the following when
x
is a logical vector?cummin
(x)
and cummin(!x)
,cummax
(x)
and cummax(!x)
,cumsum
(x)
and cumsum(!x)
,cumprod
(x)
and cumprod(!x)
.
Let
x
be a named numeric vector, e.g., “x <-
quantile(
runif(100))
”. What is the result of2*x
, mean(x)
, and round(x, 2)
?What is the meaning of
x == NULL
?Give two ways to create a named character vector.
Give two ways (discussed above; there are more) to remove the
names
attribute from an object.
There are a few peculiarities when joining or coercing lists. Compare the results generated by the following pairs of expressions:
# 1)
as.character(list(list(1, 2), list(3, list(4)), 5))
as.character(unlist(list(list(1, 2), list(3, list(4)), 5)))
# 2)
as.numeric(list(list(1, 2), list(3, list(4)), 5))
as.numeric(unlist(list(list(1, 2), list(3, list(4)), 5)))
# 3)
unlist(list(list(1, 2), sd))
list(1, 2, sd)
# 4)
c(list(c(1, 2), 3), 4, 5)
c(list(c(1, 2), 3), c(4, 5))
Given numeric vectors x
, y
, z
, and w
,
how to combine x
, y
, and list(z, w)
so as to obtain list(x, y, z, w)
?
More generally, given a set of atomic vectors
and lists of atomic vectors, how to combine them
to obtain a single list of atomic vectors
(not a list of atomic vectors and lists, not atomic vectors unwound, etc.)?
saveRDS serialises R objects and writes their snapshots to disk so that they can be restored via a call to readRDS at a later time. Verify that this function preserves object attributes. Also, check out dput and dget which work with objects’ textual representations in the form of executable R code.
(*)
Use jsonlite::
fromJSON
to read a JSON file in the form of a named list.
In the extremely unlikely event of our finding the current chapter boring, let’s rejoice: some of the exercises and remarks that we will encounter in the next part, which is devoted to vector indexing, will definitely be mouthwatering.