12. Data Frames

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of Chapters 1–12 are already complete, but there will be more. In the meantime, any bug/typos reports/fixes are appreciated. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [20].

Most matrices are built on top of atomic vectors and hence allow items of the same type to be arranged into rows and columns. Data frames (objects of S3 class data.frame, first introduced in [8]), on the other hand, are collections of vectors of identical lengths or matrices with identical row counts, hence allowing to represent structured[1] data of possibly heterogeneous types, for instance:

class(iris)  # `iris` is an example built-in data frame
## [1] "data.frame"
iris[c(1, 51, 101), ]  # 3 chosen rows from `iris`
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 101          6.3         3.3          6.0         2.5  virginica

is a mix of numeric and factor-type data.

The good news is that not only data frames are built upon named lists (e.g., to extract a column we can refer to `[[`), but also many functions recognise them to be matrix-like, (e.g., to select specific rows and columns, two indexes can be passed to `[` like in the example above). Hence, it will soon turn out that we already know a lot about how to perform basic data wrangling activities, even if we do not full realise it now.

Important

Some of us will approach this chapter biased by what we know from elsewhere, including our experience with some popular third-party packages for data frame processing. The art is to filter out that information as noise (at least, for the time being). We will show how powerful base R vocabulary is and how much can be implied from the material covered in the preceding chapters. And yes, this book is like a good thriller/drama/love story: it is meant to be read from the beginning to end, so please go back to the start of this comprehensive course if you happened to pop in here by accident or driven by “but I need to know now”. Good morning.

12.1. Creating Data Frames

12.1.1. data.frame and as.data.frame

Most frequently, we create data frames based on a series of logical, numeric, or characters vectors of identical lengths. The data.frame function is particularly useful in such a scenario:

(x <- data.frame(
    a=c(TRUE, FALSE),
    b=1:6,
    c=runif(6),
    d=c("spam", "spam", "eggs")
))
##       a b       c    d
## 1  TRUE 1 0.77437 spam
## 2 FALSE 2 0.19722 spam
## 3  TRUE 3 0.97801 eggs
## 4 FALSE 4 0.20133 spam
## 5  TRUE 5 0.36124 spam
## 6 FALSE 6 0.74261 eggs

Note that shorter vectors were recycled. That the diverse column types were retained and no coercion has been made, can be verified, e.g., by calling:

str(x)
## 'data.frame':	6 obs. of  4 variables:
##  $ a: logi  TRUE FALSE TRUE FALSE TRUE FALSE
##  $ b: int  1 2 3 4 5 6
##  $ c: num  0.774 0.197 0.978 0.201 0.361 ...
##  $ d: chr  "spam" "spam" "eggs" "spam" ...

We can also fetch the class of each column directly by calling (compare Section 12.3):

sapply(x, class)  # the same as unlist(Map(class, x))
##           a           b           c           d 
##   "logical"   "integer"   "numeric" "character"

Important

For many reasons (see, e.g., Section 12.1.5 and Section 12.1.6), we recommend to have the type of each column always checked, for instance by calling the str function.

Many objects, such as matrices, can easily be coerced to data frames using particular as.data.frame methods.

Here is an example matrix:

(A <- matrix(1:6, nrow=3,
    dimnames=list(
        NULL,        # no row labels
        c("u", "v")  # some column labels
)))
##      u v
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6

Let us convert it to a data frame:

as.data.frame(A)  # as.data.frame.matrix
##   u v
## 1 1 4
## 2 2 5
## 3 3 6

Note that a matrix with no row labels is printed slightly differently than a data frame with (as it will soon turn out) the default row.names.

Named lists are amongst other candidates for a meaningful conversion. Consider an example list, where each element is a vector of the same length as the other ones:

(l <- Map(
    function(x) {
        c(Min=min(x), Median=median(x), Mean=mean(x), Max=max(x))
    },
    split(iris[["Sepal.Length"]], iris[["Species"]])
))
## $setosa
##    Min Median   Mean    Max 
##  4.300  5.000  5.006  5.800 
## 
## $versicolor
##    Min Median   Mean    Max 
##  4.900  5.900  5.936  7.000 
## 
## $virginica
##    Min Median   Mean    Max 
##  4.900  6.500  6.588  7.900

Each list element will be turned to a separate column:

as.data.frame(l)  # as.data.frame.list
##        setosa versicolor virginica
## Min     4.300      4.900     4.900
## Median  5.000      5.900     6.500
## Mean    5.006      5.936     6.588
## Max     5.800      7.000     7.900

Sadly, as.data.frame.list is not particularly fond of lists of vectors of incompatible lengths:

as.data.frame(list(a=1, b=11:12, c=21:23))
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 2, 3

The above vectors could have been recycled with a warning, but they were not.

as.data.frame(list(a=1:4, b=11:12, c=21))  # recycling rule okay
##   a  b  c
## 1 1 11 21
## 2 2 12 21
## 3 3 11 21
## 4 4 12 21

The method for the S3 class table (mentioned in Chapter 11) can be helpful as well. Here is an example contingency table together with its unstacked version.

(t <- table(mtcars[["vs"]], mtcars[["cyl"]]))
##    
##      4  6  8
##   0  1  3 14
##   1 10  4  0
as.data.frame(t)  # as.data.frame.table; see the stringsAsFactors note below!
##   Var1 Var2 Freq
## 1    0    4    1
## 2    1    4   10
## 3    0    6    3
## 4    1    6    4
## 5    0    8   14
## 6    1    8    0

Actually, as.data.frame.table is so useful that we might want to call it directly on any array. This way, we can convert it from the so-called wide format to the long one; see Section 12.3.6 for more details.

Note

The above method is based on expand.grid, which determines all combinations of a given series of vectors.

expand.grid(1:2, c("a", "b", "c"))  # see the stringsAsFactors note below!
##   Var1 Var2
## 1    1    a
## 2    2    a
## 3    1    b
## 4    2    b
## 5    1    c
## 6    2    c

Overall, many classes of objects can be included[2] in a data frame; the popular choices include Date, POSIXct, and factor. It is worth noting that the data.frame function calls the corresponding as.data.frame method, and format is used on printing.

Example 12.1

Here are two custom methods for what we would like to call from now on an S3 class spam:

as.data.frame.spam <- function(x, ...)
    structure(
        list(x),
        class="data.frame",
        row.names=seq_along(x)
    )
format.spam <- function(x, ...)
    paste0("*", x, "*")

Testing data frame creation and printing:

data.frame(
    a=structure(c("a", "b", "c"), class="spam"),
    b=factor(c("spam", "bacon", "spam")),
    c=Sys.Date()+1:3
)
##     a     b          c
## 1 *a*  spam 2023-01-15
## 2 *b* bacon 2023-01-16
## 3 *c*  spam 2023-01-17

12.1.2. cbind.data.frame and rbind.data.frame

There are data frame-specific versions of cbind or rbind (which we discussed in the context of stacking matrices in Section 11.1.2). They are used quite eagerly: help("cbind") states that they will be referred to if at least[3] one of its arguments is a data frame and the other arguments are atomic vectors or lists (possibly with the dim attribute).

For example:

x <- iris[c(1, 51, 101), c("Sepal.Length", "Species")]  # whatever
cbind(Yummy=c(TRUE, FALSE, TRUE), x)
##     Yummy Sepal.Length    Species
## 1    TRUE          5.1     setosa
## 51  FALSE          7.0 versicolor
## 101  TRUE          6.3  virginica

added a new column to a data frame x. Moreover:

rbind(x, list(42, "virginica"))
##     Sepal.Length    Species
## 1            5.1     setosa
## 51           7.0 versicolor
## 101          6.3  virginica
## 11          42.0  virginica

added a new row. Note that columns are of different types. Hence, the values to row-bind were provided as a generic vector. The list can also be named. It can consist of vectors of length greater than one, given in any order:

rbind(x, list(
    Species=c("virginica", "setosa"),
    Sepal.Length=c(42, 7)
))
##     Sepal.Length    Species
## 1            5.1     setosa
## 51           7.0 versicolor
## 101          6.3  virginica
## 11          42.0  virginica
## 2            7.0     setosa

Sometimes referring to these methods directly will be necessary. Consider an example list of atomic vectors:

x <- list(a=1:3, b=11:13, c=21:23)

First, we call the generic which dispatches to the default method:

do.call(cbind, x)
##      a  b  c
## [1,] 1 11 21
## [2,] 2 12 22
## [3,] 3 13 23

If we want to make sure we garner a data frame in result, we need to write:

do.call(cbind.data.frame, x)
##   a  b  c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23

This is particularly useful in the context of fetching outputs from Map and its friends, which are wrapped inside a list. For instance:

l <- unname(Map(
    function(x) list(
        Sepal.Length=mean(x[["Sepal.Length"]]),
        Sepal.Width=mean(x[["Sepal.Width"]]),
        Species=x[["Species"]][1]
    ),
    split(iris, iris[["Species"]])  # split.data.frame; see below
))
str(l)
## List of 3
##  $ :List of 3
##   ..$ Sepal.Length: num 5.01
##   ..$ Sepal.Width : num 3.43
##   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1
##  $ :List of 3
##   ..$ Sepal.Length: num 5.94
##   ..$ Sepal.Width : num 2.77
##   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 2
##  $ :List of 3
##   ..$ Sepal.Length: num 6.59
##   ..$ Sepal.Width : num 2.97
##   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 3

This was nothing more than a fancy way to obtain an illustrative list, which we may now turn into a data frame by calling:

do.call(rbind.data.frame, l)
##   Sepal.Length Sepal.Width    Species
## 1        5.006       3.428     setosa
## 2        5.936       2.770 versicolor
## 3        6.588       2.974  virginica

On the other hand, do.call(rbind, l) does not return a particularly friendly object type:

do.call(rbind, l)
##      Sepal.Length Sepal.Width Species   
## [1,] 5.006        3.428       setosa    
## [2,] 5.936        2.77        versicolor
## [3,] 6.588        2.974       virginica

Despite the pretty face, it is a matrix… over a list:

str(do.call(rbind, l))
## List of 9
##  $ : num 5.01
##  $ : num 5.94
##  $ : num 6.59
##  $ : num 3.43
##  $ : num 2.77
##  $ : num 2.97
##  $ : Factor w/ 3 levels "setosa","versicolor",..: 1
##  $ : Factor w/ 3 levels "setosa","versicolor",..: 2
##  $ : Factor w/ 3 levels "setosa","versicolor",..: 3
##  - attr(*, "dim")= int [1:2] 3 3
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:3] "Sepal.Length" "Sepal.Width" "Species"

12.1.3. Reading Data Frames

Structured data can be imported from external sources, such as CSV/TSV (comma/tab-separated values) or HDF5 files, relational databases supporting SQL (see Section 12.1.4) web APIs (e.g., through the curl and jsonlite packages), spreadsheets [48], and so on.

In particular, read.csv and the like fetch data from plain text files consisting of records where fields are separated by commas, semicolons, tabs, etc.

For instance:

x <- data.frame(a=runif(3), b=c(TRUE, FALSE, TRUE))  # example data frame
f <- tempfile()  # temporary file name
write.csv(x, f, row.names=FALSE)  # export

This created a CSV file which looks like:

cat(readLines(f), sep="\n")  # print file contents
## "a","b"
## 0.287577520124614,TRUE
## 0.788305135443807,FALSE
## 0.4089769218117,TRUE

The above can be read by calling:

read.csv(f)
##         a     b
## 1 0.28758  TRUE
## 2 0.78831 FALSE
## 3 0.40898  TRUE
Exercise 12.2

Check out help("read.table") for a long list of tunable parameters, especially: sep, dec, quote, header, comment.char, and row.names. Further, note that reading from compressed files is supported directly.

Important

CSV is by far the most portable and user-friendly format for exchanging matrix-like objects between different programs and computing languages (e.g., Python, Julia, LibreOffice Calc, etc.). Such files can be opened in any text editor.

Note

As mentioned in Section 8.3.5, it is possible to process data frames on a chunk-by-chunk basis, which is beneficial especially when data do not fit into memory (compare the nrows argument to read.csv).

12.1.4. Interfacing Relational Databases and Querying with SQL (*)

The DBI package provides a universal interface for particular database management systems whose drivers are implemented in additional add-ons such as RSQLite, RMariaDB, RPostgreSQL, etc., or, more generally, RODBC or odbc. For more details, see Section 4 of [48].

Example 12.3

Let us play with an in-memory (volatile) instance of an SQLite database.

library("DBI")
con <- dbConnect(RSQLite::SQLite(), ":memory:")

This returns an object representing a database connection which we can refer to in further communication.

An easy way to create a database table is to call:

dbWriteTable(con, "mtcars", mtcars)  # `mtcars` is a toy built-in data frame

Alternatively, dbExecute could have been referred to in order to send SQL statements such as CREATE TABLE ... followed by a series of INSERT INTO ....

Some data retrieval can now follow:

dbGetQuery(con, "
    SELECT cyl, vs, AVG(mpg) AS mpg_ave, AVG(hp) AS hp_ave
    FROM mtcars
    GROUP BY cyl, vs
")
##   cyl vs mpg_ave hp_ave
## 1   4  0  26.000  91.00
## 2   4  1  26.730  81.80
## 3   6  0  20.567 131.67
## 4   6  1  19.125 115.25
## 5   8  0  15.100 209.21

This gives us an ordinary R data frame which we can process in the same fashion as any other object of this kind.

At the end, the database connection must be closed.

dbDisconnect(con)
Exercise 12.4

Database passwords should never be stored in plain text files, let alone in R scripts in version-controlled repositories. Consider a few ways for fetching credentials programmatically:

  • using environment variables (see help("Sys.getenv")),

  • using the keyring package,

  • calling system2 (Section 7.3.3) to retrieve it from the system keyring (e.g., the keyring package for Python provides a platform-independent command-line utility).

12.1.5. Strings as Factors?

The following is so critical that we will devote a separate subsection to discuss it, so that we always remain vigilant (such is life: maintaining some level of mindfulness is often a good idea).

Important

Some functions related to data frames automatically convert character vectors to factors. This behaviour is frequently controlled by the stringsAsFactors argument thereto.

This is particularly problematic due to the fact that, when printed, factor and character columns look identical:

(x <- data.frame(a=factor(c("U", "V")), b=c("U", "V")))
##   a b
## 1 U U
## 2 V V

We recall from Section 10.3.3 that factors can be nasty. For example, passing factors as indexers in `[` or converting them with as.numeric might give counterintuitive (for the uninformed) results. Also, new factor levels must be added manually when we want to extend them with more diverse data. This can cause some unexpected behaviour in contexts such as:

rbind(x, c("W", "W"))
## Warning in `[<-.factor`(`*tmp*`, ri, value = "W"): invalid factor level,
## NA generated
##      a b
## 1    U U
## 2    V V
## 3 <NA> W

It is therefore a good habit to have the data types always checked, for instance:

str(x)
## 'data.frame':	2 obs. of  2 variables:
##  $ a: Factor w/ 2 levels "U","V": 1 2
##  $ b: chr  "U" "V"

Before R 4.0, a number of functions, including data.frame and read.csv had the stringsAsFactors argument defaulting to TRUE. This is no longer the case for many of them.

However, exceptions to this rule still exist, e.g., including as.data.frame.table and expand.grid. Besides, some built-in example data frames have factor-typed columns inherited from the old days, e.g.:

class(iris[["Species"]])
## [1] "factor"

We observe that the Species column in iris is not of type character. Thence, adding a new variety might be oblique:

iris2 <- iris[c(1, 51, 101), ]  # example subset
levels(iris2[["Species"]]) <- c(levels(iris2[["Species"]]), "croatica")
rbind(iris2, c(6, 3, 3, 2, "croatica"))
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 51             7         3.2          4.7         1.4 versicolor
## 101          6.3         3.3            6         2.5  virginica
## 4              6           3            3           2   croatica

Alternatively, we could have simply converted the Species column to character.

12.1.6. Internal Representation

Objects of S3 class data.frame are built upon lists of vectors of the same length or matrices with identical row counts, which define consecutive columns thereof. Apart from class, they must be equipped with the following special attributes:

  • names – a character vector (as usual in any named list) labelling the columns or their groups,

  • row.names – a character or integer vector with no duplicates nor missing values, doing what advertised.

Therefore, a data frame can be created from scratch by calling, for example:

structure(
    list(a=11:13, b=21:23),  # sets the `names` attribute already
    row.names=1:3,
    class="data.frame"
)
##    a  b
## 1 11 21
## 2 12 22
## 3 13 23

Here is a data frame based on a length-5 list, a matrix with five rows, and a length-5 numeric vector, with some fancy row names on top:

structure(
    list(
        a=list(1, 1:2, 1:3, numeric(0), -(4:1)),
        b=cbind(u=11:15, v=21:25),
        c=runif(5)
    ),
    row.names=c("spam", "bacon", "eggs", "ham", "aubergine"),
    class="data.frame"
)
##                        a b.u b.v       c
## spam                   1  11  21 0.28758
## bacon               1, 2  12  22 0.78831
## eggs             1, 2, 3  13  23 0.40898
## ham                       14  24 0.88302
## aubergine -4, -3, -2, -1  15  25 0.94047

In general, the columns of type list can contain anything, e.g., other lists or R functions. Including atomic vectors of varying lengths just like above allows for creating something à la ragged arrays – a pretty handy scenario.

The issue with matrix entries, on the other hand, is that they appear as if they were many, but – as it will turn out in the sequel – they are often treated as a single complex column, e.g., by the index operator (see Section 12.2). Therefore, from this perspective, the above data frame has three columns, not four. Such objects can be output by aggregate (see Section 12.3), amongst others. Nevertheless, they can be very useful too, forming natural column groups which can be easily accessed and batch-processed in the same way.

Important

Unfortunately, data frames with list or matrix columns cannot be normally created with the data.frame nor cbind functions which might explain why they are less popular. This behaviour is dictated by the particular underlying as.data.frame methods which are called by both of them. As a curiosity, see help("I") though.

Exercise 12.5

Verify that for a data frame featuring a matrix column, the latter does not require column names (the second dimnames) set.

The names and row.names attributes are special in the sense of Section 4.4.3. In particular, they can be accessed or modified by the corresponding functions.

It is worth noting that row.names(df) always returns a character vector, even when attr(df, "row.names") is an integer vector. Further, setting row.names(df) <- NULL will re-set[4] this attribute to the most commonly desired case of consecutive natural numbers, for example:

(x <- iris[c(1, 51, 101), ])  # comes with some sad row names
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
row.names(x) <- NULL  # reset to seq_len(NROW(x))
print(x)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          5.1         3.5          1.4         0.2     setosa
## 2          7.0         3.2          4.7         1.4 versicolor
## 3          6.3         3.3          6.0         2.5  virginica
Exercise 12.6

What is the name of the replacement version of the row.names method for the data.frame class?

Exercise 12.7

Implement your own version of expand.grid.

Exercise 12.8

Implement your own version of xtabs, but which does not rely on a formula interface. Allow three parameters: a data frame, the name of the “counts” column and the names of the cross-classifying variables. Hence, my_xtabs(x, "Freq", c("Var1", "Var2")) should be equivalent to xtabs(Freq~Var1+Var2, x).

12.2. Data Frame Subsetting

12.2.1. Data Frames are Lists

Data frames are named lists, where each element represents an individual column. Therefore[5], length yields the number of columns and names gives their respective labels.

Let us play with the following data frame:

(x <- data.frame(
    a=runif(6),
    b=rnorm(6),
    c=LETTERS[1:6],
    d1=c(FALSE, TRUE, FALSE, NA, FALSE, NA),
    d2=c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
))
##          a         b c    d1    d2
## 1 0.287578  0.070508 A FALSE FALSE
## 2 0.788305  0.129288 B  TRUE  TRUE
## 3 0.408977  1.715065 C FALSE FALSE
## 4 0.883017  0.460916 D    NA  TRUE
## 5 0.940467 -1.265061 E FALSE FALSE
## 6 0.045556 -0.686853 F    NA  TRUE
typeof(x)  # each data frame is a list
## [1] "list"
length(x)  # the number of columns
## [1] 5
names(x)   # column labels
## [1] "a"  "b"  "c"  "d1" "d2"

The one-argument versions of extract and index operators behave as expected. `[[` fetches (looks inside) the contents of a given column:

x[["a"]]  # or x[[1]]
## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556

and `[` returns a data frame (a list with extras) comprised of the specified elements:

x["a"]  # or x[1]
##          a
## 1 0.287578
## 2 0.788305
## 3 0.408977
## 4 0.883017
## 5 0.940467
## 6 0.045556
x[c(TRUE, TRUE, FALSE, TRUE, FALSE)]
##          a         b    d1
## 1 0.287578  0.070508 FALSE
## 2 0.788305  0.129288  TRUE
## 3 0.408977  1.715065 FALSE
## 4 0.883017  0.460916    NA
## 5 0.940467 -1.265061 FALSE
## 6 0.045556 -0.686853    NA

Just like with lists, the replacement versions of the said operators can be used to add new or replace existing columns.

y <- head(x, 1)  # for a more compact display
y[["a"]] <- round(y[["a"]], 1)  # replaces the column with new content
y[["b"]] <- NULL  # removes the column, like, totally
y[["e"]] <- 10*y[["a"]]^2  # adds a new column at the end
print(y)
##     a c    d1    d2   e
## 1 0.3 A FALSE FALSE 0.9
Example 12.9

Some spam for thought to show how much we already know: some common use cases of indexing and vectorised functions:

y <- head(x, 1)  # for a more compact display

Move column a to the end:

y[unique(c(names(y), "a"), fromLast=TRUE)]
##          b c    d1    d2       a
## 1 0.070508 A FALSE FALSE 0.28758

Remove column a and c:

y[-match(c("a", "c"), names(y))]
##          b    d1    d2
## 1 0.070508 FALSE FALSE

All columns between a and c:

y[match("a", names(y)):match("c", names(y))]
##         a        b c
## 1 0.28758 0.070508 A

Names starting with d:

y[grep("^d", names(y))]
##      d1    d2
## 1 FALSE FALSE

Change name of column c to z:

names(y)[names(y) == "c"] <- "z"  # in-place
print(y)
##         a        b z    d1    d2
## 1 0.28758 0.070508 A FALSE FALSE

Change names: d2 to u and d1 to v:

names(y)[match(c("d2", "d1"), names(y))] <- c("v", "u")  # in-place
print(y)
##         a        b z     u     v
## 1 0.28758 0.070508 A FALSE FALSE

Note

Some R users might prefer the `$` operator over `[[`, but we do not. By default, the former supports partial matching of column names which might be appealing when R is used interactively. However, it does not work on matrices, nor it allows for programmatically generated names. It is also trickier to use on non-syntactically valid labels; compare Section 9.4.1.

Exercise 12.10

Write a function names_replace that changes the name of a data frame columns based on a translation table given in a from=to fashion, for instance:

names_replace <- function(x, ...) ...to.do...
x <- data.frame(a=1, b=2, c=3)
names_replace(x, c="new_c", a="new_a")
##   new_a b new_c
## 1     1 2     3

12.2.2. Data Frames are Matrix-like

Data frames can be considered “generalised” matrices. They store data of any kind (possibly mixed) organised in a tabular fashion. Some functions mentioned in the previous chapter will hence be overloaded for the data frame case. These include: dim (despite the lack of the dim attribute), NROW, NCOL, and dimnames (which is of course based on row.names and names).

For example:

(x <- data.frame(
    a=runif(6),
    b=rnorm(6),
    c=LETTERS[1:6],
    d1=c(FALSE, TRUE, FALSE, NA, FALSE, NA),
    d2=c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
))
##          a         b c    d1    d2
## 1 0.287578  0.070508 A FALSE FALSE
## 2 0.788305  0.129288 B  TRUE  TRUE
## 3 0.408977  1.715065 C FALSE FALSE
## 4 0.883017  0.460916 D    NA  TRUE
## 5 0.940467 -1.265061 E FALSE FALSE
## 6 0.045556 -0.686853 F    NA  TRUE
dim(x)  # the number of rows and columns
## [1] 6 5
dimnames(x)  # it is not a matrix, but a matrix-like object
## [[1]]
## [1] "1" "2" "3" "4" "5" "6"
## 
## [[2]]
## [1] "a"  "b"  "c"  "d1" "d2"

In addition to the list-like behaviour, which only allows for dealing with particular columns or groups thereof, the `[` operator was also equipped with the ability to take two indexers:

x[1:2, ]  # first two rows
##         a        b c    d1    d2
## 1 0.28758 0.070508 A FALSE FALSE
## 2 0.78831 0.129288 B  TRUE  TRUE
x[x[["a"]] >= 0.3  &  x[["a"]] <= 0.8, -2]  # or use x[, "a"]
##         a c    d1    d2
## 2 0.78831 B  TRUE  TRUE
## 3 0.40898 C FALSE FALSE

Recall the drop argument to `[` and its effects on matrix indexing. It the current case, its behaviour will be similar with regard to the operations on individual columns:

x[, 1]              # synonym: x[[1]], because drop=TRUE
## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556
x[, 1, drop=FALSE]  # synonym: x[1]
##          a
## 1 0.287578
## 2 0.788305
## 3 0.408977
## 4 0.883017
## 5 0.940467
## 6 0.045556

Also, note that when we extract a single row and more than one column, drop does not really apply. It is because columns (unlike in matrices) can potentially be of different types:

x[1, 1:2]  # two numeric columns but the result is still a numeric
##         a        b
## 1 0.28758 0.070508

However:

x[1, 1]
## [1] 0.28758
x[1, 1, drop=FALSE]
##         a
## 1 0.28758

Note

Once again let us take note of logical indexing featuring missing values:

x[x[["d1"]], ]
##            a       b    c   d1   d2
## 2    0.78831 0.12929    B TRUE TRUE
## NA        NA      NA <NA>   NA   NA
## NA.1      NA      NA <NA>   NA   NA
x[which(x[["d1"]]), ]  # drops missing values
##         a       b c   d1   d2
## 2 0.78831 0.12929 B TRUE TRUE

The default behaviour is consistent with many other R functions: it explicitly indicates that something is missing (we are selecting a “don’t know”; hence, the result is “don’t know” as well). Unfortunately, this comes with no warning. As we rarely check manually for missing values in the outputs, our absent-mindedness can lead to code bugs.

By far, we might have already noted that the index operator adjusts (not: resets) the row.names attribute. For instance:

(xs <- x[head(order(x[["a"]], decreasing=TRUE), 3), ])
##         a        b c    d1    d2
## 5 0.94047 -1.26506 E FALSE FALSE
## 4 0.88302  0.46092 D    NA  TRUE
## 2 0.78831  0.12929 B  TRUE  TRUE

It is a version of x comprised of only top three values in the u column. Indexing by means of character vectors will refer to row.names and names:

xs["5", c("a", "b")]
##         a       b
## 5 0.94047 -1.2651

Note that this is not the same as “xs[5, c("a", "b")]”, despite the fact that row.names is formally an integer vector here.

Note

If a data frame features a matrix, we need to use the index/extract operator twice in order to access a specific sub-column:

(x <- aggregate(iris[1], iris[5], function(x) c(Min=min(x), Max=max(x))))
##      Species Sepal.Length.Min Sepal.Length.Max
## 1     setosa              4.3              5.8
## 2 versicolor              4.9              7.0
## 3  virginica              4.9              7.9
x[["Sepal.Length"]][, "Min"]
## [1] 4.3 4.9 4.9

In other words, neither “x[["Sepal.Length.Min"]]” nor “x[, "Sepal.Length.Min"]” works.

As far as the replacement version of the index operator is concerned, it is a quite flexible tool, allowing the new content to be a vector, a data frame, a list, or even a matrix.

Exercise 12.11

Write two replacement functions[6]. First, set_row_names which replaces the row.names of a data frame with the contents of a specific column, for example:

(x <- aggregate(iris[1], iris[5], mean))  # some data frame
##      Species Sepal.Length
## 1     setosa        5.006
## 2 versicolor        5.936
## 3  virginica        6.588
set_row_names(x) <- "Species"
print(x)
##            Sepal.Length
## setosa            5.006
## versicolor        5.936
## virginica         6.588

Second, reset_row_names which converts row.names to a standalone column of a given name, for instance:

reset_row_names(x) <- "Type"
print(x)
##   Sepal.Length       Type
## 1        5.006     setosa
## 2        5.936 versicolor
## 3        6.588  virginica

These two functions may be handy as they allow for writing “x[something, ]” instead of “x[x[["column"]] %in% something, ]”.

12.3. Common Operations

Below we review the most commonly applied operations related to data frame wrangling. We have a few dedicated functions or methods overloaded for the data.frame class. However, we have already mastered the necessary skills to deal with this kind of objects through our hard work, in particular involving the solving of the exercises in the preceding chapters. Let us repeat: data frames are just lists exhibiting matrix-like behaviour.

12.3.1. Ordering Rows

Ordering rows in a data frame with respect to different criteria can be easily achieved by means of the order function and the two-argument version of `[`.

For instance, here are the top six cars in terms of the time (in seconds) to complete a 402-metre race:

mtcars6 <- mtcars[order(mtcars[["qsec"]])[1:6], ]
mtcars6[["model"]] <- row.names(mtcars6)
row.names(mtcars6) <- NULL
print(mtcars6)
##    mpg cyl disp  hp drat   wt  qsec vs am gear carb          model
## 1 15.8   8  351 264 4.22 3.17 14.50  0  1    5    4 Ford Pantera L
## 2 15.0   8  301 335 3.54 3.57 14.60  0  1    5    8  Maserati Bora
## 3 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4     Camaro Z28
## 4 19.7   6  145 175 3.62 2.77 15.50  0  1    5    6   Ferrari Dino
## 5 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4     Duster 360
## 6 21.0   6  160 110 3.90 2.62 16.46  0  1    4    4      Mazda RX4

order uses a stable sorting algorithm, therefore sorting with respect to a different criterion will not break the relative ordering of qsec in row groups with ties:

mtcars6[order(mtcars6[["cyl"]]), ]
##    mpg cyl disp  hp drat   wt  qsec vs am gear carb          model
## 4 19.7   6  145 175 3.62 2.77 15.50  0  1    5    6   Ferrari Dino
## 6 21.0   6  160 110 3.90 2.62 16.46  0  1    4    4      Mazda RX4
## 1 15.8   8  351 264 4.22 3.17 14.50  0  1    5    4 Ford Pantera L
## 2 15.0   8  301 335 3.54 3.57 14.60  0  1    5    8  Maserati Bora
## 3 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4     Camaro Z28
## 5 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4     Duster 360
Example 12.12

Notice the difference between ordering by cyl and gear vs gear and cyl:

mtcars6[order(mtcars6[["cyl"]], mtcars6[["gear"]]), ]
##    mpg cyl disp  hp drat   wt  qsec vs am gear carb          model
## 6 21.0   6  160 110 3.90 2.62 16.46  0  1    4    4      Mazda RX4
## 4 19.7   6  145 175 3.62 2.77 15.50  0  1    5    6   Ferrari Dino
## 3 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4     Camaro Z28
## 5 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4     Duster 360
## 1 15.8   8  351 264 4.22 3.17 14.50  0  1    5    4 Ford Pantera L
## 2 15.0   8  301 335 3.54 3.57 14.60  0  1    5    8  Maserati Bora
mtcars6[order(mtcars6[["gear"]], mtcars6[["cyl"]]), ]
##    mpg cyl disp  hp drat   wt  qsec vs am gear carb          model
## 3 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4     Camaro Z28
## 5 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4     Duster 360
## 6 21.0   6  160 110 3.90 2.62 16.46  0  1    4    4      Mazda RX4
## 4 19.7   6  145 175 3.62 2.77 15.50  0  1    5    6   Ferrari Dino
## 1 15.8   8  351 264 4.22 3.17 14.50  0  1    5    4 Ford Pantera L
## 2 15.0   8  301 335 3.54 3.57 14.60  0  1    5    8  Maserati Bora

Note

Mixing a increasing and decreasing ordering is tricky as the decreasing argument to order currently does not accept multiple flags in all the contexts. Perhaps the easiest way to change the ordering direction is to use the unary minus operator on the column(s) to be sorted decreasingly.

mtcars6[order(mtcars6[["gear"]], -mtcars6[["cyl"]]), ]
##    mpg cyl disp  hp drat   wt  qsec vs am gear carb          model
## 3 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4     Camaro Z28
## 5 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4     Duster 360
## 6 21.0   6  160 110 3.90 2.62 16.46  0  1    4    4      Mazda RX4
## 1 15.8   8  351 264 4.22 3.17 14.50  0  1    5    4 Ford Pantera L
## 2 15.0   8  301 335 3.54 3.57 14.60  0  1    5    8  Maserati Bora
## 4 19.7   6  145 175 3.62 2.77 15.50  0  1    5    6   Ferrari Dino

For factor and character columns, xtfrm can be used to convert them to sort keys first.

mtcars6[order(mtcars6[["cyl"]], -xtfrm(mtcars6[["model"]])), ]
##    mpg cyl disp  hp drat   wt  qsec vs am gear carb          model
## 6 21.0   6  160 110 3.90 2.62 16.46  0  1    4    4      Mazda RX4
## 4 19.7   6  145 175 3.62 2.77 15.50  0  1    5    6   Ferrari Dino
## 2 15.0   8  301 335 3.54 3.57 14.60  0  1    5    8  Maserati Bora
## 1 15.8   8  351 264 4.22 3.17 14.50  0  1    5    4 Ford Pantera L
## 5 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4     Duster 360
## 3 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4     Camaro Z28

Both of the above behave like decreasing=c(FALSE, TRUE).

Exercise 12.13

Write a method sort.data.frame that orders a data frame with respect to a given set of columns.

sort.data.frame <- function(x, decreasing=FALSE, cols) ...to.do...
sort(mtcars6, cols=c("cyl", "model"))
##    mpg cyl disp  hp drat   wt  qsec vs am gear carb          model
## 4 19.7   6  145 175 3.62 2.77 15.50  0  1    5    6   Ferrari Dino
## 6 21.0   6  160 110 3.90 2.62 16.46  0  1    4    4      Mazda RX4
## 3 13.3   8  350 245 3.73 3.84 15.41  0  0    3    4     Camaro Z28
## 5 14.3   8  360 245 3.21 3.57 15.84  0  0    3    4     Duster 360
## 1 15.8   8  351 264 4.22 3.17 14.50  0  1    5    4 Ford Pantera L
## 2 15.0   8  301 335 3.54 3.57 14.60  0  1    5    8  Maserati Bora

Unfortunately, that decreasing must be of length one and be placed as the second method argument is imposed by the sort S3 generic.

12.3.2. Handling Duplicated Rows

duplicated, anyDuplicated, and unique have methods overloaded for the data.frame class. They can be used to indicate, get rid of, or replace the repeating rows.

sum(duplicated(iris))  # how many duplicated rows are there?
## [1] 1
iris[duplicated(iris), ]  # show the duplicated rows
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 143          5.8         2.7          5.1         1.9 virginica

12.3.3. Joining (Merging) Data Frames

The merge function can perform the JOIN operation that some readers might know from SQL[7]. It matches the items in the columns that two given data frames somewhat share, and then returns their combination.

Example 12.14

Two calls to merge could be used to match data on programmers (each identified by developer_id and giving such details as their name, location, main skill, etc.) with the information about the open-source projects (each identified by project_id and informing us about its title, scope, web site, and so forth) they are engaged in (based on a third data frame featuring developer_id and project_id pairs).

As an simple illustration, consider the two following objects:

A <- data.frame(
    u=c("b0", "b1", "b2", "b3"),
    v=c("a0", "a1", "a2", "a3")
)

B <- data.frame(
    v=c("a0", "a2", "a2", "a4"),
    w=c("c0", "c1", "c2", "c3")
)

The two common columns, i.e., storing data of similar nature (a-something strings), are both named v.

First, the inner (natural) join, where we list only the matching pairs:

merge(A, B)  # x=A, y=B, by="v", all.x=FALSE, all.y=FALSE
##    v  u  w
## 1 a0 b0 c0
## 2 a2 b2 c1
## 3 a2 b2 c2

Note that the common column (or, more generally, columns) is included only once in the result.

The left join guarantees that all elements in the first data frame will be included in the result:

merge(A, B, all.x=TRUE)  # by="v", all.y=FALSE
##    v  u    w
## 1 a0 b0   c0
## 2 a1 b1 <NA>
## 3 a2 b2   c1
## 4 a2 b2   c2
## 5 a3 b3 <NA>

The right join includes all records in the second argument:

merge(A, B, all.y=TRUE)  # by="v", all.x=FALSE
##    v    u  w
## 1 a0   b0 c0
## 2 a2   b2 c1
## 3 a2   b2 c2
## 4 a4 <NA> c3

And the full outer join is their set-theoretic union:

merge(A, B, all.x=TRUE, all.y=TRUE)  # by="v"
##    v    u    w
## 1 a0   b0   c0
## 2 a1   b1 <NA>
## 3 a2   b2   c1
## 4 a2   b2   c2
## 5 a3   b3 <NA>
## 6 a4 <NA>   c3
Exercise 12.15

Show how match (Section 5.4.1) can be used to implement a very basic version of merge.

12.3.4. Aggregating and Transforming Columns

Let us discuss how to perform data aggregation or engineer features. Despite the fact that we already know how to access individual columns with `[` and process them using the many vectorised functions, we still have something interesting to add about the said matter.

It would be tempting to try implementing such operations with apply. Unfortunately, currently this function coerces its argument to a matrix. Hence, we should refrain from applying it on data frames whose columns are of mixed types[8].

However, taking into account that data frames are special lists, we can always call Map and its relatives.

Example 12.16

Given an example data frame:

(iris_sample <- iris[sample(NROW(iris), 6), ])
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 28           5.2         3.5          1.5         0.2     setosa
## 80           5.7         2.6          3.5         1.0 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 133          6.4         2.8          5.6         2.2  virginica

To get the class of each column, we can call:

sapply(iris_sample, class)  # or unlist(Map(class, iris))
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##    "numeric"    "numeric"    "numeric"    "numeric"     "factor"

Next, here is a way to compute some aggregates of the numeric columns:

unlist(Map(mean, Filter(is.numeric, iris_sample)))
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##       6.0667       3.1333       4.5500       1.7167

or:

sapply(iris_sample[sapply(iris_sample, is.numeric)], mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##       6.0667       3.1333       4.5500       1.7167

We can also fetch more than a single summary of each column:

as.data.frame(Map(
    function(x) c(Min=min(x), Max=max(x)),
    Filter(is.numeric, iris_sample)
))
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min          5.2         2.6          1.5         0.2
## Max          6.5         3.5          6.0         2.5

or:

sapply(iris_sample[sapply(iris_sample, is.numeric)], quantile, c(0, 1))
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0%            5.2         2.6          1.5         0.2
## 100%          6.5         3.5          6.0         2.5

Note that the latter called simplify2array automatically, thus the result is a matrix.

On the other hand, standardisation of all the numeric features can be performed, e.g., via a call:

iris_sample[] <- Map(function(x) {
    if (!is.numeric(x)) x else (x-mean(x))/sd(x)
}, iris_sample)
print(iris_sample)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 28      -1.70405     1.03024     -1.76004    -1.65318     setosa
## 80      -0.72094    -1.49854     -0.60591    -0.78117 versicolor
## 101      0.45878     0.46829      0.83674     0.85384  virginica
## 111      0.85202     0.18732      0.31738     0.30884  virginica
## 137      0.45878     0.74927      0.60591     0.74484  virginica
## 133      0.65540    -0.93659      0.60591     0.52684  virginica

12.3.5. Handling Missing Values

The is.na method for objects of class data.frame returns a logical matrix of the same dimensionality[9] indicating whether the corresponding items are missing or not. Of course, this function can still be called on individual columns as well.

Further, na.omit can be used to get rid of rows with missing values.

Exercise 12.17

Given a data frame, use is.na and other functions such as apply, approx, etc., to:

  1. remove all rows that feature at least one missing value,

  2. remove all rows that only consist of missing values,

  3. remove all columns that feature at least one missing value,

  4. for each column, replace all missing values with the column averages,

  5. for each column, replace all missing values with values that linearly interpolate between the preceding and succeeding well-defined observations (which is useful on time series), e.g., the blanks in c(0.60, 0.62, NA, 0.64, NA, NA, 0.58) should be filled so as to obtain c(0.60, 0.62, 0.63, 0.64, 0.62, 0.60, 0.58).

12.3.6. Reshaping Data Frames

Consider an example matrix:

A <- matrix(round(runif(6), 2), nrow=3,
    dimnames=list(
        c("X", "Y", "Z"),  # row labels
        c("u", "v")        # column labels
))
names(dimnames(A)) <- c("Row", "Col")
print(A)
##    Col
## Row    u    v
##   X 0.29 0.88
##   Y 0.79 0.94
##   Z 0.41 0.05

The as.data.frame method for the table class can be called directly on any array:

as.data.frame.table(A, responseName="Val")
##   Row Col  Val
## 1   X   u 0.29
## 2   Y   u 0.79
## 3   Z   u 0.41
## 4   X   v 0.88
## 5   Y   v 0.94
## 6   Z   v 0.05

This is an instance of reshaping an array, and more precisely, stacking: converting from a wide (okay, in this example, not so wide, as we have only two columns) to a long format.

This can be also achieved by means of the reshape function which is more flexible and operates directly on data frames (but is harder to use):

(df <- `names<-`(
    data.frame(row.names(A), A, row.names=NULL),
    c("Row", "Col.u", "Col.v")))
##   Row Col.u Col.v
## 1   X  0.29  0.88
## 2   Y  0.79  0.94
## 3   Z  0.41  0.05
(stacked <- reshape(df, varying=2:3, direction="long"))
##     Row time  Col id
## 1.u   X    u 0.29  1
## 2.u   Y    u 0.79  2
## 3.u   Z    u 0.41  3
## 1.v   X    v 0.88  1
## 2.v   Y    v 0.94  2
## 3.v   Z    v 0.05  3

Maybe the default column names are not superb, but we can always adjust them manually afterwards.

The reverse operation is called unstacking:

reshape(stacked, idvar="Row", timevar="time", drop="id", direction="wide")
##     Row Col.u Col.v
## 1.u   X  0.29  0.88
## 2.u   Y  0.79  0.94
## 3.u   Z  0.41  0.05
Exercise 12.18

Given a named numeric vector, convert it to a data frame with two columns, for instance:

convert <- function(x) ...to.do...
x <- c(spam=42, eggs=7, bacon=3)
convert(x)
##     key value
## 1  spam    42
## 2  eggs     7
## 3 bacon     3
Exercise 12.19

Reshape (stack) the built-in WorldPhones dataset. Then, reshape (unstack) the stacked WorldPhones dataset. Further, unstack the stacked set but first remove[10] five random rows from it, and then randomly permute all the remaining rows. Fill the missing entries with NAs.

Exercise 12.20

Implement a basic version of as.data.frame.table manually (using rep etc.). Also, write a function as.table.data.frame that implements its reverse. Make sure both functions are compatible with each other.

Exercise 12.21

The built-in Titanic is a four-dimensional array. Convert it to a long data frame.

Exercise 12.22

Perform what follows on the data frame defined below:

  1. convert the second column from character to a list of character vectors (split at ",");

  2. extract first elements from each of the vectors;

  3. extract last elements;

  4. (*) unstack the data frame;

  5. (*) stack it back to a data frame featuring a list;

  6. convert the list back to a character column (concatenate with "," as separator).

(x <- data.frame(
    name=c("Kat", "Ron", "Jo", "Mary"),
    food=c("buckwheat", "spam,bacon,spam", "", "eggs,spam,spam,lollipops")
))
##   name                     food
## 1  Kat                buckwheat
## 2  Ron          spam,bacon,spam
## 3   Jo                         
## 4 Mary eggs,spam,spam,lollipops
Exercise 12.23

Write a function that converts all matrix-based columns in a given data frame to separate, atomic columns. Also, write a function to that does the opposite: one that groups all columns with similar prefixes and turns them into matrices.

12.3.7. Aggregating Data in Groups

We can straightforwardly apply various transforms on data groups determined by a factor-like variable or a combination thereof thanks to the split.data.frame method, which returns a list of data frames.

For example:

x <- data.frame(
    a=c(    10,     20,     30,     40,     50),
    u=c("spam", "spam", "eggs", "spam", "eggs"),
    v=c(     1,      2,      1,      1,      1)
)
split(x, x["u"])  # i.e., split.data.frame(x, x["u"]) or x[["u"]]
## $eggs
##    a    u v
## 3 30 eggs 1
## 5 50 eggs 1
## 
## $spam
##    a    u v
## 1 10 spam 1
## 2 20 spam 2
## 4 40 spam 1

This split x with respect to the u column serving as the grouping variable. On the other hand:

split(x, x[c("u", "v")])  # sep="."
## $eggs.1
##    a    u v
## 3 30 eggs 1
## 5 50 eggs 1
## 
## $spam.1
##    a    u v
## 1 10 spam 1
## 4 40 spam 1
## 
## $eggs.2
## [1] a u v
## <0 rows> (or 0-length row.names)
## 
## $spam.2
##    a    u v
## 2 20 spam 2

partitioned with respect to a combination of two factor-like sequences. Note that a non-existing level pair (eggs, 2) results in an empty data frame.

Exercise 12.24

split.data.frame (when called explicitly) can also be used to break a matrix into a list of matrices (rowwisely). Given a matrix, perform its train-test split: allocate, say, 70% of the rows at random into one matrix and the remaining 30% into another one.

If the aggregation of grouped data in numeric columns is needed, sapply is quite convenient. To recall, it is a combination of lapply (one-vector version of Map) and simplify2array (Section 11.1.3).

sapply(split(iris[1:2], iris[5]), sapply, mean)
##              setosa versicolor virginica
## Sepal.Length  5.006      5.936     6.588
## Sepal.Width   3.428      2.770     2.974

If the function being to apply returns more than a single value, sapply will not return a too-informative result by default: the list of matrices converted to a matrix will not have the row.names argument set. As a workaround, we either call simplify2array explicitly or pass simplify="array" to sapply:

(res <- sapply(
    split(iris[1:2], iris[5]),
    sapply,
    function(x) c(Min=min(x), Max=max(x)),
    simplify="array"
))  # or simplify2array(lapply or Map etc.)
## , , setosa
## 
##     Sepal.Length Sepal.Width
## Min          4.3         2.3
## Max          5.8         4.4
## 
## , , versicolor
## 
##     Sepal.Length Sepal.Width
## Min          4.9         2.0
## Max          7.0         3.4
## 
## , , virginica
## 
##     Sepal.Length Sepal.Width
## Min          4.9         2.2
## Max          7.9         3.8

This yields a three-dimensional array which is particularly handy if we now would like to access specific results by name:

res[, "Sepal.Length", "setosa"]
## Min Max 
## 4.3 5.8

Also, the previously mentioned as.data.frame.table method works like a charm on it (up to the column names):

as.data.frame.table(res)
##    Var1         Var2       Var3 Freq
## 1   Min Sepal.Length     setosa  4.3
## 2   Max Sepal.Length     setosa  5.8
## 3   Min  Sepal.Width     setosa  2.3
## 4   Max  Sepal.Width     setosa  4.4
## 5   Min Sepal.Length versicolor  4.9
## 6   Max Sepal.Length versicolor  7.0
## 7   Min  Sepal.Width versicolor  2.0
## 8   Max  Sepal.Width versicolor  3.4
## 9   Min Sepal.Length  virginica  4.9
## 10  Max Sepal.Length  virginica  7.9
## 11  Min  Sepal.Width  virginica  2.2
## 12  Max  Sepal.Width  virginica  3.8

Note

If the grouping (by) variable is a list of two or more factors, the combined levels will be concatenated to a single string:

as.data.frame.table(as.array(sapply(
    split(ToothGrowth["len"], ToothGrowth[c("supp", "dose")]),
    sapply,
    mean
)))
##         Var1  Freq
## 1 OJ.0.5.len 13.23
## 2 VC.0.5.len  7.98
## 3   OJ.1.len 22.70
## 4   VC.1.len 16.77
## 5   OJ.2.len 26.06
## 6   VC.2.len 26.14

Also, the name of the aggregated column (len) has been included. This behaviour yields a result that may be deemed convenient in some contexts, but not necessarily so in other ones.

Exercise 12.25

Many aggregation functions are idempotent, which means that when they are fed with a vector with all the elements being identical, the result is exactly that unique element: min, mean, median, and max behave exactly this way.

Overload the mean and median methods for character vectors and factors so that they return NA when they are fed with a sequence of not all elements being the same and the unique value otherwise.

mean.character <- function(x, na.rm=FALSE, ...) ...to.do...
mean.factor <- function(x, na.rm=FALSE, ...) ...to.do...

This way, we can also aggregate the grouping variables in a convenient way:

do.call(rbind.data.frame,
    lapply(split(ToothGrowth, ToothGrowth[c("supp", "dose")]), lapply, mean))
##          len supp dose
## OJ.0.5 13.23   OJ  0.5
## VC.0.5  7.98   VC  0.5
## OJ.1   22.70   OJ  1.0
## VC.1   16.77   VC  1.0
## OJ.2   26.06   OJ  2.0
## VC.2   26.14   VC  2.0

The built-in aggregate method can assist us in a situation where a single function is to be applied on all columns in a data frame.

aggregate(iris[-5], iris[5], mean)  # not: ...[[5]]
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026
aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], mean)
##   supp dose   len
## 1   OJ  0.5 13.23
## 2   VC  0.5  7.98
## 3   OJ  1.0 22.70
## 4   VC  1.0 16.77
## 5   OJ  2.0 26.06
## 6   VC  2.0 26.14

Note that the second argument, by, must be list-like (therefore also a data frame is accepted), not a factor nor an atomic vector. Also, if the function being applied returns many values, they will be wrapped into a matrix column:

(x <- aggregate(iris[2], iris[5], function(x) c(Min=min(x), Max=max(x))))
##      Species Sepal.Width.Min Sepal.Width.Max
## 1     setosa             2.3             4.4
## 2 versicolor             2.0             3.4
## 3  virginica             2.2             3.8
class(x[["Sepal.Width"]])
## [1] "matrix" "array"
x[["Sepal.Width"]]  # not: Sepal.Width.Max, etc.
##      Min Max
## [1,] 2.3 4.4
## [2,] 2.0 3.4
## [3,] 2.2 3.8

It is actually handy, because by referring to x[["Sepal.Width"]] we have access to all the stats for this column. Further, if many columns are being aggregated at the same time, we can process all the summaries in the same way.

Exercise 12.26

Check out the built-in by function which supports some basic split-apply-bind use cases. Note the particularly peculiar behaviour of the print method for the by class.

The most flexible scenario involves applying a custom function returning any set of aggregates in the form of a list and then row-binding the results to obtain a data frame.

Example 12.27

The following implements an R version of what we would express in SQL as:

SELECT supp, dose, AVG(len) AS ave_len, COUNT(*) AS count
FROM ToothGrowth
GROUP BY supp, dose

Ad rem:

do.call(rbind.data.frame, lapply(
    split(ToothGrowth, ToothGrowth[c("supp", "dose")]),
    function(df) list(
        supp=df[1, "supp"],
        dose=df[1, "dose"],
        ave_len=mean(df[["len"]]),
        count=NROW(df)
    )
))
##        supp dose ave_len count
## OJ.0.5   OJ  0.5   13.23    10
## VC.0.5   VC  0.5    7.98    10
## OJ.1     OJ  1.0   22.70    10
## VC.1     VC  1.0   16.77    10
## OJ.2     OJ  2.0   26.06    10
## VC.2     VC  2.0   26.14    10
Example 12.28

As an exercise, let us study a function that takes a named list x (can be a data frame) and a sequence of col=f pairs and applies the function f (or each function from a list of functions f) on the named element col in x:

napply <- function(x, ...)
{
    fs <- list(...)
    stopifnot(is.list(x), !is.null(names(x)))
    stopifnot(all(names(fs) %in% names(x)))
    do.call(
        c,  # concatenates lists
        lapply(
            structure(seq_along(fs), names=names(fs)),
            function(i) {  # always returns a list
                y <- x[[ names(fs)[i] ]]
                if (is.function(fs[[i]]))
                    list(fs[[i]](y))
                else
                    lapply(fs[[i]], function(f) f(y))
            }
        )
    )
}

For example:

first <- function(x, ...) head(x, n=1L, ...)  # we use it below
napply(ToothGrowth,
    supp=first, dose=first, len=list(ave=mean, count=length)
)
## $supp
## [1] VC
## Levels: OJ VC
## 
## $dose
## [1] 0.5
## 
## $len.ave
## [1] 18.813
## 
## $len.count
## [1] 60

applies first on both ToothGrowth[["supp"]] and ToothGrowth[["dose"]] as well as mean and length on ToothGrowth[["len"]]. List names are there for some dramatic effects.

And now:

do.call(
    rbind.data.frame,
    lapply(
        split(ToothGrowth, ToothGrowth[c("supp", "dose")]),
        napply,
        supp=first, dose=first, len=list(ave=mean, count=length)
    )
)
##        supp dose len.ave len.count
## OJ.0.5   OJ  0.5   13.23        10
## VC.0.5   VC  0.5    7.98        10
## OJ.1     OJ  1.0   22.70        10
## VC.1     VC  1.0   16.77        10
## OJ.2     OJ  2.0   26.06        10
## VC.2     VC  2.0   26.14        10

or even:

aaaggg <- function(x, by, ...)
    do.call(rbind.data.frame, lapply(split(x, x[by]), napply, ...))

so that:

aaaggg(iris, "Species", Species=first, Sepal.Length=mean)
##               Species Sepal.Length
## setosa         setosa        5.006
## versicolor versicolor        5.936
## virginica   virginica        6.588

This brings fun back to R programming in the sad times when many things are given to us on a plate. And by the way, the above has not been tested thoroughly, it is a proof of concept; as usual, testing, debugging, and extending is left as an exercise to the reader.

Example 12.29

In Section 10.5, we have considered an example where we have used our own group_by function and an aggregation method overloaded for the object’s class it returns.

Here is the function that splits a data frame into a list of data frames with respect to a combination of levels in given named columns:

group_by <- function(df, by)
{
    stopifnot(is.character(by), is.data.frame(df))
    df <- droplevels(df)  # in case there are factors with empty levels
    structure(
        split(df, df[names(df) %in% by]),
        class="list_dfs",
        by=by
    )
}

The next function applies a set of aggregates on every column of each data frame in a given list (two nested lapplys plus some cosmetic additions):

aggregate.list_dfs <- function(x, FUN, ...)
{
    aggregates <- lapply(x, function(df) {
        is_by <- names(df) %in% attr(x, "by")
        res <- lapply(df[!is_by], FUN, ...)
        res_mat <- do.call(rbind, res)
        if (is.null(dimnames(res_mat)[[2]]))
            dimnames(res_mat)[[2]] <- paste0("f", seq_len(NCOL(res_mat)))
        cbind(
            `row.names<-`(df[1, is_by, drop=FALSE], NULL),
            x=row.names(res_mat),
            `row.names<-`(res_mat, NULL)
        )
    })
    combined_aggregates <- do.call(rbind.data.frame, aggregates)
    `row.names<-`(combined_aggregates, NULL)
}
aggregate(group_by(ToothGrowth, c("supp", "dose")), range)
##   supp dose   x   f1   f2
## 1   OJ  0.5 len  8.2 21.5
## 2   VC  0.5 len  4.2 11.5
## 3   OJ  1.0 len 14.5 27.3
## 4   VC  1.0 len 13.6 22.5
## 5   OJ  2.0 len 22.4 30.9
## 6   VC  2.0 len 18.5 33.9

We really want our API be bloated, hence let us introduce a convenience function being a specialised version of the above:

mean.list_dfs <- function(x, ...)
    aggregate.list_dfs(x, function(y) c(Mean=mean(y, ...)))
mean(group_by(iris[51:150, c(2, 3, 5)], "Species"))
##      Species            x  Mean
## 1 versicolor  Sepal.Width 2.770
## 2 versicolor Petal.Length 4.260
## 3  virginica  Sepal.Width 2.974
## 4  virginica Petal.Length 5.552

12.3.8. Transforming Data in Groups

Some variables will sometimes need to be transformed relative to what is happening in subsets of a dataset. This is the case, e.g., where we decide that missing values should be replaced by the corresponding within-group averages, or want to compute the relative ranks or z-scores.

If the losing of the original ordering of rows is not an issue, the standard split-apply-bind will suffice.

An example data frame:

(x <- data.frame(
    a=c( 10,   1,  NA,  NA,  NA,   4),
    b=c( -1,  10,  40,  30,   1,  20),
    c=runif(6),
    d=c("v", "u", "u", "u", "v", "u")
))
##    a  b       c d
## 1 10 -1 0.52811 v
## 2  1 10 0.89242 u
## 3 NA 40 0.55144 u
## 4 NA 30 0.45661 u
## 5 NA  1 0.95683 v
## 6  4 20 0.45333 u

Some operations:

fill_na <- function(x) `[<-`(x, is.na(x), value=mean(x[!is.na(x)]))
standardise <- function(x) (x-mean(x))/sd(x)

And now:

do.call(rbind.data.frame, lapply(
    split(x, x["d"]),
    function(df) {
        df[["a"]] <- fill_na(df[["a"]])
        df[["b"]] <- rank(df[["b"]])
        df[["c"]] <- standardise(df[["c"]])
        df
    }
))
##        a b        c d
## u.2  1.0 1  1.46357 u
## u.3  2.5 4 -0.17823 u
## u.4  2.5 3 -0.63478 u
## u.6  4.0 2 -0.65057 u
## v.1 10.0 1 -0.70711 v
## v.5 10.0 2  0.70711 v

Note that only the relative ordering of rows within groups has been retained. Overall, the rows are in a different order.

If this is an issue, we can use the unsplit function:

unsplit(
    lapply(
        split(x, x["d"]),
        function(df) {
            df[["a"]] <- fill_na(df[["a"]])
            df[["b"]] <- rank(df[["b"]])
            df[["c"]] <- standardise(df[["c"]])
            df
        }
    ),
    x["d"]
)
##      a b        c d
## 1 10.0 1 -0.70711 v
## 2  1.0 1  1.46357 u
## 3  2.5 4 -0.17823 u
## 4  2.5 3 -0.63478 u
## 5 10.0 2  0.70711 v
## 6  4.0 2 -0.65057 u
Exercise 12.30

Show how we can do the above also via the replacement version of split.

Example 12.31

Reverting to the previous ordering can be done manually too. It is because the split operation behaves as if we first ordered the data frame with respect to the grouping variable(s) (using a stable sorting algorithm).

Here is some transformation of a sample data frame split by a combination of two factors:

(x <- `row.names<-`(ToothGrowth[sample(NROW(ToothGrowth), 10), ], NULL))
##     len supp dose
## 1  23.0   OJ  2.0
## 2  23.3   OJ  1.0
## 3  29.4   OJ  2.0
## 4  14.5   OJ  1.0
## 5  11.2   VC  0.5
## 6  20.0   OJ  1.0
## 7  24.5   OJ  2.0
## 8  10.0   OJ  0.5
## 9   9.4   OJ  0.5
## 10  7.0   VC  0.5
(y <- do.call(rbind.data.frame, lapply(
    split(x, x[c("dose", "supp")]),  # two grouping variables
    function(df) {
        df[["len"]] <- df[["len"]] * 100^df[["dose"]] *  # whatever
            ifelse(df[["supp"]] == "OJ", -1, 1)          # do not overthink it
        df
    }
)))
##               len supp dose
## 0.5.OJ.8     -100   OJ  0.5
## 0.5.OJ.9      -94   OJ  0.5
## 1.OJ.2      -2330   OJ  1.0
## 1.OJ.4      -1450   OJ  1.0
## 1.OJ.6      -2000   OJ  1.0
## 2.OJ.1    -230000   OJ  2.0
## 2.OJ.3    -294000   OJ  2.0
## 2.OJ.7    -245000   OJ  2.0
## 0.5.VC.5      112   VC  0.5
## 0.5.VC.10      70   VC  0.5

In Section 5.4.4, we have mentioned that by calling order, we ca determine the inverse of a given permutation. Hence, we can call:

y[order(order(x[["supp"]], x[["dose"]])), ]  # not: dose, supp
##               len supp dose
## 2.OJ.1    -230000   OJ  2.0
## 1.OJ.2      -2330   OJ  1.0
## 2.OJ.3    -294000   OJ  2.0
## 1.OJ.4      -1450   OJ  1.0
## 0.5.VC.5      112   VC  0.5
## 1.OJ.6      -2000   OJ  1.0
## 2.OJ.7    -245000   OJ  2.0
## 0.5.OJ.8     -100   OJ  0.5
## 0.5.OJ.9      -94   OJ  0.5
## 0.5.VC.10      70   VC  0.5

Additionally, we can manually restore the original row.names, et voilà.

12.3.9. Metaprogramming-Based Techniques (*)

In Section 9.5.7, we have mentioned that due to R’s being equipped with the ability to write programs that manipulate unevaluated expressions, some functions can provide us with quite weird interfaces to a few common operations. These include transform, subset, with, and basically every procedure accepting a formula. Also, the popular data.table and dplyr packages that we briefly mention in Section 12.3.10 fall into this class.

In some contexts, they all may be found convenient[11].

However, overall, each of these methods must be carefully studied separately. This is because they can arbitrarily interpret the form of the arguments passed thereto, without taking into account their real meaning.

We try to avoid[12] them in this course, as we can do perfectly without them. However, they are not only interesting on their own, but also quite popular in other users’ code, hence the honourable mention. Learning them in more detail is left to the kind reader as an optional exercise. In sec:to-do, we will return to these functions as they will serve as a very interesting illustration of how to implement our own procedures that rely on metaprogramming techniques.

Example 12.32

For instance, let us consider an example call to the subset function:

subset(iris, Sepal.Length>7.5, -(Sepal.Width:Petal.Width))
##     Sepal.Length   Species
## 106          7.6 virginica
## 118          7.7 virginica
## 119          7.7 virginica
## 123          7.7 virginica
## 132          7.9 virginica
## 136          7.7 virginica

Neither Sepal.Length>7.5 nor -(Sepal.Width:Petal.Width) make sense as standalone R expressions, because we have not defined the named variables used therein:

Sepal.Length>7.5            # utter nonsense
## Error in eval(expr, envir, enclos): object 'Sepal.Length' not found
-(Sepal.Width:Petal.Width)  # gibberish
## Error in eval(expr, envir, enclos): object 'Sepal.Width' not found

Only from help("subset"), we can learn that this tool generously decides that the second expression plays the role of a row selector and the third one removes all the columns between the two given ones.

In our course, we pay attention to developing transferable skills. Assuming that R is not the only language we are going to learn during of our long and happy lives, it is much more likely that in the next environment, we will rather be writing something more of the more basic form:

between <- function(x, from, to) (which(from == x):which(to == x))
iris[iris[["Sepal.Length"]]>7.5,
    -between(names(iris), "Sepal.Width", "Petal.Width")]
##     Sepal.Length   Species
## 106          7.6 virginica
## 118          7.7 virginica
## 119          7.7 virginica
## 123          7.7 virginica
## 132          7.9 virginica
## 136          7.7 virginica

Let us stress again that this is a book on how to become a great chef who proudly uses produce from sustainable sources, and not how to order ultra-processed food from DeliverNoodles.com.

Example 12.33

transform can be used to add, modify, and remove columns in a data frame with the possibility of referring to existing features as if they were ordinary variables:

head(transform(mtcars, log_hp=log(hp), am=2*am-1, hp=NULL))
##                    mpg cyl disp drat    wt  qsec vs am gear carb log_hp
## Mazda RX4         21.0   6  160 3.90 2.620 16.46  0  1    4    4 4.7005
## Mazda RX4 Wag     21.0   6  160 3.90 2.875 17.02  0  1    4    4 4.7005
## Datsun 710        22.8   4  108 3.85 2.320 18.61  1  1    4    1 4.5326
## Hornet 4 Drive    21.4   6  258 3.08 3.215 19.44  1 -1    3    1 4.7005
## Hornet Sportabout 18.7   8  360 3.15 3.440 17.02  0 -1    3    2 5.1648
## Valiant           18.1   6  225 2.76 3.460 20.22  1 -1    3    1 4.6540

Similarly, attach adds any named list to the search path (see Chapter 16) so that the columns can be accessed by name. This, however, does not allow any alterations thereof to be performed. As an alternative, with and within may be referred to if writing df[["..."]] each time is so difficult to us (it should not be):

within(head(mtcars), {
    log_hp <- log(hp)
    fuel_economy <- 235/mpg
    am <- factor(am, levels=c(0, 1), labels=c("no", "yes"))
    rm(list=c("mpg", "hp", "vs", "qsec"))
})
##                   cyl disp drat    wt  am gear carb fuel_economy log_hp
## Mazda RX4           6  160 3.90 2.620 yes    4    4       11.190 4.7005
## Mazda RX4 Wag       6  160 3.90 2.875 yes    4    4       11.190 4.7005
## Datsun 710          4  108 3.85 2.320 yes    4    1       10.307 4.5326
## Hornet 4 Drive      6  258 3.08 3.215  no    3    1       10.981 4.7005
## Hornet Sportabout   8  360 3.15 3.440  no    3    2       12.567 5.1648
## Valiant             6  225 2.76 3.460  no    3    1       12.983 4.6540
Example 12.34

As mentioned in Section 10.3.2, see Section 16.5 for more details, formulas are special objects that consist of two unevaluated expressions separated by a tilde (`~`).

Functions can support formulas and do what they please with them, but a popular approach is to allow them to express “something grouped by something else” or “one thing as a function of other things”.

do.call(rbind.data.frame, lapply(split(ToothGrowth, ~supp+dose), head, 1))
##         len supp dose
## OJ.0.5 15.2   OJ  0.5
## VC.0.5  4.2   VC  0.5
## OJ.1   19.7   OJ  1.0
## VC.1   16.5   VC  1.0
## OJ.2   25.5   OJ  2.0
## VC.2   23.6   VC  2.0
aggregate(cbind(mpg, log_hp=log(hp))~am:cyl, mtcars, mean)
##   am cyl    mpg log_hp
## 1  0   4 22.900 4.4186
## 2  1   4 28.075 4.3709
## 3  0   6 19.125 4.7447
## 4  1   6 20.567 4.8552
## 5  0   8 15.050 5.2553
## 6  1   8 15.400 5.6950
head(model.frame(mpg+hp~log(hp)+I(1/qsec), mtcars))
##                   mpg + hp log(hp)    I(1/qsec)
## Mazda RX4            131.0  4.7005 0.060753....
## Mazda RX4 Wag        131.0  4.7005 0.058754....
## Datsun 710           115.8  4.5326 0.053734....
## Hornet 4 Drive       131.4  4.7005 0.051440....
## Hornet Sportabout    193.7  5.1648 0.058754....
## Valiant              123.1  4.6540 0.049455....

If these seem esoteric, it is because that is exactly the case. We need to consult the corresponding functions’ manuals to be able to understand what they do. And, as we do not recommend their use, we are not going to explain them here.

Exercise 12.35

In the last example, the peculiar printing of the last column is due to which method being overloaded?

12.3.10. A Note on the dplyr (tidyverse) and data.table Packages (*)

The popular third-party packages data.table and dplyr implement the most common data frame wrangling procedures. Moreover, some of the operations may be much faster for larger data sets.

They both introduce a completely new API for the operations we already know well how to perform. Furthermore, they are heavily based on metaprogramming (nonstandard evaluation). A good way to learn them is by solving some of the exercises listed below.

Note that dplyr is part of a huge system of interdependent packages called tidyverse which tend to do things their own way and which became quite invasive over the last years. Nevertheless, R programmers should remember that they are not only able to do without them; they also need to when the processing of other prominent data structures is required, e.g., of fancy lists and matrices. Base R always comes first as the more fundamental layer.

Important

Some functions we may find useful will (annoyingly to base R users) return objects of class tibble (tbl_df) (e.g., haven::read.xpt that reads SAS data files). However, those are in fact data.frame subclasses and we can always use as.data.frame to get our favourite objects back.

Also, we cannot stress enough that it is SQL that we recommend to learn as perhaps the most powerful interface to more considerable amounts of data, and also one that gives skills which can be used at a later time in other programming environments.

We should remember that base R has already proven long time ago to be a versatile tool for rapid prototyping, calling specialised procedures written in C or Java, and wrangling data that fit into memory. For larger problems, techniques for working with batches of data, sampling methods, or aggregating data stored elsewhere is often the way to go, especially when building machine learning models or visualisation[13] is required. Usually, the most recent data will be stored in normalised databases and you will need to join a few tables in order to fetch something of interest in the current analysis context.

12.4. Exercises

Exercise 12.36

Answer the following questions:

  • What attributes a data frame must be equipped with?

  • If row.names is an integer vector, how to access rows labelled 1, 7, and 42?

  • How to create a data frame that features a column that is a list of character vectors of different lengths?

  • How to create a data frame that includes a matrix column?

  • How to convert all numeric columns in a data frame to a numeric matrix?

  • Assuming that x is an atomic vector, what is the difference between “as.data.frame(x)” vs “as.data.frame(as.list(x))” vs “as.data.frame(list(a=x))” vs “data.frame(a=x)”?

Exercise 12.37

Assuming that x is a data frame, what is the meaning of/difference between the following:

  • x["u"]” vs “x[["u"]]” vs “x[, "u"]”?

  • x["u"][1]” vs “x[["u"]][1]” vs “x[1, "u"]” vs “x[1, "u", drop=FALSE]”?

  • x[which(x[[1]] > 0), ]” vs “x[x[[1]] > 0, ]”?

  • x[grep("^foo", names(x))]”?

Exercise 12.38

Assume we have a data frame with columns named like: ID (character), checked (logical, possibly with missing values), category (factor), x0, … x9 (ten separate numeric columns), y0, … y9 (ten separate numeric columns), coords (numeric matrix with two columns named lat and long), and features (list of character vectors of different lengths).

  • How to extract the rows where checked is TRUE?

  • How to extract a subset comprised only of ID and x-something columns?

  • How to extract the rows for which ID is like 3 letters and then 5 digits (e.g., XYZ12345)?

  • How to select all the numeric columns in one go?

  • Assuming that the IDs are like three letters and then five digits, how to add two columns: ID3 (the letters) and ID5 (the five digits).

  • How to get rid of all the columns between x3 and y7?

  • How to check where both lat and long in coords are positive?

  • How to add the row indicating the number of features?

  • How to extract the rows where "spam" is amongst the features?

  • How to convert it to a long data frame with two columns: ID and feature (individual strings)?

  • How to change the name of the ID column to id?

  • How to make the y-foo columns appear before the x-bar ones?

  • How to order the rows with respect to checked (FALSE first, then TRUE) and IDs (decreasingly)?

  • How to remove rows with duplicate IDs?

  • How to determine how many entries correspond to each category?

  • How to compute the average lat and long in each category?

  • How to compute the average lat and long for each category and checked combined?

Exercise 12.39

Consider the flights dataset. Give some ways to select all rows between March and October (regardless of the year).

Exercise 12.40

In this task, you will be working with a version of a dataset on 70k+ Melbourne trees (urban_forest). Before proceeding any further, read the dataset’s description available here.

  1. Load the downloaded dataset by calling the read.csv function.

  2. Fetch the IDs (CoM.ID) and trunk diameters (Diameter.Breast.Height) of five horse chestnuts with the smallest diameters at breast height. The output data frame must be sorted with respect to Diameter.Breast.Height, decreasingly.

  3. Create a new data frame that gives the number of trees planted in each year.

  4. Compute the average age (in years, based on Year.Planted; using aggregate) of the trees of genera (each genus separately): Eucalyptus, Platanus, Ficus, Acer, and Quercus. Depict the sorted data with barplot.

Exercise 12.41

(*) Consider the historic data dumps of https://travel.stackexchange.com/ available at https://github.com/gagolews/teaching-data/tree/master/travel_stackexchange_com_2017.

Export the CSV files located therein to an SQLite database. Then, write some R code that correspond to the following SQL queries (use dbGetQuery to verify your results):

--- 1)
SELECT
    Users.DisplayName,
    Users.Age,
    Users.Location,
    SUM(Posts.FavoriteCount) AS FavoriteTotal,
    Posts.Title AS MostFavoriteQuestion,
    MAX(Posts.FavoriteCount) AS MostFavoriteQuestionLikes
FROM Posts
JOIN Users ON Users.Id=Posts.OwnerUserId
WHERE Posts.PostTypeId=1
GROUP BY OwnerUserId
ORDER BY FavoriteTotal DESC
LIMIT 10
--- 2)
SELECT
    Posts.ID,
    Posts.Title,
    Posts2.PositiveAnswerCount
FROM Posts
JOIN (
        SELECT
            Posts.ParentID,
            COUNT(*) AS PositiveAnswerCount
        FROM Posts
        WHERE Posts.PostTypeID=2 AND Posts.Score>0
        GROUP BY Posts.ParentID
    ) AS Posts2
    ON Posts.ID=Posts2.ParentID
ORDER BY Posts2.PositiveAnswerCount DESC
LIMIT 10
--- 3)
SELECT
    Posts.Title,
    UpVotesPerYear.Year,
    MAX(UpVotesPerYear.Count) AS Count
FROM (
        SELECT
            PostId,
            COUNT(*) AS Count,
            STRFTIME('%Y', Votes.CreationDate) AS Year
        FROM Votes
        WHERE VoteTypeId=2
        GROUP BY PostId, Year
    ) AS UpVotesPerYear
JOIN Posts ON Posts.Id=UpVotesPerYear.PostId
WHERE Posts.PostTypeId=1
GROUP BY Year
--- 4)
SELECT
    Questions.Id,
    Questions.Title,
    BestAnswers.MaxScore,
    Posts.Score AS AcceptedScore,
    BestAnswers.MaxScore-Posts.Score AS Difference
FROM (
        SELECT Id, ParentId, MAX(Score) AS MaxScore
        FROM Posts
        WHERE PostTypeId==2
        GROUP BY ParentId
    ) AS BestAnswers
JOIN (
        SELECT * FROM Posts
        WHERE PostTypeId==1
    ) AS Questions
    ON Questions.Id=BestAnswers.ParentId
JOIN Posts ON Questions.AcceptedAnswerId=Posts.Id
WHERE Difference>50
ORDER BY Difference DESC
--- 5)
SELECT
    Posts.Title,
    CmtTotScr.CommentsTotalScore
FROM (
        SELECT
            PostID,
            UserID,
            SUM(Score) AS CommentsTotalScore
        FROM Comments
        GROUP BY PostID, UserID
) AS CmtTotScr
JOIN Posts ON Posts.ID=CmtTotScr.PostID
    AND Posts.OwnerUserId=CmtTotScr.UserID
WHERE Posts.PostTypeId=1
ORDER BY CmtTotScr.CommentsTotalScore DESC
LIMIT 10
--- 6)
SELECT DISTINCT
    Users.Id,
    Users.DisplayName,
    Users.Reputation,
    Users.Age,
    Users.Location
FROM (
        SELECT
            Name, UserID
        FROM Badges
        WHERE Name IN (
            SELECT
                Name
            FROM Badges
            WHERE Class=1
            GROUP BY Name
            HAVING COUNT(*) BETWEEN 2 AND 10
        )
        AND Class=1
    ) AS ValuableBadges
JOIN Users ON ValuableBadges.UserId=Users.Id
--- 7)
SELECT
    Posts.Title,
    VotesByAge2.OldVotes
FROM Posts
JOIN (
    SELECT
        PostId,
        MAX(CASE WHEN VoteDate = 'new' THEN Total ELSE 0 END) NewVotes,
        MAX(CASE WHEN VoteDate = 'old' THEN Total ELSE 0 END) OldVotes,
        SUM(Total) AS Votes
    FROM (
        SELECT
            PostId,
            CASE STRFTIME('%Y', CreationDate)
                WHEN '2017' THEN 'new'
                WHEN '2016' THEN 'new'
                ELSE 'old'
                END VoteDate,
            COUNT(*) AS Total
        FROM Votes
        WHERE VoteTypeId=2
        GROUP BY PostId, VoteDate
    ) AS VotesByAge
    GROUP BY VotesByAge.PostId
    HAVING NewVotes=0
) AS VotesByAge2 ON VotesByAge2.PostId=Posts.ID
WHERE Posts.PostTypeId=1
ORDER BY VotesByAge2.OldVotes DESC
LIMIT 10
Exercise 12.42

(*) Generate a CSV file featuring some random data arranged in a few columns of the size at least two times larger than your available RAM. Then, export the CSV file to an SQLite database. Use file connections (Section 8.3.5) and the nrow argument to read.table to be able to process it on a chunk-by-chunk basis.

Determine whether setting colClasses in read.table speeds up the reading of large CSV files significantly or not.

Exercise 12.43

(*) Export the whole XML data dump of StackOverflow published at https://archive.org/details/stackexchange (see also https://data.stackexchange.com/) to an SQLite database.

Now the second part of our course is ended.