# 12. Data frames¶

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Minimalist Data Wrangling with Python [27], too.

Most matrices are built on top of atomic vectors. Hence, only items of the same type can be arranged into rows and columns. On the other hand, data frames (objects of the S3 class data.frame, first introduced in [13]) are collections of vectors of the same lengths or matrices with identical row counts. Hence, they represent structured[1] data of possibly heterogeneous types. For instance:

class(iris)  # iris is an example data frame
## [1] "data.frame"
iris[c(1, 51, 101), ]  # three chosen rows from iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 101          6.3         3.3          6.0         2.5  virginica


It is a mix of numeric and factor-type data.

The good news is that not only are data frames built on named lists (e.g., to extract a column, we can refer to [[), but also many functions consider them matrix-like (e.g., to select specific rows and columns, two indexes can be passed to [ like in the preceding example). Hence, it will soon turn out that we already know a lot about performing basic data wrangling activities, even if we do not fully realise it now.

## 12.1. Creating data frames¶

### 12.1.1. data.frame and as.data.frame¶

Most frequently, we create data frames based on a series of logical, numeric, or character vectors of identical lengths. In such a scenario, the data.frame function is particularly worthwhile.

(x <- data.frame(
a=c(TRUE, FALSE),
b=1:6,
c=runif(6),
d=c("spam", "spam", "eggs")
))
##       a b       c    d
## 1  TRUE 1 0.77437 spam
## 2 FALSE 2 0.19722 spam
## 3  TRUE 3 0.97801 eggs
## 4 FALSE 4 0.20133 spam
## 5  TRUE 5 0.36124 spam
## 6 FALSE 6 0.74261 eggs


The shorter vectors were recycled. We can verify that the diverse column types were retained and no coercion was made by calling:

str(x)
## 'data.frame':	6 obs. of  4 variables:
##  $a: logi TRUE FALSE TRUE FALSE TRUE FALSE ##$ b: int  1 2 3 4 5 6
##  $c: num 0.774 0.197 0.978 0.201 0.361 ... ##$ d: chr  "spam" "spam" "eggs" "spam" ...


Important

For many reasons (see, e.g., Section 12.1.5 and Section 12.1.6), we recommend having the type of each column always checked, e.g., by calling the str function.

Many objects, such as matrices, can easily be coerced to data frames using particular as.data.frame methods. Here is an example matrix:

(A <- matrix(1:6, nrow=3,
dimnames=list(
NULL,        # no row labels
c("u", "v")  # some column labels
)))
##      u v
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6


Let’s convert it to a data frame:

as.data.frame(A)  # as.data.frame.matrix
##   u v
## 1 1 4
## 2 2 5
## 3 3 6


Note that a matrix with no row labels is printed slightly differently than a data frame with (as it will soon turn out) the default row.names.

Named lists are amongst other aspirants to a meaningful conversion. Consider an example list where all elements are vectors of the same length:

(l <- Map(
function(x) {
c(Min=min(x), Median=median(x), Mean=mean(x), Max=max(x))
},
split(iris[["Sepal.Length"]], iris[["Species"]])
))
## $setosa ## Min Median Mean Max ## 4.300 5.000 5.006 5.800 ## ##$versicolor
##    Min Median   Mean    Max
##  4.900  5.900  5.936  7.000
##
## $virginica ## Min Median Mean Max ## 4.900 6.500 6.588 7.900  Each list element will be turned into a separate column: as.data.frame(l) # as.data.frame.list ## setosa versicolor virginica ## Min 4.300 4.900 4.900 ## Median 5.000 5.900 6.500 ## Mean 5.006 5.936 6.588 ## Max 5.800 7.000 7.900  Sadly, as.data.frame is not particularly fond of lists of vectors of incompatible lengths: as.data.frame(list(a=1, b=11:12, c=21:23)) ## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names ## = TRUE, : arguments imply differing number of rows: 1, 2, 3  These vectors could have been recycled with a warning. But they were not. as.data.frame(list(a=1:4, b=11:12, c=21)) # recycling rule okay ## a b c ## 1 1 11 21 ## 2 2 12 21 ## 3 3 11 21 ## 4 4 12 21  The method for the S3 class table (mentioned in Chapter 11) can be helpful as well. Here is an example contingency table together with its unstacked (wide) version. (t <- table(mtcars[["vs"]], mtcars[["cyl"]])) ## ## 4 6 8 ## 0 1 3 14 ## 1 10 4 0 as.data.frame(t) # as.data.frame.table; see the stringsAsFactors note below! ## Var1 Var2 Freq ## 1 0 4 1 ## 2 1 4 10 ## 3 0 6 3 ## 4 1 6 4 ## 5 0 8 14 ## 6 1 8 0  as.data.frame.table is so handy that we might want to call it directly on any array. This way, we can convert it from the wide format to the long (tall) one; see Section 12.3.6 for more details. Note The aforementioned method is based on expand.grid, which determines all combinations of a given series of vectors. expand.grid(1:2, c("a", "b", "c")) # see the stringsAsFactors note below! ## Var1 Var2 ## 1 1 a ## 2 2 a ## 3 1 b ## 4 2 b ## 5 1 c ## 6 2 c  Overall, many classes of objects can be included[2] in a data frame. The popular choices include Date, POSIXct, and factor. Example 12.1 It is worth noting that format is used whilst printing the columns. Here is its custom method for what we would like to call from now on the S3 class spam: format.spam <- function(x, ...) paste0("<", x, ">")  Testing data frame printing: data.frame( a=structure(c("lovely", "yummy", "delicious"), class="spam"), b=factor(c("spam", "bacon", "spam")), c=Sys.Date()+1:3 ) ## a b c ## 1 <lovely> spam 2024-08-27 ## 2 <yummy> bacon 2024-08-28 ## 3 <delicious> spam 2024-08-29  ### 12.1.2. cbind.data.frame and rbind.data.frame¶ There are data frame-specific versions of cbind or rbind (which we discussed in the context of stacking matrices; Section 11.1.2). They are used quite eagerly: help("cbind") states that they will be referred to if at least[3] one of its arguments is a data frame, and the other arguments are atomic vectors or lists (possibly with the dim attribute). For example: x <- iris[c(1, 51, 101), c("Sepal.Length", "Species")] # whatever cbind(Yummy=c(TRUE, FALSE, TRUE), x) ## Yummy Sepal.Length Species ## 1 TRUE 5.1 setosa ## 51 FALSE 7.0 versicolor ## 101 TRUE 6.3 virginica  It added a new column to a data frame x. Moreover: rbind(x, list(42, "virginica")) ## Sepal.Length Species ## 1 5.1 setosa ## 51 7.0 versicolor ## 101 6.3 virginica ## 11 42.0 virginica  It added a new row. Note that columns are of different types. Hence, the values to row-bind had to be provided as a list. The generic vector used as a new row specifier can also be named. It can consist of sequences of length greater than one that are given in any order: rbind(x, list( Species=c("virginica", "setosa"), Sepal.Length=c(42, 7) )) ## Sepal.Length Species ## 1 5.1 setosa ## 51 7.0 versicolor ## 101 6.3 virginica ## 11 42.0 virginica ## 2 7.0 setosa  A direct referral to cbind.data.frame and rbind.data.frame will sometimes be necessary. Consider an example list of atomic vectors: x <- list(a=1:3, b=11:13, c=21:23)  First, we call the generic, which dispatches to the default method: do.call(cbind, x) ## a b c ## [1,] 1 11 21 ## [2,] 2 12 22 ## [3,] 3 13 23  It created a matrix. If we want to ensure we garner a data frame, we need to write: do.call(cbind.data.frame, x) ## a b c ## 1 1 11 21 ## 2 2 12 22 ## 3 3 13 23  This is useful for fetching outputs from Map et al., as they are wrapped inside a list. Here is a fancy way to obtain an illustrative list: l <- unname(Map( function(x) list( # objects are of different types, hence a list Sepal.Length=mean(x[["Sepal.Length"]]), Sepal.Width=mean(x[["Sepal.Width"]]), Species=x[["Species"]][1] # all are the same, so the first will do ), split(iris, iris[["Species"]]) # split.data.frame; see below )) str(l) ## List of 3 ##$ :List of 3
##   ..$Sepal.Length: num 5.01 ## ..$ Sepal.Width : num 3.43
##   ..$Species : Factor w/ 3 levels "setosa","versicolor",..: 1 ##$ :List of 3
##   ..$Sepal.Length: num 5.94 ## ..$ Sepal.Width : num 2.77
##   ..$Species : Factor w/ 3 levels "setosa","versicolor",..: 2 ##$ :List of 3
##   ..$Sepal.Length: num 6.59 ## ..$ Sepal.Width : num 2.97
##   ..$Species : Factor w/ 3 levels "setosa","versicolor",..: 3  We may now turn it into a data frame by calling: do.call(rbind.data.frame, l) ## Sepal.Length Sepal.Width Species ## 1 5.006 3.428 setosa ## 2 5.936 2.770 versicolor ## 3 6.588 2.974 virginica  On the other hand, do.call(rbind, l) does not return an amiable object type: do.call(rbind, l) ## Sepal.Length Sepal.Width Species ## [1,] 5.006 3.428 setosa ## [2,] 5.936 2.77 versicolor ## [3,] 6.588 2.974 virginica  Despite the pretty face, it is a matrix… over a list: str(do.call(rbind, l)) ## List of 9 ##$ : num 5.01
##  $: num 5.94 ##$ : num 6.59
##  $: num 3.43 ##$ : num 2.77
##  $: num 2.97 ##$ : Factor w/ 3 levels "setosa","versicolor",..: 1
##  $: Factor w/ 3 levels "setosa","versicolor",..: 2 ##$ : Factor w/ 3 levels "setosa","versicolor",..: 3
##  - attr(*, "dim")= int [1:2] 3 3
##  - attr(*, "dimnames")=List of 2
##   ..$: NULL ## ..$ : chr [1:3] "Sepal.Length" "Sepal.Width" "Species"


Structured data can be imported from external sources, such as CSV/TSV (comma/tab-separated values) or HDF5 files, relational databases supporting SQL (see Section 12.1.4), web APIs (e.g., through the curl and jsonlite packages), spreadsheets [66], and so on. In particular, read.csv and the like fetch data from plain text files consisting of records, where commas, semicolons, tabs, etc. separate the fields. For instance:

x <- data.frame(a=runif(3), b=c(TRUE, FALSE, TRUE))  # example data frame
f <- tempfile()  # temporary file name
write.csv(x, f, row.names=FALSE)  # export


It created a CSV file that looks like:

cat(readLines(f), sep="\n")  # print file contents
## "a","b"
## 0.287577520124614,TRUE
## 0.788305135443807,FALSE
## 0.4089769218117,TRUE


which can be read by calling:

read.csv(f)
##         a     b
## 1 0.28758  TRUE
## 2 0.78831 FALSE
## 3 0.40898  TRUE

Exercise 12.2

Check out help("read.table") for a long list of tunable parameters, especially: sep, dec, quote, header, comment.char, and row.names. Further, note that reading from compressed files and interned URLs is supported directly.

Important

CSV is the most portable and user-friendly format for exchanging matrix-like objects between different programs and computing languages (Python, Julia, LibreOffice Calc, etc.). Such files can be opened in any text editor.

Also, as mentioned in Section 8.3.5, we can process data frames chunk by chunk. This is beneficial especially when data do not fit into memory (compare the nrows argument to read.csv).

### 12.1.4. Interfacing relational databases and querying with SQL (*)¶

The DBI package provides a universal interface for many database management systems whose drivers are implemented in add-ons such as RSQLite, RMariaDB, RPostgreSQL, etc., or, more generally, RODBC or odbc. For more details, see Section 4 of [66].

Example 12.3

Let’s play with an in-memory (volatile) instance of an SQLite database.

library("DBI")
con <- dbConnect(RSQLite::SQLite(), ":memory:")


It returned an object representing a database connection which we can refer to in further communication. An easy way to create a database table is to call:

dbWriteTable(con, "mtcars", mtcars)  # mtcars is a toy data frame


Alternatively, we could have called dbExecute to send SQL statements such as “CREATE TABLE ...” followed by a series of “INSERT INTO ...”. We can now retrieve some data:

dbGetQuery(con, "
SELECT cyl, vs, AVG(mpg) AS mpg_ave, AVG(hp) AS hp_ave
FROM mtcars
GROUP BY cyl, vs
")
##   cyl vs mpg_ave hp_ave
## 1   4  0  26.000  91.00
## 2   4  1  26.730  81.80
## 3   6  0  20.567 131.67
## 4   6  1  19.125 115.25
## 5   8  0  15.100 209.21


It gave us an ordinary R data frame. We can process it in the same fashion as any other object of this kind.

At the end, the database connection must be closed.

dbDisconnect(con)

Exercise 12.4

Database passwords must never be stored in plain text files, let alone in R scripts in version-controlled repositories. Consider a few ways to fetch credentials programmatically:

• using environment variables (see help("Sys.getenv")),

• using the keyring package,

• calling system2 (Section 7.3.2) to retrieve it from the system keyring (e.g., the keyring package for Python provides a platform-independent command-line utility).

### 12.1.5. Strings as factors?¶

Some functions related to data frames automatically convert character vectors to factors. This behaviour is frequently controlled by an argument named stringsAsFactors. It can be particularly problematic because, when printed, factor and character columns look identical:

(x <- data.frame(a=factor(c("U", "V")), b=c("U", "V")))
##   a b
## 1 U U
## 2 V V


We recall from Section 10.3.2 that factors can be nasty. For example, passing factors as indexers in [ or converting them with as.numeric might give counterintuitive results. Also, when we want to extend factors by previously unobserved data, new levels must be added manually. This can cause unexpected behaviour in contexts such as:

rbind(x, c("W", "W"))
## Warning in [<-.factor(*tmp*, ri, value = "W"): invalid factor level,
##     NA generated
##      a b
## 1    U U
## 2    V V
## 3 <NA> W


Therefore, always having the data types checked is a praiseworthy habit. For instance:

str(x)
## 'data.frame':	2 obs. of  2 variables:
##  $a: Factor w/ 2 levels "U","V": 1 2 ##$ b: chr  "U" "V"


Before R 4.0, certain functions, including data.frame and read.csv had the stringsAsFactors argument defaulting to TRUE. It is no longer the case. However, exceptions to this rule still exist, e.g., including as.data.frame.table and expand.grid. Besides, some example data frames continue to enjoy factor-typed columns, e.g.:

class(iris[["Species"]])
## [1] "factor"


In particular, adding a new flower variety might be oblique:

iris2 <- iris[c(1, 101), ]  # example subset
rbind(iris2, c(6, 3, 3, 2, "croatica"))
## Warning in [<-.factor(*tmp*, ri, value = "croatica"): invalid factor
##     level, NA generated
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1            5.1         3.5          1.4         0.2    setosa
## 101          6.3         3.3            6         2.5 virginica
## 3              6           3            3           2      <NA>


Compare it to:

levels(iris2[["Species"]])[nlevels(iris2[["Species"]])+1] <- "croatica"
rbind(iris2, c(6, 3, 3, 2, "croatica"))
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 1            5.1         3.5          1.4         0.2    setosa
## 101          6.3         3.3            6         2.5 virginica
## 3              6           3            3           2  croatica


### 12.1.6. Internal representation¶

Objects of the S3 class data.frame are erected on lists of vectors of the same length or matrices with identical row counts. Each list element defines a column or column group. Apart from class, data frames must be equipped with the following special attributes:

• names – a character vector (as usual in any named list) that gives the column labels,

• row.names – a character or integer vector with no duplicates nor missing values, doing what is advertised.

Therefore, a data frame can be created from scratch by calling, for example:

structure(
list(a=11:13, b=21:23),  # sets the names attribute
row.names=1:3,
class="data.frame"
)
##    a  b
## 1 11 21
## 2 12 22
## 3 13 23


Here is a data frame based on a list of length five, a matrix with five rows, and a numeric vector with five items. We added some fancy row names on top:

structure(
list(
a=list(1, 1:2, 1:3, numeric(0), -(4:1)),
b=cbind(u=11:15, v=21:25),
c=runif(5)
),
row.names=c("spam", "bacon", "eggs", "ham", "aubergine"),
class="data.frame"
)
##                        a b.u b.v       c
## spam                   1  11  21 0.28758
## bacon               1, 2  12  22 0.78831
## eggs             1, 2, 3  13  23 0.40898
## ham                       14  24 0.88302
## aubergine -4, -3, -2, -1  15  25 0.94047


In general, the columns of the type list can contain anything, e.g., other lists or R functions. Including atomic vectors of varying lengths, just like above, permits us to create something à la ragged arrays.

The issue with matrix entries, on the other hand, is that they appear as if they were many columns. Still, as it will turn out in the sequel, they are often treated as a single complex column, e.g., by the index operator (see Section 12.2). Therefore, from this perspective, the aforementioned data frame has three columns, not four. Such compound columns can be output by aggregate (see Section 12.3), amongst others. They are valuable in certain contexts: the column groups can be easily accessed as a whole and batch-processed in the same way.

Important

Alas, data frames with list or matrix columns cannot be created with the data.frame nor cbind functions. This might explain why they are less popular. This behaviour is dictated by the underlying as.data.frame methods, which they both call. As a curiosity, see help("I"), though.

Exercise 12.5

Verify that if a data frame carries a matrix column, this matrix does not need to have any column names (the second element of dimnames).

The names and row.names attributes are special in the sense of Section 4.4.3. In particular, they can be accessed or modified via the dedicated functions.

It is worth noting that row.names(df) always returns a character vector, even when attr(df, "row.names") is integer. Further, calling “row.names(df) <- NULL” will reset[4] this attribute to the most commonly desired case of consecutive natural numbers. For example:

(x <- iris[c(1, 51, 101), ])  # comes with some sad row names
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
row.names<-(x, NULL)  # reset to seq_len(NROW(x))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          5.1         3.5          1.4         0.2     setosa
## 2          7.0         3.2          4.7         1.4 versicolor
## 3          6.3         3.3          6.0         2.5  virginica

Exercise 12.6

Exercise 12.7

Write a version of xtabs that does not rely on a formula interface (compare Section 10.3.4). Allow three parameters: a data frame, the name of the “counts” column, and the names of the cross-classifying factors. Hence, my_xtabs(x, "Freq", c("Var1", "Var2")) should be equivalent to xtabs(Freq~Var1+Var2, x).

## 12.2. Data frame subsetting¶

### 12.2.1. Data frames are lists¶

A data frame is a named list whose elements represent individual columns. Therefore[5], length yields the number of columns and names gives their respective labels.

Let’s play around with this data frame:

(x <- data.frame(
a=runif(6),
b=rnorm(6),
c=LETTERS[1:6],
d1=c(FALSE, TRUE, FALSE, NA, FALSE, NA),
d2=c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
))
##          a         b c    d1    d2
## 1 0.287578  0.070508 A FALSE FALSE
## 2 0.788305  0.129288 B  TRUE  TRUE
## 3 0.408977  1.715065 C FALSE FALSE
## 4 0.883017  0.460916 D    NA  TRUE
## 5 0.940467 -1.265061 E FALSE FALSE
## 6 0.045556 -0.686853 F    NA  TRUE
typeof(x)  # each data frame is a list
## [1] "list"
length(x)  # the number of columns
## [1] 5
names(x)   # column labels
## [1] "a"  "b"  "c"  "d1" "d2"


The one-argument versions of extract and index operators behave as expected. [[ fetches (looks inside) the contents of a given column:

x[["a"]]  # or x[[1]]
## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556


[ returns a data frame (a list with extras):

x["a"]  # or x[1]; a data frame with one column
##          a
## 1 0.287578
## 2 0.788305
## 3 0.408977
## 4 0.883017
## 5 0.940467
## 6 0.045556
x[c(TRUE, TRUE, FALSE, TRUE, FALSE)]
##          a         b    d1
## 1 0.287578  0.070508 FALSE
## 2 0.788305  0.129288  TRUE
## 3 0.408977  1.715065 FALSE
## 4 0.883017  0.460916    NA
## 5 0.940467 -1.265061 FALSE
## 6 0.045556 -0.686853    NA


Just like with lists, the replacement versions of these operators can add new columns or modify existing ones.

(y <- head(x, 1))  # example data frame
##         a        b c    d1    d2
## 1 0.28758 0.070508 A FALSE FALSE
y[["a"]] <- round(y[["a"]], 1)  # replaces the column with new content
y[["b"]] <- NULL  # removes the column, like, totally
y[["e"]] <- 10*y[["a"]]^2  # adds a new column at the end
print(y)
##     a c    d1    d2   e
## 1 0.3 A FALSE FALSE 0.9

Example 12.8

Some spam for thought to show how much we already know. Here are a few common scenarios involving indexing.

(y <- head(x, 1))  # example data frame
##         a        b c    d1    d2
## 1 0.28758 0.070508 A FALSE FALSE


Move the column a to the end:

y[unique(c(names(y), "a"), fromLast=TRUE)]
##          b c    d1    d2       a
## 1 0.070508 A FALSE FALSE 0.28758


Remove the columns a and c:

y[-match(c("a", "c"), names(y))]  # or y[setdiff(names(y), c("a", "c"))]
##          b    d1    d2
## 1 0.070508 FALSE FALSE


Select all columns between a and c:

y[match("a", names(y)):match("c", names(y))]
##         a        b c
## 1 0.28758 0.070508 A


Fetch the columns with names starting with d:

y[grep("^d", names(y), perl=TRUE)]
##      d1    d2
## 1 FALSE FALSE


Change the name of column c to z:

names(y)[names(y) == "c"] <- "z"
print(y)  # names<-(y, [<-(names(y), names(y) == "c", "z"))
##         a        b z    d1    d2
## 1 0.28758 0.070508 A FALSE FALSE


Change names: d2 to u and d1 to v:

names(y)[match(c("d2", "d1"), names(y))] <- c("v", "u")
print(y)
##         a        b z     u     v
## 1 0.28758 0.070508 A FALSE FALSE


Note

Some users prefer the $ operator over [[, but we do not. By default, the former supports partial matching of column names which might be appealing when R is used interactively. Nonetheless, it does not work on matrices nor it allows for programmatically generated names. It is also trickier to use on not syntactically valid labels; compare Section 9.3.1. Exercise 12.9 Write a function rename that changes the names of columns based on a translation table given in a from=to fashion (we have already solved a similar exercise in Chapter 9). For instance: rename <- function(x, ...) ...to.do... rename(head(x, 1), c="new_c", a="new_a") ## new_a b new_c d1 d2 ## 1 0.28758 0.070508 A FALSE FALSE  ### 12.2.2. Data frames are matrix-like¶ Data frames can be considered “generalised” matrices. They store data of any kind (possibly mixed) organised in a tabular fashion. A few functions mentioned in the previous chapter are overloaded for the data frame case. They include: dim (despite the lack of the dim attribute), NROW, NCOL, and dimnames (which is, of course, based on row.names and names). For example: (x <- data.frame( a=runif(6), b=rnorm(6), c=LETTERS[1:6], d1=c(FALSE, TRUE, FALSE, NA, FALSE, NA), d2=c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE) )) ## a b c d1 d2 ## 1 0.287578 0.070508 A FALSE FALSE ## 2 0.788305 0.129288 B TRUE TRUE ## 3 0.408977 1.715065 C FALSE FALSE ## 4 0.883017 0.460916 D NA TRUE ## 5 0.940467 -1.265061 E FALSE FALSE ## 6 0.045556 -0.686853 F NA TRUE dim(x) # the number of rows and columns ## [1] 6 5 dimnames(x) # row and column labels ## [[1]] ## [1] "1" "2" "3" "4" "5" "6" ## ## [[2]] ## [1] "a" "b" "c" "d1" "d2"  In addition to the list-like behaviour, which only allows for dealing with particular columns or their groups, the [ operator can also take two indexers: x[1:2, ] # first two rows ## a b c d1 d2 ## 1 0.28758 0.070508 A FALSE FALSE ## 2 0.78831 0.129288 B TRUE TRUE x[x[["a"]] >= 0.3 & x[["a"]] <= 0.8, -2] # or use x[, "a"] ## a c d1 d2 ## 2 0.78831 B TRUE TRUE ## 3 0.40898 C FALSE FALSE  Recall the drop argument to [ and its effects on matrix indexing (Section 11.2.4). In the current case, its behaviour will be similar with regard to the operations on individual columns: x[, 1] # synonym: x[[1]] because drop=TRUE ## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556 x[, 1, drop=FALSE] # synonym: x[1] ## a ## 1 0.287578 ## 2 0.788305 ## 3 0.408977 ## 4 0.883017 ## 5 0.940467 ## 6 0.045556  When we extract a single row and more than one column, drop does not apply. It is because columns (unlike in matrices) can potentially be of different types: x[1, 1:2] # two numeric columns but the result is still a numeric ## a b ## 1 0.28758 0.070508  However: x[1, 1] # a single value ## [1] 0.28758 x[1, 1, drop=FALSE] # a data frame with one row and one column ## a ## 1 0.28758  Note Once again, let’s take note of logical indexing in the presence of missing values: x[x[["d1"]], ] # d1 is of the type logical ## a b c d1 d2 ## 2 0.78831 0.12929 B TRUE TRUE ## NA NA NA <NA> NA NA ## NA.1 NA NA <NA> NA NA x[which(x[["d1"]]), ] # which drops missing values ## a b c d1 d2 ## 2 0.78831 0.12929 B TRUE TRUE  The default behaviour is consistent with many other R functions. It explicitly indicates that something is missing. After all, when we select a “don’t know”, the result is unknown as well. Regretfully, this comes with no warning. As we seldom check missing values in the outputs manually, our absent-mindedness can lead to code bugs. By far, we might have already noted that the index operator adjusts (not: resets) the row.names attribute. For instance: (xs <- x[order(x[["a"]], decreasing=TRUE)[1:3], ]) ## a b c d1 d2 ## 5 0.94047 -1.26506 E FALSE FALSE ## 4 0.88302 0.46092 D NA TRUE ## 2 0.78831 0.12929 B TRUE TRUE  It is a version of x comprised of the top three values in the a column. Indexing by means of character vectors will refer to row.names and names: xs["5", c("a", "b")] ## a b ## 5 0.94047 -1.2651  It is not the same as xs[5, c("a", "b")], even though row.names is formally an integer vector here. Regarding the replacement version of the two-indexer variant of the [ operator, it is a flexible tool. It permits the new content to be a vector, a data frame, a list, or even a matrix. Verifying this is left as an exercise. Note If a data frame carries a matrix, to access a specific sub-column, we need to use the index/extract operator twice: (x <- aggregate(iris[1], iris[5], function(x) c(Min=min(x), Max=max(x)))) ## Species Sepal.Length.Min Sepal.Length.Max ## 1 setosa 4.3 5.8 ## 2 versicolor 4.9 7.0 ## 3 virginica 4.9 7.9 x[["Sepal.Length"]][, "Min"] ## [1] 4.3 4.9 4.9  In other words, neither x[["Sepal.Length.Min"]] nor x[, "Sepal.Length.Min"] works. Exercise 12.10 Write two replacement functions[6]. First, author set_row_names which replaces the row.names of a data frame with the contents of a specific column. For example: (x <- aggregate(iris[1], iris[5], mean)) # an example data frame ## Species Sepal.Length ## 1 setosa 5.006 ## 2 versicolor 5.936 ## 3 virginica 6.588 set_row_names(x) <- "Species" print(x) ## Sepal.Length ## setosa 5.006 ## versicolor 5.936 ## virginica 6.588  Second, implement reset_row_names which converts row.names to a standalone column of a given name. For instance: reset_row_names(x) <- "Type" print(x) ## Sepal.Length Type ## 1 5.006 setosa ## 2 5.936 versicolor ## 3 6.588 virginica  These two functions may be handy for they enable writing x[something, ] instead of x[x[["column"]] %in% something, ]. ## 12.3. Common operations¶ Below we review the most commonly applied operations related to data frame wrangling. We have a few dedicated functions or methods overloaded for the data.frame class. However, we have already mastered most skills to deal with such objects effectively. Let’s repeat: data frames are just lists exhibiting matrix-like behaviour. ### 12.3.1. Ordering rows¶ Ordering rows in a data frame with respect to different criteria can be easily achieved through the order function and the two-indexer version of [. For instance, here are the six fastest cars from mtcars in terms of the time (in seconds) to complete a 402-metre race: mtcars6 <- mtcars[order(mtcars[["qsec"]])[1:6], c("qsec", "cyl", "gear")] (mtcars6 <- row.names<-(cbind(model=row.names(mtcars6), mtcars6), NULL)) ## model qsec cyl gear ## 1 Ford Pantera L 14.50 8 5 ## 2 Maserati Bora 14.60 8 5 ## 3 Camaro Z28 15.41 8 3 ## 4 Ferrari Dino 15.50 6 5 ## 5 Duster 360 15.84 8 3 ## 6 Mazda RX4 16.46 6 4  order uses a stable sorting algorithm. Therefore, any sorting with respect to a different criterion will not break the relative ordering of qsec in row groups with ties: mtcars6[order(mtcars6[["cyl"]]), ] ## model qsec cyl gear ## 4 Ferrari Dino 15.50 6 5 ## 6 Mazda RX4 16.46 6 4 ## 1 Ford Pantera L 14.50 8 5 ## 2 Maserati Bora 14.60 8 5 ## 3 Camaro Z28 15.41 8 3 ## 5 Duster 360 15.84 8 3  qsec is still increasing in each of the two cyl groups. Example 12.11 Notice the difference between ordering by cyl and gear: mtcars6[order(mtcars6[["cyl"]], mtcars6[["gear"]]), ] ## model qsec cyl gear ## 6 Mazda RX4 16.46 6 4 ## 4 Ferrari Dino 15.50 6 5 ## 3 Camaro Z28 15.41 8 3 ## 5 Duster 360 15.84 8 3 ## 1 Ford Pantera L 14.50 8 5 ## 2 Maserati Bora 14.60 8 5  vs gear and cyl: mtcars6[order(mtcars6[["gear"]], mtcars6[["cyl"]]), ] ## model qsec cyl gear ## 3 Camaro Z28 15.41 8 3 ## 5 Duster 360 15.84 8 3 ## 6 Mazda RX4 16.46 6 4 ## 4 Ferrari Dino 15.50 6 5 ## 1 Ford Pantera L 14.50 8 5 ## 2 Maserati Bora 14.60 8 5  Note Mixing increasing and decreasing ordering is tricky as the decreasing argument to order currently does not accept multiple flags in all the contexts. Perhaps the easiest way to change the ordering direction is to use the unary minus operator on the column(s) to be sorted decreasingly. mtcars6[order(mtcars6[["gear"]], -mtcars6[["cyl"]]), ] ## model qsec cyl gear ## 3 Camaro Z28 15.41 8 3 ## 5 Duster 360 15.84 8 3 ## 6 Mazda RX4 16.46 6 4 ## 1 Ford Pantera L 14.50 8 5 ## 2 Maserati Bora 14.60 8 5 ## 4 Ferrari Dino 15.50 6 5  For factor and character columns, xtfrm can convert them to sort keys first. mtcars6[order(mtcars6[["cyl"]], -xtfrm(mtcars6[["model"]])), ] ## model qsec cyl gear ## 6 Mazda RX4 16.46 6 4 ## 4 Ferrari Dino 15.50 6 5 ## 2 Maserati Bora 14.60 8 5 ## 1 Ford Pantera L 14.50 8 5 ## 5 Duster 360 15.84 8 3 ## 3 Camaro Z28 15.41 8 3  Both statements act like the unsupported decreasing=c(FALSE, TRUE). Exercise 12.12 Write a method sort.data.frame that orders a data frame with respect to a given set of columns. sort.data.frame <- function(x, decreasing=FALSE, cols) ...to.do... sort(mtcars6, cols=c("cyl", "model")) ## model qsec cyl gear ## 4 Ferrari Dino 15.50 6 5 ## 6 Mazda RX4 16.46 6 4 ## 3 Camaro Z28 15.41 8 3 ## 5 Duster 360 15.84 8 3 ## 1 Ford Pantera L 14.50 8 5 ## 2 Maserati Bora 14.60 8 5  Unfortunately, that decreasing must be of length one and be placed as the second argument is imposed by the sort S3 generic. ### 12.3.2. Handling duplicated rows¶ duplicated, anyDuplicated, and unique have methods overloaded for the data.frame class. They can be used to indicate, get rid of, or replace the repeating rows. sum(duplicated(iris)) # how many duplicated rows are there? ## [1] 1 iris[duplicated(iris), ] # show the duplicated rows ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 143 5.8 2.7 5.1 1.9 virginica  ### 12.3.3. Joining (merging) data frames¶ The merge function can perform the join operation that some readers might know from SQL[7]. It matches the items in the columns that two given data frames somewhat share. Then, it returns the combination of the corresponding rows. Example 12.13 Two calls to merge could be used to match data on programmers (each identified by developer_id and giving such details as their name, location, main skill, etc.) with the information about the open-source projects (each identified by project_id and informing us about its title, scope, website, and so forth) they are engaged in (based on a third data frame defining developer_id and project_id pairs). As a simple illustration, consider two objects: A <- data.frame( u=c("b0", "b1", "b2", "b3"), v=c("a0", "a1", "a2", "a3") ) B <- data.frame( v=c("a0", "a2", "a2", "a4"), w=c("c0", "c1", "c2", "c3") )  The two common columns, i.e., storing data of similar nature (a-something strings), are both named v. First is the inner (natural) join, where we list only the matching pairs: merge(A, B) # x=A, y=B, by="v", all.x=FALSE, all.y=FALSE ## v u w ## 1 a0 b0 c0 ## 2 a2 b2 c1 ## 3 a2 b2 c2  The common column is included in the result only once. Next, the left join guarantees that all elements in the first data frame will be included in the result: merge(A, B, all.x=TRUE) # by="v", all.y=FALSE ## v u w ## 1 a0 b0 c0 ## 2 a1 b1 <NA> ## 3 a2 b2 c1 ## 4 a2 b2 c2 ## 5 a3 b3 <NA>  The right join includes all records in the second argument: merge(A, B, all.y=TRUE) # by="v", all.x=FALSE ## v u w ## 1 a0 b0 c0 ## 2 a2 b2 c1 ## 3 a2 b2 c2 ## 4 a4 <NA> c3  Lastly, the full outer join is their set-theoretic union: merge(A, B, all.x=TRUE, all.y=TRUE) # by="v" ## v u w ## 1 a0 b0 c0 ## 2 a1 b1 <NA> ## 3 a2 b2 c1 ## 4 a2 b2 c2 ## 5 a3 b3 <NA> ## 6 a4 <NA> c3  Joining on more than one common column is also supported. Exercise 12.14 Show how match (Section 5.4.1) can help author a very basic version of merge. Exercise 12.15 Implement a version of match that allows the x and table arguments to be data frames with the same number of columns so that also the matching of pairs, triples, etc. is possible. ### 12.3.4. Aggregating and transforming columns¶ It might be tempting to try aggregating data frames with apply. Sadly, currently, this function coerces its argument to a matrix. Hence, we should refrain from applying it on data frames whose columns are of mixed types. However, taking into account that data frames are special lists, we can always call Map and its relatives. Example 12.16 Consider an example data frame: (iris_sample <- iris[sample(NROW(iris), 6), ]) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 28 5.2 3.5 1.5 0.2 setosa ## 80 5.7 2.6 3.5 1.0 versicolor ## 101 6.3 3.3 6.0 2.5 virginica ## 111 6.5 3.2 5.1 2.0 virginica ## 137 6.3 3.4 5.6 2.4 virginica ## 133 6.4 2.8 5.6 2.2 virginica  To get the class of each column, we can call: sapply(iris_sample, class) # or unlist(Map(class, iris)) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## "numeric" "numeric" "numeric" "numeric" "factor"  Next, here is a way to compute some aggregates of the numeric columns: unlist(Map(mean, Filter(is.numeric, iris_sample))) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.0667 3.1333 4.5500 1.7167  or: sapply(iris_sample[sapply(iris_sample, is.numeric)], mean) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.0667 3.1333 4.5500 1.7167  We can also fetch more than a single summary of each column: as.data.frame(Map( function(x) c(Min=min(x), Max=max(x)), Filter(is.numeric, iris_sample) )) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min 5.2 2.6 1.5 0.2 ## Max 6.5 3.5 6.0 2.5  or: sapply(iris_sample[sapply(iris_sample, is.numeric)], quantile, c(0, 1)) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 0% 5.2 2.6 1.5 0.2 ## 100% 6.5 3.5 6.0 2.5  The latter called simplify2array automatically. Thus, the result is a matrix. On the other hand, the standardisation of all numeric features can be performed, e.g., via a call: iris_sample[] <- Map(function(x) { if (!is.numeric(x)) x else (x-mean(x))/sd(x) }, iris_sample) print(iris_sample) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 28 -1.70405 1.03024 -1.76004 -1.65318 setosa ## 80 -0.72094 -1.49854 -0.60591 -0.78117 versicolor ## 101 0.45878 0.46829 0.83674 0.85384 virginica ## 111 0.85202 0.18732 0.31738 0.30884 virginica ## 137 0.45878 0.74927 0.60591 0.74484 virginica ## 133 0.65540 -0.93659 0.60591 0.52684 virginica  ### 12.3.5. Handling missing values¶ The is.na method for objects of the class data.frame returns a logical matrix of the same dimensionality[8], indicating whether the corresponding items are missing or not. Of course, the default method can still be called on individual columns. Further, na.omit gets rid of rows with missing values. Exercise 12.17 Given a data frame, use is.na and other functions like apply or approx to: 1. remove all rows that bear at least one missing value, 2. remove all rows that only consist of missing values, 3. remove all columns that carry at least one missing value, 4. for each column, replace all missing values with the column averages, 5. for each column, replace all missing values with values that linearly interpolate between the preceding and succeeding well-defined observations (which is of use in time series processing), e.g., the blanks in c(0.60, 0.62, NA, 0.64, NA, NA, 0.58) should be filled to obtain c(0.60, 0.62, 0.63, 0.64, 0.62, 0.60, 0.58). ### 12.3.6. Reshaping data frames¶ Consider an example matrix: A <- matrix(round(runif(6), 2), nrow=3, dimnames=list( c("X", "Y", "Z"), # row labels c("u", "v") # column labels )) names(dimnames(A)) <- c("Row", "Col") print(A) ## Col ## Row u v ## X 0.29 0.88 ## Y 0.79 0.94 ## Z 0.41 0.05  The as.data.frame method for the table class can be called directly on any array-like object: as.data.frame.table(A, responseName="Val", stringsAsFactors=FALSE) ## Row Col Val ## 1 X u 0.29 ## 2 Y u 0.79 ## 3 Z u 0.41 ## 4 X v 0.88 ## 5 Y v 0.94 ## 6 Z v 0.05  It is an instance of array reshaping. More precisely, we call it stacking. We converted from a wide (okay, in this example, not so wide, as we only have two columns) to a long (tall) format. The above can also be achieved by means of the reshape function which is more flexible and operates directly on data frames (but is harder to use): (df <- names<-( data.frame(row.names(A), A, row.names=NULL), c("Row", "Col.u", "Col.v"))) ## Row Col.u Col.v ## 1 X 0.29 0.88 ## 2 Y 0.79 0.94 ## 3 Z 0.41 0.05 (stacked <- reshape(df, varying=2:3, direction="long")) ## Row time Col id ## 1.u X u 0.29 1 ## 2.u Y u 0.79 2 ## 3.u Z u 0.41 3 ## 1.v X v 0.88 1 ## 2.v Y v 0.94 2 ## 3.v Z v 0.05 3  Maybe the default column names are not superb, but we can adjust them manually afterwards. The reverse operation is called unstacking: reshape(stacked, idvar="Row", timevar="time", drop="id", direction="wide") ## Row Col.u Col.v ## 1.u X 0.29 0.88 ## 2.u Y 0.79 0.94 ## 3.u Z 0.41 0.05  Exercise 12.18 Given a named numeric vector, convert it to a data frame with two columns. For instance: convert <- function(x) ...to.do... x <- c(spam=42, eggs=7, bacon=3) convert(x) ## key value ## 1 spam 42 ## 2 eggs 7 ## 3 bacon 3  Exercise 12.19 Stack the WorldPhones dataset. Then, unstack it back. Furthermore, unstack the stacked set after removing[9] five random rows from it and randomly permuting all the remaining rows. Fill in the missing entries with NAs. Exercise 12.20 Implement a basic version of as.data.frame.table manually (using rep etc.). Also, write a function as.table.data.frame that computes its reverse. Make sure both functions are compatible with each other. Exercise 12.21 Titanic is a four-dimensional array. Convert it to a long data frame. Exercise 12.22 Perform what follows on the undermentioned data frame: 1. convert the second column to a list of character vectors (split at ","); 2. extract the first elements from each of such vectors; 3. extract the last elements; 4. (*) unstack the split data frame; 5. (*) stack it back to a data frame that carries a list; 6. convert the list back to a character column (concatenate with "," as separator). (x <- data.frame( name=c("Kat", "Ron", "Jo", "Mary"), food=c("buckwheat", "spam,bacon,spam", "", "eggs,spam,spam,lollipops") )) ## name food ## 1 Kat buckwheat ## 2 Ron spam,bacon,spam ## 3 Jo ## 4 Mary eggs,spam,spam,lollipops  Exercise 12.23 Write a function that converts all matrix-based columns in a given data frame to separate atomic columns. Furthermore, author a function that does the opposite, i.e., groups all columns with similar prefixes and turns them into matrices. ### 12.3.7. Aggregating data in groups¶ We can straightforwardly apply various transforms on data groups determined by a factor-like variable or their combination thanks to the split.data.frame method, which returns a list of data frames. For example: x <- data.frame( a=c( 10, 20, 30, 40, 50), u=c("spam", "spam", "eggs", "spam", "eggs"), v=c( 1, 2, 1, 1, 1) ) split(x, x["u"]) # i.e., split.data.frame(x, x["u"]) or x[["u"]] ##$eggs
##    a    u v
## 3 30 eggs 1
## 5 50 eggs 1
##
## $spam ## a u v ## 1 10 spam 1 ## 2 20 spam 2 ## 4 40 spam 1  It split x with respect to the u column, which served as the grouping variable. On the other hand: split(x, x[c("u", "v")]) # sep="." ##$eggs.1
##    a    u v
## 3 30 eggs 1
## 5 50 eggs 1
##
## $spam.1 ## a u v ## 1 10 spam 1 ## 4 40 spam 1 ## ##$eggs.2
## [1] a u v
## <0 rows> (or 0-length row.names)
##
## $spam.2 ## a u v ## 2 20 spam 2  It partitioned with respect to a combination of two factor-like sequences. A nonexistent level pair (eggs, 2) resulted in an empty data frame. Exercise 12.24 split.data.frame (when called directly) can also be used to break a matrix into a list of matrices (rowwisely). Given a matrix, perform its train-test split: allocate, say, 70% of the rows at random into one matrix and the remaining 30% into another. sapply is convenient if we need to aggregate grouped numeric data. To recall, it is a combination of lapply (one-vector version of Map) and simplify2array (Section 11.1.3). sapply(split(iris[1:2], iris[5]), sapply, mean) ## setosa versicolor virginica ## Sepal.Length 5.006 5.936 6.588 ## Sepal.Width 3.428 2.770 2.974  If the function to apply returns more than a single value, sapply will not return too informative a result. The list of matrices converted to a matrix will not have the row.names argument set: MinMax <- function(x) c(Min=min(x), Max=max(x)) sapply(split(iris[1:2], iris[5]), sapply, MinMax) ## setosa versicolor virginica ## [1,] 4.3 4.9 4.9 ## [2,] 5.8 7.0 7.9 ## [3,] 2.3 2.0 2.2 ## [4,] 4.4 3.4 3.8  As a workaround, we either call simplify2array explicitly, or pass simplify="array" to sapply: (res <- sapply( split(iris[1:2], iris[5]), sapply, MinMax, simplify="array" )) # or simplify2array(lapply(...) or Map(...) etc.) ## , , setosa ## ## Sepal.Length Sepal.Width ## Min 4.3 2.3 ## Max 5.8 4.4 ## ## , , versicolor ## ## Sepal.Length Sepal.Width ## Min 4.9 2.0 ## Max 7.0 3.4 ## ## , , virginica ## ## Sepal.Length Sepal.Width ## Min 4.9 2.2 ## Max 7.9 3.8  It produced a three-dimensional array, which is particularly handy if we now wish to access specific results by name: res[, "Sepal.Length", "setosa"] ## Min Max ## 4.3 5.8  The previously mentioned as.data.frame.table method will work on it like a charm (up to the column names, which we can change): as.data.frame.table(res, stringsAsFactors=FALSE) ## Var1 Var2 Var3 Freq ## 1 Min Sepal.Length setosa 4.3 ## 2 Max Sepal.Length setosa 5.8 ## 3 Min Sepal.Width setosa 2.3 ## 4 Max Sepal.Width setosa 4.4 ## 5 Min Sepal.Length versicolor 4.9 ## 6 Max Sepal.Length versicolor 7.0 ## 7 Min Sepal.Width versicolor 2.0 ## 8 Max Sepal.Width versicolor 3.4 ## 9 Min Sepal.Length virginica 4.9 ## 10 Max Sepal.Length virginica 7.9 ## 11 Min Sepal.Width virginica 2.2 ## 12 Max Sepal.Width virginica 3.8  Note If the grouping (by) variable is a list of two or more factors, the combined levels will be concatenated to a single string. This behaviour yields a result that may be deemed convenient in some contexts but not necessarily so in other ones. as.data.frame.table(as.array(sapply( split(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], sep="_"), sapply, # but check also: function(...) as.matrix(sapply(...)), mean )), stringsAsFactors=FALSE) ## Var1 Freq ## 1 OJ_0.5.len 13.23 ## 2 VC_0.5.len 7.98 ## 3 OJ_1.len 22.70 ## 4 VC_1.len 16.77 ## 5 OJ_2.len 26.06 ## 6 VC_2.len 26.14  The name of the aggregated column (len) has been included, because sapply simplifies the result to a flat vector too eagerly. aggregate can assist us when a single function is to be applied on all columns in a data frame: aggregate(iris[-5], iris[5], mean) # neither iris[[5]] nor iris[, 5] ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026 aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], mean) ## supp dose len ## 1 OJ 0.5 13.23 ## 2 VC 0.5 7.98 ## 3 OJ 1.0 22.70 ## 4 VC 1.0 16.77 ## 5 OJ 2.0 26.06 ## 6 VC 2.0 26.14  The second argument, by, must be list-like (this includes data frames). Neither a factor nor an atomic vector is acceptable. Also, if the function being applied returns many values, they will be wrapped into a matrix column: (x <- aggregate(iris[2], iris[5], function(x) c(Min=min(x), Max=max(x)))) ## Species Sepal.Width.Min Sepal.Width.Max ## 1 setosa 2.3 4.4 ## 2 versicolor 2.0 3.4 ## 3 virginica 2.2 3.8 class(x[["Sepal.Width"]]) ## [1] "matrix" "array" x[["Sepal.Width"]] # not: Sepal.Width.Max, etc. ## Min Max ## [1,] 2.3 4.4 ## [2,] 2.0 3.4 ## [3,] 2.2 3.8  It is actually handy: by referring to x[["Sepal.Width"]], we access all the stats for this column. Further, if many columns are being aggregated simultaneously, we can process all the summaries in the same way. Exercise 12.25 Check out the by function, which supports basic split-apply-bind use cases. Note the particularly odd behaviour of the print method for the by class. The most flexible scenario involves applying a custom function returning any set of aggregates in the form of a list and then row-binding the results to obtain a data frame. Example 12.26 The following implements an R version of what we would express in SQL as: SELECT supp, dose, AVG(len) AS ave_len, COUNT(*) AS count FROM ToothGrowth GROUP BY supp, dose  Ad rem: do.call(rbind.data.frame, lapply( split(ToothGrowth, ToothGrowth[c("supp", "dose")]), function(df) list( supp=df[1, "supp"], dose=df[1, "dose"], ave_len=mean(df[["len"]]), count=NROW(df) ) )) ## supp dose ave_len count ## OJ.0.5 OJ 0.5 13.23 10 ## VC.0.5 VC 0.5 7.98 10 ## OJ.1 OJ 1.0 22.70 10 ## VC.1 VC 1.0 16.77 10 ## OJ.2 OJ 2.0 26.06 10 ## VC.2 VC 2.0 26.14 10  Exercise 12.27 Many aggregation functions are idempotent, which means that when they are fed with a vector with all the elements being identical, the result is exactly that unique element: min, mean, median, and max behave this way. Overload the mean and median methods for character vectors and factors. They should return NA and give a warning for sequences where not all elements are the same. Otherwise, they are expected to output the unique value. mean.character <- function(x, na.rm=FALSE, ...) ...to.do... mean.factor <- function(x, na.rm=FALSE, ...) ...to.do...  This way, we can also aggregate the grouping variables conveniently: do.call(rbind.data.frame, lapply(split(ToothGrowth, ToothGrowth[c("supp", "dose")]), lapply, mean)) ## len supp dose ## OJ.0.5 13.23 OJ 0.5 ## VC.0.5 7.98 VC 0.5 ## OJ.1 22.70 OJ 1.0 ## VC.1 16.77 VC 1.0 ## OJ.2 26.06 OJ 2.0 ## VC.2 26.14 VC 2.0  Example 12.28 Let’s study a function that takes a named list x (can be a data frame) and a sequence of col=f pairs, and applies the function f (or each function from a list of functions f) on the element named col in x: napply <- function(x, ...) { fs <- list(...) cols <- names(fs) stopifnot(is.list(x), !is.null(names(x))) stopifnot(all(cols %in% names(x))) do.call( c, # concatenates lists lapply( structure(seq_along(fs), names=cols), function(i) { # always returns a list y <- x[[ cols[i] ]] if (is.function(fs[[i]])) list(fs[[i]](y)) else lapply(fs[[i]], function(f) f(y)) } ) ) }  For example: first <- function(x, ...) head(x, n=1L, ...) # helper function napply(ToothGrowth, supp=first, dose=first, len=list(ave=mean, count=length) ) ##$supp
## [1] VC
## Levels: OJ VC
##
## $dose ## [1] 0.5 ## ##$len.ave
## [1] 18.813
##
## \$len.count
## [1] 60


It applied first on both ToothGrowth[["supp"]] and ToothGrowth[["dose"]] as well as mean and length on ToothGrowth[["len"]]. We included list names for a more dramatic effect. And now:

do.call(
rbind.data.frame,
lapply(
split(ToothGrowth, ToothGrowth[c("supp", "dose")]),
napply,
supp=first, dose=first, len=list(ave=mean, count=length)
)
)
##        supp dose len.ave len.count
## OJ.0.5   OJ  0.5   13.23        10
## VC.0.5   VC  0.5    7.98        10
## OJ.1     OJ  1.0   22.70        10
## VC.1     VC  1.0   16.77        10
## OJ.2     OJ  2.0   26.06        10
## VC.2     VC  2.0   26.14        10


or even:

gapply <- function(x, by, ...)
do.call(rbind.data.frame, lapply(
split(x, x[by]),
function(x, ...)
do.call(napply, c(  # add all by=first calls
x=list(x),
names<-(rep(list(first), length(by)), by),
list(...)
)),
...
))


And now:

gapply(iris, "Species", Sepal.Length=mean, Sepal.Width=list(min, max))
##               Species Sepal.Length Sepal.Width1 Sepal.Width2
## setosa         setosa        5.006          2.3          4.4
## versicolor versicolor        5.936          2.0          3.4
## virginica   virginica        6.588          2.2          3.8
gapply(ToothGrowth, c("supp", "dose"), len=list(ave=mean, count=length))
##        supp dose len.ave len.count
## OJ.0.5   OJ  0.5   13.23        10
## VC.0.5   VC  0.5    7.98        10
## OJ.1     OJ  1.0   22.70        10
## VC.1     VC  1.0   16.77        10
## OJ.2     OJ  2.0   26.06        10
## VC.2     VC  2.0   26.14        10


This brings fun back to R programming in the sad times when many things are given to us on a plate (the thorough testing of the above is left as an exercise).

Example 12.29

In Section 10.4, we mentioned (without giving the implementation) the group_by function returning a list of the class list_dfs. It splits a data frame into a list of data frames with respect to a combination of levels in given named columns:

group_by <- function(df, by)
{
stopifnot(is.character(by), is.data.frame(df))
df <- droplevels(df)  # factors may have unused levels
structure(
split(df, df[names(df) %in% by]),
class="list_dfs",
by=by
)
}


The next function applies a set of aggregates on every column of each data frame in a given list (two nested lapplys plus cosmetic additions):

aggregate.list_dfs <- function(x, FUN, ...)
{
aggregates <- lapply(x, function(df) {
is_by <- names(df) %in% attr(x, "by")
res <- lapply(df[!is_by], FUN, ...)
res_mat <- do.call(rbind, res)
if (is.null(dimnames(res_mat)[[2]]))
dimnames(res_mat)[[2]] <- paste0("f", seq_len(NCOL(res_mat)))
cbind(
row.names<-(df[1, is_by, drop=FALSE], NULL),
x=row.names(res_mat),
row.names<-(res_mat, NULL)
)
})
combined_aggregates <- do.call(rbind.data.frame, aggregates)
row.names<-(combined_aggregates, NULL)
}
aggregate(group_by(ToothGrowth, c("supp", "dose")), range)
##   supp dose   x   f1   f2
## 1   OJ  0.5 len  8.2 21.5
## 2   VC  0.5 len  4.2 11.5
## 3   OJ  1.0 len 14.5 27.3
## 4   VC  1.0 len 13.6 22.5
## 5   OJ  2.0 len 22.4 30.9
## 6   VC  2.0 len 18.5 33.9


We really want our API to be bloated, so let’s introduce a convenience function, which is a specialised version of the above:

mean.list_dfs <- function(x, ...)
aggregate.list_dfs(x, function(y) c(Mean=mean(y, ...)))
mean(group_by(iris[51:150, c(2, 3, 5)], "Species"))
##      Species            x  Mean
## 1 versicolor  Sepal.Width 2.770
## 2 versicolor Petal.Length 4.260
## 3  virginica  Sepal.Width 2.974
## 4  virginica Petal.Length 5.552


### 12.3.8. Transforming data in groups¶

Variables will sometimes need to be transformed relative to what is happening in a dataset’s subsets. This is the case, e.g., where we decide that missing values should be replaced by the corresponding within-group averages or want to compute the relative ranks or z-scores.

If the loss of the original ordering of rows is not an issue, the standard split-apply-bind will suffice. Here is an example data frame:

(x <- data.frame(
a=c( 10,   1,  NA,  NA,  NA,   4),
b=c( -1,  10,  40,  30,   1,  20),
c=runif(6),
d=c("v", "u", "u", "u", "v", "u")
))
##    a  b       c d
## 1 10 -1 0.52811 v
## 2  1 10 0.89242 u
## 3 NA 40 0.55144 u
## 4 NA 30 0.45661 u
## 5 NA  1 0.95683 v
## 6  4 20 0.45333 u


Some operations:

fill_na <- function(x) [<-(x, is.na(x), value=mean(x[!is.na(x)]))
standardise <- function(x) (x-mean(x))/sd(x)


And now:

x_groups <- lapply(
split(x, x["d"]),
function(df) {
df[["a"]] <- fill_na(df[["a"]])
df[["b"]] <- rank(df[["b"]])
df[["c"]] <- standardise(df[["c"]])
df
}
)
do.call(rbind.data.frame, x_groups)
##        a b        c d
## u.2  1.0 1  1.46357 u
## u.3  2.5 4 -0.17823 u
## u.4  2.5 3 -0.63478 u
## u.6  4.0 2 -0.65057 u
## v.1 10.0 1 -0.70711 v
## v.5 10.0 2  0.70711 v


Only the relative ordering of rows within groups has been retained. Overall, the rows are in a different order. If this is an issue, we can use the unsplit function:

unsplit(x_groups, x["d"])
##      a b        c d
## 1 10.0 1 -0.70711 v
## 2  1.0 1  1.46357 u
## 3  2.5 4 -0.17823 u
## 4  2.5 3 -0.63478 u
## 5 10.0 2  0.70711 v
## 6  4.0 2 -0.65057 u

Exercise 12.30

Show how we can perform the above also via the replacement version of split.

Example 12.31

(*) Recreating the previous ordering can be done manually, too. It is because the split operation behaves as if we first ordered the data frame with respect to the grouping variable(s) (using a stable sorting algorithm). Here is a transformation of an example data frame split by a combination of two factors:

(x <- row.names<-(ToothGrowth[sample(NROW(ToothGrowth), 10), ], NULL))
##     len supp dose
## 1  23.0   OJ  2.0
## 2  23.3   OJ  1.0
## 3  29.4   OJ  2.0
## 4  14.5   OJ  1.0
## 5  11.2   VC  0.5
## 6  20.0   OJ  1.0
## 7  24.5   OJ  2.0
## 8  10.0   OJ  0.5
## 9   9.4   OJ  0.5
## 10  7.0   VC  0.5
(y <- do.call(rbind.data.frame, lapply(
split(x, x[c("dose", "supp")]),  # two grouping variables
function(df) {
df[["len"]] <- df[["len"]] * 100^df[["dose"]] *  # whatever
ifelse(df[["supp"]] == "OJ", -1, 1)          # do not overthink it
df
}
)))
##               len supp dose
## 0.5.OJ.8     -100   OJ  0.5
## 0.5.OJ.9      -94   OJ  0.5
## 1.OJ.2      -2330   OJ  1.0
## 1.OJ.4      -1450   OJ  1.0
## 1.OJ.6      -2000   OJ  1.0
## 2.OJ.1    -230000   OJ  2.0
## 2.OJ.3    -294000   OJ  2.0
## 2.OJ.7    -245000   OJ  2.0
## 0.5.VC.5      112   VC  0.5
## 0.5.VC.10      70   VC  0.5


Section 5.4.4 mentioned that by calling order, we can determine the inverse of a given permutation. Hence, we can call:

y[order(order(x[["supp"]], x[["dose"]])), ]  # not: dose, supp
##               len supp dose
## 2.OJ.1    -230000   OJ  2.0
## 1.OJ.2      -2330   OJ  1.0
## 2.OJ.3    -294000   OJ  2.0
## 1.OJ.4      -1450   OJ  1.0
## 0.5.VC.5      112   VC  0.5
## 1.OJ.6      -2000   OJ  1.0
## 2.OJ.7    -245000   OJ  2.0
## 0.5.OJ.8     -100   OJ  0.5
## 0.5.OJ.9      -94   OJ  0.5
## 0.5.VC.10      70   VC  0.5


Additionally, we can manually restore the original row.names, et voilà.

### 12.3.9. Metaprogramming-based techniques (*)¶

Section 9.4.7 mentioned a few functions that provide convenient interfaces to some common data frame operations. These include transform, subset, with, and basically every procedure accepting a formula. The popular data.table and dplyr packages also belong to this class (Section 12.3.10).

Unfortunately, each method relying on metaprogramming must be studied separately because it is free to interpret the form of the passed arguments arbitrarily, without taking into account their real meaning. As we are concerned with developing a more universal skill set, we avoid[10] them in this course. They do not offer anything more than what we have learnt so far.

Withal, they are thought-provoking on their own. Furthermore, they are popular in other users’ code. Thus, after all, they deserve the honourable mention.

Example 12.32

Consider an example call to the subset function:

subset(iris, Sepal.Length<4.5, -(Sepal.Width:Petal.Width))
##    Sepal.Length Species
## 9           4.4  setosa
## 14          4.3  setosa
## 39          4.4  setosa
## 43          4.4  setosa


Neither the second nor the third argument makes sense as a standalone R expression. We have not defined the named variables used there:

Sepal.Length<4.5            # utter nonsense
-(Sepal.Width:Petal.Width)  # gibberish


Only from help("subset") we can learn that this tool assumes that the expression passed as the second argument plays the role of a row selector. Moreover, the third one is meant to remove all the columns between the two given ones.

In our course, we pay attention to developing transferable skills. We believe that R is not the only language we will learn during our long and happy lives. It is much more likely that in the next environment, we will become used to writing something of the more basic form:

between <- function(x, from, to) match(from, x):match(to, x)
iris[iris[["Sepal.Length"]]<4.5,
-between(names(iris), "Sepal.Width", "Petal.Width")]
##    Sepal.Length Species
## 9           4.4  setosa
## 14          4.3  setosa
## 39          4.4  setosa
## 43          4.4  setosa

Example 12.33

With transform, we can add, modify, and remove columns in a data frame. Existing features can be referred to as if they were ordinary variables:

(mtcars4 <- mtcars[sample(seq_len(NROW(mtcars)), 4), c("hp", "am", "mpg")])
##                     hp am  mpg
## Maserati Bora      335  1 15.0
## Cadillac Fleetwood 205  0 10.4
## Honda Civic         52  1 30.4
## Merc 450SLC        180  0 15.2
transform(mtcars4, log_hp=log(hp), am=2*am-1, hp=NULL, fcon=235/mpg)
##                    am  mpg log_hp    fcon
## Maserati Bora       1 15.0 5.8141 15.6667
## Cadillac Fleetwood -1 10.4 5.3230 22.5962
## Honda Civic         1 30.4 3.9512  7.7303
## Merc 450SLC        -1 15.2 5.1930 15.4605


Similarly, attach adds any named list to the search path (see Section 16.2.6) but it does not support altering their contents. As an alternative, within may be called:

within(mtcars4, {
log_hp <- log(hp)
fcon <- 235/mpg
am <- factor(am, levels=c(0, 1), labels=c("no", "yes"))
hp <- NULL
})
##                     am  mpg    fcon log_hp
## Maserati Bora      yes 15.0 15.6667 5.8141
## Cadillac Fleetwood  no 10.4 22.5962 5.3230
## Honda Civic        yes 30.4  7.7303 3.9512
## Merc 450SLC         no 15.2 15.4605 5.1930


Those who find writing mtcars4[["name"]] instead of name too exhausting, can save a few keystrokes.

Example 12.34

As mentioned in Section 10.3.4 (see Section 17.6 for more details), formulae are special objects that consist of two unevaluated expressions separated by a tilde, ~. Functions can support formulae and do what they please with them. However, a popular approach is to allow them to express “something grouped by something else” or “one thing as a function of other things”.

do.call(rbind.data.frame, lapply(split(ToothGrowth, ~supp+dose), head, 1))
##         len supp dose
## OJ.0.5 15.2   OJ  0.5
## VC.0.5  4.2   VC  0.5
## OJ.1   19.7   OJ  1.0
## VC.1   16.5   VC  1.0
## OJ.2   25.5   OJ  2.0
## VC.2   23.6   VC  2.0
aggregate(cbind(mpg, log_hp=log(hp))~am:cyl, mtcars, mean)
##   am cyl    mpg log_hp
## 1  0   4 22.900 4.4186
## 2  1   4 28.075 4.3709
## 3  0   6 19.125 4.7447
## 4  1   6 20.567 4.8552
## 5  0   8 15.050 5.2553
## 6  1   8 15.400 5.6950
##                   mpg + hp log(hp)    I(1/qsec)
## Mazda RX4            131.0  4.7005 0.060753....
## Mazda RX4 Wag        131.0  4.7005 0.058754....
## Datsun 710           115.8  4.5326 0.053734....
## Hornet 4 Drive       131.4  4.7005 0.051440....
## Hornet Sportabout    193.7  5.1648 0.058754....
## Valiant              123.1  4.6540 0.049455....


If these examples seem esoteric, it is because it is precisely the case. We need to consult the corresponding functions’ manuals to discover what they do. And, as we do not recommend their use to beginner programmers, we will not explain them here. Don’t trip.

Exercise 12.35

In the last example, the peculiar printing of the last column is due to which method’s being overloaded?

In the third part of this book, we will return to these functions for they will serve as an amusing illustration of how to indite our own procedures that rely on metaprogramming techniques.

### 12.3.10. A note on the dplyr (tidyverse) and data.table packages (*)¶

data.table and dplyr are very popular packages that implement common data frame transformations. In particular, the latter is part of an immerse system of interdependent extensions called tidyverse which became quite invasive over the last few years. They both heavily rely on metaprogramming and introduce entirely new APIs featuring hundreds of functions for the operations we already know well how to perform (the calamity of superabundance).

Still, their users must remember that they will need to rely on base functions when the processing of other prominent data structures is required, e.g., of fancy lists and matrices. Base R (and its predecessor, S) has long ago proven to be a versatile tool for rapid prototyping, calling specialised procedures written in C or Java, and wrangling data that fit into memory. Even though some operations from the mentioned packages may be much faster for larger datasets, the speed is less often an issue in practice than what most users might think.

For larger problems, techniques for working with batches of data, sampling methods, or aggregating data stored elsewhere are often the way to go, especially when building machine learning models or visualisation[11] is required. Usually, the most recent data will be stored in external, normalised databases, and we will need to join a few tables to fetch something valuable from the perspective of the current task’s context.

Thus, we cannot stress enough that, in many situations, SQL, not the other tools, is the most powerful interface to more considerable amounts of data. Learning it will give us the skills we can use later in other programming environments.

Note

Of course, certain functions from tidyverse and related packages we will find very helpful after all. Quite annoyingly, they tend to return objects of the class tibble (tbl_df) (e.g., haven::read.xpt that reads SAS data files). Luckily, they are subclasses of data.frame; we can always use as.data.frame to get our favourite objects back.

## 12.4. Exercises¶

Exercise 12.36

• What attributes a data frame is equipped with?

• If row.names is an integer vector, how to access rows labelled 1, 7, and 42?

• How to create a data frame that carries a column that is a list of character vectors of different lengths?

• How to create a data frame that includes a matrix column?

• How to convert all numeric columns in a data frame to a numeric matrix?

• Assuming that x is an atomic vector, what is the difference between as.data.frame(x), as.data.frame(as.list(x)), as.data.frame(list(a=x)), and data.frame(a=x)?

Exercise 12.37

Assuming that x is a data frame, what is the meaning of/difference between the following:

• x["u"] vs x[["u"]] vs x[, "u"]?

• x["u"][1] vs x[["u"]][1] vs x[1, "u"] vs x[1, "u", drop=FALSE]?

• x[which(x[[1]] > 0), ] vs x[x[[1]] > 0, ]?

• x[grep("^foo", names(x))]?

Exercise 12.38

We have a data frame with columns named like: ID (character), checked (logical, possibly with missing values), category (factor), x0, …, x9 (ten separate numeric columns), y0, …, y9 (ten separate numeric columns), coords (numeric matrix with two columns named lat and long), and features (list of character vectors of different lengths).

• How to extract the rows where checked is TRUE?

• How to extract the rows for which ID is like three letters and then five digits (e.g., XYZ12345)?

• How to select all the numeric columns in one go?

• How to extract a subset comprised only of the ID and x-something columns?

• How to get rid of all the columns between x3 and y7?

• Assuming that the IDs are like three letters and then five digits, how to add two columns: ID3 (the letters) and ID5 (the five digits)?

• How to check where both lat and long in coords are negative?

• How to add the column indicating the number of features?

• How to extract the rows where "spam" is amongst the features?

• How to convert it to a long data frame with two columns: ID and feature (individual strings)?

• How to change the name of the ID column to id?

• How to make the y-foo columns appear before the x-bar ones?

• How to order the rows with respect to checked (FALSE first, then TRUE) and IDs (decreasingly)?

• How to remove rows with duplicate IDs?

• How to determine how many entries correspond to each category?

• How to compute the average lat and long in each category?

• How to compute the average lat and long for each category and checked combined?

Exercise 12.39

Consider the flights dataset. Give some ways to select all rows between March and October (regardless of the year).

Exercise 12.40

In this task, you will be working with a version of a dataset on 70k+ Melbourne trees (urban_forest).

2. Fetch the IDs (CoM.ID) and trunk diameters (Diameter.Breast.Height) of the horse chestnuts with five smallest diameters at breast height. The output data frame must be sorted with respect to Diameter.Breast.Height, decreasingly.

3. Create a new data frame that gives the number of trees planted in each year.

4. Compute the average age (in years, based on Year.Planted) of the trees of genera (each genus separately): Eucalyptus, Platanus, Ficus, Acer, and Quercus.

Exercise 12.41

(*) Consider the historic data dumps of Stack Exchange available here. Export these CSV files to an SQLite database. Then, write some R code that corresponds to the following SQL queries. Use dbGetQuery to verify your results.

First:

SELECT
Users.DisplayName,
Users.Age,
Users.Location,
SUM(Posts.FavoriteCount) AS FavoriteTotal,
Posts.Title AS MostFavoriteQuestion,
MAX(Posts.FavoriteCount) AS MostFavoriteQuestionLikes
FROM Posts
WHERE Posts.PostTypeId=1
GROUP BY OwnerUserId
ORDER BY FavoriteTotal DESC
LIMIT 10


Second:

SELECT
Posts.ID,
Posts.Title,
FROM Posts
JOIN (
SELECT
Posts.ParentID,
FROM Posts
WHERE Posts.PostTypeID=2 AND Posts.Score>0
GROUP BY Posts.ParentID
) AS Posts2
ON Posts.ID=Posts2.ParentID
LIMIT 10


Third:

SELECT
Posts.Title,
FROM (
SELECT
PostId,
COUNT(*) AS Count,
WHERE VoteTypeId=2
GROUP BY PostId, Year
WHERE Posts.PostTypeId=1
GROUP BY Year


Fourth:

SELECT
Questions.Id,
Questions.Title,
Posts.Score AS AcceptedScore,
FROM (
SELECT Id, ParentId, MAX(Score) AS MaxScore
FROM Posts
WHERE PostTypeId==2
GROUP BY ParentId
JOIN (
SELECT * FROM Posts
WHERE PostTypeId==1
) AS Questions
WHERE Difference>50
ORDER BY Difference DESC


Fifth:

SELECT
Posts.Title,
FROM (
SELECT
PostID,
UserID,
GROUP BY PostID, UserID
) AS CmtTotScr
JOIN Posts ON Posts.ID=CmtTotScr.PostID
AND Posts.OwnerUserId=CmtTotScr.UserID
WHERE Posts.PostTypeId=1
LIMIT 10


Sixth:

SELECT DISTINCT
Users.Id,
Users.DisplayName,
Users.Reputation,
Users.Age,
Users.Location
FROM (
SELECT
Name, UserID
WHERE Name IN (
SELECT
Name
WHERE Class=1
GROUP BY Name
HAVING COUNT(*) BETWEEN 2 AND 10
)
AND Class=1


Seventh:

SELECT
Posts.Title,
FROM Posts
JOIN (
SELECT
PostId,
MAX(CASE WHEN VoteDate = 'new' THEN Total ELSE 0 END) NewVotes,
MAX(CASE WHEN VoteDate = 'old' THEN Total ELSE 0 END) OldVotes,
FROM (
SELECT
PostId,
CASE STRFTIME('%Y', CreationDate)
WHEN '2017' THEN 'new'
WHEN '2016' THEN 'new'
ELSE 'old'
END VoteDate,
COUNT(*) AS Total
WHERE VoteTypeId=2
GROUP BY PostId, VoteDate

(*) Generate a CSV file that stores some random data arranged in a few columns of a size at least two times larger than your available RAM. Then, export the CSV file to an SQLite database. Use file connections (Section 8.3.5) and the nrow argument to read.table to process it chunk by chunk. Determine whether setting colClasses in read.table speeds up the reading of large CSV files significantly or not.