# 12. Data Frames

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of Chapters 1–12 are already complete, but there will be more. In the meantime, any bug/typos reports/fixes are appreciated. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [20].

Most matrices are built on top of atomic vectors and hence allow items of the same type to be arranged into rows and columns. Data frames (objects of S3 class data.frame, first introduced in [8]), on the other hand, are collections of vectors of identical lengths or matrices with identical row counts, hence allowing to represent structured[1] data of possibly heterogeneous types, for instance:

class(iris)  # iris is an example built-in data frame
## [1] "data.frame"
iris[c(1, 51, 101), ]  # 3 chosen rows from iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 101          6.3         3.3          6.0         2.5  virginica


is a mix of numeric and factor-type data.

The good news is that not only data frames are built upon named lists (e.g., to extract a column we can refer to [[), but also many functions recognise them to be matrix-like, (e.g., to select specific rows and columns, two indexes can be passed to [ like in the example above). Hence, it will soon turn out that we already know a lot about how to perform basic data wrangling activities, even if we do not full realise it now.

Important

Some of us will approach this chapter biased by what we know from elsewhere, including our experience with some popular third-party packages for data frame processing. The art is to filter out that information as noise (at least, for the time being). We will show how powerful base R vocabulary is and how much can be implied from the material covered in the preceding chapters. And yes, this book is like a good thriller/drama/love story: it is meant to be read from the beginning to end, so please go back to the start of this comprehensive course if you happened to pop in here by accident or driven by “but I need to know now”. Good morning.

## 12.1. Creating Data Frames

### 12.1.1. data.frame and as.data.frame

Most frequently, we create data frames based on a series of logical, numeric, or characters vectors of identical lengths. The data.frame function is particularly useful in such a scenario:

(x <- data.frame(
a=c(TRUE, FALSE),
b=1:6,
c=runif(6),
d=c("spam", "spam", "eggs")
))
##       a b       c    d
## 1  TRUE 1 0.77437 spam
## 2 FALSE 2 0.19722 spam
## 3  TRUE 3 0.97801 eggs
## 4 FALSE 4 0.20133 spam
## 5  TRUE 5 0.36124 spam
## 6 FALSE 6 0.74261 eggs


Note that shorter vectors were recycled. That the diverse column types were retained and no coercion has been made, can be verified, e.g., by calling:

str(x)
## 'data.frame':	6 obs. of  4 variables:
##  $a: logi TRUE FALSE TRUE FALSE TRUE FALSE ##$ b: int  1 2 3 4 5 6
##  $c: num 0.774 0.197 0.978 0.201 0.361 ... ##$ d: chr  "spam" "spam" "eggs" "spam" ...


We can also fetch the class of each column directly by calling (compare Section 12.3):

sapply(x, class)  # the same as unlist(Map(class, x))
##           a           b           c           d
##   "logical"   "integer"   "numeric" "character"


Important

For many reasons (see, e.g., Section 12.1.5 and Section 12.1.6), we recommend to have the type of each column always checked, for instance by calling the str function.

Many objects, such as matrices, can easily be coerced to data frames using particular as.data.frame methods.

Here is an example matrix:

(A <- matrix(1:6, nrow=3,
dimnames=list(
NULL,        # no row labels
c("u", "v")  # some column labels
)))
##      u v
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6


Let us convert it to a data frame:

as.data.frame(A)  # as.data.frame.matrix
##   u v
## 1 1 4
## 2 2 5
## 3 3 6


Note that a matrix with no row labels is printed slightly differently than a data frame with (as it will soon turn out) the default row.names.

Named lists are amongst other candidates for a meaningful conversion. Consider an example list, where each element is a vector of the same length as the other ones:

(l <- Map(
function(x) {
c(Min=min(x), Median=median(x), Mean=mean(x), Max=max(x))
},
split(iris[["Sepal.Length"]], iris[["Species"]])
))
## $setosa ## Min Median Mean Max ## 4.300 5.000 5.006 5.800 ## ##$versicolor
##    Min Median   Mean    Max
##  4.900  5.900  5.936  7.000
##
## $virginica ## Min Median Mean Max ## 4.900 6.500 6.588 7.900  Each list element will be turned to a separate column: as.data.frame(l) # as.data.frame.list ## setosa versicolor virginica ## Min 4.300 4.900 4.900 ## Median 5.000 5.900 6.500 ## Mean 5.006 5.936 6.588 ## Max 5.800 7.000 7.900  Sadly, as.data.frame.list is not particularly fond of lists of vectors of incompatible lengths: as.data.frame(list(a=1, b=11:12, c=21:23)) ## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 2, 3  The above vectors could have been recycled with a warning, but they were not. as.data.frame(list(a=1:4, b=11:12, c=21)) # recycling rule okay ## a b c ## 1 1 11 21 ## 2 2 12 21 ## 3 3 11 21 ## 4 4 12 21  The method for the S3 class table (mentioned in Chapter 11) can be helpful as well. Here is an example contingency table together with its unstacked version. (t <- table(mtcars[["vs"]], mtcars[["cyl"]])) ## ## 4 6 8 ## 0 1 3 14 ## 1 10 4 0 as.data.frame(t) # as.data.frame.table; see the stringsAsFactors note below! ## Var1 Var2 Freq ## 1 0 4 1 ## 2 1 4 10 ## 3 0 6 3 ## 4 1 6 4 ## 5 0 8 14 ## 6 1 8 0  Actually, as.data.frame.table is so useful that we might want to call it directly on any array. This way, we can convert it from the so-called wide format to the long one; see Section 12.3.6 for more details. Note The above method is based on expand.grid, which determines all combinations of a given series of vectors. expand.grid(1:2, c("a", "b", "c")) # see the stringsAsFactors note below! ## Var1 Var2 ## 1 1 a ## 2 2 a ## 3 1 b ## 4 2 b ## 5 1 c ## 6 2 c  Overall, many classes of objects can be included[2] in a data frame; the popular choices include Date, POSIXct, and factor. It is worth noting that the data.frame function calls the corresponding as.data.frame method, and format is used on printing. Example 12.1 Here are two custom methods for what we would like to call from now on an S3 class spam: as.data.frame.spam <- function(x, ...) structure( list(x), class="data.frame", row.names=seq_along(x) ) format.spam <- function(x, ...) paste0("*", x, "*")  Testing data frame creation and printing: data.frame( a=structure(c("a", "b", "c"), class="spam"), b=factor(c("spam", "bacon", "spam")), c=Sys.Date()+1:3 ) ## a b c ## 1 *a* spam 2023-01-15 ## 2 *b* bacon 2023-01-16 ## 3 *c* spam 2023-01-17  ### 12.1.2. cbind.data.frame and rbind.data.frame There are data frame-specific versions of cbind or rbind (which we discussed in the context of stacking matrices in Section 11.1.2). They are used quite eagerly: help("cbind") states that they will be referred to if at least[3] one of its arguments is a data frame and the other arguments are atomic vectors or lists (possibly with the dim attribute). For example: x <- iris[c(1, 51, 101), c("Sepal.Length", "Species")] # whatever cbind(Yummy=c(TRUE, FALSE, TRUE), x) ## Yummy Sepal.Length Species ## 1 TRUE 5.1 setosa ## 51 FALSE 7.0 versicolor ## 101 TRUE 6.3 virginica  added a new column to a data frame x. Moreover: rbind(x, list(42, "virginica")) ## Sepal.Length Species ## 1 5.1 setosa ## 51 7.0 versicolor ## 101 6.3 virginica ## 11 42.0 virginica  added a new row. Note that columns are of different types. Hence, the values to row-bind were provided as a generic vector. The list can also be named. It can consist of vectors of length greater than one, given in any order: rbind(x, list( Species=c("virginica", "setosa"), Sepal.Length=c(42, 7) )) ## Sepal.Length Species ## 1 5.1 setosa ## 51 7.0 versicolor ## 101 6.3 virginica ## 11 42.0 virginica ## 2 7.0 setosa  Sometimes referring to these methods directly will be necessary. Consider an example list of atomic vectors: x <- list(a=1:3, b=11:13, c=21:23)  First, we call the generic which dispatches to the default method: do.call(cbind, x) ## a b c ## [1,] 1 11 21 ## [2,] 2 12 22 ## [3,] 3 13 23  If we want to make sure we garner a data frame in result, we need to write: do.call(cbind.data.frame, x) ## a b c ## 1 1 11 21 ## 2 2 12 22 ## 3 3 13 23  This is particularly useful in the context of fetching outputs from Map and its friends, which are wrapped inside a list. For instance: l <- unname(Map( function(x) list( Sepal.Length=mean(x[["Sepal.Length"]]), Sepal.Width=mean(x[["Sepal.Width"]]), Species=x[["Species"]][1] ), split(iris, iris[["Species"]]) # split.data.frame; see below )) str(l) ## List of 3 ##$ :List of 3
##   ..$Sepal.Length: num 5.01 ## ..$ Sepal.Width : num 3.43
##   ..$Species : Factor w/ 3 levels "setosa","versicolor",..: 1 ##$ :List of 3
##   ..$Sepal.Length: num 5.94 ## ..$ Sepal.Width : num 2.77
##   ..$Species : Factor w/ 3 levels "setosa","versicolor",..: 2 ##$ :List of 3
##   ..$Sepal.Length: num 6.59 ## ..$ Sepal.Width : num 2.97
##   ..$Species : Factor w/ 3 levels "setosa","versicolor",..: 3  This was nothing more than a fancy way to obtain an illustrative list, which we may now turn into a data frame by calling: do.call(rbind.data.frame, l) ## Sepal.Length Sepal.Width Species ## 1 5.006 3.428 setosa ## 2 5.936 2.770 versicolor ## 3 6.588 2.974 virginica  On the other hand, do.call(rbind, l) does not return a particularly friendly object type: do.call(rbind, l) ## Sepal.Length Sepal.Width Species ## [1,] 5.006 3.428 setosa ## [2,] 5.936 2.77 versicolor ## [3,] 6.588 2.974 virginica  Despite the pretty face, it is a matrix… over a list: str(do.call(rbind, l)) ## List of 9 ##$ : num 5.01
##  $: num 5.94 ##$ : num 6.59
##  $: num 3.43 ##$ : num 2.77
##  $: num 2.97 ##$ : Factor w/ 3 levels "setosa","versicolor",..: 1
##  $: Factor w/ 3 levels "setosa","versicolor",..: 2 ##$ : Factor w/ 3 levels "setosa","versicolor",..: 3
##  - attr(*, "dim")= int [1:2] 3 3
##  - attr(*, "dimnames")=List of 2
##   ..$: NULL ## ..$ : chr [1:3] "Sepal.Length" "Sepal.Width" "Species"


Structured data can be imported from external sources, such as CSV/TSV (comma/tab-separated values) or HDF5 files, relational databases supporting SQL (see Section 12.1.4) web APIs (e.g., through the curl and jsonlite packages), spreadsheets [48], and so on.

In particular, read.csv and the like fetch data from plain text files consisting of records where fields are separated by commas, semicolons, tabs, etc.

For instance:

x <- data.frame(a=runif(3), b=c(TRUE, FALSE, TRUE))  # example data frame
f <- tempfile()  # temporary file name
write.csv(x, f, row.names=FALSE)  # export


This created a CSV file which looks like:

cat(readLines(f), sep="\n")  # print file contents
## "a","b"
## 0.287577520124614,TRUE
## 0.788305135443807,FALSE
## 0.4089769218117,TRUE


The above can be read by calling:

read.csv(f)
##         a     b
## 1 0.28758  TRUE
## 2 0.78831 FALSE
## 3 0.40898  TRUE

Exercise 12.2

Check out help("read.table") for a long list of tunable parameters, especially: sep, dec, quote, header, comment.char, and row.names. Further, note that reading from compressed files is supported directly.

Important

CSV is by far the most portable and user-friendly format for exchanging matrix-like objects between different programs and computing languages (e.g., Python, Julia, LibreOffice Calc, etc.). Such files can be opened in any text editor.

Note

As mentioned in Section 8.3.5, it is possible to process data frames on a chunk-by-chunk basis, which is beneficial especially when data do not fit into memory (compare the nrows argument to read.csv).

### 12.1.4. Interfacing Relational Databases and Querying with SQL (*)

The DBI package provides a universal interface for particular database management systems whose drivers are implemented in additional add-ons such as RSQLite, RMariaDB, RPostgreSQL, etc., or, more generally, RODBC or odbc. For more details, see Section 4 of [48].

Example 12.3

Let us play with an in-memory (volatile) instance of an SQLite database.

library("DBI")
con <- dbConnect(RSQLite::SQLite(), ":memory:")


This returns an object representing a database connection which we can refer to in further communication.

An easy way to create a database table is to call:

dbWriteTable(con, "mtcars", mtcars)  # mtcars is a toy built-in data frame


Alternatively, dbExecute could have been referred to in order to send SQL statements such as CREATE TABLE ... followed by a series of INSERT INTO ....

Some data retrieval can now follow:

dbGetQuery(con, "
SELECT cyl, vs, AVG(mpg) AS mpg_ave, AVG(hp) AS hp_ave
FROM mtcars
GROUP BY cyl, vs
")
##   cyl vs mpg_ave hp_ave
## 1   4  0  26.000  91.00
## 2   4  1  26.730  81.80
## 3   6  0  20.567 131.67
## 4   6  1  19.125 115.25
## 5   8  0  15.100 209.21


This gives us an ordinary R data frame which we can process in the same fashion as any other object of this kind.

At the end, the database connection must be closed.

dbDisconnect(con)

Exercise 12.4

Database passwords should never be stored in plain text files, let alone in R scripts in version-controlled repositories. Consider a few ways for fetching credentials programmatically:

• using environment variables (see help("Sys.getenv")),

• using the keyring package,

• calling system2 (Section 7.3.3) to retrieve it from the system keyring (e.g., the keyring package for Python provides a platform-independent command-line utility).

### 12.1.5. Strings as Factors?

The following is so critical that we will devote a separate subsection to discuss it, so that we always remain vigilant (such is life: maintaining some level of mindfulness is often a good idea).

Important

Some functions related to data frames automatically convert character vectors to factors. This behaviour is frequently controlled by the stringsAsFactors argument thereto.

This is particularly problematic due to the fact that, when printed, factor and character columns look identical:

(x <- data.frame(a=factor(c("U", "V")), b=c("U", "V")))
##   a b
## 1 U U
## 2 V V


We recall from Section 10.3.3 that factors can be nasty. For example, passing factors as indexers in [ or converting them with as.numeric might give counterintuitive (for the uninformed) results. Also, new factor levels must be added manually when we want to extend them with more diverse data. This can cause some unexpected behaviour in contexts such as:

rbind(x, c("W", "W"))
## Warning in [<-.factor(*tmp*, ri, value = "W"): invalid factor level,
## NA generated
##      a b
## 1    U U
## 2    V V
## 3 <NA> W


It is therefore a good habit to have the data types always checked, for instance:

str(x)
## 'data.frame':	2 obs. of  2 variables:
##  $a: Factor w/ 2 levels "U","V": 1 2 ##$ b: chr  "U" "V"


Before R 4.0, a number of functions, including data.frame and read.csv had the stringsAsFactors argument defaulting to TRUE. This is no longer the case for many of them.

However, exceptions to this rule still exist, e.g., including as.data.frame.table and expand.grid. Besides, some built-in example data frames have factor-typed columns inherited from the old days, e.g.:

class(iris[["Species"]])
## [1] "factor"


We observe that the Species column in iris is not of type character. Thence, adding a new variety might be oblique:

iris2 <- iris[c(1, 51, 101), ]  # example subset
levels(iris2[["Species"]]) <- c(levels(iris2[["Species"]]), "croatica")
rbind(iris2, c(6, 3, 3, 2, "croatica"))
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 51             7         3.2          4.7         1.4 versicolor
## 101          6.3         3.3            6         2.5  virginica
## 4              6           3            3           2   croatica


Alternatively, we could have simply converted the Species column to character.

### 12.1.6. Internal Representation

Objects of S3 class data.frame are built upon lists of vectors of the same length or matrices with identical row counts, which define consecutive columns thereof. Apart from class, they must be equipped with the following special attributes:

• names – a character vector (as usual in any named list) labelling the columns or their groups,

• row.names – a character or integer vector with no duplicates nor missing values, doing what advertised.

Therefore, a data frame can be created from scratch by calling, for example:

structure(
list(a=11:13, b=21:23),  # sets the names attribute already
row.names=1:3,
class="data.frame"
)
##    a  b
## 1 11 21
## 2 12 22
## 3 13 23


Here is a data frame based on a length-5 list, a matrix with five rows, and a length-5 numeric vector, with some fancy row names on top:

structure(
list(
a=list(1, 1:2, 1:3, numeric(0), -(4:1)),
b=cbind(u=11:15, v=21:25),
c=runif(5)
),
row.names=c("spam", "bacon", "eggs", "ham", "aubergine"),
class="data.frame"
)
##                        a b.u b.v       c
## spam                   1  11  21 0.28758
## bacon               1, 2  12  22 0.78831
## eggs             1, 2, 3  13  23 0.40898
## ham                       14  24 0.88302
## aubergine -4, -3, -2, -1  15  25 0.94047


In general, the columns of type list can contain anything, e.g., other lists or R functions. Including atomic vectors of varying lengths just like above allows for creating something à la ragged arrays – a pretty handy scenario.

The issue with matrix entries, on the other hand, is that they appear as if they were many, but – as it will turn out in the sequel – they are often treated as a single complex column, e.g., by the index operator (see Section 12.2). Therefore, from this perspective, the above data frame has three columns, not four. Such objects can be output by aggregate (see Section 12.3), amongst others. Nevertheless, they can be very useful too, forming natural column groups which can be easily accessed and batch-processed in the same way.

Important

Unfortunately, data frames with list or matrix columns cannot be normally created with the data.frame nor cbind functions which might explain why they are less popular. This behaviour is dictated by the particular underlying as.data.frame methods which are called by both of them. As a curiosity, see help("I") though.

Exercise 12.5

Verify that for a data frame featuring a matrix column, the latter does not require column names (the second dimnames) set.

The names and row.names attributes are special in the sense of Section 4.4.3. In particular, they can be accessed or modified by the corresponding functions.

It is worth noting that row.names(df) always returns a character vector, even when attr(df, "row.names") is an integer vector. Further, setting row.names(df) <- NULL will re-set[4] this attribute to the most commonly desired case of consecutive natural numbers, for example:

(x <- iris[c(1, 51, 101), ])  # comes with some sad row names
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
row.names(x) <- NULL  # reset to seq_len(NROW(x))
print(x)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          5.1         3.5          1.4         0.2     setosa
## 2          7.0         3.2          4.7         1.4 versicolor
## 3          6.3         3.3          6.0         2.5  virginica

Exercise 12.6

What is the name of the replacement version of the row.names method for the data.frame class?

Exercise 12.7

Implement your own version of expand.grid.

Exercise 12.8

Implement your own version of xtabs, but which does not rely on a formula interface. Allow three parameters: a data frame, the name of the “counts” column and the names of the cross-classifying variables. Hence, my_xtabs(x, "Freq", c("Var1", "Var2")) should be equivalent to xtabs(Freq~Var1+Var2, x).

## 12.2. Data Frame Subsetting

### 12.2.1. Data Frames are Lists

Data frames are named lists, where each element represents an individual column. Therefore[5], length yields the number of columns and names gives their respective labels.

Let us play with the following data frame:

(x <- data.frame(
a=runif(6),
b=rnorm(6),
c=LETTERS[1:6],
d1=c(FALSE, TRUE, FALSE, NA, FALSE, NA),
d2=c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
))
##          a         b c    d1    d2
## 1 0.287578  0.070508 A FALSE FALSE
## 2 0.788305  0.129288 B  TRUE  TRUE
## 3 0.408977  1.715065 C FALSE FALSE
## 4 0.883017  0.460916 D    NA  TRUE
## 5 0.940467 -1.265061 E FALSE FALSE
## 6 0.045556 -0.686853 F    NA  TRUE
typeof(x)  # each data frame is a list
## [1] "list"
length(x)  # the number of columns
## [1] 5
names(x)   # column labels
## [1] "a"  "b"  "c"  "d1" "d2"


The one-argument versions of extract and index operators behave as expected. [[ fetches (looks inside) the contents of a given column:

x[["a"]]  # or x[[1]]
## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556


and [ returns a data frame (a list with extras) comprised of the specified elements:

x["a"]  # or x[1]
##          a
## 1 0.287578
## 2 0.788305
## 3 0.408977
## 4 0.883017
## 5 0.940467
## 6 0.045556
x[c(TRUE, TRUE, FALSE, TRUE, FALSE)]
##          a         b    d1
## 1 0.287578  0.070508 FALSE
## 2 0.788305  0.129288  TRUE
## 3 0.408977  1.715065 FALSE
## 4 0.883017  0.460916    NA
## 5 0.940467 -1.265061 FALSE
## 6 0.045556 -0.686853    NA


Just like with lists, the replacement versions of the said operators can be used to add new or replace existing columns.

y <- head(x, 1)  # for a more compact display
y[["a"]] <- round(y[["a"]], 1)  # replaces the column with new content
y[["b"]] <- NULL  # removes the column, like, totally
y[["e"]] <- 10*y[["a"]]^2  # adds a new column at the end
print(y)
##     a c    d1    d2   e
## 1 0.3 A FALSE FALSE 0.9

Example 12.9

Some spam for thought to show how much we already know: some common use cases of indexing and vectorised functions:

y <- head(x, 1)  # for a more compact display


Move column a to the end:

y[unique(c(names(y), "a"), fromLast=TRUE)]
##          b c    d1    d2       a
## 1 0.070508 A FALSE FALSE 0.28758


Remove column a and c:

y[-match(c("a", "c"), names(y))]
##          b    d1    d2
## 1 0.070508 FALSE FALSE


All columns between a and c:

y[match("a", names(y)):match("c", names(y))]
##         a        b c
## 1 0.28758 0.070508 A


Names starting with d:

y[grep("^d", names(y))]
##      d1    d2
## 1 FALSE FALSE


Change name of column c to z:

names(y)[names(y) == "c"] <- "z"  # in-place
print(y)
##         a        b z    d1    d2
## 1 0.28758 0.070508 A FALSE FALSE


Change names: d2 to u and d1 to v:

names(y)[match(c("d2", "d1"), names(y))] <- c("v", "u")  # in-place
print(y)
##         a        b z     u     v
## 1 0.28758 0.070508 A FALSE FALSE


Note

Some R users might prefer the $ operator over [[, but we do not. By default, the former supports partial matching of column names which might be appealing when R is used interactively. However, it does not work on matrices, nor it allows for programmatically generated names. It is also trickier to use on non-syntactically valid labels; compare Section 9.4.1. Exercise 12.10 Write a function names_replace that changes the name of a data frame columns based on a translation table given in a from=to fashion, for instance: names_replace <- function(x, ...) ...to.do... x <- data.frame(a=1, b=2, c=3) names_replace(x, c="new_c", a="new_a") ## new_a b new_c ## 1 1 2 3  ### 12.2.2. Data Frames are Matrix-like Data frames can be considered “generalised” matrices. They store data of any kind (possibly mixed) organised in a tabular fashion. Some functions mentioned in the previous chapter will hence be overloaded for the data frame case. These include: dim (despite the lack of the dim attribute), NROW, NCOL, and dimnames (which is of course based on row.names and names). For example: (x <- data.frame( a=runif(6), b=rnorm(6), c=LETTERS[1:6], d1=c(FALSE, TRUE, FALSE, NA, FALSE, NA), d2=c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE) )) ## a b c d1 d2 ## 1 0.287578 0.070508 A FALSE FALSE ## 2 0.788305 0.129288 B TRUE TRUE ## 3 0.408977 1.715065 C FALSE FALSE ## 4 0.883017 0.460916 D NA TRUE ## 5 0.940467 -1.265061 E FALSE FALSE ## 6 0.045556 -0.686853 F NA TRUE dim(x) # the number of rows and columns ## [1] 6 5 dimnames(x) # it is not a matrix, but a matrix-like object ## [[1]] ## [1] "1" "2" "3" "4" "5" "6" ## ## [[2]] ## [1] "a" "b" "c" "d1" "d2"  In addition to the list-like behaviour, which only allows for dealing with particular columns or groups thereof, the [ operator was also equipped with the ability to take two indexers: x[1:2, ] # first two rows ## a b c d1 d2 ## 1 0.28758 0.070508 A FALSE FALSE ## 2 0.78831 0.129288 B TRUE TRUE x[x[["a"]] >= 0.3 & x[["a"]] <= 0.8, -2] # or use x[, "a"] ## a c d1 d2 ## 2 0.78831 B TRUE TRUE ## 3 0.40898 C FALSE FALSE  Recall the drop argument to [ and its effects on matrix indexing. It the current case, its behaviour will be similar with regard to the operations on individual columns: x[, 1] # synonym: x[[1]], because drop=TRUE ## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556 x[, 1, drop=FALSE] # synonym: x[1] ## a ## 1 0.287578 ## 2 0.788305 ## 3 0.408977 ## 4 0.883017 ## 5 0.940467 ## 6 0.045556  Also, note that when we extract a single row and more than one column, drop does not really apply. It is because columns (unlike in matrices) can potentially be of different types: x[1, 1:2] # two numeric columns but the result is still a numeric ## a b ## 1 0.28758 0.070508  However: x[1, 1] ## [1] 0.28758 x[1, 1, drop=FALSE] ## a ## 1 0.28758  Note Once again let us take note of logical indexing featuring missing values: x[x[["d1"]], ] ## a b c d1 d2 ## 2 0.78831 0.12929 B TRUE TRUE ## NA NA NA <NA> NA NA ## NA.1 NA NA <NA> NA NA x[which(x[["d1"]]), ] # drops missing values ## a b c d1 d2 ## 2 0.78831 0.12929 B TRUE TRUE  The default behaviour is consistent with many other R functions: it explicitly indicates that something is missing (we are selecting a “don’t know”; hence, the result is “don’t know” as well). Unfortunately, this comes with no warning. As we rarely check manually for missing values in the outputs, our absent-mindedness can lead to code bugs. By far, we might have already noted that the index operator adjusts (not: resets) the row.names attribute. For instance: (xs <- x[head(order(x[["a"]], decreasing=TRUE), 3), ]) ## a b c d1 d2 ## 5 0.94047 -1.26506 E FALSE FALSE ## 4 0.88302 0.46092 D NA TRUE ## 2 0.78831 0.12929 B TRUE TRUE  It is a version of x comprised of only top three values in the u column. Indexing by means of character vectors will refer to row.names and names: xs["5", c("a", "b")] ## a b ## 5 0.94047 -1.2651  Note that this is not the same as “xs[5, c("a", "b")]”, despite the fact that row.names is formally an integer vector here. Note If a data frame features a matrix, we need to use the index/extract operator twice in order to access a specific sub-column: (x <- aggregate(iris[1], iris[5], function(x) c(Min=min(x), Max=max(x)))) ## Species Sepal.Length.Min Sepal.Length.Max ## 1 setosa 4.3 5.8 ## 2 versicolor 4.9 7.0 ## 3 virginica 4.9 7.9 x[["Sepal.Length"]][, "Min"] ## [1] 4.3 4.9 4.9  In other words, neither “x[["Sepal.Length.Min"]]” nor “x[, "Sepal.Length.Min"]” works. As far as the replacement version of the index operator is concerned, it is a quite flexible tool, allowing the new content to be a vector, a data frame, a list, or even a matrix. Exercise 12.11 Write two replacement functions[6]. First, set_row_names which replaces the row.names of a data frame with the contents of a specific column, for example: (x <- aggregate(iris[1], iris[5], mean)) # some data frame ## Species Sepal.Length ## 1 setosa 5.006 ## 2 versicolor 5.936 ## 3 virginica 6.588 set_row_names(x) <- "Species" print(x) ## Sepal.Length ## setosa 5.006 ## versicolor 5.936 ## virginica 6.588  Second, reset_row_names which converts row.names to a standalone column of a given name, for instance: reset_row_names(x) <- "Type" print(x) ## Sepal.Length Type ## 1 5.006 setosa ## 2 5.936 versicolor ## 3 6.588 virginica  These two functions may be handy as they allow for writing “x[something, ]” instead of “x[x[["column"]] %in% something, ]”. ## 12.3. Common Operations Below we review the most commonly applied operations related to data frame wrangling. We have a few dedicated functions or methods overloaded for the data.frame class. However, we have already mastered the necessary skills to deal with this kind of objects through our hard work, in particular involving the solving of the exercises in the preceding chapters. Let us repeat: data frames are just lists exhibiting matrix-like behaviour. ### 12.3.1. Ordering Rows Ordering rows in a data frame with respect to different criteria can be easily achieved by means of the order function and the two-argument version of [. For instance, here are the top six cars in terms of the time (in seconds) to complete a 402-metre race: mtcars6 <- mtcars[order(mtcars[["qsec"]])[1:6], ] mtcars6[["model"]] <- row.names(mtcars6) row.names(mtcars6) <- NULL print(mtcars6) ## mpg cyl disp hp drat wt qsec vs am gear carb model ## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L ## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora ## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28 ## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino ## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360 ## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4  order uses a stable sorting algorithm, therefore sorting with respect to a different criterion will not break the relative ordering of qsec in row groups with ties: mtcars6[order(mtcars6[["cyl"]]), ] ## mpg cyl disp hp drat wt qsec vs am gear carb model ## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino ## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4 ## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L ## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora ## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28 ## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360  Example 12.12 Notice the difference between ordering by cyl and gear vs gear and cyl: mtcars6[order(mtcars6[["cyl"]], mtcars6[["gear"]]), ] ## mpg cyl disp hp drat wt qsec vs am gear carb model ## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4 ## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino ## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28 ## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360 ## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L ## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora mtcars6[order(mtcars6[["gear"]], mtcars6[["cyl"]]), ] ## mpg cyl disp hp drat wt qsec vs am gear carb model ## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28 ## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360 ## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4 ## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino ## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L ## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora  Note Mixing a increasing and decreasing ordering is tricky as the decreasing argument to order currently does not accept multiple flags in all the contexts. Perhaps the easiest way to change the ordering direction is to use the unary minus operator on the column(s) to be sorted decreasingly. mtcars6[order(mtcars6[["gear"]], -mtcars6[["cyl"]]), ] ## mpg cyl disp hp drat wt qsec vs am gear carb model ## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28 ## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360 ## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4 ## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L ## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora ## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino  For factor and character columns, xtfrm can be used to convert them to sort keys first. mtcars6[order(mtcars6[["cyl"]], -xtfrm(mtcars6[["model"]])), ] ## mpg cyl disp hp drat wt qsec vs am gear carb model ## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4 ## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino ## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora ## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L ## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360 ## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28  Both of the above behave like decreasing=c(FALSE, TRUE). Exercise 12.13 Write a method sort.data.frame that orders a data frame with respect to a given set of columns. sort.data.frame <- function(x, decreasing=FALSE, cols) ...to.do... sort(mtcars6, cols=c("cyl", "model")) ## mpg cyl disp hp drat wt qsec vs am gear carb model ## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino ## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4 ## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28 ## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360 ## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L ## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora  Unfortunately, that decreasing must be of length one and be placed as the second method argument is imposed by the sort S3 generic. ### 12.3.2. Handling Duplicated Rows duplicated, anyDuplicated, and unique have methods overloaded for the data.frame class. They can be used to indicate, get rid of, or replace the repeating rows. sum(duplicated(iris)) # how many duplicated rows are there? ## [1] 1 iris[duplicated(iris), ] # show the duplicated rows ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 143 5.8 2.7 5.1 1.9 virginica  ### 12.3.3. Joining (Merging) Data Frames The merge function can perform the JOIN operation that some readers might know from SQL[7]. It matches the items in the columns that two given data frames somewhat share, and then returns their combination. Example 12.14 Two calls to merge could be used to match data on programmers (each identified by developer_id and giving such details as their name, location, main skill, etc.) with the information about the open-source projects (each identified by project_id and informing us about its title, scope, web site, and so forth) they are engaged in (based on a third data frame featuring developer_id and project_id pairs). As an simple illustration, consider the two following objects: A <- data.frame( u=c("b0", "b1", "b2", "b3"), v=c("a0", "a1", "a2", "a3") ) B <- data.frame( v=c("a0", "a2", "a2", "a4"), w=c("c0", "c1", "c2", "c3") )  The two common columns, i.e., storing data of similar nature (a-something strings), are both named v. First, the inner (natural) join, where we list only the matching pairs: merge(A, B) # x=A, y=B, by="v", all.x=FALSE, all.y=FALSE ## v u w ## 1 a0 b0 c0 ## 2 a2 b2 c1 ## 3 a2 b2 c2  Note that the common column (or, more generally, columns) is included only once in the result. The left join guarantees that all elements in the first data frame will be included in the result: merge(A, B, all.x=TRUE) # by="v", all.y=FALSE ## v u w ## 1 a0 b0 c0 ## 2 a1 b1 <NA> ## 3 a2 b2 c1 ## 4 a2 b2 c2 ## 5 a3 b3 <NA>  The right join includes all records in the second argument: merge(A, B, all.y=TRUE) # by="v", all.x=FALSE ## v u w ## 1 a0 b0 c0 ## 2 a2 b2 c1 ## 3 a2 b2 c2 ## 4 a4 <NA> c3  And the full outer join is their set-theoretic union: merge(A, B, all.x=TRUE, all.y=TRUE) # by="v" ## v u w ## 1 a0 b0 c0 ## 2 a1 b1 <NA> ## 3 a2 b2 c1 ## 4 a2 b2 c2 ## 5 a3 b3 <NA> ## 6 a4 <NA> c3  Exercise 12.15 Show how match (Section 5.4.1) can be used to implement a very basic version of merge. ### 12.3.4. Aggregating and Transforming Columns Let us discuss how to perform data aggregation or engineer features. Despite the fact that we already know how to access individual columns with [ and process them using the many vectorised functions, we still have something interesting to add about the said matter. It would be tempting to try implementing such operations with apply. Unfortunately, currently this function coerces its argument to a matrix. Hence, we should refrain from applying it on data frames whose columns are of mixed types[8]. However, taking into account that data frames are special lists, we can always call Map and its relatives. Example 12.16 Given an example data frame: (iris_sample <- iris[sample(NROW(iris), 6), ]) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 28 5.2 3.5 1.5 0.2 setosa ## 80 5.7 2.6 3.5 1.0 versicolor ## 101 6.3 3.3 6.0 2.5 virginica ## 111 6.5 3.2 5.1 2.0 virginica ## 137 6.3 3.4 5.6 2.4 virginica ## 133 6.4 2.8 5.6 2.2 virginica  To get the class of each column, we can call: sapply(iris_sample, class) # or unlist(Map(class, iris)) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## "numeric" "numeric" "numeric" "numeric" "factor"  Next, here is a way to compute some aggregates of the numeric columns: unlist(Map(mean, Filter(is.numeric, iris_sample))) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.0667 3.1333 4.5500 1.7167  or: sapply(iris_sample[sapply(iris_sample, is.numeric)], mean) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 6.0667 3.1333 4.5500 1.7167  We can also fetch more than a single summary of each column: as.data.frame(Map( function(x) c(Min=min(x), Max=max(x)), Filter(is.numeric, iris_sample) )) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min 5.2 2.6 1.5 0.2 ## Max 6.5 3.5 6.0 2.5  or: sapply(iris_sample[sapply(iris_sample, is.numeric)], quantile, c(0, 1)) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## 0% 5.2 2.6 1.5 0.2 ## 100% 6.5 3.5 6.0 2.5  Note that the latter called simplify2array automatically, thus the result is a matrix. On the other hand, standardisation of all the numeric features can be performed, e.g., via a call: iris_sample[] <- Map(function(x) { if (!is.numeric(x)) x else (x-mean(x))/sd(x) }, iris_sample) print(iris_sample) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 28 -1.70405 1.03024 -1.76004 -1.65318 setosa ## 80 -0.72094 -1.49854 -0.60591 -0.78117 versicolor ## 101 0.45878 0.46829 0.83674 0.85384 virginica ## 111 0.85202 0.18732 0.31738 0.30884 virginica ## 137 0.45878 0.74927 0.60591 0.74484 virginica ## 133 0.65540 -0.93659 0.60591 0.52684 virginica  ### 12.3.5. Handling Missing Values The is.na method for objects of class data.frame returns a logical matrix of the same dimensionality[9] indicating whether the corresponding items are missing or not. Of course, this function can still be called on individual columns as well. Further, na.omit can be used to get rid of rows with missing values. Exercise 12.17 Given a data frame, use is.na and other functions such as apply, approx, etc., to: 1. remove all rows that feature at least one missing value, 2. remove all rows that only consist of missing values, 3. remove all columns that feature at least one missing value, 4. for each column, replace all missing values with the column averages, 5. for each column, replace all missing values with values that linearly interpolate between the preceding and succeeding well-defined observations (which is useful on time series), e.g., the blanks in c(0.60, 0.62, NA, 0.64, NA, NA, 0.58) should be filled so as to obtain c(0.60, 0.62, 0.63, 0.64, 0.62, 0.60, 0.58). ### 12.3.6. Reshaping Data Frames Consider an example matrix: A <- matrix(round(runif(6), 2), nrow=3, dimnames=list( c("X", "Y", "Z"), # row labels c("u", "v") # column labels )) names(dimnames(A)) <- c("Row", "Col") print(A) ## Col ## Row u v ## X 0.29 0.88 ## Y 0.79 0.94 ## Z 0.41 0.05  The as.data.frame method for the table class can be called directly on any array: as.data.frame.table(A, responseName="Val") ## Row Col Val ## 1 X u 0.29 ## 2 Y u 0.79 ## 3 Z u 0.41 ## 4 X v 0.88 ## 5 Y v 0.94 ## 6 Z v 0.05  This is an instance of reshaping an array, and more precisely, stacking: converting from a wide (okay, in this example, not so wide, as we have only two columns) to a long format. This can be also achieved by means of the reshape function which is more flexible and operates directly on data frames (but is harder to use): (df <- names<-( data.frame(row.names(A), A, row.names=NULL), c("Row", "Col.u", "Col.v"))) ## Row Col.u Col.v ## 1 X 0.29 0.88 ## 2 Y 0.79 0.94 ## 3 Z 0.41 0.05 (stacked <- reshape(df, varying=2:3, direction="long")) ## Row time Col id ## 1.u X u 0.29 1 ## 2.u Y u 0.79 2 ## 3.u Z u 0.41 3 ## 1.v X v 0.88 1 ## 2.v Y v 0.94 2 ## 3.v Z v 0.05 3  Maybe the default column names are not superb, but we can always adjust them manually afterwards. The reverse operation is called unstacking: reshape(stacked, idvar="Row", timevar="time", drop="id", direction="wide") ## Row Col.u Col.v ## 1.u X 0.29 0.88 ## 2.u Y 0.79 0.94 ## 3.u Z 0.41 0.05  Exercise 12.18 Given a named numeric vector, convert it to a data frame with two columns, for instance: convert <- function(x) ...to.do... x <- c(spam=42, eggs=7, bacon=3) convert(x) ## key value ## 1 spam 42 ## 2 eggs 7 ## 3 bacon 3  Exercise 12.19 Reshape (stack) the built-in WorldPhones dataset. Then, reshape (unstack) the stacked WorldPhones dataset. Further, unstack the stacked set but first remove[10] five random rows from it, and then randomly permute all the remaining rows. Fill the missing entries with NAs. Exercise 12.20 Implement a basic version of as.data.frame.table manually (using rep etc.). Also, write a function as.table.data.frame that implements its reverse. Make sure both functions are compatible with each other. Exercise 12.21 The built-in Titanic is a four-dimensional array. Convert it to a long data frame. Exercise 12.22 Perform what follows on the data frame defined below: 1. convert the second column from character to a list of character vectors (split at ","); 2. extract first elements from each of the vectors; 3. extract last elements; 4. (*) unstack the data frame; 5. (*) stack it back to a data frame featuring a list; 6. convert the list back to a character column (concatenate with "," as separator). (x <- data.frame( name=c("Kat", "Ron", "Jo", "Mary"), food=c("buckwheat", "spam,bacon,spam", "", "eggs,spam,spam,lollipops") )) ## name food ## 1 Kat buckwheat ## 2 Ron spam,bacon,spam ## 3 Jo ## 4 Mary eggs,spam,spam,lollipops  Exercise 12.23 Write a function that converts all matrix-based columns in a given data frame to separate, atomic columns. Also, write a function to that does the opposite: one that groups all columns with similar prefixes and turns them into matrices. ### 12.3.7. Aggregating Data in Groups We can straightforwardly apply various transforms on data groups determined by a factor-like variable or a combination thereof thanks to the split.data.frame method, which returns a list of data frames. For example: x <- data.frame( a=c( 10, 20, 30, 40, 50), u=c("spam", "spam", "eggs", "spam", "eggs"), v=c( 1, 2, 1, 1, 1) ) split(x, x["u"]) # i.e., split.data.frame(x, x["u"]) or x[["u"]] ##$eggs
##    a    u v
## 3 30 eggs 1
## 5 50 eggs 1
##
## $spam ## a u v ## 1 10 spam 1 ## 2 20 spam 2 ## 4 40 spam 1  This split x with respect to the u column serving as the grouping variable. On the other hand: split(x, x[c("u", "v")]) # sep="." ##$eggs.1
##    a    u v
## 3 30 eggs 1
## 5 50 eggs 1
##
## $spam.1 ## a u v ## 1 10 spam 1 ## 4 40 spam 1 ## ##$eggs.2
## [1] a u v
## <0 rows> (or 0-length row.names)
##
## $spam.2 ## a u v ## 2 20 spam 2  partitioned with respect to a combination of two factor-like sequences. Note that a non-existing level pair (eggs, 2) results in an empty data frame. Exercise 12.24 split.data.frame (when called explicitly) can also be used to break a matrix into a list of matrices (rowwisely). Given a matrix, perform its train-test split: allocate, say, 70% of the rows at random into one matrix and the remaining 30% into another one. If the aggregation of grouped data in numeric columns is needed, sapply is quite convenient. To recall, it is a combination of lapply (one-vector version of Map) and simplify2array (Section 11.1.3). sapply(split(iris[1:2], iris[5]), sapply, mean) ## setosa versicolor virginica ## Sepal.Length 5.006 5.936 6.588 ## Sepal.Width 3.428 2.770 2.974  If the function being to apply returns more than a single value, sapply will not return a too-informative result by default: the list of matrices converted to a matrix will not have the row.names argument set. As a workaround, we either call simplify2array explicitly or pass simplify="array" to sapply: (res <- sapply( split(iris[1:2], iris[5]), sapply, function(x) c(Min=min(x), Max=max(x)), simplify="array" )) # or simplify2array(lapply or Map etc.) ## , , setosa ## ## Sepal.Length Sepal.Width ## Min 4.3 2.3 ## Max 5.8 4.4 ## ## , , versicolor ## ## Sepal.Length Sepal.Width ## Min 4.9 2.0 ## Max 7.0 3.4 ## ## , , virginica ## ## Sepal.Length Sepal.Width ## Min 4.9 2.2 ## Max 7.9 3.8  This yields a three-dimensional array which is particularly handy if we now would like to access specific results by name: res[, "Sepal.Length", "setosa"] ## Min Max ## 4.3 5.8  Also, the previously mentioned as.data.frame.table method works like a charm on it (up to the column names): as.data.frame.table(res) ## Var1 Var2 Var3 Freq ## 1 Min Sepal.Length setosa 4.3 ## 2 Max Sepal.Length setosa 5.8 ## 3 Min Sepal.Width setosa 2.3 ## 4 Max Sepal.Width setosa 4.4 ## 5 Min Sepal.Length versicolor 4.9 ## 6 Max Sepal.Length versicolor 7.0 ## 7 Min Sepal.Width versicolor 2.0 ## 8 Max Sepal.Width versicolor 3.4 ## 9 Min Sepal.Length virginica 4.9 ## 10 Max Sepal.Length virginica 7.9 ## 11 Min Sepal.Width virginica 2.2 ## 12 Max Sepal.Width virginica 3.8  Note If the grouping (by) variable is a list of two or more factors, the combined levels will be concatenated to a single string: as.data.frame.table(as.array(sapply( split(ToothGrowth["len"], ToothGrowth[c("supp", "dose")]), sapply, mean ))) ## Var1 Freq ## 1 OJ.0.5.len 13.23 ## 2 VC.0.5.len 7.98 ## 3 OJ.1.len 22.70 ## 4 VC.1.len 16.77 ## 5 OJ.2.len 26.06 ## 6 VC.2.len 26.14  Also, the name of the aggregated column (len) has been included. This behaviour yields a result that may be deemed convenient in some contexts, but not necessarily so in other ones. Exercise 12.25 Many aggregation functions are idempotent, which means that when they are fed with a vector with all the elements being identical, the result is exactly that unique element: min, mean, median, and max behave exactly this way. Overload the mean and median methods for character vectors and factors so that they return NA when they are fed with a sequence of not all elements being the same and the unique value otherwise. mean.character <- function(x, na.rm=FALSE, ...) ...to.do... mean.factor <- function(x, na.rm=FALSE, ...) ...to.do...  This way, we can also aggregate the grouping variables in a convenient way: do.call(rbind.data.frame, lapply(split(ToothGrowth, ToothGrowth[c("supp", "dose")]), lapply, mean)) ## len supp dose ## OJ.0.5 13.23 OJ 0.5 ## VC.0.5 7.98 VC 0.5 ## OJ.1 22.70 OJ 1.0 ## VC.1 16.77 VC 1.0 ## OJ.2 26.06 OJ 2.0 ## VC.2 26.14 VC 2.0  The built-in aggregate method can assist us in a situation where a single function is to be applied on all columns in a data frame. aggregate(iris[-5], iris[5], mean) # not: ...[[5]] ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## 1 setosa 5.006 3.428 1.462 0.246 ## 2 versicolor 5.936 2.770 4.260 1.326 ## 3 virginica 6.588 2.974 5.552 2.026 aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], mean) ## supp dose len ## 1 OJ 0.5 13.23 ## 2 VC 0.5 7.98 ## 3 OJ 1.0 22.70 ## 4 VC 1.0 16.77 ## 5 OJ 2.0 26.06 ## 6 VC 2.0 26.14  Note that the second argument, by, must be list-like (therefore also a data frame is accepted), not a factor nor an atomic vector. Also, if the function being applied returns many values, they will be wrapped into a matrix column: (x <- aggregate(iris[2], iris[5], function(x) c(Min=min(x), Max=max(x)))) ## Species Sepal.Width.Min Sepal.Width.Max ## 1 setosa 2.3 4.4 ## 2 versicolor 2.0 3.4 ## 3 virginica 2.2 3.8 class(x[["Sepal.Width"]]) ## [1] "matrix" "array" x[["Sepal.Width"]] # not: Sepal.Width.Max, etc. ## Min Max ## [1,] 2.3 4.4 ## [2,] 2.0 3.4 ## [3,] 2.2 3.8  It is actually handy, because by referring to x[["Sepal.Width"]] we have access to all the stats for this column. Further, if many columns are being aggregated at the same time, we can process all the summaries in the same way. Exercise 12.26 Check out the built-in by function which supports some basic split-apply-bind use cases. Note the particularly peculiar behaviour of the print method for the by class. The most flexible scenario involves applying a custom function returning any set of aggregates in the form of a list and then row-binding the results to obtain a data frame. Example 12.27 The following implements an R version of what we would express in SQL as: SELECT supp, dose, AVG(len) AS ave_len, COUNT(*) AS count FROM ToothGrowth GROUP BY supp, dose  Ad rem: do.call(rbind.data.frame, lapply( split(ToothGrowth, ToothGrowth[c("supp", "dose")]), function(df) list( supp=df[1, "supp"], dose=df[1, "dose"], ave_len=mean(df[["len"]]), count=NROW(df) ) )) ## supp dose ave_len count ## OJ.0.5 OJ 0.5 13.23 10 ## VC.0.5 VC 0.5 7.98 10 ## OJ.1 OJ 1.0 22.70 10 ## VC.1 VC 1.0 16.77 10 ## OJ.2 OJ 2.0 26.06 10 ## VC.2 VC 2.0 26.14 10  Example 12.28 As an exercise, let us study a function that takes a named list x (can be a data frame) and a sequence of col=f pairs and applies the function f (or each function from a list of functions f) on the named element col in x: napply <- function(x, ...) { fs <- list(...) stopifnot(is.list(x), !is.null(names(x))) stopifnot(all(names(fs) %in% names(x))) do.call( c, # concatenates lists lapply( structure(seq_along(fs), names=names(fs)), function(i) { # always returns a list y <- x[[ names(fs)[i] ]] if (is.function(fs[[i]])) list(fs[[i]](y)) else lapply(fs[[i]], function(f) f(y)) } ) ) }  For example: first <- function(x, ...) head(x, n=1L, ...) # we use it below napply(ToothGrowth, supp=first, dose=first, len=list(ave=mean, count=length) ) ##$supp
## [1] VC
## Levels: OJ VC
##
## $dose ## [1] 0.5 ## ##$len.ave
## [1] 18.813
##
## \$len.count
## [1] 60


applies first on both ToothGrowth[["supp"]] and ToothGrowth[["dose"]] as well as mean and length on ToothGrowth[["len"]]. List names are there for some dramatic effects.

And now:

do.call(
rbind.data.frame,
lapply(
split(ToothGrowth, ToothGrowth[c("supp", "dose")]),
napply,
supp=first, dose=first, len=list(ave=mean, count=length)
)
)
##        supp dose len.ave len.count
## OJ.0.5   OJ  0.5   13.23        10
## VC.0.5   VC  0.5    7.98        10
## OJ.1     OJ  1.0   22.70        10
## VC.1     VC  1.0   16.77        10
## OJ.2     OJ  2.0   26.06        10
## VC.2     VC  2.0   26.14        10


or even:

aaaggg <- function(x, by, ...)
do.call(rbind.data.frame, lapply(split(x, x[by]), napply, ...))


so that:

aaaggg(iris, "Species", Species=first, Sepal.Length=mean)
##               Species Sepal.Length
## setosa         setosa        5.006
## versicolor versicolor        5.936
## virginica   virginica        6.588


This brings fun back to R programming in the sad times when many things are given to us on a plate. And by the way, the above has not been tested thoroughly, it is a proof of concept; as usual, testing, debugging, and extending is left as an exercise to the reader.

Example 12.29

In Section 10.5, we have considered an example where we have used our own group_by function and an aggregation method overloaded for the object’s class it returns.

Here is the function that splits a data frame into a list of data frames with respect to a combination of levels in given named columns:

group_by <- function(df, by)
{
stopifnot(is.character(by), is.data.frame(df))
df <- droplevels(df)  # in case there are factors with empty levels
structure(
split(df, df[names(df) %in% by]),
class="list_dfs",
by=by
)
}


The next function applies a set of aggregates on every column of each data frame in a given list (two nested lapplys plus some cosmetic additions):

aggregate.list_dfs <- function(x, FUN, ...)
{
aggregates <- lapply(x, function(df) {
is_by <- names(df) %in% attr(x, "by")
res <- lapply(df[!is_by], FUN, ...)
res_mat <- do.call(rbind, res)
if (is.null(dimnames(res_mat)[[2]]))
dimnames(res_mat)[[2]] <- paste0("f", seq_len(NCOL(res_mat)))
cbind(
row.names<-(df[1, is_by, drop=FALSE], NULL),
x=row.names(res_mat),
row.names<-(res_mat, NULL)
)
})
combined_aggregates <- do.call(rbind.data.frame, aggregates)
row.names<-(combined_aggregates, NULL)
}
aggregate(group_by(ToothGrowth, c("supp", "dose")), range)
##   supp dose   x   f1   f2
## 1   OJ  0.5 len  8.2 21.5
## 2   VC  0.5 len  4.2 11.5
## 3   OJ  1.0 len 14.5 27.3
## 4   VC  1.0 len 13.6 22.5
## 5   OJ  2.0 len 22.4 30.9
## 6   VC  2.0 len 18.5 33.9


We really want our API be bloated, hence let us introduce a convenience function being a specialised version of the above:

mean.list_dfs <- function(x, ...)
aggregate.list_dfs(x, function(y) c(Mean=mean(y, ...)))
mean(group_by(iris[51:150, c(2, 3, 5)], "Species"))
##      Species            x  Mean
## 1 versicolor  Sepal.Width 2.770
## 2 versicolor Petal.Length 4.260
## 3  virginica  Sepal.Width 2.974
## 4  virginica Petal.Length 5.552


### 12.3.8. Transforming Data in Groups

Some variables will sometimes need to be transformed relative to what is happening in subsets of a dataset. This is the case, e.g., where we decide that missing values should be replaced by the corresponding within-group averages, or want to compute the relative ranks or z-scores.

If the losing of the original ordering of rows is not an issue, the standard split-apply-bind will suffice.

An example data frame:

(x <- data.frame(
a=c( 10,   1,  NA,  NA,  NA,   4),
b=c( -1,  10,  40,  30,   1,  20),
c=runif(6),
d=c("v", "u", "u", "u", "v", "u")
))
##    a  b       c d
## 1 10 -1 0.52811 v
## 2  1 10 0.89242 u
## 3 NA 40 0.55144 u
## 4 NA 30 0.45661 u
## 5 NA  1 0.95683 v
## 6  4 20 0.45333 u


Some operations:

fill_na <- function(x) [<-(x, is.na(x), value=mean(x[!is.na(x)]))
standardise <- function(x) (x-mean(x))/sd(x)


And now:

do.call(rbind.data.frame, lapply(
split(x, x["d"]),
function(df) {
df[["a"]] <- fill_na(df[["a"]])
df[["b"]] <- rank(df[["b"]])
df[["c"]] <- standardise(df[["c"]])
df
}
))
##        a b        c d
## u.2  1.0 1  1.46357 u
## u.3  2.5 4 -0.17823 u
## u.4  2.5 3 -0.63478 u
## u.6  4.0 2 -0.65057 u
## v.1 10.0 1 -0.70711 v
## v.5 10.0 2  0.70711 v


Note that only the relative ordering of rows within groups has been retained. Overall, the rows are in a different order.

If this is an issue, we can use the unsplit function:

unsplit(
lapply(
split(x, x["d"]),
function(df) {
df[["a"]] <- fill_na(df[["a"]])
df[["b"]] <- rank(df[["b"]])
df[["c"]] <- standardise(df[["c"]])
df
}
),
x["d"]
)
##      a b        c d
## 1 10.0 1 -0.70711 v
## 2  1.0 1  1.46357 u
## 3  2.5 4 -0.17823 u
## 4  2.5 3 -0.63478 u
## 5 10.0 2  0.70711 v
## 6  4.0 2 -0.65057 u

Exercise 12.30

Show how we can do the above also via the replacement version of split.

Example 12.31

Reverting to the previous ordering can be done manually too. It is because the split operation behaves as if we first ordered the data frame with respect to the grouping variable(s) (using a stable sorting algorithm).

Here is some transformation of a sample data frame split by a combination of two factors:

(x <- row.names<-(ToothGrowth[sample(NROW(ToothGrowth), 10), ], NULL))
##     len supp dose
## 1  23.0   OJ  2.0
## 2  23.3   OJ  1.0
## 3  29.4   OJ  2.0
## 4  14.5   OJ  1.0
## 5  11.2   VC  0.5
## 6  20.0   OJ  1.0
## 7  24.5   OJ  2.0
## 8  10.0   OJ  0.5
## 9   9.4   OJ  0.5
## 10  7.0   VC  0.5
(y <- do.call(rbind.data.frame, lapply(
split(x, x[c("dose", "supp")]),  # two grouping variables
function(df) {
df[["len"]] <- df[["len"]] * 100^df[["dose"]] *  # whatever
ifelse(df[["supp"]] == "OJ", -1, 1)          # do not overthink it
df
}
)))
##               len supp dose
## 0.5.OJ.8     -100   OJ  0.5
## 0.5.OJ.9      -94   OJ  0.5
## 1.OJ.2      -2330   OJ  1.0
## 1.OJ.4      -1450   OJ  1.0
## 1.OJ.6      -2000   OJ  1.0
## 2.OJ.1    -230000   OJ  2.0
## 2.OJ.3    -294000   OJ  2.0
## 2.OJ.7    -245000   OJ  2.0
## 0.5.VC.5      112   VC  0.5
## 0.5.VC.10      70   VC  0.5


In Section 5.4.4, we have mentioned that by calling order, we ca determine the inverse of a given permutation. Hence, we can call:

y[order(order(x[["supp"]], x[["dose"]])), ]  # not: dose, supp
##               len supp dose
## 2.OJ.1    -230000   OJ  2.0
## 1.OJ.2      -2330   OJ  1.0
## 2.OJ.3    -294000   OJ  2.0
## 1.OJ.4      -1450   OJ  1.0
## 0.5.VC.5      112   VC  0.5
## 1.OJ.6      -2000   OJ  1.0
## 2.OJ.7    -245000   OJ  2.0
## 0.5.OJ.8     -100   OJ  0.5
## 0.5.OJ.9      -94   OJ  0.5
## 0.5.VC.10      70   VC  0.5


Additionally, we can manually restore the original row.names, et voilà.

### 12.3.9. Metaprogramming-Based Techniques (*)

In Section 9.5.7, we have mentioned that due to R’s being equipped with the ability to write programs that manipulate unevaluated expressions, some functions can provide us with quite weird interfaces to a few common operations. These include transform, subset, with, and basically every procedure accepting a formula. Also, the popular data.table and dplyr packages that we briefly mention in Section 12.3.10 fall into this class.

In some contexts, they all may be found convenient[11].

However, overall, each of these methods must be carefully studied separately. This is because they can arbitrarily interpret the form of the arguments passed thereto, without taking into account their real meaning.

We try to avoid[12] them in this course, as we can do perfectly without them. However, they are not only interesting on their own, but also quite popular in other users’ code, hence the honourable mention. Learning them in more detail is left to the kind reader as an optional exercise. In sec:to-do, we will return to these functions as they will serve as a very interesting illustration of how to implement our own procedures that rely on metaprogramming techniques.

Example 12.32

For instance, let us consider an example call to the subset function:

subset(iris, Sepal.Length>7.5, -(Sepal.Width:Petal.Width))
##     Sepal.Length   Species
## 106          7.6 virginica
## 118          7.7 virginica
## 119          7.7 virginica
## 123          7.7 virginica
## 132          7.9 virginica
## 136          7.7 virginica


Neither Sepal.Length>7.5 nor -(Sepal.Width:Petal.Width) make sense as standalone R expressions, because we have not defined the named variables used therein:

Sepal.Length>7.5            # utter nonsense
-(Sepal.Width:Petal.Width)  # gibberish


Only from help("subset"), we can learn that this tool generously decides that the second expression plays the role of a row selector and the third one removes all the columns between the two given ones.

In our course, we pay attention to developing transferable skills. Assuming that R is not the only language we are going to learn during of our long and happy lives, it is much more likely that in the next environment, we will rather be writing something more of the more basic form:

between <- function(x, from, to) (which(from == x):which(to == x))
iris[iris[["Sepal.Length"]]>7.5,
-between(names(iris), "Sepal.Width", "Petal.Width")]
##     Sepal.Length   Species
## 106          7.6 virginica
## 118          7.7 virginica
## 119          7.7 virginica
## 123          7.7 virginica
## 132          7.9 virginica
## 136          7.7 virginica


Let us stress again that this is a book on how to become a great chef who proudly uses produce from sustainable sources, and not how to order ultra-processed food from DeliverNoodles.com.

Example 12.33

transform can be used to add, modify, and remove columns in a data frame with the possibility of referring to existing features as if they were ordinary variables:

head(transform(mtcars, log_hp=log(hp), am=2*am-1, hp=NULL))
##                    mpg cyl disp drat    wt  qsec vs am gear carb log_hp
## Mazda RX4         21.0   6  160 3.90 2.620 16.46  0  1    4    4 4.7005
## Mazda RX4 Wag     21.0   6  160 3.90 2.875 17.02  0  1    4    4 4.7005
## Datsun 710        22.8   4  108 3.85 2.320 18.61  1  1    4    1 4.5326
## Hornet 4 Drive    21.4   6  258 3.08 3.215 19.44  1 -1    3    1 4.7005
## Hornet Sportabout 18.7   8  360 3.15 3.440 17.02  0 -1    3    2 5.1648
## Valiant           18.1   6  225 2.76 3.460 20.22  1 -1    3    1 4.6540


Similarly, attach adds any named list to the search path (see Chapter 16) so that the columns can be accessed by name. This, however, does not allow any alterations thereof to be performed. As an alternative, with and within may be referred to if writing df[["..."]] each time is so difficult to us (it should not be):

within(head(mtcars), {
log_hp <- log(hp)
fuel_economy <- 235/mpg
am <- factor(am, levels=c(0, 1), labels=c("no", "yes"))
rm(list=c("mpg", "hp", "vs", "qsec"))
})
##                   cyl disp drat    wt  am gear carb fuel_economy log_hp
## Mazda RX4           6  160 3.90 2.620 yes    4    4       11.190 4.7005
## Mazda RX4 Wag       6  160 3.90 2.875 yes    4    4       11.190 4.7005
## Datsun 710          4  108 3.85 2.320 yes    4    1       10.307 4.5326
## Hornet 4 Drive      6  258 3.08 3.215  no    3    1       10.981 4.7005
## Hornet Sportabout   8  360 3.15 3.440  no    3    2       12.567 5.1648
## Valiant             6  225 2.76 3.460  no    3    1       12.983 4.6540

Example 12.34

As mentioned in Section 10.3.2, see Section 16.5 for more details, formulas are special objects that consist of two unevaluated expressions separated by a tilde (~).

Functions can support formulas and do what they please with them, but a popular approach is to allow them to express “something grouped by something else” or “one thing as a function of other things”.

do.call(rbind.data.frame, lapply(split(ToothGrowth, ~supp+dose), head, 1))
##         len supp dose
## OJ.0.5 15.2   OJ  0.5
## VC.0.5  4.2   VC  0.5
## OJ.1   19.7   OJ  1.0
## VC.1   16.5   VC  1.0
## OJ.2   25.5   OJ  2.0
## VC.2   23.6   VC  2.0
aggregate(cbind(mpg, log_hp=log(hp))~am:cyl, mtcars, mean)
##   am cyl    mpg log_hp
## 1  0   4 22.900 4.4186
## 2  1   4 28.075 4.3709
## 3  0   6 19.125 4.7447
## 4  1   6 20.567 4.8552
## 5  0   8 15.050 5.2553
## 6  1   8 15.400 5.6950
##                   mpg + hp log(hp)    I(1/qsec)
## Mazda RX4            131.0  4.7005 0.060753....
## Mazda RX4 Wag        131.0  4.7005 0.058754....
## Datsun 710           115.8  4.5326 0.053734....
## Hornet 4 Drive       131.4  4.7005 0.051440....
## Hornet Sportabout    193.7  5.1648 0.058754....
## Valiant              123.1  4.6540 0.049455....


If these seem esoteric, it is because that is exactly the case. We need to consult the corresponding functions’ manuals to be able to understand what they do. And, as we do not recommend their use, we are not going to explain them here.

Exercise 12.35

In the last example, the peculiar printing of the last column is due to which method being overloaded?

### 12.3.10. A Note on the dplyr (tidyverse) and data.table Packages (*)

The popular third-party packages data.table and dplyr implement the most common data frame wrangling procedures. Moreover, some of the operations may be much faster for larger data sets.

They both introduce a completely new API for the operations we already know well how to perform. Furthermore, they are heavily based on metaprogramming (nonstandard evaluation). A good way to learn them is by solving some of the exercises listed below.

Note that dplyr is part of a huge system of interdependent packages called tidyverse which tend to do things their own way and which became quite invasive over the last years. Nevertheless, R programmers should remember that they are not only able to do without them; they also need to when the processing of other prominent data structures is required, e.g., of fancy lists and matrices. Base R always comes first as the more fundamental layer.

Important

Some functions we may find useful will (annoyingly to base R users) return objects of class tibble (tbl_df) (e.g., haven::read.xpt that reads SAS data files). However, those are in fact data.frame subclasses and we can always use as.data.frame to get our favourite objects back.

Also, we cannot stress enough that it is SQL that we recommend to learn as perhaps the most powerful interface to more considerable amounts of data, and also one that gives skills which can be used at a later time in other programming environments.

We should remember that base R has already proven long time ago to be a versatile tool for rapid prototyping, calling specialised procedures written in C or Java, and wrangling data that fit into memory. For larger problems, techniques for working with batches of data, sampling methods, or aggregating data stored elsewhere is often the way to go, especially when building machine learning models or visualisation[13] is required. Usually, the most recent data will be stored in normalised databases and you will need to join a few tables in order to fetch something of interest in the current analysis context.

## 12.4. Exercises

Exercise 12.36

• What attributes a data frame must be equipped with?

• If row.names is an integer vector, how to access rows labelled 1, 7, and 42?

• How to create a data frame that features a column that is a list of character vectors of different lengths?

• How to create a data frame that includes a matrix column?

• How to convert all numeric columns in a data frame to a numeric matrix?

• Assuming that x is an atomic vector, what is the difference between “as.data.frame(x)” vs “as.data.frame(as.list(x))” vs “as.data.frame(list(a=x))” vs “data.frame(a=x)”?

Exercise 12.37

Assuming that x is a data frame, what is the meaning of/difference between the following:

• x["u"]” vs “x[["u"]]” vs “x[, "u"]”?

• x["u"][1]” vs “x[["u"]][1]” vs “x[1, "u"]” vs “x[1, "u", drop=FALSE]”?

• x[which(x[[1]] > 0), ]” vs “x[x[[1]] > 0, ]”?

• x[grep("^foo", names(x))]”?

Exercise 12.38

Assume we have a data frame with columns named like: ID (character), checked (logical, possibly with missing values), category (factor), x0, … x9 (ten separate numeric columns), y0, … y9 (ten separate numeric columns), coords (numeric matrix with two columns named lat and long), and features (list of character vectors of different lengths).

• How to extract the rows where checked is TRUE?

• How to extract a subset comprised only of ID and x-something columns?

• How to extract the rows for which ID is like 3 letters and then 5 digits (e.g., XYZ12345)?

• How to select all the numeric columns in one go?

• Assuming that the IDs are like three letters and then five digits, how to add two columns: ID3 (the letters) and ID5 (the five digits).

• How to get rid of all the columns between x3 and y7?

• How to check where both lat and long in coords are positive?

• How to add the row indicating the number of features?

• How to extract the rows where "spam" is amongst the features?

• How to convert it to a long data frame with two columns: ID and feature (individual strings)?

• How to change the name of the ID column to id?

• How to make the y-foo columns appear before the x-bar ones?

• How to order the rows with respect to checked (FALSE first, then TRUE) and IDs (decreasingly)?

• How to remove rows with duplicate IDs?

• How to determine how many entries correspond to each category?

• How to compute the average lat and long in each category?

• How to compute the average lat and long for each category and checked combined?

Exercise 12.39

Consider the flights dataset. Give some ways to select all rows between March and October (regardless of the year).

Exercise 12.40

In this task, you will be working with a version of a dataset on 70k+ Melbourne trees (urban_forest). Before proceeding any further, read the dataset’s description available here.

2. Fetch the IDs (CoM.ID) and trunk diameters (Diameter.Breast.Height) of five horse chestnuts with the smallest diameters at breast height. The output data frame must be sorted with respect to Diameter.Breast.Height, decreasingly.

3. Create a new data frame that gives the number of trees planted in each year.

4. Compute the average age (in years, based on Year.Planted; using aggregate) of the trees of genera (each genus separately): Eucalyptus, Platanus, Ficus, Acer, and Quercus. Depict the sorted data with barplot.

Exercise 12.41

(*) Consider the historic data dumps of https://travel.stackexchange.com/ available at https://github.com/gagolews/teaching-data/tree/master/travel_stackexchange_com_2017.

Export the CSV files located therein to an SQLite database. Then, write some R code that correspond to the following SQL queries (use dbGetQuery to verify your results):

--- 1)
SELECT
Users.DisplayName,
Users.Age,
Users.Location,
SUM(Posts.FavoriteCount) AS FavoriteTotal,
Posts.Title AS MostFavoriteQuestion,
MAX(Posts.FavoriteCount) AS MostFavoriteQuestionLikes
FROM Posts
WHERE Posts.PostTypeId=1
GROUP BY OwnerUserId
ORDER BY FavoriteTotal DESC
LIMIT 10

--- 2)
SELECT
Posts.ID,
Posts.Title,
FROM Posts
JOIN (
SELECT
Posts.ParentID,
FROM Posts
WHERE Posts.PostTypeID=2 AND Posts.Score>0
GROUP BY Posts.ParentID
) AS Posts2
ON Posts.ID=Posts2.ParentID
LIMIT 10

--- 3)
SELECT
Posts.Title,
FROM (
SELECT
PostId,
COUNT(*) AS Count,
WHERE VoteTypeId=2
GROUP BY PostId, Year
WHERE Posts.PostTypeId=1
GROUP BY Year

--- 4)
SELECT
Questions.Id,
Questions.Title,
Posts.Score AS AcceptedScore,
FROM (
SELECT Id, ParentId, MAX(Score) AS MaxScore
FROM Posts
WHERE PostTypeId==2
GROUP BY ParentId
JOIN (
SELECT * FROM Posts
WHERE PostTypeId==1
) AS Questions
WHERE Difference>50
ORDER BY Difference DESC

--- 5)
SELECT
Posts.Title,
FROM (
SELECT
PostID,
UserID,
GROUP BY PostID, UserID
) AS CmtTotScr
JOIN Posts ON Posts.ID=CmtTotScr.PostID
AND Posts.OwnerUserId=CmtTotScr.UserID
WHERE Posts.PostTypeId=1
LIMIT 10

--- 6)
SELECT DISTINCT
Users.Id,
Users.DisplayName,
Users.Reputation,
Users.Age,
Users.Location
FROM (
SELECT
Name, UserID
WHERE Name IN (
SELECT
Name
WHERE Class=1
GROUP BY Name
HAVING COUNT(*) BETWEEN 2 AND 10
)
AND Class=1

--- 7)
SELECT
Posts.Title,
FROM Posts
JOIN (
SELECT
PostId,
MAX(CASE WHEN VoteDate = 'new' THEN Total ELSE 0 END) NewVotes,
MAX(CASE WHEN VoteDate = 'old' THEN Total ELSE 0 END) OldVotes,
FROM (
SELECT
PostId,
CASE STRFTIME('%Y', CreationDate)
WHEN '2017' THEN 'new'
WHEN '2016' THEN 'new'
ELSE 'old'
END VoteDate,
COUNT(*) AS Total
WHERE VoteTypeId=2
GROUP BY PostId, VoteDate
WHERE Posts.PostTypeId=1

(*) Generate a CSV file featuring some random data arranged in a few columns of the size at least two times larger than your available RAM. Then, export the CSV file to an SQLite database. Use file connections (Section 8.3.5) and the nrow argument to read.table to be able to process it on a chunk-by-chunk basis.
Determine whether setting colClasses in read.table speeds up the reading of large CSV files significantly or not.