12. Data frames#
The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime, any bug/typos reports/fixes are appreciated. Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [26].
Most matrices are built on top of atomic vectors. Hence, they only
allow items of the same type to be arranged into rows and columns.
Data frames (objects of S3 class data.frame
,
first introduced in [13]), on the other hand,
are collections of vectors of the same lengths or matrices with identical
row counts, hence allowing to represent structured[1]
data of possibly heterogeneous types. For instance:
class(iris) # `iris` is an example built-in data frame
## [1] "data.frame"
iris[c(1, 51, 101), ] # 3 chosen rows from `iris`
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
It is a mix of numeric and factor-type data.
The good news is that not only are data frames built on named lists (e.g., to extract a column, we can refer to `[[`), but also many functions recognise them to be matrix-like (e.g., to select specific rows and columns, two indexes can be passed to `[` like in the example above). Hence, it will soon turn out that we already know a lot about how to perform basic data wrangling activities, even if we do not fully realise it now.
Important
Some readers will approach this chapter biased by what they know from elsewhere, including their experience with some popular third-party packages for data frame processing. The art is to filter out that information as noise (for the time being, at least). We will show how robust base R vocabulary is and how much can be implied from the material covered in the preceding chapters.
Yes, this book is like a breathtaking thriller/drama/love story: it is meant to be read from the beginning to the end. Please go back to the start of this comprehensive course if you happened to pop in here by accident or driven by “but I need to know now”. Good morning.
12.1. Creating data frames#
12.1.1. data.frame and as.data.frame#
Most frequently, data frames are based on a series of logical, numeric, or characters vectors of identical lengths. In such a scenario, the data.frame function is particularly worthwhile.
(x <- data.frame(
a=c(TRUE, FALSE),
b=1:6,
c=runif(6),
d=c("spam", "spam", "eggs")
))
## a b c d
## 1 TRUE 1 0.77437 spam
## 2 FALSE 2 0.19722 spam
## 3 TRUE 3 0.97801 eggs
## 4 FALSE 4 0.20133 spam
## 5 TRUE 5 0.36124 spam
## 6 FALSE 6 0.74261 eggs
The shorter vectors were recycled. We can verify that the diverse column types were retained and no coercion was made, by calling:
str(x)
## 'data.frame': 6 obs. of 4 variables:
## $ a: logi TRUE FALSE TRUE FALSE TRUE FALSE
## $ b: int 1 2 3 4 5 6
## $ c: num 0.774 0.197 0.978 0.201 0.361 ...
## $ d: chr "spam" "spam" "eggs" "spam" ...
We can also fetch the class of each column directly by calling (compare Section 12.3):
sapply(x, class) # the same as unlist(Map(class, x))
## a b c d
## "logical" "integer" "numeric" "character"
Important
For many reasons (see, e.g., Section 12.1.5 and Section 12.1.6), we recommend having the type of each column always checked, e.g., by calling the str function.
Many objects, such as matrices, can easily be coerced to data frames using particular as.data.frame methods.
Here is an example matrix:
(A <- matrix(1:6, nrow=3,
dimnames=list(
NULL, # no row labels
c("u", "v") # some column labels
)))
## u v
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
Let us convert it to a data frame:
as.data.frame(A) # as.data.frame.matrix
## u v
## 1 1 4
## 2 2 5
## 3 3 6
Note that a matrix with no row labels is printed slightly differently
than a data frame with (as it will soon turn out) the default
row.names
.
Named lists are amongst other candidates for a meaningful conversion. Consider an example list where each element is a vector of the same length as the other ones:
(l <- Map(
function(x) {
c(Min=min(x), Median=median(x), Mean=mean(x), Max=max(x))
},
split(iris[["Sepal.Length"]], iris[["Species"]])
))
## $setosa
## Min Median Mean Max
## 4.300 5.000 5.006 5.800
##
## $versicolor
## Min Median Mean Max
## 4.900 5.900 5.936 7.000
##
## $virginica
## Min Median Mean Max
## 4.900 6.500 6.588 7.900
Each list element will be turned into a separate column:
as.data.frame(l) # as.data.frame.list
## setosa versicolor virginica
## Min 4.300 4.900 4.900
## Median 5.000 5.900 6.500
## Mean 5.006 5.936 6.588
## Max 5.800 7.000 7.900
Sadly, as.data.frame.list is not particularly fond of lists of vectors of incompatible lengths:
as.data.frame(list(a=1, b=11:12, c=21:23))
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names
## = TRUE, : arguments imply differing number of rows: 1, 2, 3
The above vectors could have been recycled with a warning. But they were not.
as.data.frame(list(a=1:4, b=11:12, c=21)) # recycling rule okay
## a b c
## 1 1 11 21
## 2 2 12 21
## 3 3 11 21
## 4 4 12 21
The method for the S3 class table
(mentioned in Chapter 11)
can be helpful as well.
Here is an example contingency table together with its
unstacked version.
(t <- table(mtcars[["vs"]], mtcars[["cyl"]]))
##
## 4 6 8
## 0 1 3 14
## 1 10 4 0
as.data.frame(t) # as.data.frame.table; see the stringsAsFactors note below!
## Var1 Var2 Freq
## 1 0 4 1
## 2 1 4 10
## 3 0 6 3
## 4 1 6 4
## 5 0 8 14
## 6 1 8 0
as.data.frame.table is so handy that we might want to call it directly on any array. This way, we can convert it from the so-called wide format to the long one; see Section 12.3.6 for more details.
Note
The above method is based on expand.grid, which determines all combinations of a given series of vectors.
expand.grid(1:2, c("a", "b", "c")) # see the stringsAsFactors note below!
## Var1 Var2
## 1 1 a
## 2 2 a
## 3 1 b
## 4 2 b
## 5 1 c
## 6 2 c
Overall, many classes of objects can be included[2]
in a data frame; the popular choices include
Date
, POSIXct
, and factor
.
It is worth noting that the data.frame function calls
the corresponding as.data.frame method.
Also, format is used whilst printing the columns.
Here are two custom methods for what we would like to call from now on
an S3 class spam
:
as.data.frame.spam <- function(x, ...)
structure(
list(x),
class="data.frame",
row.names=seq_along(x)
)
format.spam <- function(x, ...)
paste0("*", x, "*")
Testing data frame creation and printing:
data.frame(
a=structure(c("a", "b", "c"), class="spam"),
b=factor(c("spam", "bacon", "spam")),
c=Sys.Date()+1:3
)
## a b c
## 1 *a* spam 2023-05-20
## 2 *b* bacon 2023-05-21
## 3 *c* spam 2023-05-22
12.1.2. cbind.data.frame and rbind.data.frame#
There are data frame-specific versions of
cbind or rbind (which we discussed
in the context of stacking matrices; Section 11.1.2).
They are used quite eagerly: help("cbind")
states that
they will be referred to if at least[3]
one of its arguments is a data frame, and the other arguments
are atomic vectors or lists (possibly with the dim
attribute).
For example:
x <- iris[c(1, 51, 101), c("Sepal.Length", "Species")] # whatever
cbind(Yummy=c(TRUE, FALSE, TRUE), x)
## Yummy Sepal.Length Species
## 1 TRUE 5.1 setosa
## 51 FALSE 7.0 versicolor
## 101 TRUE 6.3 virginica
It added a new column to a data frame x
. Moreover:
rbind(x, list(42, "virginica"))
## Sepal.Length Species
## 1 5.1 setosa
## 51 7.0 versicolor
## 101 6.3 virginica
## 11 42.0 virginica
It added a new row. Note that columns are of different types. Hence, the values to row-bind were provided as a generic vector. The list can also be named. It can consist of vectors of length greater than one, given in any order:
rbind(x, list(
Species=c("virginica", "setosa"),
Sepal.Length=c(42, 7)
))
## Sepal.Length Species
## 1 5.1 setosa
## 51 7.0 versicolor
## 101 6.3 virginica
## 11 42.0 virginica
## 2 7.0 setosa
Sometimes a direct referral to these methods will be necessary. Consider an example list of atomic vectors:
x <- list(a=1:3, b=11:13, c=21:23)
First, we call the generic, which dispatches to the default method:
do.call(cbind, x)
## a b c
## [1,] 1 11 21
## [2,] 2 12 22
## [3,] 3 13 23
If we want to ensure we garner a data frame as a result, we need to write:
do.call(cbind.data.frame, x)
## a b c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
This is particularly useful for fetching outputs from Map and its friends, which are wrapped inside a list. For instance:
l <- unname(Map(
function(x) list(
Sepal.Length=mean(x[["Sepal.Length"]]),
Sepal.Width=mean(x[["Sepal.Width"]]),
Species=x[["Species"]][1]
),
split(iris, iris[["Species"]]) # split.data.frame; see below
))
str(l)
## List of 3
## $ :List of 3
## ..$ Sepal.Length: num 5.01
## ..$ Sepal.Width : num 3.43
## ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1
## $ :List of 3
## ..$ Sepal.Length: num 5.94
## ..$ Sepal.Width : num 2.77
## ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 2
## $ :List of 3
## ..$ Sepal.Length: num 6.59
## ..$ Sepal.Width : num 2.97
## ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 3
It is nothing more than a fancy way to obtain an illustrative list, which we may now turn into a data frame by calling:
do.call(rbind.data.frame, l)
## Sepal.Length Sepal.Width Species
## 1 5.006 3.428 setosa
## 2 5.936 2.770 versicolor
## 3 6.588 2.974 virginica
On the other hand, do.call(rbind, l)
does not return
an amiable object type:
do.call(rbind, l)
## Sepal.Length Sepal.Width Species
## [1,] 5.006 3.428 setosa
## [2,] 5.936 2.77 versicolor
## [3,] 6.588 2.974 virginica
Despite the pretty face, it is a matrix… over a list:
str(do.call(rbind, l))
## List of 9
## $ : num 5.01
## $ : num 5.94
## $ : num 6.59
## $ : num 3.43
## $ : num 2.77
## $ : num 2.97
## $ : Factor w/ 3 levels "setosa","versicolor",..: 1
## $ : Factor w/ 3 levels "setosa","versicolor",..: 2
## $ : Factor w/ 3 levels "setosa","versicolor",..: 3
## - attr(*, "dim")= int [1:2] 3 3
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:3] "Sepal.Length" "Sepal.Width" "Species"
12.1.3. Reading data frames#
Structured data can be imported from external sources, such as CSV/TSV (comma/tab-separated values) or HDF5 files, relational databases supporting SQL (see Section 12.1.4), web APIs (e.g., through the curl and jsonlite packages), spreadsheets [63], and so on.
In particular, read.csv and the like fetch data from plain text files consisting of records, where commas, semicolons, tabs, etc. separate the fields.
For instance:
x <- data.frame(a=runif(3), b=c(TRUE, FALSE, TRUE)) # example data frame
f <- tempfile() # temporary file name
write.csv(x, f, row.names=FALSE) # export
It created a CSV file that looks like:
cat(readLines(f), sep="\n") # print file contents
## "a","b"
## 0.287577520124614,TRUE
## 0.788305135443807,FALSE
## 0.4089769218117,TRUE
The above can be read by calling:
read.csv(f)
## a b
## 1 0.28758 TRUE
## 2 0.78831 FALSE
## 3 0.40898 TRUE
Check out help("read.table")
for a long list of tunable parameters,
especially: sep
, dec
, quote
, header
, comment.char
,
and row.names
.
Further, note that reading from compressed files is supported directly.
Important
CSV is the most portable and user-friendly format for exchanging matrix-like objects between different programs and computing languages (Python, Julia, LibreOffice Calc, etc.). Such files can be opened in any text editor.
Note
As mentioned in Section 8.3.5,
we can process data frames on a chunk-by-chunk basis,
This is beneficial especially when data do not fit into memory
(compare the nrows
argument to read.csv).
12.1.4. Interfacing relational databases and querying with SQL (*)#
The DBI package provides a universal interface for particular database management systems whose drivers are in additional add-ons such as RSQLite, RMariaDB, RPostgreSQL, etc., or, more generally, RODBC or odbc. For more details, see Section 4 of [63].
Let us play with an in-memory (volatile) instance of an SQLite database.
library("DBI")
con <- dbConnect(RSQLite::SQLite(), ":memory:")
It returned an object representing a database connection which we can refer to in further communication.
An easy way to create a database table is to call:
dbWriteTable(con, "mtcars", mtcars) # `mtcars` is a toy built-in data frame
Alternatively, dbExecute could have been called
to send SQL statements such as CREATE TABLE ...
followed by a series
of INSERT INTO ...
.
Some data retrieval can now follow:
dbGetQuery(con, "
SELECT cyl, vs, AVG(mpg) AS mpg_ave, AVG(hp) AS hp_ave
FROM mtcars
GROUP BY cyl, vs
")
## cyl vs mpg_ave hp_ave
## 1 4 0 26.000 91.00
## 2 4 1 26.730 81.80
## 3 6 0 20.567 131.67
## 4 6 1 19.125 115.25
## 5 8 0 15.100 209.21
It gave us an ordinary R data frame. We can process it in the same fashion as any other object of this kind.
At the end, the database connection must be closed.
dbDisconnect(con)
Database passwords must never be stored in plain text files, let alone in R scripts in version-controlled repositories. Consider a few ways to fetch credentials programmatically:
using environment variables (see help
("Sys.getenv")
),using the keyring package,
calling system2 (Section 7.3.2) to retrieve it from the system keyring (e.g., the keyring package for Python provides a platform-independent command-line utility).
12.1.5. Strings as factors?#
The following is so critical that we devote a separate section to discuss it so we always remain vigilant (such is life: maintaining some level of mindfulness is often a good idea).
Important
Some functions related to data frames
automatically convert character vectors to factors.
This behaviour is frequently controlled by the
stringsAsFactors
argument.
This is particularly problematic because, when printed, factor and character columns look identical:
(x <- data.frame(a=factor(c("U", "V")), b=c("U", "V")))
## a b
## 1 U U
## 2 V V
We recall from Section 10.3.2 that factors can be nasty. For example, passing factors as indexers in `[` or converting them with as.numeric might give counterintuitive (for the uninformed) results. Also, when we want to extend factors by previously unobserved data, new levels must manually be added. This can cause some unexpected behaviour in contexts such as:
rbind(x, c("W", "W"))
## Warning in `[<-.factor`(`*tmp*`, ri, value = "W"): invalid factor level,
## NA generated
## a b
## 1 U U
## 2 V V
## 3 <NA> W
Therefore, always having the data types checked is a praiseworthy habit. For instance:
str(x)
## 'data.frame': 2 obs. of 2 variables:
## $ a: Factor w/ 2 levels "U","V": 1 2
## $ b: chr "U" "V"
Before R 4.0, some functions,
including data.frame and read.csv
had the stringsAsFactors
argument defaulting to TRUE
.
It is no longer the case for many of them.
However, exceptions to this rule still exist, e.g., including as.data.frame.table and expand.grid. Besides, some built-in example data frames have factor-typed columns inherited from the old days, e.g.:
class(iris[["Species"]])
## [1] "factor"
We observe that the Species
column in iris
is not of the type character
.
Thence, adding a new variety might be oblique:
iris2 <- iris[c(1, 51, 101), ] # example subset
levels(iris2[["Species"]]) <- c(levels(iris2[["Species"]]), "croatica")
rbind(iris2, c(6, 3, 3, 2, "croatica"))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 51 7 3.2 4.7 1.4 versicolor
## 101 6.3 3.3 6 2.5 virginica
## 4 6 3 3 2 croatica
Alternatively, we could have simply converted the Species
column
to the type character
.
12.1.6. Internal representation#
Objects of S3 class data.frame
are built on lists
of vectors of the same length or matrices with identical row counts.
Such lists define their consecutive columns.
Apart from class
, they must be equipped with the following
special attributes:
names
– a character vector (as usual in any named list) labelling the columns or their groups,row.names
– a character or integer vector with no duplicates nor missing values, doing what is advertised.
Therefore, a data frame can be created from scratch by calling, for example:
structure(
list(a=11:13, b=21:23), # sets the `names` attribute already
row.names=1:3,
class="data.frame"
)
## a b
## 1 11 21
## 2 12 22
## 3 13 23
Here is a data frame based on a length-5 list, a matrix with five rows, and a length-5 numeric vector with some fancy row names on top:
structure(
list(
a=list(1, 1:2, 1:3, numeric(0), -(4:1)),
b=cbind(u=11:15, v=21:25),
c=runif(5)
),
row.names=c("spam", "bacon", "eggs", "ham", "aubergine"),
class="data.frame"
)
## a b.u b.v c
## spam 1 11 21 0.28758
## bacon 1, 2 12 22 0.78831
## eggs 1, 2, 3 13 23 0.40898
## ham 14 24 0.88302
## aubergine -4, -3, -2, -1 15 25 0.94047
In general, the columns of the type list
can contain anything, e.g., other
lists or R functions. Including atomic vectors
of varying lengths, just like above, allows for creating something à la
ragged arrays – a pretty handy scenario.
The issue with matrix entries, on the other hand, is that they appear as if they were many. Still, as it will turn out in the sequel, they are often treated as a single complex column, e.g., by the index operator (see Section 12.2). Therefore, from this perspective, the above data frame has three columns, not four. Such compound columns can be output by aggregate (see Section 12.3), amongst others. They are valuable in certain contexts: the column groups can be easily accessed as a whole and batch-processed in the same way.
Important
Unfortunately, data frames with list or matrix columns cannot be normally
created with the data.frame nor cbind
functions. This might explain why they are less popular.
This behaviour is dictated by the underlying
as.data.frame methods, which they both call.
As a curiosity, see help("I")
, though.
Verify that if a data frame has a matrix column,
the matrix does not need to have any column names (the second dimnames
).
The names
and row.names
attributes are special in the sense
of Section 4.4.3. In particular, they
can be accessed or modified by the corresponding functions.
It is worth noting that row.names(df)
always returns
a character vector, even when attr(df, "row.names")
is an integer vector.
Further, calling “row.names(df) <- NULL
” will
reset[4] this attribute to the most commonly desired case
of consecutive natural numbers, for example:
(x <- iris[c(1, 51, 101), ]) # comes with some sad row names
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
row.names(x) <- NULL # reset to seq_len(NROW(x))
print(x)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 7.0 3.2 4.7 1.4 versicolor
## 3 6.3 3.3 6.0 2.5 virginica
What is the name of the replacement version of the
row.names
method for the data.frame
class?
Implement your version of expand.grid.
Write a version of xtabs that does not rely on a formula interface
(compare Section 10.3.4).
Allow three parameters: a data frame, the name of the “counts” column,
and the names of the cross-classifying variables.
Hence,
my_xtabs(x, "Freq",
c("Var1", "Var2"))
should be equivalent to
xtabs(Freq~Var1+Var2, x)
.
12.2. Data frame subsetting#
12.2.1. Data frames are lists#
Data frames are named lists, where each element represents an individual column. Therefore[5], length yields the number of columns and names gives their respective labels.
Let us play with the following data frame:
(x <- data.frame(
a=runif(6),
b=rnorm(6),
c=LETTERS[1:6],
d1=c(FALSE, TRUE, FALSE, NA, FALSE, NA),
d2=c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
))
## a b c d1 d2
## 1 0.287578 0.070508 A FALSE FALSE
## 2 0.788305 0.129288 B TRUE TRUE
## 3 0.408977 1.715065 C FALSE FALSE
## 4 0.883017 0.460916 D NA TRUE
## 5 0.940467 -1.265061 E FALSE FALSE
## 6 0.045556 -0.686853 F NA TRUE
typeof(x) # each data frame is a list
## [1] "list"
length(x) # the number of columns
## [1] 5
names(x) # column labels
## [1] "a" "b" "c" "d1" "d2"
The one-argument versions of extract and index operators behave as expected. `[[` fetches (looks inside) the contents of a given column:
x[["a"]] # or x[[1]]
## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556
`[` returns a data frame (a list with extras) comprised of the specified elements:
x["a"] # or x[1]
## a
## 1 0.287578
## 2 0.788305
## 3 0.408977
## 4 0.883017
## 5 0.940467
## 6 0.045556
x[c(TRUE, TRUE, FALSE, TRUE, FALSE)]
## a b d1
## 1 0.287578 0.070508 FALSE
## 2 0.788305 0.129288 TRUE
## 3 0.408977 1.715065 FALSE
## 4 0.883017 0.460916 NA
## 5 0.940467 -1.265061 FALSE
## 6 0.045556 -0.686853 NA
Just like with lists, the replacement versions of the said operators can be used to add new or replace existing columns.
y <- head(x, 1) # for a more compact display
y[["a"]] <- round(y[["a"]], 1) # replaces the column with new content
y[["b"]] <- NULL # removes the column, like, totally
y[["e"]] <- 10*y[["a"]]^2 # adds a new column at the end
print(y)
## a c d1 d2 e
## 1 0.3 A FALSE FALSE 0.9
Some spam for thought to show how much we already know. Here are a few common use cases of indexing using some popular vectorised functions.
y <- head(x, 1) # example data frame (more compact display)
Move the column a
to the end:
y[unique(c(names(y), "a"), fromLast=TRUE)]
## b c d1 d2 a
## 1 0.070508 A FALSE FALSE 0.28758
Remove the columns a
and c
:
y[-match(c("a", "c"), names(y))]
## b d1 d2
## 1 0.070508 FALSE FALSE
Select all columns between a
and c
:
y[match("a", names(y)):match("c", names(y))]
## a b c
## 1 0.28758 0.070508 A
Fetch the columns with names starting with d
:
y[grep("^d", names(y))]
## d1 d2
## 1 FALSE FALSE
Change the name of column c
to z
:
names(y)[names(y) == "c"] <- "z"
print(y)
## a b z d1 d2
## 1 0.28758 0.070508 A FALSE FALSE
Change names: d2
to u
and d1
to v
:
names(y)[match(c("d2", "d1"), names(y))] <- c("v", "u")
print(y)
## a b z u v
## 1 0.28758 0.070508 A FALSE FALSE
Note
Some R users might prefer the `$` operator over `[[`, but we do not. By default, the former supports partial matching of column names which might be appealing when R is used interactively. Nonetheless, it does not work on matrices nor it allows for programmatically generated names. It is also trickier to use on not syntactically valid labels; compare Section 9.4.1.
Write a function names_replace that changes
the names of columns based on a translation table
given in a from=to
fashion, for instance:
names_replace <- function(x, ...) ...to.do...
x <- data.frame(a=1, b=2, c=3)
names_replace(x, c="new_c", a="new_a")
## new_a b new_c
## 1 1 2 3
12.2.2. Data frames are matrix-like#
Data frames can be considered “generalised” matrices. They store data
of any kind (possibly mixed) organised in a tabular fashion.
Some functions mentioned in the previous chapter will
be overloaded for the data frame case.
These include: dim (despite the lack of the dim
attribute),
NROW, NCOL, and
dimnames (which is, of course, based on row.names
and names
).
For example:
(x <- data.frame(
a=runif(6),
b=rnorm(6),
c=LETTERS[1:6],
d1=c(FALSE, TRUE, FALSE, NA, FALSE, NA),
d2=c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
))
## a b c d1 d2
## 1 0.287578 0.070508 A FALSE FALSE
## 2 0.788305 0.129288 B TRUE TRUE
## 3 0.408977 1.715065 C FALSE FALSE
## 4 0.883017 0.460916 D NA TRUE
## 5 0.940467 -1.265061 E FALSE FALSE
## 6 0.045556 -0.686853 F NA TRUE
dim(x) # the number of rows and columns
## [1] 6 5
dimnames(x) # it is not a matrix, but a matrix-like object
## [[1]]
## [1] "1" "2" "3" "4" "5" "6"
##
## [[2]]
## [1] "a" "b" "c" "d1" "d2"
In addition to the list-like behaviour, which only allows for dealing with particular columns or groups thereof, the `[` operator was also equipped with the ability to take two indexers:
x[1:2, ] # first two rows
## a b c d1 d2
## 1 0.28758 0.070508 A FALSE FALSE
## 2 0.78831 0.129288 B TRUE TRUE
x[x[["a"]] >= 0.3 & x[["a"]] <= 0.8, -2] # or use x[, "a"]
## a c d1 d2
## 2 0.78831 B TRUE TRUE
## 3 0.40898 C FALSE FALSE
Recall the drop
argument to `[` and its effects
on matrix indexing. In the current case,
its behaviour will be similar with regard to the
operations on individual columns:
x[, 1] # synonym: x[[1]] as drop=TRUE
## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556
x[, 1, drop=FALSE] # synonym: x[1]
## a
## 1 0.287578
## 2 0.788305
## 3 0.408977
## 4 0.883017
## 5 0.940467
## 6 0.045556
When we extract a single row and more than one column,
drop
does not apply. It is because columns (unlike in matrices)
can potentially be of different types:
x[1, 1:2] # two numeric columns but the result is still a numeric
## a b
## 1 0.28758 0.070508
However:
x[1, 1]
## [1] 0.28758
x[1, 1, drop=FALSE]
## a
## 1 0.28758
Note
Once again, let us take note of logical indexing featuring missing values:
x[x[["d1"]], ]
## a b c d1 d2
## 2 0.78831 0.12929 B TRUE TRUE
## NA NA NA <NA> NA NA
## NA.1 NA NA <NA> NA NA
x[which(x[["d1"]]), ] # drops missing values
## a b c d1 d2
## 2 0.78831 0.12929 B TRUE TRUE
The default behaviour is consistent with many other R functions: it explicitly indicates that something is missing (we are selecting a “don’t know”; hence, the result is “don’t know” as well). Regretfully, this comes with no warning. As we rarely manually check missing values in the outputs, our absent-mindedness can lead to code bugs.
By far, we might have already noted that
the index operator adjusts (not: resets) the row.names
attribute. For instance:
(xs <- x[head(order(x[["a"]], decreasing=TRUE), 3), ])
## a b c d1 d2
## 5 0.94047 -1.26506 E FALSE FALSE
## 4 0.88302 0.46092 D NA TRUE
## 2 0.78831 0.12929 B TRUE TRUE
It is a version of x
comprised of only the top three values in
the u
column. Indexing by means of character vectors will refer to
row.names
and names
:
xs["5", c("a", "b")]
## a b
## 5 0.94047 -1.2651
This is not the same as
“xs[5,
c("a", "b")]
”,
even though row.names
is formally an integer vector here.
Note
If a data frame features a matrix, we need to use the index/extract operator twice in order to access a specific sub-column:
(x <- aggregate(iris[1], iris[5], function(x) c(Min=min(x), Max=max(x))))
## Species Sepal.Length.Min Sepal.Length.Max
## 1 setosa 4.3 5.8
## 2 versicolor 4.9 7.0
## 3 virginica 4.9 7.9
x[["Sepal.Length"]][, "Min"]
## [1] 4.3 4.9 4.9
In other words, neither “x[["Sepal.Length.Min"]]
”
nor “x[, "Sepal.Length.Min"]
” works.
Regarding the replacement version of the index operator, it is a quite flexible tool. It allows the new content to be a vector, a data frame, a list, or even a matrix.
Write two replacement functions[6].
First,
set_row_names which replaces the row.names
of a data frame
with the contents of a specific column, for example:
(x <- aggregate(iris[1], iris[5], mean)) # some data frame
## Species Sepal.Length
## 1 setosa 5.006
## 2 versicolor 5.936
## 3 virginica 6.588
set_row_names(x) <- "Species"
print(x)
## Sepal.Length
## setosa 5.006
## versicolor 5.936
## virginica 6.588
Second, reset_row_names which converts row.names
to
a standalone column of a given name, for instance:
reset_row_names(x) <- "Type"
print(x)
## Sepal.Length Type
## 1 5.006 setosa
## 2 5.936 versicolor
## 3 6.588 virginica
These two functions may be handy as they allow for writing
“x[something, ]
” instead of
“x[x[["column"]]
%in% something, ]
”.
12.3. Common operations#
Below we review the most commonly applied operations
related to data frame wrangling. We have a few dedicated functions
or methods overloaded for the data.frame
class. However,
we have already mastered the necessary skills to deal with
such objects through our hard work: in particular, by solving
all the exercises listed in the preceding chapters.
Let us repeat: data frames are just lists exhibiting matrix-like behaviour.
12.3.1. Ordering rows#
Ordering rows in a data frame with respect to different criteria can be easily achieved through the order function and the two-argument version of `[`.
For instance, here are the top six cars in terms of the time (in seconds) to complete a 402-metre race:
mtcars6 <- mtcars[order(mtcars[["qsec"]])[1:6], ]
mtcars6[["model"]] <- row.names(mtcars6)
row.names(mtcars6) <- NULL
print(mtcars6)
## mpg cyl disp hp drat wt qsec vs am gear carb model
## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L
## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora
## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28
## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino
## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360
## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4
order uses a stable sorting algorithm. Therefore,
any sorting with respect to a different criterion will not break
the relative ordering of qsec
in row groups with ties:
mtcars6[order(mtcars6[["cyl"]]), ]
## mpg cyl disp hp drat wt qsec vs am gear carb model
## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino
## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4
## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L
## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora
## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28
## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360
Notice the difference between ordering by
cyl
and gear
vs gear
and cyl
:
mtcars6[order(mtcars6[["cyl"]], mtcars6[["gear"]]), ]
## mpg cyl disp hp drat wt qsec vs am gear carb model
## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4
## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino
## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28
## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360
## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L
## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora
mtcars6[order(mtcars6[["gear"]], mtcars6[["cyl"]]), ]
## mpg cyl disp hp drat wt qsec vs am gear carb model
## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28
## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360
## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4
## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino
## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L
## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora
Note
Mixing increasing and decreasing ordering is tricky
as the decreasing
argument to order
currently does not accept multiple flags
in all the contexts. Perhaps the easiest way to change the ordering
direction is to use the unary minus operator on the column(s)
to be sorted decreasingly.
mtcars6[order(mtcars6[["gear"]], -mtcars6[["cyl"]]), ]
## mpg cyl disp hp drat wt qsec vs am gear carb model
## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28
## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360
## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4
## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L
## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora
## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino
For factor and character columns, xtfrm can convert them to sort keys first.
mtcars6[order(mtcars6[["cyl"]], -xtfrm(mtcars6[["model"]])), ]
## mpg cyl disp hp drat wt qsec vs am gear carb model
## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4
## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino
## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora
## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L
## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360
## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28
Both of the above behave like decreasing=
c(FALSE, TRUE)
.
Write a method sort.data.frame that orders a data frame with respect to a given set of columns.
sort.data.frame <- function(x, decreasing=FALSE, cols) ...to.do...
sort(mtcars6, cols=c("cyl", "model"))
## mpg cyl disp hp drat wt qsec vs am gear carb model
## 4 19.7 6 145 175 3.62 2.77 15.50 0 1 5 6 Ferrari Dino
## 6 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 Mazda RX4
## 3 13.3 8 350 245 3.73 3.84 15.41 0 0 3 4 Camaro Z28
## 5 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 Duster 360
## 1 15.8 8 351 264 4.22 3.17 14.50 0 1 5 4 Ford Pantera L
## 2 15.0 8 301 335 3.54 3.57 14.60 0 1 5 8 Maserati Bora
Unfortunately, that decreasing
must be of length one and
be placed as the second method argument is imposed by the
sort S3 generic.
12.3.2. Handling duplicated rows#
duplicated, anyDuplicated, and unique
have methods overloaded for the data.frame
class.
They can be used to indicate, get rid of, or replace
the repeating rows.
sum(duplicated(iris)) # how many duplicated rows are there?
## [1] 1
iris[duplicated(iris), ] # show the duplicated rows
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 143 5.8 2.7 5.1 1.9 virginica
12.3.3. Joining (merging) data frames#
The merge function can perform the JOIN operation that some readers might know from SQL[7]. It matches the items in the columns that two given data frames somewhat share and then returns their combination.
Two calls to merge could be used to
match data on programmers (each identified by developer_id
and giving such details as their name, location, main skill, etc.)
with the information about the open-source projects
(each identified by project_id
and informing us about
its title, scope, web site, and so forth) they are engaged in
(based on a third data frame featuring developer_id
and project_id
pairs).
As a simple illustration, consider the two following objects:
A <- data.frame(
u=c("b0", "b1", "b2", "b3"),
v=c("a0", "a1", "a2", "a3")
)
B <- data.frame(
v=c("a0", "a2", "a2", "a4"),
w=c("c0", "c1", "c2", "c3")
)
The two common columns, i.e., storing
data of similar nature (a
-something strings), are both named v
.
First, the inner (natural) join, where we list only the matching pairs:
merge(A, B) # x=A, y=B, by="v", all.x=FALSE, all.y=FALSE
## v u w
## 1 a0 b0 c0
## 2 a2 b2 c1
## 3 a2 b2 c2
The common column (or, more generally, columns) is included only once in the result.
The left join guarantees that all elements in the first data frame will be included in the result:
merge(A, B, all.x=TRUE) # by="v", all.y=FALSE
## v u w
## 1 a0 b0 c0
## 2 a1 b1 <NA>
## 3 a2 b2 c1
## 4 a2 b2 c2
## 5 a3 b3 <NA>
The right join includes all records in the second argument:
merge(A, B, all.y=TRUE) # by="v", all.x=FALSE
## v u w
## 1 a0 b0 c0
## 2 a2 b2 c1
## 3 a2 b2 c2
## 4 a4 <NA> c3
The full outer join is their set-theoretic union:
merge(A, B, all.x=TRUE, all.y=TRUE) # by="v"
## v u w
## 1 a0 b0 c0
## 2 a1 b1 <NA>
## 3 a2 b2 c1
## 4 a2 b2 c2
## 5 a3 b3 <NA>
## 6 a4 <NA> c3
Show how match (Section 5.4.1) can be used to author a very basic version of merge.
12.3.4. Aggregating and transforming columns#
Let us discuss how to perform data aggregation or engineer features. Although we already know how to access individual columns with `[` and process them using the many vectorised functions, we still have something interesting to add about the said matter.
It would be tempting to try composing such operations with apply. Unfortunately, currently, this function coerces its argument to a matrix. Hence, we should refrain from applying it on data frames whose columns are of mixed types[8].
However, taking into account that data frames are special lists, we can always call Map and its relatives.
Given an example data frame:
(iris_sample <- iris[sample(NROW(iris), 6), ])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 28 5.2 3.5 1.5 0.2 setosa
## 80 5.7 2.6 3.5 1.0 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 133 6.4 2.8 5.6 2.2 virginica
To get the class of each column, we can call:
sapply(iris_sample, class) # or unlist(Map(class, iris))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "numeric" "numeric" "numeric" "numeric" "factor"
Next, here is a way to compute some aggregates of the numeric columns:
unlist(Map(mean, Filter(is.numeric, iris_sample)))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 6.0667 3.1333 4.5500 1.7167
or:
sapply(iris_sample[sapply(iris_sample, is.numeric)], mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 6.0667 3.1333 4.5500 1.7167
We can also fetch more than a single summary of each column:
as.data.frame(Map(
function(x) c(Min=min(x), Max=max(x)),
Filter(is.numeric, iris_sample)
))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min 5.2 2.6 1.5 0.2
## Max 6.5 3.5 6.0 2.5
or:
sapply(iris_sample[sapply(iris_sample, is.numeric)], quantile, c(0, 1))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0% 5.2 2.6 1.5 0.2
## 100% 6.5 3.5 6.0 2.5
The latter called simplify2array automatically. Thus, the result is a matrix.
On the other hand, the standardisation of all the numeric features can be performed, e.g., via a call:
iris_sample[] <- Map(function(x) {
if (!is.numeric(x)) x else (x-mean(x))/sd(x)
}, iris_sample)
print(iris_sample)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 28 -1.70405 1.03024 -1.76004 -1.65318 setosa
## 80 -0.72094 -1.49854 -0.60591 -0.78117 versicolor
## 101 0.45878 0.46829 0.83674 0.85384 virginica
## 111 0.85202 0.18732 0.31738 0.30884 virginica
## 137 0.45878 0.74927 0.60591 0.74484 virginica
## 133 0.65540 -0.93659 0.60591 0.52684 virginica
12.3.5. Handling missing values#
The is.na method for objects of the class data.frame
returns a logical matrix of the same dimensionality[9],
indicating whether the corresponding items are missing or not.
Of course, this function can still be called on individual
columns as well.
Further, na.omit can be used to get rid of rows with missing values.
Given a data frame, use is.na and other functions such as apply or approx to:
remove all rows that feature at least one missing value,
remove all rows that only consist of missing values,
remove all columns that feature at least one missing value,
for each column, replace all missing values with the column averages,
for each column, replace all missing values with values that linearly interpolate between the preceding and succeeding well-defined observations (of use in time series), e.g., the blanks in c
(0.60, 0.62, NA, 0.64, NA, NA, 0.58)
should be filled to obtain c(0.60, 0.62, 0.63, 0.64, 0.62, 0.60, 0.58)
.
12.3.6. Reshaping data frames#
Consider an example matrix:
A <- matrix(round(runif(6), 2), nrow=3,
dimnames=list(
c("X", "Y", "Z"), # row labels
c("u", "v") # column labels
))
names(dimnames(A)) <- c("Row", "Col")
print(A)
## Col
## Row u v
## X 0.29 0.88
## Y 0.79 0.94
## Z 0.41 0.05
The as.data.frame method for the table
class can be called directly on any array:
as.data.frame.table(A, responseName="Val")
## Row Col Val
## 1 X u 0.29
## 2 Y u 0.79
## 3 Z u 0.41
## 4 X v 0.88
## 5 Y v 0.94
## 6 Z v 0.05
It is an instance of array reshaping. More precisely, we call it stacking: we converted from a wide (okay, in this example, not so wide, as we have only two columns) to a long format.
This can be also achieved by means of the reshape function which is more flexible and operates directly on data frames (but is harder to use):
(df <- `names<-`(
data.frame(row.names(A), A, row.names=NULL),
c("Row", "Col.u", "Col.v")))
## Row Col.u Col.v
## 1 X 0.29 0.88
## 2 Y 0.79 0.94
## 3 Z 0.41 0.05
(stacked <- reshape(df, varying=2:3, direction="long"))
## Row time Col id
## 1.u X u 0.29 1
## 2.u Y u 0.79 2
## 3.u Z u 0.41 3
## 1.v X v 0.88 1
## 2.v Y v 0.94 2
## 3.v Z v 0.05 3
Maybe the default column names are not superb, but we can adjust them manually afterwards.
The reverse operation is called unstacking:
reshape(stacked, idvar="Row", timevar="time", drop="id", direction="wide")
## Row Col.u Col.v
## 1.u X 0.29 0.88
## 2.u Y 0.79 0.94
## 3.u Z 0.41 0.05
Given a named numeric vector, convert it to a data frame with two columns, for instance:
convert <- function(x) ...to.do...
x <- c(spam=42, eggs=7, bacon=3)
convert(x)
## key value
## 1 spam 42
## 2 eggs 7
## 3 bacon 3
Reshape (stack) the built-in WorldPhones
dataset.
Then, reshape (unstack) the stacked WorldPhones
dataset.
Further, unstack the stacked set but first
remove[10] five random rows from it and then randomly
permute all the remaining rows. Fill in the missing entries with NA
s.
Implement a basic version of as.data.frame.table manually (using rep etc.). Also, write a function as.table.data.frame that computes its reverse. Make sure both functions are compatible with each other.
The built-in Titanic
is a four-dimensional array.
Convert it to a long data frame.
Perform what follows on the data frame defined below:
convert the second column from
character
to a list of character vectors (split at","
);extract the first elements from each of the vectors;
extract the last elements;
(*) unstack the data frame;
(*) stack it back to a data frame featuring a list;
convert the list back to a character column (concatenate with
","
as separator).
(x <- data.frame(
name=c("Kat", "Ron", "Jo", "Mary"),
food=c("buckwheat", "spam,bacon,spam", "", "eggs,spam,spam,lollipops")
))
## name food
## 1 Kat buckwheat
## 2 Ron spam,bacon,spam
## 3 Jo
## 4 Mary eggs,spam,spam,lollipops
Write a function that converts all matrix-based columns in a given data frame to separate atomic columns. Furthermore, author a function that does the opposite: one that groups all columns with similar prefixes and turns them into matrices.
12.3.7. Aggregating data in groups#
We can straightforwardly apply various transforms on data groups determined by a factor-like variable or a combination thereof thanks to the split.data.frame method, which returns a list of data frames.
For example:
x <- data.frame(
a=c( 10, 20, 30, 40, 50),
u=c("spam", "spam", "eggs", "spam", "eggs"),
v=c( 1, 2, 1, 1, 1)
)
split(x, x["u"]) # i.e., split.data.frame(x, x["u"]) or x[["u"]]
## $eggs
## a u v
## 3 30 eggs 1
## 5 50 eggs 1
##
## $spam
## a u v
## 1 10 spam 1
## 2 20 spam 2
## 4 40 spam 1
It split x
with respect to the u
column, which served as the grouping
variable. On the other hand:
split(x, x[c("u", "v")]) # sep="."
## $eggs.1
## a u v
## 3 30 eggs 1
## 5 50 eggs 1
##
## $spam.1
## a u v
## 1 10 spam 1
## 4 40 spam 1
##
## $eggs.2
## [1] a u v
## <0 rows> (or 0-length row.names)
##
## $spam.2
## a u v
## 2 20 spam 2
It partitioned with respect to a combination of two factor-like
sequences. A nonexistent level pair (eggs, 2)
resulted in an empty data frame.
split.data.frame (when called explicitly) can also be used to break a matrix into a list of matrices (rowwisely). Given a matrix, perform its train-test split: allocate, say, 70% of the rows at random into one matrix and the remaining 30% into another.
If we need to aggregate grouped data in numeric columns, sapply is quite convenient. To recall, it is a combination of lapply (one-vector version of Map) and simplify2array (Section 11.1.3).
sapply(split(iris[1:2], iris[5]), sapply, mean)
## setosa versicolor virginica
## Sepal.Length 5.006 5.936 6.588
## Sepal.Width 3.428 2.770 2.974
If the function to apply returns more than a single value,
sapply will not return too informative a result:
the list of matrices converted to a matrix will not have the row.names
argument set. As a workaround, we either call simplify2array
explicitly or pass simplify="array"
to sapply:
(res <- sapply(
split(iris[1:2], iris[5]),
sapply,
function(x) c(Min=min(x), Max=max(x)),
simplify="array"
)) # or simplify2array(lapply or Map etc.)
## , , setosa
##
## Sepal.Length Sepal.Width
## Min 4.3 2.3
## Max 5.8 4.4
##
## , , versicolor
##
## Sepal.Length Sepal.Width
## Min 4.9 2.0
## Max 7.0 3.4
##
## , , virginica
##
## Sepal.Length Sepal.Width
## Min 4.9 2.2
## Max 7.9 3.8
It produced a three-dimensional array, which is particularly handy if we now wish to access specific results by name:
res[, "Sepal.Length", "setosa"]
## Min Max
## 4.3 5.8
Also, the previously mentioned as.data.frame.table method works on it like a charm (up to the column names):
as.data.frame.table(res)
## Var1 Var2 Var3 Freq
## 1 Min Sepal.Length setosa 4.3
## 2 Max Sepal.Length setosa 5.8
## 3 Min Sepal.Width setosa 2.3
## 4 Max Sepal.Width setosa 4.4
## 5 Min Sepal.Length versicolor 4.9
## 6 Max Sepal.Length versicolor 7.0
## 7 Min Sepal.Width versicolor 2.0
## 8 Max Sepal.Width versicolor 3.4
## 9 Min Sepal.Length virginica 4.9
## 10 Max Sepal.Length virginica 7.9
## 11 Min Sepal.Width virginica 2.2
## 12 Max Sepal.Width virginica 3.8
Note
If the grouping (by
) variable is a list of two or more factors,
the combined levels will be concatenated to a single string:
as.data.frame.table(as.array(sapply(
split(ToothGrowth["len"], ToothGrowth[c("supp", "dose")]),
sapply,
mean
)))
## Var1 Freq
## 1 OJ.0.5.len 13.23
## 2 VC.0.5.len 7.98
## 3 OJ.1.len 22.70
## 4 VC.1.len 16.77
## 5 OJ.2.len 26.06
## 6 VC.2.len 26.14
Also, the name of the aggregated column (len
) has been included.
This behaviour yields a result that may be deemed convenient
in some contexts but not necessarily so in others.
Many aggregation functions are idempotent, which means that when they are fed with a vector with all the elements being identical, the result is exactly that unique element: min, mean, median, and max behave this way.
Overload the mean and median methods
for character vectors and factors. They should return NA
when
fed with a sequence where not all elements are the same.
Otherwise, they are expected to output the unique value.
mean.character <- function(x, na.rm=FALSE, ...) ...to.do...
mean.factor <- function(x, na.rm=FALSE, ...) ...to.do...
This way, we can also aggregate the grouping variables conveniently:
do.call(rbind.data.frame,
lapply(split(ToothGrowth, ToothGrowth[c("supp", "dose")]), lapply, mean))
## len supp dose
## OJ.0.5 13.23 OJ 0.5
## VC.0.5 7.98 VC 0.5
## OJ.1 22.70 OJ 1.0
## VC.1 16.77 VC 1.0
## OJ.2 26.06 OJ 2.0
## VC.2 26.14 VC 2.0
The built-in aggregate method can assist us in when a single function is to be applied on all columns in a data frame.
aggregate(iris[-5], iris[5], mean) # not: ...[[5]]
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026
aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], mean)
## supp dose len
## 1 OJ 0.5 13.23
## 2 VC 0.5 7.98
## 3 OJ 1.0 22.70
## 4 VC 1.0 16.77
## 5 OJ 2.0 26.06
## 6 VC 2.0 26.14
The second argument, by
, must be list-like (this includes data frames).
Neither a factor nor an atomic vector are acceptable.
Also, if the function being applied returns many values, they
will be wrapped into a matrix column:
(x <- aggregate(iris[2], iris[5], function(x) c(Min=min(x), Max=max(x))))
## Species Sepal.Width.Min Sepal.Width.Max
## 1 setosa 2.3 4.4
## 2 versicolor 2.0 3.4
## 3 virginica 2.2 3.8
class(x[["Sepal.Width"]])
## [1] "matrix" "array"
x[["Sepal.Width"]] # not: Sepal.Width.Max, etc.
## Min Max
## [1,] 2.3 4.4
## [2,] 2.0 3.4
## [3,] 2.2 3.8
It is actually handy: by referring to
x[["Sepal.Width"]]
, we have access to all the stats for this column.
Further, if many columns are being aggregated simultaneously,
we can process all the summaries in the same way.
Check out the built-in by function,
which supports some basic split-apply-bind use cases.
Note the particularly peculiar behaviour of the
print method for the by
class.
The most flexible scenario involves applying a custom function returning any set of aggregates in the form of a list and then row-binding the results to obtain a data frame.
The following implements an R version of what we would express in SQL as:
SELECT supp, dose, AVG(len) AS ave_len, COUNT(*) AS count
FROM ToothGrowth
GROUP BY supp, dose
Ad rem:
do.call(rbind.data.frame, lapply(
split(ToothGrowth, ToothGrowth[c("supp", "dose")]),
function(df) list(
supp=df[1, "supp"],
dose=df[1, "dose"],
ave_len=mean(df[["len"]]),
count=NROW(df)
)
))
## supp dose ave_len count
## OJ.0.5 OJ 0.5 13.23 10
## VC.0.5 VC 0.5 7.98 10
## OJ.1 OJ 1.0 22.70 10
## VC.1 VC 1.0 16.77 10
## OJ.2 OJ 2.0 26.06 10
## VC.2 VC 2.0 26.14 10
As an exercise, let us study a function that takes a named list x
(can be a data frame) and a sequence of col=f
pairs and applies
the function f
(or each function from a list of functions f
)
on the named element col
in x
:
napply <- function(x, ...)
{
fs <- list(...)
stopifnot(is.list(x), !is.null(names(x)))
stopifnot(all(names(fs) %in% names(x)))
do.call(
c, # concatenates lists
lapply(
structure(seq_along(fs), names=names(fs)),
function(i)
{ # always returns a list
y <- x[[ names(fs)[i] ]]
if (is.function(fs[[i]]))
list(fs[[i]](y))
else
lapply(fs[[i]], function(f) f(y))
}
)
)
}
For example:
first <- function(x, ...) head(x, n=1L, ...) # we use it below
napply(ToothGrowth,
supp=first, dose=first, len=list(ave=mean, count=length)
)
## $supp
## [1] VC
## Levels: OJ VC
##
## $dose
## [1] 0.5
##
## $len.ave
## [1] 18.813
##
## $len.count
## [1] 60
applies first on both
ToothGrowth[["supp"]]
and ToothGrowth[["dose"]]
as well as mean and length on
ToothGrowth[["len"]]
.
List names are there for a more dramatic effect.
And now:
do.call(
rbind.data.frame,
lapply(
split(ToothGrowth, ToothGrowth[c("supp", "dose")]),
napply,
supp=first, dose=first, len=list(ave=mean, count=length)
)
)
## supp dose len.ave len.count
## OJ.0.5 OJ 0.5 13.23 10
## VC.0.5 VC 0.5 7.98 10
## OJ.1 OJ 1.0 22.70 10
## VC.1 VC 1.0 16.77 10
## OJ.2 OJ 2.0 26.06 10
## VC.2 VC 2.0 26.14 10
or even:
aaaggg <- function(x, by, ...)
do.call(rbind.data.frame, lapply(split(x, x[by]), napply, ...))
so that:
aaaggg(iris, "Species", Species=first, Sepal.Length=mean)
## Species Sepal.Length
## setosa setosa 5.006
## versicolor versicolor 5.936
## virginica virginica 6.588
This brings fun back to R programming in the sad times when many things are given to us on a plate.
By the way, the above has not been tested thoroughly. It is a proof of concept. As usual, testing, debugging, and extending are left as an exercise for the reader.
Section 10.5 considered an example where we used our own group_by function and an aggregation method overloaded for the object’s class it returns.
Here is the function that splits a data frame into a list of data frames with respect to a combination of levels in given named columns:
group_by <- function(df, by)
{
stopifnot(is.character(by), is.data.frame(df))
df <- droplevels(df) # factors might have unused levels
structure(
split(df, df[names(df) %in% by]),
class="list_dfs",
by=by
)
}
The next function applies a set of aggregates on every column of each data frame in a given list (two nested lapplys plus some cosmetic additions):
aggregate.list_dfs <- function(x, FUN, ...)
{
aggregates <- lapply(x, function(df) {
is_by <- names(df) %in% attr(x, "by")
res <- lapply(df[!is_by], FUN, ...)
res_mat <- do.call(rbind, res)
if (is.null(dimnames(res_mat)[[2]]))
dimnames(res_mat)[[2]] <- paste0("f", seq_len(NCOL(res_mat)))
cbind(
`row.names<-`(df[1, is_by, drop=FALSE], NULL),
x=row.names(res_mat),
`row.names<-`(res_mat, NULL)
)
})
combined_aggregates <- do.call(rbind.data.frame, aggregates)
`row.names<-`(combined_aggregates, NULL)
}
aggregate(group_by(ToothGrowth, c("supp", "dose")), range)
## supp dose x f1 f2
## 1 OJ 0.5 len 8.2 21.5
## 2 VC 0.5 len 4.2 11.5
## 3 OJ 1.0 len 14.5 27.3
## 4 VC 1.0 len 13.6 22.5
## 5 OJ 2.0 len 22.4 30.9
## 6 VC 2.0 len 18.5 33.9
We really want our API to be bloated. Hence, let us introduce a convenience function, which is a specialised version of the above:
mean.list_dfs <- function(x, ...)
aggregate.list_dfs(x, function(y) c(Mean=mean(y, ...)))
mean(group_by(iris[51:150, c(2, 3, 5)], "Species"))
## Species x Mean
## 1 versicolor Sepal.Width 2.770
## 2 versicolor Petal.Length 4.260
## 3 virginica Sepal.Width 2.974
## 4 virginica Petal.Length 5.552
12.3.8. Transforming data in groups#
Some variables will sometimes need to be transformed relative to what is happening in a dataset’s subsets. This is the case, e.g., where we decide that missing values should be replaced by the corresponding within-group averages or want to compute the relative ranks or z-scores.
If the loss of the original ordering of rows is not an issue, the standard split-apply-bind will suffice.
An example data frame:
(x <- data.frame(
a=c( 10, 1, NA, NA, NA, 4),
b=c( -1, 10, 40, 30, 1, 20),
c=runif(6),
d=c("v", "u", "u", "u", "v", "u")
))
## a b c d
## 1 10 -1 0.52811 v
## 2 1 10 0.89242 u
## 3 NA 40 0.55144 u
## 4 NA 30 0.45661 u
## 5 NA 1 0.95683 v
## 6 4 20 0.45333 u
Some operations:
fill_na <- function(x) `[<-`(x, is.na(x), value=mean(x[!is.na(x)]))
standardise <- function(x) (x-mean(x))/sd(x)
And now:
do.call(rbind.data.frame, lapply(
split(x, x["d"]),
function(df) {
df[["a"]] <- fill_na(df[["a"]])
df[["b"]] <- rank(df[["b"]])
df[["c"]] <- standardise(df[["c"]])
df
}
))
## a b c d
## u.2 1.0 1 1.46357 u
## u.3 2.5 4 -0.17823 u
## u.4 2.5 3 -0.63478 u
## u.6 4.0 2 -0.65057 u
## v.1 10.0 1 -0.70711 v
## v.5 10.0 2 0.70711 v
Only the relative ordering of rows within groups has been retained. Overall, the rows are in a different order.
If this is an issue, we can use the unsplit function:
unsplit(
lapply(
split(x, x["d"]),
function(df) {
df[["a"]] <- fill_na(df[["a"]])
df[["b"]] <- rank(df[["b"]])
df[["c"]] <- standardise(df[["c"]])
df
}
),
x["d"]
)
## a b c d
## 1 10.0 1 -0.70711 v
## 2 1.0 1 1.46357 u
## 3 2.5 4 -0.17823 u
## 4 2.5 3 -0.63478 u
## 5 10.0 2 0.70711 v
## 6 4.0 2 -0.65057 u
Show how we can do the above also via the replacement version of split.
Reverting to the previous ordering can be done manually too. It is because the split operation behaves as if we first ordered the data frame with respect to the grouping variable(s) (using a stable sorting algorithm).
Here is some transformation of a sample data frame split by a combination of two factors:
(x <- `row.names<-`(ToothGrowth[sample(NROW(ToothGrowth), 10), ], NULL))
## len supp dose
## 1 23.0 OJ 2.0
## 2 23.3 OJ 1.0
## 3 29.4 OJ 2.0
## 4 14.5 OJ 1.0
## 5 11.2 VC 0.5
## 6 20.0 OJ 1.0
## 7 24.5 OJ 2.0
## 8 10.0 OJ 0.5
## 9 9.4 OJ 0.5
## 10 7.0 VC 0.5
(y <- do.call(rbind.data.frame, lapply(
split(x, x[c("dose", "supp")]), # two grouping variables
function(df) {
df[["len"]] <- df[["len"]] * 100^df[["dose"]] * # whatever
ifelse(df[["supp"]] == "OJ", -1, 1) # do not overthink it
df
}
)))
## len supp dose
## 0.5.OJ.8 -100 OJ 0.5
## 0.5.OJ.9 -94 OJ 0.5
## 1.OJ.2 -2330 OJ 1.0
## 1.OJ.4 -1450 OJ 1.0
## 1.OJ.6 -2000 OJ 1.0
## 2.OJ.1 -230000 OJ 2.0
## 2.OJ.3 -294000 OJ 2.0
## 2.OJ.7 -245000 OJ 2.0
## 0.5.VC.5 112 VC 0.5
## 0.5.VC.10 70 VC 0.5
Section 5.4.4 mentioned that by calling order, we can determine the inverse of a given permutation. Hence, we can call:
y[order(order(x[["supp"]], x[["dose"]])), ] # not: dose, supp
## len supp dose
## 2.OJ.1 -230000 OJ 2.0
## 1.OJ.2 -2330 OJ 1.0
## 2.OJ.3 -294000 OJ 2.0
## 1.OJ.4 -1450 OJ 1.0
## 0.5.VC.5 112 VC 0.5
## 1.OJ.6 -2000 OJ 1.0
## 2.OJ.7 -245000 OJ 2.0
## 0.5.OJ.8 -100 OJ 0.5
## 0.5.OJ.9 -94 OJ 0.5
## 0.5.VC.10 70 VC 0.5
Additionally, we can manually restore the original row.names
,
et voilà.
12.3.9. Metaprogramming-based techniques (*)#
Section 9.5.7 mentioned that
R is equipped with the ability to compose programs that manipulate
unevaluated expressions. As a consequence,
some functions provide us with convenient[11] interfaces
to a few common operations. These include
transform, {commandsubset
, with,
and basically every procedure accepting a formula.
The popular data.table and dplyr packages
also fall into this class (Section 12.3.10).
However, we have already noted that each method must be studied separately. It can arbitrarily interpret the form of the passed arguments without taking into account their real meaning. Hence, we avoid[12] them in this course, as we can do perfectly without them.
Withal, they are thought-provoking on their own. Furthermore, they are quite popular in other users’ code. Hence the honourable mention. Learning them in more detail is left to the earnest reader as an optional exercise.
For instance, let us consider an example call to the subset function:
subset(iris, Sepal.Length>7.5, -(Sepal.Width:Petal.Width))
## Sepal.Length Species
## 106 7.6 virginica
## 118 7.7 virginica
## 119 7.7 virginica
## 123 7.7 virginica
## 132 7.9 virginica
## 136 7.7 virginica
Neither Sepal.Length>7.5
nor -(Sepal.Width:Petal.Width)
make sense
as standalone R expressions. We have not defined
the named variables used therein:
Sepal.Length>7.5 # utter nonsense
## Error in eval(expr, envir, enclos): object 'Sepal.Length' not found
-(Sepal.Width:Petal.Width) # gibberish
## Error in eval(expr, envir, enclos): object 'Sepal.Width' not found
Only from help("subset")
we can learn that this tool generously
decides that the second expression plays the role of a row selector.
Moreover, the third one removes all the columns between the two
given ones.
In our course, we pay attention to developing transferable skills. Assuming that R is not the only language we are going to learn during our long and happy lives, it is much more likely that in the next environment, we will be writing something more of the more basic form instead:
between <- function(x, from, to) (which(from == x):which(to == x))
iris[iris[["Sepal.Length"]]>7.5,
-between(names(iris), "Sepal.Width", "Petal.Width")]
## Sepal.Length Species
## 106 7.6 virginica
## 118 7.7 virginica
## 119 7.7 virginica
## 123 7.7 virginica
## 132 7.9 virginica
## 136 7.7 virginica
Let us stress again that this is a book on how to become a great chef who proudly uses sustainably-sourced produce, not how to order ultra-processed food from DeliverNoodlesQuickly.com.
transform can be used to add, modify, and remove columns in a data frame with the possibility of referring to existing features as if they were ordinary variables:
head(transform(mtcars, log_hp=log(hp), am=2*am-1, hp=NULL))
## mpg cyl disp drat wt qsec vs am gear carb log_hp
## Mazda RX4 21.0 6 160 3.90 2.620 16.46 0 1 4 4 4.7005
## Mazda RX4 Wag 21.0 6 160 3.90 2.875 17.02 0 1 4 4 4.7005
## Datsun 710 22.8 4 108 3.85 2.320 18.61 1 1 4 1 4.5326
## Hornet 4 Drive 21.4 6 258 3.08 3.215 19.44 1 -1 3 1 4.7005
## Hornet Sportabout 18.7 8 360 3.15 3.440 17.02 0 -1 3 2 5.1648
## Valiant 18.1 6 225 2.76 3.460 20.22 1 -1 3 1 4.6540
Similarly, attach adds any named list to the search path
(see Section 16.2.6) so that the columns can be accessed
by name. Nevertheless, we cannot alter their contents.
As an alternative, with and within may be referred
to if writing df[["..."]]
each time is so difficult for us
(it should not be):
within(head(mtcars), {
log_hp <- log(hp)
fuel_economy <- 235/mpg
am <- factor(am, levels=c(0, 1), labels=c("no", "yes"))
rm(list=c("mpg", "hp", "vs", "qsec"))
})
## cyl disp drat wt am gear carb fuel_economy log_hp
## Mazda RX4 6 160 3.90 2.620 yes 4 4 11.190 4.7005
## Mazda RX4 Wag 6 160 3.90 2.875 yes 4 4 11.190 4.7005
## Datsun 710 4 108 3.85 2.320 yes 4 1 10.307 4.5326
## Hornet 4 Drive 6 258 3.08 3.215 no 3 1 10.981 4.7005
## Hornet Sportabout 8 360 3.15 3.440 no 3 2 12.567 5.1648
## Valiant 6 225 2.76 3.460 no 3 1 12.983 4.6540
As mentioned in Section 10.3.4 (see Section 17.6 for more details), formulae are special objects that consist of two unevaluated expressions separated by a tilde (`~`).
Functions can support formulae and do what they please with them, but a popular approach is to allow them to express “something grouped by something else” or “one thing as a function of other things”.
do.call(rbind.data.frame, lapply(split(ToothGrowth, ~supp+dose), head, 1))
## len supp dose
## OJ.0.5 15.2 OJ 0.5
## VC.0.5 4.2 VC 0.5
## OJ.1 19.7 OJ 1.0
## VC.1 16.5 VC 1.0
## OJ.2 25.5 OJ 2.0
## VC.2 23.6 VC 2.0
aggregate(cbind(mpg, log_hp=log(hp))~am:cyl, mtcars, mean)
## am cyl mpg log_hp
## 1 0 4 22.900 4.4186
## 2 1 4 28.075 4.3709
## 3 0 6 19.125 4.7447
## 4 1 6 20.567 4.8552
## 5 0 8 15.050 5.2553
## 6 1 8 15.400 5.6950
head(model.frame(mpg+hp~log(hp)+I(1/qsec), mtcars))
## mpg + hp log(hp) I(1/qsec)
## Mazda RX4 131.0 4.7005 0.060753....
## Mazda RX4 Wag 131.0 4.7005 0.058754....
## Datsun 710 115.8 4.5326 0.053734....
## Hornet 4 Drive 131.4 4.7005 0.051440....
## Hornet Sportabout 193.7 5.1648 0.058754....
## Valiant 123.1 4.6540 0.049455....
If these seem esoteric, it is because that is precisely the case. We need to consult the corresponding functions’ manuals to understand what they do. And, as we do not recommend their use by beginner programmers, we will not explain them here.
In the last example, the peculiar printing of the last column is due to which method’s being overloaded?
In the third part of this book, we will return to these functions as they will serve as an amusing illustration of how to indite our own procedures that rely on metaprogramming techniques.
12.3.10. A note on the dplyr (tidyverse) and data.table packages (*)#
The popular third-party packages data.table and dplyr implement the most common data frame wrangling procedures. Some operations may be much faster for larger datasets.
They both introduce an entirely new API for the operations we already know well how to perform. Furthermore, they are heavily based on metaprogramming (lazy evaluation). A nice way to learn them is by solving some of the exercises listed below.
dplyr is part of a vast system of interdependent packages called tidyverse, which tend to do things their own way. They became quite invasive over the last few years.
Still, R programmers ought to remember that they are able to do without them. What is more, they also need to rely on other solutions when processing other prominent data structures is required, e.g., of fancy lists and matrices. Base R always comes first as the most fundamental layer.
Important
Some functions we may find helpful (annoyingly to base R users)
return objects of the class tibble
(tbl_df
)
(e.g., haven::
read.xpt that
reads SAS data files).
However, these are, in fact, subclasses of data.frame
.
We can always use as.data.frame to get our
favourite objects back.
Also, we cannot stress enough that it is SQL that we recommend learning as perhaps the most powerful interface to more considerable amounts of data. It will give us the skills that can be used later in other programming environments.
Base R (and its predecessor, S) has long ago proven to be a versatile tool for rapid prototyping, calling specialised procedures written in C or Java, and wrangling data that fit into memory. For larger problems, techniques for working with batches of data, sampling methods, or aggregating data stored elsewhere are often the way to go, especially when building machine learning models or visualisation[13] is required. Usually, the most recent data will be stored in normalised databases, and we will need to join a few tables to fetch something interesting from the perspective of the current analysis context.
12.4. Exercises#
Answer the following questions:
What attributes a data frame must be equipped with?
If
row.names
is an integer vector, how to access rows labelled1
,7
, and42
?How to create a data frame that features a column that is a list of character vectors of different lengths?
How to create a data frame that includes a matrix column?
How to convert all numeric columns in a data frame to a numeric matrix?
Assuming that
x
is an atomic vector, what is the difference between “as.data.frame(x)
” vs “as.data.frame(
as.list(x))
” vs “as.data.frame(
list(a=x))
” vs “data.frame(a=x)
”?
Assuming that x
is a data frame, what is the meaning of/difference between
the following:
“
x["u"]
” vs “x[["u"]]
” vs “x[, "u"]
”?“
x["u"][1]
” vs “x[["u"]][1]
” vs “x[1, "u"]
” vs “x[1, "u", drop=FALSE]
”?“
x[
which(x[[1]] > 0), ]
” vs “x[x[[1]] > 0, ]
”?“
x[
grep("^foo",
names(x))]
”?
Assume we have a data frame with columns named like:
ID
(character),
checked
(logical, possibly with missing values),
category
(factor),
x0
, … x9
(ten separate numeric columns),
y0
, … y9
(ten separate numeric columns),
coords
(numeric matrix with two columns named lat
and long
),
and
features
(list of character vectors of different lengths).
How to extract the rows where
checked
isTRUE
?How to extract a subset comprised only of
ID
andx
-something columns?How to extract the rows for which
ID
is like three letters and then five digits (e.g.,XYZ12345
)?How to select all the numeric columns in one go?
Assuming that the
ID
s are like three letters and then five digits, how to add two columns:ID3
(the letters) andID5
(the five digits)?How to get rid of all the columns between
x3
andy7
?How to check where both
lat
andlong
incoords
are positive?How to add the row indicating the number of
features
?How to extract the rows where
"spam"
is amongst thefeatures
?How to convert it to a long data frame with two columns:
ID
andfeature
(individual strings)?How to change the name of the
ID
column toid
?How to make the
y
-foo columns appear before thex
-bar ones?How to order the rows with respect to
checked
(FALSE
first, thenTRUE
) andID
s (decreasingly)?How to remove rows with duplicate IDs?
How to determine how many entries correspond to each
category
?How to compute the average
lat
andlong
in eachcategory
?How to compute the average
lat
andlong
for eachcategory
andchecked
combined?
Consider the flights
dataset.
Give some ways to select all rows between March and October
(regardless of the year).
In this task, you will be working with a version of a
dataset on 70k+ Melbourne trees (urban_forest
).
Load the downloaded dataset by calling the read.csv function.
Fetch the IDs (
CoM.ID
) and trunk diameters (Diameter.Breast.Height
) of five horse chestnuts with the smallest diameters at breast height. The output data frame must be sorted with respect toDiameter.Breast.Height
, decreasingly.Create a new data frame that gives the number of trees planted each year.
Compute the average age (in years, based on
Year.Planted
; using aggregate) of the trees of genera (each genus separately): Eucalyptus, Platanus, Ficus, Acer, and Quercus. Depict the sorted data with barplot.
(*) Consider the historic data dumps of https://travel.stackexchange.com/ available here. Export the CSV files located therein to an SQLite database. Then, write some R code that corresponds to the following SQL queries (use dbGetQuery to verify your results):
--- 1)
SELECT
Users.DisplayName,
Users.Age,
Users.Location,
SUM(Posts.FavoriteCount) AS FavoriteTotal,
Posts.Title AS MostFavoriteQuestion,
MAX(Posts.FavoriteCount) AS MostFavoriteQuestionLikes
FROM Posts
JOIN Users ON Users.Id=Posts.OwnerUserId
WHERE Posts.PostTypeId=1
GROUP BY OwnerUserId
ORDER BY FavoriteTotal DESC
LIMIT 10
--- 2)
SELECT
Posts.ID,
Posts.Title,
Posts2.PositiveAnswerCount
FROM Posts
JOIN (
SELECT
Posts.ParentID,
COUNT(*) AS PositiveAnswerCount
FROM Posts
WHERE Posts.PostTypeID=2 AND Posts.Score>0
GROUP BY Posts.ParentID
) AS Posts2
ON Posts.ID=Posts2.ParentID
ORDER BY Posts2.PositiveAnswerCount DESC
LIMIT 10
--- 3)
SELECT
Posts.Title,
UpVotesPerYear.Year,
MAX(UpVotesPerYear.Count) AS Count
FROM (
SELECT
PostId,
COUNT(*) AS Count,
STRFTIME('%Y', Votes.CreationDate) AS Year
FROM Votes
WHERE VoteTypeId=2
GROUP BY PostId, Year
) AS UpVotesPerYear
JOIN Posts ON Posts.Id=UpVotesPerYear.PostId
WHERE Posts.PostTypeId=1
GROUP BY Year
--- 4)
SELECT
Questions.Id,
Questions.Title,
BestAnswers.MaxScore,
Posts.Score AS AcceptedScore,
BestAnswers.MaxScore-Posts.Score AS Difference
FROM (
SELECT Id, ParentId, MAX(Score) AS MaxScore
FROM Posts
WHERE PostTypeId==2
GROUP BY ParentId
) AS BestAnswers
JOIN (
SELECT * FROM Posts
WHERE PostTypeId==1
) AS Questions
ON Questions.Id=BestAnswers.ParentId
JOIN Posts ON Questions.AcceptedAnswerId=Posts.Id
WHERE Difference>50
ORDER BY Difference DESC
--- 5)
SELECT
Posts.Title,
CmtTotScr.CommentsTotalScore
FROM (
SELECT
PostID,
UserID,
SUM(Score) AS CommentsTotalScore
FROM Comments
GROUP BY PostID, UserID
) AS CmtTotScr
JOIN Posts ON Posts.ID=CmtTotScr.PostID
AND Posts.OwnerUserId=CmtTotScr.UserID
WHERE Posts.PostTypeId=1
ORDER BY CmtTotScr.CommentsTotalScore DESC
LIMIT 10
--- 6)
SELECT DISTINCT
Users.Id,
Users.DisplayName,
Users.Reputation,
Users.Age,
Users.Location
FROM (
SELECT
Name, UserID
FROM Badges
WHERE Name IN (
SELECT
Name
FROM Badges
WHERE Class=1
GROUP BY Name
HAVING COUNT(*) BETWEEN 2 AND 10
)
AND Class=1
) AS ValuableBadges
JOIN Users ON ValuableBadges.UserId=Users.Id
--- 7)
SELECT
Posts.Title,
VotesByAge2.OldVotes
FROM Posts
JOIN (
SELECT
PostId,
MAX(CASE WHEN VoteDate = 'new' THEN Total ELSE 0 END) NewVotes,
MAX(CASE WHEN VoteDate = 'old' THEN Total ELSE 0 END) OldVotes,
SUM(Total) AS Votes
FROM (
SELECT
PostId,
CASE STRFTIME('%Y', CreationDate)
WHEN '2017' THEN 'new'
WHEN '2016' THEN 'new'
ELSE 'old'
END VoteDate,
COUNT(*) AS Total
FROM Votes
WHERE VoteTypeId=2
GROUP BY PostId, VoteDate
) AS VotesByAge
GROUP BY VotesByAge.PostId
HAVING NewVotes=0
) AS VotesByAge2 ON VotesByAge2.PostId=Posts.ID
WHERE Posts.PostTypeId=1
ORDER BY VotesByAge2.OldVotes DESC
LIMIT 10
(*)
Generate a CSV file featuring some random data arranged in
a few columns of a size at least two times larger than your available RAM.
Then, export the CSV file to an SQLite database.
Use file connections (Section 8.3.5) and the
nrow
argument to read.table
to be able to process it on a chunk-by-chunk basis.
Determine whether setting colClasses
in read.table
speeds up the reading of large CSV files significantly or not.
(*) Export the whole XML data dump of StackOverflow published at https://archive.org/details/stackexchange (see also https://data.stackexchange.com/) to an SQLite database.