12. Data frames¶
This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Minimalist Data Wrangling with Python [27], too.
Most matrices are built on top of atomic vectors. Hence, only
items of the same type can be arranged into rows and columns.
On the other hand, data frames (objects of the S3 class data.frame
,
first introduced in [13])
are collections of vectors of the same lengths or matrices with identical
row counts. Hence, they represent structured[1]
data of possibly heterogeneous types. For instance:
class(iris) # `iris` is an example data frame
## [1] "data.frame"
iris[c(1, 51, 101), ] # three chosen rows from `iris`
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
It is a mix of numeric and factor-type data.
The good news is that not only are data frames built on named lists (e.g., to extract a column, we can refer to `[[`), but also many functions consider them matrix-like (e.g., to select specific rows and columns, two indexes can be passed to `[` like in the preceding example). Hence, it will soon turn out that we already know a lot about performing basic data wrangling activities, even if we do not fully realise it now.
12.1. Creating data frames¶
12.1.1. data.frame and as.data.frame¶
Most frequently, we create data frames based on a series of logical, numeric, or character vectors of identical lengths. In such a scenario, the data.frame function is particularly worthwhile.
(x <- data.frame(
a=c(TRUE, FALSE),
b=1:6,
c=runif(6),
d=c("spam", "spam", "eggs")
))
## a b c d
## 1 TRUE 1 0.77437 spam
## 2 FALSE 2 0.19722 spam
## 3 TRUE 3 0.97801 eggs
## 4 FALSE 4 0.20133 spam
## 5 TRUE 5 0.36124 spam
## 6 FALSE 6 0.74261 eggs
The shorter vectors were recycled. We can verify that the diverse column types were retained and no coercion was made by calling:
str(x)
## 'data.frame': 6 obs. of 4 variables:
## $ a: logi TRUE FALSE TRUE FALSE TRUE FALSE
## $ b: int 1 2 3 4 5 6
## $ c: num 0.774 0.197 0.978 0.201 0.361 ...
## $ d: chr "spam" "spam" "eggs" "spam" ...
Important
For many reasons (see, e.g., Section 12.1.5 and Section 12.1.6), we recommend having the type of each column always checked, e.g., by calling the str function.
Many objects, such as matrices, can easily be coerced to data frames using particular as.data.frame methods. Here is an example matrix:
(A <- matrix(1:6, nrow=3,
dimnames=list(
NULL, # no row labels
c("u", "v") # some column labels
)))
## u v
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
Let’s convert it to a data frame:
as.data.frame(A) # as.data.frame.matrix
## u v
## 1 1 4
## 2 2 5
## 3 3 6
Note that a matrix with no row labels is printed slightly differently
than a data frame with (as it will soon turn out) the default
row.names
.
Named lists are amongst other aspirants to a meaningful conversion. Consider an example list where all elements are vectors of the same length:
(l <- Map(
function(x) {
c(Min=min(x), Median=median(x), Mean=mean(x), Max=max(x))
},
split(iris[["Sepal.Length"]], iris[["Species"]])
))
## $setosa
## Min Median Mean Max
## 4.300 5.000 5.006 5.800
##
## $versicolor
## Min Median Mean Max
## 4.900 5.900 5.936 7.000
##
## $virginica
## Min Median Mean Max
## 4.900 6.500 6.588 7.900
Each list element will be turned into a separate column:
as.data.frame(l) # as.data.frame.list
## setosa versicolor virginica
## Min 4.300 4.900 4.900
## Median 5.000 5.900 6.500
## Mean 5.006 5.936 6.588
## Max 5.800 7.000 7.900
Sadly, as.data.frame is not particularly fond of lists of vectors of incompatible lengths:
as.data.frame(list(a=1, b=11:12, c=21:23))
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names
## = TRUE, : arguments imply differing number of rows: 1, 2, 3
These vectors could have been recycled with a warning. But they were not.
as.data.frame(list(a=1:4, b=11:12, c=21)) # recycling rule okay
## a b c
## 1 1 11 21
## 2 2 12 21
## 3 3 11 21
## 4 4 12 21
The method for the S3 class table
(mentioned in Chapter 11)
can be helpful as well. Here is an example contingency table together
with its unstacked (wide) version.
(t <- table(mtcars[["vs"]], mtcars[["cyl"]]))
##
## 4 6 8
## 0 1 3 14
## 1 10 4 0
as.data.frame(t) # as.data.frame.table; see the stringsAsFactors note below!
## Var1 Var2 Freq
## 1 0 4 1
## 2 1 4 10
## 3 0 6 3
## 4 1 6 4
## 5 0 8 14
## 6 1 8 0
as.data.frame.table is so handy that we might want to call it directly on any array. This way, we can convert it from the wide format to the long (tall) one; see Section 12.3.6 for more details.
Note
The aforementioned method is based on expand.grid, which determines all combinations of a given series of vectors.
expand.grid(1:2, c("a", "b", "c")) # see the stringsAsFactors note below!
## Var1 Var2
## 1 1 a
## 2 2 a
## 3 1 b
## 4 2 b
## 5 1 c
## 6 2 c
Overall, many classes of objects can be included[2]
in a data frame. The popular choices include
Date
, POSIXct
, and factor
.
It is worth noting that format is used whilst printing
the columns. Here is its custom method for what we would like to
call from now on the S3 class spam
:
format.spam <- function(x, ...)
paste0("<", x, ">")
Testing data frame printing:
data.frame(
a=structure(c("lovely", "yummy", "delicious"), class="spam"),
b=factor(c("spam", "bacon", "spam")),
c=Sys.Date()+1:3
)
## a b c
## 1 <lovely> spam 2025-01-16
## 2 <yummy> bacon 2025-01-17
## 3 <delicious> spam 2025-01-18
12.1.2. cbind.data.frame and rbind.data.frame¶
There are data frame-specific versions of
cbind or rbind (which we discussed
in the context of stacking matrices; Section 11.1.2).
They are used quite eagerly: help("cbind")
states that
they will be referred to if at least[3]
one of its arguments is a data frame, and the other arguments
are atomic vectors or lists (possibly with the dim
attribute).
For example:
x <- iris[c(1, 51, 101), c("Sepal.Length", "Species")] # whatever
cbind(Yummy=c(TRUE, FALSE, TRUE), x)
## Yummy Sepal.Length Species
## 1 TRUE 5.1 setosa
## 51 FALSE 7.0 versicolor
## 101 TRUE 6.3 virginica
It added a new column to a data frame x
. Moreover:
rbind(x, list(42, "virginica"))
## Sepal.Length Species
## 1 5.1 setosa
## 51 7.0 versicolor
## 101 6.3 virginica
## 11 42.0 virginica
It added a new row. Note that columns are of different types. Hence, the values to row-bind had to be provided as a list.
The generic vector used as a new row specifier can also be named. It can consist of sequences of length greater than one that are given in any order:
rbind(x, list(
Species=c("virginica", "setosa"),
Sepal.Length=c(42, 7)
))
## Sepal.Length Species
## 1 5.1 setosa
## 51 7.0 versicolor
## 101 6.3 virginica
## 11 42.0 virginica
## 2 7.0 setosa
A direct referral to cbind.data.frame and rbind.data.frame will sometimes be necessary. Consider an example list of atomic vectors:
x <- list(a=1:3, b=11:13, c=21:23)
First, we call the generic, which dispatches to the default method:
do.call(cbind, x)
## a b c
## [1,] 1 11 21
## [2,] 2 12 22
## [3,] 3 13 23
It created a matrix. If we want to ensure we garner a data frame, we need to write:
do.call(cbind.data.frame, x)
## a b c
## 1 1 11 21
## 2 2 12 22
## 3 3 13 23
This is useful for fetching outputs from Map et al., as they are wrapped inside a list. Here is a fancy way to obtain an illustrative list:
l <- unname(Map(
function(x) list( # objects are of different types, hence a list
Sepal.Length=mean(x[["Sepal.Length"]]),
Sepal.Width=mean(x[["Sepal.Width"]]),
Species=x[["Species"]][1] # all are the same, so the first will do
),
split(iris, iris[["Species"]]) # split.data.frame; see below
))
str(l)
## List of 3
## $ :List of 3
## ..$ Sepal.Length: num 5.01
## ..$ Sepal.Width : num 3.43
## ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1
## $ :List of 3
## ..$ Sepal.Length: num 5.94
## ..$ Sepal.Width : num 2.77
## ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 2
## $ :List of 3
## ..$ Sepal.Length: num 6.59
## ..$ Sepal.Width : num 2.97
## ..$ Species : Factor w/ 3 levels "setosa","versicolor",..: 3
We may now turn it into a data frame by calling:
do.call(rbind.data.frame, l)
## Sepal.Length Sepal.Width Species
## 1 5.006 3.428 setosa
## 2 5.936 2.770 versicolor
## 3 6.588 2.974 virginica
On the other hand, do.call(rbind, l)
does not return
an amiable object type:
do.call(rbind, l)
## Sepal.Length Sepal.Width Species
## [1,] 5.006 3.428 setosa
## [2,] 5.936 2.77 versicolor
## [3,] 6.588 2.974 virginica
Despite the pretty face, it is a matrix… over a list:
str(do.call(rbind, l))
## List of 9
## $ : num 5.01
## $ : num 5.94
## $ : num 6.59
## $ : num 3.43
## $ : num 2.77
## $ : num 2.97
## $ : Factor w/ 3 levels "setosa","versicolor",..: 1
## $ : Factor w/ 3 levels "setosa","versicolor",..: 2
## $ : Factor w/ 3 levels "setosa","versicolor",..: 3
## - attr(*, "dim")= int [1:2] 3 3
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:3] "Sepal.Length" "Sepal.Width" "Species"
12.1.3. Reading data frames¶
Structured data can be imported from external sources, such as CSV/TSV (comma/tab-separated values) or HDF5 files, relational databases supporting SQL (see Section 12.1.4), web APIs (e.g., through the curl and jsonlite packages), spreadsheets [66], and so on. In particular, read.csv and the like fetch data from plain text files consisting of records, where commas, semicolons, tabs, etc. separate the fields. For instance:
x <- data.frame(a=runif(3), b=c(TRUE, FALSE, TRUE)) # example data frame
f <- tempfile() # temporary file name
write.csv(x, f, row.names=FALSE) # export
It created a CSV file that looks like:
cat(readLines(f), sep="\n") # print file contents
## "a","b"
## 0.287577520124614,TRUE
## 0.788305135443807,FALSE
## 0.4089769218117,TRUE
which can be read by calling:
read.csv(f)
## a b
## 1 0.28758 TRUE
## 2 0.78831 FALSE
## 3 0.40898 TRUE
Check out help("read.table")
for a long list of tunable parameters,
especially: sep
, dec
, quote
, header
, comment.char
,
and row.names
. Further, note that reading from compressed files
and interned URLs is supported directly.
Important
CSV is the most portable and user-friendly format for exchanging matrix-like objects between different programs and computing languages (Python, Julia, LibreOffice Calc, etc.). Such files can be opened in any text editor.
Also, as mentioned in Section 8.3.5,
we can process data frames chunk by chunk.
This is beneficial especially when data do not fit into memory
(compare the nrows
argument to read.csv).
12.1.4. Interfacing relational databases and querying with SQL (*)¶
The DBI package provides a universal interface for many database management systems whose drivers are implemented in add-ons such as RSQLite, RMariaDB, RPostgreSQL, etc., or, more generally, RODBC or odbc. For more details, see Section 4 of [66].
Let’s play with an in-memory (volatile) instance of an SQLite database.
library("DBI")
con <- dbConnect(RSQLite::SQLite(), ":memory:")
It returned an object representing a database connection which we can refer to in further communication. An easy way to create a database table is to call:
dbWriteTable(con, "mtcars", mtcars) # `mtcars` is a toy data frame
Alternatively, we could have called dbExecute
to send SQL statements such as “CREATE TABLE ...
” followed by a series
of “INSERT INTO ...
”. We can now retrieve some data:
dbGetQuery(con, "
SELECT cyl, vs, AVG(mpg) AS mpg_ave, AVG(hp) AS hp_ave
FROM mtcars
GROUP BY cyl, vs
")
## cyl vs mpg_ave hp_ave
## 1 4 0 26.000 91.00
## 2 4 1 26.730 81.80
## 3 6 0 20.567 131.67
## 4 6 1 19.125 115.25
## 5 8 0 15.100 209.21
It gave us an ordinary R data frame. We can process it in the same fashion as any other object of this kind.
At the end, the database connection must be closed.
dbDisconnect(con)
Database passwords must never be stored in plain text files, let alone in R scripts in version-controlled repositories. Consider a few ways to fetch credentials programmatically:
using environment variables (see help
("Sys.getenv")
),using the keyring package,
calling system2 (Section 7.3.2) to retrieve it from the system keyring (e.g., the keyring package for Python provides a platform-independent command-line utility).
12.1.5. Strings as factors?¶
Some functions related to data frames automatically convert character
vectors to factors. This behaviour is frequently controlled by
an argument named stringsAsFactors
.
It can be particularly problematic because, when printed,
factor and character columns look identical:
(x <- data.frame(a=factor(c("U", "V")), b=c("U", "V")))
## a b
## 1 U U
## 2 V V
We recall from Section 10.3.2 that factors can be nasty. For example, passing factors as indexers in `[` or converting them with as.numeric might give counterintuitive results. Also, when we want to extend factors by previously unobserved data, new levels must be added manually. This can cause unexpected behaviour in contexts such as:
rbind(x, c("W", "W"))
## Warning in `[<-.factor`(`*tmp*`, ri, value = "W"): invalid factor level,
## NA generated
## a b
## 1 U U
## 2 V V
## 3 <NA> W
Therefore, always having the data types checked is a praiseworthy habit. For instance:
str(x)
## 'data.frame': 2 obs. of 2 variables:
## $ a: Factor w/ 2 levels "U","V": 1 2
## $ b: chr "U" "V"
Before R 4.0, certain functions,
including data.frame and read.csv
had the stringsAsFactors
argument defaulting to TRUE
.
It is no longer the case.
However, exceptions to this rule still exist, e.g.,
including as.data.frame.table and expand.grid.
Besides, some example data frames continue to enjoy factor-typed columns, e.g.:
class(iris[["Species"]])
## [1] "factor"
In particular, adding a new flower variety might be oblique:
iris2 <- iris[c(1, 101), ] # example subset
rbind(iris2, c(6, 3, 3, 2, "croatica"))
## Warning in `[<-.factor`(`*tmp*`, ri, value = "croatica"): invalid factor
## level, NA generated
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 101 6.3 3.3 6 2.5 virginica
## 3 6 3 3 2 <NA>
Compare it to:
levels(iris2[["Species"]])[nlevels(iris2[["Species"]])+1] <- "croatica"
rbind(iris2, c(6, 3, 3, 2, "croatica"))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 101 6.3 3.3 6 2.5 virginica
## 3 6 3 3 2 croatica
12.1.6. Internal representation¶
Objects of the S3 class data.frame
are erected on lists
of vectors of the same length or matrices with identical row counts.
Each list element defines a column or column group.
Apart from class
, data frames must be equipped with the following
special attributes:
names
– a character vector (as usual in any named list) that gives the column labels,row.names
– a character or integer vector with no duplicates nor missing values, doing what is advertised.
Therefore, a data frame can be created from scratch by calling, for example:
structure(
list(a=11:13, b=21:23), # sets the `names` attribute
row.names=1:3,
class="data.frame"
)
## a b
## 1 11 21
## 2 12 22
## 3 13 23
Here is a data frame based on a list of length five, a matrix with five rows, and a numeric vector with five items. We added some fancy row names on top:
structure(
list(
a=list(1, 1:2, 1:3, numeric(0), -(4:1)),
b=cbind(u=11:15, v=21:25),
c=runif(5)
),
row.names=c("spam", "bacon", "eggs", "ham", "aubergine"),
class="data.frame"
)
## a b.u b.v c
## spam 1 11 21 0.28758
## bacon 1, 2 12 22 0.78831
## eggs 1, 2, 3 13 23 0.40898
## ham 14 24 0.88302
## aubergine -4, -3, -2, -1 15 25 0.94047
In general, the columns of the type list
can contain anything, e.g., other
lists or R functions. Including atomic vectors of varying lengths, just like
above, permits us to create something à la ragged arrays.
The issue with matrix entries, on the other hand, is that they appear as if they were many columns. Still, as it will turn out in the sequel, they are often treated as a single complex column, e.g., by the index operator (see Section 12.2). Therefore, from this perspective, the aforementioned data frame has three columns, not four. Such compound columns can be output by aggregate (see Section 12.3), amongst others. They are valuable in certain contexts: the column groups can be easily accessed as a whole and batch-processed in the same way.
Important
Alas, data frames with list or matrix columns cannot be
created with the data.frame nor cbind
functions. This might explain why they are less popular.
This behaviour is dictated by the underlying
as.data.frame methods, which they both call.
As a curiosity, see help("I")
, though.
Verify that if a data frame carries a matrix column,
this matrix does not need to have any column names (the second
element of dimnames
).
The names
and row.names
attributes are special in the sense
of Section 4.4.3. In particular, they
can be accessed or modified via the dedicated functions.
It is worth noting that row.names(df)
always returns
a character vector, even when attr(df, "row.names")
is integer.
Further, calling “row.names(df) <- NULL
” will
reset[4] this attribute to the most commonly desired case
of consecutive natural numbers. For example:
(x <- iris[c(1, 51, 101), ]) # comes with some sad row names
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
`row.names<-`(x, NULL) # reset to seq_len(NROW(x))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 7.0 3.2 4.7 1.4 versicolor
## 3 6.3 3.3 6.0 2.5 virginica
Implement your version of expand.grid.
Write a version of xtabs that does not rely on a formula interface
(compare Section 10.3.4). Allow three parameters: a data frame,
the name of the “counts” column, and the names of the cross-classifying
factors. Hence,
my_xtabs(x, "Freq",
c("Var1", "Var2"))
should be equivalent to
xtabs(Freq~Var1+Var2, x)
.
12.2. Data frame subsetting¶
12.2.1. Data frames are lists¶
A data frame is a named list whose elements represent individual columns. Therefore[5], length yields the number of columns and names gives their respective labels.
Let’s play around with this data frame:
(x <- data.frame(
a=runif(6),
b=rnorm(6),
c=LETTERS[1:6],
d1=c(FALSE, TRUE, FALSE, NA, FALSE, NA),
d2=c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
))
## a b c d1 d2
## 1 0.287578 0.070508 A FALSE FALSE
## 2 0.788305 0.129288 B TRUE TRUE
## 3 0.408977 1.715065 C FALSE FALSE
## 4 0.883017 0.460916 D NA TRUE
## 5 0.940467 -1.265061 E FALSE FALSE
## 6 0.045556 -0.686853 F NA TRUE
typeof(x) # each data frame is a list
## [1] "list"
length(x) # the number of columns
## [1] 5
names(x) # column labels
## [1] "a" "b" "c" "d1" "d2"
The one-argument versions of extract and index operators behave as expected. `[[` fetches (looks inside) the contents of a given column:
x[["a"]] # or x[[1]]
## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556
`[` returns a data frame (a list with extras):
x["a"] # or x[1]; a data frame with one column
## a
## 1 0.287578
## 2 0.788305
## 3 0.408977
## 4 0.883017
## 5 0.940467
## 6 0.045556
x[c(TRUE, TRUE, FALSE, TRUE, FALSE)]
## a b d1
## 1 0.287578 0.070508 FALSE
## 2 0.788305 0.129288 TRUE
## 3 0.408977 1.715065 FALSE
## 4 0.883017 0.460916 NA
## 5 0.940467 -1.265061 FALSE
## 6 0.045556 -0.686853 NA
Just like with lists, the replacement versions of these operators can add new columns or modify existing ones.
(y <- head(x, 1)) # example data frame
## a b c d1 d2
## 1 0.28758 0.070508 A FALSE FALSE
y[["a"]] <- round(y[["a"]], 1) # replaces the column with new content
y[["b"]] <- NULL # removes the column, like, totally
y[["e"]] <- 10*y[["a"]]^2 # adds a new column at the end
print(y)
## a c d1 d2 e
## 1 0.3 A FALSE FALSE 0.9
Some spam for thought to show how much we already know. Here are a few common scenarios involving indexing.
(y <- head(x, 1)) # example data frame
## a b c d1 d2
## 1 0.28758 0.070508 A FALSE FALSE
Move the column a
to the end:
y[unique(c(names(y), "a"), fromLast=TRUE)]
## b c d1 d2 a
## 1 0.070508 A FALSE FALSE 0.28758
Remove the columns a
and c
:
y[-match(c("a", "c"), names(y))] # or y[setdiff(names(y), c("a", "c"))]
## b d1 d2
## 1 0.070508 FALSE FALSE
Select all columns between a
and c
:
y[match("a", names(y)):match("c", names(y))]
## a b c
## 1 0.28758 0.070508 A
Fetch the columns with names starting with d
:
y[grep("^d", names(y), perl=TRUE)]
## d1 d2
## 1 FALSE FALSE
Change the name of column c
to z
:
names(y)[names(y) == "c"] <- "z"
print(y) # `names<-`(y, `[<-`(names(y), names(y) == "c", "z"))
## a b z d1 d2
## 1 0.28758 0.070508 A FALSE FALSE
Change names: d2
to u
and d1
to v
:
names(y)[match(c("d2", "d1"), names(y))] <- c("v", "u")
print(y)
## a b z u v
## 1 0.28758 0.070508 A FALSE FALSE
Note
Some users prefer the `$` operator over `[[`, but we do not. By default, the former supports partial matching of column names which might be appealing when R is used interactively. Nonetheless, it does not work on matrices nor it allows for programmatically generated names. It is also trickier to use on not syntactically valid labels; compare Section 9.3.1.
Write a function rename that changes
the names of columns based on a translation table
given in a from=to
fashion (we have already solved a similar exercise
in Chapter 9). For instance:
rename <- function(x, ...) ...to.do...
rename(head(x, 1), c="new_c", a="new_a")
## new_a b new_c d1 d2
## 1 0.28758 0.070508 A FALSE FALSE
12.2.2. Data frames are matrix-like¶
Data frames can be considered “generalised” matrices. They store data
of any kind (possibly mixed) organised in a tabular fashion.
A few functions mentioned in the previous chapter are overloaded for
the data frame case. They include: dim (despite the lack of the
dim
attribute), NROW, NCOL, and
dimnames (which is, of course, based on row.names
and names
). For example:
(x <- data.frame(
a=runif(6),
b=rnorm(6),
c=LETTERS[1:6],
d1=c(FALSE, TRUE, FALSE, NA, FALSE, NA),
d2=c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE)
))
## a b c d1 d2
## 1 0.287578 0.070508 A FALSE FALSE
## 2 0.788305 0.129288 B TRUE TRUE
## 3 0.408977 1.715065 C FALSE FALSE
## 4 0.883017 0.460916 D NA TRUE
## 5 0.940467 -1.265061 E FALSE FALSE
## 6 0.045556 -0.686853 F NA TRUE
dim(x) # the number of rows and columns
## [1] 6 5
dimnames(x) # row and column labels
## [[1]]
## [1] "1" "2" "3" "4" "5" "6"
##
## [[2]]
## [1] "a" "b" "c" "d1" "d2"
In addition to the list-like behaviour, which only allows for dealing with particular columns or their groups, the `[` operator can also take two indexers:
x[1:2, ] # first two rows
## a b c d1 d2
## 1 0.28758 0.070508 A FALSE FALSE
## 2 0.78831 0.129288 B TRUE TRUE
x[x[["a"]] >= 0.3 & x[["a"]] <= 0.8, -2] # or use x[, "a"]
## a c d1 d2
## 2 0.78831 B TRUE TRUE
## 3 0.40898 C FALSE FALSE
Recall the drop
argument to `[` and its effects
on matrix indexing (Section 11.2.4). In the current case,
its behaviour will be similar with regard to the
operations on individual columns:
x[, 1] # synonym: x[[1]] because drop=TRUE
## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556
x[, 1, drop=FALSE] # synonym: x[1]
## a
## 1 0.287578
## 2 0.788305
## 3 0.408977
## 4 0.883017
## 5 0.940467
## 6 0.045556
When we extract a single row and more than one column,
drop
does not apply. It is because columns (unlike in matrices)
can potentially be of different types:
x[1, 1:2] # two numeric columns but the result is still a numeric
## a b
## 1 0.28758 0.070508
However:
x[1, 1] # a single value
## [1] 0.28758
x[1, 1, drop=FALSE] # a data frame with one row and one column
## a
## 1 0.28758
Note
Once again, let’s take note of logical indexing in the presence of missing values:
x[x[["d1"]], ] # `d1` is of the type logical
## a b c d1 d2
## 2 0.78831 0.12929 B TRUE TRUE
## NA NA NA <NA> NA NA
## NA.1 NA NA <NA> NA NA
x[which(x[["d1"]]), ] # `which` drops missing values
## a b c d1 d2
## 2 0.78831 0.12929 B TRUE TRUE
The default behaviour is consistent with many other R functions. It explicitly indicates that something is missing. After all, when we select a “don’t know”, the result is unknown as well. Regretfully, this comes with no warning. As we seldom check missing values in the outputs manually, our absent-mindedness can lead to code bugs.
By far, we might have already noted that
the index operator adjusts (not: resets) the row.names
attribute. For instance:
(xs <- x[order(x[["a"]], decreasing=TRUE)[1:3], ])
## a b c d1 d2
## 5 0.94047 -1.26506 E FALSE FALSE
## 4 0.88302 0.46092 D NA TRUE
## 2 0.78831 0.12929 B TRUE TRUE
It is a version of x
comprised of the top three values in
the a
column. Indexing by means of character vectors will refer to
row.names
and names
:
xs["5", c("a", "b")]
## a b
## 5 0.94047 -1.2651
It is not the same as xs[5,
c("a", "b")]
,
even though row.names
is formally an integer vector here.
Regarding the replacement version of the two-indexer variant of the `[` operator, it is a flexible tool. It permits the new content to be a vector, a data frame, a list, or even a matrix. Verifying this is left as an exercise.
Note
If a data frame carries a matrix, to access a specific sub-column, we need to use the index/extract operator twice:
(x <- aggregate(iris[1], iris[5], function(x) c(Min=min(x), Max=max(x))))
## Species Sepal.Length.Min Sepal.Length.Max
## 1 setosa 4.3 5.8
## 2 versicolor 4.9 7.0
## 3 virginica 4.9 7.9
x[["Sepal.Length"]][, "Min"]
## [1] 4.3 4.9 4.9
In other words, neither x[["Sepal.Length.Min"]]
nor x[, "Sepal.Length.Min"]
works.
Write two replacement functions[6].
First, author
set_row_names which replaces the row.names
of a data frame
with the contents of a specific column. For example:
(x <- aggregate(iris[1], iris[5], mean)) # an example data frame
## Species Sepal.Length
## 1 setosa 5.006
## 2 versicolor 5.936
## 3 virginica 6.588
set_row_names(x) <- "Species"
print(x)
## Sepal.Length
## setosa 5.006
## versicolor 5.936
## virginica 6.588
Second, implement reset_row_names which converts row.names
to
a standalone column of a given name. For instance:
reset_row_names(x) <- "Type"
print(x)
## Sepal.Length Type
## 1 5.006 setosa
## 2 5.936 versicolor
## 3 6.588 virginica
These two functions may be handy for they enable writing
x[something, ]
instead of
x[x[["column"]]
%in% something, ]
.
12.3. Common operations¶
Below we review the most commonly applied operations
related to data frame wrangling. We have a few dedicated functions
or methods overloaded for the data.frame
class. However,
we have already mastered most skills to deal with such objects effectively.
Let’s repeat: data frames are just lists exhibiting matrix-like behaviour.
12.3.1. Ordering rows¶
Ordering rows in a data frame with respect to different
criteria can be easily achieved through the order function
and the two-indexer version of `[`.
For instance, here are the six fastest cars from mtcars
in terms of the
time (in seconds) to complete a 402-metre race:
mtcars6 <- mtcars[order(mtcars[["qsec"]])[1:6], c("qsec", "cyl", "gear")]
(mtcars6 <- `row.names<-`(cbind(model=row.names(mtcars6), mtcars6), NULL))
## model qsec cyl gear
## 1 Ford Pantera L 14.50 8 5
## 2 Maserati Bora 14.60 8 5
## 3 Camaro Z28 15.41 8 3
## 4 Ferrari Dino 15.50 6 5
## 5 Duster 360 15.84 8 3
## 6 Mazda RX4 16.46 6 4
order uses a stable sorting algorithm. Therefore,
any sorting with respect to a different criterion will not break
the relative ordering of qsec
in row groups with ties:
mtcars6[order(mtcars6[["cyl"]]), ]
## model qsec cyl gear
## 4 Ferrari Dino 15.50 6 5
## 6 Mazda RX4 16.46 6 4
## 1 Ford Pantera L 14.50 8 5
## 2 Maserati Bora 14.60 8 5
## 3 Camaro Z28 15.41 8 3
## 5 Duster 360 15.84 8 3
qsec
is still increasing in each of the two cyl
groups.
Notice the difference between ordering by
cyl
and gear
:
mtcars6[order(mtcars6[["cyl"]], mtcars6[["gear"]]), ]
## model qsec cyl gear
## 6 Mazda RX4 16.46 6 4
## 4 Ferrari Dino 15.50 6 5
## 3 Camaro Z28 15.41 8 3
## 5 Duster 360 15.84 8 3
## 1 Ford Pantera L 14.50 8 5
## 2 Maserati Bora 14.60 8 5
vs gear
and cyl
:
mtcars6[order(mtcars6[["gear"]], mtcars6[["cyl"]]), ]
## model qsec cyl gear
## 3 Camaro Z28 15.41 8 3
## 5 Duster 360 15.84 8 3
## 6 Mazda RX4 16.46 6 4
## 4 Ferrari Dino 15.50 6 5
## 1 Ford Pantera L 14.50 8 5
## 2 Maserati Bora 14.60 8 5
Note
Mixing increasing and decreasing ordering is tricky
as the decreasing
argument to order
currently does not accept multiple flags
in all the contexts. Perhaps the easiest way to change the ordering
direction is to use the unary minus operator on the column(s)
to be sorted decreasingly.
mtcars6[order(mtcars6[["gear"]], -mtcars6[["cyl"]]), ]
## model qsec cyl gear
## 3 Camaro Z28 15.41 8 3
## 5 Duster 360 15.84 8 3
## 6 Mazda RX4 16.46 6 4
## 1 Ford Pantera L 14.50 8 5
## 2 Maserati Bora 14.60 8 5
## 4 Ferrari Dino 15.50 6 5
For factor and character columns, xtfrm can convert them to sort keys first.
mtcars6[order(mtcars6[["cyl"]], -xtfrm(mtcars6[["model"]])), ]
## model qsec cyl gear
## 6 Mazda RX4 16.46 6 4
## 4 Ferrari Dino 15.50 6 5
## 2 Maserati Bora 14.60 8 5
## 1 Ford Pantera L 14.50 8 5
## 5 Duster 360 15.84 8 3
## 3 Camaro Z28 15.41 8 3
Both statements act like the unsupported
decreasing=
c(FALSE, TRUE)
.
Write a method sort.data.frame that orders a data frame with respect to a given set of columns.
sort.data.frame <- function(x, decreasing=FALSE, cols) ...to.do...
sort(mtcars6, cols=c("cyl", "model"))
## model qsec cyl gear
## 4 Ferrari Dino 15.50 6 5
## 6 Mazda RX4 16.46 6 4
## 3 Camaro Z28 15.41 8 3
## 5 Duster 360 15.84 8 3
## 1 Ford Pantera L 14.50 8 5
## 2 Maserati Bora 14.60 8 5
Unfortunately, that decreasing
must be of length one and
be placed as the second argument is imposed by the
sort S3 generic.
12.3.2. Handling duplicated rows¶
duplicated, anyDuplicated, and unique
have methods overloaded for the data.frame
class.
They can be used to indicate, get rid of, or replace
the repeating rows.
sum(duplicated(iris)) # how many duplicated rows are there?
## [1] 1
iris[duplicated(iris), ] # show the duplicated rows
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 143 5.8 2.7 5.1 1.9 virginica
12.3.3. Joining (merging) data frames¶
The merge function can perform the join operation that some readers might know from SQL[7]. It matches the items in the columns that two given data frames somewhat share. Then, it returns the combination of the corresponding rows.
Two calls to merge could be used to
match data on programmers (each identified by developer_id
and giving such details as their name, location, main skill, etc.)
with the information about the open-source projects
(each identified by project_id
and informing us about
its title, scope, website, and so forth) they are engaged in
(based on a third data frame defining developer_id
and project_id
pairs).
As a simple illustration, consider two objects:
A <- data.frame(
u=c("b0", "b1", "b2", "b3"),
v=c("a0", "a1", "a2", "a3")
)
B <- data.frame(
v=c("a0", "a2", "a2", "a4"),
w=c("c0", "c1", "c2", "c3")
)
The two common columns, i.e., storing
data of similar nature (a
-something strings), are both named v
.
First is the inner (natural) join, where we list only the matching pairs:
merge(A, B) # x=A, y=B, by="v", all.x=FALSE, all.y=FALSE
## v u w
## 1 a0 b0 c0
## 2 a2 b2 c1
## 3 a2 b2 c2
The common column is included in the result only once. Next, the left join guarantees that all elements in the first data frame will be included in the result:
merge(A, B, all.x=TRUE) # by="v", all.y=FALSE
## v u w
## 1 a0 b0 c0
## 2 a1 b1 <NA>
## 3 a2 b2 c1
## 4 a2 b2 c2
## 5 a3 b3 <NA>
The right join includes all records in the second argument:
merge(A, B, all.y=TRUE) # by="v", all.x=FALSE
## v u w
## 1 a0 b0 c0
## 2 a2 b2 c1
## 3 a2 b2 c2
## 4 a4 <NA> c3
Lastly, the full outer join is their set-theoretic union:
merge(A, B, all.x=TRUE, all.y=TRUE) # by="v"
## v u w
## 1 a0 b0 c0
## 2 a1 b1 <NA>
## 3 a2 b2 c1
## 4 a2 b2 c2
## 5 a3 b3 <NA>
## 6 a4 <NA> c3
Joining on more than one common column is also supported.
Show how match (Section 5.4.1) can help author a very basic version of merge.
Implement a version of match that allows the x
and table
arguments to be data frames with the same number of columns
so that also the matching of pairs, triples, etc. is possible.
12.3.4. Aggregating and transforming columns¶
It might be tempting to try aggregating data frames with apply. Sadly, currently, this function coerces its argument to a matrix. Hence, we should refrain from applying it on data frames whose columns are of mixed types. However, taking into account that data frames are special lists, we can always call Map and its relatives.
Consider an example data frame:
(iris_sample <- iris[sample(NROW(iris), 6), ])
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 28 5.2 3.5 1.5 0.2 setosa
## 80 5.7 2.6 3.5 1.0 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 133 6.4 2.8 5.6 2.2 virginica
To get the class of each column, we can call:
sapply(iris_sample, class) # or unlist(Map(class, iris))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## "numeric" "numeric" "numeric" "numeric" "factor"
Next, here is a way to compute some aggregates of the numeric columns:
unlist(Map(mean, Filter(is.numeric, iris_sample)))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 6.0667 3.1333 4.5500 1.7167
or:
sapply(iris_sample[sapply(iris_sample, is.numeric)], mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 6.0667 3.1333 4.5500 1.7167
We can also fetch more than a single summary of each column:
as.data.frame(Map(
function(x) c(Min=min(x), Max=max(x)),
Filter(is.numeric, iris_sample)
))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min 5.2 2.6 1.5 0.2
## Max 6.5 3.5 6.0 2.5
or:
sapply(iris_sample[sapply(iris_sample, is.numeric)], quantile, c(0, 1))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0% 5.2 2.6 1.5 0.2
## 100% 6.5 3.5 6.0 2.5
The latter called simplify2array automatically. Thus, the result is a matrix.
On the other hand, the standardisation of all numeric features can be performed, e.g., via a call:
iris_sample[] <- Map(function(x) {
if (!is.numeric(x)) x else (x-mean(x))/sd(x)
}, iris_sample)
print(iris_sample)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 28 -1.70405 1.03024 -1.76004 -1.65318 setosa
## 80 -0.72094 -1.49854 -0.60591 -0.78117 versicolor
## 101 0.45878 0.46829 0.83674 0.85384 virginica
## 111 0.85202 0.18732 0.31738 0.30884 virginica
## 137 0.45878 0.74927 0.60591 0.74484 virginica
## 133 0.65540 -0.93659 0.60591 0.52684 virginica
12.3.5. Handling missing values¶
The is.na method for objects of the class data.frame
returns a logical matrix of the same dimensionality[8],
indicating whether the corresponding items are missing or not.
Of course, the default method can still be called on individual columns.
Further, na.omit gets rid of rows with missing values.
Given a data frame, use is.na and other functions like apply or approx to:
remove all rows that bear at least one missing value,
remove all rows that only consist of missing values,
remove all columns that carry at least one missing value,
for each column, replace all missing values with the column averages,
for each column, replace all missing values with values that linearly interpolate between the preceding and succeeding well-defined observations (which is of use in time series processing), e.g., the blanks in c
(0.60, 0.62, NA, 0.64, NA, NA, 0.58)
should be filled to obtain c(0.60, 0.62, 0.63, 0.64, 0.62, 0.60, 0.58)
.
12.3.6. Reshaping data frames¶
Consider an example matrix:
A <- matrix(round(runif(6), 2), nrow=3,
dimnames=list(
c("X", "Y", "Z"), # row labels
c("u", "v") # column labels
))
names(dimnames(A)) <- c("Row", "Col")
print(A)
## Col
## Row u v
## X 0.29 0.88
## Y 0.79 0.94
## Z 0.41 0.05
The as.data.frame method for the table
class can be called directly on any array-like object:
as.data.frame.table(A, responseName="Val", stringsAsFactors=FALSE)
## Row Col Val
## 1 X u 0.29
## 2 Y u 0.79
## 3 Z u 0.41
## 4 X v 0.88
## 5 Y v 0.94
## 6 Z v 0.05
It is an instance of array reshaping. More precisely, we call it stacking. We converted from a wide (okay, in this example, not so wide, as we only have two columns) to a long (tall) format.
The above can also be achieved by means of the reshape function which is more flexible and operates directly on data frames (but is harder to use):
(df <- `names<-`(
data.frame(row.names(A), A, row.names=NULL),
c("Row", "Col.u", "Col.v")))
## Row Col.u Col.v
## 1 X 0.29 0.88
## 2 Y 0.79 0.94
## 3 Z 0.41 0.05
(stacked <- reshape(df, varying=2:3, direction="long"))
## Row time Col id
## 1.u X u 0.29 1
## 2.u Y u 0.79 2
## 3.u Z u 0.41 3
## 1.v X v 0.88 1
## 2.v Y v 0.94 2
## 3.v Z v 0.05 3
Maybe the default column names are not superb, but we can adjust them manually afterwards. The reverse operation is called unstacking:
reshape(stacked, idvar="Row", timevar="time", drop="id", direction="wide")
## Row Col.u Col.v
## 1.u X 0.29 0.88
## 2.u Y 0.79 0.94
## 3.u Z 0.41 0.05
Given a named numeric vector, convert it to a data frame with two columns. For instance:
convert <- function(x) ...to.do...
x <- c(spam=42, eggs=7, bacon=3)
convert(x)
## key value
## 1 spam 42
## 2 eggs 7
## 3 bacon 3
Stack the WorldPhones
dataset. Then, unstack it back.
Furthermore, unstack the stacked set after removing[9]
five random rows from it and randomly permuting all the remaining rows.
Fill in the missing entries with NA
s.
Implement a basic version of as.data.frame.table manually (using rep etc.). Also, write a function as.table.data.frame that computes its reverse. Make sure both functions are compatible with each other.
Titanic
is a four-dimensional array. Convert it to a long data frame.
Perform what follows on the undermentioned data frame:
convert the second column to a list of character vectors (split at
","
);extract the first elements from each of such vectors;
extract the last elements;
(*) unstack the split data frame;
(*) stack it back to a data frame that carries a list;
convert the list back to a character column (concatenate with
","
as separator).
(x <- data.frame(
name=c("Kat", "Ron", "Jo", "Mary"),
food=c("buckwheat", "spam,bacon,spam", "", "eggs,spam,spam,lollipops")
))
## name food
## 1 Kat buckwheat
## 2 Ron spam,bacon,spam
## 3 Jo
## 4 Mary eggs,spam,spam,lollipops
Write a function that converts all matrix-based columns in a given data frame to separate atomic columns. Furthermore, author a function that does the opposite, i.e., groups all columns with similar prefixes and turns them into matrices.
12.3.7. Aggregating data in groups¶
We can straightforwardly apply various transforms on data groups determined by a factor-like variable or their combination thanks to the split.data.frame method, which returns a list of data frames. For example:
x <- data.frame(
a=c( 10, 20, 30, 40, 50),
u=c("spam", "spam", "eggs", "spam", "eggs"),
v=c( 1, 2, 1, 1, 1)
)
split(x, x["u"]) # i.e., split.data.frame(x, x["u"]) or x[["u"]]
## $eggs
## a u v
## 3 30 eggs 1
## 5 50 eggs 1
##
## $spam
## a u v
## 1 10 spam 1
## 2 20 spam 2
## 4 40 spam 1
It split x
with respect to the u
column, which served as the grouping
variable. On the other hand:
split(x, x[c("u", "v")]) # sep="."
## $eggs.1
## a u v
## 3 30 eggs 1
## 5 50 eggs 1
##
## $spam.1
## a u v
## 1 10 spam 1
## 4 40 spam 1
##
## $eggs.2
## [1] a u v
## <0 rows> (or 0-length row.names)
##
## $spam.2
## a u v
## 2 20 spam 2
It partitioned with respect to a combination of two factor-like
sequences. A nonexistent level pair (eggs, 2)
resulted in an empty data frame.
split.data.frame (when called directly) can also be used to break a matrix into a list of matrices (rowwisely). Given a matrix, perform its train-test split: allocate, say, 70% of the rows at random into one matrix and the remaining 30% into another.
sapply is convenient if we need to aggregate grouped numeric data. To recall, it is a combination of lapply (one-vector version of Map) and simplify2array (Section 11.1.3).
sapply(split(iris[1:2], iris[5]), sapply, mean)
## setosa versicolor virginica
## Sepal.Length 5.006 5.936 6.588
## Sepal.Width 3.428 2.770 2.974
If the function to apply returns more than a single value,
sapply will not return too informative a result.
The list of matrices converted to a matrix will not have the row.names
argument set:
MinMax <- function(x) c(Min=min(x), Max=max(x))
sapply(split(iris[1:2], iris[5]), sapply, MinMax)
## setosa versicolor virginica
## [1,] 4.3 4.9 4.9
## [2,] 5.8 7.0 7.9
## [3,] 2.3 2.0 2.2
## [4,] 4.4 3.4 3.8
As a workaround, we either call simplify2array
explicitly, or pass simplify="array"
to sapply:
(res <- sapply(
split(iris[1:2], iris[5]),
sapply,
MinMax,
simplify="array"
)) # or simplify2array(lapply(...) or Map(...) etc.)
## , , setosa
##
## Sepal.Length Sepal.Width
## Min 4.3 2.3
## Max 5.8 4.4
##
## , , versicolor
##
## Sepal.Length Sepal.Width
## Min 4.9 2.0
## Max 7.0 3.4
##
## , , virginica
##
## Sepal.Length Sepal.Width
## Min 4.9 2.2
## Max 7.9 3.8
It produced a three-dimensional array, which is particularly handy if we now wish to access specific results by name:
res[, "Sepal.Length", "setosa"]
## Min Max
## 4.3 5.8
The previously mentioned as.data.frame.table method will work on it like a charm (up to the column names, which we can change):
as.data.frame.table(res, stringsAsFactors=FALSE)
## Var1 Var2 Var3 Freq
## 1 Min Sepal.Length setosa 4.3
## 2 Max Sepal.Length setosa 5.8
## 3 Min Sepal.Width setosa 2.3
## 4 Max Sepal.Width setosa 4.4
## 5 Min Sepal.Length versicolor 4.9
## 6 Max Sepal.Length versicolor 7.0
## 7 Min Sepal.Width versicolor 2.0
## 8 Max Sepal.Width versicolor 3.4
## 9 Min Sepal.Length virginica 4.9
## 10 Max Sepal.Length virginica 7.9
## 11 Min Sepal.Width virginica 2.2
## 12 Max Sepal.Width virginica 3.8
Note
If the grouping (by
) variable is a list of two or more factors,
the combined levels will be concatenated to a single string.
This behaviour yields a result that may be deemed convenient
in some contexts but not necessarily so in other ones.
as.data.frame.table(as.array(sapply(
split(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], sep="_"),
sapply, # but check also: function(...) as.matrix(sapply(...)),
mean
)), stringsAsFactors=FALSE)
## Var1 Freq
## 1 OJ_0.5.len 13.23
## 2 VC_0.5.len 7.98
## 3 OJ_1.len 22.70
## 4 VC_1.len 16.77
## 5 OJ_2.len 26.06
## 6 VC_2.len 26.14
The name of the aggregated column (len
) has been included,
because sapply simplifies the result to a flat vector
too eagerly.
aggregate can assist us when a single function is to be applied on all columns in a data frame:
aggregate(iris[-5], iris[5], mean) # neither iris[[5]] nor iris[, 5]
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026
aggregate(ToothGrowth["len"], ToothGrowth[c("supp", "dose")], mean)
## supp dose len
## 1 OJ 0.5 13.23
## 2 VC 0.5 7.98
## 3 OJ 1.0 22.70
## 4 VC 1.0 16.77
## 5 OJ 2.0 26.06
## 6 VC 2.0 26.14
The second argument, by
, must be list-like (this includes data frames).
Neither a factor nor an atomic vector is acceptable.
Also, if the function being applied returns many values, they
will be wrapped into a matrix column:
(x <- aggregate(iris[2], iris[5], function(x) c(Min=min(x), Max=max(x))))
## Species Sepal.Width.Min Sepal.Width.Max
## 1 setosa 2.3 4.4
## 2 versicolor 2.0 3.4
## 3 virginica 2.2 3.8
class(x[["Sepal.Width"]])
## [1] "matrix" "array"
x[["Sepal.Width"]] # not: Sepal.Width.Max, etc.
## Min Max
## [1,] 2.3 4.4
## [2,] 2.0 3.4
## [3,] 2.2 3.8
It is actually handy: by referring to
x[["Sepal.Width"]]
, we access all the stats for this column.
Further, if many columns are being aggregated simultaneously,
we can process all the summaries in the same way.
Check out the by function,
which supports basic split-apply-bind use cases.
Note the particularly odd behaviour of the
print method for the by
class.
The most flexible scenario involves applying a custom function returning any set of aggregates in the form of a list and then row-binding the results to obtain a data frame.
The following implements an R version of what we would express in SQL as:
SELECT supp, dose, AVG(len) AS ave_len, COUNT(*) AS count
FROM ToothGrowth
GROUP BY supp, dose
Ad rem:
do.call(rbind.data.frame, lapply(
split(ToothGrowth, ToothGrowth[c("supp", "dose")]),
function(df) list(
supp=df[1, "supp"],
dose=df[1, "dose"],
ave_len=mean(df[["len"]]),
count=NROW(df)
)
))
## supp dose ave_len count
## OJ.0.5 OJ 0.5 13.23 10
## VC.0.5 VC 0.5 7.98 10
## OJ.1 OJ 1.0 22.70 10
## VC.1 VC 1.0 16.77 10
## OJ.2 OJ 2.0 26.06 10
## VC.2 VC 2.0 26.14 10
Many aggregation functions are idempotent, which means
that when they are fed with a vector with all the elements
being identical, the result is exactly that unique element:
min, mean, median, and max
behave this way. Overload the mean and median methods
for character vectors and factors. They should return NA
and give a warning for sequences where not all elements are the same.
Otherwise, they are expected to output the unique value.
mean.character <- function(x, na.rm=FALSE, ...) ...to.do...
mean.factor <- function(x, na.rm=FALSE, ...) ...to.do...
This way, we can also aggregate the grouping variables conveniently:
do.call(rbind.data.frame,
lapply(split(ToothGrowth, ToothGrowth[c("supp", "dose")]), lapply, mean))
## len supp dose
## OJ.0.5 13.23 OJ 0.5
## VC.0.5 7.98 VC 0.5
## OJ.1 22.70 OJ 1.0
## VC.1 16.77 VC 1.0
## OJ.2 26.06 OJ 2.0
## VC.2 26.14 VC 2.0
Let’s study a function that takes a named list x
(can be a data frame) and a sequence of col=f
pairs, and applies
the function f
(or each function from a list of functions f
)
on the element named col
in x
:
napply <- function(x, ...)
{
fs <- list(...)
cols <- names(fs)
stopifnot(is.list(x), !is.null(names(x)))
stopifnot(all(cols %in% names(x)))
do.call(
c, # concatenates lists
lapply(
structure(seq_along(fs), names=cols),
function(i)
{ # always returns a list
y <- x[[ cols[i] ]]
if (is.function(fs[[i]]))
list(fs[[i]](y))
else
lapply(fs[[i]], function(f) f(y))
}
)
)
}
For example:
first <- function(x, ...) head(x, n=1L, ...) # helper function
napply(ToothGrowth,
supp=first, dose=first, len=list(ave=mean, count=length)
)
## $supp
## [1] VC
## Levels: OJ VC
##
## $dose
## [1] 0.5
##
## $len.ave
## [1] 18.813
##
## $len.count
## [1] 60
It applied first on both
ToothGrowth[["supp"]]
and ToothGrowth[["dose"]]
as well as mean and length on
ToothGrowth[["len"]]
. We included list names for a more dramatic effect.
And now:
do.call(
rbind.data.frame,
lapply(
split(ToothGrowth, ToothGrowth[c("supp", "dose")]),
napply,
supp=first, dose=first, len=list(ave=mean, count=length)
)
)
## supp dose len.ave len.count
## OJ.0.5 OJ 0.5 13.23 10
## VC.0.5 VC 0.5 7.98 10
## OJ.1 OJ 1.0 22.70 10
## VC.1 VC 1.0 16.77 10
## OJ.2 OJ 2.0 26.06 10
## VC.2 VC 2.0 26.14 10
or even:
gapply <- function(x, by, ...)
do.call(rbind.data.frame, lapply(
split(x, x[by]),
function(x, ...)
do.call(napply, c( # add all by=first calls
x=list(x),
`names<-`(rep(list(first), length(by)), by),
list(...)
)),
...
))
And now:
gapply(iris, "Species", Sepal.Length=mean, Sepal.Width=list(min, max))
## Species Sepal.Length Sepal.Width1 Sepal.Width2
## setosa setosa 5.006 2.3 4.4
## versicolor versicolor 5.936 2.0 3.4
## virginica virginica 6.588 2.2 3.8
gapply(ToothGrowth, c("supp", "dose"), len=list(ave=mean, count=length))
## supp dose len.ave len.count
## OJ.0.5 OJ 0.5 13.23 10
## VC.0.5 VC 0.5 7.98 10
## OJ.1 OJ 1.0 22.70 10
## VC.1 VC 1.0 16.77 10
## OJ.2 OJ 2.0 26.06 10
## VC.2 VC 2.0 26.14 10
This brings fun back to R programming in the sad times when many things are given to us on a plate (the thorough testing of the above is left as an exercise).
In Section 10.4, we mentioned (without giving the implementation)
the group_by function returning a list of the class list_dfs
.
It splits a data frame into a list of data frames with respect to
a combination of levels in given named columns:
group_by <- function(df, by)
{
stopifnot(is.character(by), is.data.frame(df))
df <- droplevels(df) # factors may have unused levels
structure(
split(df, df[names(df) %in% by]),
class="list_dfs",
by=by
)
}
The next function applies a set of aggregates on every column of each data frame in a given list (two nested lapplys plus cosmetic additions):
aggregate.list_dfs <- function(x, FUN, ...)
{
aggregates <- lapply(x, function(df) {
is_by <- names(df) %in% attr(x, "by")
res <- lapply(df[!is_by], FUN, ...)
res_mat <- do.call(rbind, res)
if (is.null(dimnames(res_mat)[[2]]))
dimnames(res_mat)[[2]] <- paste0("f", seq_len(NCOL(res_mat)))
cbind(
`row.names<-`(df[1, is_by, drop=FALSE], NULL),
x=row.names(res_mat),
`row.names<-`(res_mat, NULL)
)
})
combined_aggregates <- do.call(rbind.data.frame, aggregates)
`row.names<-`(combined_aggregates, NULL)
}
aggregate(group_by(ToothGrowth, c("supp", "dose")), range)
## supp dose x f1 f2
## 1 OJ 0.5 len 8.2 21.5
## 2 VC 0.5 len 4.2 11.5
## 3 OJ 1.0 len 14.5 27.3
## 4 VC 1.0 len 13.6 22.5
## 5 OJ 2.0 len 22.4 30.9
## 6 VC 2.0 len 18.5 33.9
We really want our API to be bloated, so let’s introduce a convenience function, which is a specialised version of the above:
mean.list_dfs <- function(x, ...)
aggregate.list_dfs(x, function(y) c(Mean=mean(y, ...)))
mean(group_by(iris[51:150, c(2, 3, 5)], "Species"))
## Species x Mean
## 1 versicolor Sepal.Width 2.770
## 2 versicolor Petal.Length 4.260
## 3 virginica Sepal.Width 2.974
## 4 virginica Petal.Length 5.552
12.3.8. Transforming data in groups¶
Variables will sometimes need to be transformed relative to what is happening in a dataset’s subsets. This is the case, e.g., where we decide that missing values should be replaced by the corresponding within-group averages or want to compute the relative ranks or z-scores.
If the loss of the original ordering of rows is not an issue, the standard split-apply-bind will suffice. Here is an example data frame:
(x <- data.frame(
a=c( 10, 1, NA, NA, NA, 4),
b=c( -1, 10, 40, 30, 1, 20),
c=runif(6),
d=c("v", "u", "u", "u", "v", "u")
))
## a b c d
## 1 10 -1 0.52811 v
## 2 1 10 0.89242 u
## 3 NA 40 0.55144 u
## 4 NA 30 0.45661 u
## 5 NA 1 0.95683 v
## 6 4 20 0.45333 u
Some operations:
fill_na <- function(x) `[<-`(x, is.na(x), value=mean(x[!is.na(x)]))
standardise <- function(x) (x-mean(x))/sd(x)
And now:
x_groups <- lapply(
split(x, x["d"]),
function(df) {
df[["a"]] <- fill_na(df[["a"]])
df[["b"]] <- rank(df[["b"]])
df[["c"]] <- standardise(df[["c"]])
df
}
)
do.call(rbind.data.frame, x_groups)
## a b c d
## u.2 1.0 1 1.46357 u
## u.3 2.5 4 -0.17823 u
## u.4 2.5 3 -0.63478 u
## u.6 4.0 2 -0.65057 u
## v.1 10.0 1 -0.70711 v
## v.5 10.0 2 0.70711 v
Only the relative ordering of rows within groups has been retained. Overall, the rows are in a different order. If this is an issue, we can use the unsplit function:
unsplit(x_groups, x["d"])
## a b c d
## 1 10.0 1 -0.70711 v
## 2 1.0 1 1.46357 u
## 3 2.5 4 -0.17823 u
## 4 2.5 3 -0.63478 u
## 5 10.0 2 0.70711 v
## 6 4.0 2 -0.65057 u
Show how we can perform the above also via the replacement version of split.
(*) Recreating the previous ordering can be done manually, too. It is because the split operation behaves as if we first ordered the data frame with respect to the grouping variable(s) (using a stable sorting algorithm). Here is a transformation of an example data frame split by a combination of two factors:
(x <- `row.names<-`(ToothGrowth[sample(NROW(ToothGrowth), 10), ], NULL))
## len supp dose
## 1 23.0 OJ 2.0
## 2 23.3 OJ 1.0
## 3 29.4 OJ 2.0
## 4 14.5 OJ 1.0
## 5 11.2 VC 0.5
## 6 20.0 OJ 1.0
## 7 24.5 OJ 2.0
## 8 10.0 OJ 0.5
## 9 9.4 OJ 0.5
## 10 7.0 VC 0.5
(y <- do.call(rbind.data.frame, lapply(
split(x, x[c("dose", "supp")]), # two grouping variables
function(df) {
df[["len"]] <- df[["len"]] * 100^df[["dose"]] * # whatever
ifelse(df[["supp"]] == "OJ", -1, 1) # do not overthink it
df
}
)))
## len supp dose
## 0.5.OJ.8 -100 OJ 0.5
## 0.5.OJ.9 -94 OJ 0.5
## 1.OJ.2 -2330 OJ 1.0
## 1.OJ.4 -1450 OJ 1.0
## 1.OJ.6 -2000 OJ 1.0
## 2.OJ.1 -230000 OJ 2.0
## 2.OJ.3 -294000 OJ 2.0
## 2.OJ.7 -245000 OJ 2.0
## 0.5.VC.5 112 VC 0.5
## 0.5.VC.10 70 VC 0.5
Section 5.4.4 mentioned that by calling order, we can determine the inverse of a given permutation. Hence, we can call:
y[order(order(x[["supp"]], x[["dose"]])), ] # not: dose, supp
## len supp dose
## 2.OJ.1 -230000 OJ 2.0
## 1.OJ.2 -2330 OJ 1.0
## 2.OJ.3 -294000 OJ 2.0
## 1.OJ.4 -1450 OJ 1.0
## 0.5.VC.5 112 VC 0.5
## 1.OJ.6 -2000 OJ 1.0
## 2.OJ.7 -245000 OJ 2.0
## 0.5.OJ.8 -100 OJ 0.5
## 0.5.OJ.9 -94 OJ 0.5
## 0.5.VC.10 70 VC 0.5
Additionally, we can manually restore the original row.names
,
et voilà.
12.3.9. Metaprogramming-based techniques (*)¶
Section 9.4.7 mentioned a few functions that provide convenient interfaces to some common data frame operations. These include transform, subset, with, and basically every procedure accepting a formula. The popular data.table and dplyr packages also belong to this class (Section 12.3.10).
Unfortunately, each method relying on metaprogramming must be studied separately because it is free to interpret the form of the passed arguments arbitrarily, without taking into account their real meaning. As we are concerned with developing a more universal skill set, we avoid[10] them in this course. They do not offer anything more than what we have learnt so far.
Withal, they are thought-provoking on their own. Furthermore, they are popular in other users’ code. Thus, after all, they deserve the honourable mention.
Consider an example call to the subset function:
subset(iris, Sepal.Length<4.5, -(Sepal.Width:Petal.Width))
## Sepal.Length Species
## 9 4.4 setosa
## 14 4.3 setosa
## 39 4.4 setosa
## 43 4.4 setosa
Neither the second nor the third argument makes sense as a standalone R expression. We have not defined the named variables used there:
Sepal.Length<4.5 # utter nonsense
## Error in eval(expr, envir, enclos): object 'Sepal.Length' not found
-(Sepal.Width:Petal.Width) # gibberish
## Error in eval(expr, envir, enclos): object 'Sepal.Width' not found
Only from help("subset")
we can learn that this tool assumes
that the expression passed as the second argument
plays the role of a row selector. Moreover, the third one is meant
to remove all the columns between the two given ones.
In our course, we pay attention to developing transferable skills. We believe that R is not the only language we will learn during our long and happy lives. It is much more likely that in the next environment, we will become used to writing something of the more basic form:
between <- function(x, from, to) match(from, x):match(to, x)
iris[iris[["Sepal.Length"]]<4.5,
-between(names(iris), "Sepal.Width", "Petal.Width")]
## Sepal.Length Species
## 9 4.4 setosa
## 14 4.3 setosa
## 39 4.4 setosa
## 43 4.4 setosa
With transform, we can add, modify, and remove columns in a data frame. Existing features can be referred to as if they were ordinary variables:
(mtcars4 <- mtcars[sample(seq_len(NROW(mtcars)), 4), c("hp", "am", "mpg")])
## hp am mpg
## Maserati Bora 335 1 15.0
## Cadillac Fleetwood 205 0 10.4
## Honda Civic 52 1 30.4
## Merc 450SLC 180 0 15.2
transform(mtcars4, log_hp=log(hp), am=2*am-1, hp=NULL, fcon=235/mpg)
## am mpg log_hp fcon
## Maserati Bora 1 15.0 5.8141 15.6667
## Cadillac Fleetwood -1 10.4 5.3230 22.5962
## Honda Civic 1 30.4 3.9512 7.7303
## Merc 450SLC -1 15.2 5.1930 15.4605
Similarly, attach adds any named list to the search path (see Section 16.2.6) but it does not support altering their contents. As an alternative, within may be called:
within(mtcars4, {
log_hp <- log(hp)
fcon <- 235/mpg
am <- factor(am, levels=c(0, 1), labels=c("no", "yes"))
hp <- NULL
})
## am mpg fcon log_hp
## Maserati Bora yes 15.0 15.6667 5.8141
## Cadillac Fleetwood no 10.4 22.5962 5.3230
## Honda Civic yes 30.4 7.7303 3.9512
## Merc 450SLC no 15.2 15.4605 5.1930
Those who find writing mtcars4[["name"]]
instead of name
too exhausting,
can save a few keystrokes.
As mentioned in Section 10.3.4 (see Section 17.6 for more details), formulae are special objects that consist of two unevaluated expressions separated by a tilde, `~`. Functions can support formulae and do what they please with them. However, a popular approach is to allow them to express “something grouped by something else” or “one thing as a function of other things”.
do.call(rbind.data.frame, lapply(split(ToothGrowth, ~supp+dose), head, 1))
## len supp dose
## OJ.0.5 15.2 OJ 0.5
## VC.0.5 4.2 VC 0.5
## OJ.1 19.7 OJ 1.0
## VC.1 16.5 VC 1.0
## OJ.2 25.5 OJ 2.0
## VC.2 23.6 VC 2.0
aggregate(cbind(mpg, log_hp=log(hp))~am:cyl, mtcars, mean)
## am cyl mpg log_hp
## 1 0 4 22.900 4.4186
## 2 1 4 28.075 4.3709
## 3 0 6 19.125 4.7447
## 4 1 6 20.567 4.8552
## 5 0 8 15.050 5.2553
## 6 1 8 15.400 5.6950
head(model.frame(mpg+hp~log(hp)+I(1/qsec), mtcars))
## mpg + hp log(hp) I(1/qsec)
## Mazda RX4 131.0 4.7005 0.060753....
## Mazda RX4 Wag 131.0 4.7005 0.058754....
## Datsun 710 115.8 4.5326 0.053734....
## Hornet 4 Drive 131.4 4.7005 0.051440....
## Hornet Sportabout 193.7 5.1648 0.058754....
## Valiant 123.1 4.6540 0.049455....
If these examples seem esoteric, it is because it is precisely the case. We need to consult the corresponding functions’ manuals to discover what they do. And, as we do not recommend their use to beginner programmers, we will not explain them here. Don’t trip.
In the last example, the peculiar printing of the last column is due to which method’s being overloaded?
In the third part of this book, we will return to these functions for they will serve as an amusing illustration of how to indite our own procedures that rely on metaprogramming techniques.
12.3.10. A note on the dplyr (tidyverse) and data.table packages (*)¶
data.table and dplyr are very popular packages that implement common data frame transformations. In particular, the latter is part of an immerse system of interdependent extensions called tidyverse which became quite invasive over the last few years. They both heavily rely on metaprogramming and introduce entirely new APIs featuring hundreds of functions for the operations we already know well how to perform (the calamity of superabundance).
Still, their users must remember that they will need to rely on base functions when the processing of other prominent data structures is required, e.g., of fancy lists and matrices. Base R (and its predecessor, S) has long ago proven to be a versatile tool for rapid prototyping, calling specialised procedures written in C or Java, and wrangling data that fit into memory. Even though some operations from the mentioned packages may be much faster for larger datasets, the speed is less often an issue in practice than what most users might think.
For larger problems, techniques for working with batches of data, sampling methods, or aggregating data stored elsewhere are often the way to go, especially when building machine learning models or visualisation[11] is required. Usually, the most recent data will be stored in external, normalised databases, and we will need to join a few tables to fetch something valuable from the perspective of the current task’s context.
Thus, we cannot stress enough that, in many situations, SQL, not the other tools, is the most powerful interface to more considerable amounts of data. Learning it will give us the skills we can use later in other programming environments.
Note
Of course, certain functions from tidyverse and related packages
we will find very helpful after all. Quite annoyingly,
they tend to return objects of the class tibble
(tbl_df
)
(e.g., haven::
read.xpt that
reads SAS
data files).
Luckily, they are subclasses of data.frame
;
we can always use as.data.frame to get our
favourite objects back.
12.4. Exercises¶
Answer the following questions.
What attributes a data frame is equipped with?
If
row.names
is an integer vector, how to access rows labelled1
,7
, and42
?How to create a data frame that carries a column that is a list of character vectors of different lengths?
How to create a data frame that includes a matrix column?
How to convert all numeric columns in a data frame to a numeric matrix?
Assuming that
x
is an atomic vector, what is the difference between as.data.frame(x)
, as.data.frame(
as.list(x))
, as.data.frame(
list(a=x))
, and data.frame(a=x)
?
Assuming that x
is a data frame, what is the meaning of/difference between
the following:
x["u"]
vsx[["u"]]
vsx[, "u"]
?x["u"][1]
vsx[["u"]][1]
vsx[1, "u"]
vsx[1, "u", drop=FALSE]
?x[
which(x[[1]] > 0), ]
vsx[x[[1]] > 0, ]
?x[
grep("^foo",
names(x))]
?
We have a data frame with columns named like:
ID
(character),
checked
(logical, possibly with missing values),
category
(factor),
x0
, …, x9
(ten separate numeric columns),
y0
, …, y9
(ten separate numeric columns),
coords
(numeric matrix with two columns named lat
and long
),
and
features
(list of character vectors of different lengths).
How to extract the rows where
checked
isTRUE
?How to extract the rows for which
ID
is like three letters and then five digits (e.g.,XYZ12345
)?How to select all the numeric columns in one go?
How to extract a subset comprised only of the
ID
andx
-something columns?How to get rid of all the columns between
x3
andy7
?Assuming that the
ID
s are like three letters and then five digits, how to add two columns:ID3
(the letters) andID5
(the five digits)?How to check where both
lat
andlong
incoords
are negative?How to add the column indicating the number of
features
?How to extract the rows where
"spam"
is amongst thefeatures
?How to convert it to a long data frame with two columns:
ID
andfeature
(individual strings)?How to change the name of the
ID
column toid
?How to make the
y
-foo columns appear before thex
-bar ones?How to order the rows with respect to
checked
(FALSE
first, thenTRUE
) andID
s (decreasingly)?How to remove rows with duplicate IDs?
How to determine how many entries correspond to each
category
?How to compute the average
lat
andlong
in eachcategory
?How to compute the average
lat
andlong
for eachcategory
andchecked
combined?
Consider the flights
dataset.
Give some ways to select all rows between March and October
(regardless of the year).
In this task, you will be working with a version of a
dataset on 70k+ Melbourne trees (urban_forest
).
Load the downloaded dataset by calling the read.csv function.
Fetch the IDs (
CoM.ID
) and trunk diameters (Diameter.Breast.Height
) of the horse chestnuts with five smallest diameters at breast height. The output data frame must be sorted with respect toDiameter.Breast.Height
, decreasingly.Create a new data frame that gives the number of trees planted in each year.
Compute the average age (in years, based on
Year.Planted
) of the trees of genera (each genus separately): Eucalyptus, Platanus, Ficus, Acer, and Quercus.
(*) Consider the historic data dumps of Stack Exchange available here. Export these CSV files to an SQLite database. Then, write some R code that corresponds to the following SQL queries. Use dbGetQuery to verify your results.
First:
SELECT
Users.DisplayName,
Users.Age,
Users.Location,
SUM(Posts.FavoriteCount) AS FavoriteTotal,
Posts.Title AS MostFavoriteQuestion,
MAX(Posts.FavoriteCount) AS MostFavoriteQuestionLikes
FROM Posts
JOIN Users ON Users.Id=Posts.OwnerUserId
WHERE Posts.PostTypeId=1
GROUP BY OwnerUserId
ORDER BY FavoriteTotal DESC
LIMIT 10
Second:
SELECT
Posts.ID,
Posts.Title,
Posts2.PositiveAnswerCount
FROM Posts
JOIN (
SELECT
Posts.ParentID,
COUNT(*) AS PositiveAnswerCount
FROM Posts
WHERE Posts.PostTypeID=2 AND Posts.Score>0
GROUP BY Posts.ParentID
) AS Posts2
ON Posts.ID=Posts2.ParentID
ORDER BY Posts2.PositiveAnswerCount DESC
LIMIT 10
Third:
SELECT
Posts.Title,
UpVotesPerYear.Year,
MAX(UpVotesPerYear.Count) AS Count
FROM (
SELECT
PostId,
COUNT(*) AS Count,
STRFTIME('%Y', Votes.CreationDate) AS Year
FROM Votes
WHERE VoteTypeId=2
GROUP BY PostId, Year
) AS UpVotesPerYear
JOIN Posts ON Posts.Id=UpVotesPerYear.PostId
WHERE Posts.PostTypeId=1
GROUP BY Year
Fourth:
SELECT
Questions.Id,
Questions.Title,
BestAnswers.MaxScore,
Posts.Score AS AcceptedScore,
BestAnswers.MaxScore-Posts.Score AS Difference
FROM (
SELECT Id, ParentId, MAX(Score) AS MaxScore
FROM Posts
WHERE PostTypeId==2
GROUP BY ParentId
) AS BestAnswers
JOIN (
SELECT * FROM Posts
WHERE PostTypeId==1
) AS Questions
ON Questions.Id=BestAnswers.ParentId
JOIN Posts ON Questions.AcceptedAnswerId=Posts.Id
WHERE Difference>50
ORDER BY Difference DESC
Fifth:
SELECT
Posts.Title,
CmtTotScr.CommentsTotalScore
FROM (
SELECT
PostID,
UserID,
SUM(Score) AS CommentsTotalScore
FROM Comments
GROUP BY PostID, UserID
) AS CmtTotScr
JOIN Posts ON Posts.ID=CmtTotScr.PostID
AND Posts.OwnerUserId=CmtTotScr.UserID
WHERE Posts.PostTypeId=1
ORDER BY CmtTotScr.CommentsTotalScore DESC
LIMIT 10
Sixth:
SELECT DISTINCT
Users.Id,
Users.DisplayName,
Users.Reputation,
Users.Age,
Users.Location
FROM (
SELECT
Name, UserID
FROM Badges
WHERE Name IN (
SELECT
Name
FROM Badges
WHERE Class=1
GROUP BY Name
HAVING COUNT(*) BETWEEN 2 AND 10
)
AND Class=1
) AS ValuableBadges
JOIN Users ON ValuableBadges.UserId=Users.Id
Seventh:
SELECT
Posts.Title,
VotesByAge2.OldVotes
FROM Posts
JOIN (
SELECT
PostId,
MAX(CASE WHEN VoteDate = 'new' THEN Total ELSE 0 END) NewVotes,
MAX(CASE WHEN VoteDate = 'old' THEN Total ELSE 0 END) OldVotes,
SUM(Total) AS Votes
FROM (
SELECT
PostId,
CASE STRFTIME('%Y', CreationDate)
WHEN '2017' THEN 'new'
WHEN '2016' THEN 'new'
ELSE 'old'
END VoteDate,
COUNT(*) AS Total
FROM Votes
WHERE VoteTypeId=2
GROUP BY PostId, VoteDate
) AS VotesByAge
GROUP BY VotesByAge.PostId
HAVING NewVotes=0
) AS VotesByAge2 ON VotesByAge2.PostId=Posts.ID
WHERE Posts.PostTypeId=1
ORDER BY VotesByAge2.OldVotes DESC
LIMIT 10
(*)
Generate a CSV file that stores some random data arranged in
a few columns of a size at least two times larger than your available RAM.
Then, export the CSV file to an SQLite database.
Use file connections (Section 8.3.5) and the
nrow
argument to read.table to process it chunk by chunk.
Determine whether setting colClasses
in read.table
speeds up the reading of large CSV files significantly or not.
(*) Export the whole XML data dump of StackOverflow published at https://archive.org/details/stackexchange (see also https://data.stackexchange.com/) to an SQLite database.