# 5. Vector indexing#

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime, any bug/typos reports/fixes are appreciated. Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python .

We now know plenty of ways to process vectors in their entirety, but how to extract and replace their specific parts? We will be collectively referring to such activities as indexing. This is because they are often performed through the index operator, [.

Let us begin with something more lightweight, though. The head function can be used to fetch a few elements from the beginning of a vector.

x <- 1:10
##  1 2 3 4 5 6
head(x, 3)   # get the first three
##  1 2 3
head(x, -3)  # skip the last three
##  1 2 3 4 5 6 7


Similarly, tail extracts a few elements from the end of a sequence.

tail(x)  # tail(x, 6)
##   5  6  7  8  9 10
tail(x, 3)   # get the last three
##   8  9 10
tail(x, -3)  # skip the first three
##   4  5  6  7  8  9 10


Both functions work on lists, too. They are useful, e.g., when we wish to preview the contents of a big object.

## 5.2. Subsetting and extracting from vectors#

Given a vector x, “x[i]” returns its subset comprised of elements indicated by the indexer i, which can be a single vector of:

• nonnegative integers (gives the positions of elements to retrieve),

• negative integers (gives the positions to omit),

• logical values (states whether the corresponding element should be fetched or skipped),

• character strings (locates the elements with specific names).

### 5.2.1. Nonnegative indexes#

Consider the following example vectors:

(x <- seq(10, 100, 10))
##    10  20  30  40  50  60  70  80  90 100
(y <- list(1, 11:12, 21:23))
## []
##  1
##
## []
##  11 12
##
## []
##  21 22 23


The first element in a vector is at index 1. Hence:

x          # the first element
##  10
x[length(x)]  # the last element
##  100


Important

We might have wondered why “” is being displayed each time we print out an atomic vector on the console:

print((1:51)*10)
##    10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160 170
##  180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340
##  350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510


It is merely a visual hint indicating which vector element we output first in each line.

Vectorisation is a universal feature of R. It comes as no surprise that the indexer can also be of length greater than one.

x[c(1, length(x))]  # the first and the last
##   10 100
x[1:3]  # the first three
##  10 20 30


Take note of some boundary cases:

x[c(1, 2, 1, 0, 3, NA_real_, 1, 11)]  # repeated, 0, missing, out of bound
##  10 20 10 30 NA 10 NA
x[c()]  # indexing by an empty vector
## numeric(0)


Important

Subsetting with [ yields an object of the same type.

When applied on lists, the index operator always returns a list as well, even if we ask for a single element:

y  # a list that includes the 2nd element
## []
##  11 12
y[c(1, 3)]  # not the same as x[1, 3] (a different story)
## []
##  1
##
## []
##  21 22 23


If we wish to extract a component, i.e., to dig into what is inside a list at a specific location, we can refer to [[:

y[]  # extract the 2nd element
##  11 12


This is exactly why R displays “[]”, “[]”, etc. when printing out lists on the console.

Note

Calling “x[[i]]” on an atomic vector, where i is a single value, has almost the same effect as “x[i]”. However, [[ generates an error if the subscript is out of bounds.

Note

(*) [[ also supports multiple indexers.

y[[c(1, 3)]]
## Error in y[[c(1, 3)]]: subscript out of bounds


Its meaning is different from y[c(1, 3)], though; we are about to extract a single value, remember? Here, indexing is applied recursively. Namely, the above is equivalent to y[][]. We got an error because y[] is of a length smaller than three.

More examples:

y[[c(3, 1)]]  # y[][]
##  21
list(list(7))[[c(1, 1)]]  # 7, not list(7)
##  7


Important

Let us reflect on the operators’ behaviour in the case of nonexistent items:

c(1, 2, 3)
##  NA
list(1, 2, 3)
## []
## NULL
c(1, 2, 3)[]
## Error in c(1, 2, 3)[]: subscript out of bounds
list(1, 2, 3)[]
## Error in list(1, 2, 3)[]: subscript out of bounds


### 5.2.2. Negative indexes#

The indexer can also be a vector of negative integers. This way, we can exclude the elements at given positions:

y[-1]  # all but the first
## []
##  11 12
##
## []
##  21 22 23
x[-(1:3)]
##   40  50  60  70  80  90 100
x[-c(1, 0, 2, 1, 1, 8:100)]  # 0, repeated, out of bound indexes
##  30 40 50 60 70


Note

Negative and positive indexes cannot be mixed.

x[-1:3]  # recall that -1:3 == (-1):3
## Error in x[-1:3]: only 0's may be mixed with negative subscripts


Also, NA indexes are not allowed amongst negative ones.

### 5.2.3. Logical indexer#

A vector can also be subsetted by means of a logical vector. If they both are of identical lengths, the consecutive elements in the latter indicate whether the corresponding elements of the indexed vector are supposed to be selected (TRUE) or omitted (FALSE).

#   1***  2      3      4      5***  6***  7      8***  9?   10***
x[c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, NA,  TRUE)]
##   10  50  60  80  NA 100


In other words, x[l], where l is a logical vector, returns all x[i] with i such that l[i] is TRUE. Above, we extracted the elements at indexes 1, 5, 6, 8, and 10.

Important

Let us be careful: if the element selector is NA, the selected element will be set to a missing value (for atomic vectors) or NULL (for lists).

c("one", "two", "three")[c(NA, TRUE, FALSE)]
##  NA    "two"
list("one", "two", "three")[c(NA, TRUE, FALSE)]
## []
## NULL
##
## []
##  "two"


This, lamentably, comes with no warning, which might be problematic when indexers are generated programmatically.

As a remedy, we sometimes pass the logical indexer to the which function first. It returns the indexes of the elements equal to TRUE, ignoring the missing ones.

which(c(NA, TRUE, FALSE))
##  2
c("one", "two", "three")[which(c(NA, TRUE, FALSE))]
##  "two"


Recall that in Chapter 3, we discussed ample vectorised operations that generate logical vectors. Anything that yields a logical vector of the same length as x can be passed as an indexer.

x > 60  # yes, it is a perfect indexer candidate
##   FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
x[x > 60]  # select elements in x that are greater than 60
##   70  80  90 100
x[x < 30 | 70 < x]  # elements not between 30 and 70
##   10  20  80  90 100
x[x < mean(x)]  # elements smaller than the mean
##  10 20 30 40 50
x[x^2 > 7777 | log10(x) <= 1.6]  # indexing via a transformed version of x
##   10  20  30  90 100
(z <- round(runif(length(x)), 2))  # ten pseudorandom numbers
##   0.29 0.79 0.41 0.88 0.94 0.05 0.53 0.89 0.55 0.46
x[z <= 0.5]  # indexing based on z, not x — not a problem
##   10  30  60 100


The indexer is always evaluated first and then passed to the subsetting operation. The index operator does not care how an indexer is generated.

Furthermore, the recycling rule is applied when necessary:

x[c(FALSE, TRUE)]  # every second element
##   20  40  60  80 100
y[c(TRUE, FALSE)]  # interestingly, there is no warning here
## []
##  1
##
## []
##  21 22 23

Exercise 5.1

Consider a simple database about six people, their favourite dishes, and birth years.

name <- c("Graham", "John", "Terry", "Eric",  "Michael", "Terry")
food <- c("bacon",  "spam", "spam",  "eggs",  "spam",    "beans")
year <- c( 1941,     1939,   1942,    1943,    1943,      1940  )


The consecutive elements in different vectors correspond to each other, e.g., Graham was born in 1941, and his go-to food was bacon.

• List the names of people born in 1941 or 1942.

• List the names of those who like spam.

• List the names of those who like spam and were born after 1940.

• Compute the average birth year of the lovers of spam.

• Give the average age, in 1969, of those who didn’t find spam utmostly delicious.

The answers to the above must be provided programmatically, i.e., we do not just write "Eric" and "Graham". The code must be generic enough so that it works in the case of any other database of this kind, no matter its size.

Exercise 5.2

Remove missing values from a given vector without referring to the na.omit function.

### 5.2.4. Character indexer#

If a vector is equipped with the names attribute, such as this one:

x <- structure(x, names=letters[1:10])  # add names
print(x)
##   a   b   c   d   e   f   g   h   i   j
##  10  20  30  40  50  60  70  80  90 100


These labels can be referred to for the purpose of extracting the elements. To do this, we use an indexer that is a character vector:

x[c("a", "f", "a", "g", "z")]
##    a    f    a    g <NA>
##   10   60   10   70   NA


Important

We have said that special object attributes add extra functionality on top of the existing ones. Therefore, indexing by means of positive, negative, and logical vectors is still available:

x[1:3]
##  a  b  c
## 10 20 30
x[-(1:5)]
##   f   g   h   i   j
##  60  70  80  90 100
x[x > 70]
##   h   i   j
##  80  90 100


Lists can also be subsetted this way.

(y <- structure(y, names=c("first", "second", "third")))
## $first ##  1 ## ##$second
##  11 12
##
## $third ##  21 22 23 y[c("first", "second")] ##$first
##  1
##
## $second ##  11 12 y["third"] # result is a list ##$third
##  21 22 23
y[["third"]]  # result is the specific element unwrapped
##  21 22 23


Important

Labels do not have to be unique. When we have repeated names, the first matching element is extracted:

structure(1:3, names=c("a", "b", "a"))["a"]
## a
## 1


There is no direct way to select all but given names, just like with negative integer indexers. For a workaround, see Section 5.4.1.

Exercise 5.3

Rewrite the solution to the above spam-lovers exercise, assuming that we have the three features wrapped inside a list: (notice that Steve has now joined the group; hello, Steve):

(people <- list(
Name=c("Graham", "John", "Terry", "Eric",  "Michael", "Terry", "Steve"),
Food=c("bacon",  "spam", "spam",  "eggs",  "spam",    "beans", "spam"),
Year=c( 1941,     1939,   1942,    1943,    1943,      1940,   NA_real_)
))
## $Name ##  "Graham" "John" "Terry" "Eric" "Michael" "Terry" "Steve" ## ##$Food
##  "bacon" "spam"  "spam"  "eggs"  "spam"  "beans" "spam"
##
## $Year ##  1941 1939 1942 1943 1943 1940 NA  Do not refer to name, food, and year directly. Instead, use the full people[["Name"]] etc. accessors. There is no need to pout: it is just a tiny bit of extra work. ## 5.3. Replacing elements# ### 5.3.1. Modifying atomic vectors# There are also replacement versions of the above indexing schemes. They allow us to substitute some new content for the old one. (x <- 1:12) ##  1 2 3 4 5 6 7 8 9 10 11 12 x[length(x)] <- 42 # modify the last element print(x) ##  1 2 3 4 5 6 7 8 9 10 11 42  The principles of vectorisation, recycling rule, and implicit coercion are all in place: x[c(TRUE, FALSE)] <- c("a", "b", "c") print(x) ##  "a" "2" "b" "4" "c" "6" "a" "8" "b" "10" "c" "42"  Long story long: first, to ensure that the new content can be poured into the old wineskin, R needed to convert the numeric vector to a character one; compare Section 4.1. Then, every second element therein, a total of six items, was replaced by a recycled version of the replacement sequence of length 3. Finally, the name “x” was rebound to such a brought-forth object and the previous one became forgotten. Note For more details on replacement functions in general, see Section 9.4.6. Such operations alter the state of the object they are called on (quite a rare behaviour in functional languages). Exercise 5.4 Replace missing values in a given numeric vector with the arithmetic mean of well-defined observations therein. ### 5.3.2. Modifying lists# List contents can be altered as well. For modifying individual elements, the safest option is to use the replacement version of the [[ operator: y <- list(a=1, b=1:2, c=1:3) y[] <- 100:110 y[["c"]] <- -y[["c"]] print(y) ##$a
##   100 101 102 103 104 105 106 107 108 109 110
##
## $b ##  1 2 ## ##$c
##  -1 -2 -3


The replacement version of [ modifies a whole sub-list:

y[1:3] <- list(1, c("a", "b", "c"), c(TRUE, FALSE))
print(y)
## $a ##  1 ## ##$b
##  "a" "b" "c"
##
## $c ##  TRUE FALSE  Moreover: y <- list(1:10) # replace 1 element with 1 object y[-1] <- 10:11 # replace 2 elements with 2 vectors of length 1 print(y) ##$a
##    1  2  3  4  5  6  7  8  9 10
##
## $b ##  10 ## ##$c
##  11


Note

Let idx be a vector of positive indexes of elements to be modified. Overall, calling “y[idx] <- z” behaves as if we wrote:

1. y[[idx]] <- z[],

2. y[[idx]] <- z[],

3. y[[idx]] <- z[],

and so forth.

Furthermore, z (but not idx) will be recycled if necessary, i.e., we take z[[j %% length(z)]] for consecutive js from 1 to the length of idx.

Exercise 5.5

Reflect on the results of the following expressions:

• y <- c("a", "b", "c"),

• y[] <- c("a", "b", "c"),

• y[] <- list(c("a", "b", "c")),

• y[1:3] <- c("a", "b", "c"),

• y[1:3] <- list(c("a", "b", "c")),

• y[1:3] <- "a",

• y[1:3] <- list("a"),

• y[c(1, 2, 1)] <- c("a", "b", "c"),

Important

Setting a list item to NULL removes it from the list completely.

y <- list(1, 2, 3, 4)
y <- NULL       # removes the 1st element (i.e., 1)
y[] <- NULL     # removes the 1st element (i.e., now 2)
y <- list(NULL) # sets the 1st element (i.e., now 3) to NULL
print(y)
## []
## NULL
##
## []
##  4


The same notation convention is used for dropping object attributes; see Section 9.4.6.

### 5.3.3. Inserting new elements#

New elements can be pushed at the end of the vector quite easily.

(x <- 1:5)
##  1 2 3 4 5
x[length(x)+1] <- 6  # insert at the end
print(x)
##  1 2 3 4 5 6
x <- 10  # insert at the end but add more items
print(x)
##    1  2  3  4  5  6 NA NA NA 10


The elements to be inserted can be named as well:

x["a"] <- 11  # still inserts at the end
x["z"] <- 12
x["c"] <- 13
x["z"] <- 14  # z is already there; replace
print(x)
##                                a  z  c
##  1  2  3  4  5  6 NA NA NA 10 11 14 13


Note that x was not equipped with the names attribute before. The unlabelled elements were assigned blank labels (empty strings).

Note

It is not possible to insert new elements at the beginning or in the middle of a sequence, at least not with the index operator. By writing “x[3:4] <- 1:5” we do not replace two elements in the middle with five other ones. However, we can always use the c function to slice parts of the vector and intertwine them with some new content:

x <- seq(10, 100, 10)
x <- c(x[1:2], 1:5, x[5:7])
print(x)
##   10 20  1  2  3  4  5 50 60 70


## 5.5. Preserving and losing attributes#

As attributes are conceived as extra data, it is up to a function’s authors what they will decide to do with them. Generally, it is safe to assume that much thought has been put into the design of base R functions. Oftentimes, they behave quite reasonably. This is why we are going to spend some time now exploring their approaches to the handling of attributes.

Namely, for functions and operators that aim at transforming vectors passed as their inputs, the assumed strategy may be to:

• ignore the input attributes completely,

• equip the output object with the same set of attributes, or

• take care only of some special attributes, such as names, if that makes sense.

Below we explore some common patterns; see also Section 1.3 of .

### 5.5.1. c#

First, c drops all attributes except names:

(x <- structure(1:4, names=c("a", "b", "c", "d"), attrib1="<3"))
## a b c d
## 1 2 3 4
## attr(,"attrib1")
##  "<3"
c(x)  # only names are preserved
## a b c d
## 1 2 3 4


We can therefore end up calling this function chiefly for this nice side effect. Also, recall that unname drops the labels.

unname(x)
##  1 2 3 4
## attr(,"attrib1")
##  "<3"


### 5.5.2. as.something#

as.vector, as.numeric, and similar drop all attributes in the case where the output is an atomic vector, but it might not necessarily do so in other cases (because they are S3 generics; see Chapter 10).

as.vector(x)  # drops all attributes if x is atomic
##  1 2 3 4


### 5.5.3. Subsetting#

Subsetting with [ (except where the indexer is not given) drops all attributes but names (as well as dim and dimnames; see Chapter 11), which is adjusted accordingly:

x    # subset of labels
## a
## 1
x[]  # this always drops the labels
##  1


The replacement version of the index operator can be used to modify the values in an existing vector whilst preserving all the attributes. In particular, skipping the indexer will allow us to replace all the elements:

y <- x
y[] <- c("u", "v")  # note that c("u", "v") has no attributes at all
print(y)
##   a   b   c   d
## "u" "v" "u" "v"
## attr(,"attrib1")
##  "<3"


### 5.5.4. Vectorised functions#

Vectorised unary functions tend to copy all the attributes.

round(x)
## a b c d
## 1 2 3 4
## attr(,"attrib1")
##  "<3"


Binary operations are expected to get the attributes from the longer input. If they are of equal sizes, the first argument preferred to the second.

y <- structure(c(1, 10), names=c("f", "g"), attrib1=":|", attrib2=":O")
y * x  # x is longer
##  a  b  c  d
##  1 20  3 40
## attr(,"attrib1")
##  "<3"
y[c("h", "i")] <- c(100, 1000)  # add two new elements at the end
y * x
##    f    g    h    i
##    1   20  300 4000
## attr(,"attrib1")
##  ":|"
## attr(,"attrib2")
##  ":O"
x * y
##    a    b    c    d
##    1   20  300 4000
## attr(,"attrib1")
##  "<3"
## attr(,"attrib2")
##  ":O"


Also, refer to Section 9.4.6 for a way to copy all the attributes from one object to another.

Important

Even in base R, the above rules are not enforced strictly. We consider them inconsistencies that should be, for the time being, treated as features (with which we need to learn to live as they have not been fixed for years, but hope springs eternal).

As far as third-party extension packages are concerned, suffice it to say that a lot of R programmers do not know what attributes are at all! It is always best to refer to the documentation, perform some experiments, and/or manually ensure the preservation of the data we care about.

Exercise 5.16

Check what attributes are preserved by ifelse.

## 5.6. Exercises#

Exercise 5.17

Answer the following questions (contemplate first, then use R to find the answer):

• What is the result of “x[c()]?” Is it the same as “x[]”?

• Is “x[c(1, 1, 1)]” equivalent to “x”?

• Is “x” equivalent to “x["1"]”?

• Is “x[c(-1, -1, -1)]” equivalent to “x[-1]”?

• What does “x[c(0, 1, 2, NA)]” do?

• What does “x” return?

• What does “x[1, 2, 3]” do?

• What about “x[c(0, -1, -2)]” and “x[c(-1, -2, NA)]”?

• Why “x[NA]” is so significantly different from “x[c(1, NA)]”?

• What is “x[c(FALSE, TRUE, 2)]”?

• What will we obtain by calling “x[x<min(x)]”?

• What about “x[length(x)+1]”?

• Why “x[min(y)]” is probably a mistake? What could it mean? How can it be fixed?

• Why cannot we mix indexes of different types and write “x[c(1, "b", "c", 4)]”? Or can we?

• Why would we call “as.vector(na.omit(x))” instead of just na.omit(x)?

• What is the difference between sort and order?

• What is the type and the length of the object returned by a call to “split(a, u)”? What about “split(a, c(u, v))”?

• How to get rid of the seventh element from a list of ten elements?

• How to get rid of the seventh, eight, and ninth elements from a list with ten elements?

• How to get rid of the seventh element from an atomic vector of ten elements?

• If y is a list, by how many elements “y[c(length(y)+1, length(y)+1, length(y)+1)] <- list(1, 2, 3)” will extend it?

• What is the difference between “x[x>0]” and “x[which(x>0)]”?

Exercise 5.18

If x is an atomic vector of length n, “x[5:n]” obviously extracts everything from the fifth element to the end. Does it, though? Check what happens when x is of length less than five, including 0. List different ways to correct this expression so that it makes (some) sense in the case of shorter vectors.

Exercise 5.19

Similarly, “x[length(x) + 1 - 5:1]” is supposed to return the last five elements in x. Propose a few alternatives that are correct also for short xs.

Exercise 5.20

Given a numeric vector, fetch its five largest elements. Ensure the code works for vectors of length less than five.

Exercise 5.21

We can compute a trimmed mean of some x by setting the trim argument to the mean function. Compute a similar robust estimator of location – the $$p$$-winsorised mean, $$p\in[0, 0.5]$$ defined as the arithmetic mean of all elements in x clipped to the $$[Q_p, Q_{1-p}]$$ interval, where $$Q_p$$ is the vector’s $$p$$-quantile; see quantile. For example, if x is (8, 5, 2, 9, 7, 4, 6, 1, 3), we have $$Q_{0.25}=3$$ and $$Q_{0.75}=7$$ and hence the $$0.25$$-winsorised mean will be equal to the arithmetic mean of (7, 5, 3, 7, 7, 4, 6, 3, 3).

Exercise 5.22

Let x and y be two vectors of the same length, $$n$$, and no ties. Compute the Spearman rank correlation coefficient given by:

$\varrho(\mathbf{x},\mathbf{y}) = 1-\frac{6 \sum_{i=1}^n d_i^2}{n (n^2-1)},$

where $$d_i=r_i-s_i$$, $$i=1,\dots,n$$, and $$r_i$$ and $$s_i$$ denote the rank of $$x_i$$ and $$y_i$$, respectively. See also the built-in cor.

Exercise 5.23

(*) Given two vectors x and y of the same length $$n$$, a call to approx(x, y, ...) can be used to interpolate linearly between the points $$(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$$. We can use it whenever we wish to generate new $$y$$s for previously unobserved $$x$$s (somewhere “in-between” the data we already have). Moreover, spline(x, y, ...) can perform a cubic spline interpolation, which is smoother; see Figure 5.2.

x <- c(1, 3,  5, 7, 10)
y <- c(1, 15, 25, 6,  0)
x_new <- seq(1, 10, by=0.25)
y_new1 <- approx(x, y, xout=x_new)[["y"]]
y_new2 <- spline(x, y, xout=x_new)[["y"]]
plot(x, y, ylim=c(-10, 30))  # the points to interpolate between
lines(x_new, y_new1, col="darkred", lty=2)  # linear interpolation
lines(x_new, y_new2, col="navy", lty=4)  # cubic interpolation
legend("topright", legend=c("linear", "cubic"),
lty=c(2, 4), col=c("darkred", "navy"), bg="white")


Using a call to one of the above, impute missing data in euraud-20200101-20200630.csv, e.g., the blanks in (0.60, 0.62, NA, 0.64, NA, NA, 0.58) should be filled so as to obtain (0.60, 0.62, 0.63, 0.64, 0.62, 0.60, 0.58).

Exercise 5.24

Given some 1fromton, use findInterval to generate a logical vector of length n with TRUE elements only at indexes between from and to, inclusive.

Exercise 5.25

Implement expressions that give rise to the same results as calls to which, which.min, which.max, and rev functions. What is the difference between x[x>y] and x[which(x>y)]? What about which.min(x) vs which(x == min(x))?

Exercise 5.26

Given two equal-length vectors x and y, fetch the value from the former that corresponds to the smallest value in the latter. Write three versions of such an expression, each dealing with potential ties in y differently, for example:

x <- c("a", "b", "c", "d", "e", "f")
y <- c(  3,   1,   2,   1,   1,   4)


should choose either the first ("b"), last ("e"), or random ("b", "d", "e" with equal probability) element from x fulfilling the above property. Make sure your code works for x being of the type character or numeric as well as an empty vector.

Exercise 5.27

Implement an expression that yields the same result as duplicated(x) for a numeric vector x, but using diff and order.

Exercise 5.28

Based on match and unique, implement your versions of union(x, y), intersect(x, y), setdiff(x, y), is.element(x, y), and setequal(x, y) for x and y being nonempty numeric vectors.