# 6. Character Vectors

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of Chapters 1–12 are already complete, but there will be more. In the meantime, any bug/typos reports/fixes are appreciated. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [20].

Text is a universal, portable, economic, and efficient means of interacting between humans and computers as well as exchanging data between programs or APIs. This book is 99% made of text. And, wow, how much useful knowledge is in it, innit?

## 6.1. Creating Character Vectors

### 6.1.1. Inputting Individual Strings

Specific character strings are delimited either by a pair of double quotes or a pair of single quotes (apostrophes).

"a string"
## [1] "a string"
'another string'  # and of course neither 'like this" nor "like this'
## [1] "another string"

The only difference between these two lies in the fact that we cannot directly include, e.g., an apostrophe in a single quote-delimited string. On the other hand, "'tis good ol' spam" and 'I "love" bacon' are both okay.

However, we may always use escape sequences to embrace characters whose inclusion might otherwise be difficult or impossible.

R uses the backslash, “\”, as the escape character, in particular:

• \" inputs the double quote character,

• \' – single quote,

• \\ – backslash,

• \n – new line.

(x <- "I \"love\" bacon\n\\\"/")
## [1] "I \"love\" bacon\n\\\"/"

The print function (which was implicitly called to display the above object) does not reveal the special meaning of the escape sequences. Rather, print outputs strings in the very way which we ourselves would follow when inputting them. The number of characters in x is 18, and not 23:

nchar(x)
## [1] 18

To display the string as-it-really-is, we call:

cat(x)
## I "love" bacon
## \"/

Raw character constants, where the backslash character’s special meaning is disabled, can be entered using the notation like r"(...)", r"{...}", r"[...]", r"----(...)----", etc.; see help("Quotes"). These can be useful when inputting regular expressions (see below).

x <- r"(spam\n\\\"maps)"
print(x)
## [1] "spam\\n\\\\\\\"maps"
cat(x)
## spam\n\\\"maps

… and of course the string version of the missing value marker is “NA_character_”.

Note

(*) Some output devices may support the following codes that control the position of the caret (text cursor):

• \b – backspace (move cursor one column to the left),

• \t – tab (advance to the next tab stop, e.g., a multiply of 8),

• \r – carriage return (move to the beginning of the current line).

cat("abc\bd\tef\rg\nhij")
## gbd     ef
## hij

These can be used on unbuffered outputs (e.g., stderr; see Section 8.3.5) to display the status of the current operation (a simple “animated” progress bar, the print-out of the ETA, or the % completed).

Further, certain terminals can also understand the ECMA-48/ANSI-X3.64 escape sequences of the form “\u001b[...” to further control the cursor’s position, text colour, and even style. For example, “\u001b[1;31m” outputs red bold text and “\u001b[0m” resets the settings to default. Give, e.g., “cat("\u001b[1;31mspam\u001b[0m")” or “cat("\u001b[5;36m\u001b[Abacon\u001b[Espam\u001b[0m")” a try.

Note

(*) The Unicode standard 15.0 (version dated September 2022) defines over 149,186 characters, i.a., letters from different scripts, mathematical symbols, and emojis. Each of them is assigned a unique numeric identifier; see the Unicode Character Code Charts. For example, the inverted exclamation mark (see the Latin-1 Supplement section therein) has been mapped to hexadecimal code 0xA1 (or 161 decimally). Knowing this magic number allows us to specify a Unicode code point using one of the following escape sequences:

• \uxxxx – codes using four hexadecimal digits,

• \Uxxxxxxxx – codes using eight hexadecimal digits.

For instance:

cat("!\u00a1!\U000000a1!")
## !¡!¡!

All R installations allow for working with Unicode strings (more precisely, UTF-8) – a super-encoding which is native to most Unix-like boxes (including GNU/Linux and m**OS). Other operating systems may use some 8-bit encoding as the system one (e.g., latin1 or cp1252), but they can be mixed with Unicode seamlessly. See help("Encoding"), help("iconv"), and [21] for discussion.

Nevertheless, certain output devices (web browsers, LaTeX renderers, text terminals) might be unable to display each and every Unicode character, e.g., due to some fonts missing. As far as the processing of character data is concerned, though, this does not matter: R does it with its eyes closed.

For example, in the PDF version of this joyful book, none of the following Unicode glyphs are displayed properly, because yours cordially did not care about installing appropriate fonts in his XeLaTeX distribution. However, its HTML variant, which is generated from exactly the same source files as the former, will likely be rendered by the kind reader’s web browser as intended.

cat("\U0001f642\u2665\u0bb8\U0001f923\U0001f60d\u2307")
## 🙂♥ஸ🤣😍⌇

### 6.1.2. Many Strings, One Object

Less trivial character vectors (meaning, of length greater than one) can be constructed by means of, e.g., c or rep[1].

(x <- c(rep("spam", 3), "bacon", NA_character_, "spam"))
## [1] "spam"  "spam"  "spam"  "bacon" NA      "spam"

Thus, a character vector is in fact a sequence of sequences of characters. The total number of strings can be fetched, as usual, with the length function. However, the length of each individual string may be read via the vectorised nchar.

length(x)  # how many strings?
## [1] 6
nchar(x)   # the number of code points in each string
## [1]  4  4  4  5 NA  4

### 6.1.3. Concatenating Character Vectors

paste can be used to concatenate (join) the corresponding elements of two or more character vectors:

paste(c("a", "b", "c"), c("1", "2", "3"))  # sep=" " by default
## [1] "a 1" "b 2" "c 3"
## [1] "a1" "b2" "c3"

The function is deeply vectorised:

paste(c("a", "b", "c"), 1:6, c("!", "?"))  # implicit coercion of numbers
## [1] "a 1 !" "b 2 ?" "c 3 !" "a 4 ?" "b 5 !" "c 6 ?"

We can also collapse (flatten, aggregate) a sequence of strings into a single string:

paste(c("a", "b", "c", "d"), collapse=",")
## [1] "a,b,c,d"
paste(c("a", "b", "c", "d"), 1:2, sep="", collapse="")
## [1] "a1b2c1d2"

Unfortunately (perhaps for the so-called convenience), paste does not treat missing values just like most other vectorised elementwise functions:

paste(c("A", NA_character_, "B"), "!", sep="")
## [1] "A!"  "NA!" "B!"

### 6.1.4. Formatting Objects

Strings can also come into being by turning other R objects into text. For example, the quite customisable (see Chapter 10) format can be used for pretty-printing data in dynamically generated reports.

x <- c(123456.789, -pi, NaN)
format(x)
## [1] "123456.7890" "    -3.1416" "        NaN"
cat(format(x, digits=8, scientific=FALSE, drop0trailing=TRUE), sep="\n")
## 123456.789
##     -3.1415927
##            NaN

Moreover, sprintf is a workhorse for turning possibly many atomic vectors to strings. The numbers’ precision, strings’ widths and justification, etc., can be fully controlled. Its first argument is a format string; special escape sequences starting with percent sign, “%”, serve as placeholders for the actual values. For instance, “%s” is meant to be replaced with a corresponding string and “%f” with a floating point value. Additional options are available, e.g., “%10.2f” is a number that, when converted to text, will occupy ten text columns[2], with two decimal digits of precision. Also, e.g., “%1$s”, “%2$s”, … will insert the 1st, 2nd, … argument as text.

sprintf("%.5f", pi)
## [1] "3.14159"
sprintf("%s%s", "a", c("X", "Y", "Z"))  # like paste(...)
## [1] "aX" "aY" "aZ"
sprintf("key=%s, value=%.1f", c("spam", "eggs"), c(100000, 0))
## [1] "key=spam, value=100000.0" "key=eggs, value=0.0"
sprintf("%.*f", 1:5, pi)  # variable precision
## [1] "3.1"     "3.14"    "3.142"   "3.1416"  "3.14159"
sprintf("%1$s, %2$s, %1$s, and %1$s", "spam", "bacon")  # numbered argument
## [1] "spam, bacon, spam, and spam"

See help("sprintf") for more details. I recommend. Marek Gagolewski.

### 6.1.5. Reading Text Data from Files

Given a raw text file, readLines can load it into memory so that it is represented as a character vector, with each line stored in a separate string.

)
## [1] "# [Marek](https://www.gagolewski.com)'s Teaching and Training Data"
## [2] ""
## [3] "> *See the comment lines within the files themselves for"
## [4] "> a detailed description of each dataset.*"
## [5] ""
## [6] "*Good* datasets are actually hard to find!"

writeLines is its counterpart. There is also an option to read or write parts of files at a time, which me mention in Section 8.3.5. Also, cat(..., append=TRUE) can be used to create a text file incrementally.

## 6.2. Pattern Searching

### 6.2.1. Comparing Whole Strings

We have already reviewed a couple of ways to compare strings as a whole. For instance, the == operator implements elementwise testing:

c("spam", "spam", "bacon", "eggs") == c("spam", "eggs")  # recycling rule
## [1]  TRUE FALSE FALSE  TRUE

Moreover, in Section 5.4.1, we have introduced the match function and its derivative, the %in% operator, which are vectorised in a different way:

match(c("spam", "spam", "bacon", "eggs"), c("spam", "eggs"))
## [1]  1  1 NA  2
c("spam", "spam", "bacon", "eggs") %in% c("spam", "eggs")
## [1]  TRUE  TRUE FALSE  TRUE

Note

We should stress that these are simple, bytewise comparisons of the corresponding code points and as such they might not be valid in, for example, natural language processing activities; compare [13]. In particular, German word groß is not deemed equal to gross, although we expect that should be the case, at least in a German language setting. Moreover, in the rare situations where we read Unicode-unnormalised data (say, not in the NFC form; see [12]), canonically equivalent strings may be considered as different.

### 6.2.2. Partial Matching

When only a consideration of the initial part of each string is required, we can call:

startsWith(c("s", "spam", "spamtastic", "spontaneous", "spoon"), "spam")
## [1] FALSE  TRUE  TRUE FALSE FALSE

Both the above and endsWith are applied elementwisely in case of many search prefixes/suffixes, just like in ==.

Partial matching of strings can be performed with charmatch. This is a each-vs-all version of startsWith:

charmatch(c("s", "sp", "spam", "spams", "eggs", "bacon"), c("spam", "eggs"))
## [1]  1  1  1 NA  2 NA
charmatch(c("s", "sp", "spam", "spoo", "spoof"), c("spam", "spoon"))
## [1]  0  0  1  2 NA

Note that 0 designates that there was an ambiguity in the matching of a string to a given table.

Note

(*) In sec:to-do, we discuss the more-advanced match.arg which is (unfortunately) frequently called from within other R functions, and in Section 9.4.2 and sec:to-do, we mention the (discouraged) partial matching of list labels and argument names in function calls.

### 6.2.3. Matching Anywhere Within a String

Fixed patterns can be also searched for anywhere within character strings using grepl:

x <- c("spam", "y spammite spam", "yummy SPAM", "sram")
grepl("spam", x, fixed=TRUE)  # fixed patterns, as opposed to regexes below
## [1]  TRUE  TRUE FALSE FALSE

Important

Note that the order of arguments is like grepl(needle, haystack), not the other way around. Also, this function is not vectorised with respect to the first argument.

Exercise 6.1

Determine how can a call to grep(y, x, value=FALSE) and grep(y, x, value=TRUE) be implemented based on grepl and other operations that we are already familiar with.

Note

As a curiosity, agrepl performs approximate matching based on Levenshtein’s edit distance. This can account for a small number of “typos”.

agrepl("spam", x)
## [1]  TRUE  TRUE FALSE  TRUE
agrepl("ham", x, ignore.case=TRUE)
## [1] TRUE TRUE TRUE TRUE

### 6.2.4. Using Regular Expressions (*)

Setting perl=TRUE allows for identifying occurrences of patterns specified by the PCRE2 regular expressions (regexes).

grepl("^spam", x, perl=TRUE)  # strings that begin with spam
## [1]  TRUE FALSE FALSE FALSE
grepl("(?i)^spam|spam$", x, perl=TRUE) # begin or end; case ignored ## [1] TRUE TRUE TRUE FALSE Note For more details on regular expressions in general, see, e.g., [18]. The ultimate reference for PCRE2 pattern syntax is the man page pcre2pattern(3). R also gives access to ERE-like TRE library (see help("regex")), which is the default one. However, we discourage its use, because it is feature-poorer. Exercise 6.2 The list.files function generates the list of file names in a given directory that match a given regular expression. For instance, the following gives all CSV files in some directory. list.files("../../Projects/teaching-data/r/", r"(\.csv$)")  # or "\\.csv\$"
## [1] "air_quality_1973.csv" "anscombe.csv"         "iris.csv"
## [4] "titanic.csv"          "tooth_growth.csv"     "trees.csv"
## [7] "world_phones.csv"

Write a single regular expression that matches file names ending with “.csv” or “.csv.gz”. Also, write a regex that matches CSV files whose names do not begin with “eurusd”.

### 6.2.5. Locating Pattern Occurrences

regexpr finds the first occurrence of a pattern in each string:

regexpr("spam", x, fixed=TRUE)
## [1]  1  3 -1 -1
## attr(,"match.length")
## [1]  4  4 -1 -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

In particular, there is a pattern occurrence starting at the 3th code point of the 2nd string in x. Moreover, there is no pattern match in the last string (denoted with -1).

The match.length attribute is generally more worthwhile when searching with regular expressions.

To locate all the matches, i.e., globally, we use gregexpr:

# spam followed by 0 or more letters, case insensitively
gregexpr("(?i)spam\\p{L}*", x, perl=TRUE)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1]  3 12
## attr(,"match.length")
## [1] 8 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[3]]
## [1] 7
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[4]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

As we have noted in Section 4.4.2, wrapping the results in a list was a clever choice as the number of matches can obviously vary between strings.

In Section 7.2, we will take a look at the Map function, which, along with substring introduced below, can aid in getting the most out of such data. Meanwhile, let us just mention that regmatches extracts the matching substrings:

regmatches(x, gregexpr("(?i)spam\\p{L}*", x, perl=TRUE))
## [[1]]
## [1] "spam"
##
## [[2]]
## [1] "spammite" "spam"
##
## [[3]]
## [1] "SPAM"
##
## [[4]]
## character(0)

Note

(*) Let us consider what happens when a regular expression contains parenthesised subexpressions (capture groups).

r <- "(?<basename>[^. ]+)\\.(?<extension>[^ ]*)"

The above regex consists of two such parts. The first one is labelled “basename” and is comprised of a number of arbitrary characters except for the space and the dot. The second group, named “extension” is a substring of anything but the space. These two are separated by a dot.

Such a pattern can be used for unpacking space-delimited lists of file names.

z <- "dataset.csv.gz something_else.txt spam"
regexpr(r, z, perl=TRUE)
## [1] 1
## attr(,"match.length")
## [1] 14
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
##      basename extension
## [1,]        1         9
## attr(,"capture.length")
##      basename extension
## [1,]        7         6
## attr(,"capture.names")
## [1] "basename"  "extension"
gregexpr(r, z, perl=TRUE)
## [[1]]
## [1]  1 16
## attr(,"match.length")
## [1] 14 18
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
##      basename extension
## [1,]        1         9
## [2,]       16        31
## attr(,"capture.length")
##      basename extension
## [1,]        7         6
## [2,]       14         3
## attr(,"capture.names")
## [1] "basename"  "extension"

The capture.* attributes give us access to the matches to the individual capture groups, i.e., the basename and the extension.

Exercise 6.3

(*) Check out the difference between the results generated by regexec and regexpr as well as gregexec and gregexpr.

### 6.2.6. Replacing Pattern Occurrences

sub and gsub can replace first and all, respectively, matches to a pattern:

x <- c("spam", "y spammite spam", "yummy SPAM", "sram")
sub("spam", "ham", x, fixed=TRUE)
## [1] "ham"            "y hammite spam" "yummy SPAM"     "sram"
gsub("spam", "ham", x, fixed=TRUE)
## [1] "ham"           "y hammite ham" "yummy SPAM"    "sram"

Note

(*) If a regex features some capture groups, matches thereto can be mentioned not only in the pattern itself, but also in the replacement string:

gsub("(\\p{L})\\p{L}\\1", "\\1", "aha egg gag NaN spam", perl=TRUE)
## [1] "a egg g N spam"

The above matches a letter (it is a capture group), another letter, and the former letter again. Each such palindrome of length 3 is replaced with just the repeated letter.

Exercise 6.4

(*) Display the source code of glob2rx by calling print(glob2rx) and study how this function converts wildcards such as file???.* or *.csv to regular expressions that can be passed to, e.g., list.files.

### 6.2.7. Splitting Strings into Tokens

strsplit divides each string in a character vector into chunks. This time, though, the search pattern, specifying the token delimiter, is given as the second argument:

strsplit(c("spam;spam;eggs;;bacon", "spam"), ";", fixed=TRUE)
## [[1]]
## [1] "spam"  "spam"  "eggs"  ""      "bacon"
##
## [[2]]
## [1] "spam"

## 6.3. Other String Operations

### 6.3.1. Extracting Substrings

substring extracts parts of strings between given character position ranges.

substring("spammity spam", 1, 4)  # from 1st to 4th character
## [1] "spam"
substring("spammity spam", 10)  # from 10th to end
## [1] "spam"
substring("spammity spam", c(1, 10), c(4, 14))  # vectorisation
## [1] "spam" "spam"
substring(c("spammity spam", "bacon and eggs"), 1, c(4, 5))
## [1] "spam"  "bacon"

Note

There is also a replacement (compare Section 9.4.6) version of the above:

x <- "spam, spam, bacon, and spam"
substring(x, 7, 11) <- "eggs"
print(x)
## [1] "spam, eggs, bacon, and spam"

Unfortunately, the number of characters in the replacement string should not exceed the length of the part being substituted (try “chickpeas” instead of “eggs”). However, substring replacement can be written as a composition of substring extraction and concatenation:

paste(substring(x, 1, 6), "chickpeas", substring(x, 11), sep="")
## [1] "spam, chickpeas, bacon, and spam"
Exercise 6.5

Take the output generated by regexpr and apply substring to extract the pattern occurrences. If there is no match in some string, the corresponding output should be NA.

### 6.3.2. Translating Characters

tolower and toupper can be used to convert between lower and upper case:

toupper("spam")
## [1] "SPAM"

Note

Just like many other string operations in base R, these functions perform very simple character substitutions and they might not be valid in natural language processing tasks. For instance, groß is not converted to GROSS, which is the correct case folding in German.

Moreover, chartr translates individual characters:

chartr("\\", "/", "c:\\windows\\system\\cmd.exe")  # chartr(old, new, x)
## [1] "c:/windows/system/cmd.exe"
chartr("([S", ")]*", ":( :S :[")
## [1] ":) :* :]"

In the first line, we replace each backslash with slash. The second example replaces “(”, “[”, and “S” with “)”, “]”, and “*”, respectively.

### 6.3.3. Ordering Strings

We have previously mentioned that operators such as < and >= as well as functions like sort, order, rank, but also xtfrm[3] are based on the lexicographic ordering of strings.

It is worth noting that the ordering is dependent on the currently selected locale, see Sys.getlocale("LC_COLLATE"). For instance, in the Slovak language setting, we would obtain "hardy" < "hladný" < "chladný" < "chłodny".

Note

Many “structured” data items can be displayed or transmitted as human-readable strings. In particular, we know that as.numeric can be used to convert a string to a number. Moreover, in Section 10.3.1 we will discuss date-time objects such as "1970-01-01 00:00:00 GMT". We will be processing them with specialised functions such as strptime and strftime.

Important

(*) As we have mentioned, many string operations in base R are not necessarily portable. The stringx package [22] defines drop-in, “fixed” replacements therefor. They are based on the International Components for Unicode (ICU) library, which is a de facto standard for the processing of Unicode text, and the R package stringi; see [21].

# call install.packages("stringx") first
toupper("gro\u00DF")  # compare base::toupper("gro\u00DF")
## [1] "GROSS"

## 6.4. Other Atomic Vector Types (*)

We have discussed four vector types: logical, double, character, and list (the latter being a generic-recursive vector). To get the complete picture of the sequence-like types in R, let us briefly mention integer, complex, and raw atomic types, so that we are not surprised when we encounter them.

### 6.4.1. Integer Vectors (*)

Integer scalars can be input manually by using the L suffix:

(x <- c(1L, 2L, -1L, NA_integer_))  # looks like numeric
## [1]  1  2 -1 NA
typeof(x)  # but is integer
## [1] "integer"

Some functions return them in certain contexts[4]:

typeof(1:10)  # seq(1, 10) as well, but not seq(1, 10, 1)
## [1] "integer"
as.integer(c(-1.1, 0, 1.9, 2.1))  # truncate/round towards 0
## [1] -1  0  1  2

In the vast majority of expressions, integer vectors behave like numeric ones, and are silently coerced to double if need be, so there is no real practical reason to distinguish between them (they are of internal interest, e.g., when writing C/C++ extensions; see Chapter 14). For example:

1L/2L  # like 1/2 == 1.0/2.0
## [1] 0.5

Note

(*) R integers are 32-bit signed types. The double type can store more integers than them (with the maximal contiguously representable integer being $$2^{53}$$ vs $$2^{31}-1$$ in the former case; see Section 3.2.3):

as.integer(2^31-1) + 1L  # 32-bit integer overflow
## Warning in as.integer(2^31 - 1) + 1L: NAs produced by integer overflow
## [1] NA
as.integer(2^31-1) + 1 == 2^31 # integer+double == double – OK
## [1] TRUE
(2^53 - 1) + 1 == 2^53  # OK
## [1] TRUE
(2^53 + 1) - 1 == 2^53  # lost due to FP rounding, left result is 2^53 - 1
## [1] FALSE

Note

Since R 3.0, there is support for vectors longer than $$2^{31}-1$$ elements. As there are no 64-bit integers in R, these are indexed by doubles anyway (as we have been doing all this time). Interestingly, x[1.9] is the same as x[1] and x[-1.9] means x[-1] (truncation of the fractional part). This is why the notation like x[length(x)*0.2] works regardless of whether the length of x is a multiple of 5 or not, which is neat.

### 6.4.2. Raw Vectors (*)

Vectors of type raw can store bytes, i.e., unsigned 8-bit integers, whose range is 0-255 (there are no raw NAs). For example:

as.raw(c(-1, 0, 1, 2, 0xc0, 254, 255, 256, NA))
## Warning: out-of-range values treated as 0 in coercion to raw
## [1] 00 00 01 02 c0 fe ff 00 00

They are displayed as two-digit hexadecimal (base-16) numbers. Also note that we may enter such numbers using the “0x” prefix.

There are only few functions that deal with such vectors: e.g., readBin, charToRaw, and rawToChar.

### 6.4.3. Complex Vectors (*)

We can also play with vectors of type complex, with “1i” representing the imaginary unit, $$\sqrt{-1}$$. Complex numbers appear in quite a few engineering or scientific applications, e.g., in physics, electronics, or signal processing and are (at least: should be) part of many university-level subjects or textbooks in mathematics[5].

c(0, 1i, pi+pi*1i, NA_complex_)
## [1] 0.0000+0.0000i 0.0000+1.0000i 3.1416+3.1416i             NA

Apart from the basic operators, mathematical and aggregation functions, procedures like fft, solve, qr, or svd can be fed with or produce such data. For more details, see help("complex") and some matrix examples in Chapter 11.

## 6.5. Exercises

Exercises marked with (*) might require tinkering with regular expressions or third-party R packages.

Exercise 6.6

• How many characters are there in the string "ab\n\\\t\\\\\""? What about "-{ab\n\\\t\\\\\"-)}-"?

• What is the result of calling “paste(NA, 1:5, collapse="")”?

• What is the meaning of the following sprintf format strings: "%s", "%20s", "%-20s", "%f", "%g", "%e", "%5f", "%5.2f%%", "%.2f", "%0+5f", and "[%+-5.2f]"?

• What is the difference between regexpr and gregexpr? What does “g” in the latter name stand for?

• What is the result of a call to “grepl(c("spam", "spammity spam", "aubergines"), "spam")”?

• Is it always the case that “"Aaron" < "Zorro"”?

• If x is a character vector, is “x == x” always equal to TRUE?

• If x and y are character vectors of lengths n and m, respectively, what is the length of the output of “match(x, y)”?

• If x is a named vector, why there is a difference between “x[NA]” and “x[NA_character_]”?

• What is the difference between “x == y” and “x %in% y”?

Exercise 6.7

Let x, y, and z be atomic vectors and a and b be single strings. Generate the same results as “pastena(x, collapse=b)”, “pastena(x, y, sep=a)”, “pastena(x, y, sep=a, collapse=b)”, “pastena(x, y, z, sep=a)”, “pastena(x, y, z, sep=a, collapse=b)”, assuming that pastena is a version of paste (which we do not have) that handles missing data in a way consistent with most other functions.

Exercise 6.8

Based on list.files and glob2rx, generate the list of all PDFs on your computer. Then, using file.size filter out the files smaller than 10 MiB.

Exercise 6.9

Read a text file that stores a long paragraph of some banal prose. Concatenate all the lines to form a single, long string. Using strwrap and cat, output the paragraph on the console, nicely formatted to fit an aesthetic width, say, 60 text columns.

Exercise 6.10

(*) Implement your own, simplified version of basename and dirname.

Exercise 6.11

(*) Implement an operation similar to trimws using the functions introduced in this chapter.

Exercise 6.12

(*) Write a regex that extracts all words from each string in a given character vector.

Exercise 6.13

(*) Write a regex that extracts, from each string in a character vector, all:

• integers numbers (signed or unsigned),

• floating-point numbers,

• numbers of any kind (including those in scientific notation),

• #hashtags,

• hyperlinks of the form http://… and https://….

Exercise 6.14

(*) What does 42i, 42L, and 0x42 stand for?

Exercise 6.15

(*) Check out stri_sort in the stringi package (or sort.character in stringx) for a way to obtain an ordering like "a1" < "a2" < "a10" < "a11" < "a100".

Exercise 6.16

(*) In sprintf, the formatter "%20s" means that if a string is less than 20 bytes long, the remaining bytes will be replaced with spaces. Only for ASCII characters (English letters, digits, some punctuation marks, etc.) it is true that one character is represented by 1 byte. Other Unicode code points can take up between 2 and 4 bytes.

cat(sprintf("..%6s..", c("abc", "1!<", "aßc", "ąß©")), sep="\n")  # aligned?
## ..   abc..
## ..   1!<..
## ..  aßc..