6. Character vectors#

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime, any bug/typos reports/fixes are appreciated. Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [26].

Text is a universal, portable, economical, and efficient means of interacting between humans and computers as well as exchanging data between programs or APIs. This book is 99% made of text. And, wow, how much valuable knowledge is in it, innit?

6.1. Creating character vectors#

6.1.1. Inputting individual strings#

Specific character strings are delimited either by a pair of double or single quotes (apostrophes).

"a string"
## [1] "a string"
'another string'  # and, of course, neither 'like this" nor "like this'
## [1] "another string"

The only difference between these two is that we cannot directly include, e.g., an apostrophe in a single quote-delimited string. On the other hand, "'tis good ol' spam" and 'I "love" bacon' are both okay.

However, to embrace characters whose inclusion might otherwise be difficult or impossible, we may always employ the so-called escape sequences.

R uses the backslash, “\”, as the escape character, in particular:

  • \" inputs the double quote character,

  • \' – single quote,

  • \\ – backslash,

  • \n – new line.

(x <- "I \"love\" bacon\n\\\"/")
## [1] "I \"love\" bacon\n\\\"/"

The print function (which was implicitly called to display the above object) does not reveal the special meaning of the escape sequences. Instead, print outputs strings in the same way that we ourselves would follow when inputting them. The number of characters in x is 18, and not 23:

## [1] 18

To display the string as-it-really-is, we call:

cat(x, sep="\n")
## I "love" bacon
## \"/

Raw character constants, where the backslash character’s special meaning is disabled, can be entered using the notation like r"(...)", r"{...}", r"[...]", r"----(...)----", etc.; see help("Quotes"). These can be useful when inputting regular expressions (see below).

x <- r"(spam\n\\\"maps)"
## [1] "spam\\n\\\\\\\"maps"
cat(x, sep="\n")
## spam\n\\\"maps

… and, of course, the string version of the missing value marker is “NA_character_”.


(*) Some output devices may support the following codes that control the position of the caret (text cursor):

  • \b – backspace (move cursor one column to the left),

  • \t – tab (advance to the next tab stop, e.g., a multiply of 8),

  • \r – carriage return (move to the beginning of the current line).

cat("abc\bd\tef\rg\nhij", sep="\n")
## gbd     ef
## hij

These can be used on unbuffered outputs (see, e.g., help("stderr")) to display the status of the current operation (a simple “animated” progress bar, the print-out of the ETA, or the percentage of work completed).

Further, certain terminals can also understand the ECMA-48/ANSI-X3.64 escape sequences of the form “\u001b[...” to control further the cursor’s position, text colour, and even style. For example, “\u001b[1;31m” outputs bold red text and “\u001b[0m” resets the settings to default. Give, e.g., “cat("\u001b[1;31mspam\u001b[0m")” or “cat("\u001b[5;36m\u001b[Abacon\u001b[Espam\u001b[0m")” a try.


(*) The Unicode standard 15.0 (version dated September 2022) defines over 149 186 characters, i.a., letters from different scripts, mathematical symbols, and emojis. Each is assigned a unique numeric identifier; see the Unicode Character Code Charts. For example, the inverted exclamation mark (see the Latin-1 Supplement section therein) has been mapped to hexadecimal code 0xA1 (or 161 decimally). Knowing this magic number allows us to specify a Unicode code point using one of the following escape sequences:

  • \uxxxx – codes using four hexadecimal digits,

  • \Uxxxxxxxx – codes using eight hexadecimal digits.

For instance:

cat("!\u00a1!\U000000a1!", sep="\n")
## !¡!¡!

All R installations allow for working with Unicode strings (more precisely, UTF-8: a super-encoding that is native to most UNIX-like boxes, including GNU/Linux and m**OS). Other operating systems may use some 8-bit encoding as the system one (e.g., latin1 or cp1252), but they can be mixed with Unicode seamlessly. See help("Encoding"), help("iconv"), and [27] for discussion.

Nevertheless, certain output devices (web browsers, LaTeX renderers, text terminals) might be unable to display every possible Unicode character, e.g., due to some fonts’ being missing. However, as far as the processing of character data is concerned, this does not matter: R does it with its eyes closed.

For example, in the PDF version of this adorable book, none of the following Unicode glyphs are correctly displayed. Yours cordially did not care about installing appropriate fonts in his XeLaTeX distribution. However, its HTML variant, generated from the same source files as the former, will likely be rendered by the reader’s web browser as intended.

cat("\U0001f642\u2665\u0bb8\U0001f923\U0001f60d\u2307", sep="\n")
## 🙂♥ஸ🤣😍⌇

6.1.2. Many strings, one object#

Less trivial character vectors (meaning, of length greater than one) can be constructed by means of, e.g., c or rep[1].

(x <- c(rep("spam", 3), "bacon", NA_character_, "spam"))
## [1] "spam"  "spam"  "spam"  "bacon" NA      "spam"

Thus, a character vector is, in fact, a sequence of sequences of characters[2]. As usual, the total number of strings can be fetched, via the length function. However, the length of each string may be read with the vectorised nchar.

length(x)  # how many strings?
## [1] 6
nchar(x)   # the number of code points in each string
## [1]  4  4  4  5 NA  4

6.1.3. Concatenating character vectors#

paste can be used to concatenate (join) the corresponding elements of two or more character vectors:

paste(c("a", "b", "c"), c("1", "2", "3"))  # sep=" " by default
## [1] "a 1" "b 2" "c 3"
paste(c("a", "b", "c"), c("1", "2", "3"), sep="")  # see also paste0
## [1] "a1" "b2" "c3"

The function is deeply vectorised:

paste(c("a", "b", "c"), 1:6, c("!", "?"))  # implicit coercion of numbers
## [1] "a 1 !" "b 2 ?" "c 3 !" "a 4 ?" "b 5 !" "c 6 ?"

We can also collapse (flatten, aggregate) a sequence of strings into a single string:

paste(c("a", "b", "c", "d"), collapse=",")
## [1] "a,b,c,d"
paste(c("a", "b", "c", "d"), 1:2, sep="", collapse="")
## [1] "a1b2c1d2"

Unfortunately (perhaps for the so-called convenience), paste treats missing values differently from most other vectorised functions:

paste(c("A", NA_character_, "B"), "!", sep="")
## [1] "A!"  "NA!" "B!"

6.1.4. Formatting objects#

Strings can also come into being by turning other R objects into text. For example, the quite customisable (see Chapter 10) format function can be used to pretty-print data in dynamically generated reports.

x <- c(123456.789, -pi, NaN)
## [1] "123456.7890" "    -3.1416" "        NaN"
cat(format(x, digits=8, scientific=FALSE, drop0trailing=TRUE), sep="\n")
## 123456.789
##     -3.1415927
##            NaN

Moreover, sprintf is a workhorse for turning possibly many atomic vectors into strings. The numbers’ precision, strings’ widths and justification, etc., can be fully controlled. Its first argument is a format string; special escape sequences starting with the per cent sign, “%”, serve as placeholders for the actual values. For instance, “%s” is meant to be replaced with a corresponding string and “%f” with a floating point value. Additional options are available, e.g., “%10.2f” is a number that, when converted to text, will occupy ten text columns[3], with two decimal digits of precision. Also, e.g., “%1$s”, “%2$s”, … will insert the 1st, 2nd, … argument as text.

sprintf("%.5f", pi)
## [1] "3.14159"
sprintf("%s%s", "a", c("X", "Y", "Z"))  # like paste(...)
## [1] "aX" "aY" "aZ"
sprintf("key=%s, value=%.1f", c("spam", "eggs"), c(100000, 0))
## [1] "key=spam, value=100000.0" "key=eggs, value=0.0"
sprintf("%.*f", 1:5, pi)  # variable precision
## [1] "3.1"     "3.14"    "3.142"   "3.1416"  "3.14159"
sprintf("%1$s, %2$s, %1$s, and %1$s", "spam", "bacon")  # numbered argument
## [1] "spam, bacon, spam, and spam"

See help("sprintf") for more details. I recommend. Marek Gagolewski.

6.1.5. Reading text data from files#

Given a raw text file, readLines can load it into memory to represent it as a character vector, with each line stored in a separate string.

f <- readLines(
## [1] "# Dr [Marek](https://www.gagolewski.com)'s Data for Teaching"
## [2] ""
## [3] "> *See the comment lines within the files themselves for"
## [4] "> a detailed description of each dataset.*"
## [5] ""
## [6] "*Good* datasets are actually hard to find!"

writeLines is its counterpart. There is also an option to read or write parts of files at a time, which we mention in Section 8.3.5. Also, cat(..., append=TRUE) can be used to create a text file incrementally.

6.2. Pattern searching#

6.2.1. Comparing whole strings#

We have already reviewed a couple of ways to compare strings as a whole. For instance, the `==` operator implements elementwise testing:

c("spam", "spam", "bacon", "eggs") == c("spam", "eggs")  # recycling rule

In Section 5.4.1, we introduced the match function and its derivative, the `%in%` operator. They are vectorised in a different way:

match(c("spam", "spam", "bacon", "eggs"), c("spam", "eggs"))
## [1]  1  1 NA  2
c("spam", "spam", "bacon", "eggs") %in% c("spam", "eggs")


match relies on simple, bytewise comparisons of the corresponding code points. It might not be valid in, for example, natural language processing activities; compare [18]. In particular, the German word groß is not deemed equal to gross, although we expect that should be the case, at least in a German language setting. Moreover, in the rare situations where we read Unicode-unnormalised data (say, not in the NFC form; see [17]), canonically equivalent strings may be considered different.

6.2.2. Partial matching#

When only a consideration of the initial part of each string is required, we can call:

startsWith(c("s", "spam", "spamtastic", "spontaneous", "spoon"), "spam")

Both the above and endsWith are applied elementwisely in the case of many search prefixes/suffixes, just like in `==`.

Partial matching of strings can be performed with charmatch, which is a each-vs-all version of startsWith:

charmatch(c("s", "sp", "spam", "spams", "eggs", "bacon"), c("spam", "eggs"))
## [1]  1  1  1 NA  2 NA
charmatch(c("s", "sp", "spam", "spoo", "spoof"), c("spam", "spoon"))
## [1]  0  0  1  2 NA

Note that 0 designates that there was an ambiguity in matching a string to a given table.


(*) In Section 9.5.7, we discuss the very-advanced match.arg, which is frequently called by other R functions. It assists in selecting a value from a range of possible choices. Furthermore, Section 9.4.2 and Section 15.4.4 mention the (discouraged) partial matching of list labels and function argument names.

6.2.3. Matching anywhere within a string#

Fixed patterns can also be searched for anywhere within character strings using grepl:

x <- c("spam", "y spammite spam", "yummy SPAM", "sram")
grepl("spam", x, fixed=TRUE)  # fixed patterns, as opposed to regexes below


The order of arguments is like grepl(needle, haystack), not the other way around. Also, this function is not vectorised with respect to the first argument.

Exercise 6.1

Determine how the calls to grep(y, x, value=FALSE) and grep(y, x, value=TRUE) can be implemented based on grepl and other operations we are already familiar with.


As a curiosity, agrepl performs approximate matching based on Levenshtein’s edit distance, which can account for a small number of “typos”.

agrepl("spam", x)
agrepl("ham", x, ignore.case=TRUE)

6.2.4. Using regular expressions (*)#

Setting perl=TRUE allows for identifying occurrences of patterns specified by the PCRE2 regular expressions (regexes).

grepl("^spam", x, perl=TRUE)  # strings that begin with `spam`
grepl("(?i)^spam|spam$", x, perl=TRUE)  # begin or end; case ignored


For more details on regular expressions in general, see, e.g., [24]. The ultimate reference for PCRE2 pattern syntax is the man page pcre2pattern(3). R also gives access to ERE-like TRE library (see help("regex")), which is the default one. However, we discourage its use because it is feature-poorer.

Exercise 6.2

The list.files function generates the list of file names in a given directory that match a given regular expression. For instance, the following gives all CSV files in some directory.

list.files("../../Projects/teaching-data/r/", r"(\.csv$)")  # or "\\.csv$"
## [1] "air_quality_1973.csv" "anscombe.csv"         "iris.csv"
## [4] "titanic.csv"          "tooth_growth.csv"     "trees.csv"
## [7] "world_phones.csv"

Write a single regular expression that matches file names ending with “.csv” or “.csv.gz”. Also, scribble a regex that matches CSV files whose names do not begin with “eurusd”.

6.2.5. Locating pattern occurrences#

regexpr finds the first occurrence of a pattern in each string:

regexpr("spam", x, fixed=TRUE)
## [1]  1  3 -1 -1
## attr(,"match.length")
## [1]  4  4 -1 -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

In particular, there is a pattern occurrence starting at the 3th code point of the 2nd string in x. Moreover, the last string has no pattern match (denoted with -1).

The match.length attribute is generally more informative when searching with regular expressions.

To locate all the matches, i.e., globally, we use gregexpr:

# `spam` followed by 0 or more letters, case insensitively
gregexpr("(?i)spam\\p{L}*", x, perl=TRUE)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [[2]]
## [1]  3 12
## attr(,"match.length")
## [1] 8 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [[3]]
## [1] 7
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## [[4]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

As we noted in Section 4.4.2, wrapping the results in a list was a clever choice, as the number of matches can obviously vary between strings.

In Section 7.2, we will look at the Map function, which, along with substring introduced below, can aid in getting the most out of such data. Meanwhile, let us just mention that regmatches extracts the matching substrings:

regmatches(x, gregexpr("(?i)spam\\p{L}*", x, perl=TRUE))
## [[1]]
## [1] "spam"
## [[2]]
## [1] "spammite" "spam"
## [[3]]
## [1] "SPAM"
## [[4]]
## character(0)


(*) Let us consider what happens when a regular expression contains parenthesised subexpressions (capture groups).

r <- "(?<basename>[^. ]+)\\.(?<extension>[^ ]*)"

The above regex consists of two such parts. The first one is labelled “basename”. It comprises several arbitrary characters except for the space and the dot. The second group, named “extension”, is a substring consisting of anything but the space. A dot separates these two groups.

Such a pattern can be used for unpacking space-delimited lists of file names.

z <- "dataset.csv.gz something_else.txt spam"
regexpr(r, z, perl=TRUE)
## [1] 1
## attr(,"match.length")
## [1] 14
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
##      basename extension
## [1,]        1         9
## attr(,"capture.length")
##      basename extension
## [1,]        7         6
## attr(,"capture.names")
## [1] "basename"  "extension"
gregexpr(r, z, perl=TRUE)
## [[1]]
## [1]  1 16
## attr(,"match.length")
## [1] 14 18
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
##      basename extension
## [1,]        1         9
## [2,]       16        31
## attr(,"capture.length")
##      basename extension
## [1,]        7         6
## [2,]       14         3
## attr(,"capture.names")
## [1] "basename"  "extension"

The capture.* attributes give us access to the matches to the individual capture groups, i.e., the basename and the extension.

Exercise 6.3

(*) Check out the difference between the results generated by regexec and regexpr as well as gregexec and gregexpr.

6.2.6. Replacing pattern occurrences#

sub and gsub can replace first and all, respectively, matches to a pattern:

x <- c("spam", "y spammite spam", "yummy SPAM", "sram")
sub("spam", "ham", x, fixed=TRUE)
## [1] "ham"            "y hammite spam" "yummy SPAM"     "sram"
gsub("spam", "ham", x, fixed=TRUE)
## [1] "ham"           "y hammite ham" "yummy SPAM"    "sram"


(*) If a regex features some capture groups, matches thereto can be mentioned not only in the pattern itself but also in the replacement string:

gsub("(\\p{L})\\p{L}\\1", "\\1", "aha egg gag NaN spam", perl=TRUE)
## [1] "a egg g N spam"

The above matches in the following order: a letter (it is a capture group), another letter, and the former letter again. Each such palindrome of length three is replaced with just the repeated letter.

Exercise 6.4

(*) Display the source code of glob2rx by calling print(glob2rx) and study how this function converts wildcards such as file???.* or *.csv to regular expressions that can be passed to, e.g., list.files.

6.2.7. Splitting strings into tokens#

strsplit divides each string in a character vector into chunks. This time, though, the search pattern specifying the token delimiter is given as the second argument:

strsplit(c("spam;spam;eggs;;bacon", "spam"), ";", fixed=TRUE)
## [[1]]
## [1] "spam"  "spam"  "eggs"  ""      "bacon"
## [[2]]
## [1] "spam"

6.3. Other string operations#

6.3.1. Extracting substrings#

substring extracts parts of strings between given character position ranges.

substring("spammity spam", 1, 4)  # from 1st to 4th character
## [1] "spam"
substring("spammity spam", 10)  # from 10th to end
## [1] "spam"
substring("spammity spam", c(1, 10), c(4, 14))  # vectorisation
## [1] "spam" "spam"
substring(c("spammity spam", "bacon and eggs"), 1, c(4, 5))
## [1] "spam"  "bacon"


There is also a replacement (compare Section 9.4.6) version of the above:

x <- "spam, spam, bacon, and spam"
substring(x, 7, 11) <- "eggs"
## [1] "spam, eggs, bacon, and spam"

Unfortunately, the number of characters in the replacement string should not exceed the length of the part being substituted (try “chickpeas” instead of “eggs”). However, substring replacement can be written as a composition of substring extraction and concatenation:

paste(substring(x, 1, 6), "chickpeas", substring(x, 11), sep="")
## [1] "spam, chickpeas, bacon, and spam"
Exercise 6.5

Take the output generated by regexpr and apply substring to extract the pattern occurrences. If there is no match in some string, the corresponding output should be NA.

6.3.2. Translating characters#

tolower and toupper can be used to convert between lower and upper case:

## [1] "SPAM"


Like many other string operations in base R, these functions perform very simple character substitutions. They might not be valid in natural language processing tasks. For instance, groß is not converted to GROSS, the correct case folding in German.

Moreover, chartr translates individual characters:

chartr("\\", "/", "c:\\windows\\system\\cmd.exe")  # chartr(old, new, x)
## [1] "c:/windows/system/cmd.exe"
chartr("([S", ")]*", ":( :S :[")
## [1] ":) :* :]"

In the first line, we replace each backslash with a slash. The second example replaces “(”, “[”, and “S” with “)”, “]”, and “*”, respectively.

6.3.3. Ordering strings#

We have previously mentioned that operators and functions such as `<`, `>=`, sort, order, rank, and xtfrm[4] are based on the lexicographic ordering of strings.

sort(c("chłodny", "hardy", "chladný", "hladný"))
## [1] "chladný" "chłodny" "hardy"   "hladný"

It is worth noting that the ordering depends on the currently selected locale; see Sys.getlocale("LC_COLLATE"). For instance, in the Slovak language setting, we would obtain "hardy" < "hladný" < "chladný" < "chłodny".


Many “structured” data items can be displayed or transmitted as human-readable strings. In particular, we know that as.numeric can convert a string to a number. Moreover, Section 10.3.1 will discuss date-time objects such as "1970-01-01 00:00:00 GMT". We will be processing them with specialised functions such as strptime and strftime.


(*) Many string operations in base R are not necessarily portable. The stringx package defines drop-in, “fixed” replacements therefor. They are based on the International Components for Unicode (ICU) library, a de facto standard for processing Unicode text, and the R package stringi; see [27].

# call install.packages("stringx") first
suppressPackageStartupMessages(library("stringx"))  # load the package
sort(c("chłodny", "hardy", "chladný", "hladný"), locale="sk_SK")
## [1] "hardy"   "hladný"  "chladný" "chłodny"
toupper("gro\u00DF")  # compare base::toupper("gro\u00DF")
## [1] "GROSS"
detach("package:stringx")  # unload the package

6.4. Other atomic vector types (*)#

We have discussed four vector types: logical, double, character, and list (the latter being a generic-recursive vector). To get the complete picture of the sequence-like types in R, let us briefly mention integer, complex, and raw atomic types so that we are not surprised when we encounter them.

6.4.1. Integer vectors (*)#

Integer scalars can be input manually by using the L suffix:

(x <- c(1L, 2L, -1L, NA_integer_))  # looks like numeric
## [1]  1  2 -1 NA
typeof(x)  # but is integer
## [1] "integer"

Some functions return them in a few contexts[5]:

typeof(1:10)  # seq(1, 10) as well, but not seq(1, 10, 1)
## [1] "integer"
as.integer(c(-1.1, 0, 1.9, 2.1))  # truncate/round towards 0
## [1] -1  0  1  2

In most expressions, integer vectors behave like numeric ones. They are silently coerced to double if need be. Therefore, there is no real practical reason to distinguish between them (they are of internal interest, e.g., when writing C/C++ extensions; see Chapter 14). For example:

1L/2L  # like 1/2 == 1.0/2.0
## [1] 0.5


(*) R integers are 32-bit signed types. The double type can store more integers than them (with the maximal contiguously representable integer being \(2^{53}\) vs \(2^{31}-1\) in the former case; see Section 3.2.3):

as.integer(2^31-1) + 1L  # 32-bit integer overflow
## Warning in as.integer(2^31 - 1) + 1L: NAs produced by integer overflow
## [1] NA
as.integer(2^31-1) + 1 == 2^31 # integer+double == double – OK
## [1] TRUE
(2^53 - 1) + 1 == 2^53  # OK
## [1] TRUE
(2^53 + 1) - 1 == 2^53  # lost due to FP rounding, left result is 2^53 - 1
## [1] FALSE


Since R 3.0, there is support for vectors longer than \(2^{31}-1\) elements. As there are no 64-bit integers in R, these are indexed by doubles anyway (as we have been doing all this time). Interestingly, x[1.9] is the same as x[1], and x[-1.9] means x[-1] (a truncation of the fractional part). This is why the notation like x[length(x)*0.2] works regardless of whether the length of x is a multiple of 5 or not, which is neat.

6.4.2. Raw vectors (*)#

Vectors of the type raw can store bytes, i.e., unsigned 8-bit integers, whose range is 0–255 (there are no raw NAs). For example:

as.raw(c(-1, 0, 1, 2, 0xc0, 254, 255, 256, NA))
## Warning: out-of-range values treated as 0 in coercion to raw
## [1] 00 00 01 02 c0 fe ff 00 00

They are displayed as two-digit hexadecimal (base-16) numbers. We may enter such numbers using the “0x” prefix.

Only a few functions deal with such vectors: e.g., readBin, charToRaw, and rawToChar.

6.4.3. Complex vectors (*)#

We can also play with vectors of the type complex, with “1i” representing the imaginary unit, \(\sqrt{-1}\). Complex numbers appear in quite a few engineering or scientific applications, e.g., in physics, electronics, or signal processing. They are (at least: ought to be) part of introductory subjects or textbooks in university-level mathematics, including the statistics/machine learning-oriented ones because of their heavy use of numerical computing; see e.g., [19, 30].

c(0, 1i, pi+pi*1i, NA_complex_)
## [1] 0.0000+0.0000i 0.0000+1.0000i 3.1416+3.1416i             NA

Apart from the basic operators, mathematical and aggregation functions, procedures like fft, solve, qr, or svd can be fed with or produce such data. For more details, see help("complex") and some matrix examples in Chapter 11.

6.5. Exercises#

Exercises marked with (*) might require tinkering with regular expressions or third-party R packages.

Exercise 6.6

Answer the following questions:

  • How many characters are there in the string "ab\n\\\t\\\\\""? What about "-{ab\n\\\t\\\\\"-)}-"?

  • What is the result of calling “paste(NA, 1:5, collapse="")”?

  • What is the meaning of the following sprintf format strings: "%s", "%20s", "%-20s", "%f", "%g", "%e", "%5f", "%5.2f%%", "%.2f", "%0+5f", and "[%+-5.2f]"?

  • What is the difference between regexpr and gregexpr? What does “g” in the latter name stand for?

  • What is the result of a call to “grepl(c("spam", "spammity spam", "aubergines"), "spam")”?

  • Is it always the case that “"Aaron" < "Zorro"”?

  • Why “x < "10"” and “x < 10” may return different results?

  • If x is a character vector, is “x == x” always equal to TRUE?

  • If x and y are character vectors of lengths \(n\) and \(m\), respectively, what is the length of the output of “match(x, y)”?

  • If x is a named vector, why is there a difference between “x[NA]” and “x[NA_character_]”?

  • What is the difference between “x == y” and “x %in% y”?

Exercise 6.7

Let x, y, and z be atomic vectors and a and b be single strings. Generate the same results as “pastena(x, collapse=b)”, “pastena(x, y, sep=a)”, “pastena(x, y, sep=a, collapse=b)”, “pastena(x, y, z, sep=a)”, “pastena(x, y, z, sep=a, collapse=b)”, assuming that pastena is a version of paste (which we do not have) that handles missing data in a way consistent with most other functions.

Exercise 6.8

Based on list.files and glob2rx, generate the list of all PDFs on your computer. Then, using file.size filter out the files smaller than 10 MiB.

Exercise 6.9

Read a text file that stores a long paragraph of some banal prose. Concatenate all the lines to form a single, long string. Using strwrap and cat, output the paragraph on the console, nicely formatted to fit an aesthetic width, say, 60 text columns.

Exercise 6.10

(*) Implement your own simplified version of basename and dirname.

Exercise 6.11

(*) Implement an operation similar to trimws using the functions introduced in this chapter.

Exercise 6.12

(*) Write a regex that extracts all words from each string in a given character vector.

Exercise 6.13

(*) Write a regex that extracts, from each string in a character vector, all:

  • integers numbers (signed or unsigned),

  • floating-point numbers,

  • numbers of any kind (including those in scientific notation),

  • #hashtags,

  • email@address.es,

  • hyperlinks of the form http://… and https://….

Exercise 6.14

(*) What do 42i, 42L, and 0x42 stand for?

Exercise 6.15

(*) Check out stri_sort in the stringi package (or sort.character in stringx) for a way to obtain an ordering like "a1" < "a2" < "a10" < "a11" < "a100".

Exercise 6.16

(*) In sprintf, the formatter "%20s" means that if a string is less than 20 bytes long, the remaining bytes will be replaced with spaces. Only for ASCII characters (English letters, digits, some punctuation marks, etc.), it is true that one character is represented by 1 byte. Other Unicode code points can take up between 2 and 4 bytes.

cat(sprintf("..%6s..", c("abc", "1!<", "aßc", "ąß©")), sep="\n")  # aligned?
## ..   abc..
## ..   1!<..
## ..  aßc..
## ..ąß©..

Use the stri_pad function from the stringi package to align the strings aesthetically. Alternatively, check out sprintf from stringx.

Exercise 6.17

(*) Implement an operation similar to stri_pad from stringi using the functions introduced in this chapter.