6. Character vectors#

This open-access textbook is, and will remain, freely available for everyone’s enjoyment (also in PDF; a paper copy can also be ordered). It is a non-profit project. Although available online, it is a whole course, and should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Any bug/typo reports/fixes are appreciated. Make sure to check out Minimalist Data Wrangling with Python [27] too.

Text is a universal, portable, economical, and efficient means of interacting between humans and computers as well as exchanging data between programs or APIs. This book is 99% made of text. And, wow, how much valuable knowledge is in it, innit?

6.1. Creating character vectors#

6.1.1. Inputting individual strings#

Specific character strings are delimited by a pair of either double or single quotes (apostrophes).

"a string"
## [1] "a string"
'another string'  # and, of course, neither 'like this" nor "like this'
## [1] "another string"

The only difference between these two is that we cannot directly include, e.g., an apostrophe in a single quote-delimited string. On the other hand, "'tis good ol' spam" and 'I "love" bacon' are both okay.

However, to embrace characters whose inclusion might otherwise be difficult or impossible, we may always employ the so-called escape sequences.

R uses the backslash, “\”, as the escape character. In particular:

  • \" inputs a double quote,

  • \' generates a single quote,

  • \\ includes a backslash,

  • \n endows a new line.

(x <- "I \"love\" bacon\n\\\"/")
## [1] "I \"love\" bacon\n\\\"/"

The print function (which was implicitly called to display the above object) does not reveal the special meaning of the escape sequences. Instead, print outputs strings in the same way that we ourselves would follow when inputting them. The number of characters in x is 18, and not 23:

nchar(x)
## [1] 18

To display the string as-it-really-is, we call cat:

cat(x, sep="\n")
## I "love" bacon
## \"/

In raw character constants, the backslash character’s special meaning is disabled. They can be entered using the notation like r"(...)", r"{...}", or r"[...]"; see help("Quotes"). These can be useful when inputting regular expressions (Section 6.2.4).

x <- r"(spam\n\\\"maps)"   # also: r"-(...)-", r"--(...)--", etc.
print(x)
## [1] "spam\\n\\\\\\\"maps"
cat(x, sep="\n")
## spam\n\\\"maps

Furthermore, the string version of the missing value marker is NA_character_.

Note

(*) The Unicode standard 15.0 (version dated September 2022) defines 149 186 characters, i.a., letters from different scripts, mathematical symbols, and emojis. Each is assigned a unique numeric identifier; see the Unicode Character Code Charts. For example, the inverted exclamation mark (see the Latin-1 Supplement section therein) has been mapped to the hexadecimal code 0xA1 (or 161 decimally). Knowing this magic number permits us to specify a Unicode code point using one of the following escape sequences:

  • \uxxxx – codes using four hexadecimal digits,

  • \Uxxxxxxxx – codes using eight hexadecimal digits.

For instance:

cat("!\u00a1!\U000000a1!", sep="\n")
## !¡!¡!

All R installations allow for working with Unicode strings. More precisely, they support dealing with UTF-8, being a super-encoding that is native to most UNIX-like boxes, including GNU/Linux and m**OS. Other operating systems may use some 8-bit encoding as the system one (e.g., latin1 or cp1252), but they can be mixed with Unicode seamlessly; see help("Encoding"), help("iconv"), and [26] for discussion.

Nevertheless, certain output devices (web browsers, LaTeX renderers, text terminals) might be unable to display every possible Unicode character, e.g., due to some fonts’ being missing. However, as far as processing character data is concerned, this does not matter because R does it with its eyes closed. For example:

cat("\U0001f642\u2665\u0bb8\U0001f923\U0001f60d\u2307", sep="\n")
## 🙂♥ஸ🤣😍⌇

In the PDF version of this adorable book, the Unicode glyphs are not rendered correctly for some reason. However, its HTML variant, generated from the same source files, should be displayed by most web browsers properly.

Note

(*) Some output devices may support the following codes that control the position of the caret (text cursor):

  • \b inserts a backspace (moves cursor one column to the left),

  • \t implants a tabulator (advances to the next tab stop, e.g., a multiply of four or eight text columns),

  • \r injects a carriage return (move to the beginning of the current line).

cat("abc\bd\tef\rg\nhij", sep="\n")
## gbd     ef
## hij

These can be used on unbuffered outputs like stderr to display the status of the current operation, for instance, an animated textual progress bar, the print-out of the ETA, or the percentage of work completed.

Further, certain terminals can also understand the ECMA-48/ANSI-X3.64 escape sequences of the form \u001b[... to control the cursor’s position, text colour, and even style. For example, \u001b[1;31m outputs red text in bold font and \u001b[0m resets the settings to default. We recommend giving, e.g., cat("\u001b[1;31mspam\u001b[0m") or cat("\u001b[5;36m\u001b[Abacon\u001b[Espam\u001b[0m") a try.

6.1.2. Many strings, one object#

Less trivial character vectors (meaning, of length greater than one) can be constructed by means of, e.g., c or rep[1].

(x <- c(rep("spam", 3), "bacon", NA_character_, "spam"))
## [1] "spam"  "spam"  "spam"  "bacon" NA      "spam"

Thus, a character vector is, in fact, a sequence of sequences of characters[2]. As usual, the total number of strings can be fetched via the length function. However, the length of each string may be read with the vectorised nchar.

length(x)  # how many strings?
## [1] 6
nchar(x)   # the number of code points in each string
## [1]  4  4  4  5 NA  4

6.1.3. Concatenating character vectors#

paste can be used to concatenate (join) the corresponding elements of two or more character vectors:

paste(c("a", "b", "c"), c("1", "2", "3"))  # sep=" " by default
## [1] "a 1" "b 2" "c 3"
paste(c("a", "b", "c"), c("1", "2", "3"), sep="")  # see also paste0
## [1] "a1" "b2" "c3"

The function is deeply vectorised:

paste(c("a", "b", "c"), 1:6, c("!", "?"))  # coercion of numeric to character
## [1] "a 1 !" "b 2 ?" "c 3 !" "a 4 ?" "b 5 !" "c 6 ?"

We can also collapse (flatten, aggregate) a sequence of strings into a single string:

paste(c("a", "b", "c", "d"), collapse=",")
## [1] "a,b,c,d"
paste(c("a", "b", "c", "d"), 1:2, sep="", collapse="")
## [1] "a1b2c1d2"

Perhaps for convenience, alas, paste treats missing values differently from most other vectorised functions:

paste(c("A", NA_character_, "B"), "!", sep="")
## [1] "A!"  "NA!" "B!"

6.1.4. Formatting objects#

Strings can also arise by converting other-typed R objects into text. For example, the quite customisable (see Chapter 10) format function prepares data for display in dynamically generated reports.

x <- c(123456.789, -pi, NaN)
format(x)
## [1] "123456.7890" "    -3.1416" "        NaN"
cat(format(x, digits=8, scientific=FALSE, drop0trailing=TRUE), sep="\n")
## 123456.789
##     -3.1415927
##            NaN

Moreover, sprintf is a workhorse for turning possibly many atomic vectors into strings. Its first argument is a format string. Special escape sequences starting with the per cent sign, “%”, serve as placeholders for the actual values. For instance, “%s” is replaced with a string and “%f” with a floating point value taken from further arguments.

sprintf("%s%s", "a", c("X", "Y", "Z"))  # like paste(...)
## [1] "aX" "aY" "aZ"
sprintf("key=%s, value=%f", c("spam", "eggs"), c(100000, 0))
## [1] "key=spam, value=100000.000000" "key=eggs, value=0.000000"

The numbers’ precision, strings’ widths and justification, etc., can be customised, e.g., “%6.2f” is a number that, when converted to text, will occupy six text columns[3], with two decimal digits of precision.

sprintf("%10s=%6.2f%%", "rate", 2/3*100)  # "%%" renders the per cent sign
## [1] "      rate= 66.67%"
sprintf("%.*f", 1:5, pi)  # variable precision
## [1] "3.1"     "3.14"    "3.142"   "3.1416"  "3.14159"

Also, e.g., “%1$s”, “%2$s”, … inserts the first, second, … argument as text.

sprintf("%1$s, %2$s, %1$s, and %1$s", "spam", "bacon")  # numbered argument
## [1] "spam, bacon, spam, and spam"
Exercise 6.1

Read help("sprintf") (highly recommended!).

6.1.5. Reading text data from files#

Given a raw text file, readLines loads it into memory and represents it as a character vector, with each line stored in a separate string.

head(readLines(
    "https://github.com/gagolews/teaching-data/raw/master/README.md"
))
## [1] "# Dr [Marek](https://www.gagolewski.com)'s Data for Teaching"
## [2] ""
## [3] "> *See the comment lines within the files themselves for"
## [4] "> a detailed description of each dataset.*"
## [5] ""
## [6] "*Good* datasets are actually hard to find!"

writeLines is its counterpart. There is also an option to read or write parts of files at a time using file connections which we mention in Section 8.3.5. Moreover, cat(..., append=TRUE) can be used to create a text file incrementally.

6.2. Pattern searching#

6.2.1. Comparing whole strings#

We have already reviewed a couple of ways to compare strings as a whole. For instance, the `==` operator implements elementwise testing:

c("spam", "spam", "bacon", "eggs") == c("spam", "eggs")  # recycling rule
## [1]  TRUE FALSE FALSE  TRUE

In Section 5.4.1, we introduced the match function and its derivative, the `%in%` operator. They are vectorised in a different way:

match(c("spam", "spam", "bacon", "eggs"), c("spam", "eggs"))
## [1]  1  1 NA  2
c("spam", "spam", "bacon", "eggs") %in% c("spam", "eggs")
## [1]  TRUE  TRUE FALSE  TRUE

Note

(*) match relies on a simple, bytewise comparison of the corresponding code points. It might not be valid in natural language processing activities, e.g., where the German word groß should be equivalent to gross [18]. Moreover, in the rare situations where we read Unicode-unnormalised data, canonically equivalent strings may be considered different; see [17].

6.2.2. Partial matching#

When only a consideration of the initial part of each string is required, we can call:

startsWith(c("s", "spam", "spamtastic", "spontaneous", "spoon"), "spam")
## [1] FALSE  TRUE  TRUE FALSE FALSE

If we provide many prefixes, the above function will be applied elementwisely, just like the `==` operator.

On the other hand, charmatch performs a partial matching of strings. It is an each-vs-all version of startsWith:

charmatch(c("s", "sp", "spam", "spams", "eggs", "bacon"), c("spam", "eggs"))
## [1]  1  1  1 NA  2 NA
charmatch(c("s", "sp", "spam", "spoo", "spoof"), c("spam", "spoon"))
## [1]  0  0  1  2 NA

Note that 0 designates that there was an ambiguous match.

Note

(*) In Section 9.4.7, we discuss match.arg, which a few R functions rely on when they need to select a value from a range of possible choices. Furthermore, Section 9.3.2 and Section 15.4.4 mention the (discouraged) partial matching of list labels and function argument names.

6.2.3. Matching anywhere within a string#

Fixed patterns can also be searched for anywhere within character strings using grepl:

x <- c("spam", "y spammite spam", "yummy SPAM", "sram")
grepl("spam", x, fixed=TRUE)  # fixed patterns, as opposed to regexes below
## [1]  TRUE  TRUE FALSE FALSE

Important

The order of arguments is like grepl(needle, haystack), not vice versa. Also, this function is not vectorised with respect to the first argument.

Exercise 6.2

How the calls to grep(y, x, value=FALSE) and grep(y, x, value=TRUE) can be implemented based on grepl and other operations we are already familiar with?

Note

(*) As a curiosity, agrepl performs approximate matching, which can account for a smöll nmber of tpyos.

agrepl("spam", x)
## [1]  TRUE  TRUE FALSE  TRUE
agrepl("ham", x, ignore.case=TRUE)
## [1] TRUE TRUE TRUE TRUE

It is based on Levenshtein’s edit distance that measures the number of character insertions, deletions, or substitutions required to turn one string into another.

6.2.4. Using regular expressions (*)#

Setting perl=TRUE allows for identifying occurrences of patterns specified by regular expressions (regexes).

grepl("^spam", x, perl=TRUE)  # strings that begin with `spam`
## [1]  TRUE FALSE FALSE FALSE
grepl("(?i)^spam|spam$", x, perl=TRUE)  # begin or end; case ignored
## [1]  TRUE  TRUE  TRUE FALSE

Note

For more details on regular expressions in general, see, e.g., [24]. The ultimate reference on the PCRE2 pattern syntax is the Unix man page pcre2pattern(3). From now on, we assume that the reader is familiar with it.

Apart from the Perl-compatible regexes, R also gives access to the TRE library (ERE-like), which is the default one; see help("regex"). However, we discourage its use because it is feature-poorer.

Exercise 6.3

The list.files function generates the list of file names in a given directory that match a given regular expression. For instance, the following gives all CSV files in a folder:

list.files("~/Projects/teaching-data/r/", "\\.csv$")
## [1] "air_quality_1973.csv" "anscombe.csv"         "iris.csv"
## [4] "titanic.csv"          "tooth_growth.csv"     "trees.csv"
## [7] "world_phones.csv"

Write a single regular expression that matches file names ending with “.csvor.csv.gz”. Also, scribble a regex that matches CSV files whose names do not begin with “eurusd”.

6.2.5. Locating pattern occurrences#

regexpr finds the first occurrence of a pattern in each string:

regexpr("spam", x, fixed=TRUE)
## [1]  1  3 -1 -1
## attr(,"match.length")
## [1]  4  4 -1 -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

In particular, there is a pattern occurrence starting at the third code point of the second string in x. Moreover, the last string has no pattern match, which is denoted by -1.

The match.length attribute is generally more informative when searching with regular expressions.

To locate all the matches, i.e., globally, we use gregexpr:

# `spam` followed by 0 or more letters, case insensitively
gregexpr("(?i)spam\\p{L}*", x, perl=TRUE)
## [[1]]
## [1] 1
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1]  3 12
## attr(,"match.length")
## [1] 8 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[3]]
## [1] 7
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[4]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

As we noted in Section 4.4.2, wrapping the results in a list was a clever choice for the number of matches can obviously vary between strings.

In Section 7.2, we will look at the Map function, which, along with substring introduced below, can aid in getting the most out of such data. Meanwhile, let’s just mention that regmatches extracts the matching substrings:

regmatches(x, gregexpr("(?i)spam\\p{L}*", x, perl=TRUE))
## [[1]]
## [1] "spam"
##
## [[2]]
## [1] "spammite" "spam"
##
## [[3]]
## [1] "SPAM"
##
## [[4]]
## character(0)

Note

(*) Consider what happens when a regular expression contains parenthesised subexpressions (capture groups).

r <- "(?<basename>[^. ]+)\\.(?<extension>[^ ]*)"

This regex consists of two capture groups separated by a dot. The first one is labelled “basename”. It comprises several arbitrary characters except for spaces and dots. The second group, named “extension”, is a substring consisting of anything but spaces.

Such a pattern can be used for unpacking space-delimited lists of file names.

z <- "dataset.csv.gz something_else.txt spam"
regexpr(r, z, perl=TRUE)
## [1] 1
## attr(,"match.length")
## [1] 14
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
##      basename extension
## [1,]        1         9
## attr(,"capture.length")
##      basename extension
## [1,]        7         6
## attr(,"capture.names")
## [1] "basename"  "extension"
gregexpr(r, z, perl=TRUE)
## [[1]]
## [1]  1 16
## attr(,"match.length")
## [1] 14 18
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
## attr(,"capture.start")
##      basename extension
## [1,]        1         9
## [2,]       16        31
## attr(,"capture.length")
##      basename extension
## [1,]        7         6
## [2,]       14         3
## attr(,"capture.names")
## [1] "basename"  "extension"

The capture.* attributes give us access to the matches to the individual capture groups, i.e., the basename and the extension.

Exercise 6.4

(*) Check out the difference between the results generated by regexec and regexpr as well as between the outputs of gregexec and gregexpr.

6.2.6. Replacing pattern occurrences#

sub and gsub can replace the first and all, respectively, matches to a pattern:

x <- c("spam", "y spammite spam", "yummy SPAM", "sram")
sub("spam", "ham", x, fixed=TRUE)
## [1] "ham"            "y hammite spam" "yummy SPAM"     "sram"
gsub("spam", "ham", x, fixed=TRUE)
## [1] "ham"           "y hammite ham" "yummy SPAM"    "sram"

Note

(*) If a regex defines capture groups, matches thereto can be mentioned not only in the pattern itself but also in the replacement string:

gsub("(\\p{L})\\p{L}\\1", "\\1", "aha egg gag NaN spam", perl=TRUE)
## [1] "a egg g N spam"

Matched are, in the following order: a letter (it is a capture group), another letter, and the former letter again. Each such palindrome of length three is replaced with just the repeated letter.

Exercise 6.5

(*) Display the source code of glob2rx by calling print(glob2rx) and study how this function converts wildcards such as file???.* or *.csv to regular expressions that can be passed to, e.g., list.files.

6.2.7. Splitting strings into tokens#

strsplit divides each string in a character vector into chunks.

strsplit(c("spam;spam;eggs;;bacon", "spam"), ";", fixed=TRUE)
## [[1]]
## [1] "spam"  "spam"  "eggs"  ""      "bacon"
##
## [[2]]
## [1] "spam"

Note that this time the search pattern specifying the token delimiter is given as the second argument (an inconsistency).

6.3. Other string operations#

6.3.1. Extracting substrings#

substring extracts parts of strings between given character position ranges.

substring("spammity spam", 1, 4)  # from the first to the fourth character
## [1] "spam"
substring("spammity spam", 10)  # from the tenth to end
## [1] "spam"
substring("spammity spam", c(1, 10), c(4, 14))  # vectorisation
## [1] "spam" "spam"
substring(c("spammity spam", "bacon and eggs"), 1, c(4, 5))
## [1] "spam"  "bacon"

Note

There is also a replacement (compare Section 9.3.6) version of the foregoing:

x <- "spam, spam, bacon, and spam"
substring(x, 7, 11) <- "eggs"
print(x)
## [1] "spam, eggs, bacon, and spam"

Unfortunately, the number of characters in the replacement string should not exceed the length of the part being substituted (try "chickpeas" instead of "eggs"). However, substring replacement can be written as a composition of substring extraction and concatenation:

paste(substring(x, 1, 6), "chickpeas", substring(x, 11), sep="")
## [1] "spam, chickpeas, bacon, and spam"
Exercise 6.6

Take the output generated by regexpr and apply substring to extract the pattern occurrences. If there is no match in a string, the corresponding output should be NA.

6.3.2. Translating characters#

tolower and toupper converts between lower and upper case:

toupper("spam")
## [1] "SPAM"

Note

Like many other string operations in base R, these functions perform very simple character substitutions. They might not be valid in natural language processing tasks. For instance, groß is not converted to GROSS, being the correct case folding in German.

Moreover, chartr translates individual characters:

chartr("\\", "/", "c:\\windows\\system\\cmd.exe")  # chartr(old, new, x)
## [1] "c:/windows/system/cmd.exe"
chartr("([S", ")]*", ":( :S :[")
## [1] ":) :* :]"

In the first line, we replace each backslash with a slash. The second example replaces “(”, “[”, and “S” with “)”, “]”, and “*”, respectively.

6.3.3. Ordering strings#

We have previously mentioned that operators and functions such as `<`, `>=`, sort, order, rank, and xtfrm[4] are based on the lexicographic ordering of strings.

sort(c("chłodny", "hardy", "chladný", "hladný"))
## [1] "chladný" "chłodny" "hardy"   "hladný"

It is worth noting that the ordering depends on the currently selected locale; see Sys.getlocale("LC_COLLATE"). For instance, in the Slovak language setting, we would obtain "hardy" < "hladný" < "chladný" < "chłodny".

Note

Many “structured” data items can be displayed or transmitted as human-readable strings. In particular, we know that as.numeric can convert a string to a number. Moreover, Section 10.3.1 will discuss date-time objects such as "1970-01-01 00:00:00 GMT". We will be processing them with specialised functions such as strptime and strftime.

Important

(*) Many string operations in base R are not necessarily portable. The stringx package defines drop-in, “fixed” replacements therefor. They are based on the International Components for Unicode (ICU) library, a de facto standard for processing Unicode text, and the R package stringi; see [26].

# call install.packages("stringx") first
suppressPackageStartupMessages(library("stringx"))  # load the package
sort(c("chłodny", "hardy", "chladný", "hladný"), locale="sk_SK")
## [1] "hardy"   "hladný"  "chladný" "chłodny"
toupper("gro\u00DF")  # compare base::toupper("gro\u00DF")
## [1] "GROSS"
detach("package:stringx")  # remove the package from the search path

6.4. Other atomic vector types (*)#

We have discussed four vector types: logical, double, character, and list. To get a more complete picture of the sequence-like types in R, let’s briefly mention integer, complex, and raw atomic types so that we are not surprised when we encounter them.

6.4.1. Integer vectors (*)#

Integer scalars can be input manually by using the L suffix:

(x <- c(1L, 2L, -1L, NA_integer_))  # looks like numeric
## [1]  1  2 -1 NA
typeof(x)  # but is integer
## [1] "integer"

Some functions return them in a few contexts[5]:

typeof(1:10)  # seq(1, 10) as well, but not seq(1, 10, 1)
## [1] "integer"
as.integer(c(-1.1, 0, 1.9, 2.1))  # truncate/round towards 0
## [1] -1  0  1  2

In most expressions, integer vectors behave like numeric ones. They are silently coerced to double if need be. Usually, there is no practical[6] reason to distinguish between them. For example:

1L/2L  # like 1/2 == 1.0/2.0
## [1] 0.5

Note

(*) R integers are 32-bit signed types. In the double type, we can store more of them. The maximal contiguously representable integer is \(2^{31}-1\) and \(2^{53}\), respectively; see Section 3.2.3:

as.integer(2^31-1) + 1L  # 32-bit integer overflow
## Warning in as.integer(2^31 - 1) + 1L: NAs produced by integer overflow
## [1] NA
as.integer(2^31-1) + 1 == 2^31 # integer+double == double – OK
## [1] TRUE
(2^53 - 1) + 1 == 2^53  # OK
## [1] TRUE
(2^53 + 1) - 1 == 2^53  # lost due to FP rounding; left side equals 2^53 - 1
## [1] FALSE

Note

Since R 3.0, there is support for vectors longer than \(2^{31}-1\) elements. As there are no 64-bit integers in R, long vectors are indexed by doubles (we have been doing all this time). In particular, x[1.9] is the same as x[1], and x[-1.9] means x[-1], i.e., the fractional part is truncated. It is why the notation like x[length(x)*0.2] works, whether the length of x is a multiple of five or not.

6.4.2. Raw vectors (*)#

Vectors of the type raw can store bytes, i.e., unsigned 8-bit integers, whose range is 0–255. For example:

as.raw(c(-1, 0, 1, 2, 0xc0, 254, 255, 256, NA))
## Warning: out-of-range values treated as 0 in coercion to raw
## [1] 00 00 01 02 c0 fe ff 00 00

They are displayed as two-digit hexadecimal (base-16) numbers. There are no raw NAs.

Only a few functions deal with such vectors: e.g., readBin, charToRaw, and rawToChar.

Interestingly, the meaning of the logical operators differs for raw vectors; they denote bitwise operations. See also bitwAnd, bitwOr etc. that work on integer vectors.

xor(as.raw(0xf0), as.raw(0x0f))
## [1] ff
bitwXor(0x0fff0f00, 0x0f00f0ff)
## [1] 16777215
Example 6.7

(*) One use case of bitwise operations is for representing a selection of items in a small set of possible values. This can be useful for communicating with routines implemented in C/C++. For instance, let’s define three flags:

HAS_SPAM  <- 0x01  # binary 00000001
HAS_BACON <- 0x02  # binary 00000010
HAS_EGGS  <- 0x04  # binary 00000100

Now a particular subset can be created using the bitwise OR:

dish <- bitwOr(HAS_SPAM, HAS_EGGS)  # {spam, eggs}

Testing for inclusion is done via the bitwise AND:

as.logical(bitwAnd(dish, c(HAS_SPAM, HAS_BACON, HAS_EGGS)))
## [1]  TRUE FALSE  TRUE

6.4.3. Complex vectors (*)#

We can also play with vectors of the type complex, with “1i” representing the imaginary unit, \(\sqrt{-1}\). Complex numbers appear in quite a few engineering or scientific applications, e.g., in physics, electronics, or signal processing. They are (at least: ought to be) part of introductory subjects or textbooks in university-level mathematics, including the statistics- and machine learning-orientated ones because of their heavy use of numerical computing; see e.g., [19, 30].

c(0, 1i, pi+pi*1i, NA_complex_)
## [1] 0.0000+0.0000i 0.0000+1.0000i 3.1416+3.1416i             NA

Apart from the basic operators, mathematical and aggregation functions, procedures like fft, solve, qr, or svd can be fed with or produce such data. For more details, see help("complex") and some matrix examples in Chapter 11.

6.5. Exercises#

Exercises marked with (*) might require tinkering with regular expressions or third-party R packages.

Exercise 6.8

Answer the following questions.

  • How many characters are there in the string "ab\n\\\t\\\\\""? What about r"-{ab\n\\\t\\\\\"-)}-"?

  • What is the result of a call to paste(NA, 1:5, collapse="")?

  • What is the meaning of the following sprintf format strings: “%s”, “%20s”, “%-20s”, “%f”, “%g”, “%e”, “%5f”, “%5.2f%%”, “%.2f”, “%0+5f”, and “[%+-5.2f]”?

  • What is the difference between regexpr and gregexpr? What does “g” in the latter function name stand for?

  • What is the result of a call to grepl(c("spam", "spammity spam", "aubergines"), "spam")?

  • Is it always the case that “"Aaron" < "Zorro"”?

  • Why “x < "10"” and “x < 10” may return different results?

  • If x is a character vector, is “x == x” always equal to TRUE?

  • If x and y are character vectors of lengths \(n\) and \(m\), respectively, what is the length of the output of match(x, y)?

  • If x is a named vector, why is there a difference between x[NA] and x[NA_character_]?

  • What is the difference between “x == y” and “x %in% y”?

Exercise 6.9

Let x, y, and z be atomic vectors and a and b be single strings. Generate the same results as pastena(x, collapse=b), pastena(x, y, sep=a), pastena(x, y, sep=a, collapse=b), pastena(x, y, z, sep=a), pastena(x, y, z, sep=a, collapse=b), assuming that pastena is a version of paste (which we do not have) that handles missing data in a way consistent with most other functions.

Exercise 6.10

Based on list.files and glob2rx, generate the list of all PDFs on your computer. Then, use file.size to filter out the files smaller than 10 MiB.

Exercise 6.11

Read a text file that stores a long paragraph of some banal prose. Concatenate all the lines to form a single, long string. Using strwrap and cat, output the paragraph on the console, nicely formatted to fit a block of text of an aesthetic width, say, 60 columns.

Exercise 6.12

(*) Implement a simplified version of basename and dirname.

Exercise 6.13

(*) Implement an operation similar to trimws using the functions introduced in this chapter.

Exercise 6.14

(*) Write a regex that extracts all words from each string in a given character vector.

Exercise 6.15

(*) Write a regex that extracts, from each string in a character vector, all:

  • integers numbers (signed or unsigned),

  • floating-point numbers,

  • numbers of any kind (including those in scientific notation),

  • #hashtags,

  • email@address.es,

  • hyperlinks of the form http://… and https://….

Exercise 6.16

(*) What do 42i, 42L, and 0x42 stand for?

Exercise 6.17

(*) Check out stri_sort in the stringi package (or sort.character in stringx) for a way to obtain an ordering like "a1" < "a2" < "a10" < "a11" < "a100".

Exercise 6.18

(*) In sprintf, the formatter "%20s" means that if a string is less than 20 bytes long, the remaining bytes will be replaced with spaces. Only for ASCII characters (English letters, digits, some punctuation marks, etc.), it is true that one character is represented by one byte. Other Unicode code points can take up between two and four bytes.

cat(sprintf("..%6s..", c("abc", "1!<", "aßc", "ąß©")), sep="\n")  # aligned?
## ..   abc..
## ..   1!<..
## ..  aßc..
## ..ąß©..

Use the stri_pad function from the stringi package to align the strings aesthetically. Alternatively, check out sprintf from stringx.

Exercise 6.19

(*) Implement an operation similar to stri_pad from stringi using the functions introduced in this chapter.