# 2. Numeric vectors#

The open-access textbookDeep R Programmingby Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project.This book is still a work in progress.Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime,any bug/typos reports/fixes are appreciated.Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks.Also, check out my other book,Minimalist Data Wrangling with Python[26].

In this chapter, we discuss the uttermost common operations
on numeric vectors. They are so fundamental that we will also find them
in other scientific computing environments, including Python
with **NumPy** or **TensorFlow**,
Julia, MATLAB, GNU Octave, or Scilab.

At first blush, the number of functions we are going to explore may seem quite large. Still, the reader is kindly asked to place some trust (a rare thing these days) in yours truly. It is because our selection is comprised only of the most representative and educational amongst the plethora of possible choices. More complex building blocks can often be reduced to a creative combination of the former. Also, sometimes they may easily found in a number of additional packages or libraries (e.g., the GNU GSL [28]).

A solid understanding of base R programming is crucial for dealing with
popular packages (such as **data.table**, **dplyr**,
or **caret**). Most importantly, base R’s API is *stable*.
Hence, the code we compose today will most likely work the same way in
ten years. It is often not the case when we rely on third-party add-ons.

In the sequel, we will be advocating a minimalist, keep-it-simple approach to the art of programming data processing pipelines, one that is a healthy balance between “doing it all by oneself”, “minimising the information overload”, “being lazy”, and “standing on the shoulders of giants”.

Note

The exercises that we suggest below are all self-contained unless explicitly
stated otherwise. The use of language constructs that are yet to be formally
introduced (in particular, **if**, **for**,
and **while** explained in Chapter 8)
is not just unnecessary: it is discouraged.
Moreover, we recommend against taking shortcuts
by looking up partial solutions on the internet.
Rather, to get the most out of this course,
the reader should be seeking relevant information within the current and
preceding chapters as well as the R help system.

## 2.1. Creating numeric vectors#

### 2.1.1. Numeric constants#

The simplest numeric vectors are those of length one:

```
-3.14
## [1] -3.14
1.23e-4
## [1] 0.000123
```

The latter is in what we call *scientific notation*,
which is a convenient means of entering numbers of very
large or small orders of magnitude.
Here, “`e`

” stands for “… times 10 to the power of…”.
Therefore, `1.23e-4`

is equal to \(1.23\times 10^{-4}=0.000123\).
In other words, given \(1.23\), we move the decimal separator by
4 digits towards the left.

In real life, some information items
may be inherently or temporarily missing, unknown, or `N`

ot `A`

vailable.
R is data processing-oriented. Hence, it is equipped with
a special indicator:

```
NA_real_ # numeric NA (missing value)
## [1] NA
```

It is similar to the *Null* marker in database query languages
such as SQL. Note that `NA_real_`

is displayed simply as “`NA`

”, chiefly for readability.

Moreover, `Inf`

denotes infinity (\(\infty\); a value that is
larger than the largest representable double-precision
floating point number, i.e., 64 bit).
Also, `NaN`

stands for *not-a-number* (it is returned as the result
of some illegal operations, e.g., \(0/0\) or \(\infty-\infty\)).

### 2.1.2. Concatenating vectors with **c**#

Let us provide some ways to create numeric vectors with possibly more than one element.

First, the **c** function that we introduced in the previous chapter
can be used to *c*ombine (concatenate) many numeric vectors, each
of any length, so as to form a single object:

```
c(1, 2, 3) # 3 vectors of length 1 -> 1 vector of length 3
## [1] 1 2 3
c(1, c(2, NA_real_, 4), 5, c(6, c(7, Inf)))
## [1] 1 2 NA 4 5 6 7 Inf
```

Note

Running **help**`("c")`

, we will see that its usage is like
“**c**`(...)`

”. In the current context,
this means that the **c** function
takes an arbitrary number of arguments.
In Section 9.5.6, we will study
the dot-dot-dot (ellipsis) parameter in more detail.

### 2.1.3. Repeating entries with **rep**#

Second, **rep** *rep*licates the elements
in a given vector a given number of times.

```
rep(1, 5)
## [1] 1 1 1 1 1
rep(c(1, 2, 3), 4)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3
```

In the second case, the whole vector (1, 2, 3) has been *recycled*
(tiled) four times. Interestingly, if the second argument was a vector
of the same length as the first one, the behaviour would be quite different:

```
rep(c(1, 2, 3), c(2, 1, 4))
## [1] 1 1 2 3 3 3 3
rep(c(1, 2, 3), c(4, 4, 4))
## [1] 1 1 1 1 2 2 2 2 3 3 3 3
```

Here, *each* element is repeated the *corresponding* number of times.

If we call **help**`("rep")`

, we will come across the notion like
“**rep**`(x, ...)`

” in the *Usage* section.
It is rather peculiar. However, reading further,
we discover the dot-dot-dot stands for one of the following
further parameters (see the *Arguments* section):

`times`

,`length.out`

,`each`

.

So far, we have been playing with `times`

,
which is listed second in the parameter list (after `x`

– the vector
whose elements are to be repeated).

Important

It turns out that the following function calls are all equivalent:

```
rep(c(1, 2, 3), 4) # positional matching of arguments: `x`, then `times`
rep(c(1, 2, 3), times=4) # `times` is the second argument
rep(x=c(1, 2, 3), times=4) # keyword arguments of the form name=value
rep(times=4, x=c(1, 2, 3)) # keyword arguments can be given in any order
rep(times=4, c(1, 2, 3)) # mixed positional and keyword arguments
```

We can also pass `each`

or `length.out`

(a dot has no special meaning in R; see Section 2.2),
but their names must be mentioned explicitly:

```
rep(c(1, 2, 3), length.out=7)
## [1] 1 2 3 1 2 3 1
rep(c(1, 2, 3), each=3)
## [1] 1 1 1 2 2 2 3 3 3
rep(c(1, 2, 3), length.out=7, each=3)
## [1] 1 1 1 2 2 2 3
```

Note

Whether it was a good programming practice to actually
implement a range of varied behaviours inside a single
function is a matter of taste.
On the one hand, in all of the examples above, we do repeat the input elements
somehow, so remembering just one function name is really convenient.
Nevertheless, a drastic change in the repetition pattern
depending, e.g., on the length of the `times`

argument can be bug-prone.
Anyway, we have been warned[1].

Zero-length vectors are possible, too:

```
rep(c(1, 2, 3), 0)
## numeric(0)
```

Even though their handling might be a little tricky (compare Chapter 9), we will see later that they are indispensable in contexts like “create an empty data frame with a specific column structure”.

Also, note that R often allows for partial matching of named arguments, but its use is a bad programming practice; see Section 15.4.4 for more details.

```
rep(c(1, 2, 3), len=7) # not recommended (see later)
## Warning in rep(c(1, 2, 3), len = 7): partial argument match of 'len' to
## 'length.out'
## [1] 1 2 3 1 2 3 1
```

The only reason we see the warning message is that
we have manually set **options**`(warnPartialMatchArgs=TRUE)`

in our environment. It is not set by default.

### 2.1.4. Generating arithmetic progressions with **seq** and `**:**`#

Third, we can call the **seq** function
to create a sequence of equally-spaced numbers
(on a linear scale, i.e., an arithmetic progression).

```
seq(1, 15, 2)
## [1] 1 3 5 7 9 11 13 15
```

From the function’s help page, we discover that it features
the `from`

, `to`

, `by`

, `length.out`

parameters, amongst others.

Thus, the above call is equivalent to:

```
seq(from=1, to=15, by=2)
## [1] 1 3 5 7 9 11 13 15
```

Note that `to`

actually means “up to”:

```
seq(from=1, to=16, by=2)
## [1] 1 3 5 7 9 11 13 15
```

We can also pass `length.out`

instead of `by`

.
In such a case, the increments or decrements will be computed via the formula
`((to - from)/(length.out - 1))`

; this *default value*
is reported in the *Usage* section in **help**`("seq")`

.

```
seq(1, 0, length.out=5)
## [1] 1.00 0.75 0.50 0.25 0.00
```

Also, this:

```
seq(length.out=5) # default `from` is 1
## [1] 1 2 3 4 5
```

Arithmetic progressions with steps equal to 1 or -1
can also be generated via the `**:**` operator.

```
1:10 # seq(1, 10) or seq(1, 10, 1)
## [1] 1 2 3 4 5 6 7 8 9 10
-1:10 # seq(-1, 10) or seq(-1, 10, 1)
## [1] -1 0 1 2 3 4 5 6 7 8 9 10
-1:-10 # seq(-1, -10) or seq(-1, -10, -1)
## [1] -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
```

Let us highlight the order of precedence of this operator:
“`-1:10`

” means “`(-1):10`

” and not “`-(1:10)`

”;
compare Section 2.4.3.

Take a look at the manual page of **seq_along** and
**seq_len**. Determine whether they can easily be done without
having **seq**[2] at hand.

### 2.1.5. Generating pseudorandom numbers#

We can also generate sequences drawn independently from a range of univariate probability distributions.

```
runif(7) # uniform U(0, 1)
## [1] 0.287578 0.788305 0.408977 0.883017 0.940467 0.045556 0.528105
rnorm(7) # normal N(0, 1)
## [1] 1.23950 -0.10897 -0.11724 0.18308 1.28055 -1.72727 1.69018
```

These correspond to seven pseudorandom deviates following the uniform distribution on the unit interval (i.e., \((0, 1)\)) and the standard normal distribution (i.e., with expectation 0 and standard deviation 1), respectively; compare Figure 2.3.

For more *named* distribution classes (frequently occurring in various
real-world statistical modelling exercises), see
Section 2.3.4.

Another worthwhile function samples a number of values from a given vector, either with or without replacement:

```
sample(1:10, 20, replace=TRUE) # 20 with replacement (allow repetitions)
## [1] 3 3 10 2 6 5 4 6 9 10 5 3 9 9 9 3 8 10 7 10
sample(1:10, 5, replace=FALSE) # 5 without replacement (do not repeat)
## [1] 9 3 4 6 1
```

The distribution of the sampled values does not need to be uniform;
the `prob`

argument may be fed with a vector of the corresponding
probabilities. For example, here are 20 independent realisations
of the random variable \(X\) such that
\(\Pr(X=0)=0.9\) (the probability that we obtain 0 is equal to 90%)
and \(\Pr(X=1)=0.1\):

```
sample(0:1, 20, replace=TRUE, prob=c(0.9, 0.1))
## [1] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1
```

Note

If `n`

is a single number (a numeric vector of length 1), then
**sample**`(n, ...)`

is equivalent to
**sample**`(1:n, ...)`

.
Similarly, **seq**`(n)`

is a synonym for **seq**`(1, n)`

or **seq**`(1, `

**length**`(n))`

,
depending on the length of `n`

.
This is a dangerous behaviour that can occasionally backfire
and lead to bugs (check what happens when `n`

is, e.g., 0).
Nonetheless, we have been warned. From now on, we are going to
be extra careful (but are we really?).
Read more at **help**`("sample")`

and **help**`("seq")`

.

Let us stress that the numbers we obtain are merely *pseudo*random
because they are generated algorithmically.
R uses the Mersenne-Twister MT19937 method [46] by default;
see **help**`("RNG")`

and [21, 29, 42].
By setting the *seed* of the random number generator, i.e., resetting
its state to a given one, we can obtain results
that are *reproducible*.

```
set.seed(12345) # seeds are specified with integers
sample(1:10, 5, replace=TRUE) # a,b,c,d,e
## [1] 3 10 8 10 8
sample(1:10, 5, replace=TRUE) # f,g,h,i,j
## [1] 2 6 6 7 10
```

Setting the seed to the one used previously gives:

```
set.seed(12345)
sample(1:10, 5, replace=TRUE) # a,b,c,d,e
## [1] 3 10 8 10 8
```

We did not(?) expect that! And now for something completely different:

```
set.seed(12345)
sample(1:10, 10, replace=TRUE) # a,b,c,d,e,f,g,h,i,j
## [1] 3 10 8 10 8 2 6 6 7 10
```

Reproducibility is a crucial feature of each truly scientific experiment. The same initial condition (here: the same seed) leads to exactly the same outcomes.

Note

Some claim that the only unsuspicious seed is 42,
but each programmer can have their own picks.
Yours truly, for example, uses 123, 1234, and 12345 as well.
When performing many runs of Monte Carlo experiments,
it may be a clever idea to call **set.seed**`(i)`

in the \(i\)-th iteration of a simulation we are trying to program.

Anyhow, we should ensure that our seed settings are applied consistently across all our scripts. Otherwise, we might be accused of tampering with evidence. For instance, here is the ultimate proof that we are very lucky today:

```
set.seed(1679619) # totally unsuspicious, right?
sample(0:1, 20, replace=TRUE) # so random
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
```

This is exactly why reproducible scripts and auxiliary data should be published alongside all research reports or papers. Only open, transparent science can be fully trustworthy.

If **set.seed** is not called explicitly,
and the random state is not restored from the previously saved R session
(see Chapter 16), then the random generator is initialised
based on the current wall time and the identifier of the running R instance
(PID). This may give the impression that the numbers we generate are
surprising.

To understand the “pseudo” part of the said randomness better, in Section 8.3, we will build a very simple random generator ourselves.

### 2.1.6. Reading data with **scan**#

The example text file named `euraud-20200101-20200630.csv`

gives the EUR to AUD exchange rates
(how many Australian Dollars can one buy for 1 Euro)
from 1 January to 30 June 2020 (remember COVID-19?). Let us preview
the first couple of lines:

```
# EUR/AUD Exchange Rates
# Source: Statistical Data Warehouse of the European Central Bank System
# https://www.ecb.europa.eu/stats/policy_and_exchange_rates/
# (provided free of charge)
NA
1.6006
1.6031
NA
```

The four first lines that begin with “`#`

” merely serve as
comments for us humans. They should be ignored by the interpreter.
The first “real” value, `NA`

, corresponds to 1 January
(Wednesday, New Year’s Day; Forex markets were closed,
hence a missing observation).

The **scan** function can be used to read all the inputs
and convert them to a single numeric vector:

```
scan(paste0("https://github.com/gagolews/teaching-data/raw/",
"master/marek/euraud-20200101-20200630.csv"), comment.char="#")
## [1] NA 1.6006 1.6031 NA NA 1.6119 1.6251 1.6195 1.6193 1.6132
## [11] NA NA 1.6117 1.6110 1.6188 1.6115 1.6122 NA NA 1.6154
## [21] 1.6177 1.6184 1.6149 1.6127 NA NA 1.6291 1.6290 1.6299 1.6412
## [31] 1.6494 NA NA 1.6521 1.6439 1.6299 1.6282 1.6417 NA NA
## [41] 1.6373 1.6260 1.6175 1.6138 1.6151 NA NA 1.6129 1.6195 1.6142
## [51] 1.6294 1.6363 NA NA 1.6384 1.6442 1.6565 1.6672 1.6875 NA
## [61] NA 1.6998 1.6911 1.6794 1.6917 1.7103 NA NA 1.7330 1.7377
## [71] 1.7389 1.7674 1.7684 NA NA 1.8198 1.8287 1.8568 1.8635 1.8226
## [81] NA NA 1.8586 1.8315 1.7993 1.8162 1.8209 NA NA 1.8021
## [91] 1.7967 1.8053 1.7970 1.8004 NA NA 1.7790 1.7578 1.7596
## [ reached getOption("max.print") -- omitted 83 entries ]
```

We used the **paste0** function to concatenate two
long (too long to fit a single line of code)
strings to form a single URL; see Section 6.1.3.

We can also read the files located on our computer, for example:

```
scan("~/Projects/teaching-data/marek/euraud-20200101-20200630.csv",
comment.char="#")
```

uses an absolute file path that starts at the user’s home
directory, denoted “`~`

”: yours truly’s case is `/home/gagolews/`

.

Note

For portability reasons, we suggest slashes, “`/`

”, as path
separators (see also **help**`("file.path")`

and **help**`(".Platform")`

).
These recognised by all UNIX-like boxes as well by other
popular operating systems.
Note that URLs (such as https://www.r-project.org/)
consist of slashes as well.

Paths can also be relative to the current working directory,
denoted “`.`

”. It can be read via a call to **getwd**.
Usually, it is the directory from where the R session has been started.

For instance, if the working directory was
`/home/gagolews/Projects/teaching-data/marek`

,
we could have written the file path equivalently as
`./euraud-20200101-20200630.csv`

or even
`euraud-20200101-20200630.csv`

.

On as side note, `../`

would mark the parent directory
of the current working directory. For instance,
`../r/iris.csv`

would be equivalent
to `/home/gagolews/Projects/teaching-data/r/iris.csv`

.

Read the help page about **scan**. Take note of the following
formal arguments and their meaning: `dec`

, `sep`

, `what`

,
`comment.char`

, and `na.strings`

.

Later we will discuss the **read.table** and **read.csv**,
which are wrappers around **scan** that can be used to read
tabular data. Also, **write** can be used to export an atomic
vector’s contents to a text file.

Figure 2.1 shows the graph of the aforementioned exchange rates, which was generated by calling:

```
plot(scan(paste0("https://github.com/gagolews/teaching-data/raw/",
"master/marek/euraud-20200101-20200630.csv"), comment.char="#"),
xlab="Day", ylab="EUR/AUD")
```

Somewhat misleadingly (and for reasons that will
become apparent later), the documentation of **plot**
can be accessed by calling **help**`("plot.default")`

.
Read about, and experiment with, different values of the
`main`

, `xlab`

, `ylab`

, `type`

, `col`

, `pch`

, `cex`

, `lty`

, and `lwd`

arguments. More plotting routines will be discussed in
Chapter 13.

## 2.2. Creating named objects#

Often, the objects we bring forth will need to be memorised so that
they can be referred to in further computations. The assignment operator,
`**<-**`, can be used for this very purpose:

```
x <- 1:3 # creates a numeric vector and binds the name `x` to it
```

The now-named object can be recalled[3] and dealt with as we please:

```
print(x) # or just `x` in the R console
## [1] 1 2 3
sum(x) # example operation: compute the sum of all elements in `x`
## [1] 6
```

Important

In R, all names are *case-sensitive*.
Hence, `x`

and `X`

can coexist peacefully: when set,
they refer to two different objects.
Also, if we tried to call `Print(x)`

above, we would get an error.

Typically, we will be using what we refer to as *syntactic names*
(see Section 9.4.1 for an exception).
In the R help system (see **help**`("make.names")`

and also **help**`("Quotes")`

), we read:
*A syntactically valid name consists of letters, numbers and the
dot or underline characters and starts with a letter or the dot
not followed by a number. Names such as .2way are not valid,
and neither are the reserved words.* For the list of the latter,
see

**help**

`("Reserved")`

.A fine name is self-explanatory and thus reader-friendly:
`patients`

, `mean`

, and `average_scores`

are way better (if they
are what they claim they are) than `xyz123`

, `crap`

, or `spam`

.
Also, it might not be such a bad idea to get used to denoting:

vectors with

`x`

,`y`

,`z`

,matrices (and matrix-like objects) with

`A`

,`B`

, …,`X`

,`Y`

,`Z`

,integer indexes with letters

`i`

,`j`

,`k`

,`l`

,object sizes with

`n`

,`m`

,`d`

,`p`

or`nx`

,`ny`

, etc.,

especially when they are only of temporary nature (for storing some auxiliary results, iterating over collections of objects, etc.).

There are numerous naming conventions that we can adopt,
but most often they are a matter of taste;
`snake_case`

, `lowerCamelCase`

, `UpperCamelCase`

, `flatcase`

, or
`dot.case`

are equally sound as long as they are used coherently
(for instance, some use `snake_case`

for vectors
and `UpperCamelCase`

for functions). It may even be the case
that we have little choice but to adhere to the naming conventions
of the project we are about to contribute to.

Note

Let us stress that a dot, “`.`

”, has no special meaning (however, see
Chapter 10 and Chapter 16 for some asterisks);
`na.omit`

is as appropriate a name
as `na_omit`

, `naOmit`

, `NAOMIT`

, `naomit`

, and `NaOmit`

.
Users coming from some other (C, C++, Java, Python, etc.) programming
languages will need to habituate themselves to this convention.

R, as a dynamic language, allows for introducing new variables at any time. Moreover, existing names can be rebound to new values. For instance:

```
(y <- c(1, 10, 100)) # bracketed expression - printing not suppressed
## [1] 1 10 100
x <- y
print(x)
## [1] 1 10 100
```

Now `x`

refers to a verbatim copy of `y`

.

Note

Objects are automatically destroyed when there are
no more names bound to them. By now, the
*garbage collector* is likely to have got rid of the `1:3`

vector
begotten above (to which the name `x`

was bound previously). See
Section 14.2.4 for more details on memory management.

## 2.3. Vectorised mathematical functions#

Mathematically, we will be denoting a given vector \(\boldsymbol{x}\) of length \(n\) with \((x_1, x_2, \dots, x_n)\). In other words, its \(i\)-th element is equal to \(x_i\).

Let us review some ubiquitous operations in numerical computing.

### 2.3.1. **abs** and **sqrt**#

R implements *vectorised* versions of the most popular
mathematical functions, e.g., **abs** (absolute value, \(|x|\)) and
**sqrt** (square root, \(\sqrt{x}\)).

```
abs(c(2, -1, 0, -3, NA_real_))
## [1] 2 1 0 3 NA
```

Here, *vectorised* means that instead of being defined to act on a single
numeric value, the function of interest is applied on each element
in a vector. The \(i\)-th resulting item
is a transformed version of the \(i\)-th input.
If an input is a missing value, the corresponding output will be
marked as “don’t know” as well.

Another example:

```
x <- c(4, 2, -1)
(y <- sqrt(x))
## Warning in sqrt(x): NaNs produced
## [1] 2.0000 1.4142 NaN
```

To attract our attention to the fact that computing the square root of a negative value yields a not-a-number, R generated an informative warning. A warning is not an error: the result is being reckoned as usual.

Also the fact that the irrational \(\sqrt{2}\) is *displayed*[4]
as `1.4142`

does not mean that it is such a crude approximation
to \(1.414213562373095048801688724209698\dots\);
it is only rounded when printing for aesthetic reasons.
In fact, in Section 3.2.3, we will point out that the computer’s
floating-point arithmetic allows for roughly 16 decimal digits precision
(but we shall see that the devil is in the detail).

```
print(y, digits=16) # display more significant figures
## [1] 2.000000000000000 1.414213562373095 NaN
```

### 2.3.2. Rounding#

The following functions get rid of all or portions of fractional parts of numbers:

**floor**`(x)`

(rounds down to the nearest integer, denoted \(\lfloor x\rfloor\)),**ceiling**`(x)`

(rounds up, denoted \(\lceil x\rceil\)),**trunc**`(x)`

(rounds towards zero), and**round**`(x, digits=0)`

(rounds to the nearest number with`digits`

decimal digits).

For instance:

```
x <- c(7.0001, 6.9999, -4.3149, -5.19999, 123.4567, -765.4321, 0.5, 1.5, 2.5)
floor(x)
## [1] 7 6 -5 -6 123 -766 0 1 2
ceiling(x)
## [1] 8 7 -4 -5 124 -765 1 2 3
trunc(x)
## [1] 7 6 -4 -5 123 -765 0 1 2
```

Note

If we call **help**`("round")`

, we will read that its usage is like
`round(x, digits=0)`

, which means that the `digits`

parameter
is equipped with the *default value* of 0.
In other words, if rounding to 0 decimal digits is what we need,
the second argument can be omitted.

```
round(x) # the same as round(x, 0)
## [1] 7 7 -4 -5 123 -765 0 2 2
round(x, 1)
## [1] 7.0 7.0 -4.3 -5.2 123.5 -765.4 0.5 1.5 2.5
round(x, -2)
## [1] 0 0 0 0 100 -800 0 0 0
```

### 2.3.3. Natural exponential function and logarithm#

Moreover:

`exp`

`(x)`

outputs the natural exponential function, \(e^x\), where the Euler’s number \(e\simeq 2.718\),`log`

`(x, base=exp(1))`

computes, by default, the natural logarithm of \(x\), \(\log_e x\) (which is most often denoted simply as \(\log x\)).

Recall that if \(x=e^y\), then \(\log_e x = y\), i.e., one is the inverse of the other.

```
log(c(0, 1, 2.7183, 7.3891, 20.0855)) # grows slowly
## [1] -Inf 0 1 2 3
exp(c(0, 1, 2, 3)) # grows fast
## [1] 1.0000 2.7183 7.3891 20.0855
```

Note

These functions enjoy a number of very valuable identities and inequalities, including:

\(\log(x\cdot y)=\log x + \log y\),

\(\log(x^y) = y\log x\),

\(e^{x+y} = e^x\cdot e^y\).

For more properties like these, take a glance at Chapter 4 of the freely available handbook [50].

For the logarithm to a different base, say, \(\log_{10} x\), we can call:

```
log(c(0, 1, 10, 100, 1000, 1e10), 10) # or log(..., base=10)
## [1] -Inf 0 1 2 3 10
```

Note that if \(\log_{b} x=y\), then \(x=b^y\), for any \(1\neq b>0\).

Note

Commonly, a logarithmic scale is used for variables that grow rapidly when expressed as functions of each other; see Figure 2.2.

```
x <- seq(0, 10, length.out=1001)
par(mfrow=c(1, 2)) # two plots in one figure (1 row, 2 columns)
plot(x, exp(x), type="l")
plot(x, exp(x), type="l", log="y") # log-scale on the y-axis
```

Let us highlight that \(e^x\) on the log-scale is just a straight line. Also, keep in mind that such a transformation of the axes can only be applied in the case of values strictly greater than 0.

### 2.3.4. Probability distributions (*)#

It should come as no surprise that R offers extensive support for many univariate probability distribution families, including:

continuous distributions, which take values being arbitrary real numbers (over the whole possible range or in some interval):

`*unif`

(uniform),`*norm`

(normal),`*exp`

(exponential),`*gamma`

(gamma, \(\Gamma\)),`*beta`

(beta, \(\mathrm{B}\)),`*lnorm`

(log-normal),`*t`

(Student),`*cauchy`

(Cauchy–Lorentz),`*chisq`

(chi-squared, \(\chi^2\)),`*f`

(Snedecor–Fisher),`*weibull`

(Weibull);

with the prefix “

`*`

” being one of:“

`d`

” (probability density function, PDF),“

`p`

” (cumulative distribution function, CDF; or survival function, SF),“

`q`

” (quantile function, being the inverse of the CDF),“

`r`

” (generation of random deviates; already mentioned);

discrete distributions, i.e., those whose possible outcomes can be easily enumerated (e.g., some integers).

`*binom`

(binomial),`*geom`

(geometric),`*pois`

(Poisson),`*hyper`

(hypergeometric),`*nbinom`

(negative binomial);

here, prefixes “

`p`

” and “`r`

” have the same meaning as above, however:“

`d`

” now gives the probability*mass*function (PMF),“

`q`

” brings about the quantile function, defined as a*generalised*inverse of the CDF.

Each distribution is characterised by a set of underlying parameters.
For instance, a normal distribution \(\mathrm{N}(\mu, \sigma)\)
can be pinpointed by setting its expected value
\(\mu\in\mathbb{R}\) and standard deviation
\(\sigma>0\).
In R, these two have been named `mean`

and `sd`

, respectively;
see **help**`("dnorm")`

.

Note

The parametrisations assumed in R can be subtly different
from what we know from statistical textbooks or probability courses.
For example, the normal distribution can be parameterised
based on either standard deviation or variance,
and the exponential distribution can be defined via its expected
value or the reciprocal thereof.
We thus advise the reader to study carefully the documentation
of **help**`("dnorm")`

, **help**`("dunif")`

, **help**`("dexp")`

,
**help**`("dbinom")`

,
and the like.

It is also worth knowing the typical use cases of each of the distributions listed, e.g., a Poisson distribution can describe the probability of observing the number of independent events in a fixed time interval (e.g., the number of users downloading a copy of R from CRAN per hour), and an exponential distribution can model the time between such events; compare [23].

A call to **hist**`(x)`

draws a histogram,
which can serve as an estimator of the underlying continuous
probability density function of a given
sample; see Figure 2.3 for an illustration.

```
par(mfrow=c(1, 2)) # 2 plots in 1 figure
# Uniform U(0, 1)
hist(runif(10000, 0, 1), col="white", probability=TRUE, main="")
x <- seq(0, 1, length.out=101)
lines(x, dunif(x, 0, 1), lwd=2) # draw the true density function (PDF)
# Normal N(0, 1)
hist(rnorm(10000, 0, 1), col="white", probability=TRUE, main="")
x <- seq(-4, 4, length.out=101)
lines(x, dnorm(x, 0, 1), lwd=2) # draw the PDF
```

Draw a histogram of some random samples of different sizes `n`

from the following distributions:

`rnorm(n, µ, σ)`

– normal \(\mathrm{N}(\mu, \sigma)\) with expected values \(\mu\in\{-1, 0, 5\}\) (i.e., \(\mu\) being equal to either \(-1\), \(0\), or \(5\); read “\(\in\)” as “belongs to the given set” or “in”) and standard deviations \(\sigma\in\{0.5, 1, 5\}\);`runif(n, a, b)`

– uniform \(\mathrm{U}(a, b)\) on the interval \((a, b)\) with \(a=0\) and \(b=1\) as well as \(a=-1\) and \(b=1\);`rbeta(n, α, β)`

– beta \(\mathrm{B}(\alpha, \beta)\) with \(\alpha,\beta\in\{0.5, 1, 2\}\);`rexp(n, λ)`

– exponential \(\mathrm{E}(\lambda)\) with rates \(\lambda\in\{0.5, 1, 10\}\);

Moreover, read about and play with the
`breaks`

, `main`

, `xlab`

, `ylab`

, `xlim`

, `ylim`

, and `col`

parameters; see **help**`("hist")`

.

We roll a six-sided dice twelve times.
Let \(C\) be a random variable denoting the number of
cases where the “`1`

” face is thrown.
\(C\) follows a binomial distribution \(\mathrm{Bin}(n, p)\)
with parameters \(n=12\) (the number of Bernoulli trials)
and \(p=1/6\) (the probability of *success* in a single roll).

The probabilities that the number of “`1`

”s rolled will be equal to
\(0, 1, \dots, 4\), i.e., \(P(C=0)\), \(P(C=1)\), …, \(P(C=4)\), respectively,
can be computed based on the probability mass function (**dbinom**):

```
dbinom(0:4, 12, 1/6) # probability mass function at 5 different points
## [1] 0.112157 0.269176 0.296094 0.197396 0.088828
```

On the other hand, the probability that we throw
more than three “`1`

”s, \(P(C > 3) = 1-P(C\le 3)\),
can be determined by means of the cumulative distribution function
(**pbinom**) or survival function (`pbinom(..., lower.tail=FALSE)`

):

```
1-pbinom(3, 12, 1/6) # pbinom(3, 12, 1/6, lower.tail=FALSE)
## [1] 0.12518
```

The smallest \(c\) such that \(P(C \le c)\ge 0.95\) can be computed based on the quantile function:

```
qbinom(0.5, 12, 1/6)
## [1] 2
pbinom(3:4, 12, 1/6) # for comparison: 0.95 is in-between
## [1] 0.87482 0.96365
```

In other words, at least 95% of the time, we will be observing no more than four successes.

Also, here are some pseudorandom realisations of \(C\) –
the number of “`1`

”s in 30 simulations of 12 independent dice rolls each:

```
rbinom(30, 12, 1/6)
## [1] 1 3 2 4 4 0 2 4 2 2 4 2 3 2 0 4 1 0 1 4 4 3 2 6 2 3 2 2 1 1
```

### 2.3.5. Special functions (*)#

Within mathematical formulae and across assorted application areas, certain functions appear more frequently than others. Hence, for the sake of notational brevity and computational precision, many of them have been assigned special names. For instance, the following may be mentioned in the definitions related to some of the probability distributions listed above:

**gamma**`(x)`

for \(x>0\) computes \(\Gamma(x) = \int_0^\infty t^{x-1} e^{-t} \,dt\),**beta**`(a, b)`

for \(a,b>0\) yields \(B(a, b) = \frac{\Gamma(a) \Gamma(b)}{\Gamma(a+b)} = \int_0^1 t^{a-1} (1-t)^{b-1}\,dt\).

Why do we have **beta** if it is merely a mix of **gamma**s?
A specific, tailored function is expected to be faster and more precise
than its DIY version; its underlying implementation
does not have to involve any calls to **gamma** at all.

```
beta(0.25, 250) # okay
## [1] 0.91213
gamma(0.25)*gamma(250)/gamma(250.25) # not okay
## [1] NaN
```

The \(\Gamma\) function grows so rapidly that already
**gamma**`(172)`

gives rise to `Inf`

.
It is due to the fact that a computer’s arithmetic
is not infinitely precise; compare Section 3.2.3.

Special functions are plentiful; see the open-access [50]
for one of the most definitive references
(and also [2] for its predecessor).
R package **gsl** [33] provides a vectorised interface
to the GNU GSL [28] library, which implements
many of such routines.

The Pochhammer symbol, \((a)_x = \Gamma(a+x)/\Gamma(a)\),
can be computed via a call to
**gsl**`::`

**poch**`(a, x)`

(i.e., the **poch** function from the **gsl** package;
see Section 7.3.1):

```
# call install.packages("gsl") first
library("gsl") # load the package
poch(10, 3:6) # calls gsl_sf_poch() from GNU GSL
## [1] 1320 17160 240240 3603600
```

Read the documentation of the corresponding **gsl_sf_poch**
function in the GNU GSL manual available
here.

And since you are there, do not hesitate to go through the list of all the other functions, including those related to statistics, permutations, combinations, and so forth.

Many functions also have their logarithm-of versions;
see, e.g., **lgamma** and **lbeta**.
Also, for instance, **dnorm** and **dbeta** have the `log`

parameter. Its classical use case is the (numerical) maximum likelihood
estimation, which involves the sums of the *logarithms* of densities.

## 2.4. Arithmetic operations#

### 2.4.1. Vectorised arithmetic operators#

R features the following arithmetic operators:

`

**+**` (addition) and `**-**` (subtraction),`

*****` (multiplication) and `**/**` (division),`

**%/%**` (integer division) and `**%%**` (modulo, division remainder),`

**^**` (exponentiation; synonym: `******`).

They are all *vectorised*: they take two
vectors on input and produce another vector on output.

```
c(1, 2, 3) * c(10, 100, 1000)
## [1] 10 200 3000
```

The multiplication was performed in an *elementwise* fashion.
The first element in the left vector was multiplied
by the *corresponding* element in the right vector,
and the result was stored in the first element of the output
Then, the second element in the left… all right, we get the point.

Other operators are vectorised in the same manner:

```
0:10 + seq(0, 1, 0.1)
## [1] 0.0 1.1 2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9 11.0
0:7 / rep(3, length.out=8) # division by 3
## [1] 0.00000 0.33333 0.66667 1.00000 1.33333 1.66667 2.00000 2.33333
0:7 %/% rep(3, length.out=8) # integer division
## [1] 0 0 0 1 1 1 2 2
0:7 %% rep(3, length.out=8) # division remainder
## [1] 0 1 2 0 1 2 0 1
```

Operations involving missing values also yield `NA`

s:

```
c(1, NA_real_, 3, NA_real_) + c(NA_real_, 2, 3, NA_real_)
## [1] NA NA 6 NA
```

### 2.4.2. Recycling rule#

Some of the above statements can be written more concisely.
When the operands are of different lengths, the shorter one
is *recycled* (think:
**rep**`(y, length.out=`

**length**`(x))`

)
as many times as necessary.

```
0:7 / 3
## [1] 0.00000 0.33333 0.66667 1.00000 1.33333 1.66667 2.00000 2.33333
1:10 * c(-1, 1)
## [1] -1 2 -3 4 -5 6 -7 8 -9 10
2 ^ (0:10)
## [1] 1 2 4 8 16 32 64 128 256 512 1024
```

We call this the *recycling rule*.

If an operand cannot be recycled in its entirety, a warning[5] is generated, but the output is still available.

```
c(1, 10, 100) * 1:8
## Warning in c(1, 10, 100) * 1:8: longer object length is not a multiple of
## shorter object length
## [1] 1 20 300 4 50 600 7 80
```

Note

Some functions are also deeply vectorised, i.e., with respect to multiple arguments. For example,

```
runif(3, c(10, 20, 30), c(11, 22, 33))
## [1] 10.288 21.577 31.227
```

generates three random numbers uniformly distributed over the intervals \((10, 11)\), \((20, 22)\), and \((30, 33)\), respectively.

Also, **pmin** and **pmax** return the *parallel* minimum
and maximum of the corresponding elements of the input vectors:

```
pmin(c(1, 2, 3, 4), c(4, 2, 3, 1))
## [1] 1 2 3 1
pmin(3, 1:5)
## [1] 1 2 3 3 3
pmax(0, pmin(1, c(0.25, -2, 5, -0.5, 0, 1.3, 0.99))) # clipping to [0, 1]
## [1] 0.25 0.00 1.00 0.00 0.00 1.00 0.99
```

Note

Vectorisation and the recycling rule are perhaps most fruitful
when applying binary operators on sequences of identical lengths
or when performing vector-scalar (i.e., a sequence vs a single value)
operations. However, there is much more:
schemes like “every \(k\)-th element” appear in
Taylor series expansions (multiply by **c**`(-1, 1)`

),
\(k\)-fold cross-validation, etc.;
see also Section 11.3.4 for
use cases in matrix/tensor processing.

### 2.4.3. Operator precedence#

Apart from the seven binary arithmetic operators,
other noteworthy, already mentioned ones include:
the unary `**-**` (change of sign),
`**:**` (sequence generation),
and `**<-**` (assignment).

Expressions involving multiple operations need a set of rules
governing the order of computations (unless we enforce it
using round brackets). We have said that “`-1:10`

”
means “`(-1):10`

” rather than “`-(1:10)`

”.
But what about, say, “`1+1+1+1+1*0`

” or “`3*2^0:5+10`

”?

Let us list the aforementioned operators in their order of precedence,
from the least to the most binding (see also **help**`("Syntax")`

):

`

**<-**` (right-to-left),`

**+**` and `**-**`,`

*****` and `**/**`,`

**%%**` and `**%/%**`,`

**:**`,`

**+**` and `**-**` (unary),`

**^**` (right-to-left).

Hence, “`-2^2/3+3*4`

” means “`((-(2^2))/3)+(3*4)`

” and not, for example,
`-((2^(2/(3+3)))*4)`

.

Notice that `**+**` and `**-**`,
`*****` and `**/**`,
as well as `**%%**` and `**%/%**`
have the same priority. Expressions involving a series of
operations in the same group
are evaluated left to right, with the exception of `**^**`
and `**<-**`,
which are performed from right to left.

Therefore:

“

`2*3/4*5`

” is equivalent to “`((2*3)/4)*5`

”,“

`2^3^4`

” is the same as “`2^(3^4)`

” (which, mathematically, we would write as \(2^{3^4}=2^{81}\)),“

`x <- y <- 4*3%%8/2`

” binds both`y`

and`x`

with`6`

and not`x`

with the previous value of`y`

.

And let us remember: when in doubt, we can always bracket a subexpression to ensure it is executed in the intended order (which can also increase the readability of the code).

### 2.4.4. Accumulating#

The `**+**` and `*****` operators,
as well as the **pmin** and **pmax** functions,
implement elementwise operations that are applied on the corresponding
elements taken from two given vectors:

However, we can also scan through all the values
in a *single* vector and combine the successive elements that we inspect
using the corresponding operation:

**cumsum**`(x)`

gives the cumulative sum of the elements in a vector,**cumprod**`(x)`

computes the cumulative product,**cummin**`(x)`

yields the cumulative minimum,**cummax**`(x)`

breeds the cumulative maximum.

The \(i\)-th element in the output vector will consist of the sum/product/min/max of the first \(i\) inputs:

For example:

```
cumsum(1:8)
## [1] 1 3 6 10 15 21 28 36
cumprod(1:8)
## [1] 1 2 6 24 120 720 5040 40320
cummin(c(3, 2, 4, 5, 1, 6, 0))
## [1] 3 2 2 2 1 1 0
cummax(c(3, 2, 4, 5, 1, 6, 0))
## [1] 3 3 4 5 5 6 6
```

If we are interested only in the last cumulant, summarising all the inputs, we have the following functions at our disposal:

**sum**`(x)`

computes the sum of elements in a vector, \(\sum_{i=1}^n x_i=x_1+x_2+\cdots+x_n\),**prod**`(x)`

outputs the product of all elements, \(\prod_{i=1}^n x_i=x_1 x_2 \cdots x_n\),**min**`(x)`

computes the minimum,**max**`(x)`

reckons the greatest value.

For example:

```
sum(1:8)
## [1] 36
prod(1:8)
## [1] 40320
min(c(3, 2, 4, 5, 1, 6, 0))
## [1] 0
max(c(3, 2, 4, 5, 1, 6, 0))
## [1] 6
```

Note

Chapter 7 will discuss the **Reduce** function,
which generalises the above by allowing any binary operation
to be propagated over a given vector.

**diff** can be considered an inverse to **cumsum**: it computes
the iterated difference. Namely, it subtracts the first two elements,
then the second from the third one, the third from the fourth, and so on.
In other words, **diff**`(x)`

gives \(\boldsymbol{y}\) such that \(y_i=x_{i+1}-x_i\).

```
x <- c(-2, 3, 6, 2, 15)
diff(x)
## [1] 5 3 -4 13
cumsum(diff(x))
## [1] 5 8 4 17
cumsum(c(-2, diff(x))) # recreates x
## [1] -2 3 6 2 15
```

Thanks to **diff**, we can compute the daily changes
to the EUR/AUD forex rates; see Figure 2.4.

```
aud <- scan(paste0("https://github.com/gagolews/teaching-data/raw/",
"master/marek/euraud-20200101-20200630.csv"), comment.char="#")
aud_all <- na.omit(aud) # remove all missing values
plot(diff(aud_all), type="s", ylab="Daily change [EUR/AUD]") # "steps"
abline(h=0, lty="dotted") # draw a horizontal line at y=0
```

### 2.4.5. Aggregating#

The above functions form the basis for some popular summary statistics[6] (sample aggregates), such as:

**mean**`(x)`

gives the arithmetic mean,**sum**`(x)/`

**length**`(x)`

,**var**`(x)`

yields the (unbiased) sample variance,**sum**`((x-`

**mean**`(x))^2)/(`

**length**`(x)-1)`

,**sd**`(x)`

is the standard deviation,**sqrt**`(`

**var**`(x))`

,**median**`(x)`

computes the sample median, i.e., the middle value in the sorted version of`x`

.

For instance[7]:

```
x <- runif(1000)
c(min(x), mean(x), median(x), max(x), sd(x))
## [1] 0.00046535 0.49727780 0.48995025 0.99940453 0.28748391
```

Let \(\boldsymbol{x}\) be any vector of length \(n\) with positive elements. Compute its geometric and harmonic mean, which are given by, respectively,

When solving exercises like this one,
it does not matter on what data you apply these functions
(see, however, Section 9.3.4.2 for discussion).
We are being abstract in the sense that
the \(\boldsymbol{x}\) vector can be anything:
from the one that features very accurate financial predictions
that will help minimise inequality and make this world less miserable,
through the data you have been collecting for the last ten years
in relation to your definitely-super-important PhD research,
whatever your company asked you to crunch today,
to something related to the hobby project that you enjoy doing after hours.
Therefore, just test the above on something
like “`x <- `

**runif**`(10)`

”, and move on.

All the aforementioned functions return a missing value if any
of the input elements is unavailable.
Luckily, they are equipped with the `na.rm`

parameter,
on behalf of which we can request the removal of `NA`

s.

```
aud <- scan(paste0("https://github.com/gagolews/teaching-data/raw/",
"master/marek/euraud-20200101-20200630.csv"), comment.char="#")
c(min(aud), mean(aud), max(aud))
## [1] NA NA NA
c(min(aud, na.rm=TRUE), mean(aud, na.rm=TRUE), max(aud, na.rm=TRUE))
## [1] 1.6006 1.6775 1.8635
```

Note

In the documentation, we read that the usage
of some of the aforementioned functions is like
**sum**`(..., na.rm=FALSE)`

.
**prod**, **min**, and **max** are defined similarly.
They accept any number of input vectors: each of them can be of
arbitrary length.
Therefore, **min**`(1, 2, 3)`

,
**min**`(`

**c**`(1, 2, 3))`

as well as
**min**`(`

**c**`(1, 2), 3)`

all return the same result.

However, we also read that we have
**mean**`(x, trim=0, na.rm=FALSE, ...)`

.
This time, only one vector can be aggregated, and any further arguments
(except `trim`

and `na.rm`

) are ignored.

The extra flexibility (which we do not have to rely on, ever)
of the former group is due to their being
associative operations: it holds, e.g., \((2+3)+4 = 2+(3+4)\). Hence,
the operations can be performed in any order, in any group.
Note that they are more primitive operations:
it is **mean** that is based on **sum**, not vice versa.

## 2.5. Exercises#

Answer the following questions:

What is the meaning of the dot-dot-dot parameter in the definition of the

**c**function?We say that the

**round**function is*vectorised*; what does that mean?What is wrong with a call to

**c**`(`

**sqrt**`(1),`

**sqrt**`(2),`

**sqrt**`(3))`

?What do we mean by saying that multiplication operates element-by-element?

How does the recycling rule work when applying `

**+**`?How to (and why) set the seed of the pseudorandom number generator?

What is the difference between

`NA_real_`

and`NaN`

?How are default arguments specified in the manual of, e.g., the

**round**function?Is a call to

**rep**`(times=4, x=1:5)`

” equivalent to**rep**`(4, 1:5)`

?List a few ways to generate a sequence like (-1, -0.75, -0.5, …, 0.75, 1).

Is “

`-3:5`

” the same as`"-(3:5)"`

? What about the precedence of operators in expressions such as “`2^3/4*5^6`

”, “`5*6+4/17%%8`

”, and “`1+-2^3:4-1`

”?If

`x`

is a numeric vector of length \(n\) (for some \(n\ge 0\)), how many values will**sample**`(x)`

output?Does

**scan**support reading directly from compressed archives, e.g.,`.csv.gz`

files?

When in doubt, refer back to the material discussed in this chapter and/or the R manual.

The following code generates an example graph of arcsine and arccosine, whose preparation – thanks to vectorisation – is quite straightforward.

```
x <- seq(-1, 1, length.out=11) # increase length.out for a smoother curve
plot(x, asin(x), # asin() computed for 11 points
type="l", # lines
ylim=c(-pi/2, pi), # y axis limits like c(y_min, y_max)
ylab="asin(x), acos(x)") # y axis label
lines(x, acos(x), col="red", lty="dashed") # adds to the current plot
legend("topright", c("asin(x)", "acos(x)"),
lty=c("solid", "dashed"), col=c("black", "red"), bg="white")
```

Inspired by the above, plot the following functions:
\(|\sin x^2|\), \(\left|\sin |x|\right|\), \(\sqrt{\lfloor x\rfloor}\),
and \(1/(1+e^{-x})\).
Recall that the documentation of **plot** can be viewed
by calling **help**`("plot.default")`

.

It can be shown that:

slowly converges to \(\pi\) as \(n\) approaches \(\infty\). Compute the above for \(n=1{,}000{,}000\) and \(n=1{,}000{,}000{,}000\) using the vectorised functions and operators discussed in this chapter, making use of the recycling rule as much as possible.

Let `x`

and `y`

be two vectors of identical lengths \(n\),
say:

```
x <- rnorm(100)
y <- 2*x+10+rnorm(100, 0, 0.5)
```

Compute the Pearson linear correlation coefficient given by:

To make sure you have come up with a correct implementation,
compare your result to a call to the built-in **cor**`(x, y)`

.

(*) Find an R package providing a function to compute moving (rolling) averages and medians of a given vector. Apply them on the EUR/AUD currency exchange data. Draw thus obtained smoothened versions of the time series.

(**)
Use a call to **convolve**`(..., type="filter")`

to compute the \(k\)-moving average of a numeric vector.

In the next chapter, we will study operations that involve logical values.