# 8. Flow of Execution

The open-access textbookDeep R Programmingby Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project.This book is still a work in progress. Beta versions of Chapters 1–12 are already complete, but there will be more. In the meantime, any bug/typos reports/fixes are appreciated.Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks.Also, check out my other book,Minimalist Data Wrangling with Python[20].

The **ifelse** and **Map** functions are very powerful,
but they allow us to process only the consecutive elements
in a vector.

Thus, let us (finally!) discuss different ways to alter a program’s control flow manually, based on some criterion, and to evaluate the same expression a number of times, but perhaps on different data. Before proceeding any further, let us, however, contemplate on the fact that we have managed to do without them for such a long time – and the data processing exercises we learnt to solve were far from trivial.

## 8.1. Conditional Evaluation

Life is full of surprises, so we would be nice if we were able to adapt to whatever the circumstances are going to be.

The following evaluates a given `expression`

*if and only if*
a logical `condition`

is true.

```
if (condition) expression
```

When performing some `other_expression`

is preferred
rather than doing nothing in the case of the `condition`

’s being false,
we can write:

```
if (condition) expression else other_expression
```

For instance:

```
(x <- runif(1)) # to spice things up
## [1] 0.28758
if (x > 0.5) cat("head") else cat("tail")
## tail
```

Many expressions can of course be grouped with curly braces, “`{`

”

```
if (x > 0.5) {
cat("head")
x <- 1
} else {
cat("tail")
x <- 0
}
## tail
print(x)
## [1] 0
```

Important

At the top level, we should not put a new line before **else**,
otherwise we will get an error like `Error: unexpected 'else' in "else"`

.
This is because the interpreter enthusiastically executes the statements
been read line by line as soon as it regards them
as stand-alone expressions.
In this case, we first get an **if** without **else**,
and then, separately, a *dangling* **else** without the preceding
**if**.

This does not happen when a conditional statement is part of an expression group, because the latter is read in its entirety.

```
function (x)
{ # opening bracket – start
if (x > 0.5)
cat("head")
else # not dandling, because {...} is read as a whole
cat("tail")
} # closing bracket – expression ends
```

As an exercise, try removing the curly braces and see what happens.

### 8.1.1. Return Value

`**if**` is a function (compare Section 9.4),
hence has a return value – the result of evaluating the
conditional expression.

```
(x <- runif(1))
## [1] 0.28758
y <- if (x > 0.5) "head" # no else
print(y)
## NULL
y <- if (x > 0.5) "head" else "tail"
print(y)
## [1] "tail"
```

This is particularly useful when a call to `**if**`
is the last expression in the code block constituting a function’s body.

```
mint <- function(x)
{
if (x > 0.5) # the last expression (actually, the only one)
"head" # this can be the return value...
else
"tail" # or this one, depending on the condition
}
mint(x)
## [1] "tail"
unlist(Map(mint, runif(5)))
## [1] "tail" "head" "tail" "head" "head"
```

Add-on packages can also be loaded using **requireNamespace**.
Contrary to **library**, the former does not fail
when a package is not available. Also, it does not attach
it on the search list; see `sec:to-do`

.

Instead, it returns a logical value indicating if the package is available for use. This can be useful inside other functions where the availability of some additional features depends on the user environment’s configuration:

```
process_data <- function(x)
{
if (requireNamespace("some_extension_package", quietly=TRUE))
some_extension_package::very_fast_method(x)
else
normal_method(x)
}
```

### 8.1.2. Nested **if**s

If more than two test cases are possible, i.e.,
when we need to go beyond either `condition`

or
`!condition`

, then we can use the following construction:

```
if (a) {
expression_a
} else if (b) {
expression_b
} else if (c) {
expression_c
} else {
expression_else
}
```

This evaluates all conditions `a`

, `b`

, … (in this order)
until the first positive case is found,
and then executes the corresponding `expression`

.

Note that the above is nothing else than a series of nested
**if** statements:

```
if (a) {
expression_a
} else {
if (b) {
expression_b
} else {
if (c) {
expression_c
} else {
expression_else
}
}
}
```

but written in a less readable[1] manner.

Write a function named **sign** that determines
if a given numeric value is `"positive"`

, `"negative"`

, or `"zero"`

.

### 8.1.3. Condition: Either True of False

**if** expects a `condition`

that is a single, well-defined logical
value, either `TRUE`

or `FALSE`

.
Thence, problems may arise when this is not the case.

If the `condition`

is of length not equal to one,
we get an error:

```
if (c(TRUE, FALSE)) cat("spam")
## Error in if (c(TRUE, FALSE)) cat("spam"): the condition has length > 1
if (logical(0)) cat("bacon")
## Error in if (logical(0)) cat("bacon"): argument is of length zero
```

We cannot pass a missing value either:

```
if (NA) cat("ham")
## Error in if (NA) cat("ham"): missing value where TRUE/FALSE needed
```

Important

If we think that we are absolutely immune to the writing of
code violating the above constraints, just we
wait until the `condition`

becomes a function of data
for which there is no sanity-checking in place.

```
mint <- function(x)
if (x > 0.5) "H" else "T"
mint(0.25)
## [1] "T"
mint(runif(5))
## Error in if (x > 0.5) "H" else "T": the condition has length > 1
mint(log(rnorm(1))) # not obvious, only triggered sometimes
## Warning in log(rnorm(1)): NaNs produced
## Error in if (x > 0.5) "H" else "T": missing value where TRUE/FALSE needed
```

In Chapter 9, we will be particularly interested in ways to assure input data integrity, so that situations such as above will either fail gracefully or succeed bombastically.

Here, we should probably make sure that `x`

is a single finite numeric value.
Alternatively, we had rather test whether
**all**`(x > 0.5, na.rm=TRUE)`

.

Interestingly, objects other that logical are accepted: they will be coerced if needed.

```
x <- 1:5
if (length(x)) # i.e., length(x) != 0, but way less readable
cat("length is not zero")
## length is not zero
```

Recall that coercion of numeric to logical yields `FALSE`

if and
only if the original value is zero.

### 8.1.4. Short-Circuit Evaluation

Specially for formulating logical conditions
in **if** and **while** (see below),
we have the *scalar* `**||**` (alternative) and
`**&&**` (conjunction) operators.

```
FALSE || TRUE
## [1] TRUE
NA || TRUE
## [1] TRUE
```

Contrary to their vectorised counterparts (`**|**`
and `**&**`),
the scalar operators are lazy (Section 9.5.5) in the sense
that they evaluate the first operand and then determine if the computing
of the second one is necessary
(because, e.g., `FALSE && whatever`

is always `FALSE`

anyway).

Therefore,

```
if (a && b)
expression
```

is equivalent to:

```
if (a) {
if (b) { # compute b only if a is TRUE
expression
}
}
```

and:

```
if (a || b)
expression
```

corresponds to:

```
if (a) {
expression
} else if (b) { # compute b only if a is FALSE
expression
}
```

For instance,
“**is.vector**`(x) && `

**length**`(x) > 0 && x[[1]] > 0`

” is a safe test that takes into account
that “`x[[1]]`

” has only the desired meaning for objects that
are not non-empty vectors.

Some other examples (recall that the expressions within the curly braces are evaluated one after another and that the result is determined by the last value in the series):

```
{cat("spam"); FALSE} || {cat("ham"); TRUE} || {cat("cherries"); FALSE}
## spamham
## [1] TRUE
{cat("spam"); TRUE} && {cat("ham"); FALSE} && {cat("cherries"); TRUE}
## spamham
## [1] FALSE
```

Study the source code of
**isTRUE** and **isFALSE**
and determine if these functions could be useful in
formulating the conditions within the **if** expressions.

## 8.2. Exception Handling

Exceptions are exceptional, but they may happen and break things. For instance, when we try to download a file and the internet connection drops. Or an optimisation algorithm fails to converge. Or we just have a bug in our code. Or:

```
read.csv("/path/to/a/file/that/does/not/exist")
## Warning in file(file, "rt"): cannot open file '/path/to/a/file/that/does/
## not/exist': No such file or directory
## Error in file(file, "rt"): cannot open the connection
```

Three types of *conditions* are frequently observed:

errors – they stop the flow of execution,

warnings – non critical, but can be turned into errors (see

`warn`

in**option**),messages – they transmit some diagnostic information.

These can be manually triggered by means of
**stop**, **warning**, and **message** functions.

Errors (but warnings too) can be handled by means of the
**tryCatch** function, amongst others.

```
tryCatch({ # block of expressions to execute, until an error occurs
cat("a\n")
stop("b") # error – breaks the linear control flow
cat("c\n")
},
error = function(e) { # executed immediately upon an error
cat(sprintf("error: %s\n", e[["message"]]))
},
finally = { # always executed at the end, regardless of error occurrence
cat("finally!\n")
}
)
## a
## error: b
## finally!
```

The two other conditions can be ignored by calling
**suppressWarnings** and
**suppressMessages**.

```
log(-1)
## Warning in log(-1): NaNs produced
## [1] NaN
suppressWarnings(log(-1)) # yeah, yeah, we know what we're doing
## [1] NaN
```

At the time of writing of this book,
the **data.table** package emits a message upon attachment.
Call **suppressMessages** to silence it.
Note that consecutive calls to **library** do not
reload an already loaded package, therefore the message will only
be seen once per R session.

Related functions include **stopifnot** discussed
in Section 9.2 and **on.exit** mentioned
in `sec:to-do`

; see also Section 9.3.4
for some code debugging tips.

## 8.3. Repeated Evaluation

And now for something completely different… time for the elephant in the room!

We have been able to do without loops so far
(and will be quite all right in the second part of the book too),
because many data processing tasks can be written in terms
of vectorised operations such as
`**+**`,
**sqrt**,
**sum**,
`**[**`,
**Map**, and
**Reduce**.
Oftentimes, compared to their
loop-based counterparts,
they are not only much more readable but also more efficient.
We will explore this in the exercises below.

However, at times, using an explicit **while** or **for** loop
might be the only natural way of solving a problem,
for instance, when processing chunks of data streams.
Also, an explicitly “looped” algorithm may occasionally
have better[2] time or memory complexity.

### 8.3.1. **while**

**if** considers a given logical condition and thus determines
whether to execute a given statement.
On the other hand,

```
while (condition) # single TRUE or FALSE, as in `if`
expression
```

evaluates a given `expression`

*as long as* the logical `condition`

is true.
Therefore, it is advisable to make the `condition`

dependent upon
some variable that can be modified by the `expression`

.

```
i <- 1
while (i <= 3) {
cat(sprintf("%d, ", i))
i <- i + 1
}
## 1, 2, 3,
```

Nested loops are of course possible too:

```
i <- 1
while (i <= 2) {
j <- 1
while (j <= 3) {
cat(sprintf("%d %d, ", i, j))
j <- j + 1
}
cat("\n")
i <- i + 1
}
## 1 1, 1 2, 1 3,
## 2 1, 2 2, 2 3,
```

Implement a simple linear congruential pseudorandom number generator that, given some seed \(X_0\in [0, m)\), outputs a sequence \((X_1,X_2,\dots)\) defined by:

with, e.g., \(a=75\), \(c=74\), and \(m=2^{16}+1\)
(here, *mod* is the division reminder, `**%%**`).
Note that this generator has poor statistical properties
and should not be used in practice.
In particular, after some number of operations \(k\),
we will find a cycle such that \(X_k=X_1, X_{k+1}=X_2, \dots\).

### 8.3.2. **for**

The for-each loop:

```
for (name in vector)
expression
```

takes each element, from the beginning to the end,
in a given `vector`

, assigns it some `name`

, and evaluates the `expression`

.

Example:

```
fridge <- c("spam", "spam", "bacon", "eggs")
for (food in fridge)
cat(sprintf("%s, ", food))
## spam, spam, bacon, eggs,
```

One more:

```
for (i in 1:length(fridge)) # better: seq_along(fridge), see below
cat(sprintf("%s, ", fridge[i]))
## spam, spam, bacon, eggs,
```

Just one more, promise:

```
for (i in 1:2) {
for (j in 1:3)
cat(sprintf("%d %d, ", i, j))
cat("\n")
}
## 1 1, 1 2, 1 3,
## 2 1, 2 2, 2 3,
```

Note that the iterator still exists after the loop’s watch has ended:

```
print(i)
## [1] 2
print(j)
## [1] 3
```

Important

Writing:

```
for (i in 1:length(x))
print(x[i])
```

is not necessarily safe, because if `x`

is an empty vector, then:

```
x <- logical(0)
for (i in 1:length(x)) print(x[i])
## [1] NA
## logical(0)
```

Recall from Chapter 5 that
`x[1]`

tries to access an out of bounds element here
and `x[0]`

returns nothing.

We generally suggest replacing
`1:`

**length**`(x)`

with **seq_along**`(x)`

or
**seq_len**`(`

**length**`(x))`

.
wherever possible.

Note

The model **for** loop above is roughly equivalent to:

```
name <- NULL
tmp_vector <- vector
tmp_iter <- 1
while (tmp_iter <= length(tmp_vector)) {
name <- tmp_vector[[tmp_iter]]
expression
tmp_iter <- tmp_iter + 1
}
```

Note that `tmp_vector`

is determined before the loop itself.
Hence, any changes to `vector`

will not influence the execution flow.
Also note that due to the use of `**[[**`,
the loop can be applied on lists as well.

Let `x`

be a list and **f** be a function.
The following code generates the same result
as **Map**`(`

**f**`, x)`

:

```
n <- length(x)
ret <- vector("list", n) # a new list of length `n`
for (i in seq_len(n))
ret[[i]] <- f(x[[i]])
```

Let `x`

and `y`

be two lists and **f** be a function.
Here is the most basic version of
**Map**`(`

**f**`, x, y)`

.
Note that `x`

and `y`

might be of different lengths.

```
nx <- length(x)
ny <- length(y)
n <- max(nx, ny)
ret <- vector("list", n)
for (i in seq_len(n))
ret[[i]] <- f(x[[((i-1)%%nx)+1]], y[[((i-1)%%ny)+1]])
```

Feel free to upgrade the above by adding a warning like
*the longer argument is not a multiple of the length of the shorter one*.
Also, rewrite it without the use of the modulo operators, `**%%**`.

### 8.3.3. **break** and **next**

**break** can be used to escape the current loop.
**next** skips the remaining expressions and advances
to the next iteration (to where the testing of
the logical condition occurs).

Here is a rather random example:

```
x <- runif(1000)
s <- 0
for (e in x) {
if (e > 0.1)
next
print(e)
if (e < 0.01)
break
s <- s + e
}
## [1] 0.045556
## [1] 0.04206
## [1] 0.024614
## [1] 0.045831
## [1] 0.094841
## [1] 0.00062477
print(s)
## [1] 0.2529
```

Computes the sum of the elements in `x`

that are less than or equal to 0.1 from the beginning,
stopping at the first element less than 0.01.

Note that we have used the frequently occurring design pattern:

```
for (e in x) {
if (condition)
next
many_statements...
}
```

which is equivalent to:

```
for (e in x) {
if (!condition) {
many_statements...
}
}
```

but avoids introducing a nested block of expressions.

Note

(*) There is a third loop type,

```
repeat
expression
```

which is a shorthand for

```
while (TRUE)
expression
```

i.e., it is a possibly infinite loop.
Such loops are useful when implementing situations such as
*do*-stuff-*until*-a-thing-happens,
e.g., when we want to execute a command at least once.

```
i <- 1
repeat { # while (TRUE)
# simulate dice casting until we throw "1"
if (runif(1) < 1/6) break # not an infinite loop after all
i <- i+1
}
print(i)
## [1] 6
```

What is wrong with the following code?

```
j <- 1
while (j <= 10) {
if (j %% 2 == 0) next
print(j)
j <- j + 1
}
```

What about this one?

```
j <- 1
while (j <= 10);
j <- j + 1
```

### 8.3.4. **return**

**return**, when called from within a function,
immediately yields a specified value and goes back to the caller.

For example, here is a simple recursive function that flattens a given list:

```
my_unlist <- function(x)
{
if (is.atomic(x))
return(x)
# so if we are here, x is definitely not atomic
out <- NULL
for (e in x)
out <- c(out, my_unlist(e))
out # or return(out); it's the last expression anyway, so not necessary
}
my_unlist(list(list(list(1, 2), 3), list(4, list(5, list(6, 7:10)))))
## [1] 1 2 3 4 5 6 7 8 9 10
```

Note that **return** is a function: the round brackets
are obligatory,

### 8.3.5. A Note on Time and Space Complexity of Algorithms (*)

Analysis of algorithms (e.g., [9, 35]),
can give us a rough estimate of their run times
or memory consumption as a function of the input data size,
especially for *big* data.

In scientific computing and data science, we
most often deal with vectors (sequences)
or matrices/data frames (tabular data).
Therefore, we might be interested in determining how many
*primitive operations* need to be performed as a function
of their length *n* or the number of rows *n* and columns *p*,
respectively.

The *O* (Big-Oh) notation, for instance, can express
the upper bounds for time/resource consumption in asymptotic cases.
For instance, we say that the time complexity is \(O(n^2)\),
if for large \(n\), the number of operations to perform
will be proportional to *at most* the square of
the vector size (more precisely, there exists \(m\) and \(C>0\)
such that for all \(n>m\), the number of operations is \(\le Cn^2\)).

Therefore, if we have two algorithms that solve the same task, one that has \(O(n^2)\) time complexity, and other of \(O(n^3)\), it is better to choose the former, because for large problem sizes we expect it to be faster.

Moreover, whether time grows proportionally to \(\log n\), \(\sqrt{n}\), \(n\), \(n\log n\), \(n^2\), \(n^3\), or \(2^n\), can be useful in predicting how big the data can be if we have a fixed deadline or not too much space left on the disk.

The **hclust** function determines a hierarchical clustering
of a dataset. It is fed with an object that stores the distance
between all the pairs of input points.
There are \(n(n-1)/2\) (i.e., \(O(n^2)\)) unique point pairs for any given *n*.
One numeric scalar (`double`

type) takes 8 bytes of storage.
If you have 16 GB or RAM, what is the largest dataset that you
can cluster on your machine using this function?

Oftentimes, we can learn about the time or memory complexity
of the functions we use from their documentation;
see, e.g., **help**`("findInterval")`

.

A course in data structures in algorithms, which this one is not, will give us plenty of opportunities to implement many algorithms ourselves. This way, we can gain a lot of insights and intuitions.

For instance, this is a \(O(n)\)-time algorithm:

```
for (i in seq_len(n))
expression
```

and this is one runs in \(O(n^2)\) time:

```
for (i in seq_len(n))
for (j in seq_len(n))
expression
```

as long as, of course, the `expression`

is rather primitive (e.g.,
operations on scalar variables).

R is a very expressive language and hence quite complex and lengthy operations can look pretty innocent (it is a glue language for rapid prototyping, after all).

For example:

```
for (i in seq_len(n))
for (j in seq_len(n))
z <- z + x[[i]] + y[[j]]
```

can be seen as \(O(n^3)\) if each element in the lists `x`

and `y`

as well as `z`

itself are atomic vectors of length *n*.

Similarly,

```
Map(mean, x)
```

is \(O(n^2)\) if `x`

is a list of *n* atomic vectors each of length *n*.

Note

A quite common statistical scenario involves the generation of a data buffer of a fixed size:

```
ret <- c()
for (i in n)
ret[[i]] <- generate_data(i) # here: ret[[length(ret)+1]] <- ...
```

This notation, however, involves the growing of the `ret`

array
in each iteration. Luckily, since R version 3.4.0,
each such size extension has amortised \(O(1)\) time
due to the fact that some more memory is internally reserved for its
prospective growth (see, e.g., Chapter 17 of [9]).

However, it would still be better to pre-allocate the output vector and grant it the desired, final size already upon creation.

Note that we can construct vectors of specific lengths and types
in an efficient way (more efficient than with **rep**) by calling:

```
numeric(3)
## [1] 0 0 0
numeric(0)
## numeric(0)
logical(5)
## [1] FALSE FALSE FALSE FALSE FALSE
character(2)
## [1] "" ""
vector("numeric", 8)
## [1] 0 0 0 0 0 0 0 0
vector("list", 2)
## [[1]]
## NULL
##
## [[2]]
## NULL
```

Note

Not all data fit into memory, but it does not mean that we should start installing Apache Hadoop and Spark immediately. Some datasets can be processed on a chunk-by-chunk basis.

R enables data stream handling (some can be of infinite length) through file connections, for example:

```
f <- file(paste0("https://raw.githubusercontent.com/gagolews/teaching-data/",
"master/README.md"), open="r") # a big file, the biggest file ever
i <- 0
while (TRUE) {
few_lines <- readLines(f, n=4) # read only four lines at a time
if (length(few_lines) == 0) break
i <- i + length(few_lines)
}
close(f)
print(i) # the number of lines
## [1] 93
```

Many functions support reading from/writing to already
established connections of different types, e.g., **file**,
**gzfile**, **textConnection**, batch-by-batch.

A frequent scenario involves reading a very large CSV, JSON, or XML file only thousands of lines/records at a time, parsing and cleansing them, and exporting to SQL databases (which we will exercise in Chapter 12).

Also note that the always-open text connections
**stdout** and **stderr** (for writing),
and **stdin** (for reading) are by default mapped to
the “terminal/console” and “keyboard”, respectively.
Call **scan**, **cat**, and **stop**
to interact with such sources/targets.

## 8.4. Exercises

Note that, from now on, we should stay alert.
Many, if not all, of the following tasks can still be implemented
without the explicit use of the R loops, but
based only on the operations covered in the previous chapters.
If this is the case, try implementing both the looped
and loop-free version.
Use **microbenchmark**`::`

**microbenchmark**
or **proc.time** to compare the run-times[3].

Answer the following questions:

Let

`x`

be a numeric vector. When does**if**`(x > 0) ...`

yield a warning? When does it give an error? How to prevent this?What is the dangling

**else**?What happens if you put

**if**as the last expression in a curly braces block within a function’s body?Why do we say that `

**&&**` and `**||**` are lazy? What are their use cases?What is the difference between `

**&&**` and `**&**`?Can

**while**always be replaced with**for**? What about the other way around?

Verify which of the following can be safely used as logical conditions
in **if** statements. If that is not the case for all `x`

, `y`

, …,
determine the additional conditions that should be imposed
in order to make them valid.

`x == 0`

,`x[y] > 0`

,**any**`(x>0)`

,**match**`(x, y)`

,**any**`(x %in% y)`

.

What can go wrong in the following code chunk, depending on
the type and form of `x`

? Consider as many scenarios as possible.

```
count <- 0
for (i in 1:length(x))
if (x[i] > 0)
count <- count + 1
```

Implement **shift_left**`(x, n)`

and
**shift_right**`(x, n)`

.
The former function gets rid of the first *n* observations in `x`

and adds *n* missing values at the end of the resulting vector,
e.g., **shift_left**`(`

**c**`(1, 2, 3, 4, 5), 2)`

is **c**`(3, 4, 5, NA, NA)`

.
On the other hand,
**shift_right**`(`

**c**`(1, 2, 3, 4, 5), 2)`

is
**c**`(NA, NA, 1, 2, 3)`

.

Implement your own version of **diff**.

Write a function that determines the longest increasing trend
in a given numeric vector, i.e.,
the length of the longest subsequence of consecutive elements
that are increasing.
For example, the input **c**`(1, 2, 3, 2, 1, 2, 3, 4, 3)`

should yield 4.

Implement the functions that round down and round up, to a number of decimal digits, each element in a numeric vector.

This concludes the first part of this magnificent book.