8. Flow of execution#

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime, any bug/typos reports/fixes are appreciated. Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [26].

The ifelse and Map functions are potent. However, they allow us to process only the consecutive elements in a vector.

Thus, let us (finally!) discuss different ways to alter a program’s control flow manually, based on some criterion, and to evaluate the same expression many times, but perhaps on different data. Before proceeding any further, let us, however, contemplate the fact that we have managed to do without them for such a long time, despite the fact that the data processing exercises we learnt to solve were far from trivial.

8.1. Conditional evaluation#

Life is full of surprises, so it would be nice if we could adapt to whatever the circumstances are going to be.

The following evaluates a given expression if and only if a logical condition is true.

if (condition) expression

When performing some other_expression is preferred rather than doing nothing in the case of the condition’s being false, we can write:

if (condition) expression else other_expression

For instance:

(x <- runif(1))  # to spice things up
## [1] 0.28758
if (x > 0.5) cat("head\n") else cat("tail\n")
## tail

Many expressions can, of course, be grouped with curly braces, `{`.

if (x > 0.5) {
    cat("head\n")
    x <- 1
} else {
    cat("tail\n")
    x <- 0
}
## tail
print(x)
## [1] 0

Important

At the top level, we should not put a new line before else. Otherwise, we will get an error like Error: unexpected 'else' in "else". This is because the interpreter enthusiastically executes the statements read line by line as soon as it regards them as stand-alone expressions. In this case, we first get an if without else, and then, separately, a dangling else without the preceding if.

This does not happen when a conditional statement is part of an expression group as the latter is read in its entirety.

function (x)
{  # opening bracket – start
    if (x > 0.5)
        cat("head\n")
    else              # not dandling because {...} is read as a whole
        cat("tail\n")
}  # closing bracket – expression ends

As an exercise, try removing the curly braces and see what happens.

8.1.1. Return value#

`if` is a function (compare Section 9.4). Hence, it has a return value: the result of evaluating the conditional expression.

(x <- runif(1))
## [1] 0.28758
y <- if (x > 0.5) "head"  # no else
print(y)
## NULL
y <- if (x > 0.5) "head" else "tail"
print(y)
## [1] "tail"

This is particularly useful when a call to `if` is the last expression in the code block constituting a function’s body.

mint <- function(x)
{
    if (x > 0.5)  # the last expression (actually, the only one)
        "head"        # this can be the return value...
    else
        "tail"        # or this one, depending on the condition
}

mint(x)
## [1] "tail"
unlist(Map(mint, runif(5)))
## [1] "tail" "head" "tail" "head" "head"
Example 8.1

Add-on packages can be loaded using requireNamespace. Contrary to library, the former does not fail when a package is not available. Also, it does not attach it to the search path; see Section 16.2.6.

Instead, it returns a logical value indicating if the package is available for use. This can be helpful inside other functions where the availability of some additional features depends on the user environment’s configuration:

process_data <- function(x)
{
    if (requireNamespace("some_extension_package", quietly=TRUE))
        some_extension_package::very_fast_method(x)
    else
        normal_method(x)
}

8.1.2. Nested ifs#

If more than two test cases are possible, i.e., when we need to go beyond either condition or !condition, then we can use the following construction:

if (a) {
    expression_a
} else if (b) {
    expression_b
} else if (c) {
    expression_c
} else {
    expression_else
}

This evaluates all conditions a, b, … (in this order) until the first positive case is found and then executes the corresponding expression.

It is worth stressing that the above is nothing else than a series of nested if statements but written in a more readable[1] manner:

if (a) {
    expression_a
} else {
    if (b) {
    expression_b
    } else {
        if (c) {
            expression_c
        } else {
            expression_else
        }
    }
}
Exercise 8.2

Write a function named sign that determines if a given numeric value is "positive", "negative", or "zero".

8.1.3. Condition: Either TRUE or FALSE#

if expects a condition that is a single, well-defined logical value, either TRUE or FALSE. Thence, problems may arise when this is not the case.

If the condition is of length not equal to one, we get an error:

if (c(TRUE, FALSE)) cat("spam\n")
## Error in if (c(TRUE, FALSE)) cat("spam\n"): the condition has length > 1
if (logical(0)) cat("bacon\n")
## Error in if (logical(0)) cat("bacon\n"): argument is of length zero

We cannot pass a missing value either:

if (NA) cat("ham\n")
## Error in if (NA) cat("ham\n"): missing value where TRUE/FALSE needed

Important

If we think that we are immune to writing code violating the above constraints, just we wait until the condition becomes a function of data for which there is no sanity-checking in place.

mint <- function(x)
    if (x > 0.5) "H" else "T"

mint(0.25)
## [1] "T"
mint(runif(5))
## Error in if (x > 0.5) "H" else "T": the condition has length > 1
mint(log(rnorm(1)))  # not obvious, only triggered sometimes
## Warning in log(rnorm(1)): NaNs produced
## Error in if (x > 0.5) "H" else "T": missing value where TRUE/FALSE needed

In Chapter 9, we will be particularly interested in ways to ensure input data integrity so that situations such as above will either fail gracefully or succeed bombastically.

Here, we should probably ensure that x is a single finite numeric value. Alternatively, we can test whether all(x > 0.5, na.rm=TRUE).

Interestingly, objects other that logical are accepted: they will be coerced if needed.

x <- 1:5
if (length(x))  # i.e., length(x) != 0, but way less readable
    cat("length is not zero\n")
## length is not zero

Recall that coercion of numeric to logical yields FALSE if and only if the original value is zero.

8.1.4. Short-circuit evaluation#

Especially for formulating logical conditions in if and while (see below), we have the scalar `||` (alternative) and `&&` (conjunction) operators.

FALSE || TRUE
## [1] TRUE
NA || TRUE
## [1] TRUE

Contrary to their vectorised counterparts (`|` and `&`), the scalar operators are lazy (Chapter 17) in the sense that they evaluate the first operand and then determine if the computing of the second one is necessary (because, e.g., FALSE && whatever is always FALSE anyway).

Therefore,

if (a && b)
    expression

is equivalent to:

if (a) {
    if (b) {  # compute b only if a is TRUE
        expression
    }
}

and:

if (a || b)
    expression

corresponds to:

if (a) {
    expression
} else if (b) {  # compute b only if a is FALSE
    expression
}

For instance, “is.vector(x) && length(x) > 0 && x[[1]] > 0” is a risk-free test that takes into account that “x[[1]]” has only the desired meaning for objects that are not nonempty vectors.

Some other examples:

{cat("spam"); FALSE} || {cat("ham"); TRUE} || {cat("cherries"); FALSE}
## spamham
## [1] TRUE
{cat("spam"); TRUE} && {cat("ham"); FALSE} && {cat("cherries"); TRUE}
## spamham
## [1] FALSE

Recall that the expressions within the curly braces are evaluated one after another and that the result is determined by the last value in the series.

Exercise 8.3

Study the source code of isTRUE and isFALSE and determine if these functions could be useful in formulating the conditions within the if expressions.

8.2. Exception handling#

Exceptions are exceptional, but they may happen and break things. For instance, when the internet connection drops while we try to download a file, an optimisation algorithm fails to converge, we just have a bug in our code, or:

read.csv("/path/to/a/file/that/does/not/exist")
## Warning in file(file, "rt"): cannot open file '/path/to/a/file/that/does/
##     not/exist': No such file or directory
## Error in file(file, "rt"): cannot open the connection

Three types of conditions are frequently observed:

  • errors – they stop the flow of execution,

  • warnings – not critical, but can be turned into errors (see warn in option),

  • messages – they transmit some diagnostic information.

They can be manually triggered using the stop, warning, and message functions.

Errors (but warnings too) can be handled by means of the tryCatch function, amongst others.

tryCatch({           # block of expressions to execute, until an error occurs
        cat("a\n")
        stop("b")    # error – breaks the linear control flow
        cat("c\n")
    },
    error = function(e) {  # executed immediately on an error
        cat(sprintf("error: %s\n", e[["message"]]))
    },
    finally = {  # always executed at the end, regardless of error occurrence
        cat("finally!\n")
    }
)
## a
## error: b
## finally!

The two other conditions can be ignored by calling suppressWarnings and suppressMessages.

log(-1)
## Warning in log(-1): NaNs produced
## [1] NaN
suppressWarnings(log(-1))  # yeah, yeah, we know what we're doing
## [1] NaN
Exercise 8.4

At the time of writing of this book, when the data.table package is attached, it emits a message. Call suppressMessages to silence it. However, consecutive calls to library do not reload an already loaded package. Therefore, the message will only be seen once per R session.

Related functions include stopifnot discussed in Section 9.2 and on.exit mentioned in Section 17.4; see Section 9.3.4 for some code debugging tips.

8.3. Repeated evaluation#

And now for something completely different… time for the elephant in the room!

We have been able to do without loops so far (and will be quite all right in the second part of the book, too). This is because many data processing tasks can be written in terms of vectorised operations such as `+`, sqrt, sum, `[`, Map, and Reduce. Oftentimes, compared to their loop-based counterparts, they are more readable and efficient. We will explore this in the exercises below.

However, at times, using an explicit while or for loop might be the only natural way of solving a problem, for instance, when processing chunks of data streams. Also, an explicitly “looped” algorithm may occasionally have better[2] time or memory complexity.

8.3.1. while#

if considers a given logical condition and thus determines whether to execute a given statement. On the other hand,

while (condition)  # single TRUE or FALSE, as in `if`
    expression

evaluates a given expression as long as the logical condition is true. Therefore, it is advisable to make the condition dependent on some variable that the expression can modify.

i <- 1
while (i <= 3) {
    cat(sprintf("%d, ", i))
    i <- i + 1
}
## 1, 2, 3,

Nested loops are possible, too:

i <- 1
while (i <= 2) {
    j <- 1
    while (j <= 3) {
        cat(sprintf("%d %d, ", i, j))
        j <- j + 1
    }
    cat("\n")
    i <- i + 1
}
## 1 1, 1 2, 1 3,
## 2 1, 2 2, 2 3,
Example 8.5

Implement a simple linear congruential pseudorandom number generator that, given some seed \(X_0\in [0, m)\), outputs a sequence \((X_1,X_2,\dots)\) defined by:

\[ X_i = (a X_{i-1} + c) \mod m, \]

with, e.g., \(a=75\), \(c=74\), and \(m=2^{16}+1\) (here, mod is the division reminder, `%%`). This generator has poor statistical properties and its use in practice is discouraged. In particular, after some number of operations \(k\), we will find a cycle such that \(X_k=X_1, X_{k+1}=X_2, \dots\).

8.3.2. for#

The for-each loop:

for (name in vector)
    expression

takes each element, from the beginning to the end, in a given vector, assigns it some name, and evaluates the expression.

Example:

fridge <- c("spam", "spam", "bacon", "eggs")
for (food in fridge)
    cat(sprintf("%s, ", food))
## spam, spam, bacon, eggs,

Another example:

for (i in 1:length(fridge))  # better: seq_along(fridge), see below
    cat(sprintf("%s, ", fridge[i]))
## spam, spam, bacon, eggs,

One more:

for (i in 1:2) {
    for (j in 1:3)
        cat(sprintf("%d %d, ", i, j))
    cat("\n")
}
## 1 1, 1 2, 1 3,
## 2 1, 2 2, 2 3,

The iterator still exists after the loop’s watch has ended:

print(i)
## [1] 2
print(j)
## [1] 3

Important

Writing:

for (i in 1:length(x))
    print(x[i])

is reckless. If x is an empty vector, then:

x <- logical(0)
for (i in 1:length(x)) print(x[i])
## [1] NA
## logical(0)

Recall from Chapter 5 that x[1] tries to access an out-of-bounds element here, and x[0] returns nothing.

We generally suggest replacing 1:length(x) with seq_along(x) or seq_len(length(x)). wherever possible.

Note

The model for loop above is roughly equivalent to:

name <- NULL
tmp_vector <- vector
tmp_iter <- 1
while (tmp_iter <= length(tmp_vector)) {
    name <- tmp_vector[[tmp_iter]]
    expression
    tmp_iter <- tmp_iter + 1
}

Note that the tmp_vector is determined before the loop itself. Hence, any changes to the vector will not influence the execution flow. Also, the loop can be applied on lists as well due to the use of `[[`.

Example 8.6

Let x be a list and f be a function. The following code generates the same result as Map(f, x):

n <- length(x)
ret <- vector("list", n)  # a new list of length `n`
for (i in seq_len(n))
    ret[[i]] <- f(x[[i]])
Example 8.7

Let x and y be two lists and f be a function. Here is the most basic version of Map(f, x, y).

nx <- length(x)
ny <- length(y)
n <- max(nx, ny)
ret <- vector("list", n)
for (i in seq_len(n))
    ret[[i]] <- f(x[[((i-1)%%nx)+1]], y[[((i-1)%%ny)+1]])

Note that x and y might be of different lengths. Feel free to upgrade the above code by adding a warning like the longer argument is not a multiple of the length of the shorter one. Also, rewrite it without using the modulo operator, `%%`.

8.3.3. break and next#

break can be used to escape the current loop. next skips the remaining expressions and advances to the next iteration (where the testing of the logical condition occurs).

Here is a rather random example:

x <- runif(1000)
s <- 0
for (e in x) {
    if (e > 0.1)
        next

    print(e)
    if (e < 0.01)
        break

    s <- s + e
}
## [1] 0.045556
## [1] 0.04206
## [1] 0.024614
## [1] 0.045831
## [1] 0.094841
## [1] 0.00062477
print(s)
## [1] 0.2529

Computes the sum of the elements in x that are less than or equal to 0.1 from the beginning, stopping at the first element less than 0.01.

We have used a frequently occurring design pattern:

for (e in x) {
    if (condition)
        next

    many_statements...
}

which is equivalent to:

for (e in x) {
    if (!condition) {
        many_statements...
    }
}

but which avoids introducing a nested block of expressions.

Note

(*) There is a third loop type,

repeat
    expression

which is a shorthand for

while (TRUE)
    expression

i.e., it is a possibly infinite loop. Such loops are invaluable when expressing situations such as do-stuff-until-a-thing-happens, e.g., when we want to execute a command at least once.

i <- 1
repeat {  # while (TRUE)
    # simulate dice casting until we throw "1"
    if (runif(1) < 1/6) break  # not an infinite loop after all
    i <- i+1
}
print(i)
## [1] 6
Exercise 8.8

What is wrong with the following code?

j <- 1
while (j <= 10) {
    if (j %% 2 == 0) next
    print(j)
    j <- j + 1
}
Exercise 8.9

What about this one?

j <- 1
while (j <= 10);
    j <- j + 1

8.3.4. return#

return, when called from within a function, immediately yields a specified value and goes back to the caller.

For example, here is a simple recursive function that flattens a given list:

my_unlist <- function(x)
{
    if (is.atomic(x))
        return(x)

    # so if we are here, x is definitely not atomic
    out <- NULL
    for (e in x)
        out <- c(out, my_unlist(e))

    out  # or return(out); it's the last expression anyway, so not necessary
}

my_unlist(list(list(list(1, 2), 3), list(4, list(5, list(6, 7:10)))))
##  [1]  1  2  3  4  5  6  7  8  9 10

return is a function: the round brackets are obligatory.

8.3.5. Time and space complexity of algorithms (*)#

Analysis of algorithms can give us a rough estimate of their run times or memory consumption as a function of the input data size, especially for big data (e.g., [14, 43]).

In scientific computing and data science, we often deal with vectors (sequences) or matrices/data frames (tabular data). Therefore, we might be interested in determining how many primitive operations need to be performed as a function of their length \(n\) or the number of rows \(n\) and columns \(m\), respectively.

For instance, the \(O\) (Big-Oh) notation can express the upper bounds for time/resource consumption in asymptotic cases. For instance, we say that the time complexity is \(O(n^2)\), if for large \(n\), the number of operations to perform will be proportional to at most the square of the vector size (more precisely, there exists \(m\) and \(C>0\) such that for all \(n>m\), the number of operations is \(\le Cn^2\)).

Therefore, if we have two algorithms that solve the same task, one that has \(O(n^2)\) time complexity, and other of \(O(n^3)\), it is better to choose the former. For large problem sizes, we expect it to be faster.

Moreover, whether time grows proportionally to \(\log n\), \(\sqrt{n}\), \(n\), \(n\log n\), \(n^2\), \(n^3\), or \(2^n\), can be informative in predicting how big the data can be if we have a fixed deadline or not too much space left on the disk.

Exercise 8.10

The hclust function determines a hierarchical clustering of a dataset. It is fed with an object that stores the distance between all the pairs of input points. There are \(n(n-1)/2\) (i.e., \(O(n^2)\)) unique point pairs for any given \(n\). One numeric scalar (double type) takes 8 bytes of storage. If you have 16 GB of RAM, what is the largest dataset that you can cluster on your machine using this function?

Oftentimes, we can learn about the time or memory complexity of the functions we use from their documentation; see, e.g., help("findInterval").

Example 8.11

A course in data structures in algorithms, which this one is not, will give us plenty of opportunities to implement many algorithms ourselves. This way, we can gain a lot of insights and intuitions.

For instance, this is a \(O(n)\)-time algorithm:

for (i in seq_len(n))
    expression

and this is one runs in \(O(n^2)\) time:

for (i in seq_len(n))
    for (j in seq_len(n))
        expression

as long as, of course, the expression is rather primitive (e.g., operations on scalar variables).

R is a very expressive language. Hence, quite complex and lengthy operations can look pretty innocent. After all, it is a glue language for rapid prototyping.

For example:

for (i in seq_len(n))
    for (j in seq_len(n))
        z <- z + x[[i]] + y[[j]]

can be seen as \(O(n^3)\) if each element in the lists x and y as well as z itself are atomic vectors of length \(n\).

Similarly,

Map(mean, x)

is \(O(n^2)\) if x is a list of \(n\) atomic vectors, each of length \(n\).

Note

A quite common statistical scenario involves generating a data buffer of a fixed size:

ret <- c()
for (i in n)
    ret[[i]] <- generate_data(i)  # here: ret[[length(ret)+1]] <- ...

This notation, however, involves growing the ret array in each iteration. Luckily, since R version 3.4.0, each such size extension has amortised \(O(1)\) time as some more memory is internally reserved for its prospective growth (dynamic arrays; see, e.g., Chapter 17 of [14]).

However, it is better to pre-allocate the output vector of the desired final size. We can construct vectors of specific lengths and types in an efficient way (more efficient than with rep) by calling:

numeric(3)
## [1] 0 0 0
numeric(0)
## numeric(0)
logical(5)
## [1] FALSE FALSE FALSE FALSE FALSE
character(2)
## [1] "" ""
vector("numeric", 8)
## [1] 0 0 0 0 0 0 0 0
vector("list", 2)
## [[1]]
## NULL
##
## [[2]]
## NULL

Note

Not all data fit into memory, but it does not mean that we should start installing Apache Hadoop and Spark immediately. Some datasets can be processed on a chunk-by-chunk basis.

R enables data stream handling (some can be of infinite length) through file connections, for example:

f <- file("https://github.com/gagolews/teaching-data/raw/master/README.md",
    open="r")  # a big file, the biggest file ever
i <- 0
while (TRUE) {
    few_lines <- readLines(f, n=4)  # reads only four lines at a time
    if (length(few_lines) == 0) break
    i <- i + length(few_lines)
}
close(f)
print(i)  # the number of lines
## [1] 90

Many functions support reading from/writing to already established connections of different types, e.g., file, gzfile, textConnection, batch-by-batch.

A frequent scenario involves reading a very large CSV, JSON, or XML file only by thousands of lines/records at a time, parsing and cleansing them, and exporting them to SQL databases (which we will exercise in Chapter 12).

8.4. Exercises#

From now on, we must stay alert. Many, if not all, of the following tasks, can still be implemented without the explicit use of the R loops but based only on the operations covered in the previous chapters. If this is the case, try composing both the looped and loop-free versions. Use microbenchmark::microbenchmark or proc.time to compare their run times[3].

Exercise 8.12

Answer the following questions:

  • Let x be a numeric vector. When does if(x > 0) ... yield a warning? When does it give an error? How to prevent this?

  • What is the dangling else?

  • What happens if you put if as the last expression in a curly braces block within a function’s body?

  • Why do we say that `&&` and `||` are lazy? What are their use cases?

  • What is the difference between `&&` and `&`?

  • Can while always be replaced with for? What about the other way around?

  • What is wrong with “return (1+2)*3”?

Exercise 8.13

Verify which of the following can be safely used as logical conditions in if statements. If that is not the case for all x, y, …, determine the additional conditions that must be imposed to make them valid.

  • x == 0,

  • x[y] > 0,

  • any(x>0),

  • match(x, y),

  • any(x %in% y).

Exercise 8.14

What can go wrong in the following code chunk, depending on the type and form of x? Consider as many scenarios as possible.

count <- 0
for (i in 1:length(x))
    if (x[i] > 0)
        count <- count + 1
Exercise 8.15

Implement shift_left(x, n) and shift_right(x, n). The former function gets rid of the first \(n\) observations in x and adds \(n\) missing values at the end of the resulting vector, e.g., shift_left(c(1, 2, 3, 4, 5), 2) is c(3, 4, 5, NA, NA). On the other hand, shift_right(c(1, 2, 3, 4, 5), 2) is c(NA, NA, 1, 2, 3).

Exercise 8.16

Implement your version of diff.

Exercise 8.17

Write a function that determines the longest ascending trend in a given numeric vector, i.e., the length of the longest subsequence of consecutive increasing elements. For example, the input c(1, 2, 3, 2, 1, 2, 3, 4, 3) should yield 4.

Exercise 8.18

Implement the functions that round down and round up each element in a numeric vector to a number of decimal digits.

This concludes the first part of this magnificent book.