16. 🚧🚧 Environments and Evaluation (**)

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of Chapters 1–12 are already complete, but there will be more. In the meantime, any bug/typos reports/fixes are appreciated. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [20].

In the first part of our book, we have discussed quite a few basic object types: numeric, logical, and character vectors, lists (generic vectors), and functions.

Environments, which we introduce in this chapter, just like lists, are instances of recursive types (compare the diagram in Figure 16.1).

Even though we rarely interact with them directly (unless we need a hash table-like data structure with quick by-name element look-up), they are crucial for R itself: they form the basis of the environment model of evaluation which governs how expressions are computed; see Section 16.2.

Important

Each object of type environment consists of:

  • a frame[1] (Section 16.1) – stores a set of bindings, which associate variable names with their corresponding values; it can be thought of as a container of named R objects of any type;

  • a pointer to an enclosing environment[2] (Section 16.3).

🚧 This chapter is under construction. Please come back later.

16.1. Frames: Environments as Object Containers

To create a new, empty environment, we can call the new.env function:

e1 <- new.env()
typeof(e1)
## [1] "environment"

In this section, we treat environments merely as containers for named objects of any kind, i.e., we deal with the frame part thereof.

Let us insert some elements into e1:

e1[["x"]] <- "x in e1"
e1[["y"]] <- 1:3

The `[[` operator provides some named list-like look-and-feel also in the case of element extraction:

e1[["x"]]
## [1] "x in e1"
e1[["spam"]]  # does not exist
## NULL
e1[["y"]] <- e1[["y"]]*10  # replace with new content
e1[["z"]] <- NULL  # unlike in the case of lists, creates a new element

16.1.1. Printing

Let us note that the printing of an environment is quite awkward:

print(e1)  # same with str(e1)
## <environment: 0x55b724cb5dd8>

Later we will mention that this is the address where e1 is stored in computer’s memory.

As we have said, these objects are rather of internal interest, so nobody cared[3] about making the interaction therewith particularly smooth.

However, we can easily get the list of objects stored within the container using names[4]:

names(e1)  # compare attr(e1, "names") – it is not set
## [1] "x" "y" "z"

Also:

length(e1)
## [1] 3

gives the number of objects in the container.

16.1.2. Environments vs Named Lists

Environment frames, in some sense, can be thought of as named lists, but the set of admissible operations is severely restricted. In particular, we cannot extract more than one element at the same time with the index operator:

e1[c("x", "y")]  # but see the mget function
## Error in e1[c("x", "y")]: object of type 'environment' is not subsettable

nor can we refer to the elements by position:

e1[[1]] <- "bad key"
## Error in e1[[1]] <- "bad key": wrong args for environment subassignment
Exercise 16.1

Check if lapply and Map can be applied directly on environments. Also, can we iterate over their elements using a for loop?

Actually, named lists can be converted to environments and vice versa using as.list and as.environment.

as.list(e1)
## $x
## [1] "x in e1"
## 
## $y
## [1] 10 20 30
## 
## $z
## NULL
as.environment(list(u=42, whatever="it's not gonna be printed anyway"))
## <environment: 0x55b724d09b38>
as.list(as.environment(list(x=1, y=2, x=3)))  # no duplicates allowed
## $y
## [1] 2
## 
## $x
## [1] 3

16.1.3. Hash Maps: Fast Element Look-up by Name

Environment frames are internally implemented using hash tables (hash maps; see, e.g., [9, 33]) with string keys.

Important

A hash table is a data structure that allows for a very quick[5] lookup and insertion of individual elements by name.

This comes at a price, including what we have already observed above:

  • the elements are not particularly unordered: they cannot be referred to by a numeric index;

  • all element names must be unique.

Note

A list may be considered a sequence, but an environment frame is only set (a bag) of key–value pairs. In the majority of numerical computing applications, we would rather store, iterate over, and process all the elements in order, hence the greater popularity of the former. Lists still allow for an element look-up by name, even though this is slightly slower[6]. Hence, they are much more universal.

Example 16.2

A natural use case of manually-created environment frames deals with grouping a series of objects identified by character string keys.

Consider a simple pseudocode for counting the number of occurrences of objects in a given container:

for (key in some_container) {
    if (!is.null(counter[["key"]]))
        counter[["key"]] <- counter[["key"]]+1
    else
        counter[["key"]] <- 1
}

If some_container is large, say, of size \(n\) (let us say it is generated on the fly by reading some data stream), then the run-time of the above will depend on the type of the data structure used. If counter is a list, then, theoretically, the worst-case performance will be \(O(n^2)\) (if all keys are unique). However, for environments, it is an order of magnitude faster: down to amortised \(O(n)\).

Exercise 16.3

(*) Implement the above pseudocode and benchmark the two data structures using proc.time on some example data:

t0 <- proc.time()  # timer start
# ... to do - something time-consuming ...
print(proc.time() - t0)  # elapsed time
Exercise 16.4

(*) Determine the number of unique text lines in a very large file (assuming that the set of unique text lines fits into memory, but the file itself does not). Also, determine the five most frequently occurring text lines.

16.1.4. Pass-by-Value, Copy on Demand – Not for Environments

Given any R object, say, x, when we issue:

y <- x

its copy[7] is made so that y and x are independent of each other. In other words, any change in the state of x (or y) is not reflected in the state of y (or x).

For instance:

x <- list(a=1)
y <- x
y[["a"]] <- y[["a"]]+1
print(y)
## $a
## [1] 2
print(x)  # not affected – `x` and `y` are independent
## $a
## [1] 1

The same happens with arguments that we feed to the functions:

mod <- function(y, key)  # it's like: local_y <- passed_argument
{
    y[[key]] <- y[[key]]+1
    y
}

mod(x, "a")  # returns a modified copy of `x`
## $a
## [1] 2
print(x)
## $a
## [1] 1

We thus say that R has pass-by-value semantics.

Important

Environments are the only[8] R objects that have an assign- and pass-by-reference semantics.

In other words, if we perform:

x <- as.environment(x)
y <- x

then the names x and y are bound with exactly the same objects in computer’s memory.

y[["a"]] <- y[["a"]]+1
print(y[["a"]])
## [1] 2
print(x[["a"]])  # `x` is `y`, `y` is `x`
## [1] 2

This is exactly seen when we print the address where the environments are located:

print(x)
## <environment: 0x55b7240e8418>
print(y)
## <environment: 0x55b7240e8418>

The same when we pass them to a function:

mod(y, "a")  # pass-by-reference (`y` is `x`, remember?)
## <environment: 0x55b7240e8418>
x[["a"]]   # `x` has changed (names `y` and `x` are bound to the same object)
## [1] 3

Thus, any changes we make to an environment passed as an argument to a function will be visible “outside” the call. This minimises time and memory use in certain situations. If this sounds complicated, we should rather be avoiding these data structures whatsoever. As we have said, an R user can live without them.

Note

(*) Actually, for efficiency reasons, when we write “y <- x” , a copy of x (unless it is an environment) is only created if absolutely necessary.

Here is some benchmarking of the copy-on-demand mechanism.

n <- 100000000  # like, a lot

Creation of a new large numeric vector:

t0 <- proc.time();           x <- numeric(n);                proc.time() - t0
##    user  system elapsed 
##   0.123   0.189   0.312

Creation of a (delayed) copy:

t0 <- proc.time();           y <- x;                         proc.time() - t0
##    user  system elapsed 
##   0.001   0.000   0.000

This was instant. Thus, we definitely did not clone the n data cells.

Copy-on-demand is implemented using some quite simple reference counting; compare Section 14.4. That – temporarily – x and y point to the same address in memory can be inspected by calling:

.Internal(inspect(x))
## @7fe7e58a0010 14 REALSXP g0c7 [REF(4)] (len=100000000, tl=0) 0,0,0,0,0,...
.Internal(inspect(y))
## @7fe7e58a0010 14 REALSXP g0c7 [REF(5)] (len=100000000, tl=0) 0,0,0,0,0,...

The real copying is only triggered when we try to modify x or y. This is when they need to be separated.

t0 <- proc.time();           y[1] <- 1;                      proc.time() - t0
##    user  system elapsed 
##   0.153   0.207   0.360

Now x and y are different objects.

.Internal(inspect(x))
## @7fe7e58a0010 14 REALSXP g0c7 [MARK,REF(5)] (len=100000000, tl=0) 0,0,0,0,0,...
.Internal(inspect(y))
## @7fe7b5daf010 14 REALSXP g0c7 [MARK,REF(1)] (len=100000000, tl=0) 1,0,0,0,0,...

Note that the elapsed time is similar to that needed to create x from scratch.

Further modifications can already be fast:

t0 <- proc.time();           y[2] <- 2;                      proc.time() - t0
##    user  system elapsed 
##   0.213   0.212   0.424

16.1.5. 🚧 A Note on Reference Classes (*)

16.2. 🚧 The Environment Model of Evaluation

16.3. 🚧 Enclosing Environments

16.4. 🚧 Evaluating Functions

16.4.1. 🚧 Evaluation of Default Arguments

16.4.2. 🚧 Not All Arguments Need to Be Evaluated

Calls such as `if`(test, ifyes, ifno) or `&&`(mustbe, maybe) do not have to evaluate all their arguments.

{cat(" first "); FALSE} && {cat(" second "); FALSE}
##  first
## [1] FALSE
{cat(" first "); TRUE} && {cat(" Spanish Inquisition "); FALSE}
##  first  Spanish Inquisition
## [1] FALSE

This is also a kind of nonstandard evaluation.

We can write such functions too.

test <- function(a, b, c) a + c  # b is unused

test({cat(" spam "); 1}, {cat(" eggs "); 10}, {cat(" bacon "); 100})
##  spam  bacon
## [1] 101

Here, the second argument was not used, therefore it did not have to be evaluated. And it was not.

A more advanced example:

test <- function(a, b, c)
{
    cat("Arguments passed to the functions (expressions): \n")
    cat("a = ", deparse(substitute(a)), "\n")
    cat("b = ", deparse(substitute(b)), "\n")
    cat("c = ", deparse(substitute(c)), "\n")
    cat("Using a: ")
    a  # we don't even have to be particularly creative
    cat("Using a and c: ")
    retval <- a + c  # a is reused, b is unused
    cat("Cheers! ")
    retval
}

test({cat(" spam "); 1}, {cat(" eggs "); MeAn(egGs)}, {cat(" bacon "); 100})
## Arguments passed to the functions (expressions): 
## a =  {     cat(" spam ")     1 } 
## b =  {     cat(" eggs ")     MeAn(egGs) } 
## c =  {     cat(" bacon ")     100 } 
## Using a:  spam Using a and c:  bacon Cheers!
## [1] 101

Notice that:

  • either the evaluation of an argument does not happen or it is triggered only once (in which case the result is cached);

  • evaluation is delayed until the very first request for the underlying value;

  • regardless whether we request it or not, we still have access to the underlying unevaluated expression passed as an argument.

As a consequence, we can pass arbitrary gibberish as the second argument (and we did above).

In other words, the evaluation of arguments can be postponed arbitrarily, possibly forever.

16.4.3. 🚧 Matching of Argument Names (TODO: MOVE)

16.4.4. 🚧 Ellipsis Revisited

16.4.5. 🚧 S3 Method Lookup by UseMethod

16.4.6. 🚧 Overloading S3 Group Generics

16.4.7. 🚧 Package Namespaces

16.5. 🚧 Formulas, `~` (*)

16.6. 🚧 Exercises

Exercise 16.5

Answer the following questions:

  • What is the role of a frame in an environment?

  • What is the role of an enclosing environment?

  • What is the difference between a named list and an environment?

  • What functions and operators work on named lists but cannot be applied on environments?

  • What do we mean by saying that environments are not passed by value to R functions?

  • What do we mean by saying that sometimes objects are copied on demand?

  • 🚧 TODO…

Exercise 16.6

(*) 🚧 TODO: assign consecutive, unique numeric IDs for a large list of items

pkg <- available.packages()
pkg[, "Package"]  # a list of the names of available packages
pkg[, "Depends"]  # dependencies

Convert the dependency lists to a list of character vectors (preferably using a regular expression).

Then, generate a list of reverse dependencies:

What packages depend on each given package.

This will be much faster using an environment.

16.7. 🚧 Outro

This is the end.

Recall our first approximation to the classification of R Data Types that we presented in the Preface.

As a summary of what we have covered, Figure 16.1 gives a much broader picture.

../_images/data-types-full.png

Figure 16.1 R data types

Now that we have reached the end of this course, we might be interested in reading the following guides:

  • R Language Definition [51],

  • R Internals [50],

  • Writing R Extensions [47].

Also, the NEWS files available at https://cran.r-project.org/doc/manuals/r-release/ will keep us up to date with new features, bug fixes, and deprecated functionality.

Good luck.