1. Introduction#

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime, any bug/typos reports/fixes are appreciated. Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [26].

1.1. Hello, world!#

Traditionally, every programming journey starts with the printing of a “Hello, World”-like greeting. Let’s then get it over with asap:

cat("My hovercraft is full of eels.\n")  # `\n` == newline
## My hovercraft is full of eels.

By calling (invoking) the cat function, we printed out a given character string that we enclosed in double-quote characters.

Documenting code is a good development practice. It is thus worth knowing that any text followed by a hash sign (that is not part of a string) is a comment. It ignored by the interpreter.

# This is a comment.
# This is another comment.
cat("I cannot wait", "till lunchtime.\n")  # two arguments (another comment)
## I cannot wait till lunchtime.
cat("# I will not buy this record.\n# It is scratched.\n")
## # I will not buy this record.
## # It is scratched.

By convention, in this book, the textual outputs generated by R itself are always preceded by two hashes. This makes copy-pasting all code chunks easier in the case where the diligent readers would like to experiment with them by themselves (which is always highly encouraged).

Whenever a call to some function is to be made, the round brackets are obligatory. All objects within the parentheses (they are separated by commas) constitute the input data to be consumed by the operation. Thus, the syntax is: some_function_to_be_called(argument1, argument2, etc.).

1.2. Setting up the development environment#

1.2.1. Installing R#

It is quite natural to pine for the ability to execute the above code ourselves – we cannot learn programming without getting our hands dirty.

The official precompiled binary distributions of R can be downloaded from https://cran.r-project.org/.

For serious programming work[1], we recommend, sooner rather than later, switching to[2] one of the UNIX-like operating systems. This includes the free, open-source (== good) variants of GNU/Linux, amongst others, or the proprietary (== very far from good) m**OS. The users thereof might employ their favourite package manager (e.g., apt, dnf, pacman, or Homebrew) to install R.

Users of other operating systems (such as Wi***ws) might consider installing Anaconda or Miniconda if they require some level of interoperability with the Python environment, e.g., they would like to work with the Jupyter environment (Section 1.2.5).

Below we review several ways in which we can write and execute R code. It is up to the benign readers to research, set up, and learn the development environment that suits their needs. As usual in real life, there is no single universal approach that always works best in all scenarios.

1.2.2. Interactive mode#

R’s read-eval-print loop (REPL) can give us instant gratification whenever we would like to compute something quickly, e.g., determine basic aggregates of a few numbers entered by hand or evaluate a mathematical expression like “2+2”.

How to start the R console varies from system to system, e.g., users of UNIX-like boxes can simply execute R from the terminal (shell). Wi***ws folks can fire up the RGui from the Start menu.


When working interactively, the default[3] command prompt, “>”, means: I am awaiting orders. Moreover, “+” denotes: Please continue. In such a case, we should either complete the unfinished expression or cancel the operation by pressing ESC or CTRL+C (depending on the operating system).

> cat("And now
+ for something
+ completely different
+ it is an unfinished expression...
+ awaiting another double quote character and then the closing bracket...
+ press ESC or CTRL+C to abort input

For readability, we never print out the command prompt characters in this book.

1.2.3. Batch mode: Working with R scripts (**)#

The interactive mode of operation is unsuitable for more complicated tasks, though.

The users of UNIX-like operating systems will be interested in another extreme, which involves writing standalone R scripts that can be executed line by line without any user intervention.

To do so, in the terminal (command line, shell), we can invoke:

Rscript file.R

where file.R is the path to some source file.

Exercise 1.1

(**) In your favourite text editor (e.g., Notepad++, Kate, vi, Emacs, RStudio, or VSCodium), create a file named test.R. Write a few calls to the cat function. Then, execute this script from the terminal by invoking the Rscript program.

1.2.4. Weaving: Automatic report generation (**)#

Reproducible data analysis[4] requires us to keep the results (text, tables, plots, auxiliary files) synchronised with their generating code and data.

utils::Sweave (the Sweave function from the utils package) and knitr [61] are two example template processors that evaluate R code chunks within documents written in LaTeX, HTML, or other markup languages. The chunks are replaced by the outputs they yield.

This book is a showcase of such an approach: all the results, including Figure 2.3 and the above “Hello, World”, were generated programmatically. Thanks to its being written in the highly universal Markdown language, it could be easily converted to a single PDF document as well as the whole website. Tools like pandoc and docutils facilitate such operations.

Exercise 1.2

(**) Install the knitr package by calling install.packages("knitr") from within an R session. Then, create a text file named test.Rmd with the following content:

# Hello, Markdown!

This is my first automatically generated report,
where I print messages and stuff.


Thank you for your attention.

Assuming that the file is located in the current working directory (compare Section 7.3.2), call knitr::knit("test.Rmd") from the R console or run the following in the terminal:

Rscript -e 'knitr::knit("test.Rmd")'

Then, inspect the generated Markdown file, test.md.

Furthermore, if you have the pandoc tool installed, to generate a standalone HTML file, execute in the terminal:

pandoc test.md --standalone -o test.html

Alternatively, for ways to call external programs from R, see Section 7.3.2.

1.2.5. Semi-interactive modes (Jupyter Notebooks, sending code to the associated R console, etc.)#

The nature of the most frequent use cases of R encourages a semi-interactive workflow, where we progress with prototyping fast by trial-and-error.

In this mode, we compose a series of short code fragments inside a standalone R script.

Each fragment implements a simple, well-defined task, such as the loading of data files, data cleansing, feature visualisation, computations of some information aggregates, etc.

Importantly, any code chunk can be sent to the associated R console and executed therein. This way, we can inspect the results it generates at any time. If we are not happy with the outcome, we can apply any corrections that are necessary.

There are quite a few integrated development environments (IDEs; sometimes requiring additional plugins) that enable such a workflow, including JupyterLab, Emacs, RStudio, and VSCodium.

Executing an individual code line or a whole text selection is usually done by pressing (configurable) keyboard shortcuts such as Ctrl+Enter or Shift+Enter.

Exercise 1.3

(*) JupyterLab is a development environment that runs in a web browser. It was programmed in Python, but supports many programming languages. Thanks to IRkernel, we can use it with R.

  1. Install JupyterLab and IRkernel (for instance, if you use Anaconda, run conda install -c r r-essentials).

  2. From the File menu, select Create a new R source file and save it as, e.g., test.R.

  3. From the File menu, select Create a new console for the editor running the R kernel.

  4. Type some print “Hello, World”-like calls.

  5. Press Shift+Enter (whilst working in the editor) to send different code fragments onto the console and execute them. Inspect the results.

See Figure 1.1 for an illustration.


Figure 1.1 JupyterLab: a source file editor and the associated R console, where we can run arbitrary code fragments#

Example 1.4

(*) JupyterLab is part of the Jupyter project. It supports the handling of dedicated Notebooks, where editable and executable code chunks and results they generate can be kept together in a single .ipynb (JSON) file; see Figure 1.2 for an illustration and Chapter 1 of [26] for a quick introduction (from the Python language kernel perspective).

This environment is quite convenient for live coding (e.g., for teachers) or performing exploratory data analyses. However, for more serious programming work, the code can get quite messy (luckily, there is always an option to export a notebook to an executable, plain text R script).


Figure 1.2 An example Jupyter Notebook, where we can keep the code and the results together#

1.3. Atomic vectors at a glance#

After the printing of the “Hello, World” message, a typical programming course would normally proceed with the discussion on basic data types for storing individual numeric or logical values. Next, we would be introduced to arithmetic and relational operations on such scalars, followed by the definition of whole arrays or other collections of such values, complemented by the methods to iterate over them, one element after another.

In R, no separate types representing individual values have been defined. Instead, what seems to be a single datum, is already a vector (sequence, array) of length 1.

2.71828          # input a number (here: the same as print(2.71828))
## [1] 2.7183
length(2.71828)  # it is a vector featuring one element
## [1] 1

To create a vector of any length, we can call the c function, which combines given arguments into a single sequence:

c(1, 2, 3)  # three vectors of length 1  ->  one vector of length 3
## [1] 1 2 3
length(c(1, 2, 3))
## [1] 3

In Chapter 2, Chapter 3, and Chapter 6, we will discuss the most prevalent types of atomic vectors: numeric, logical, and character ones, respectively.

c(0, 1, -3.14159, 12345.6)           # four numbers
## [1]     0.0000     1.0000    -3.1416 12345.6000
c(TRUE, FALSE)                       # two logical values
## [1]  TRUE FALSE
c("spam", "spam", "bacon and spam")  # three character strings
## [1] "spam"           "spam"           "bacon and spam"

We call them atomic for they can only group together values of the same type. Lists, which we will discuss in Chapter 4, are, on the other hand, referred to as generic vectors. They can be used for storing items of mixed types: other lists as well.


Not having separate scalar types greatly simplifies the programming of numerical computing tasks. Vectors are prevalent in our main areas of interest – statistics, simulations, data science, machine learning, and all other data-oriented computing. For example, columns and rows in tables (values of different features describing clients, ratings of items given by users) or time series (stock market prices, readings from temperature sensors) are all best represented by means of such sequences.

Moreover, the fact that vectors are the core part of the R language makes their use very natural – as opposed to the languages that require special add-ons for vector processing, e.g., numpy for Python [34]. By learning different ways to process them as a whole (instead of one element at a time), we will ensure that our ideas can quickly be turned into working code (rapid prototyping). For instance, computing summary statistics such as, say, the mean absolute deviation of some sequence x, will be as effortless as writing mean(abs(x-mean(x))). Such a code is not only easy to read and maintain, but it is also fast to run.

1.4. Getting help#

Our aim is to become independent, advanced R programmers.

Independent, however, does not mean omniscient. The R help system is the authoritative source of knowledge about specific functions or more general topics. To open a help page, we call:

help("topic")  # equivalently: ?"topic"
Exercise 1.5

Sight (without going into detail) the manual on the length function by calling help("length"). Note that most help pages are structured as follows:

  1. Header: “package:base” means that the function is a base one (see Section 7.3.1 for more details on the R package system);

  2. Title;

  3. Description: a short description of what the function does;

  4. Usage: the list of formal arguments (parameters) to the function;

  5. Arguments: the meaning of each formal argument explained;

  6. Details: technical information;

  7. Value: return value explained;

  8. References: further reading;

  9. See Also: links to other help pages;

  10. Examples: R code that is worth running and studying by yourself.

We can also search within all the installed help pages by calling:

help.search("vague topic")  # equivalently: ??"vague topic"

This way, we will be able to find answers to our questions more reliably than when asking DuckDuckGo or G**gle, which commonly feature many low-quality, irrelevant, or distracting results from splogs. We do not want to lose the sacred code writer’s flow!


All code chunks, including code comments and textual outputs, form an integral part of this book’s text. They should not be skipped by the reader. On the contrary, they must become objects of our intense reflection and thorough investigation.

For instance, whenever we introduce a few functions, it may be a clever idea to look it up in the help system. Moreover, playing with the presented code (running, modifying, experimenting, etc.) is also very beneficial. We should develop the habit of asking ourselves questions like “What would happen if…”, and then finding the answers on our own.

We are now ready to discuss the most significant operations on numeric vectors, which constitute the main theme of the next chapter. See you there.

1.5. Exercises#

Exercise 1.6

What are the three most important types of atomic vectors?

Exercise 1.7

According to the classification of the R data types we introduced in the previous chapter, are atomic vectors basic or compound types?