Preface#
The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of all chapters are already available (proofreading and copyediting pending). In the meantime, any bug/typos reports/fixes are appreciated. Although available online, this is a whole course. It should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [26].
To R, or not to R#
R [67] has been named the eleventh most dreaded programming language in the 2022 StackOverflow Developer Survey.
Also, it is a free app, so there must be something wrong with it, right?
But whatever, R is deprecated anyway; the “modern” way is to use tidyverse.
Or we should all just switch to Python.
Well, not really[1].
R (GNU S) as a language and an environment#
Let us get one thing straight: R is not just a statistical package. It is a general-purpose, high-level programming language that happens to be very powerful for numerical, data-intense computing activities of any kind. It offers extensive support for statistical, machine learning, data analysis, data wrangling, and data visualisation applications, but there is much more.
Initially, R[2] was written for statisticians, by statisticians. Therefore, it is a free, yet more capable alternative to Stata, SAS, SPSS, Statistica, Minitab, Weka, etc. (and without any strings attached). Unlike in some of them, in R, a spreadsheet-like GUI is not the main gateway for performing computations on data. Here, a user must write code to get things done. Despite the learning curve’s being a little steeper for non-programmers, in the long run, it empowers their users because they are not limited only to the most common scenarios. If some functionality is missing or does not suit their needs, they can easily implement everything themselves.
It is thus very convenient for rapid prototyping. It helps turn our ideas into operational code that can be tested, extended, polished, run in production, and otherwise enjoyed. As an interpreted language, it can not only be executed in an interactive read-eval-print loop (command–result, question–answer, …), but also in batch mode (running whole, standalone scripts).
Thus, we would rather position R amongst such tools/languages for numerical or scientific computing as Python with the NumPy ecosystem, Julia, GNU Octave, Scilab, MATLAB, etc. However, it is more specialised in data science applications than any of them. Hence, it provides a much smoother experience. This is why, over the years, R has become the de facto standard in statistics and many other related fields.
Important
R is a whole ecosystem (environment). It not only consists of the R language interpreter, but also features advanced:
graphics capabilities (see Chapter 13),
a consistent, well-integrated help system (Section 1.4),
ways for convenient interfacing with compiled code (Chapter 14),
a package system and centralised package repositories (such as CRAN and Bioconductor; Section 7.3.1),
a lively community of users and developers – curious and passionate people, like you and me.
Note
R is a free, open-source (licensed under the GNU General Public License v2) variation or dialect of the very popular S system designed in the mid-1970s by Rick A. Becker, John M. Chambers, and Allan R. Wilks at Bell Labs; see [3, 4, 5, 6] and its later revisions [7, 9, 13, 55].
Quoting [4]:
The design goal for S is, most broadly stated, to enable and encourage good data analysis, that is, to provide users with specific facilities and a general environment that helps them quickly and conveniently look at many displays, summaries, and models for their data, and to follow the kind of iterative, exploratory path that most often leads to a thorough analysis. The system is designed for interactive use with simple but general expressions for the user to type, and immediate, informative feedback from the system, including graphic output on any of a variety of graphical devices.
S became popular because it offered greater flexibility than the standalone statistical packages. It was praised for its high interactivity and array orientation that was known from APL, the familiar syntax of the C language that involves the use of {curly braces}, the ability to treat code as data known from Lisp (Chapter 15), the notion of lazy arguments (Chapter 17), and the ease of calling external C and Fortran routines (Chapter 14). Its newer versions were also somewhat object-oriented (Chapter 10).
However, S was a commercial system. To address this, R (GNU S) was developed in the mid-1990s[3] by Robert Gentleman and Ross Ihaka of the Statistics Department, University of Auckland, and many contributors; see [12, 37] for some historical notes. In essence, R was supposed to be backwards-compatible with S, but some design choices led to their evaluation models’ being slightly different: its design was heavily inspired by Scheme (with its environment model of evaluation; see [1] and Chapter 16 for more details)
Aims, scope, and design philosophy#
Many users were introduced to R by means of some very advanced operations involving data frames, formulae, and functions that rely on nonstandard evaluation (metaprogramming), like:
lm(
Ozone~Solar.R+Temp,
data=subset(airquality, Temp>60, select=-(Month:Day))
) |> summary()
This is horrible.
Another group was isolated from the base R through a thick layer of third-party packages that feature an overwhelming number of functions (every operation, regardless of its complexity, has a different name), often duplicate the core functionality, and sometimes are quite incompatible with our traditional system.
Both user families ought to be fine, as long as they limit themselves to solving only the simplest and most common data processing problems.
But we yearn for more. We do not want hundreds of prefabricated recipes for popular dishes that we can mindlessly apply without much understanding.
Our aim is to learn base R, which is supposed to be the common language (lingua franca) for all R users. We want to be able to indite code that everybody should understand; code that will work without modifications ten years from now (no slang!).
We want to be able to tackle any data-intense problem. Furthermore, we want to develop transferable skills so that learning new tools such as Python with NumPy and Pandas (e.g., [26]) or Julia will be much easier later: R is not the only notable environment out there.
Anyway, enough preaching. This graduate[4]-level textbook is for independent readers who:
do not mind a slightly steeper learning curve at the beginning,
can appreciate a more cohesively and comprehensively[5] organised material,
would like to experience the joy of solving problems by programming,
do not want to be made obsolete by artificial “intelligence” in the future.
Some will benefit from it as a first introduction to R (yet, without all the pampering). For others[6], this will be a fine course from intermediate to advanced (do not skip the first chapters, though).
Either way, do not forget to solve all the prescribed exercises.
Good luck.
Classification of R data types and book structure#

Figure 1 An overview of the most prevalent R data types; see Figure 17.2 for a more comprehensive list#
The most commonly used R data types can be classified as follows; see also Figure 1.
Basic types – which we discuss in the first part of this book – internal or built-in types, upon which more complex ones are hinged:
atomic vectors – represent whole sequences of values, where every element is of the same type:
logical
(Chapter 3) – includes items that areTRUE
(“yes”, “present”),FALSE
(“no”, “absent”), orNA
(“not available”, “missing”);numeric
(Chapter 2) – features real numbers, such as1
,3.14
,-0.0000001
, etc.;character
(Chapter 6) – contains strings of characters, e.g.,"groß"
,"123"
, or “Добрий день”;
function
(Chapter 7) – used to group a series of expressions (code lines) so that they can be applied on different kinds of input data to generate the (hopefully) desired outcomes, for instance, cat, print, plot, sample, and sum;list
(generic vector; Chapter 4) – can store elements of mixed types;
The above will be complemented with a discussion on vector indexing (Chapter 5) and ways to control the program flow (Chapter 8).
Compound types – discussed in the second part – wrappers around objects of basic types that might behave differently from the underlying primitives thanks to the dedicated operations overloaded for them. They are:
factor
(Section 10.3.2) – a vector-like object that represents qualitative data (on a nominal or an ordered scale);matrix
(Chapter 11) – stores tabular data, i.e., arranged into rows and columns, where each cell is usually of the same type;data.frame
(Chapter 12) – also used for depositing tabular data, but this time such that each column can be of a different type;and many more, which we can arbitrarily define using, amongst others, the principles of S3-style object-oriented programming (Chapter 10).
In this part of the book, we also discuss the principles of sustainable coding (Chapter 9) as well as introduce the basic ways to prepare publication-quality graphics (Chapter 13).
Some more advanced material is discussed in the third part. In most cases, we can (and often should) easily do without it, but it is still essential to gain a complete understanding of and control over our environment. This includes the following data types:
symbol
(name
),call
,expression
(Chapter 15) – objects representing unevaluated R expressions that can be freely manipulated and executed if needed;environment
(Chapter 16) – hashmaps that store named objects and which form the basis of the environment model of evaluation;formula
(Section 17.6) – used by some functions to specify supervised learning models or define operations to be performed within data subgroups, amongst others.
We should not be surprised that we did not list any data types defined by a few trendy[7] third-party packages. We will later see that we can most often do without them. If that is not the case, we will become skilled enough to learn them quickly ourselves.
Acknowledgements#
R, and its predecessor S, is the result of a collaborative effort of many programmers. Without their generous intellectual contributions, the landscape of data analysis would not be as beautiful as it is now. R is distributed under the terms of the GNU General Public license version 2, and we occasionally display fragments of its source code for didactic purposes.
We describe and use R version 4.3.0 (2023-04-21). However, we expect 99.9% of the material covered here to be valid in future releases (consider filing a bug report if you discover this is not the case).
Deep R Programming is based on the author’s experience as an R user (since ~2003), developer of open-source packages (mentioned above), tutor/lecturer (since ~2008), and an author of a quite successful Polish textbook Programowanie w języku R (see [25]) which was published by PWN (1st ed. 2014, 2nd ed. 2016). Even though the current book is an entirely different work, its predecessor served as an excellent testbed for many ideas conveyed here.
In particular, the teaching style exercised in this book has proven successful in many similar courses that yours truly has been responsible for, including at Warsaw University of Technology, Data Science Retreat (Berlin), and Deakin University (Melbourne). I thank all my students and colleagues for the feedback given over the last 15-odd years.
However, this book received no funding, administrative, technical, or editorial support from Deakin University, Warsaw University of Technology, Polish Academy of Sciences, or any other source.
This book was prepared in a Markdown superset
called MyST,
Sphinx, and
TeX (XeLaTeX).
Code chunks were processed with the R package knitr [61].
All figures were plotted with the low-level graphics package
using the author’s own style template.
A little help from Makefiles, custom shell scripts,
and Sphinx plugins
(sphinxcontrib-bibtex,
sphinxcontrib-proof)
dotted the j’s and crossed the f’s.
The Ubuntu Mono
font is used for the display of code
.
Typesetting of the main text relies on the
Alegreya
and Lato
typefaces.
You can make this book better#
Open, non-profit projects such as this have to rely on the generosity of the readers’ community when it comes to quality assurance.
If you find a typo, a bug, or a passage that could be rewritten or extended for better readability/clarity, do not hesitate to report it via the Issues tracker available at https://github.com/gagolews/deepr/issues/. New feature requests are welcome as well.
Please consider “starring” the book’s GitHub repository. Some people (weirdly) use the number of “stars” as a proxy for quality.
Spread the news about this book by sharing https://deepr.gagolewski.com/ with your mates, peers, or students. You may want to cite it in your publications. Thank you.