Preface

The open-access textbook Deep R Programming by Marek Gagolewski is, and will remain, freely available for everyone’s enjoyment (also in PDF). It is a non-profit project. This book is still a work in progress. Beta versions of Chapters 1–12 are already complete, but there will be more. In the meantime, any bug/typos reports/fixes are appreciated. Although available online, it is a whole course; it should be read from the beginning to the end. Refer to the Preface for general introductory remarks. Also, check out my other book, Minimalist Data Wrangling with Python [20].

To R, or not to R

R [52] has been named the eleventh most dreaded programming language in the 2022 StackOverflow Developer Survey.

Also, it is a free app, so there must be something wrong with it, right?

But whatever, R is deprecated anyway; the “modern” way is to use tidyverse. Or we should all just switch to Python.

Well, not really[1].

R as a Language and an Environment

Let us get one thing straight: R is not just a statistical package. It is a general-purpose, high-level programming language, that just happens to be very powerful for any kind of numerical, data-intense computing. It offers extensive support for statistical, machine learning, data analysis, data wrangling, and data visualisation applications, but there is a lot more.

Initially, R was written “for statisticians by statisticians”. Therefore, it may be thought of as a free yet more capable alternative to Stata, SAS, SPSS, Statistica, Minitab, Weka, etc. Unlike some of them, however, a spreadsheet-like GUI is not the main gateway for performing computations on data. In R, a user must write code to get things actually done. Despite the learning curve’s being a little steeper for non-programmers, in the long run, it empowers their users because they are not limited only to the most common scenarios. If some functionality is missing or does not suit their needs, they can easily implement everything themselves.

It is thus very convenient for rapid prototyping. It helps turn our ideas into operational code that can be tested, extended, polished, run in production, and otherwise enjoyed overall. As an interpreted language, it can be run not only in an interactive read-eval-print loop (command–result, question–answer, …), but also in batch mode (running whole, standalone scripts).

Thus, we would rather position R amongst such tools/languages for numerical or scientific computing as Python with the NumPy ecosystem, Julia, GNU Octave, Scilab, MATLAB, etc. However, it is more specialised in data science applications than all of them. Hence, it provides a much smoother experience. This is why, over the years, R has become the de facto standard in statistics and many other related fields.

Important

R is a whole ecosystem (environment). It not only consists of the R language interpreter, but also features advanced:

  • graphical capabilities (see Chapter 13),

  • a help system (Section 1.4),

  • ways for convenient interfacing with compiled code (Chapter 14),

  • a package system and centralised package repositories (such as CRAN and Bioconductor; Section 7.3.1),

  • a lively community of users and developers – curious and passionate people, just like you and me.

Note

R’s predecessor is the popular S system designed in the 1980s by John M. Chambers and his colleagues at Bell Labs S: [3, 4, 8, 42]. R is called GNU S, a free, open-source version of its commercial counterpart developed in the mid-1990s[2] by Robert Gentleman and Ross Ihaka of the Statistics Department, University of Auckland, and a large number of contributors; see [7, 31] for some historical notes.

R has a C language-like syntax that involves the use of {curly braces}. Still, in principle, it is a beautiful, functional programming language: its design was heavily inspired by Scheme (see [1] and Chapter 16 for more details). It is also somewhat object-oriented (Chapter 10).

Aims, Scope, and Design Philosophy

Many users have been introduced to R by means of some very advanced operations involving data frames, formulas, and functions that rely on nonstandard evaluation (metaprogramming), like:

lm(
    Ozone~Solar.R+Temp,
    data=subset(airquality, Temp>60, select=-(Month:Day))
) |> summary()

This is horrible.

Another group has been isolated from the base R through a thick layer of third-party packages that feature an overwhelming number of functions (every operation, regardless of its complexity, has a different name), often duplicating the core functionality, and sometimes being quite incompatible with our traditional system.

Both families should be fine — as long as they limit themselves to solving only the simplest and most common data processing problems.

But we yearn for more. We do not want hundreds of prefabricated recipes for popular dishes that we can mindlessly apply without much understanding.

Our aim is to learn base R, which is supposed to be the common language (lingua franca) to all R users. We want to be able to write code that everybody should be able to understand, and which will be likely to work without modifications ten years from now (no slang!).

We want to be able to tackle any data-intense problem. Furthermore, we want to develop skills that are transferable, so that learning new tools such as Julia or Python with NumPy and Pandas will be much easier later (because R is not the only notable environment out there).

Anyway, enough preaching. This graduate[3]-level textbook is for independent readers who do not mind a slightly steeper learning curve at the beginning, but who are able to appreciate a more cohesively and comprehensively[4] organised material.

Some will benefit from it as a first introduction to R (but without all the pampering). For others[5], this will be a good course from intermediate to advanced (do not skip the first chapters, though).

Either way, do not forget to solve all the prescribed exercises.

Good luck.

🚧 Classification of R Data Types and Book Structure

../_images/data-types-overview.png

Figure 1 An overview of the most prevalent R data types (see Figure 16.1 for a more comprehensive list)

The most commonly used R data types can be classified as follows; see also Figure 1.

  1. Basic types – which we discuss in the first part of this book – internal or built-in types, upon which more complex ones are hinged:

    • atomic vectors – represent whole sequences of values, where every element is of the same type:

      • logical (Chapter 3) – includes items that are TRUE (“yes”, “present”), FALSE (“no”, “absent”), or NA (“not available”, “missing”);

      • numeric (Chapter 2) – features real numbers, such as 1, 3.14, -0.0000001, etc.;

      • character (Chapter 6) – contains strings of characters, e.g., "groß", "123", or “Добрий день”;

    • function (Chapter 7) – used to group a series of expressions (code lines) so that they can be applied on different kinds of input data to generate the (hopefully) desired outcomes, for instance, cat, print, plot, sample, and sum;

    • list (Chapter 4) a.k.a. a generic vector – can store elements of mixed types;

    The above will be complemented with a discussion on vector indexing (Chapter 5) and ways to control the program flow (Chapter 8).

  2. Compound types – discussed in the second part – wrappers around objects of basic types that might behave differently from the underlying primitives thanks to the dedicated operations overloaded for them. They are

    • factor (Section 10.3.3) – a vector-like object that represents qualitative data (on a nominal or an ordered scale);

    • matrix (Chapter 11) – stores tabular data, i.e., arranged into rows and columns, where each cell is usually of the same type;

    • data.frame (Chapter 12) – also used for depositing tabular data, but this time such that each column can be of different type;

    • and many more, which we or third-parties can define arbitrarily using, amongst others, the principles of S3-style object orientated-programming (Chapter 10).

    In this part of the book, we also discuss the principles of sustainable coding (Chapter 9) as well as introduce the basic ways to prepare publication-quality graphics (Chapter 13).

  3. 🚧 Some more advanced material that, in most cases, we can (and often should) easily do without, but which is still essential to gain a full understanding of and control over the environment, is discussed in the first part. This includes, amongst others, the following data types:

    • externalptr (Section 14.2);

    • environment (Chapter 16);

    • symbol (name), call, expression (Chapter 15);

    • formula (Section 16.5) – used by some functions to specify supervised learning models or define operations to be performed within data subgroups, amongst others;

    🚧 Also, we will discuss other things, but this is an early draft of this book, so right now, we only provide a placeholder therefor (sec:to-do). Please come back later.

Note

The above classification is just a first approximation to the complete type classification that we give in Figure 16.1.

Also, we should not be surprised that above we do not see any of the data types defined by a few very popular[6] third-party packages. We will later see that we can most often do without them. If that is not the case, we will become skilled enough to learn them easily ourselves.

About the Author

I, Marek Gagolewski (pronounced like Ma’rek Gong-olive-ski), am currently a Senior Lecturer in Applied AI at Deakin University in Melbourne, VIC, Australia and an Associate Professor in Data Science at the Systems Research Institute of the Polish Academy of Sciences.

My research interests are related to data science, in particular: modelling complex phenomena, developing usable, general-purpose algorithms, studying their analytical properties, and finding out how people use, misuse, understand, and misunderstand methods of data analysis in research, commercial, and decision-making settings. I’m an author of 90+ publications, including journal papers in outlets such as Proceedings of the National Academy of Sciences (PNAS), Information Fusion, International Journal of Forecasting, Statistical Modelling, Journal of Statistical Software, Information Sciences, Knowledge-Based Systems, IEEE Transactions on Fuzzy Systems, and Journal of Informetrics.

In my “spare” time, I write books for my students (also check out my Minimalist Data Wrangling with Python [20]) and develop open-source (libre) data analysis software, such as stringi (one of the most often downloaded R packages), genieclust (a fast and robust clustering algorithm in both Python and R), and many others.

Acknowledgements

Deep R Programming is based on my experience as an R user (since ~2003), developer of open-source packages (see above), tutor/lecturer (since ~2008), and an author of a quite successful Polish textbook Programowanie w języku R (see [19]) which was published by PWN (1st ed. 2014, 2nd ed. 2016). Even though the current book is an entirely different work, its predecessor served as an excellent testbed for many ideas conveyed here.

In particular, the teaching style exercised in this book has proven successful in many similar courses that yours truly has been responsible for, including at Warsaw University of Technology, Data Science Retreat (Berlin), and Deakin University (Melbourne). I thank all my students and colleagues for the feedback given over the last 15-odd years.

We describe R version 4.2.2 Patched (2022-11-10 r83330). However, we expect 99.9% of material covered here to be valid in future releases (consider filing a bug report if you discover that this is not the case).

This book was prepared in a Markdown superset called MyST, Sphinx, and TeX (XeLaTeX). Code chunks were processed with the R package knitr [46]. All figures were plotted with the low-level graphics package using the author’s own style template. A little help from Makefiles, custom shell scripts, and Sphinx plugins (sphinxcontrib-bibtex, sphinxcontrib-proof) dotted the j’s and crossed the f’s. The Ubuntu Mono font is used for the display of code. Typesetting of the main text relies upon the Alegreya and Lato typefaces.

This book received no funding, administrative, technical, or editorial support from Deakin University, Warsaw University of Technology, Polish Academy of Sciences, or any other source.