What's new in R 4.4.0?
R 4.4.0 (“Puppy Cup”) was released on the 24th April 2024 and it is a
beauty. In time-honoured tradition, here we summarise some of the
changes that caught our eyes. R 4.4.0 introduces some cool features (one
of which is experimental) and makes one of our favourite {rlang}
operators available in base R. There are a few things you might need to
be aware of regarding handling NULL
and complex
values.
The full changelog can be found at the r-release ‘NEWS’ page and if you want to keep up to date with developments in base R, have a look at the r-devel ‘NEWS’ page.
A tail-recursive tale
Years ago, before I’d caused my first stack overflow, my Grandad used to tell me a daft tale:
It was on a dark and stormy night,
And the skipper of the yacht said to Antonio,
"Antonio, tell us a tale",
So Antonio started as follows...
It was on a dark and stormy night,
And the skipper of the yacht .... [ad infinitum]
The tale carried on in this way forever. Or at least it would until you were finally asleep.
At around the same age, I was toying with BASIC programming and could knock out classics such as
>10 PRINT "Ali stinks!"
>20 GOTO 10
Burn! Infinite burn!
That was two example processes that demonstrate recursion. Antonio’s tale quotes itself recursively, and my older brother will be repeatedly mocked unless someone intervenes.
Recursion is an elegant approach to many programming problems - this usually takes the form of a function that can call itself. You would use it when you know how to get closer to a solution, but not necessarily how to get directly to that solution. And unlike the un-ending examples above, when we write recursive solutions to computational problems, we include a rule for stopping.
An example from mathematics would be finding zeros for a continuous function. The sine function provides a typical example:
We can see that when x = π, there is a zero for sin(x)
, but the
computer doesn’t know that.
One recursive solution to finding the zeros of a function, f()
, is the
bisection method,
which iteratively narrows a range until it finds a point where f(x)
is
close enough to zero. Here’s a quick implementation of that algorithm.
If you need to perform root-finding in R, please don’t use the following
function. stats::uniroot()
is much more robust…
bisect = function(f, interval, tolerance, iteration = 1, verbose = FALSE) {
if (verbose) {
msg = glue::glue(
"Iteration {iteration}: Interval [{interval[1]}, {interval[2]}]"
)
message(msg)
}
# Evaluate 'f' at either end of the interval and return
# any endpoint where f() is close enough to zero
lhs = interval[1]; rhs = interval[2]
f_left = f(lhs); f_right = f(rhs)
if (abs(f_left) <= tolerance) {
return(lhs)
}
if (abs(f_right) <= tolerance) {
return(rhs)
}
stopifnot(sign(f_left) != sign(f_right))
# Bisect the interval and rerun the algorithm
# on the half-interval where y=0 is crossed
midpoint = (lhs + rhs) / 2
f_mid = f(midpoint)
new_interval = if (sign(f_mid) == sign(f_left)) {
c(midpoint, rhs)
} else {
c(lhs, midpoint)
}
bisect(f, new_interval, tolerance, iteration + 1, verbose)
}
We know that π is somewhere between 3 and 4, so we can find the zero
of sin(x)
as follows:
bisect(sin, interval = c(3, 4), tolerance = 1e-4, verbose = TRUE)
#> Iteration 1: Interval [3, 4]
#> Iteration 2: Interval [3, 3.5]
#> Iteration 3: Interval [3, 3.25]
#> Iteration 4: Interval [3.125, 3.25]
#> Iteration 5: Interval [3.125, 3.1875]
#> Iteration 6: Interval [3.125, 3.15625]
#> Iteration 7: Interval [3.140625, 3.15625]
#> Iteration 8: Interval [3.140625, 3.1484375]
#> Iteration 9: Interval [3.140625, 3.14453125]
#> Iteration 10: Interval [3.140625, 3.142578125]
#> Iteration 11: Interval [3.140625, 3.1416015625]
#> [1] 3.141602
It takes 11 iterations to get to a point where sin(x)
is within
10−4 of zero. If we tightened the tolerance, had a more
complicated function, or had a less precise starting range, it might
take many more iterations to approximate a zero.
Importantly, this is a recursive algorithm - in the last statement of
the bisect()
function body, we call bisect()
again. The initial call
to bisect()
(with interval = c(3, 4)
) has to wait until the second
call to bisect()
(interval = c(3, 3.5)
) completes before it can
return (which in turn has to wait for the third call to return). So we
have to wait for 11 calls to bisect()
to complete before we get our
result.
Those function calls get placed on a computational object named the
call stack. For each
function call, this stores details about how the function was called and
where from. While waiting for the first call to bisect()
to complete,
the call stack grows to include the details about 11 calls to
bisect()
.
Imagine our algorithm didn’t just take 11 function calls to complete, but thousands, or millions. The call stack would get really full and this would lead to a “stack overflow” error.
We can demonstrate a stack-overflow in R quite easily:
blow_up = function(n, max_iter) {
if (n >= max_iter) {
return("Finished!")
}
blow_up(n + 1, max_iter)
}
The recursive function behaves nicely when we only use a small number of iterations:
blow_up(1, max_iter = 100)
#> [1] "Finished!"
But the call-stack gets too large and the function fails when we attempt to use too many iterations. Note that we get a warning about the size of the call-stack before we actually reach it’s limit, so the R process can continue after exploding the call-stack.
blow_up(1, max_iter = 1000000)
# Error: C stack usage 7969652 is too close to the limit
In R 4.4, we are getting (experimental) support for tail-call recursion. This allows us (in many situations) to write recursive functions that won’t explode the size of the call stack.
How can that work? In our bisect()
example, we still need to make 11
calls to bisect()
to get a result that is close enough to zero, and
those 11 calls will still need to be put on the call-stack.
Remember the first call to bisect()
? It called bisect()
as the very
last statement in it’s function body. So the value returned by the
second call to bisect()
was returned to the user without modification
by the first call. So we could return the second call’s value directly
to the user, instead of returning it via the first bisect()
call;
indeed, we could remove the first call to bisect()
from the call stack
and put the second call in it’s place. This would prevent the call stack
from expanding with recursive calls.
The key to this (in R) is to use the new Tailcall()
function. That
tells R “you can remove me from the call stack, and put this cat on
instead”. Our final line in bisect()
should look like this:
bisect = function(...) {
... snip ...
Tailcall(bisect, f, new_interval, tolerance, iteration + 1, verbose)
}
Note that you are passing the name of the recursively-called function
into Tailcall()
, rather than a call to that function (bisect
rather
than bisect(...)
).
To illustrate that the stack no longer blows up when tail-call recursion
is used. Let’s rewrite our blow_up()
function:
# R 4.4.0
blow_up = function(n, max_iter) {
if (n >= max_iter) {
return("Finished!")
}
Tailcall(blow_up, n+1, max_iter)
}
We can still successfully use a small number of iterations:
blow_up(1, 100)
#> [1] "Finished!"
But now, even a million iterations of the recursive function can be performed:
blow_up(1, 1000000)
#> [1] "Finished!"
Note that the tail-call optimisation only works here, because the
recursive call was made as the very last step in the function body. If
your function needs to modify the value after the recursive call, you
may not be able to use Tailcall()
.
Rejecting the NULL
Missing values are everywhere.
In a typical dataset you might have missing values encoded as NA
(if
you’re lucky) and invalid numbers encoded as NaN
, you might have
implicitly missing rows (for example, a specific date missing from a
time series) or factor levels that aren’t present in your table. You
might even have empty vectors, or data-frames with no rows, to contend
with. When writing functions and data-science workflows, where the input
data may change over time, by programming defensively and handling these
kinds of edge-cases your code will throw up less surprises in the long
run. You don’t want a critical report to fail because a mathematical
function you wrote couldn’t handle a missing value.
When programming defensively with R, there is another important form of missingness to be cautious of …
The NULL
object.
NULL
is an actual object. You can assign it to a variable, combine it
with other values, index into it, pass it into (and return it from) a
function. You can also test whether a value is NULL
.
# Assignment
my_null = NULL
my_null
#> NULL
# Use in functions
my_null[1]
#> NULL
c(NULL, 123)
#> [1] 123
c(NULL, NULL)
#> NULL
toupper(NULL)
#> character(0)
# Testing NULL-ness
is.null(my_null)
#> [1] TRUE
is.null(1)
#> [1] FALSE
identical(my_null, NULL)
#> [1] TRUE
# Note that the equality operator shouldn't be used to
# test NULL-ness:
NULL == NULL
#> logical(0)
R functions that are solely called for their side-effects (write.csv()
or message()
, for example) often return a NULL
value. Other
functions may return NULL
as a valid value - one intended for
subsequent use. For example, list-indexing (which is a function call,
under the surface) will return NULL
if you attempt to access an
undefined value:
config = list(user = "Russ")
# When the index is present, the associated value is returned
config$user
#> [1] "Russ"
# But when the index is absent, a `NULL` is returned
config$url
#> NULL
Similarly, you can end up with a NULL
output from an incomplete stack
of if
/ else
clauses:
language = "Polish"
greeting = if (language == "English") {
"Hello"
} else if (language == "Hawaiian") {
"Aloha"
}
greeting
#> NULL
A common use for NULL
is as a default argument in a function
signature. A NULL
default is often used for parameters that aren’t
critical to function evaluation. For example, the function signature for
matrix()
is as follows:
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
The dimnames
parameter isn’t really needed to create a matrix
, but
when a non-NULL
value for dimnames
is provided, the values are used
to label the row and column names of the created matrix
.
matrix(1:4, nrow = 2)
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 4
matrix(1:4, nrow = 2, dimnames = list(c("2023", "2024"), c("Jan", "Feb")))
#> Jan Feb
#> 2023 1 3
#> 2024 2 4
R 4.4 introduces the %||%
operator to help when handling variables
that are potentially NULL
. When working with variables that could be
NULL
, you might have written code like this:
# Remember there is no 'url' field in our `config` list
# Set a default value for the 'url' if one isn't defined in
# the config
my_url = if (is.null(config$url)) {
"https://www.jumpingrivers.com/blog/"
} else {
config$url
}
my_url
#> [1] "https://www.jumpingrivers.com/blog/"
Assuming config
is a list
:
- when the
url
entry is absent fromconfig
(or is itselfNULL
), thenconfig$url
will beNULL
and the variablemy_url
will be set to the default value; - but when the
url
entry is found withinconfig
(and isn’tNULL
) then that value will be stored inmy_url
.
That code can now be rewritten as follows:
# R 4.4.0
my_url = config$url %||% "https://www.jumpingrivers.com/blog"
my_url
#> [1] "https://www.jumpingrivers.com/blog"
Note that the left-hand value must evaluate to NULL
for the right-hand
side to be evaluated, and that empty vectors aren’t NULL
:
# R 4.4.0
NULL %||% 1
#> [1] 1
c() %||% 1
#> [1] 1
numeric(0) %||% 1
#> numeric(0)
This operator has been available in the {rlang}
package for eight
years and is implemented in exactly the same way. So if you have been
using %||%
in your code already, the base-R version of this operator
should work without any problems, though you may want to wait until you
are certain all your users are using R >= 4.4 before switching from
{rlang} to the base-R version of %||%
.
Any other business
A shorthand hexadecimal format (common in web-programming) for specifying RGB colours has been introduced. So, rather than writing the 6-digit hexcode for a colour “#112233”, you can use “#123”. This only works for those 6-digit hexcodes where the digits are repeated in pairs.
Parsing and formatting of complex numbers has been improved. For
example, as.complex("1i")
now returns the complex number 0 + 1i
,
previously it returned NA
.
There are a few other changes related to handling NULL
that have been
introduced in R 4.4. The changes highlight that NULL
is quite
different from an empty vector. Empty vectors contain nothing, whereas
NULL
represents nothing. For example, whereas an empty numeric vector
is considered to be an atomic (unnestable) data structure, NULL
is no
longer atomic. Also, NCOL(NULL)
(the number of columns in a matrix
formed from NULL
) is now 0, whereas it was formerly 1.
sort_by()
a new function for sorting objects based on values in a
separate object. This can be used to sort a data.frame
based on it’s
columns (they should be specified as a formula):
mtcars |> sort_by(~ list(cyl, mpg)) |> head()
## mpg cyl disp hp drat wt qsec vs am gear carb
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Try the latest version out for yourself
To take away the pain of installing the latest development version of R,
you can use docker. To use the devel
version of R, you can use the
following commands:
docker pull rstudio/r-base:devel-jammy
docker run --rm -it rstudio/r-base:devel-jammy
Once R 4.4 is the released version of R and the r-docker
repository
has been updated, you should use the following command to test out R
4.4.
docker pull rstudio/r-base:4.4-jammy
docker run --rm -it rstudio/r-base:4.4-jammy
See also
The R 4.x versions have introduced a wealth of interesting changes. These have been summarised in our earlier blog posts: