R
can be considered as a functional programming language as it focuses on the creation and manipulation of functions and has what’s known as first class functions. Understanding the functional nature of R
may help to improve clarity and avoid redundancy.
R
can be considered as a functional programming language as it focuses on the creation and manipulation of functions and has what’s known as first class functions.
In computer science, functional programming is a programming paradigm, a style of building the structure and elements of computer programs, that treats computation as the evaluation of mathematical functions and avoids state and mutable data.
Functional programming emphasizes functions that produce results that depend only on their inputs and not on the program state
In functional code, the output value of a function depends only on the arguments that are input to the function, so calling a function f()
twice with the same value for an argument x
will produce the same result f(x)
both times. R
is clearly a functional programming language.
Understanding the functional nature of R
may help to improve clarity and avoid redundancy.
We will examine:
First-class functions are a key component of functional programming style.
A programming language is said to have first-class functions when the language supports:
and R
has has first-class functions indeed.
In this example we pass function: identity()
as argument to function lapply()
lapply(0, identity)
[[1]]
[1] 0
Here we make use of an anonymous function:
(function(x) sd(x)/mean(x))(x = 1:5)
[1] 0.5270463
We can easily define a function that return a function
f <- function(){
function(x) sd(x)/mean(x)
}
Finaly we store functions within a list:
function_list <- list(mean , sd)
A function closure or closure is a function together with a referencing environment.
Almost all functions in R
are closures as they remember the environment where they were created. Generally, but not always, the global environment:
f <- function(x) 0
environment(f)
<environment: R_GlobalEnv>
or the package environment
environment(mean)
<environment: namespace:base>
Functions that cannot be classified as closures, and therefore do not have a referencing environment, are know as primitives. These are internal R
function calling the underlying C
code. sum()
and c()
are good cases in point:
environment(sum)
NULL
As functions remember the environments where they were created, the following example does not return any error:
y <- 1
f <- function(x){x+y}
f(1)
[1] 2
This is possible as f()
is declared within the global environment and therefore f()
remembers all objects bounded to that environment (the referencing environment), y
included.
When we call a function, a new environment is created to hold the function’s execution and, normally, that environment is destroyed when the function exits. But, if we define a function g()
that returns a function f()
, the environment where f()
is created is the execution environment of g()
, that is, the execution environment of g()
is the referencing environment of f()
. As a consequence, the execution environment of g()
is not destroyed as g()
exits but it persists as long as f()
exists. Finally, as f()
remembers all objects bounded to its referencing environment, f()
remembers all objects bounded to the execution environment of g()
With this idea in mind, we can use the referencing environment of f()
, that is the execution environment of g()
, to hold any object and these objects will be available to f()
.
g <- function(){
y <- 1
function(x) {x+y}
}
f1 <- g()
f1(3)
[1] 4
Moreover, as f()
is created within g()
any argument passed to g()
will be available to f()
in its later executions.
g <- function (y) {
function(x) x+y
}
f1 <- g(1)
f1(3)
[1] 4
As a proof of concept, we may temporaly modify function g()
in order to print the execution environment of g()
g_tmp <- function(y){
print(environment())
function(x) {x+y}
}
Than use g()
to produce f()
f1 <- g_tmp(1)
<environment: 0x56068b663e00>
and finaly ask R
for the environment associated with f()
environment(f1)
<environment: 0x56068b663e00>
As we can see, the execution environment of g_tmp()
corresponds to the environment associated to f()
.
Finally,
get("y" , env = environment(f1))
[1] 1
shows where y
is stored.
Notice that each call to g()
returns a function with its own referencing environment:
f2 <- g(1)
environment(f1)
<environment: 0x56068b663e00>
environment(f2)
<environment: 0x56068ab2d1f0>
The referencing environments for f1()
and f2()
are different, despite that f1()
and f2()
are both returned by g()
.
In practice, we can use closures to write specific functions that, in turn, can be used to generate new functions. This allows us to have a double layer of development: a first layer that is used to do all the complex work in common to all functions and a second layer that defines the details of each function.
We can think of a simple function add(x,i)
that add the value i
to the value x
. We could define this function as:
add <- function(x, i){
x+i
}
Alternatively, we may consider a set of functions, say f1(x)
, f2(x)
, ...
, fi(x)
, ...
, fn{x}
that add 1,2,...,i,...,n
to the x
argument. Clearly, we do not want to define all these functions but we want to define a unique function f(i)
:
f <- function(i){
function(x) {x+i}
}
capable of generating all fi(x)
functions:
f1 <- f(1)
f1(3)
[1] 4
f2 <- f(2)
f2(4)
[1] 6
In this simple example, this approach shows no benefit and possibly increases the complexity of our codes but, for more structured cases, it is definitely worth.
As a more structured example, we may consider the development of a set of functions: lnorm(x)
, lweibull(x)
, ...
that compute max likelihood estimates for those distributions given a vector of data x
:
new_estimate <- function(dist){
estimate <- function(x, theta){
neglik <- function(theta = theta , x = x, log = T){
args <- c(list(x), as.list(theta), as.list(log))
neglik <- -sum(do.call(dist, args))
neglik
}
optim(par = theta, fn = neglik , x = x)$par
}
estimate
}
new_estimate
returns a second function: estimate()
whose body depends on the argument dist
passed to new_estimate()
.
Within estimate()
we first define a third function neglik()
and secondly, we minimize it within optim()
.
The returned function: estimate()
can be used as a generator of maximum likelihood estimation functions for any distribution as long as its corresponding ddist()
exists in R
.
Once we have new_estimate()
, we can use it to define any MLE estimation function as long as its density function is defined. That is, we can now write a llnorm()
that computes log-normal maximum likelihood estimates as simply as:
llnorm <- new_estimate("dlnorm")
x <- rlnorm(100, 7 , 1)
llnorm(x, theta = c(mean(log(x)), sd(log(x))))
[1] 6.972047 1.005196
and similarly:
lweibull <- new_estimate("dweibull")
w <- rweibull(100, 2 , 1)
lweibull(w, theta = c(mean(w), sd(w)))
[1] 1.868930 1.008663
As a further example of functions factories we may consider a function moving(f)
that returns moving_f()
where f()
could be mean()
, median()
or any other statistical function as long it returns a single value.
As a first step we may consider a simple function g()
that returns the value of f()
for any backward window of length n
starting at i
:
g <- function(i , x, n , f, ...) f(x[(i-n+1):i], ...)
g(i = 5 , x = 1:10,n = 3 , f= mean)
[1] 4
g(i = 5 , x = 1:10,n = 3 , f= sd)
[1] 1
Note that g()
takes, among its inputs, a second function f()
and apply it to the window [(i-n+1):i]
of x
.
As a second step we define a function moving(f)
that takes any function f()
as an input and define function g()
within its body.
moving <- function(f){
g <- function(i , x, n , f, ...) f(x[(i-n+1):i], ...)
h <- function(x, n, ...) {
N <- length(x)
vapply(n:N, g, x , n , f, FUN.VALUE = numeric(1), ...)
}
return(h)
}
Function moving()
returns function h()
that, in turn can be used to generate any moving_f()
functions:
Function vapply()
within h()
is a functional used as a for loop replacement that will be fully explored when discussing functionals.
moving_average <- moving(mean)
moving_average(x = rpois(10, 10), n = 3)
[1] 9.666667 10.333333 13.666667 12.000000 11.666667 8.000000
[7] 9.666667 8.333333
Eventually, argument ’‘...
’’ can be used to pass extra arguments to f()
.
moving_average(x = rpois(10, 10), n = 3, trim = .5)
[1] 7 11 11 11 8 8 10 10
If necessary, function moving()
ca be used in the form of anonymous function:
moving(sd)(rpois(10, 10), n = 5)
[1] 3.563706 3.563706 3.114482 3.962323 4.159327 3.114482
Finally:
x <- 1:100
y <- seq(along = x, from = 0 , to = 1)+rnorm(length(x), 0 , .1)
plot(x, y)
lines(x[10:length(x)], moving(mean)(y, 10), col = "red", lwd = 2)
lines(x[10:length(x)], moving(median)(y, 10), col = "green", lwd = 2)
Density function in R are usually specified by the prefixes d
followed by a standard suffix for each ditribution. dnorm()
, dlnorm()
, dweibull()
, etc …
Therefore, we use to write:
x <- c(7,10,13)
dlnorm(x , meanlog = 2, sdlog = 1)
[1] 0.05690844 0.03810909 0.02616136
in order to get density values at x
from a lognormal distribution with parameters 2
and 1
.
In case we need value from a truncated distribution, as far as we know, we need to load an extra package such as truncdist
. The package itself works perfectly. In fact, assuming that a dlnorm()
function exists, we can get density values from a left truncated lognormal distribution with parameters meanlog = 2
and sdlog = 1
by simply writing:
require(truncdist)
dtrunc(x, spec = "lnorm", a = 5)
[1] 0.15962754 0.05237882 0.02127653
where a = 5
represents the left threshold for truncation
Nevertheless, the above command require a change in our programming style.
In principle, we would like to be able to write:
tdlnorm(x, meanlog = 2, sdlog = 1, L = 5)
where L = 5
represents the left threshold for truncation
so that we could have the same programming style, just with different parameters, for both truncated and not truncated distribution.
Within this frame, when tdlnorm()
is called with L
and U
set to their default values it behaves as stats::dlnorm()
tdlnorm(x, meanlog = 2, sdlog = 1)
but when called with different settings for L
and U
; such as:
tdlnorm(x, meanlog = 2, sdlog = 1, L = 5, U = 20)
it behaves as a lognormal density left truncated at L=5
and right truncated at U=20
.
This goal could be achieved by writing a tdlnorm()
as:
tdlnorm <- function (x, meanlog = 0, sdlog = 1, L = 0, H = Inf)
{
density <-
stats::dlnorm(x, meanlog=meanlog, sdlog=sdlog)/
(
stats::plnorm(H, meanlog=meanlog, sdlog=sdlog)-
stats::plnorm(L, meanlog=meanlog, sdlog=sdlog)
)
return(density)
}
That returns the same results as function truncdist::dtrunc()
tdlnorm(x, 1, 2, L= 5, H = 20)
[1] 0.11523518 0.07297029 0.05109291
dtrunc(x, spec = "lnorm", a = 5, b = 20, meanlog = 1, sdlog = 2)
[1] 0.11523518 0.07297029 0.05109291
As this function clearly works, next step could be to write something similar for other distributions such as weibull
, gumbel
or gamma
. We have to admit that all of this may become as quite time consuming.
A different approach could be to define a different function, dtruncate()
, taking the name of a density distribution as an argument and returning a second function that computes density values for the truncated distribution:
dtruncate <-
function (dist, pkg = stats){
dist <- deparse(substitute(dist))
envir <- as.environment(paste("package", as.character(substitute(pkg)), sep = ":"))
ddist=paste("d", dist, sep = "")
pdist=paste("p", dist, sep = "")
#gets density function
ddist <- get(ddist, mode = "function", envir = envir)
#gets argument of density function
dargs <- formals(ddist)
#gets probability function
pdist <- get(pdist, mode = "function", envir = envir)
#gets argument of probability function
pargs <- formals(pdist)
#Output function starts here
density <- function ()
{
#check L U
if (L > U) stop("U must be greater than or equal to L")
#gets density arguments
call <- as.list(match.call())[-1]
#all unique arguments belonging to density and ddist
dargs <- c(dargs[!is.element(names(dargs), names(call))], call[is.element(names(call), names(dargs))])
#all unique arguments belonging to probability and pdist
pargs <- c(pargs[!is.element(names(pargs), names(call))], call[is.element(names(call), names(pargs))])
#select x only where defined by L and U
dargs$x <- x[x > L & x <= U]
#define arguments for pdist in L and U
pUargs <- pLargs <- pargs
pUargs$q <- U
pLargs$q <- L
#initialize output
density <- numeric(length(x))
#standard method for computing density values for truncated distributions
density[x > L & x <= U] <- do.call("ddist", as.list(dargs)) / (do.call("pdist", as.list(pUargs)) - do.call("pdist", as.list(pLargs)))
#returns density values for truncated distributions
return(density)
}
#add to density function formals L and U with values as passed with dtruncate
formals(density) <- c(formals(ddist), eval(substitute(alist(L=-Inf, U=Inf))))
#return density function
return(density)
}
with:
plnorm()
and dlnorm()
belong toWe can now define a new tdlnorm()
as:
tdlnorm <- dtruncate(dist = lnorm)
and use it as:
p <- ppoints(1000)
x <- qlnorm(p, meanlog = 1, sdlog = 1)
d <- tdlnorm(x, meanlog = 1, sdlog = 1)
dt <- tdlnorm(x, meanlog = 1, sdlog = 1, L= 5, U = 10)
plot(x, dt, type = "n", xlab = "Quantile", ylab = "Density")
points(x, dt, type = "s", col = "red", lwd = 2)
points(x, d, type = "s", col = "darkblue", lwd = 2)
title("Truncated and not-truncated log-normal")
grid()
clearly, tdlnorm()
returns the same results as truncdist::dtrunc()
:
dtrunc(x = 5:8, spec = "lnorm", a = 5, b = 10, meanlog = 1, sdlog = 1)
[1] 0.3791833 0.2780953 0.2084880 0.1593537
tdlnorm(x = 5:8, meanlog = 1, sdlog = 1, L = 5, U = 10)
[1] 0.0000000 0.2780953 0.2084880 0.1593537
Moreover, our newly created tdlnorm()
function takes as argument meanlog
and sdlog
, as well as lower.tail = TRUE
, log.p = FALSE
, as stats::plnorm()
does, despite these arguments were not mentioned when calling dtruncate()
.
Now that we have dtruncate()
, the same exercise can be replicate, at no extra programming effort, to any density function:
dweibull <- dtruncate(dist = weibull)
dgpd <- dtruncate(gpd, pkg = evd)
When talking about clousures, we used the referencing environment of f()
to hold any value passed by g()
. Similarly, we can use the same environment to keep a state across multiple executions of f()
.
We may consider a function that simply returns the current date but tracks how many times it has ben called:
g <- function(){
i <- 0
f <- function(){
i <<- i+1
cat("this function has been called ", i, " times", "\n")
date()
}}
f <- g()
#first call
f()
this function has been called 1 times
[1] "Sat Oct 10 01:15:39 2020"
#second call
f()
this function has been called 2 times
[1] "Sat Oct 10 01:15:39 2020"
#third call
f()
this function has been called 3 times
[1] "Sat Oct 10 01:15:39 2020"
Note that, we used the <<-
operator that assigns in the parent environment. This is equivalent to:
assign("i", i+1, envir = parent.env(environment())):
We can use the referencing environment of a function to keep previous returned values of the same function. By using this idea, we could try to avoid re-calculating previously computed values.
Supose we want a function that takes n
as argument and returns all primes less or equal to n
. This function already exists within library pracma
:
library(pracma)
primes(n = 9)
[1] 2 3 5 7
In order to keep previous results we can define a function makefprime()
that, when called, returns a second function with an environment .env
attached:
makefprime = function () {
.env = new.env()
f = function(n) {
symbol = paste("p", n, sep = ".")
if (exists(symbol, envir = .env)){
prime = get(symbol, envir = .env)
}
else {prime = primes(n = n)
assign(symbol , prime, envir = .env)
}
prime
}
f
}
We can now create a function named for instance fprimes()
by calling function makefprime()
which returns identical results when compared with primes()
.
fprimes = makefprime()
fprimes(10)
[1] 2 3 5 7
Now suppose we need to compute prime numbers several time within a working session or a for loop. When n
is large, this computation may require a substancial ammount of time.
system.time({p1 = fprimes(n = 10^7)})
user system elapsed
0.338 0.041 0.383
Nevertheless, because of the way we defined fprimes()
, second time this function is called with n = 10^7
computing time is practicaly zero as the function reuse previously computed results as stored in environment .env
.
system.time({p2 = fprimes(n = 10^7)})
user system elapsed
0 0 0
As a last example, we may want to have a function that add to an existing plot any time a new observation becomes available, using the same mechanism, we can define a new_plot()
function that instances a new plot the first time it is called:
new_plot = function(){
xx = NULL
yy = NULL
function(x, y, ...) {
xx <<- c(xx, x)
yy <<- c(yy, y)
plot(xx, yy, ...)
}}
this_plot <- new_plot()
and add to the same plot at each next call:
this_plot (1:4, c(2, 3, 1, 5), type = "b")
this_plot(5, 3, type = "b")
this_plot(6, 3, type = "b", col = "red")
In R, we usually assign functions to variable names. Nevertheless, functions can exists without been assigned to symbol. Functions that don’t have a name are called anonymous functions.
We can call anonymous functions directly, as we do with named functions, but the code is a little unusual as we have to use brackets both to include the whole function definition and to pass arguments to the function:
(function(x) x + 3)(10)
[1] 13
Note that this is exactly the same as:
f <- function(x) x + 3
f(10)
[1] 13
We use anonymous functions when it’s not worth the effort of assigning functions to a name. We could plot a function s(x)
by:
s <- function(x) sin(x)/sqrt(x)
integrate(s, 0, 4)
1.609553 with absolute error < 4.5e-11
or alternatively by:
integrate(function(x) sin(x)/sqrt(x), 0, 4)
1.609553 with absolute error < 4.5e-11
in this case function(x) sin(x)/sqrt(x)
is an example of anonymous function.
Finally, anonymous functions are, by all rights, normal R functions as they have formals()
, a body()
, and a parent environment()
:
formals(function(x) x+1)
$x
body(function(x) x+1)
x + 1
environment(function(x) x+1)
<environment: R_GlobalEnv>
Functions, as any type of R
object, can be stored in a list.
fun_list <- list(m = mean , s = sd)
This makes it easier to work with groups of related functions.
Functions defined within a list are still accessible at least in three different ways:
using function with()
with (fun_list, m(x = 1:10))
[1] 5.5
by using the $
operator
fun_list$m( x = 1:10)
[1] 5.5
by attaching the list:
attach(fun_list)
m( x = 1:10)
[1] 5.5
detach(fun_list)
Lists of functions can be most useful when we want to apply all functions of the list to the same set of data.
We can achieve this goal in two logical steps.
We first define a function
fun <- function(f, ...){f(...)}
that takes a function f()
as argument along with any other arguments ’‘...
’’ and returns f(...)
. In practice:
fun(mean, x = 1:10, na.rm = TRUE)
[1] 5.5
Secondly, we apply function fun()
to the list of functions. Arguments required by the functions stored in the list are passed by the ’‘...
’’ argument:
lapply(fun_list, fun, x = 1:10)
$m
[1] 5.5
$s
[1] 3.02765
Under almost all circumstances, equivalent results can be achieved by using function do.call()
within a call to lapply()
:
lapply(fun_list, do.call, list(x = 1:10, na.rm = T))
$m
[1] 5.5
$s
[1] 3.02765
the only difference being that arguments to functions within the list must be enclosed in a list too.
As a simple example we may want to compare the results of four Anderson-Darling type tests from the truncgof
package applied to the same data.
We can define a list that holds these four functions and store it in the global environment:
require(truncgof, quietly = TRUE)
nor_test <- list(ad2.test = ad2.test, ad2up.test = ad2up.test, ad.test = ad.test,
adup.test = adup.test)
and, afterword, apply function fun()
to each element of this list:
x <- rnorm(100, 10, 1)
m <- mean(x)
s <- sd(x)
lapply(nor_test, fun, x , distn = "pnorm", list(mean = m, sd = s), sim = 100)
$ad2.test
Quadratic Class Anderson-Darling Test
data: x
AD2 = 0.77067, p-value = 0.03
treshold = -Inf, simulations: 100
$ad2up.test
Quadratic Class Anderson-Darling Upper Tail Test
data: x
AD2up = 2.7084, p-value = 0.61
treshold = -Inf, simulations: 100
$ad.test
Supremum Class Anderson-Darling Test
data: x
AD = 1.6421, p-value = 0.7
alternative hypothesis: two.sided
treshold = -Inf, simulations: 100
$adup.test
Anderson-Darling Upper Tail Test
data: x
ADup = 10, p-value = 0.51
alternative hypothesis: two.sided
treshold = -Inf, simulations: 100
We may want to define a function that returns some specific statistics for a given set of variables in the form of a data.frame
.
this_summary <- as.data.frame(rbind(vapply(trees, mean, FUN.VALUE = numeric(1)),
vapply(trees, sd, FUN.VALUE = numeric(1)), vapply(trees, function(x,
...) {
diff(range(x))
}, FUN.VALUE = numeric(1))))
row.names(this_summary) <- c("mean", "sd", "range")
this_summary
Girth Height Volume
mean 13.248387 76.000000 30.17097
sd 3.138139 6.371813 16.43785
range 12.300000 24.000000 66.80000
We may achieve the same result by writing a more general function that will work with any kind of statistics as long as they return a single value:
my_summary <- function(x, flist) {
f <- function(f, ...) f(...)
g <- function(x, flist) {
vapply(flist, f, x, FUN.VALUE = numeric(1))
}
df <- as.data.frame(lapply(x, g, flist))
row.names(df) <- names(flist)
df
}
my_summary(cars, flist = list(mean = mean, stdev = sd, cv = function(x,
...) {
sd(x, ...)/mean(x, ...)
}))
speed dist
mean 15.4000000 42.9800000
stdev 5.2876444 25.7693775
cv 0.3433535 0.5995667
Working with this mind set we may even define a function fapply()
that applies all functions of a list to the same set of arguments
fapply <- function(X, FUN, ...){
lapply(FUN, function(f, ...){f(...)}, X, ...)
}
and use it as:
basic_stat <- list(mean = mean, median = median, sd = sd)
fapply(1:10, basic_stat)
$mean
[1] 5.5
$median
[1] 5.5
$sd
[1] 3.02765
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Spano (2020, Oct. 10). andreaspano blog: Functional Programming with R. Retrieved from https://andreaspano.github.io/posts/2020-10-07-functional-programming-with-r/
BibTeX citation
@misc{spano2020functional, author = {Spano, Andrea}, title = {andreaspano blog: Functional Programming with R}, url = {https://andreaspano.github.io/posts/2020-10-07-functional-programming-with-r/}, year = {2020} }