andreaspano blog: Functional Programming with R

Andrea Spano

R can be considered as a functional programming language as it focuses on the creation and manipulation of functions and has what’s known as first class functions.

In computer science, functional programming is a programming paradigm, a style of building the structure and elements of computer programs, that treats computation as the evaluation of mathematical functions and avoids state and mutable data.

Functional programming emphasizes functions that produce results that depend only on their inputs and not on the program state

In functional code, the output value of a function depends only on the arguments that are input to the function, so calling a function f() twice with the same value for an argument x will produce the same result f(x) both times. R is clearly a functional programming language.

Understanding the functional nature of R may help to improve clarity and avoid redundancy.

We will examine:

First Class Functions
Functions Closures
Functions Factories
Anonymous Functions
Lists of Functions
Functionals

First class functions

First-class functions are a key component of functional programming style.

A programming language is said to have first-class functions when the language supports:

passing functions as arguments to other functions
creating anonymous functions
returning functions as the values from other functions
storing functions in data structures.

and R has has first-class functions indeed.

In this example we pass function: identity() as argument to function lapply()


lapply(0, identity)


[[1]]
[1] 0

Here we make use of an anonymous function:


(function(x) sd(x)/mean(x))(x = 1:5)


[1] 0.5270463

We can easily define a function that return a function


f <- function(){
  function(x) sd(x)/mean(x)
}

Finaly we store functions within a list:


function_list <- list(mean , sd)

Functions closures

A function closure or closure is a function together with a referencing environment.

Almost all functions in R are closures as they remember the environment where they were created. Generally, but not always, the global environment:


f <- function(x) 0
environment(f)


<environment: R_GlobalEnv>

or the package environment


environment(mean)


<environment: namespace:base>

Functions that cannot be classified as closures, and therefore do not have a referencing environment, are know as primitives. These are internal R function calling the underlying C code. sum() and c() are good cases in point:


environment(sum)


NULL

As functions remember the environments where they were created, the following example does not return any error:


y <- 1 
f <- function(x){x+y}
f(1)


[1] 2

This is possible as f() is declared within the global environment and therefore f() remembers all objects bounded to that environment (the referencing environment), y included.

When we call a function, a new environment is created to hold the function’s execution and, normally, that environment is destroyed when the function exits. But, if we define a function g() that returns a function f(), the environment where f() is created is the execution environment of g(), that is, the execution environment of g() is the referencing environment of f(). As a consequence, the execution environment of g() is not destroyed as g() exits but it persists as long as f() exists. Finally, as f() remembers all objects bounded to its referencing environment, f() remembers all objects bounded to the execution environment of g()

With this idea in mind, we can use the referencing environment of f(), that is the execution environment of g(), to hold any object and these objects will be available to f().


g <- function(){
  y <- 1
  function(x) {x+y}
}
f1 <- g()
f1(3)


[1] 4

Moreover, as f() is created within g() any argument passed to g() will be available to f() in its later executions.


g <- function (y) {
    function(x) x+y
}
f1 <- g(1)
f1(3)


[1] 4

As a proof of concept, we may temporaly modify function g() in order to print the execution environment of g()


g_tmp <- function(y){
  print(environment())
  function(x) {x+y}
}

Than use g() to produce f()


f1 <- g_tmp(1)


<environment: 0x56068b663e00>

and finaly ask R for the environment associated with f()


environment(f1)


<environment: 0x56068b663e00>

As we can see, the execution environment of g_tmp() corresponds to the environment associated to f().

Finally,


get("y" , env = environment(f1))


[1] 1

shows where y is stored.

Notice that each call to g() returns a function with its own referencing environment:


f2 <- g(1)
environment(f1)


<environment: 0x56068b663e00>


environment(f2)


<environment: 0x56068ab2d1f0>

The referencing environments for f1() and f2() are different, despite that f1() and f2() are both returned by g().

Functions Factories

In practice, we can use closures to write specific functions that, in turn, can be used to generate new functions. This allows us to have a double layer of development: a first layer that is used to do all the complex work in common to all functions and a second layer that defines the details of each function.

Example: A basic case in point

We can think of a simple function add(x,i) that add the value i to the value x. We could define this function as:


add <- function(x, i){
  x+i
}

Alternatively, we may consider a set of functions, say f1(x), f2(x), ..., fi(x), ..., fn{x} that add 1,2,...,i,...,n to the x argument. Clearly, we do not want to define all these functions but we want to define a unique function f(i):


f <- function(i){
  function(x) {x+i}
}

capable of generating all fi(x) functions:


f1 <-  f(1)
f1(3)


[1] 4


f2 <- f(2)
f2(4)


[1] 6

In this simple example, this approach shows no benefit and possibly increases the complexity of our codes but, for more structured cases, it is definitely worth.

Example: MLE functions

As a more structured example, we may consider the development of a set of functions: lnorm(x), lweibull(x), ... that compute max likelihood estimates for those distributions given a vector of data x:


new_estimate <-  function(dist){
  estimate <-  function(x, theta){   
    neglik <-  function(theta = theta , x = x, log = T){
      args <-  c(list(x), as.list(theta), as.list(log))
      neglik <-  -sum(do.call(dist,  args))
      neglik
    }
    optim(par = theta, fn = neglik , x = x)$par
  }
estimate
}

new_estimate returns a second function: estimate() whose body depends on the argument dist passed to new_estimate().

Within estimate() we first define a third function neglik() and secondly, we minimize it within optim().

The returned function: estimate() can be used as a generator of maximum likelihood estimation functions for any distribution as long as its corresponding ddist() exists in R.

Once we have new_estimate(), we can use it to define any MLE estimation function as long as its density function is defined. That is, we can now write a llnorm() that computes log-normal maximum likelihood estimates as simply as:


llnorm <- new_estimate("dlnorm")
x <- rlnorm(100, 7 , 1)
llnorm(x, theta = c(mean(log(x)), sd(log(x))))


[1] 6.972047 1.005196

and similarly:


lweibull <- new_estimate("dweibull")
w <- rweibull(100, 2 , 1)
lweibull(w, theta = c(mean(w), sd(w)))


[1] 1.868930 1.008663

Example: moving statistics

As a further example of functions factories we may consider a function moving(f) that returns moving_f() where f() could be mean(), median() or any other statistical function as long it returns a single value.

As a first step we may consider a simple function g() that returns the value of f() for any backward window of length n starting at i:


g <- function(i , x, n , f, ...) f(x[(i-n+1):i], ...)
g(i = 5 , x = 1:10,n = 3  , f= mean)


[1] 4


g(i = 5 , x = 1:10,n = 3  , f= sd)


[1] 1

Note that g() takes, among its inputs, a second function f() and apply it to the window [(i-n+1):i] of x.

As a second step we define a function moving(f) that takes any function f() as an input and define function g() within its body.


moving <- function(f){
  g <- function(i , x, n , f, ...) f(x[(i-n+1):i], ...)
  h <- function(x, n, ...) {
    N <- length(x)
    vapply(n:N, g, x , n , f, FUN.VALUE = numeric(1), ...)
  }
return(h)  
}

Function moving() returns function h() that, in turn can be used to generate any moving_f() functions:

Function vapply() within h() is a functional used as a for loop replacement that will be fully explored when discussing functionals.


moving_average <- moving(mean)  
moving_average(x = rpois(10, 10), n = 3)


[1]  9.666667 10.333333 13.666667 12.000000 11.666667  8.000000
[7]  9.666667  8.333333

Eventually, argument ’‘...’’ can be used to pass extra arguments to f().


moving_average(x = rpois(10, 10), n = 3, trim = .5)


[1]  7 11 11 11  8  8 10 10

If necessary, function moving() ca be used in the form of anonymous function:


moving(sd)(rpois(10, 10), n = 5)


[1] 3.563706 3.563706 3.114482 3.962323 4.159327 3.114482

Finally:


x <- 1:100
y <- seq(along = x, from = 0 , to = 1)+rnorm(length(x), 0 , .1)
plot(x, y)
lines(x[10:length(x)], moving(mean)(y, 10), col = "red", lwd = 2)
lines(x[10:length(x)], moving(median)(y, 10), col = "green", lwd = 2)

Figure 1: Plot of moving average and median

Example: Truncated density function

Density function in R are usually specified by the prefixes d followed by a standard suffix for each ditribution. dnorm(), dlnorm(), dweibull(), etc …

Therefore, we use to write:


x <- c(7,10,13)
dlnorm(x , meanlog = 2, sdlog = 1)


[1] 0.05690844 0.03810909 0.02616136

in order to get density values at x from a lognormal distribution with parameters 2 and 1.

In case we need value from a truncated distribution, as far as we know, we need to load an extra package such as truncdist. The package itself works perfectly. In fact, assuming that a dlnorm() function exists, we can get density values from a left truncated lognormal distribution with parameters meanlog = 2 and sdlog = 1 by simply writing:


require(truncdist)
dtrunc(x, spec = "lnorm", a = 5)


[1] 0.15962754 0.05237882 0.02127653

where a = 5 represents the left threshold for truncation

Nevertheless, the above command require a change in our programming style.

In principle, we would like to be able to write:


tdlnorm(x, meanlog = 2, sdlog = 1, L = 5)

where L = 5 represents the left threshold for truncation

so that we could have the same programming style, just with different parameters, for both truncated and not truncated distribution.

Within this frame, when tdlnorm() is called with L and U set to their default values it behaves as stats::dlnorm()


tdlnorm(x, meanlog = 2, sdlog = 1)

but when called with different settings for L and U; such as:

tdlnorm(x, meanlog = 2, sdlog = 1, L = 5, U = 20)

it behaves as a lognormal density left truncated at L=5 and right truncated at U=20.

This goal could be achieved by writing a tdlnorm() as:


tdlnorm <-  function (x, meanlog = 0, sdlog = 1,  L = 0,  H = Inf) 
 {
  
  density <-  
     stats::dlnorm(x, meanlog=meanlog, sdlog=sdlog)/
        (
        stats::plnorm(H, meanlog=meanlog, sdlog=sdlog)-  
        stats::plnorm(L, meanlog=meanlog, sdlog=sdlog)  
          )
              
   return(density)
 }

That returns the same results as function truncdist::dtrunc()


tdlnorm(x, 1, 2, L= 5, H = 20)


[1] 0.11523518 0.07297029 0.05109291


dtrunc(x, spec = "lnorm", a = 5, b = 20, meanlog = 1, sdlog = 2)


[1] 0.11523518 0.07297029 0.05109291

As this function clearly works, next step could be to write something similar for other distributions such as weibull, gumbel or gamma. We have to admit that all of this may become as quite time consuming.

A different approach could be to define a different function, dtruncate(), taking the name of a density distribution as an argument and returning a second function that computes density values for the truncated distribution:


dtruncate <-
  function (dist, pkg = stats){ 
    
    dist <- deparse(substitute(dist))
    envir <- as.environment(paste("package", as.character(substitute(pkg)), sep = ":"))
    
    ddist=paste("d", dist, sep = "") 
    pdist=paste("p", dist, sep = "")
        
    #gets density function                    
    ddist <-  get(ddist, mode = "function", envir = envir)
    #gets argument of density function
    dargs <- formals(ddist)
   
    #gets probability function                    
    pdist <- get(pdist, mode = "function", envir = envir)
    #gets argument of probability function
    pargs <- formals(pdist)
        
    #Output function starts here
    density <- function () 
    {
      #check L U 
      if (L > U) stop("U must be greater than or equal to L")
      
      #gets density arguments
      call <- as.list(match.call())[-1]
      
      #all unique arguments belonging to density and ddist 
      dargs <- c(dargs[!is.element(names(dargs), names(call))], call[is.element(names(call), names(dargs))])
      
      #all unique arguments belonging to probability and pdist 
      pargs <- c(pargs[!is.element(names(pargs), names(call))], call[is.element(names(call), names(pargs))])
      
      #select x only where defined by L and U
      dargs$x <- x[x > L & x <= U]
      
      #define arguments for pdist in L and U
      pUargs <-  pLargs <- pargs 
      pUargs$q <- U
      pLargs$q <- L
      
      #initialize output
      density <- numeric(length(x))
      
      #standard method for computing density values for truncated distributions
      density[x > L & x <= U] <-  do.call("ddist", as.list(dargs)) / (do.call("pdist", as.list(pUargs)) - do.call("pdist", as.list(pLargs)))
      
      #returns density values for truncated distributions
      return(density)
      
    }
    
    #add to density function formals L and U with values as passed with dtruncate
    formals(density) <-  c(formals(ddist), eval(substitute(alist(L=-Inf, U=Inf))))
    #return density function
    return(density)
  }

with:

envir: the environment plnorm() and dlnorm() belong to

We can now define a new tdlnorm() as:


tdlnorm <- dtruncate(dist = lnorm)

and use it as:


p <- ppoints(1000)
x <- qlnorm(p, meanlog = 1, sdlog = 1)
d <- tdlnorm(x, meanlog = 1, sdlog = 1)
dt <- tdlnorm(x, meanlog = 1, sdlog = 1, L= 5, U = 10)
plot(x, dt, type = "n", xlab = "Quantile", ylab = "Density")
points(x, dt, type = "s", col = "red", lwd = 2)
points(x, d, type = "s", col = "darkblue", lwd = 2)
title("Truncated and not-truncated log-normal")
grid()

clearly, tdlnorm() returns the same results as truncdist::dtrunc():


dtrunc(x = 5:8, spec = "lnorm", a = 5, b = 10, meanlog = 1, sdlog = 1)


[1] 0.3791833 0.2780953 0.2084880 0.1593537


tdlnorm(x = 5:8, meanlog = 1, sdlog = 1, L = 5, U = 10)


[1] 0.0000000 0.2780953 0.2084880 0.1593537

Moreover, our newly created tdlnorm() function takes as argument meanlog and sdlog, as well as lower.tail = TRUE, log.p = FALSE, as stats::plnorm() does, despite these arguments were not mentioned when calling dtruncate().

Now that we have dtruncate(), the same exercise can be replicate, at no extra programming effort, to any density function:


dweibull <-  dtruncate(dist = weibull)
dgpd <- dtruncate(gpd, pkg = evd)

Functions with memory

When talking about clousures, we used the referencing environment of f() to hold any value passed by g(). Similarly, we can use the same environment to keep a state across multiple executions of f().

Example: Track how many times a function is called

We may consider a function that simply returns the current date but tracks how many times it has ben called:


g <- function(){
 i <- 0
 f <- function(){
    i <<- i+1
    cat("this function has been called ", i, " times", "\n")
    date()  
}}

f <- g()
#first call
f()


this function has been called  1  times


[1] "Sat Oct 10 01:15:39 2020"


#second call
f()


this function has been called  2  times


[1] "Sat Oct 10 01:15:39 2020"


#third call
f()


this function has been called  3  times


[1] "Sat Oct 10 01:15:39 2020"

Note that, we used the <<- operator that assigns in the parent environment. This is equivalent to:


assign("i", i+1, envir = parent.env(environment())):

Example: Avoid re-calculate previous results

We can use the referencing environment of a function to keep previous returned values of the same function. By using this idea, we could try to avoid re-calculating previously computed values.

Supose we want a function that takes n as argument and returns all primes less or equal to n. This function already exists within library pracma:


library(pracma)
primes(n = 9)


[1] 2 3 5 7

In order to keep previous results we can define a function makefprime() that, when called, returns a second function with an environment .env attached:


makefprime = function () {
  .env = new.env()
  f = function(n) {
    symbol = paste("p", n, sep = ".")
    if (exists(symbol, envir = .env)){
      prime = get(symbol, envir = .env)
    } 
    else {prime = primes(n = n)
      assign(symbol , prime, envir = .env)
    }
    prime
   }  
f
}

We can now create a function named for instance fprimes() by calling function makefprime() which returns identical results when compared with primes().


fprimes = makefprime()
fprimes(10)


[1] 2 3 5 7

Now suppose we need to compute prime numbers several time within a working session or a for loop. When n is large, this computation may require a substancial ammount of time.


system.time({p1 = fprimes(n = 10^7)})


   user  system elapsed 
  0.338   0.041   0.383

Nevertheless, because of the way we defined fprimes(), second time this function is called with n = 10^7 computing time is practicaly zero as the function reuse previously computed results as stored in environment .env.


system.time({p2 = fprimes(n = 10^7)})


   user  system elapsed 
      0       0       0

Example: Add to an existing plot

As a last example, we may want to have a function that add to an existing plot any time a new observation becomes available, using the same mechanism, we can define a new_plot() function that instances a new plot the first time it is called:


new_plot = function(){
  xx = NULL
  yy = NULL
  function(x, y, ...) {
  xx <<- c(xx, x)
  yy <<- c(yy, y)
  plot(xx, yy, ...)
}}

this_plot <- new_plot()

and add to the same plot at each next call:


this_plot (1:4, c(2, 3, 1, 5), type = "b")

Figure 2: first call


this_plot(5, 3, type = "b")

Figure 3: second call


this_plot(6, 3, type = "b", col = "red")

Figure 4: third call

Anonymous functions

In R, we usually assign functions to variable names. Nevertheless, functions can exists without been assigned to symbol. Functions that don’t have a name are called anonymous functions.

We can call anonymous functions directly, as we do with named functions, but the code is a little unusual as we have to use brackets both to include the whole function definition and to pass arguments to the function:


(function(x) x + 3)(10)


[1] 13

Note that this is exactly the same as:


f <- function(x) x + 3
f(10)


[1] 13

We use anonymous functions when it’s not worth the effort of assigning functions to a name. We could plot a function s(x) by:


s <- function(x) sin(x)/sqrt(x)
integrate(s,  0, 4)


1.609553 with absolute error < 4.5e-11

or alternatively by:


integrate(function(x) sin(x)/sqrt(x),  0, 4)


1.609553 with absolute error < 4.5e-11

in this case function(x) sin(x)/sqrt(x) is an example of anonymous function.

Finally, anonymous functions are, by all rights, normal R functions as they have formals(), a body(), and a parent environment():


formals(function(x) x+1)

$x


body(function(x) x+1)


x + 1


environment(function(x) x+1)


<environment: R_GlobalEnv>

Lists of functions

Functions, as any type of R object, can be stored in a list.


fun_list <- list(m = mean , s = sd)

This makes it easier to work with groups of related functions.

Functions defined within a list are still accessible at least in three different ways:

using function with()


with (fun_list, m(x = 1:10))


[1] 5.5

by using the $ operator


fun_list$m( x = 1:10)


[1] 5.5

by attaching the list:


attach(fun_list)
m( x = 1:10)


[1] 5.5


detach(fun_list)

Lists of functions can be most useful when we want to apply all functions of the list to the same set of data.

We can achieve this goal in two logical steps.

We first define a function


fun <- function(f, ...){f(...)}

that takes a function f() as argument along with any other arguments ’‘...’’ and returns f(...). In practice:


fun(mean, x = 1:10, na.rm = TRUE)


[1] 5.5

Secondly, we apply function fun() to the list of functions. Arguments required by the functions stored in the list are passed by the ’‘...’’ argument:


lapply(fun_list, fun, x = 1:10)


$m
[1] 5.5

$s
[1] 3.02765

Under almost all circumstances, equivalent results can be achieved by using function do.call() within a call to lapply():


lapply(fun_list, do.call, list(x = 1:10, na.rm = T))


$m
[1] 5.5

$s
[1] 3.02765

the only difference being that arguments to functions within the list must be enclosed in a list too.

Example: Multiple Anderson-Darling tests

As a simple example we may want to compare the results of four Anderson-Darling type tests from the truncgof package applied to the same data.

We can define a list that holds these four functions and store it in the global environment:


require(truncgof, quietly = TRUE)
nor_test <- list(ad2.test = ad2.test, ad2up.test = ad2up.test, ad.test = ad.test, 
    adup.test = adup.test)

and, afterword, apply function fun() to each element of this list:


x <- rnorm(100, 10, 1)
m <-  mean(x)
s <- sd(x)
lapply(nor_test, fun, x , distn = "pnorm", list(mean = m, sd = s), sim = 100)


$ad2.test

    Quadratic Class Anderson-Darling Test

data:  x
AD2 = 0.77067, p-value = 0.03

treshold = -Inf, simulations: 100


$ad2up.test

    Quadratic Class Anderson-Darling Upper Tail Test

data:  x
AD2up = 2.7084, p-value = 0.61

treshold = -Inf, simulations: 100


$ad.test

    Supremum Class Anderson-Darling Test

data:  x
AD = 1.6421, p-value = 0.7
alternative hypothesis: two.sided

treshold = -Inf, simulations: 100


$adup.test

    Anderson-Darling Upper Tail Test

data:  x
ADup = 10, p-value = 0.51
alternative hypothesis: two.sided

treshold = -Inf, simulations: 100

Example: Summary statistics

We may want to define a function that returns some specific statistics for a given set of variables in the form of a data.frame.


this_summary <- as.data.frame(rbind(vapply(trees, mean, FUN.VALUE = numeric(1)), 
    vapply(trees, sd, FUN.VALUE = numeric(1)), vapply(trees, function(x, 
        ...) {
        diff(range(x))
    }, FUN.VALUE = numeric(1))))

row.names(this_summary) <- c("mean", "sd", "range")
this_summary


          Girth    Height   Volume
mean  13.248387 76.000000 30.17097
sd     3.138139  6.371813 16.43785
range 12.300000 24.000000 66.80000

We may achieve the same result by writing a more general function that will work with any kind of statistics as long as they return a single value:


my_summary <- function(x, flist) {
    f <- function(f, ...) f(...)
    g <- function(x, flist) {
        vapply(flist, f, x, FUN.VALUE = numeric(1))
    }
    df <- as.data.frame(lapply(x, g, flist))
    row.names(df) <- names(flist)
    df
}

my_summary(cars, flist = list(mean = mean, stdev = sd, cv = function(x, 
    ...) {
    sd(x, ...)/mean(x, ...)
}))


           speed       dist
mean  15.4000000 42.9800000
stdev  5.2876444 25.7693775
cv     0.3433535  0.5995667

fapply

Working with this mind set we may even define a function fapply() that applies all functions of a list to the same set of arguments


fapply <- function(X, FUN, ...){
  lapply(FUN, function(f, ...){f(...)}, X, ...)
}