Code Optimisation

Timing R commands

The function system.time returns the computing time of R commands.

In [1]:
system.time({x = qt(runif(1e4), df = 0.5)}) # Compute quantiles of the t dist'n
   user  system elapsed 
   0.14    0.00    0.14 

Array pre-allocation

In the above example, the variable x is a vector of 1e4 elements. When calling the function qt, some space in the computer memory is pre-allocated to hold these 1e4 elements. This takes time as can be seen in the example below.

In [2]:
n = 3e4
u = runif(n)

Pre-allocate memory

In [3]:
system.time({
  x1 = numeric(n) # Here we pre-allocate memory
  for (i in 1:n) x1[i] = qt(u[i], df = 0.5)
})
   user  system elapsed 
   0.49    0.01    0.50 

Without memory pre-allocation

In [4]:
system.time({
  x2 = c() # x2 is initially empty
  for (i in 1:n) x2 = c(x2, qt(u[i], df = 0.5)) # Join x2 with the most recent calculation
})
   user  system elapsed 
   1.73    0.00    1.73 

In the case for x2, the memory is allocated within the for loop so there are n such operations. This slowed down the code.

Vectorising code

The for loop

Preferably loops should be avoided when possible. Consider the following example.

In [5]:
n = 5e4
u = runif(n)
qf = qp = numeric(n) # Placeholder for storing the output of the commands below

## Loop version
system.time({
  for (i in 1:n) {
    qf[i] = qt(u[i], df = 0.5)
  }
})

## vectorised version
system.time({
  qp = qt(u, df = 0.5)
})
   user  system elapsed 
   0.81    0.00    0.81 
   user  system elapsed 
    0.7     0.0     0.7 

The reason for the slower code in the loop is the following. Every line of code must first be translated into native computer language. In the parallel version, the translation only occurs once. In the the loopy version it occurs n times. This costs time.

Using apply and lapply

One benefit of for loops is that the code is sometimes easier to read. In these cases it is not obvious how to vectorise the code. Consider for example the following:

In [6]:
# Code to compute the average length for each variable and species using a loop.

data(iris3) # An array of dimension 50 by 4 by 3
avglen_f = matrix(0, 4, 3)
for (i in 1:4) {
  for (j in 1:3) {
    avglen_f[i, j] = mean(iris3[, i, j])
  }
}

The above code can be vectorised (and simplified) with the aid of the function apply as follows

In [7]:
### Code to compute the average length for each variable and species using apply.
avglen_v = apply(iris3, c(2, 3), mean)

In words the above code says "apply the function mean cycling through the dimensions 2 and 3 of the array iris3." The function mean will be called 6 times without using R loops. The input to the function each time will be the vector iris3[, 1, 1], iris3[, 1, 2], etc. Note that the function to be applied need not have a name, e.g., here is the same result using an anonymous function

In [8]:
### Code to compute the average length for each variable and species by applying an anonymous function.
avglen_p = apply(iris3, c(2, 3), function(x) {sum(x)/length(x)})

In the above x denotes the vector input to the anonymous function. The anonymous function is quite flexible in the sense that the variables used in the body of the function can come from the current environment. In the following example the variable p is not inputted to the function but its value is taken from the current environment.

In [9]:
n = 15
k = 4
x = matrix(rnorm(n*k), n, k) # A matrix of k different standard normal samples of size n
p = c(.10, .50, .90) # A vector of probabilities
z = apply(x, 2, function(xi) quantile(xi, probs = p)) # Compute the quantiles for each sample. Note that p is not an argument to the function.

The function apply operates on arrays, i.e., objects whose dimension is not NULL and returns an array or a vector. The function lapply (and derivatives) operate on lists, or objects which can be coerced to lists. The loop cycles through each component of the list.

In [10]:
### Return the type of each variable in the data frame
lapply(ToothGrowth, class)
n = 15
k = 4
x = matrix(rnorm(n*k), n, k) # A matrix of k different standard normal samples of size n
p = c(.10, .50, .90) # A vector of probabilities
apply(x, 2, function(xi) quantile(xi, probs = p))
as.list(1:k)
lapply(1:k, function(i) quantile(x[, i], probs = p))
sapply(1:k, function(i) quantile(x[, i], probs = p)) # a neater version of lapply that displays the output as a single array
$len
'numeric'
$supp
'factor'
$dose
'numeric'
10%-0.8102628-0.5512272-1.7612170-1.277195
50% 0.4979250 0.1577357 0.1119588-0.369030
90% 1.4027805 0.8512106 1.0646683 1.936443
  1. 1
  2. 2
  3. 3
  4. 4
  1. 10%
    -0.810262807188327
    50%
    0.497924999393728
    90%
    1.40278047949047
  2. 10%
    -0.551227174222887
    50%
    0.157735670608364
    90%
    0.851210599591786
  3. 10%
    -1.76121700108509
    50%
    0.111958787802228
    90%
    1.06466834066878
  4. 10%
    -1.27719456718345
    50%
    -0.369030035064022
    90%
    1.93644332438826
10%-0.8102628-0.5512272-1.7612170-1.277195
50% 0.4979250 0.1577357 0.1119588-0.369030
90% 1.4027805 0.8512106 1.0646683 1.936443

1

Consider the ToothGrowth data.

  1. Compute using apply, lapply etc the average growth (variable len) for the two different supplements (variable supp).
  2. Compute using apply, lapply etc the average growth (variable len) for each combination of supplement (variable supp) and dose (variable dose).

Solution

2

Consider the topo data provided by the package MASS containing the altitude z at 52 coordinates x and y.

  1. Use for loops efficiently to compute the matrix of pairwise distances between the locations given by the x, y coordinates. Compute only the lower triangular part of the matrix.
  2. The spherical correlation structure is given by the function $$r(d; \theta) = \begin{cases} 1 - 1.5 \dfrac{d}{\theta} + 0.5 \biggl( \dfrac{d}{\theta} \biggr)^3 & \text{if $d < \theta$,} \\ 0 & \text{otherwise.} \end{cases}$$ where $d$ is the distance between two coordinates and $\theta$ is a parameter that controls the strength of the correlation. Write efficient parallel code to compute the correlation matrix for the sampling coordinates of the topo data set at $\theta = 3$.

Solution

Further reading

  • See ?Rprof for profiling R commands.
  • Use the parallel package for running commands trully in parallel on a computer with multiple cores.
  • Read the R interface to C and FORTRAN document for using compiled code in those languages within R.