25  The apply() family in R

R offers a potent suite of functions known as the apply() family, which are designed to streamline many iterative tasks that might otherwise require loops. These functions are not only more concise but often more efficient.

The apply() family is a part of the Base R package and includes functions tailored for manipulating slices of data from matrices, arrays, lists, and data frames in repetitive ways. They facilitate traversing the data through various means while avoiding the explicit use of loop constructs. The apply() functions operate on an input list, matrix, or array and apply a specified function with one or more optional arguments.

These functions are the foundation of more complex operations and enable the execution of tasks with minimal code. Specifically, the family comprises the apply(), lapply(), sapply(), vapply(), mapply(), rapply(), and tapply() functions.

25.1 apply()

  • Input class: matrix, data.frame
  • Output class: vector

This function is the foundation within the apply() family. It applies a function to margin 1 (rows), margin 2 (columns), or both c(1, 2) (rows and columns).

# Create a 3x5 matrix with normally distributed random variables generated by rnorm()
set.seed(123)  # Set seed for reproducibility
m <- matrix(rnorm(n = 15, mean = 100, sd = 30), nrow = 3, ncol = 5)
m
          [,1]     [,2]      [,3]      [,4]      [,5]
[1,]  83.18573 102.1153 113.82749  86.63014 112.02314
[2,]  93.09468 103.8786  62.04816 136.72245 103.32048
[3,] 146.76125 151.4519  79.39441 110.79441  83.32477

With the apply() function, we can find the mean of each row in m by

apply(X = m, MARGIN = 1, FUN = mean)
[1]  99.55635  99.81288 114.34536

We can verify this result using the rowMeans() function.

rowMeans(m)
[1]  99.55635  99.81288 114.34536

Similarly, to calculate the mean of each column, we can use the apply() function as follows.

apply(X = m, MARGIN = 2, FUN = mean)
[1] 107.68055 119.14861  85.09002 111.38234  99.55613

Verify this result using colMeans():

colMeans(m)
[1] 107.68055 119.14861  85.09002 111.38234  99.55613

Exercise E

Q1

Use apply() to calculate the standard deviation of each column in m.

Q2

Use apply() to compute the mean of each column in the mtcars dataset.

25.2 lapply()

  • Input class: list, vector
  • Output class: list

It applies a function to each element of a list or vector. First, let us examine how it is used with vectors.

# Create a character vector with city names
cities <- c("New York", "Philadelphia", "Boston")

# Apply the nchar() function to each element in cities to count the number of characters
lapply(X = cities, FUN = nchar)
[[1]]
[1] 8

[[2]]
[1] 12

[[3]]
[1] 6

To convert the returned values into a vector, use unlist().

unlist(lapply(X = cities, FUN = nchar))
[1]  8 12  6

Now, let us examine how it works with lists.

# create a list
l <- list(a = c(1:3), b = c(4:6), c = c(7:9))
l
$a
[1] 1 2 3

$b
[1] 4 5 6

$c
[1] 7 8 9

Calculate the sum for each element in the list:

# apply the sum function to each element in the list 'l'
lapply(X = l, FUN = sum)
$a
[1] 6

$b
[1] 15

$c
[1] 24

We can also define our own functions to be applied.

# select the first element from each list item 
lapply(l, function(z) { z[1] })
$a
[1] 1

$b
[1] 4

$c
[1] 7

Exercise F

Q1

Calculate the mean for each vector in the list l.

25.3 sapply()

  • Input class: list, vector
  • Output class: vector, matrix, list (if fail to return in the previous two forms)

Similar to lapply(), but the results are typically presented in a simpler and more user-friendly format.

# create a list of temperature measurements over 5 days
temp <- list( 
    c(3, 7, 9, 6, -1), # temperatures for day 1
    c(6, 9, 12, 13, 5), # temperatures for day 2
    c(4, 8, 3, -1, -3), # temperatures for day 3
    c(1, 4, 7, 2, -2), # temperatures for day 4
    c(5, 7, 9, 4, 2) # temperatures for day 5
)
temp
[[1]]
[1]  3  7  9  6 -1

[[2]]
[1]  6  9 12 13  5

[[3]]
[1]  4  8  3 -1 -3

[[4]]
[1]  1  4  7  2 -2

[[5]]
[1] 5 7 9 4 2

Let us determine the minimum temperature for each day.

# determine the minimum temperature for each day (returns a vector)
sapply(X = temp, FUN = min)
[1] -1  5 -3 -2  2

Compare the outputs of sapply() and lapply():

# determine the minimum temperature for each day (returns a list)
lapply(X = temp, FUN = min)
[[1]]
[1] -1

[[2]]
[1] 5

[[3]]
[1] -3

[[4]]
[1] -2

[[5]]
[1] 2

To obtain a vector from the output of lapply(), we need to employ unlist().

unlist(lapply(X = temp, FUN = min))
[1] -1  5 -3 -2  2

What if the function supplied to sapply() returns more than one value?

# define a function that returns the minimum and maximum values of a vector
getMinMax <- function(x) { 
    return(c(min = min(x), max = max(x)))
}

# apply getMinMax to each element of 'temp' using sapply to return a well-formatted matrix
sapply(X = temp, FUN = getMinMax)
    [,1] [,2] [,3] [,4] [,5]
min   -1    5   -3   -2    2
max    9   13    8    7    9

If we use lapply():

lapply(X = temp, FUN = getMinMax)
[[1]]
min max 
 -1   9 

[[2]]
min max 
  5  13 

[[3]]
min max 
 -3   8 

[[4]]
min max 
 -2   7 

[[5]]
min max 
  2   9 

After unlist():

unlist(lapply(X = temp, FUN = getMinMax))
min max min max min max min max min max 
 -1   9   5  13  -3   8  -2   7   2   9 

In our examination of the temp dataset, employing lapply() with the getMinMax() function yields a list with each element containing the minimum and maximum values of each vector. This list structure, while retaining detailed information, can be less intuitive for immediate data interpretation. Subsequently, applying unlist() converts this output into a single vector, but this transformation sacrifices the inherent organization that was present in the list.

On the other hand, sapply() streamlines the process and directly provides a neatly formatted matrix. Each row corresponds to either the minimum or maximum value, and each column represents a vector from the temp dataset. This matrix format delivered by sapply() is more structured and user-friendly for quick analysis and visualization purposes.

Exercise G

Q1

Obtain the average temperature on each day in temp.

25.4 vapply()

  • Input class: list, vector
  • Output class: vector, matrix, list (if fail to return in the previous two forms)

It functions identically to sapply(), but it operates faster because you explicitly specify the output type for R, which optimizes performance.

# tells vapply we're expecting 1 number for each element
vapply(X = cities, FUN = nchar, FUN.VALUE = numeric(1))
    New York Philadelphia       Boston 
           8           12            6 
# returns minimum and maximum
vapply(X = temp, FUN = getMinMax, FUN.VALUE = c(min = 0, max = 0))
    [,1] [,2] [,3] [,4] [,5]
min   -1    5   -3   -2    2
max    9   13    8    7    9

25.5 mapply()

  • Input class: list, vector
  • Output class: vector, list (if fail to return as vector)

mapply() is a multivariate version of sapply(). It is called multivariate in the sense that your function must accept multiple arguments. It applies function (the first argument supplied to mapply()) to each element.

mapply(FUN = sum, 1:5, 5:1, -5:-1)
[1] 1 2 3 4 5

What it does is to apply the function sum() where there are three arguments five times, i.e., sum (1, 5, -5), sum (2, 4, -4), …, sum (5, 1, -1).

Let us see another example:

mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4

This is equivalent to writing the rep() four times.

list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))
[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4

25.6 rapply()

  • Input class: list
  • Output class: list

It is used when applying a function to each element of a nested list structure, recursively. Not so commonly used. For example,

l1 <- list(a = list("A", "B", "C"), b = c(1, 100), c = list("Hey"))
l1
$a
$a[[1]]
[1] "A"

$a[[2]]
[1] "B"

$a[[3]]
[1] "C"


$b
[1]   1 100

$c
$c[[1]]
[1] "Hey"

We apply a function that appends “!” to each text string and adds 1000 to each number in this list.

# create custom function to supply to rapply
addSomething <- function(x) {
    if (is.character(x)) {# if element within the list is a character, add !
        return(paste0(x, "!"))
    }
    else {
        return(x + 1000) # if element isn't a character, add 1000 to it
    }
}

rapply(object = l1, f = addSomething)
    a1     a2     a3     b1     b2      c 
  "A!"   "B!"   "C!" "1001" "1100" "Hey!" 

25.7 tapply()

  • Input class: vector
  • Output class: array
#create vector
x <- c(1, 2, 3, 10, 20, 30, 100, 200, 300)

#create grouping variable (3 groups)
groups <- c("a", "a", "a", "b", "b", "b", "c", "c", "c")

tapply(X = x, INDEX = groups, FUN = mean)
  a   b   c 
  2  20 200 

When working with a data frame:

head(ChickWeight)
# There are four types of diets (1, 2, 3, 4)
table(ChickWeight$Diet)

  1   2   3   4 
220 120 120 118 
tapply(ChickWeight$weight, ChickWeight$Diet, mean)
       1        2        3        4 
102.6455 122.6167 142.9500 135.2627