Functions + Iteration

October 7 + 9 + 16, 2024

Jo Hardin

Agenda 10/7/24

  1. R functions
  2. The map() function

Functions

Function components

Here is a simple function to compute the absolute value.

my_abs <- function(x){
  return(ifelse(x >= 0, x, -1*x))
}

my_abs(-3)
[1] 3
my_abs(c(-2, 5))
[1] 2 5
  • name: my_abs
  • arguments: x
  • body: everything inside the { }

Ordering and arguments

my_power <- function(x, y){
  return(x^y)
}

my_power(x = 2, y = 3)
[1] 8
my_power(y = 3, x = 2)
[1] 8
my_power(2, 3)
[1] 8
my_power(3, 2)
[1] 9
  • When calling the function, if you don’t name the arguments, R assumes that you passed them in the order defined inside the function.

Function defaults

my_power <- function(x, y){
  return(x^y)
}

What will happen when I run the following code?

my_power(3)
my_power(3)
Error in my_power(3): argument "y" is missing, with no default

Function defaults

my_power <- function(x, y = 2){
  return(x^y)
}

What will happen when I run the following code?

my_power(3)
my_power(3)
[1] 9

Function defaults

my_power <- function(x, y = 2){
  return(x^y)
}

What will happen when I run the following code?

my_power(2, 3)
my_power(2, 3)
[1] 8

Function defaults

my_power <- function(x = 2, y = 3){
  return(x^y)
}

What will happen when I run the following code?

my_power()
my_power()
[1] 8

Returning a value

average1 <- function(x, remove_nas) {
    sum(x, na.rm = remove_nas)/sum(!is.na(x))
}

average2 <- function(x, remove_nas) {
    return(sum(x, na.rm = remove_nas)/sum(!is.na(x)))
}

average3 <- function(x, remove_nas = TRUE) {
    sum(x, na.rm = remove_nas)/sum(!is.na(x))
}
some_data <- c(3, NA, 2, 13, 2, NA, 47)

average1(some_data)
Error in average1(some_data): argument "remove_nas" is missing, with no default
average1(some_data, remove_nas = TRUE)
[1] 13.4
average2(some_data)
Error in average2(some_data): argument "remove_nas" is missing, with no default
average2(some_data, remove_nas = TRUE)
[1] 13.4
average3(some_data)
[1] 13.4

Returning a value

  • without return(): the function returns the last value which gets computed and isn’t stored as an object (using <-).

  • with return(): the function will return an object that is explicitly included in the return() call. (Note: if you (accidentally?) have two return() calls, the function will return the object in the first return() call.)

Control flow

Often in functions, you will want to execute code conditionally. Consider the if-else if-else structure.

if (logical_condition) {
    # some code
} else if (other_logical_condition) {
    # some other code
} else {
    # yet more code
}

Control flow

middle <- function(x) {
    mean_x <- mean(x, na.rm = TRUE)
    median_x <- median(x, na.rm = TRUE)
    seems_skewed <- (mean_x > 1.5*median_x) | (mean_x < (1/1.5)*median_x)
    if (seems_skewed) {
        median_x
    } else {
        mean_x
    }
}

Note that (mean_x > 1.5*median_x) | (mean_x < (1/1.5)*median_x) is a TRUE or FALSE question.

some_data <- c(3, NA, 2, 13, 2, NA, 47)


mean(some_data, na.rm = TRUE)
[1] 13.4
median(some_data, na.rm = TRUE)
[1] 3
middle(some_data)
[1] 3

Functions in the tidyverse

Don’t collapse

Functions that return the same number of rows as the original data frame are good to use inside mutate() and filter(). For example, you might want to capitalize the first word of every string:

first_upper <- function(x) {
  str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
  x
}

first_upper(c("hello", "goodbye"))
[1] "Hello"   "Goodbye"

Functions in the tidyverse

Collapse

Functions that collapse into a single value will work well in the summarize() step of the pipeline. For example, you may want to calculate the coefficient of variation which is the standard deviation divided by the mean.

cv <- function(x, na.rm = FALSE) {
  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}

cv(runif(100, min = 0, max = 50))
[1] 0.5983797
cv(runif(100, min = 0, max = 500))
[1] 0.5812336

Functions summary

  • Functions can be used to avoid repeating code
  • Arguments allow us specify the inputs when we call a function
  • If inputs are not named when calling the function, R uses the ordering from the function definition
  • All arguments must be specified when calling a function
  • Default arguments can be specified when the function is defined
  • The input to a function can be a function!

Iterating functions

There will be times when you will need to iterate a function multiple times.

purrr for functional programming

functionals are functions that take function as input and return a vector / list / data frame as output.

  • alternatives to loops

  • a functional is better than a for loop is better than while is better than repeat (in terms of computing efficiency)

Benefits

  • encourages function logic to be separated from iteration logic

  • can collapse results into vectors/data frames easily

Map

map() has (at least) two arguments, a data object and a function. It performs the function on each element of the object and returns a list. We can also pass additional arguments into the function.

variations of map_ functions

The map functions are named by they output the produce. For example:

  • map(.x, .f) is the main mapping function and returns a list

  • map_dbl(.x, .f) returns a numeric (double) vector

  • map_chr(.x, .f) returns a character vector

  • map_lgl(.x, .f) returns a logical vector

Note that the first argument is always the data object and the second object is always the function you want to iteratively apply to each element in the input object.

map() in practice

map() variants (output)

triple <- function(x) x * 3
map(.x = c(1:3), .f = triple)
[[1]]
[1] 3

[[2]]
[1] 6

[[3]]
[1] 9
map_dbl(.x = c(1:3), .f = triple)
[1] 3 6 9
map_lgl (.x = c(1:3), .f = triple)
Error in `map_lgl()`:
ℹ In index: 1.
Caused by error:
! Can't coerce from a number to a logical.
map_lgl(.x = c(1, NA, 3), .f = is.na)
[1] FALSE  TRUE FALSE

Agenda 10/9/24

  1. The map() function
  2. Iterating functions

fastfood dataset from openintro

library(openintro)
fastfood
# A tibble: 515 × 17
   restaurant item      calories cal_fat total_fat sat_fat trans_fat cholesterol
   <chr>      <chr>        <dbl>   <dbl>     <dbl>   <dbl>     <dbl>       <dbl>
 1 Mcdonalds  Artisan …      380      60         7       2       0            95
 2 Mcdonalds  Single B…      840     410        45      17       1.5         130
 3 Mcdonalds  Double B…     1130     600        67      27       3           220
 4 Mcdonalds  Grilled …      750     280        31      10       0.5         155
 5 Mcdonalds  Crispy B…      920     410        45      12       0.5         120
 6 Mcdonalds  Big Mac        540     250        28      10       1            80
 7 Mcdonalds  Cheesebu…      300     100        12       5       0.5          40
 8 Mcdonalds  Classic …      510     210        24       4       0            65
 9 Mcdonalds  Double C…      430     190        21      11       1            85
10 Mcdonalds  Double Q…      770     400        45      21       2.5         175
# ℹ 505 more rows
# ℹ 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>, sugar <dbl>,
#   protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>, salad <chr>

From TidyTuesday Fast food entree data on September 4, 2018.

Anonymous functions and shortcuts

  • use ~ to set a formula (when the function is neither a single name nor defined by function(...) { ...})
  • use .x to reference the input map(.x = ..., .f = )
map_dbl(.x = fastfood, .f = function(dog) mean(dog, na.rm = TRUE))
  restaurant         item     calories      cal_fat    total_fat      sat_fat 
          NA           NA  530.9126214  238.8135922   26.5902913    8.1533981 
   trans_fat  cholesterol       sodium   total_carb        fiber        sugar 
   0.4650485   72.4563107 1246.7378641   45.6640777    4.1371769    7.2621359 
     protein        vit_a        vit_c      calcium        salad 
  27.8910506   18.8571429   20.1704918   24.8524590           NA 
map_dbl(.x = fastfood,  .f = ~mean(.x, na.rm = TRUE))
  restaurant         item     calories      cal_fat    total_fat      sat_fat 
          NA           NA  530.9126214  238.8135922   26.5902913    8.1533981 
   trans_fat  cholesterol       sodium   total_carb        fiber        sugar 
   0.4650485   72.4563107 1246.7378641   45.6640777    4.1371769    7.2621359 
     protein        vit_a        vit_c      calcium        salad 
  27.8910506   18.8571429   20.1704918   24.8524590           NA 
map_dbl(.x = fastfood, .f = mean, na.rm = TRUE)
  restaurant         item     calories      cal_fat    total_fat      sat_fat 
          NA           NA  530.9126214  238.8135922   26.5902913    8.1533981 
   trans_fat  cholesterol       sodium   total_carb        fiber        sugar 
   0.4650485   72.4563107 1246.7378641   45.6640777    4.1371769    7.2621359 
     protein        vit_a        vit_c      calcium        salad 
  27.8910506   18.8571429   20.1704918   24.8524590           NA 
map_dbl(.x = fastfood, .f = mean)
  restaurant         item     calories      cal_fat    total_fat      sat_fat 
          NA           NA  530.9126214  238.8135922   26.5902913    8.1533981 
   trans_fat  cholesterol       sodium   total_carb        fiber        sugar 
   0.4650485   72.4563107 1246.7378641   45.6640777           NA    7.2621359 
     protein        vit_a        vit_c      calcium        salad 
          NA           NA           NA           NA           NA 

Agenda 10/16/24

  1. The map() function
  2. Iterating functions

The same thing, many ways

Note that .x is the name of the first argument in map() (.f is the name of the second argument).

# the task
map_dbl(fastfood, function(x) length(unique(x)))
 restaurant        item    calories     cal_fat   total_fat     sat_fat 
          8         505         113         117          80          40 
  trans_fat cholesterol      sodium  total_carb       fiber       sugar 
         10          52         197         103          19          31 
    protein       vit_a       vit_c     calcium       salad 
         71          22          24          27           1 
map_dbl(fastfood, function(unicorn) length(unique(unicorn)))
 restaurant        item    calories     cal_fat   total_fat     sat_fat 
          8         505         113         117          80          40 
  trans_fat cholesterol      sodium  total_carb       fiber       sugar 
         10          52         197         103          19          31 
    protein       vit_a       vit_c     calcium       salad 
         71          22          24          27           1 
map_dbl(fastfood, ~length(unique(.x)))
 restaurant        item    calories     cal_fat   total_fat     sat_fat 
          8         505         113         117          80          40 
  trans_fat cholesterol      sodium  total_carb       fiber       sugar 
         10          52         197         103          19          31 
    protein       vit_a       vit_c     calcium       salad 
         71          22          24          27           1 
map_dbl(fastfood, ~length(unique(..1)))
 restaurant        item    calories     cal_fat   total_fat     sat_fat 
          8         505         113         117          80          40 
  trans_fat cholesterol      sodium  total_carb       fiber       sugar 
         10          52         197         103          19          31 
    protein       vit_a       vit_c     calcium       salad 
         71          22          24          27           1 
map_dbl(fastfood, ~length(unique(.)))
 restaurant        item    calories     cal_fat   total_fat     sat_fat 
          8         505         113         117          80          40 
  trans_fat cholesterol      sodium  total_carb       fiber       sugar 
         10          52         197         103          19          31 
    protein       vit_a       vit_c     calcium       salad 
         71          22          24          27           1 
# not the task
map_dbl(fastfood, length)
 restaurant        item    calories     cal_fat   total_fat     sat_fat 
        515         515         515         515         515         515 
  trans_fat cholesterol      sodium  total_carb       fiber       sugar 
        515         515         515         515         515         515 
    protein       vit_a       vit_c     calcium       salad 
        515         515         515         515         515 
#error
map_dbl(fastfood, ~length)
Error in `map_dbl()`:
ℹ In index: 1.
ℹ With name: restaurant.
Caused by error:
! Can't coerce from a primitive function to a double.
#error
map_dbl(fastfood, length(unique()))
Error in unique.default(): argument "x" is missing, with no default
#error
map_dbl(fastfood, ~length(unique(x)))
Error in `map_dbl()`:
ℹ In index: 1.
ℹ With name: restaurant.
Caused by error in `.f()`:
! object 'x' not found

mapping to a data frame

Would be great if the results were a data frame! If the function outputs a data frame, then we can use list_rbind() and list_cbind() to create a data frame as the final map() output.

  • results as rows: map() |> list_rbind()
  • results as columns: map() |> list_cbind()
col_stats <- function(n) {
  head(fastfood, n) |> 
    select(calories, protein, vit_c) |> 
    summarise_all(mean, na.rm = TRUE) |> 
    mutate(n = paste("N =", n))
}
out1 <- map(c(10,20), col_stats)

out1
[[1]]
# A tibble: 1 × 4
  calories protein vit_c n     
     <dbl>   <dbl> <dbl> <chr> 
1      657    39.5  12.3 N = 10

[[2]]
# A tibble: 1 × 4
  calories protein vit_c n     
     <dbl>   <dbl> <dbl> <chr> 
1     582.    34.6  12.2 N = 20
out2 <- map(c(10,20), col_stats) |> list_rbind()

out2
# A tibble: 2 × 4
  calories protein vit_c n     
     <dbl>   <dbl> <dbl> <chr> 
1     657     39.5  12.3 N = 10
2     582.    34.6  12.2 N = 20
out3 <- map(c(10,20), col_stats) |> list_cbind()

out3
# A tibble: 1 × 8
  calories...1 protein...2 vit_c...3 n...4  calories...5 protein...6 vit_c...7
         <dbl>       <dbl>     <dbl> <chr>         <dbl>       <dbl>     <dbl>
1          657        39.5      12.3 N = 10         582.        34.6      12.2
# ℹ 1 more variable: n...8 <chr>

Two+ arguments to map()

map_*() variants (input)

We’ve already described the difference between the columns. Now we cover the difference between the rows.

map2_*()

  • map2_*() has two arguments, .x and .y
  • raise each value of .x to the power of 2
map_dbl(
  .x = c(1:5), 
  .f = function(x) x ^ 2
)
[1]  1  4  9 16 25
  • raise each value .x to the power .y
map2_dbl(
  .x = c(1:5), 
  .y = c(2:6), 
  .f = ~ (.x ^ .y)
)
[1]     1     8    81  1024 15625

imap()

  • imap() is like map2()except that .y is derived from names(.x) if the data frame has names.

  • If not, .y is 1, 2, 3, … \(n\) where \(n\) is the number of items in .x. (A data frame has \(n\) columns.)

  • These two calls produce the same result

imap_chr(.x = fastfood, 
         .f = ~ paste(.y, "has a mean of", round(mean(.x), 1))) |> 
head()
                    restaurant                           item 
 "restaurant has a mean of NA"        "item has a mean of NA" 
                      calories                        cal_fat 
"calories has a mean of 530.9"  "cal_fat has a mean of 238.8" 
                     total_fat                        sat_fat 
"total_fat has a mean of 26.6"    "sat_fat has a mean of 8.2" 
map2_chr(.x = fastfood, 
         .y = names(fastfood),
         .f = ~ paste(.y, "has a mean of", round(mean(.x), 1))) |> 
head()
                    restaurant                           item 
 "restaurant has a mean of NA"        "item has a mean of NA" 
                      calories                        cal_fat 
"calories has a mean of 530.9"  "cal_fat has a mean of 238.8" 
                     total_fat                        sat_fat 
"total_fat has a mean of 26.6"    "sat_fat has a mean of 8.2" 

pmap()

  • you can pass a named list or dataframe as arguments to a function

  • for example runif() has the parameters n, min and max

params
# A tibble: 3 × 3
      n   min   max
  <dbl> <dbl> <dbl>
1     1     1    10
2     2    10   100
3     3   100  1000
pmap(params, runif)
[[1]]
[1] 1.522878

[[2]]
[1] 60.40849 62.27485

[[3]]
[1] 393.0755 733.5081 142.7339

Or use the pipe into pmap():

params |> 
  pmap(runif)
[[1]]
[1] 6.228173

[[2]]
[1] 21.28932 12.86177

[[3]]
[1] 900.8545 594.6754 274.8844

An aside…

Interestingly, runif() will take either a scalar or a vector as its first argument. If the first argument is a vector, runif() will return N random uniforms, where N is the length of the vector.

runif(n = 3)
[1] 0.3460550 0.3684597 0.7622530
runif(n = c(1,3))
[1] 0.2994728 0.7781749
runif(n = c(10000,12321412424))
[1] 0.7561568 0.4648677
runif(n = c("rainbow", "unicorn"))
[1] 0.9089102 0.3279757

pmap() vs map()

Which means that the pmap() code will “work” in map() as well:

params
# A tibble: 3 × 3
      n   min   max
  <dbl> <dbl> <dbl>
1     1     1    10
2     2    10   100
3     3   100  1000
params |> 
  pmap(runif)
[[1]]
[1] 1.734303

[[2]]
[1] 54.19798 10.06920

[[3]]
[1] 587.4556 944.9725 889.0832
params |> 
  map(runif)
$n
[1] 0.63462172 0.08050289 0.62253413

$min
[1] 0.5753892 0.1718571 0.8298359

$max
[1] 0.8211994 0.6053252 0.2518072

pmap() with expand_grid()

  • I like to use expand_grid() when I want all possible parameter combinations.
expand_grid(n = c(1, 2, 3), 
            min = c(1, 10),
            max = c(10, 100)) 
# A tibble: 12 × 3
       n   min   max
   <dbl> <dbl> <dbl>
 1     1     1    10
 2     1     1   100
 3     1    10    10
 4     1    10   100
 5     2     1    10
 6     2     1   100
 7     2    10    10
 8     2    10   100
 9     3     1    10
10     3     1   100
11     3    10    10
12     3    10   100
expand_grid(n = c(1, 2, 3), 
            min = c(1, 10),
            max = c(10, 100)) |> 
pmap(runif)
[[1]]
[1] 1.08188

[[2]]
[1] 11.49043

[[3]]
[1] 10

[[4]]
[1] 39.97649

[[5]]
[1] 5.169109 4.533750

[[6]]
[1] 46.15350 74.83895

[[7]]
[1] 10 10

[[8]]
[1] 78.80365 13.85192

[[9]]
[1] 1.959881 2.481792 1.705433

[[10]]
[1] 61.762033 48.758773  3.483168

[[11]]
[1] 10 10 10

[[12]]
[1] 96.07411 57.56174 46.62570