Variable Types

September 23 + 25, 2024

Jo Hardin

Variable Types

Some new variable types:

  • character strings
  • factor variables
  • dates
  • numeric
  • logical

A variable’s type determines the values that the variable can take on and the operations that can be performed on it. Specifying variable types ensures the data’s integrity and increases performance.

Agenda 9/23/24

  1. Character strings
  2. str_*() functions
  3. Factor variables

Character strings

When working with character strings, we might want to detect, replace, or extract certain patterns.

Strings are objects of the character class (abbreviated as <chr> in tibbles). When you print out strings, they display with double quotes:

some_string <- "banana"
some_string
[1] "banana"

Creating strings

Creating strings by hand is useful for testing out regular expressions.

To create a string, type any text in either double quotes " or single quotes '. Using double or single quotes doesn’t matter unless your string itself has single or double quotes.

string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

string1
[1] "This is a string"
string2
[1] "If I want to include a \"quote\" inside a string, I use single quotes"

str_view()

We can view these strings “naturally” (without the opening and closing quotes) with str_view():

str_view(string1)
[1] │ This is a string
str_view(string2)
[1] │ If I want to include a "quote" inside a string, I use single quotes

str_c

Similar to paste() (gluing strings together), but works well in a tidy pipeline.

df <- tibble(name = c("Flora", "David", "Terra", NA))
df |> mutate(greeting = str_c("Hi ", name, "!"))
# A tibble: 4 × 2
  name  greeting 
  <chr> <chr>    
1 Flora Hi Flora!
2 David Hi David!
3 Terra Hi Terra!
4 <NA>  <NA>     

str_sub()

str_sub(string, start, end) will extract parts of a string where start and end are the positions where the substring starts and ends.

fruits <- c("Apple", "Banana", "Pear")
str_sub(fruits, 1, 3)
[1] "App" "Ban" "Pea"
str_sub(fruits, -3, -1)
[1] "ple" "ana" "ear"

Won’t fail if the string is too short.

str_sub(fruits, 1, 5)
[1] "Apple" "Banan" "Pear" 

str_sub() in a pipeline

We can use the str_*() functions inside the mutate() function.

titanic |> 
  mutate(class1 = str_sub(Class, 1, 1))
   Class    Sex   Age Survived Freq class1
1    1st   Male Child       No    0      1
2    2nd   Male Child       No    0      2
3    3rd   Male Child       No   35      3
4   Crew   Male Child       No    0      C
5    1st Female Child       No    0      1
6    2nd Female Child       No    0      2
7    3rd Female Child       No   17      3
8   Crew Female Child       No    0      C
9    1st   Male Adult       No  118      1
10   2nd   Male Adult       No  154      2
11   3rd   Male Adult       No  387      3
12  Crew   Male Adult       No  670      C
13   1st Female Adult       No    4      1
14   2nd Female Adult       No   13      2
15   3rd Female Adult       No   89      3
16  Crew Female Adult       No    3      C
17   1st   Male Child      Yes    5      1
18   2nd   Male Child      Yes   11      2
19   3rd   Male Child      Yes   13      3
20  Crew   Male Child      Yes    0      C
21   1st Female Child      Yes    1      1
22   2nd Female Child      Yes   13      2
23   3rd Female Child      Yes   14      3
24  Crew Female Child      Yes    0      C
25   1st   Male Adult      Yes   57      1
26   2nd   Male Adult      Yes   14      2
27   3rd   Male Adult      Yes   75      3
28  Crew   Male Adult      Yes  192      C
29   1st Female Adult      Yes  140      1
30   2nd Female Adult      Yes   80      2
31   3rd Female Adult      Yes   76      3
32  Crew Female Adult      Yes   20      C

str_replace*()

str_replace() replaces the first match of a pattern. str_replace_all() replaces all the matches of a pattern.

fruits
[1] "Apple"  "Banana" "Pear"  
str_replace(fruits, "a", "x")
[1] "Apple"  "Bxnana" "Pexr"  
str_replace_all(fruits, "a", "x")
[1] "Apple"  "Bxnxnx" "Pexr"  

str_detect()

str_detect(fruits, "a")
[1] FALSE  TRUE  TRUE

str_detect() in pipeline

str_detect() used in a filter() pipeline.

starwars |> 
  select(name, films) |> 
  str() 
tibble [87 × 2] (S3: tbl_df/tbl/data.frame)
 $ name : chr [1:87] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
 $ films:List of 87
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:7] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith"
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:3] "A New Hope" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:3] "A New Hope" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "A New Hope"
  ..$ : chr "A New Hope"
  ..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "A New Hope" "Revenge of the Sith"
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Force Awakens"
  ..$ : chr "A New Hope"
  ..$ : chr [1:3] "A New Hope" "Return of the Jedi" "The Phantom Menace"
  ..$ : chr [1:3] "A New Hope" "The Empire Strikes Back" "Return of the Jedi"
  ..$ : chr "A New Hope"
  ..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
  ..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
  ..$ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "Attack of the Clones"
  ..$ : chr "The Empire Strikes Back"
  ..$ : chr "The Empire Strikes Back"
  ..$ : chr [1:2] "The Empire Strikes Back" "Return of the Jedi"
  ..$ : chr "The Empire Strikes Back"
  ..$ : chr [1:2] "Return of the Jedi" "The Force Awakens"
  ..$ : chr "Return of the Jedi"
  ..$ : chr "Return of the Jedi"
  ..$ : chr "Return of the Jedi"
  ..$ : chr "Return of the Jedi"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "Return of the Jedi"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
  ..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
  ..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
  ..$ : chr "The Phantom Menace"
  ..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "Attack of the Clones"
  ..$ : chr "Attack of the Clones"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "Revenge of the Sith"
  ..$ : chr "Revenge of the Sith"
  ..$ : chr [1:2] "A New Hope" "Revenge of the Sith"
  ..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
  ..$ : chr "Revenge of the Sith"
  ..$ : chr "The Force Awakens"
  ..$ : chr "The Force Awakens"
  ..$ : chr "The Force Awakens"
  ..$ : chr "The Force Awakens"
  ..$ : chr "The Force Awakens"
starwars |> 
  filter(str_detect(films, "Empire")) |> 
  select(name, films) |> 
  str()
tibble [16 × 2] (S3: tbl_df/tbl/data.frame)
 $ name : chr [1:16] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
 $ films:List of 16
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:7] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith"
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
  ..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
  ..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Force Awakens"
  ..$ : chr [1:3] "A New Hope" "The Empire Strikes Back" "Return of the Jedi"
  ..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
  ..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
  ..$ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "Attack of the Clones"
  ..$ : chr "The Empire Strikes Back"
  ..$ : chr "The Empire Strikes Back"
  ..$ : chr [1:2] "The Empire Strikes Back" "Return of the Jedi"
  ..$ : chr "The Empire Strikes Back"

stringr functions

The stringr package within tidyverse contains lots of functions to help process strings. Letting x be a string variable…

str function arguments returns
str_replace() x, pattern, replacement a modified string
str_replace_all() x, pattern, replacement a modified string
str_to_lower() x a modified string
str_to_upper() x a modified string
str_sub() x, start, end a modified string
str_length() x a number
str_detect() x, pattern TRUE/FALSE

Use the stringr cheatsheet.

Agenda 9/25/24

  1. Factor variables
  2. Time and date objects

Factor variables

Factor variables are a special type of character string. The computer actually stores them as integers (?!?!!?) with a label (abbreviated as <fct> in tibbles).

  • categorical variable
  • represented in discrete levels with an ordering

Where do we order?

The ordering of the factor variable comes out in:

  • plots (e.g., barplots)
  • tables (e.g., group_by())
  • modeling (e.g., the baseline level in a linear regression)

Order matters

SurveyUSA poll from 2012 on views of the DREAM Act.

What is off about the data viz part of the report?

openintro::dream
# A tibble: 910 × 2
   ideology     stance
   <fct>        <fct> 
 1 Conservative Yes   
 2 Conservative Yes   
 3 Conservative Yes   
 4 Conservative Yes   
 5 Conservative Yes   
 6 Conservative Yes   
 7 Conservative Yes   
 8 Conservative Yes   
 9 Conservative Yes   
10 Conservative Yes   
# ℹ 900 more rows
dream |> 
  ggplot(aes(x = ideology, fill = stance)) + 
  geom_bar()

dream |> 
  select(ideology) |> 
  pull() |>  # levels() works only on vectors, not data frames
  levels()
[1] "Conservative" "Liberal"      "Moderate"    

Change the order

We can fix the order of the ideology variable.

dream |> 
  mutate(ideology = fct_relevel(ideology, 
                                c("Liberal", "Moderate", "Conservative"))) |> 
  ggplot(aes(x = ideology, fill = stance)) + 
  geom_bar()

Factor and character variables

starbucks |> 
  select(item, type, calories)
# A tibble: 77 × 3
   item                          type   calories
   <chr>                         <fct>     <int>
 1 "8-Grain Roll"                bakery      350
 2 "Apple Bran Muffin"           bakery      350
 3 "Apple Fritter"               bakery      420
 4 "Banana Nut Loaf"             bakery      490
 5 "Birthday Cake Mini Doughnut" bakery      130
 6 "Blueberry Oat Bar"           bakery      370
 7 "Blueberry Scone"             bakery      460
 8 "Bountiful Blueberry Muffin"  bakery      370
 9 "Butter Croissant "           bakery      310
10 "Cheese Danish"               bakery      420
# ℹ 67 more rows

Reorder according to another variable

Lets say that we wanted to order the type of food item based on the average number of calories in that food.

starbucks |> 
  mutate(type = fct_reorder(type, calories, .fun = "mean", .desc = TRUE)) |> 
  ggplot(aes(x = type, y = calories)) + 
  geom_point() + 
  labs(x = "type of food",
       y = "",
       title = "Calories for food items at Starbucks")

Change character to factor

starbucks
# A tibble: 77 × 7
   item                          calories   fat  carb fiber protein type  
   <chr>                            <int> <dbl> <int> <int>   <int> <fct> 
 1 "8-Grain Roll"                     350     8    67     5      10 bakery
 2 "Apple Bran Muffin"                350     9    64     7       6 bakery
 3 "Apple Fritter"                    420    20    59     0       5 bakery
 4 "Banana Nut Loaf"                  490    19    75     4       7 bakery
 5 "Birthday Cake Mini Doughnut"      130     6    17     0       0 bakery
 6 "Blueberry Oat Bar"                370    14    47     5       6 bakery
 7 "Blueberry Scone"                  460    22    61     2       7 bakery
 8 "Bountiful Blueberry Muffin"       370    14    55     0       6 bakery
 9 "Butter Croissant "                310    18    32     0       5 bakery
10 "Cheese Danish"                    420    25    39     0       7 bakery
# ℹ 67 more rows
starbucks |> 
  mutate(item = as.factor(item))
# A tibble: 77 × 7
   item                          calories   fat  carb fiber protein type  
   <fct>                            <int> <dbl> <int> <int>   <int> <fct> 
 1 "8-Grain Roll"                     350     8    67     5      10 bakery
 2 "Apple Bran Muffin"                350     9    64     7       6 bakery
 3 "Apple Fritter"                    420    20    59     0       5 bakery
 4 "Banana Nut Loaf"                  490    19    75     4       7 bakery
 5 "Birthday Cake Mini Doughnut"      130     6    17     0       0 bakery
 6 "Blueberry Oat Bar"                370    14    47     5       6 bakery
 7 "Blueberry Scone"                  460    22    61     2       7 bakery
 8 "Bountiful Blueberry Muffin"       370    14    55     0       6 bakery
 9 "Butter Croissant "                310    18    32     0       5 bakery
10 "Cheese Danish"                    420    25    39     0       7 bakery
# ℹ 67 more rows

forcats functions

The forcats package within tidyverse contains lots of functions to help process factor variables Use the forcats cheatsheet. We’ll focus on the most common functions.

  • functions for changing the order of factor levels
    • fct_relevel() = manually reorder levels
    • fct_reorder() = reorder levels according to values of another variable
    • fct_infreq() = order levels from highest to lowest frequency
    • fct_rev() = reverse the current order
  • functions for changing the labels or values of factor levels
    • fct_recode() = manually change levels
    • fct_lump() = group together least common levels

Time and date objects

If the variable is formatted as a time or date object, you will find that there are very convenient ways to access, wrangle, and plot the information.

There are three types of date/time data that refer to an instant in time:

A date. Tibbles print this as <date>.

A time within a day. Tibbles print this as <time>.

A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>. Base R calls these POSIXct, but that doesn’t exactly trip off the tongue.

Formatting time variablse

image credit: https://xkcd.com/1179/

What time is it?

today()
[1] "2024-09-25"
now()
[1] "2024-09-25 12:27:16 PDT"

Creating dates

ymd() and friends create dates

ymd("2024-09-25")
[1] "2024-09-25"
mdy("September 25th, 2024")
[1] "2024-09-25"
dmy("25-Sep-2024")
[1] "2024-09-25"

… with times

To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:

ymd_hms("2024-09-25 11:45:59", tz = "America/Los_Angeles")
[1] "2024-09-25 11:45:59 PDT"
mdy_hm("09/25/2024 15:01")  # default is UTC = GMT
[1] "2024-09-25 15:01:00 UTC"

More information about time zones in R.

lubridate

lubridate is a another R package meant for data wrangling!

In particular, lubridate makes it very easy to work with days, times, and dates. The base idea is to start with dates in a ymd (year month day) format and transform the information into whatever you want.

Example from the lubridate vignette.

If anyone drove a time machine, they would crash

The length of months and years change so often that doing arithmetic with them can be unintuitive.

Consider a simple operation: January 31st + one month.

If anyone drove a time machine, they would crash

The length of months and years change so often that doing arithmetic with them can be unintuitive.

Consider a simple operation: January 31st + one month.

Should the answer be:

  1. February 31st (which doesn’t exist)?
  2. March 4th (31 days after January 31)?
  3. February 28th (assuming its not a leap year)?

If anyone drove a time machine, they would crash

A basic property of arithmetic is that a + b - b = a. Only solution 1 obeys the mathematical property, but it is an invalid date. Wickham wants to make lubridate as consistent as possible by invoking the following rule: if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA.

If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special %m+% and %m-% operators. %m+% and %m-% automatically roll dates back to the last day of the month, should that be necessary.

basics in lubridate

library(lubridate)
rightnow <- now()
rightnow
[1] "2024-09-25 12:27:16 PDT"
day(rightnow)
[1] 25
week(rightnow)
[1] 39
month(rightnow, label=FALSE)
[1] 9
month(rightnow, label=TRUE)
[1] Sep
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
year(rightnow)
[1] 2024

basics in lubridate

minute(rightnow)
[1] 27
hour(rightnow)
[1] 12
yday(rightnow)
[1] 269
mday(rightnow)
[1] 25
wday(rightnow, label=FALSE)
[1] 4
wday(rightnow, label=TRUE)
[1] Wed
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

Working with a date object

jan31 <- ymd("2024-01-31")
jan31 + months(0:11)
 [1] "2024-01-31" NA           "2024-03-31" NA           "2024-05-31"
 [6] NA           "2024-07-31" "2024-08-31" NA           "2024-10-31"
[11] NA           "2024-12-31"
floor_date(jan31, "month")
[1] "2024-01-01"
floor_date(jan31, "month") + months(0:11) + days(31)
 [1] "2024-02-01" "2024-03-03" "2024-04-01" "2024-05-02" "2024-06-01"
 [6] "2024-07-02" "2024-08-01" "2024-09-01" "2024-10-02" "2024-11-01"
[11] "2024-12-02" "2025-01-01"
jan31 + months(0:11) + days(31)
 [1] "2024-03-02" NA           "2024-05-01" NA           "2024-07-01"
 [6] NA           "2024-08-31" "2024-10-01" NA           "2024-12-01"
[11] NA           "2025-01-31"
jan31 %m+% months(0:11)
 [1] "2024-01-31" "2024-02-29" "2024-03-31" "2024-04-30" "2024-05-31"
 [6] "2024-06-30" "2024-07-31" "2024-08-31" "2024-09-30" "2024-10-31"
[11] "2024-11-30" "2024-12-31"

NYC flights

library(nycflights13)
names(flights)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     

NYC flights

Creating a date object from variables.

flightsWK <- flights |>  
   mutate(ymdday = ymd(paste(year, month,day, sep="-"))) |> 
   mutate(weekdy = wday(ymdday, label=TRUE), 
          whichweek = week(ymdday)) 

flightsWK |>  select(year, month, day, ymdday, weekdy, whichweek, 
                     dep_time, arr_time, air_time) 
# A tibble: 336,776 × 9
    year month   day ymdday     weekdy whichweek dep_time arr_time air_time
   <int> <int> <int> <date>     <ord>      <dbl>    <int>    <int>    <dbl>
 1  2013     1     1 2013-01-01 Tue            1      517      830      227
 2  2013     1     1 2013-01-01 Tue            1      533      850      227
 3  2013     1     1 2013-01-01 Tue            1      542      923      160
 4  2013     1     1 2013-01-01 Tue            1      544     1004      183
 5  2013     1     1 2013-01-01 Tue            1      554      812      116
 6  2013     1     1 2013-01-01 Tue            1      554      740      150
 7  2013     1     1 2013-01-01 Tue            1      555      913      158
 8  2013     1     1 2013-01-01 Tue            1      557      709       53
 9  2013     1     1 2013-01-01 Tue            1      557      838      140
10  2013     1     1 2013-01-01 Tue            1      558      753      138
# ℹ 336,766 more rows