[1] "banana"
September 23 + 25, 2024
Some new variable types:
A variable’s type determines the values that the variable can take on and the operations that can be performed on it. Specifying variable types ensures the data’s integrity and increases performance.
str_*()
functionsWhen working with character strings, we might want to detect, replace, or extract certain patterns.
Strings are objects of the character class (abbreviated as <chr>
in tibbles). When you print out strings, they display with double quotes:
Creating strings by hand is useful for testing out regular expressions.
To create a string, type any text in either double quotes "
or single quotes '
. Using double or single quotes doesn’t matter unless your string itself has single or double quotes.
str_view()
We can view these strings “naturally” (without the opening and closing quotes) with str_view()
:
str_c
Similar to paste()
(gluing strings together), but works well in a tidy pipeline.
str_sub()
str_sub(string, start, end)
will extract parts of a string
where start
and end
are the positions where the substring starts and ends.
[1] "App" "Ban" "Pea"
[1] "ple" "ana" "ear"
Won’t fail if the string is too short.
str_sub()
in a pipelineWe can use the str_*()
functions inside the mutate()
function.
Class Sex Age Survived Freq class1
1 1st Male Child No 0 1
2 2nd Male Child No 0 2
3 3rd Male Child No 35 3
4 Crew Male Child No 0 C
5 1st Female Child No 0 1
6 2nd Female Child No 0 2
7 3rd Female Child No 17 3
8 Crew Female Child No 0 C
9 1st Male Adult No 118 1
10 2nd Male Adult No 154 2
11 3rd Male Adult No 387 3
12 Crew Male Adult No 670 C
13 1st Female Adult No 4 1
14 2nd Female Adult No 13 2
15 3rd Female Adult No 89 3
16 Crew Female Adult No 3 C
17 1st Male Child Yes 5 1
18 2nd Male Child Yes 11 2
19 3rd Male Child Yes 13 3
20 Crew Male Child Yes 0 C
21 1st Female Child Yes 1 1
22 2nd Female Child Yes 13 2
23 3rd Female Child Yes 14 3
24 Crew Female Child Yes 0 C
25 1st Male Adult Yes 57 1
26 2nd Male Adult Yes 14 2
27 3rd Male Adult Yes 75 3
28 Crew Male Adult Yes 192 C
29 1st Female Adult Yes 140 1
30 2nd Female Adult Yes 80 2
31 3rd Female Adult Yes 76 3
32 Crew Female Adult Yes 20 C
str_replace*()
str_replace()
replaces the first match of a pattern. str_replace_all()
replaces all the matches of a pattern.
str_detect()
str_detect()
in pipelinestr_detect()
used in a filter()
pipeline.
tibble [87 × 2] (S3: tbl_df/tbl/data.frame)
$ name : chr [1:87] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
$ films:List of 87
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:7] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith"
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:3] "A New Hope" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:3] "A New Hope" "Attack of the Clones" "Revenge of the Sith"
..$ : chr "A New Hope"
..$ : chr "A New Hope"
..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "A New Hope" "Revenge of the Sith"
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Force Awakens"
..$ : chr "A New Hope"
..$ : chr [1:3] "A New Hope" "Return of the Jedi" "The Phantom Menace"
..$ : chr [1:3] "A New Hope" "The Empire Strikes Back" "Return of the Jedi"
..$ : chr "A New Hope"
..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
..$ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "Attack of the Clones"
..$ : chr "The Empire Strikes Back"
..$ : chr "The Empire Strikes Back"
..$ : chr [1:2] "The Empire Strikes Back" "Return of the Jedi"
..$ : chr "The Empire Strikes Back"
..$ : chr [1:2] "Return of the Jedi" "The Force Awakens"
..$ : chr "Return of the Jedi"
..$ : chr "Return of the Jedi"
..$ : chr "Return of the Jedi"
..$ : chr "Return of the Jedi"
..$ : chr "The Phantom Menace"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr "The Phantom Menace"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
..$ : chr "The Phantom Menace"
..$ : chr "Return of the Jedi"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr "The Phantom Menace"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
..$ : chr [1:2] "The Phantom Menace" "Revenge of the Sith"
..$ : chr "The Phantom Menace"
..$ : chr [1:3] "The Phantom Menace" "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "The Phantom Menace" "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr "Attack of the Clones"
..$ : chr "Attack of the Clones"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr "Revenge of the Sith"
..$ : chr "Revenge of the Sith"
..$ : chr [1:2] "A New Hope" "Revenge of the Sith"
..$ : chr [1:2] "Attack of the Clones" "Revenge of the Sith"
..$ : chr "Revenge of the Sith"
..$ : chr "The Force Awakens"
..$ : chr "The Force Awakens"
..$ : chr "The Force Awakens"
..$ : chr "The Force Awakens"
..$ : chr "The Force Awakens"
tibble [16 × 2] (S3: tbl_df/tbl/data.frame)
$ name : chr [1:16] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
$ films:List of 16
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:7] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith"
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:6] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" ...
..$ : chr [1:5] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "Revenge of the Sith" ...
..$ : chr [1:4] "A New Hope" "The Empire Strikes Back" "Return of the Jedi" "The Force Awakens"
..$ : chr [1:3] "A New Hope" "The Empire Strikes Back" "Return of the Jedi"
..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
..$ : chr [1:5] "The Empire Strikes Back" "Return of the Jedi" "The Phantom Menace" "Attack of the Clones" ...
..$ : chr [1:3] "The Empire Strikes Back" "Return of the Jedi" "Attack of the Clones"
..$ : chr "The Empire Strikes Back"
..$ : chr "The Empire Strikes Back"
..$ : chr [1:2] "The Empire Strikes Back" "Return of the Jedi"
..$ : chr "The Empire Strikes Back"
The stringr package within tidyverse contains lots of functions to help process strings. Letting x be a string variable…
str function | arguments | returns |
---|---|---|
str_replace() |
x , pattern , replacement |
a modified string |
str_replace_all() |
x , pattern , replacement |
a modified string |
str_to_lower() |
x |
a modified string |
str_to_upper() |
x |
a modified string |
str_sub() |
x , start , end |
a modified string |
str_length() |
x |
a number |
str_detect() |
x , pattern |
TRUE/FALSE |
Use the stringr cheatsheet.
Factor variables are a special type of character string. The computer actually stores them as integers (?!?!!?) with a label (abbreviated as <fct>
in tibbles).
The ordering of the factor variable comes out in:
group_by()
)SurveyUSA poll from 2012 on views of the DREAM Act.
What is off about the data viz part of the report?
We can fix the order of the ideology
variable.
# A tibble: 77 × 3
item type calories
<chr> <fct> <int>
1 "8-Grain Roll" bakery 350
2 "Apple Bran Muffin" bakery 350
3 "Apple Fritter" bakery 420
4 "Banana Nut Loaf" bakery 490
5 "Birthday Cake Mini Doughnut" bakery 130
6 "Blueberry Oat Bar" bakery 370
7 "Blueberry Scone" bakery 460
8 "Bountiful Blueberry Muffin" bakery 370
9 "Butter Croissant " bakery 310
10 "Cheese Danish" bakery 420
# ℹ 67 more rows
Lets say that we wanted to order the type of food item based on the average number of calories in that food.
# A tibble: 77 × 7
item calories fat carb fiber protein type
<chr> <int> <dbl> <int> <int> <int> <fct>
1 "8-Grain Roll" 350 8 67 5 10 bakery
2 "Apple Bran Muffin" 350 9 64 7 6 bakery
3 "Apple Fritter" 420 20 59 0 5 bakery
4 "Banana Nut Loaf" 490 19 75 4 7 bakery
5 "Birthday Cake Mini Doughnut" 130 6 17 0 0 bakery
6 "Blueberry Oat Bar" 370 14 47 5 6 bakery
7 "Blueberry Scone" 460 22 61 2 7 bakery
8 "Bountiful Blueberry Muffin" 370 14 55 0 6 bakery
9 "Butter Croissant " 310 18 32 0 5 bakery
10 "Cheese Danish" 420 25 39 0 7 bakery
# ℹ 67 more rows
# A tibble: 77 × 7
item calories fat carb fiber protein type
<fct> <int> <dbl> <int> <int> <int> <fct>
1 "8-Grain Roll" 350 8 67 5 10 bakery
2 "Apple Bran Muffin" 350 9 64 7 6 bakery
3 "Apple Fritter" 420 20 59 0 5 bakery
4 "Banana Nut Loaf" 490 19 75 4 7 bakery
5 "Birthday Cake Mini Doughnut" 130 6 17 0 0 bakery
6 "Blueberry Oat Bar" 370 14 47 5 6 bakery
7 "Blueberry Scone" 460 22 61 2 7 bakery
8 "Bountiful Blueberry Muffin" 370 14 55 0 6 bakery
9 "Butter Croissant " 310 18 32 0 5 bakery
10 "Cheese Danish" 420 25 39 0 7 bakery
# ℹ 67 more rows
The forcats package within tidyverse contains lots of functions to help process factor variables Use the forcats cheatsheet. We’ll focus on the most common functions.
fct_relevel()
= manually reorder levelsfct_reorder()
= reorder levels according to values of another variablefct_infreq()
= order levels from highest to lowest frequencyfct_rev()
= reverse the current orderfct_recode()
= manually change levelsfct_lump()
= group together least common levelsIf the variable is formatted as a time or date object, you will find that there are very convenient ways to access, wrangle, and plot the information.
There are three types of date/time data that refer to an instant in time:
A date. Tibbles print this as <date>
.
A time within a day. Tibbles print this as <time>
.
A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>
. Base R calls these POSIXct, but that doesn’t exactly trip off the tongue.
ymd()
and friends create dates
To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:
[1] "2024-09-25 11:45:59 PDT"
[1] "2024-09-25 15:01:00 UTC"
lubridate is a another R package meant for data wrangling!
In particular, lubridate makes it very easy to work with days, times, and dates. The base idea is to start with dates in a ymd
(year month day) format and transform the information into whatever you want.
Example from the lubridate vignette.
The length of months and years change so often that doing arithmetic with them can be unintuitive.
Consider a simple operation: January 31st + one month.
The length of months and years change so often that doing arithmetic with them can be unintuitive.
Consider a simple operation: January 31st + one month.
Should the answer be:
A basic property of arithmetic is that a + b - b = a. Only solution 1 obeys the mathematical property, but it is an invalid date. Wickham wants to make lubridate as consistent as possible by invoking the following rule: if adding or subtracting a month or a year creates an invalid date, lubridate will return an NA.
If you thought solution 2 or 3 was more useful, no problem. You can still get those results with clever arithmetic, or by using the special %m+%
and %m-%
operators. %m+%
and %m-%
automatically roll dates back to the last day of the month, should that be necessary.
lubridate
lubridate
[1] "2024-01-31" NA "2024-03-31" NA "2024-05-31"
[6] NA "2024-07-31" "2024-08-31" NA "2024-10-31"
[11] NA "2024-12-31"
[1] "2024-01-01"
[1] "2024-02-01" "2024-03-03" "2024-04-01" "2024-05-02" "2024-06-01"
[6] "2024-07-02" "2024-08-01" "2024-09-01" "2024-10-02" "2024-11-01"
[11] "2024-12-02" "2025-01-01"
[1] "2024-03-02" NA "2024-05-01" NA "2024-07-01"
[6] NA "2024-08-31" "2024-10-01" NA "2024-12-01"
[11] NA "2025-01-31"
[1] "2024-01-31" "2024-02-29" "2024-03-31" "2024-04-30" "2024-05-31"
[6] "2024-06-30" "2024-07-31" "2024-08-31" "2024-09-30" "2024-10-31"
[11] "2024-11-30" "2024-12-31"
Creating a date object from variables.
flightsWK <- flights |>
mutate(ymdday = ymd(paste(year, month,day, sep="-"))) |>
mutate(weekdy = wday(ymdday, label=TRUE),
whichweek = week(ymdday))
flightsWK |> select(year, month, day, ymdday, weekdy, whichweek,
dep_time, arr_time, air_time)
# A tibble: 336,776 × 9
year month day ymdday weekdy whichweek dep_time arr_time air_time
<int> <int> <int> <date> <ord> <dbl> <int> <int> <dbl>
1 2013 1 1 2013-01-01 Tue 1 517 830 227
2 2013 1 1 2013-01-01 Tue 1 533 850 227
3 2013 1 1 2013-01-01 Tue 1 542 923 160
4 2013 1 1 2013-01-01 Tue 1 544 1004 183
5 2013 1 1 2013-01-01 Tue 1 554 812 116
6 2013 1 1 2013-01-01 Tue 1 554 740 150
7 2013 1 1 2013-01-01 Tue 1 555 913 158
8 2013 1 1 2013-01-01 Tue 1 557 709 53
9 2013 1 1 2013-01-01 Tue 1 557 838 140
10 2013 1 1 2013-01-01 Tue 1 558 753 138
# ℹ 336,766 more rows