The Active Duty data are not tidy! What are the cases? How are the data not tidy? What might the data look like in tidy form? Suppose that the case was “an individual in the armed forces.” What variables would you use to capture the information in the following table?
Tidying data
How would you wrangle each data example into a tidy format?
the emails in your inbox
social media texts
images
videos
audio files
Tidy packages: the tidyverse
image credit: https://www.tidyverse.org/.
Verbs
Most data wrangling happens with a set of data verbs. Verbs are functions that act on data frames.
The first argument of each data verb is the data frame.
Some Basic Verbs
filter()∗
arrange()
select()∗
distinct()
mutate()∗
summarize()∗
sample_n()
group_by()∗
The most used verbs are denoted with ∗.
(out of) NYC, flights data (2013)
library(nycflights13)flights
ABCDEFGHIJ0123456789
year
<int>
month
<int>
day
<int>
dep_time
<int>
sched_dep_time
<int>
2013
1
1
517
515
2013
1
1
533
529
2013
1
1
542
540
2013
1
1
544
545
2013
1
1
554
600
2013
1
1
554
558
2013
1
1
555
600
2013
1
1
557
600
2013
1
1
557
600
2013
1
1
558
600
filter()
Allows you to select a subset of the rows of a data frame. The first argument is the name of the data frame, the following arguments are the filters that you’d like to apply
For all flights on January 1st:
filter(flights, month ==1, day ==1)
ABCDEFGHIJ0123456789
year
<int>
month
<int>
day
<int>
dep_time
<int>
sched_dep_time
<int>
2013
1
1
517
515
2013
1
1
533
529
2013
1
1
542
540
2013
1
1
544
545
2013
1
1
554
600
2013
1
1
554
558
2013
1
1
555
600
2013
1
1
557
600
2013
1
1
557
600
2013
1
1
558
600
Constructing filters
Filters are constructed of logical operators: <, >, <=, >=, ==, != (and some others).
Adding them one by one to filter() is akin to saying “this AND that”. To say “this OR that OR both”, use |.
filter(flights, month ==1| month ==2)
ABCDEFGHIJ0123456789
year
<int>
month
<int>
day
<int>
dep_time
<int>
sched_dep_time
<int>
2013
1
1
517
515
2013
1
1
533
529
2013
1
1
542
540
2013
1
1
544
545
2013
1
1
554
600
2013
1
1
554
558
2013
1
1
555
600
2013
1
1
557
600
2013
1
1
557
600
2013
1
1
558
600
Practice
Construct filters to isolate:
Flights that left on St. Patrick’s Day.
Flights that were destined for Chicago’s primary airport.
Flights that were destined for Chicago’s primary airport and were operated by United Airlines.
Flights with flight times more than 2000 miles or that were in the air more than 5 hours.
Solution
Flights that left on St. Patrick’s Day.
Flights that were destined for Chicago’s primary airport.
Flights that were destined for Chicago’s primary airport and were operated by United Airlines.
Flights with flight times more than 2000 miles or that were in the air more than 5 hours.
filter(flights, month ==3, day ==17)filter(flights, dest =="ORD")filter(flights, dest =="ORD", carrier =="UA")filter(flights, distance >2000| air_time >5*60)
arrange()
arrange() reorders the rows: It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
arrange(flights, year, month, day)
ABCDEFGHIJ0123456789
year
<int>
month
<int>
day
<int>
dep_time
<int>
sched_dep_time
<int>
2013
1
1
517
515
2013
1
1
533
529
2013
1
1
542
540
2013
1
1
544
545
2013
1
1
554
600
2013
1
1
554
558
2013
1
1
555
600
2013
1
1
557
600
2013
1
1
557
600
2013
1
1
558
600
Use desc() to sort in descending order.
arrange(flights, desc(dep_delay))
ABCDEFGHIJ0123456789
year
<int>
month
<int>
day
<int>
dep_time
<int>
sched_dep_time
<int>
2013
1
9
641
900
2013
6
15
1432
1935
2013
1
10
1121
1635
2013
9
20
1139
1845
2013
7
22
845
1600
2013
4
10
1100
1900
2013
3
17
2321
810
2013
6
27
959
1900
2013
7
22
2257
759
2013
12
5
756
1700
select()
Often you work with large datasets with many columns where only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:
select(flights, year, month, day)
ABCDEFGHIJ0123456789
year
<int>
month
<int>
day
<int>
2013
1
1
2013
1
1
2013
1
1
2013
1
1
2013
1
1
2013
1
1
2013
1
1
2013
1
1
2013
1
1
2013
1
1
You can exclude columns using - and specify a range using :.
select(flights, -(year:day))
ABCDEFGHIJ0123456789
dep_time
<int>
sched_dep_time
<int>
dep_delay
<dbl>
arr_time
<int>
sched_arr_time
<int>
517
515
2
830
819
533
529
4
850
830
542
540
2
923
850
544
545
-1
1004
1022
554
600
-6
812
837
554
558
-4
740
728
555
600
-5
913
854
557
600
-3
709
723
557
600
-3
838
846
558
600
-2
753
745
distinct()
A common use of select() is to find out which values a set of variables takes. This is particularly useful in conjunction with the distinct() verb which only returns the unique values in a table.
What do the following data correspond to?
distinct(select(flights, origin, dest))
ABCDEFGHIJ0123456789
origin
<chr>
dest
<chr>
EWR
IAH
LGA
IAH
JFK
MIA
JFK
BQN
LGA
ATL
EWR
ORD
EWR
FLL
LGA
IAD
JFK
MCO
LGA
ORD
mutate()
As well as selecting from the set of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():
select(mutate(flights, gain = dep_delay - arr_delay), flight, dep_delay, arr_delay, gain)
ABCDEFGHIJ0123456789
flight
<int>
dep_delay
<dbl>
arr_delay
<dbl>
gain
<dbl>
1545
2
11
-9
1714
4
20
-16
1141
2
33
-31
725
-1
-18
17
461
-6
-25
19
1696
-4
12
-16
507
-5
19
-24
5708
-3
-14
11
79
-3
-8
5
301
-2
8
-10
summarize() and sample_n()
summarize() collapses a data frame to a single row based on some function. It’s not very useful yet, but it will be.
sample_n() provides you with a random sample of rows.
summarize() and sample_n() are even more powerful when combined with the idea of “group by”, repeating the operation separately on groups of observations within the dataset.
The group_by() function describes how to break a dataset down into groups of rows.
group_by() does not change anything about the data frame. Rather, it holds the rows of the data frame separately so that any future verbs applied to the data frame are done separately per group being held.
group_by()
Find the fastest airplanes in the bunch, measured as the average speed per airplane.
Instead of applying each verb step-by-step, we can chain them into a single data pipeline, connected with the |> operator. You start the pipeline with a data frame and then pass it to each function in turn.
The pipe syntax (|>) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:
x |> f() is the same as f(x)
x |> f(y) is the same as f(x, y)
Mornings
me_step1 <-dress(me, what = sports) me_step2 <-exercise(me_step1, how = running) me_step3 <-eat(me_step2, choice = cereal) me_step4 <-dress(me_step3, what = school) me_step5 <-commute(me_step4, transportation = bike)
Mornings
commute(dress(eat(exercise(dress(me, what = sports), how = running), choice = cereal), what = school), transportation = bike)
Morning
(better??)
commute(dress(eat(exercise(dress(me, what = sports), how = running), choice = cereal), what = school), transportation = bike)
Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?
Practice
Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?
Say you’re curious about the relationship between the number of flights that each plane made in 2013, the mean distance that each of those planes flew, and the mean arrival delay. You also want to exclude the edge cases from your analysis, so focus on the planes that have logged more than 20 flights and flown an average distance of less than 2000 miles. Please form the chain that creates this dataset.