Data Wrangling

February 5 + 10, 2025

Jo Hardin

What are data?

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007

year <int>	month <int>	day <int>	dep_time <int>	sched_dep_time <int>
2013	1	1	517	515
2013	1	1	533	529
2013	1	1	542	540
2013	1	1	544	545
2013	1	1	554	600
2013	1	1	554	558
2013	1	1	555	600
2013	1	1	557	600
2013	1	1	557	600
2013	1	1	558	600

year <int>	month <int>	day <int>	dep_time <int>	sched_dep_time <int>
2013	1	1	517	515
2013	1	1	533	529
2013	1	1	542	540
2013	1	1	544	545
2013	1	1	554	600
2013	1	1	554	558
2013	1	1	555	600
2013	1	1	557	600
2013	1	1	557	600
2013	1	1	558	600

year <int>	month <int>	day <int>	dep_time <int>	sched_dep_time <int>
2013	1	1	517	515
2013	1	1	533	529
2013	1	1	542	540
2013	1	1	544	545
2013	1	1	554	600
2013	1	1	554	558
2013	1	1	555	600
2013	1	1	557	600
2013	1	1	557	600
2013	1	1	558	600

year <int>	month <int>	day <int>	dep_time <int>	sched_dep_time <int>
2013	1	1	517	515
2013	1	1	533	529
2013	1	1	542	540
2013	1	1	544	545
2013	1	1	554	600
2013	1	1	554	558
2013	1	1	555	600
2013	1	1	557	600
2013	1	1	557	600
2013	1	1	558	600

year <int>	month <int>	day <int>	dep_time <int>	sched_dep_time <int>
2013	1	9	641	900
2013	6	15	1432	1935
2013	1	10	1121	1635
2013	9	20	1139	1845
2013	7	22	845	1600
2013	4	10	1100	1900
2013	3	17	2321	810
2013	6	27	959	1900
2013	7	22	2257	759
2013	12	5	756	1700

year <int>	month <int>	day <int>
2013	1	1
2013	1	1
2013	1	1
2013	1	1
2013	1	1
2013	1	1
2013	1	1
2013	1	1
2013	1	1
2013	1	1

dep_time <int>	sched_dep_time <int>	dep_delay <dbl>	arr_time <int>	sched_arr_time <int>
517	515	2	830	819
533	529	4	850	830
542	540	2	923	850
544	545	-1	1004	1022
554	600	-6	812	837
554	558	-4	740	728
555	600	-5	913	854
557	600	-3	709	723
557	600	-3	838	846
558	600	-2	753	745

origin <chr>	dest <chr>
EWR	IAH
LGA	IAH
JFK	MIA
JFK	BQN
LGA	ATL
EWR	ORD
EWR	FLL
LGA	IAD
JFK	MCO
LGA	ORD

flight <int>	dep_delay <dbl>	arr_delay <dbl>	gain <dbl>
1545	2	11	-9
1714	4	20	-16
1141	2	33	-31
725	-1	-18	17
461	-6	-25	19
1696	-4	12	-16
507	-5	19	-24
5708	-3	-14	11
79	-3	-8	5
301	-2	8	-10

delay <dbl>
12.63907

year <int>	month <int>	day <int>	dep_time <int>	sched_dep_time <int>
2013	12	30	1835	1804
2013	2	13	555	600
2013	2	4	601	600
2013	4	14	950	955
2013	11	14	1253	1300
2013	8	30	2121	2122
2013	9	23	1353	1400
2013	5	1	1106	1115
2013	12	21	33	2359
2013	4	10	2232	2032

tailnum <chr>	count <int>	avg_speed <dbl>
N228UA	1	500.8163
N315AS	1	498.6851
N654UA	1	498.5821
N819AW	1	490.3448
N382HA	26	485.6026
N388HA	36	484.3891
N391HA	21	484.0645
N777UA	1	483.3645
N385HA	28	482.8947
N392HA	13	482.2468

tailnum <chr>	number <int>	avg_speed <dbl>
N228UA	1	500.8163
N315AS	1	498.6851
N654UA	1	498.5821
N819AW	1	490.3448
N382HA	26	485.6026
N388HA	36	484.3891
N391HA	21	484.0645
N777UA	1	483.3645
N385HA	28	482.8947
N392HA	13	482.2468

tailnum <chr>	number <int>	avg_speed <dbl>
N228UA	1	500.8163
N315AS	1	498.6851
N654UA	1	498.5821
N819AW	1	490.3448
N382HA	26	485.6026
N388HA	36	484.3891
N391HA	21	484.0645
N777UA	1	483.3645
N385HA	28	482.8947
N392HA	13	482.2468

carrier <chr>	avg_delay <dbl>
F9	20.215543
EV	19.955390
YV	18.996330
FL	18.726075
WN	17.711744
9E	16.725769
B6	13.022522
VX	12.869421
OO	12.586207
UA	12.106073

1 / 40

Data Wrangling February 5 + 10, 2025 Jo Hardin

Data Wrangling
Agenda 2/5/25
What are data?
Tidy data
Definition of datum
Not tidy – Active Duty Military
Tidying data
Tidy packages: the tidyverse
Verbs
Some Basic Verbs
(out of) NYC, flights data (2013)
filter()
Constructing filters
Practice
Solution
arrange()
select()
distinct()
mutate()
summarize() and sample_n()
Practice
Practice
group_by()
group_by()
Chaining
Mornings
Mornings
Morning
Mornings
Little Bunny Foo Foo
Little Bunny Foo Foo
Little Bunny Foo Foo
Little Bunny Foo Foo
Little Bunny Foo Foo
Flights
Practice
Practice
Practice again
Solution
Visualizing the data

Data Wrangling

Agenda 2/5/25

What are data?

Tidy data

Definition of `datum`

Not tidy – Active Duty Military

Tidying data

Tidy packages: the tidyverse

Verbs

Some Basic Verbs

(out of) NYC, flights data (2013)

`filter()`

Constructing filters

Practice

Solution

`arrange()`

`select()`

`distinct()`

`mutate()`

`summarize()` and `sample_n()`

Practice

Practice

`group_by()`

`group_by()`

Chaining

Mornings

Mornings

Morning

Mornings

Little Bunny Foo Foo

Little Bunny Foo Foo

Little Bunny Foo Foo

Little Bunny Foo Foo

Little Bunny Foo Foo

Flights

Practice

Practice

Practice again

Solution

Visualizing the data

origin <chr>	dest <chr>
EWR	IAH
LGA	IAH
JFK	MIA
JFK	BQN
LGA	ATL
EWR	ORD
EWR	FLL
LGA	IAD
JFK	MCO
LGA	ORD

origin <chr>	dest <chr>
EWR	IAH
LGA	IAH
JFK	MIA
JFK	BQN
LGA	ATL
EWR	ORD
EWR	FLL
LGA	IAD
JFK	MCO
LGA	ORD

origin <chr>	dest <chr>
EWR	IAH
LGA	IAH
JFK	MIA
JFK	BQN
LGA	ATL
EWR	ORD
EWR	FLL
LGA	IAD
JFK	MCO
LGA	ORD