Data Wrangling

September 10, 2025

Jo Hardin

Agenda 9/10/25

Tidy data
Data verbs

What are data?

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007

Tidy data

each row = a unit of observation (here, a penguin)
each column = a measure on some variable of interest, either quantitative (numbers with units) or categorical (discrete possibilities or categories)
each entry contains a single data value; no analysis, summaries, footnotes, comments, etc, and only one value per cell

Definition of `datum`

The definition of datum can be much broader:

Definition of datum from the Oxford English Dictionary

Not tidy – Active Duty Military

The Active Duty data are not tidy! What are the cases? How are the data not tidy? What might the data look like in tidy form? Suppose that the case was “an individual in the armed forces.” What variables would you use to capture the information in the following table?

Tidying data

How would you wrangle each data example into a tidy format?

the emails in your inbox
social media texts
images
videos
audio files

Tidy packages: the tidyverse

Image of hex stickers for the eight core tidyverse packages including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats.

image credit: https://www.tidyverse.org/.

Verbs

Most data wrangling happens with a set of data verbs. Verbs are functions that act on data frames.

The first argument of each data verb is the data frame.

The output of a data verb is a data frame.

Some Basic Verbs

filter()
arrange()
select()
mutate()
summarize()
group_by()

(out of) NYC, flights data (2013)

library(nycflights13)
flights

`filter()`

Allows you to select a subset of the rows of a data frame. The first argument is the name of the data frame, the following arguments are the filters that you’d like to apply

For all flights on January 1st:

filter(flights, month == 1, day == 1)

Constructing filters

Filters are constructed of logical operators: <, >, <=, >=, ==, != (and some others).

Adding them one by one to filter() is akin to saying “this AND that”. To say “this OR that OR both”, use |.

filter(flights, month == 1 | month == 2)

Practice

Construct filters to isolate:

Flights that left on St. Patrick’s Day.
Flights that were destined for Chicago’s primary airport.
Flights that were destined for Chicago’s primary airport and were operated by United Airlines.
Flights with flight times more than 2000 miles or that were in the air more than 5 hours.

Solution

Flights that left on St. Patrick’s Day.
Flights that were destined for Chicago’s primary airport.
Flights that were destined for Chicago’s primary airport and were operated by United Airlines.
Flights with flight times more than 2000 miles or that were in the air more than 5 hours.

filter(flights, month == 3, day == 17)
filter(flights, dest == "ORD")
filter(flights, dest == "ORD", carrier == "UA")
filter(flights, distance > 2000 | air_time > 5*60)

`arrange()`

arrange() reorders the rows: It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)

`arrange()`

Use desc() to sort in descending order.

arrange(flights, desc(dep_delay))

`select()`

Often you work with large datasets with many columns where only a few are actually of interest to you. select() allows you to rapidly zoom in on a useful subset using operations that usually only work on numeric variable positions:

select(flights, year, month, day)

`select()`

You can exclude columns using - and specify a range using :.

select(flights, -(year:day))

`mutate()`

As well as selecting from the set of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():

select(mutate(flights, gain = dep_delay - arr_delay), 
       flight, dep_delay, arr_delay, gain)

`summarize()`

summarize() collapses a data frame to a single row based on some function. It’s not very useful yet, but it will be.

summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

Practice

Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.

Select the new variable and save it, along with tailnum, as a new data frame object.

Practice

Mutate the data to create a new column that contains the average speed traveled by the plane for each flight.

Select the new variable and save it, along with tailnum, as a new data frame object.

flights2 <- mutate(flights, speed = distance/(air_time/60))
speed_data <- select(flights2, tailnum, speed)

`group_by()`

summarize() is even more powerful when combined with the idea of “group by”, repeating the operation separately on groups of observations within the dataset.

The group_by() function describes how to break a dataset down into groups of rows.

group_by() does not change anything about the data frame. Rather, it holds the rows of the data frame separately so that any future verbs applied to the data frame are done separately per group being held.

`group_by()`

Find the fastest airplanes in the bunch, measured as the average speed per airplane.

by_tailnum <- group_by(speed_data, tailnum)
avg_speed <- summarize(by_tailnum, 
                       count = n(), 
                       avg_speed = mean(speed, na.rm = TRUE))
arrange(avg_speed, desc(avg_speed))

Chaining

Instead of applying each verb step-by-step, we can chain them into a single data pipeline, connected with the |> operator. You start the pipeline with a data frame and then pass it to each function in turn.

The pipe syntax (|>) takes a data frame and sends it to the argument of a function. The mapping goes to the first available argument in the function. For example:

x |> f() is the same as f(x)

x |> f(y) is the same as f(x, y)

Mornings

me_step1 <- dress(me, what = sports)  
me_step2 <- exercise(me_step1, how = running)  
me_step3 <- eat(me_step2, choice = cereal)  
me_step4 <- dress(me_step3, what = school)  
me_step5 <- commute(me_step4, transportation = bike)

Mornings

commute(dress(eat(exercise(dress(me, what = sports), how = running), choice = cereal), what = school), transportation = bike)

Morning

(better??)

commute(
  dress(
    eat(
      exercise(
        dress(me, 
              what = sports), 
        how = running), 
      choice = cereal), 
    what = school), 
  transportation = bike)

Mornings

me |> 
  dress(what = sports) |> 
  exercise(how = running) |> 
  eat(choice = cereal) |> 
  dress(what = school) |> 
  commute(transportation = bike)

flights2 <- mutate(flights, speed = distance/(air_time/60))
tail_speed <- select(flights2, tailnum, speed)
tail_speed_grp <- group_by(tail_speed, tailnum)
tail_ave <- summarize(tail_speed_grp, number = n(),
                      avg_speed = mean(speed, na.rm = TRUE))
arrange(tail_ave, desc(avg_speed))

flights |> 
  mutate(speed = distance / (air_time/60)) |> 
  select(tailnum, speed) |> 
  group_by(tailnum) |>  
  summarize(number = n(), 
            avg_speed = mean(speed, na.rm = TRUE)) |> 
  arrange(desc(avg_speed))

Practice

Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?

Practice

Form a chain that creates a data frame containing only carrier and the mean departure delay time. Which carriers have the highest and lowest mean delays?

flights |> 
  group_by(carrier) |> 
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE)) |> 
  arrange(desc(avg_delay))

Practice again

Say you’re curious about the relationship between the number of flights that each plane made in 2013, the mean distance that each of those planes flew, and the mean arrival delay. You also want to exclude the edge cases from your analysis, so focus on the planes that have logged more than 20 flights and flown an average distance of less than 2000 miles. Please form the chain that creates this dataset.

Solution

delay_data <- flights |> 
  group_by(tailnum) |> 
  summarize(number = n(),
            dist = mean(distance, na.rm = TRUE), 
            delay = mean(arr_delay, na.rm = TRUE)) |> 
  filter(number > 20, dist < 2000)

Visualizing the data

Viz
Code

delay_data |> 
 ggplot(aes(dist, delay)) +
 geom_point(aes(size = number), 
            alpha = 1/2) +
 geom_smooth() +
 scale_size_area()

When scale_size_area is used, the default behavior is to scale the area of points to be proportional to the value.