September 3 + 8, 2025
Important
Before Monday, read: Tufte. 1997. Visual and Statistical Thinking: Displays of Evidence for Making Decisions. (Use Google to find it.)
What was Hilary trying to answer in her data collection?
Name two of Hilary’s main hurdles in gathering accurate data.
Which is better: high touch (manual) or low touch (automatic) data collection? Why?
What additional covariates are needed / desired? Any problems with them?
How much data does she need?
Are there any ethical considerations to think about?
Based on https://www.effectivedatastorytelling.com/post/a-deeper-dive-into-lego-bricks-and-data-stories, original source: https://www.linkedin.com/learning/instructors/bill-shander
Yau (2013) gives us nine visual cues, and Wickham (2014) translates them into a language using ggplot2.
Visual Cues: the aspects of the figure where we should focus.
Position (numerical) where in relation to other things?
Length (numerical) how big (in one dimension)?
Angle (numerical) how wide? parallel to something else?
Direction (numerical) at what slope? In a time series, going up or down?
Shape (categorical) belonging to what group?
Area (numerical) how big (in two dimensions)? Beware of improper scaling!
Volume (numerical) how big (in three dimensions)? Beware of improper scaling!
Shade (either) to what extent? how severely?
Color (either) to what extent? how severely? Beware of red/green color blindness.
Coordinate System: rectangular, polar, geographic, etc.
Scale: numeric (linear? logarithmic?), categorical (ordered?), time
Context: in comparison to what (think back to ideas from Tufte)
Visual Cues of Yau (2013):
Position (numerical)
Length (numerical)
Angle (numerical)
Direction (numerical)
Shape (categorical)
Area (numerical)
Volume (numerical)
Shade (either)
Color (either)
Attributes can focus your reader’s attention.1
ggplot2What I will try to do
give a tour of ggplot2
explain how to think about plots the ggplot2 way
prepare/encourage you to learn more later
What I can’t do in one session
show every bell and whistle
make you an expert at using ggplot2
One of the best ways to get started with ggplot is to Google what you want to do with the word ggplot. Then look through the images that come up. More often than not, the associated code is there. There are also ggplot galleries of images, one of them is here: https://plot.ly/ggplot2/
Look at the end of this presentation and the syllabus. More help options there.
ggplotgeom: the geometric “shape” used to display data
aesthetic: an attribute controlling how geom is displayed with respect to variables
guide: helps user convert visual data back into raw data (legends, axes)
stat: a transformation applied to data before geom gets it
| date | births | wday | year | month | day_of_year | day_of_month | day_of_week |
|---|---|---|---|---|---|---|---|
| 2015-01-01 | 8068 | Thu | 2015 | 1 | 1 | 1 | 5 |
| 2015-01-02 | 10850 | Fri | 2015 | 1 | 2 | 2 | 6 |
| 2015-01-03 | 8328 | Sat | 2015 | 1 | 3 | 3 | 7 |
| 2015-01-04 | 7065 | Sun | 2015 | 1 | 4 | 4 | 1 |
| 2015-01-05 | 11892 | Mon | 2015 | 1 | 5 | 5 | 2 |
| 2015-01-06 | 12425 | Tue | 2015 | 1 | 6 | 6 | 3 |
Obtained from the National Center for Health Statistics, National Vital Statistics System, Natality, 2015 data.
Two Questions:
What do we want R to do? (What is the goal?)
What does R need to know?
Goal: scatterplot = a plot with points
What does R need to know?
data source: Births2015
aesthetics:
date -> xbirths -> ypoints
What has changed?
Now there are two layers: one with points and one with lines
The layers are placed one on top of the other: the points are below and the lines are above.
data and aes specified in ggplot() affect all geoms
This is mapping the color aesthetic to a new variable with only one value (“navy”).
So all the dots get set to the same color, but it’s not navy.
If we want to set the color to be navy for all of the dots, we do it outside the aes() designation:
color = "navy" is now outside of the aesthetics list. That’s how ggplot2 distinguishes between mapping and setting.ggplot() establishes the default data and aesthetics for the geoms, but each geom may change these defaults.
good practice: put into ggplot() the things that affect all (or most) of the layers; rest in geom_XXXX()
Information gets passed to the plot via:
map the variable information inside the aes (aesthetic) command
set the non-variable information outside the aes (aesthetic) command
Make the data stand out
Facilitate comparison
Add information
(Nolan & Perrrett, 2016)
Tufte lists two main motivational steps to working with graphics as part of an argument.
“An essential analytic task in making decisions based on evidence is to understand how things work.”
Making decisions based on evidence requires the appropriate display of that evidence.”
Tufte (1997) Visual and Statistical Thinking: Displays of Evidence for Making Decisions. (Use Google to find it.)
How many aspects of this graph can you point out which are relevant to figuring out that cholera infection was coming from a single pump? Are there any distracting aspects?
Why would the outbreak already have begun to decline before the pump handle was removed?
One of the graphics which was particularly unconvincing in trying to explain that O-rings fail in the cold.