Web Scraping

November 4 + 6, 2024

Jo Hardin

Agenda 11/4/24

web scraping
selector gadget
rvest

Important tool

Our approach to web scraping relies on the Chrome browser and an extension called a Selector Gadget. Download them here:

Data Acquisition

Based on https://www.effectivedatastorytelling.com/post/a-deeper-dive-into-lego-bricks-and-data-stories, original source: https://www.linkedin.com/learning/instructors/bill-shander

Reading The Student Life

How often do you read The Student Life?
a. Every day
b. 3-5 times a week
c. Once a week
d. Rarely

Reading The Student Life

What do you think is the most common word in the titles of The Student Life opinion pieces?

Analyzing The Student Life

Reading The Student Life

How do you think the sentiments in opinion pieces in The Student Life compare across authors?
Roughly the same?
Wildly different?
Somewhere in between?

Analyzing The Student Life

All of this analysis is done in R!

(mostly) with tools you already know!

Common words in The Student Life titles

Code for the earlier plot:

data(stop_words)  # from tidytext
tsl_opinion_titles |>
  tidytext::unnest_tokens(word, title) |>
  anti_join(stop_words) |>
  count(word, sort = TRUE) |>
  slice_head(n = 20) |>
  mutate(word = fct_reorder(word, n)) |>
  ggplot(aes(y = word, x = n, fill = log(n))) +
  geom_col(show.legend = FALSE) +
  theme_minimal(base_size = 16) +
  labs(
    x = "Number of mentions",
    y = "Word",
    title = "The Student Life - Opinion pieces",
    subtitle = "Common words in the 500 most recent opinion pieces",
    caption = "Source: Data scraped from The Student Life on Nov 4, 2024"
  ) +
  theme(
    plot.title.position = "plot",
    plot.caption = element_text(color = "gray30")
  )

Avg sentiment scores of first paragraph

Code for the earlier plot:

afinn_sentiments <- get_sentiments("afinn")  # need tidytext and textdata
tsl_opinion_titles |>
  tidytext::unnest_tokens(word, first_p) |>
  anti_join(stop_words) |>
  left_join(afinn_sentiments) |> 
  group_by(authors, title) |>
  summarize(total_sentiment = sum(value, na.rm = TRUE), .groups = "drop") |>
  group_by(authors) |>
  summarize(
    n_articles = n(),
    avg_sentiment = mean(total_sentiment, na.rm = TRUE),
  ) |>
  filter(n_articles > 1 & !is.na(authors)) |>
  arrange(desc(avg_sentiment)) |>
  slice(c(1:10, 69:78)) |>
  mutate(
    authors = fct_reorder(authors, avg_sentiment),
    neg_pos = if_else(avg_sentiment < 0, "neg", "pos"),
    label_position = if_else(neg_pos == "neg", 0.25, -0.25)
  ) |>
  ggplot(aes(y = authors, x = avg_sentiment)) +
  geom_col(aes(fill = neg_pos), show.legend = FALSE) +
  geom_text(
    aes(x = label_position, label = authors, color = neg_pos),
    hjust = c(rep(1,10), rep(0, 10)),
    show.legend = FALSE,
    fontface = "bold"
  ) +
  geom_text(
    aes(label = round(avg_sentiment, 1)),
    hjust = c(rep(1.25,10), rep(-0.25, 10)),
    color = "white",
    fontface = "bold"
  ) +
  scale_fill_manual(values = c("neg" = "#4d4009", "pos" = "#FF4B91")) +
  scale_color_manual(values = c("neg" = "#4d4009", "pos" = "#FF4B91")) +
  scale_x_continuous(breaks = -5:5, minor_breaks = NULL) +
  scale_y_discrete(breaks = NULL) +
  coord_cartesian(xlim = c(-5, 5)) +
  labs(
    x = "negative  ←     Average sentiment score (AFINN)     →  positive",
    y = NULL,
    title = "The Student Life - Opinion pieces\nAverage sentiment scores of first paragraph by author",
    subtitle = "Top 10 average positive and negative scores",
    caption = "Source: Data scraped from The Student Life on Nov 4, 2024"
  ) +
  theme_void(base_size = 16) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, margin = unit(c(0.5, 0, 1, 0), "lines")),
    axis.text.y = element_blank(),
    plot.caption = element_text(color = "gray30")
  )

Where is the data coming from?

https://tsl.news/category/opinions/

Where is the data coming from?

tsl_opinion_titles

# A tibble: 500 × 4
   title                                     authors date                first_p
   <chr>                                     <chr>   <dttm>              <chr>  
 1 Elon Musk’s million-dollar-a-day rewards… Celest… 2024-11-01 16:27:00 have y…
 2 The politics behind apolitical acts       Eric Lu 2024-11-01 16:21:00 while …
 3 In Defense of the Pomona College Judicia… Henri … 2024-11-01 16:15:00 former…
 4 ‘Yakking’ isn’t a canon event, party res… Kabir … 2024-11-01 16:10:00 whirri…
 5 The ‘if he wanted to, he would’ mentalit… Tess M… 2024-11-01 16:01:00 ladies…
 6 You can’t silence us: A united front aga… Outbac… 2024-10-25 11:23:00 in the…
 7 We will not tolerate collective punishme… Suspen… 2024-10-25 09:04:00 in the…
 8 A guide to ballot propositions            Akshay… 2024-10-25 06:37:00 are yo…
 9 Pomona will protest or perish             Maggie… 2024-10-23 06:00:00 as pom…
10 GUEST EDITORIAL: Pomona’s culpability un… Mike W… 2024-10-11 08:23:00 mike w…
# ℹ 490 more rows

Web scraping

Scraping the web: what? why?

Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy & paste, but it’s time-consuming and prone to errors
Web scraping is the process of extracting this information automatically and transform it into a structured dataset
Two different scenarios:
- Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
- Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.

Hypertext Markup Language

Most of the data on the web is largely available as HTML - while it is structured (hierarchical) it often is not available in a form useful for analysis (flat / tidy).

<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
    <br/>
    <div class="name" id="first">John</div>
    <div class="name" id="last">Doe</div>
    <div class="contact">
      <div class="home">555-555-1234</div>
      <div class="home">555-555-2345</div>
      <div class="work">555-555-9999</div>
      <div class="fax">555-555-8888</div>
    </div>
  </body>
</html>

Some HTML elements

<html>: start of the HTML page
<head>: header information (metadata about the page)
<body>: everything that is on the page
<p>: paragraphs
<b>: bold
<table>: table
<div>: a container to group content together

rvest

The rvest package makes basic processing and manipulation of HTML data straight forward
It is designed to work with pipelines built with |>
rvest.tidyverse.org

library(rvest)

rvest

Core functions:

read_html() - read HTML data from a url or character string.
html_elements() - select specified elements from the HTML document using CSS selectors.
html_element() - select a single element from the HTML document using CSS selectors.
html_table() - parse an HTML table into a data frame.
html_text() / html_text2() - extract element’s text content.
html_name - extract a element’s name(s).
html_attrs - extract all attributes.
html_attr - extract attribute value(s) by name.

html & rvest

html <- 
'<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
    <br/>
    <div class="name" id="first">John</div>
    <div class="name" id="last">Doe</div>
    <div class="contact">
      <div class="home">555-555-1234</div>
      <div class="home">555-555-2345</div>
      <div class="work">555-555-9999</div>
      <div class="fax">555-555-8888</div>
    </div>
  </body>
</html>'

read_html(html)

{html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n    <p align="center">Hello world!</p>\n    <br><div class="name" ...

Selecting elements

read_html(html) |> html_elements("p")

{xml_nodeset (1)}
[1] <p align="center">Hello world!</p>

read_html(html) |> html_elements("p") |> html_text()

[1] "Hello world!"

read_html(html) |> html_elements("p") |> html_name()

[1] "p"

read_html(html) |> html_elements("p") |> html_attrs()

[[1]]
   align 
"center"

read_html(html) |> html_elements("p") |> html_attr("align")

[1] "center"

More selecting elements

read_html(html) |> html_elements("div")

{xml_nodeset (7)}
[1] <div class="name" id="first">John</div>
[2] <div class="name" id="last">Doe</div>
[3] <div class="contact">\n      <div class="home">555-555-1234</div>\n       ...
[4] <div class="home">555-555-1234</div>
[5] <div class="home">555-555-2345</div>
[6] <div class="work">555-555-9999</div>
[7] <div class="fax">555-555-8888</div>

read_html(html) |> html_elements("div") |> html_text()

[1] "John"                                                                                  
[2] "Doe"                                                                                   
[3] "\n      555-555-1234\n      555-555-2345\n      555-555-9999\n      555-555-8888\n    "
[4] "555-555-1234"                                                                          
[5] "555-555-2345"                                                                          
[6] "555-555-9999"                                                                          
[7] "555-555-8888"

CSS selectors

We will use a tool called SelectorGadget to help us identify the HTML elements of interest by constructing a CSS selector which can be used to subset the HTML document.

Some examples of basic selector syntax is below,

Selector	Example	Description
.class	`.title`	Select all elements with class=“title”
#id	`#name`	Select all elements with id=“name”
element	`p`	Select all <p> elements
element element	`div p`	Select all <p> elements inside a <div> element
element>element	`div > p`	Select all <p> elements with <div> as a direct parent
[attribute]	`[class]`	Select all elements with a class attribute
[attribute=value]	`[class=title]`	Select all elements with class=“title”

Agenda 11/6/24

rvest continues
example
ethics

CSS classes and ids

class and id are used to style elements (e.g., change their color!). They are special types of attributes.
class can be applied to multiple different elements (class is identified with ., for example .name)
id is unique to each element (id is identified with #, for example, #first)

read_html(html) |> html_elements(".name")

{xml_nodeset (2)}
[1] <div class="name" id="first">John</div>
[2] <div class="name" id="last">Doe</div>

read_html(html) |> html_elements("div.name")

{xml_nodeset (2)}
[1] <div class="name" id="first">John</div>
[2] <div class="name" id="last">Doe</div>

read_html(html) |> html_elements("#first")

{xml_nodeset (1)}
[1] <div class="name" id="first">John</div>

Text with `html_text()` vs. `html_text2()`

html = read_html(
  "<p>  
    This is the first sentence in the paragraph.
    This is the second sentence that should be on the same line as the first sentence.<br>This third sentence should start on a new line.
  </p>"
)

html |> html_text()

[1] "  \n    This is the first sentence in the paragraph.\n    This is the second sentence that should be on the same line as the first sentence.This third sentence should start on a new line.\n  "

html |> html_text2()

[1] "This is the first sentence in the paragraph. This is the second sentence that should be on the same line as the first sentence.\nThis third sentence should start on a new line."

HTML tables with `html_table()`

html_table = 
'<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <table>
      <tr> <th>a</th> <th>b</th> <th>c</th> </tr>
      <tr> <td>1</td> <td>2</td> <td>3</td> </tr>
      <tr> <td>2</td> <td>3</td> <td>4</td> </tr>
      <tr> <td>3</td> <td>4</td> <td>5</td> </tr>
    </table>
  </body>
</html>'

read_html(html_table) |>
  html_elements("table") |> 
  html_table()

[[1]]
# A tibble: 3 × 3
      a     b     c
  <int> <int> <int>
1     1     2     3
2     2     3     4
3     3     4     5

`html_attr()`

extracts data from attributes:

html <- minimal_html("
  <p><a href='https://en.wikipedia.org/wiki/Cat'>cats</a></p>
  <p><a href='https://en.wikipedia.org/wiki/Dog'>dogs</a></p>
")
html

{html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<p><a href="https://en.wikipedia.org/wiki/Cat">cats</a></p>\n  <p ...

html |> 
html_attr("href")

[1] NA

html |> 
html_elements("[href]") |> 
html_attr("href")

[1] "https://en.wikipedia.org/wiki/Cat" "https://en.wikipedia.org/wiki/Dog"

html |> 
  html_elements("p")

{xml_nodeset (2)}
[1] <p><a href="https://en.wikipedia.org/wiki/Cat">cats</a></p>
[2] <p><a href="https://en.wikipedia.org/wiki/Dog">dogs</a></p>

html |> 
  html_elements("p") |> 
  html_element("a")

{xml_nodeset (2)}
[1] <a href="https://en.wikipedia.org/wiki/Cat">cats</a>
[2] <a href="https://en.wikipedia.org/wiki/Dog">dogs</a>

html |> 
  html_elements("p") |> 
  html_element("a") |> 
  html_attr("href")

[1] "https://en.wikipedia.org/wiki/Cat" "https://en.wikipedia.org/wiki/Dog"

html |> 
  html_element("a") |> 
  html_attr("href")

[1] "https://en.wikipedia.org/wiki/Cat"

html |> 
  html_elements("a") |> 
  html_attr("href")

[1] "https://en.wikipedia.org/wiki/Cat" "https://en.wikipedia.org/wiki/Dog"

html_attr() always returns a string, so if you’re extracting numbers or dates, you’ll need to do some post-processing.

`div p` vs `div > p`

div p selects all <p> elements within <div>, regardless of depth.
div > p selects only direct child <p> elements of <div>.

<div>
  <p>This will be selected by both `div p` and `div > p`.</p> 
  <section>
    <p>This will be selected only by `div p`, not by `div > p`.</p>
  </section>
</div>

SelectorGadget

SelectorGadget (selectorgadget.com) is a javascript based tool that helps you interactively build an appropriate CSS selector for the content you are interested in.

Recap

Use the SelectorGadget identify elements you want to grab
Use the rvest R package to first read in the entire page (into R) and then parse the object you’ve read in to the elements you’re interested in
Put the components together in a data frame (a tibble) and analyze it like you analyze any other data

Plan

Read in the entire page
Scrape opinion title and save as title
Scrape author and save as author
Scrape date and save as date
Create a new data frame called tsl_opinion with variables title, author, and date

Read in the entire page

tsl_page <- read_html("https://tsl.news/category/opinions/")
tsl_page

{html_document}
<html lang="en-US">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="archive category category-opinions category-2244 custom-back ...

typeof(tsl_page)

[1] "list"

class(tsl_page)

[1] "xml_document" "xml_node"

we need to convert into something more familiar, like a data frame

Scrape title and save as `title`

tsl_page |> 
html_elements(".entry-title a")

{xml_nodeset (10)}
 [1] <a href="https://tsl.news/opinion-in-trumps-america-the-future-is-dire-f ...
 [2] <a href="https://tsl.news/opinions-catfished-by-a-scripps-sponsored-inte ...
 [3] <a href="https://tsl.news/opinion-how-maternal-healthcare-fails-black-wo ...
 [4] <a href="https://tsl.news/opinion-how-ranked-choice-voting-better-serves ...
 [5] <a href="https://tsl.news/opinion-elon-musks-million-dollar-a-day-reward ...
 [6] <a href="https://tsl.news/opinion-the-politics-behind-apolitical-acts/"  ...
 [7] <a href="https://tsl.news/opinion-in-defense-of-the-pomona-college-judic ...
 [8] <a href="https://tsl.news/yakking-isnt-a-cannon-event/" title="OPINION:  ...
 [9] <a href="https://tsl.news/opinion-the-if-he-wanted-to-he-would-mentality ...
[10] <a href="https://tsl.news/you-cant-silence-us-a-united-front-against-pom ...

title <- tsl_page |> 
html_elements(".entry-title a") |> 
html_text()
title

 [1] "OPINION: In Trump’s America, the future is dire for women"               
 [2] "OPINIONS: Catfished by a Scripps-sponsored internship"                   
 [3] "OPINION: How maternal healthcare fails Black women"                      
 [4] "OPINION: How ranked choice voting better serves us all"                  
 [5] "OPINION: Elon Musk’s million-dollar-a-day rewards are undemocratic"      
 [6] "OPINION: The politics behind apolitical acts"                            
 [7] "OPINION: In Defense of the Pomona College Judicial Council"              
 [8] "OPINION: ‘Yakking’ isn’t a canon event, party responsibly"               
 [9] "OPINION: The ‘if he wanted to, he would’ mentality is holding women back"
[10] "You can’t silence us: A united front against Pomona’s repression"

title <- title |> 
str_remove("OPINION: ")

title

 [1] "In Trump’s America, the future is dire for women"                
 [2] "OPINIONS: Catfished by a Scripps-sponsored internship"           
 [3] "How maternal healthcare fails Black women"                       
 [4] "How ranked choice voting better serves us all"                   
 [5] "Elon Musk’s million-dollar-a-day rewards are undemocratic"       
 [6] "The politics behind apolitical acts"                             
 [7] "In Defense of the Pomona College Judicial Council"               
 [8] "‘Yakking’ isn’t a canon event, party responsibly"                
 [9] "The ‘if he wanted to, he would’ mentality is holding women back" 
[10] "You can’t silence us: A united front against Pomona’s repression"

Scrape author and save as `author`

author <- tsl_page |> 
html_elements("span.author") |> 
html_text()
author

 [1] "By Tania Azhang"                                                        
 [2] "By Jada Shavers"                                                        
 [3] "By Chloe Gill"                                                          
 [4] "By Alex Benach"                                                         
 [5] "By Celeste Cariker"                                                     
 [6] "By Eric Lu"                                                             
 [7] "By Henri Prevost"                                                       
 [8] "By Kabir Raina"                                                         
 [9] "By Tess McHugh"                                                         
[10] "By Outback Editors, Undercurrents Editors and The Scripps Voice Editors"

author <- author |> 
str_replace("By ", "")

author

 [1] "Tania Azhang"                                                        
 [2] "Jada Shavers"                                                        
 [3] "Chloe Gill"                                                          
 [4] "Alex Benach"                                                         
 [5] "Celeste Cariker"                                                     
 [6] "Eric Lu"                                                             
 [7] "Henri Prevost"                                                       
 [8] "Kabir Raina"                                                         
 [9] "Tess McHugh"                                                         
[10] "Outback Editors, Undercurrents Editors and The Scripps Voice Editors"

Scrape date and save as `date`

date <- tsl_page |> 
html_elements(".published") |> 
html_text()
date

 [1] "November 8, 2024 12:45 am" "November 7, 2024 10:39 pm"
 [3] "November 7, 2024 10:38 pm" "November 7, 2024 10:33 pm"
 [5] "November 1, 2024 9:27 am"  "November 1, 2024 9:21 am" 
 [7] "November 1, 2024 9:15 am"  "November 1, 2024 9:10 am" 
 [9] "November 1, 2024 9:01 am"  "October 25, 2024 4:23 am"

date <- date |> 
lubridate::mdy_hm(tz = "America/Los_Angeles")
date

 [1] "2024-11-08 00:45:00 PST" "2024-11-07 22:39:00 PST"
 [3] "2024-11-07 22:38:00 PST" "2024-11-07 22:33:00 PST"
 [5] "2024-11-01 09:27:00 PDT" "2024-11-01 09:21:00 PDT"
 [7] "2024-11-01 09:15:00 PDT" "2024-11-01 09:10:00 PDT"
 [9] "2024-11-01 09:01:00 PDT" "2024-10-25 04:23:00 PDT"

Create a new data frame

tsl_opinion <- tibble(
    title,
    author,
    date
)

tsl_opinion

# A tibble: 10 × 3
   title                                              author date               
   <chr>                                              <chr>  <dttm>             
 1 In Trump’s America, the future is dire for women   Tania… 2024-11-08 00:45:00
 2 OPINIONS: Catfished by a Scripps-sponsored intern… Jada … 2024-11-07 22:39:00
 3 How maternal healthcare fails Black women          Chloe… 2024-11-07 22:38:00
 4 How ranked choice voting better serves us all      Alex … 2024-11-07 22:33:00
 5 Elon Musk’s million-dollar-a-day rewards are unde… Celes… 2024-11-01 09:27:00
 6 The politics behind apolitical acts                Eric … 2024-11-01 09:21:00
 7 In Defense of the Pomona College Judicial Council  Henri… 2024-11-01 09:15:00
 8 ‘Yakking’ isn’t a canon event, party responsibly   Kabir… 2024-11-01 09:10:00
 9 The ‘if he wanted to, he would’ mentality is hold… Tess … 2024-11-01 09:01:00
10 You can’t silence us: A united front against Pomo… Outba… 2024-10-25 04:23:00

Opinion titles

#|eval: false
tsl_opinions <- function(i){
tsl_page <- read_html(paste0("https://tsl.news/category/opinions/page/",i))
  
title <- tsl_page |> 
html_elements(".entry-title a") |> 
html_text() |> 
str_remove("OPINION: ")
  
author <- tsl_page |> 
html_elements("span.author") |> 
html_text() |> 
tibble() |> 
set_names(nm = "authors") |> 
mutate(authors = str_replace(authors, "By ", "")) 
  
date <- tsl_page |> 
html_elements(".published") |> 
html_text() |> 
lubridate::mdy_hm(tz = "America/Los_Angeles")

first_p <- tsl_page |> 
  html_elements(".entry-content p") |> 
  html_text() |> 
  tolower()
  
tibble(
    title,
    author,
    date,
    first_p
)  
}

tsl_opinion_titles <- 1:50 |> purrr::map(tsl_opinions) |> 
list_rbind()

Web scraping considerations

Check if you are allowed!

library(robotstxt)
paths_allowed("https://tsl.news/category/opinions/")

[1] TRUE

paths_allowed("http://www.facebook.com")

[1] FALSE

Ethics: “Can you?” vs “Should you?”

“Can you?” vs “Should you?”

Challenges: Unreliable formatting

Challenges: Data broken into many pages

More ethics: graphics

Consider the following image. What do you think is wrong? (Hint: examine the y-axis carefully)

Reproduction of a data graphic reporting the number of gun deaths in Florida over time. The original image was published by Reuters. (Baumer, Kaplan, and Horton 2021)

More ethics: graphics

May 10, 2020, Georgia Department of Health, COVID-19 cases for 5 counties across time. https://dph.georgia.gov/covid-19-daily-status-report

May 17, 2020, Georgia Department of Health, COVID-19 cases for 5 counties across time. https://dph.georgia.gov/covid-19-daily-status-report

More ethics: graphics

A few weeks later, the Georgia Department of Health came out with the following two plots where, despite cases skyrocketing, they display images where the visual doesn’t really change.

July 2, 2020, Georgia Department of Health, COVID-19 cases per 100K https://dph.georgia.gov/covid-19-daily-status-report

July 17, 2020, Georgia Department of Health, COVID-19 cases per 100K https://dph.georgia.gov/covid-19-daily-status-report

More ethics: algorithms

disparate treatment \(\rightarrow\) means that the differential treatment is intentional
disparate impact \(\rightarrow\) means that the differential treatment is unintentional or implicit (some examples include advancing mortgage credit, employment selection, predictive policing)

More ethics: COMPAS

Dylan Fugett had three subsequent arrests for drug possession. Bernard Parker had no subsequent offenses.

More ethics: COMPAS

DYLAN FUGETT	BERNARD PARKER
Prior Offense	Prior Offense
1 attempted burglary	1 resisting arrest without violence
LOW RISK - 3	HIGH RISK - 10
Subsequent Offenses	Subsequent Offenses
3 drug possessions	None

More ethics: COMPAS

False positive and false negative rates broken down by race.

References

Baumer, Ben, Daniel Kaplan, and Nicholas Horton. 2021. Modern Data Science with r. CRC Press. https://mdsr-book.github.io/mdsr2e/.

Web Scraping

Agenda 11/4/24

Important tool

Data Acquisition

Reading The Student Life

Reading The Student Life

Analyzing The Student Life

Reading The Student Life

Analyzing The Student Life

All of this analysis is done in R!

Common words in The Student Life titles

Avg sentiment scores of first paragraph

Where is the data coming from?

Where is the data coming from?

Web scraping

Scraping the web: what? why?

Hypertext Markup Language

Some HTML elements

rvest

rvest

html & rvest

Selecting elements

More selecting elements

CSS selectors

Agenda 11/6/24

CSS classes and ids

Text with html_text() vs. html_text2()

HTML tables with html_table()

html_attr()

div p vs div > p

SelectorGadget

Recap

Plan

Read in the entire page

Scrape title and save as title

Scrape author and save as author

Scrape date and save as date

Create a new data frame

Opinion titles

Web scraping considerations

Check if you are allowed!

Ethics: “Can you?” vs “Should you?”

“Can you?” vs “Should you?”

Challenges: Unreliable formatting

Challenges: Data broken into many pages

More ethics: graphics

More ethics: graphics

More ethics: graphics

More ethics: algorithms

More ethics: COMPAS

More ethics: COMPAS

More ethics: COMPAS

References

Text with `html_text()` vs. `html_text2()`

HTML tables with `html_table()`

`html_attr()`

`div p` vs `div > p`

Scrape title and save as `title`

Scrape author and save as `author`

Scrape date and save as `date`