Web Scraping

November 3, 2025

Jo Hardin

Agenda 11/03/25

  1. HTML / web scraping
  2. css selectors
  3. rvest

Important tool

Our approach to web scraping relies on the Chrome browser and an extension called SelectorGadget. Download them here:

Data Acquisition

Based on https://www.effectivedatastorytelling.com/post/a-deeper-dive-into-lego-bricks-and-data-stories, original source: https://www.linkedin.com/learning/instructors/bill-shander

Reading The Student Life

How often do you read The Student Life?
a. Every day
b. 3-5 times a week
c. Once a week
d. Rarely

Reading The Student Life

What do you think is the most common word in the titles of The Student Life opinion pieces?

Analyzing The Student Life

Reading The Student Life

How do you think the sentiments in opinion pieces in The Student Life compare across authors?
Roughly the same?
Wildly different?
Somewhere in between?

Analyzing The Student Life

All of this analysis is done in R!

(mostly) with tools you already know!

Common words in The Student Life titles

Code for the earlier plot:

data(stop_words)  # from tidytext
tsl_opinion_titles |>
  tidytext::unnest_tokens(word, title) |>
  anti_join(stop_words) |>
  count(word, sort = TRUE) |>
  slice_head(n = 20) |>
  mutate(word = fct_reorder(word, n)) |>
  ggplot(aes(y = word, x = n, fill = log(n))) +
  geom_col(show.legend = FALSE) +
  theme_minimal(base_size = 16) +
  labs(
    x = "Number of mentions",
    y = "Word",
    title = "The Student Life - Opinion pieces",
    subtitle = "Common words in the 500 most recent opinion pieces",
    caption = "Source: Data scraped from The Student Life on November 2, 2025"
  ) +
  theme(
    plot.title.position = "plot",
    plot.caption = element_text(color = "gray30")
  )

Avg sentiment scores of first paragraph

Code for the earlier plot:

afinn_sentiments <- get_sentiments("afinn")  # need tidytext and textdata
tsl_opinion_titles |>
  tidytext::unnest_tokens(word, first_p) |>
  anti_join(stop_words) |>
  left_join(afinn_sentiments) |> 
  group_by(authors, title) |>
  summarize(total_sentiment = sum(value, na.rm = TRUE), .groups = "drop") |>
  group_by(authors) |>
  summarize(
    n_articles = n(),
    avg_sentiment = mean(total_sentiment, na.rm = TRUE),
  ) |>
  filter(n_articles > 1 & !is.na(authors)) |>
  arrange(desc(avg_sentiment)) |>
  slice(c(1:10, 69:78)) |>
  mutate(
    authors = fct_reorder(authors, avg_sentiment),
    neg_pos = if_else(avg_sentiment < 0, "neg", "pos"),
    label_position = if_else(neg_pos == "neg", 0.25, -0.25)
  ) |>
  ggplot(aes(y = authors, x = avg_sentiment)) +
  geom_col(aes(fill = neg_pos), show.legend = FALSE) +
  geom_text(
    aes(x = label_position, label = authors, color = neg_pos),
    hjust = c(rep(1,10), rep(0, 10)),
    show.legend = FALSE,
    fontface = "bold"
  ) +
  geom_text(
    aes(label = round(avg_sentiment, 1)),
    hjust = c(rep(1.25,10), rep(-0.25, 10)),
    color = "white",
    fontface = "bold"
  ) +
  scale_fill_manual(values = c("neg" = "#4d4009", "pos" = "#FF4B91")) +
  scale_color_manual(values = c("neg" = "#4d4009", "pos" = "#FF4B91")) +
  scale_x_continuous(breaks = -5:5, minor_breaks = NULL) +
  scale_y_discrete(breaks = NULL) +
  coord_cartesian(xlim = c(-5, 5)) +
  labs(
    x = "negative  ←     Average sentiment score (AFINN)     →  positive",
    y = NULL,
    title = "The Student Life - Opinion pieces\nAverage sentiment scores of first paragraph by author",
    subtitle = "Top 10 average positive and negative scores",
    caption = "Source: Data scraped from The Student Life on November 2, 2025"
  ) +
  theme_void(base_size = 16) +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, margin = unit(c(0.5, 0, 1, 0), "lines")),
    axis.text.y = element_blank(),
    plot.caption = element_text(color = "gray30")
  )

Where is the data coming from?

Where is the data coming from?

tsl_opinion_titles
# A tibble: 500 × 4
   title                           authors date                first_p
   <chr>                           <chr>   <dttm>              <chr>  
 1 Stop buying your books          Sarah … 2025-04-04 08:03:00 from b…
 2 The case for fleeing the count… Alex B… 2025-04-04 07:27:00 when t…
 3 Tolerate thy neighbor           Parker… 2025-04-04 07:22:00 it’s s…
 4 Confronting furry hate          Xavier… 2025-04-04 07:16:00 furrie…
 5 Shame on the governor: Gavin N… Akshay… 2025-03-28 06:56:00 gavin …
 6 Accessibility at the 5Cs requi… Zena A… 2025-03-28 06:42:00 althou…
 7 Your spring break destination … Nicole… 2025-03-15 04:44:00 spring…
 8 Pomona College’s Merritt Field… Katie … 2025-03-15 03:03:00 with l…
 9 Seminars should be tech-free s… Elias … 2025-03-14 09:15:00 we hav…
10 The bitter truth to the bitter… Daniel… 2025-03-14 09:13:00 have y…
# ℹ 490 more rows

HTML / Web scraping

Scraping the web: what? why?

  • Increasing amount of data is available on the web

  • These data are provided in an unstructured format: you can always copy & paste, but it’s time-consuming and prone to errors

  • Web scraping is the process of extracting information automatically and transforming it into a structured dataset

  • Two different scenarios:

    • Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).

    • Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.

Hypertext Markup Language

Much of the data on the web is available as HTML - while it is structured (hierarchical), often it is not immediately available in a form useful for analysis (flat / tidy).

<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
    <br>
    <div class="name" id="first">John</div>
    <div class="name" id="last">Doe</div>
    <div class="contact">
      <div class="home">555-555-1234</div>
      <div class="home">555-555-2345</div>
      <div class="work">555-555-9999</div>
      <div class="fax">555-555-8888</div>
    </div>
  </body>
</html>

Some HTML elements

  • <html>: start of the HTML page
  • <head>: header information (metadata about the page)
  • <body>: everything that is on the page
  • <p>: paragraph
  • <b>: bold
  • <table>: table
  • <div>: a container to group content together (“division”) (used to group other HRML elements together)
  • <a>: the “anchor” element that creates a hyperlink

HTML attribute

An attribute in HTML is a name–value pair that gives extra information about an element. It sits inside the opening tag and modifies the element’s behavior, appearance, identity, or data.

Think of the attribute as the argument to the element (which would be the function in this analogy).

rvest

  • The rvest package makes basic processing and manipulation of HTML data straight forward
  • It is designed to work with pipelines built with |>
  • rvest.tidyverse.org
library(rvest)

rvest hex logo

rvest

Core functions:

  • read_html() - read HTML data from a url or character string.

  • html_elements() - select specified elements from the HTML document using CSS selectors.

  • html_element() - select a single element from the HTML document using CSS selectors.

  • html_table() - parse an HTML table into a data frame.

  • html_text() / html_text2() - extract element’s text content.

  • html_name - extract a element’s name(s).

  • html_attrs - extract all attributes.

  • html_attr - extract attribute value(s) by name.

html & rvest

html <- 
'<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <p align="center">Hello world!</p>
    <br>
    <div class="name" id="first">John</div>
    <div class="name" id="last">Doe</div>
    <div class="contact">
      <div class="home">555-555-1234</div>
      <div class="home">555-555-2345</div>
      <div class="work">555-555-9999</div>
      <div class="fax">555-555-8888</div>
    </div>
  </body>
</html>'
read_html(html)
{html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n    <p align="center">Hello world!</p>\n    <br><div class="name" ...

Selecting elements

read_html(html) |> html_elements("p")
{xml_nodeset (1)}
[1] <p align="center">Hello world!</p>
read_html(html) |> html_elements("p") |> html_text()
[1] "Hello world!"
read_html(html) |> html_elements("p") |> html_name()
[1] "p"
read_html(html) |> html_elements("p") |> html_attrs()
[[1]]
   align 
"center" 
read_html(html) |> html_elements("p") |> html_attr("align")
[1] "center"

More selecting elements

read_html(html) |> html_elements("div")
{xml_nodeset (7)}
[1] <div class="name" id="first">John</div>
[2] <div class="name" id="last">Doe</div>
[3] <div class="contact">\n      <div class="home">555-555-1234</div>\n       ...
[4] <div class="home">555-555-1234</div>
[5] <div class="home">555-555-2345</div>
[6] <div class="work">555-555-9999</div>
[7] <div class="fax">555-555-8888</div>
read_html(html) |> html_elements("div") |> html_text()
[1] "John"                                                                                  
[2] "Doe"                                                                                   
[3] "\n      555-555-1234\n      555-555-2345\n      555-555-9999\n      555-555-8888\n    "
[4] "555-555-1234"                                                                          
[5] "555-555-2345"                                                                          
[6] "555-555-9999"                                                                          
[7] "555-555-8888"                                                                          

CSS selectors

  • We will use a tool called SelectorGadget to help us identify the HTML elements of interest by constructing a CSS selector which can be used to subset the HTML document.
  • Some examples of basic selector syntax is below,
Selector Example Description
.class .title Select all elements with class=“title”
#id #name Select all elements with id=“name”
element p Select all <p> elements
element element div p Select all <p> elements inside a <div> element
element>element div > p Select all <p> elements with <div> as a direct parent
[attribute] [class] Select all elements with a class attribute
[attribute=value] [class=title] Select all elements with class=“title”

CSS classes and ids

  • class and id are used to style elements (e.g., change their color!). They are special types of attributes.

  • class can be applied to multiple different elements (class is identified with ., for example .name)

  • id is unique to each element (id is identified with #, for example, #first)

read_html(html) |> html_elements(".name")
{xml_nodeset (2)}
[1] <div class="name" id="first">John</div>
[2] <div class="name" id="last">Doe</div>
read_html(html) |> html_elements("div.name")
{xml_nodeset (2)}
[1] <div class="name" id="first">John</div>
[2] <div class="name" id="last">Doe</div>
read_html(html) |> html_elements("#first")
{xml_nodeset (1)}
[1] <div class="name" id="first">John</div>

Text with html_text() vs. html_text2()

  • The two functions handle whitespace differently:
html <- read_html("<p>  Hello,\n   world! </p>")

html |>  html_element("p") |>  html_text()
[1] "  Hello,\n   world! "
html |>  html_element("p") |>  html_text2()
[1] "Hello, world!"

Text with html_text() vs. html_text2()

html = read_html(
  "<p>  
    First sentence in the paragraph.
    Second sentence that follows a literal line break. <br>Third sentence will start on a new line in the rendered html, but doesn't have a line break as part of the code.
  </p>"
)
html |> html_text()
[1] "  \n    First sentence in the paragraph.\n    Second sentence that follows a literal line break. Third sentence will start on a new line in the rendered html, but doesn't have a line break as part of the code.\n  "
html |> html_text2()
[1] "First sentence in the paragraph. Second sentence that follows a literal line break.\nThird sentence will start on a new line in the rendered html, but doesn't have a line break as part of the code."

html_text2() collapses any white space (including \n) into a single space, and turns <br> into \n.

HTML tables with html_table()

html_table = 
'<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <table>
      <tr> <th>a</th> <th>b</th> <th>c</th> </tr>
      <tr> <td>1</td> <td>2</td> <td>3</td> </tr>
      <tr> <td>2</td> <td>3</td> <td>4</td> </tr>
      <tr> <td>3</td> <td>4</td> <td>5</td> </tr>
    </table>
  </body>
</html>'
read_html(html_table) |>
  html_elements("table") |> 
  html_table()
[[1]]
# A tibble: 3 × 3
      a     b     c
  <int> <int> <int>
1     1     2     3
2     2     3     4
3     3     4     5

html_attr()

extracts data from attributes:

(n.b., the <a> tag refers to “anchor” and is used to create hyperlinks)

html <- minimal_html("
  <p><a href='https://en.wikipedia.org/wiki/Cat'>cats</a></p>
  <p><a href='https://en.wikipedia.org/wiki/Dog'>dogs</a></p>
")
html
{html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n<p><a href="https://en.wikipedia.org/wiki/Cat">cats</a></p>\n  <p ...
html |> 
html_attr("href")
[1] NA
html |> 
html_elements("[href]") |> 
html_attr("href")
[1] "https://en.wikipedia.org/wiki/Cat" "https://en.wikipedia.org/wiki/Dog"
html |> 
  html_elements("p") 
{xml_nodeset (2)}
[1] <p><a href="https://en.wikipedia.org/wiki/Cat">cats</a></p>
[2] <p><a href="https://en.wikipedia.org/wiki/Dog">dogs</a></p>
html |> 
  html_elements("p") |> 
  html_element("a") 
{xml_nodeset (2)}
[1] <a href="https://en.wikipedia.org/wiki/Cat">cats</a>
[2] <a href="https://en.wikipedia.org/wiki/Dog">dogs</a>
html |> 
  html_elements("p") |> 
  html_element("a") |> 
  html_attr("href")
[1] "https://en.wikipedia.org/wiki/Cat" "https://en.wikipedia.org/wiki/Dog"
html |> 
  html_elements("p a") |> 
  html_attr("href")
[1] "https://en.wikipedia.org/wiki/Cat" "https://en.wikipedia.org/wiki/Dog"
html |> 
  html_element("a") |> 
  html_attr("href")
[1] "https://en.wikipedia.org/wiki/Cat"
html |> 
  html_elements("a") |> 
  html_attr("href")
[1] "https://en.wikipedia.org/wiki/Cat" "https://en.wikipedia.org/wiki/Dog"
  • html_attr() always returns a string, so if you’re extracting numbers or dates, you’ll need to do some post-processing.

div p vs div > p

  • div p selects all <p> elements within <div>, regardless of depth.
  • div > p selects only direct child <p> elements of <div>.
<div>
  <p>This will be selected by both `div p` and `div > p`.</p> 
  <section>
   <p>This will be selected only by `div p`, not by `div > p`. Because it is inside the section tag.</p>
  </section>
</div>

SelectorGadget

SelectorGadget (selectorgadget.com) is a javascript based tool that helps you interactively build an appropriate CSS selector for the content you are interested in.

Recap

  • Use the SelectorGadget identify elements you want to grab
  • Use the rvest R package to first read in the entire page (into R) and then parse the object you’ve read in to the elements you’re interested in
  • Put the components together in a data frame (a tibble) and analyze it like you analyze any other data

Plan

  1. Read in the entire page
  2. Scrape opinion title and save as title
  3. Scrape author and save as author
  4. Scrape date and save as date
  5. Create a new data frame called tsl_opinion with variables title, author, and date

Read in the entire page

tsl_page <- read_html("https://tsl.news/category/opinions/")
tsl_page
{html_document}
<html lang="en-US">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="archive category category-opinions category-2244 custom-back ...
typeof(tsl_page)
[1] "list"
class(tsl_page)
[1] "xml_document" "xml_node"    
  • we need to convert into something more familiar, like a data frame

Scrape title and save as title

tsl_page |> 
html_elements(".entry-title a") 
{xml_nodeset (10)}
 [1] <a href="https://tsl.news/opinion-life-is-a-fedora-why-i-wear-fedoras-an ...
 [2] <a href="https://tsl.news/opinion-its-time-to-start-enjoying-your-coffee ...
 [3] <a href="https://tsl.news/opinion-do-you-love-sneaking-into-malott-i-hav ...
 [4] <a href="https://tsl.news/opinion-we-need-a-serious-caring-approach-to-s ...
 [5] <a href="https://tsl.news/opinion-the-anonymity-epidemic-rages-on/" titl ...
 [6] <a href="https://tsl.news/opinion-before-rebuilding-gaza-the-world-must- ...
 [7] <a href="https://tsl.news/opinion-ai-can-take-the-credit-for-this-new-cu ...
 [8] <a href="https://tsl.news/opinion-what-democrats-can-learn-from-mexicos- ...
 [9] <a href="https://tsl.news/opinion-aesthetic-feminism-is-super-anti-femin ...
[10] <a href="https://tsl.news/opinion-trumps-authoritarianism-doesnt-listen- ...
title <- tsl_page |> 
html_elements(".entry-title a") |> 
html_text()

title
 [1] "OPINION: Life is a fedora: why I wear fedoras, and why you should too"                
 [2] "OPINION: It’s time to start enjoying your coffee without a side of homework"          
 [3] "OPINION: Do you love sneaking into Malott? I have a better alternative"               
 [4] "OPINION: We need a serious, caring approach to sexual health"                         
 [5] "OPINION: The anonymity epidemic rages on"                                             
 [6] "OPINION: Before Rebuilding Gaza, the World Must Confront Who Destroyed It"            
 [7] "OPINION: AI can take the credit for this new cultural normalization of cheating"      
 [8] "OPINION: What Democrats can learn from Mexico’s governing party"                      
 [9] "OPINION: Aesthetic feminism is super anti-feminist"                                   
[10] "OPINION: Trump’s authoritarianism doesn’t listen to your No Kings Day cardboard signs"
title <- title |> 
str_remove("OPINION: ")

title
 [1] "Life is a fedora: why I wear fedoras, and why you should too"                
 [2] "It’s time to start enjoying your coffee without a side of homework"          
 [3] "Do you love sneaking into Malott? I have a better alternative"               
 [4] "We need a serious, caring approach to sexual health"                         
 [5] "The anonymity epidemic rages on"                                             
 [6] "Before Rebuilding Gaza, the World Must Confront Who Destroyed It"            
 [7] "AI can take the credit for this new cultural normalization of cheating"      
 [8] "What Democrats can learn from Mexico’s governing party"                      
 [9] "Aesthetic feminism is super anti-feminist"                                   
[10] "Trump’s authoritarianism doesn’t listen to your No Kings Day cardboard signs"

Scrape author and save as author

author <- tsl_page |> 
html_elements("span.author") |> 
html_text()
author
 [1] "By Nicholas Steinman"         "By Ansley Kang"              
 [3] "By Nicole Teh"                "By Ezra Levinson"            
 [5] "By Madeleine Farr"            "By Leili Kamali"             
 [7] "By Ansley Kang"               "By Rafael Hernandez Guerrero"
 [9] "By Ansley Kang"               "By Jason Murillo"            
author <- author |> 
str_replace("By ", "")

author
 [1] "Nicholas Steinman"         "Ansley Kang"              
 [3] "Nicole Teh"                "Ezra Levinson"            
 [5] "Madeleine Farr"            "Leili Kamali"             
 [7] "Ansley Kang"               "Rafael Hernandez Guerrero"
 [9] "Ansley Kang"               "Jason Murillo"            

Scrape date and save as date

date <- tsl_page |> 
html_elements(".published") |> 
html_text()
date
 [1] "November 13, 2025 11:56 pm" "November 13, 2025 11:53 pm"
 [3] "November 13, 2025 11:51 pm" "November 13, 2025 11:04 pm"
 [5] "November 13, 2025 10:53 pm" "November 7, 2025 12:44 am" 
 [7] "November 7, 2025 12:41 am"  "November 7, 2025 12:05 am" 
 [9] "October 30, 2025 7:48 pm"   "October 30, 2025 7:22 pm"  
date <- date |> 
lubridate::mdy_hm(tz = "America/Los_Angeles")
date
 [1] "2025-11-13 23:56:00 PST" "2025-11-13 23:53:00 PST"
 [3] "2025-11-13 23:51:00 PST" "2025-11-13 23:04:00 PST"
 [5] "2025-11-13 22:53:00 PST" "2025-11-07 00:44:00 PST"
 [7] "2025-11-07 00:41:00 PST" "2025-11-07 00:05:00 PST"
 [9] "2025-10-30 19:48:00 PDT" "2025-10-30 19:22:00 PDT"

Create a new data frame

tsl_opinion <- tibble(
    title,
    author,
    date
)

tsl_opinion
# A tibble: 10 × 3
   title                                    author date               
   <chr>                                    <chr>  <dttm>             
 1 Life is a fedora: why I wear fedoras, a… Nicho… 2025-11-13 23:56:00
 2 It’s time to start enjoying your coffee… Ansle… 2025-11-13 23:53:00
 3 Do you love sneaking into Malott? I hav… Nicol… 2025-11-13 23:51:00
 4 We need a serious, caring approach to s… Ezra … 2025-11-13 23:04:00
 5 The anonymity epidemic rages on          Madel… 2025-11-13 22:53:00
 6 Before Rebuilding Gaza, the World Must … Leili… 2025-11-07 00:44:00
 7 AI can take the credit for this new cul… Ansle… 2025-11-07 00:41:00
 8 What Democrats can learn from Mexico’s … Rafae… 2025-11-07 00:05:00
 9 Aesthetic feminism is super anti-femini… Ansle… 2025-10-30 19:48:00
10 Trump’s authoritarianism doesn’t listen… Jason… 2025-10-30 19:22:00

Opinion titles

tsl_opinions <- function(i){
tsl_page <- rvest::read_html(paste0("https://tsl.news/category/opinions/page/",i))
  
title <- tsl_page |> 
html_elements(".entry-title a") |> 
html_text() |> 
str_remove("OPINION: ")
  
author <- tsl_page |> 
html_elements("span.author") |> 
html_text() |> 
tibble() |> 
set_names(nm = "authors") |> 
mutate(authors = str_replace(authors, "By ", "")) 
  
date <- tsl_page |> 
html_elements(".published") |> 
html_text() |> 
lubridate::mdy_hm(tz = "America/Los_Angeles")

first_p <- tsl_page |> 
  html_elements(".entry-content p:nth-child(1)") |> 
  html_text() |> 
  tolower()
  
tibble(
    title,
    author,
    date,
    first_p
)  
}

tsl_opinion_titles <- c(1:50) |> purrr::map(tsl_opinions) |> 
list_rbind()

Web scraping considerations

Check if you are allowed!

library(robotstxt)
paths_allowed("https://tsl.news/category/opinions/")
[1] TRUE
paths_allowed("http://www.facebook.com")
[1] FALSE

Ethics: “Can you?” vs “Should you?”

“Can you?” vs “Should you?”

Challenges: Unreliable formatting

Challenges: Data broken into many pages