Worksheet 16 - rvest

Wednesday, November 6, 2024

Author

DS 002R - Jo Hardin

Name: __________________________________

Names of people you worked with: __________________________________

How are you holding up this morning?

Task:

  1. What function is used to read in the HTML code (usually it will be from a webpage)?
  2. In the R code, what HTML element is being targeted from which to extract text?
  3. To extract only the list items within <div> elements, how would you modify the R code?
  4. To extract all of the list items (from anywhere), how would you modify the R code?
html_sample <- '<html>
<head>
    <title>Sample Webpage</title>
</head>
<body>
    <h1>Welcome to the Sample Page</h1>
    <p>This is a paragraph introducing the content.</p>
    <section>
        <h2>Main Features</h2>
        <ul>
            <li>Feature 1: High quality</li>
            <li>Feature 2: Affordable price</li>
            <li>Feature 3: Great customer service</li>
        </ul>
    </section>
    <div>
        <h2>Additional Information</h2>
        <ul>
            <li>Extra 1: Fast shipping</li>
            <li>Extra 2: Easy returns</li>
        </ul>
    </div>
    <footer>
        <p>Contact us at support@example.com</p>
    </footer>
</body>
</html>'
library(rvest)

page <- read_html(html_sample)

section_items <- page |>  html_elements("section li") |> html_text()

section_items
[1] "Feature 1: High quality"           "Feature 2: Affordable price"      
[3] "Feature 3: Great customer service"

The rendered HTML code looks like this:

Sample Webpage

Welcome to the Sample Page

This is a paragraph introducing the content.

Main Features

  • Feature 1: High quality
  • Feature 2: Affordable price
  • Feature 3: Great customer service

Additional Information

  • Extra 1: Fast shipping
  • Extra 2: Easy returns

Contact us at support@example.com

Solution:

  1. read_html(html_sample) reads in the HTML content from the URL (or in this case, from the HTML code I wrote above).
  2. The code is first targeting the <section> element, which represents a section in HTML; then the code targets the <li> element, which represents a list item in HTML. The html_elements("section li") function selects all nodes with the section element tag and then the li element tag in the webpage’s HTML.
  3. To extract all text from list items within the <div> elements, you would change the code to:
div_items <- page |>  html_elements("div li") |> html_text()

div_items
[1] "Extra 1: Fast shipping" "Extra 2: Easy returns" 
  1. If you want all of the list items (from anywhere), you would change the code to:
items <- page |>  html_elements("li") |> html_text()

items
[1] "Feature 1: High quality"           "Feature 2: Affordable price"      
[3] "Feature 3: Great customer service" "Extra 1: Fast shipping"           
[5] "Extra 2: Easy returns"