Worksheet 16 - rvest

Wednesday, November 6, 2024

Author

DS 002R - Jo Hardin

Name: __________________________________

Names of people you worked with: __________________________________

How are you holding up this morning?

Task:

What function is used to read in the HTML code (usually it will be from a webpage)?
In the R code, what HTML element is being targeted from which to extract text?
To extract only the list items within <div> elements, how would you modify the R code?
To extract all of the list items (from anywhere), how would you modify the R code?

html_sample <- '<html>
<head>
    <title>Sample Webpage</title>
</head>
<body>
    <h1>Welcome to the Sample Page</h1>
    <p>This is a paragraph introducing the content.</p>
    <section>
        <h2>Main Features</h2>
        <ul>
            <li>Feature 1: High quality</li>
            <li>Feature 2: Affordable price</li>
            <li>Feature 3: Great customer service</li>
        </ul>
    </section>
    <div>
        <h2>Additional Information</h2>
        <ul>
            <li>Extra 1: Fast shipping</li>
            <li>Extra 2: Easy returns</li>
        </ul>
    </div>
    <footer>
        <p>Contact us at support@example.com</p>
    </footer>
</body>
</html>'

library(rvest)

page <- read_html(html_sample)

section_items <- page |>  html_elements("section li") |> html_text()

section_items

[1] "Feature 1: High quality"           "Feature 2: Affordable price"      
[3] "Feature 3: Great customer service"

The rendered HTML code looks like this:

Sample Webpage

Welcome to the Sample Page

This is a paragraph introducing the content.

Main Features

Feature 1: High quality
Feature 2: Affordable price
Feature 3: Great customer service

Additional Information

Extra 1: Fast shipping
Extra 2: Easy returns

Solution:

read_html(html_sample) reads in the HTML content from the URL (or in this case, from the HTML code I wrote above).
The code is first targeting the <section> element, which represents a section in HTML; then the code targets the <li> element, which represents a list item in HTML. The html_elements("section li") function selects all nodes with the section element tag and then the li element tag in the webpage’s HTML.
To extract all text from list items within the <div> elements, you would change the code to:

div_items <- page |>  html_elements("div li") |> html_text()

div_items

[1] "Extra 1: Fast shipping" "Extra 2: Easy returns"

If you want all of the list items (from anywhere), you would change the code to:

items <- page |>  html_elements("li") |> html_text()

items

[1] "Feature 1: High quality"           "Feature 2: Affordable price"      
[3] "Feature 3: Great customer service" "Extra 1: Fast shipping"           
[5] "Extra 2: Easy returns"