<- '<html>
html_sample <head>
<title>Sample Webpage</title>
</head>
<body>
<h1>Welcome to the Sample Page</h1>
<p>This is a paragraph introducing the content.</p>
<section>
<h2>Main Features</h2>
<ul>
<li>Feature 1: High quality</li>
<li>Feature 2: Affordable price</li>
<li>Feature 3: Great customer service</li>
</ul>
</section>
<div>
<h2>Additional Information</h2>
<ul>
<li>Extra 1: Fast shipping</li>
<li>Extra 2: Easy returns</li>
</ul>
</div>
<footer>
<p>Contact us at support@example.com</p>
</footer>
</body>
</html>'
Worksheet 16 - rvest
Wednesday, November 6, 2024
Name: __________________________________
Names of people you worked with: __________________________________
How are you holding up this morning?
Task:
- What function is used to read in the HTML code (usually it will be from a webpage)?
- In the R code, what HTML element is being targeted from which to extract text?
- To extract only the list items within
<div>
elements, how would you modify the R code? - To extract all of the list items (from anywhere), how would you modify the R code?
library(rvest)
<- read_html(html_sample)
page
<- page |> html_elements("section li") |> html_text()
section_items
section_items
[1] "Feature 1: High quality" "Feature 2: Affordable price"
[3] "Feature 3: Great customer service"
The rendered HTML code looks like this:
Welcome to the Sample Page
This is a paragraph introducing the content.
Main Features
- Feature 1: High quality
- Feature 2: Affordable price
- Feature 3: Great customer service
Additional Information
- Extra 1: Fast shipping
- Extra 2: Easy returns
Solution:
read_html(html_sample)
reads in the HTML content from the URL (or in this case, from the HTML code I wrote above).- The code is first targeting the
<section>
element, which represents a section in HTML; then the code targets the<li>
element, which represents a list item in HTML. Thehtml_elements("section li")
function selects all nodes with thesection
element tag and then theli
element tag in the webpage’s HTML. - To extract all text from list items within the
<div>
elements, you would change the code to:
<- page |> html_elements("div li") |> html_text()
div_items
div_items
[1] "Extra 1: Fast shipping" "Extra 2: Easy returns"
- If you want all of the list items (from anywhere), you would change the code to:
<- page |> html_elements("li") |> html_text()
items
items
[1] "Feature 1: High quality" "Feature 2: Affordable price"
[3] "Feature 3: Great customer service" "Extra 1: Fast shipping"
[5] "Extra 2: Easy returns"