Project 2

Completing a full text analysis

Overview

You will find a data set containing string / character data. The data could be on newspaper articles, tweets, songs, scripts, movie reviews, or anything else you can imagine. Then you will answer questions of interest and tell a story about your data using the string and regular expression skills you have developed.

Your analysis must contain the following elements:

at least 3 str_*() functions (3 different functions)
at least 3 regular expressions
at least 1 lookaround
at least 2 illustrative, well-labeled plots or tables
a description of what insights can be gained from your plots and tables
a reference / documentation of the data source.

Logistics:

please include all your code used in the analysis (but feel free to use code folding¹.)
make sure that all graphs are well-labeled (including x and y axes, title of the graph, and accurate and succinct labels for color and fill).
do not include error or warning messages (see HW YAML for code).
include a few sentences describing each of your plots or tables. That is, tell the reader what they see when they look at the plot. Your narrative description should be in the text part of the qmd file, not as a comment in an R chunk.
please include the source of the data (which might include, for example, both the link to the data source (e.g., TidyTuesday page) + the original source of the data (e.g., NYT)).
if you are working with a (local) copy of the .csv file (as opposed to, for example, a link to the dataset on TidyTuesday’s GitHub site), then the .csv file should live in your GitHub repository for your website. And you should read the data in from that local copy. That is, the dataset should not live in your Downloads.

Some potential places to find text data

I’ve gathered some potential datasets for you to work with. All of the datasets below contain some or a lot of text.

Shakespeare Dialogue
Netflix titles
synopses data frame for Broadway Weekly Grosses
Friends dialogue
Federalist Papers
- Load the package into R using

install.packages("devtools")
devtools::install_github("nicholasjhorton/FederalistPapers")

All of Emily Dickinson’s poems
- Load the package into R using

devtools::install_github("Amherst-Statistics/DickinsonPoems")

Every line of the office
All of Barack Obama’s tweets archived by the National Archives.
The “Dear Abby” stories underlying The Pudding’s 30 Years of American Anxieties article
- See data on the The Pudding’s GitHub site
- Load the data in using

read_csv("https://raw.githubusercontent.com/the-pudding/data/master/dearabby/raw_da_qs.csv")

Other articles from The Pudding
NY Times headlines from the RTextTools package (see below)

library(RTextTools) 
data(NYTimes)
as_tibble(NYTimes)

the options are endless – be resourceful and creative!

Timeline

Project 2 must be submitted on Canvas (not Gradescope) by 11:59 PM on Wednesday March 5. You will add a tab to your Quarto webpage and submit the new page’s URL. [Remember, you should continue to work in your website Rproj. Do not start a new R Project.]

Footnotes

code folding allows the user can see the code if they want to: https://quarto.org/docs/output-formats/html-code.html#folding-code↩︎

Reuse

CC BY 4.0