install.packages("devtools")
::install_github("nicholasjhorton/FederalistPapers") devtools
Project 2
Completing a full text analysis
Overview
You will find a data set containing string / character data. The data could be on newspaper articles, tweets, songs, scripts, movie reviews, or anything else you can imagine. Then you will answer questions of interest and tell a story about your data using the string and regular expression skills you have developed.
Your analysis must contain the following elements:
- at least 3
str_*()
functions (3 different functions) - at least 3 regular expressions
- at least 1 lookaround
- at least 2 illustrative, well-labeled plots or tables
- a description of what insights can be gained from your plots and tables
- a reference / documentation of the data source.
Logistics:
- please include all your code used in the analysis (but feel free to use code folding1.)
- make sure that all graphs are well-labeled (including x and y axes, title of the graph, and accurate and succinct labels for color and fill).
- do not include error or warning messages (see HW YAML for code).
- include a few sentences describing each of your plots or tables. That is, tell the reader what they see when they look at the plot. Your narrative description should be in the text part of the qmd file, not as a comment in an R chunk.
- please include the source of the data (which might include, for example, both the link to the data source (e.g., TidyTuesday page) + the original source of the data (e.g., NYT)).
- if you are working with a (local) copy of the .csv file (as opposed to, for example, a link to the dataset on TidyTuesday’s GitHub site), then the .csv file should live in your GitHub repository for your website. And you should read the data in from that local copy. That is, the dataset should not live in your Downloads.
Some potential places to find text data
I’ve gathered some potential datasets for you to work with. All of the datasets below contain some or a lot of text.
synopses data frame for Broadway Weekly Grosses
-
- Load the package into R using
- All of Emily Dickinson’s poems
- Load the package into R using
::install_github("Amherst-Statistics/DickinsonPoems") devtools
All of Barack Obama’s tweets archived by the National Archives.
The “Dear Abby” stories underlying The Pudding’s 30 Years of American Anxieties article
- See data on the The Pudding’s GitHub site
- Load the data in using
read_csv("https://raw.githubusercontent.com/the-pudding/data/master/dearabby/raw_da_qs.csv")
NY Times headlines from the RTextTools package (see below)
library(RTextTools)
data(NYTimes)
as_tibble(NYTimes)
- the options are endless – be resourceful and creative!
Timeline
Project 2 must be submitted on Canvas (not Gradescope) by 11:59 PM on Wednesday March 5. You will add a tab to your Quarto webpage and submit the new page’s URL. [Remember, you should continue to work in your website Rproj. Do not start a new R Project.]