Project 2

Completing a full text analysis

Overview

You will find a data set containing string data. The data could be on newspaper articles, tweets, songs, plays, movie reviews, or anything else you can imagine. Then you will answer questions of interest and tell a story about your data using string and regular expression skills you have developed.

Your analysis must contain the following elements:

  • at least 3 str_*() functions
  • at least 3 regular expressions
  • at least 2 illustrative, well-labeled plots or tables
  • a description of what insights can be gained from your plots and tables
  • a reference / documentation of the data source.

Logistics:

  • please include all your code used in the analysis.
  • make sure that all graphs are well-labeled (including x and y axes, title of the graph, and accurate and succinct labels for color and fill).
  • do not include superfluous error or warning messages.
  • include a few sentences describing each of your plots or tables. That is, tell the reader what they see when they look at the plot. Your narrative description should be in the text part of the qmd file, not as a comment in an R chunk.

Some potential places to find text data

I’ve gathered some potential datasets for you to work with. All of the datasets below contain some or a lot of text.

install.packages("devtools")
devtools::install_github("nicholasjhorton/FederalistPapers")
devtools::install_github("Amherst-Statistics/DickinsonPoems")
read_csv("https://raw.githubusercontent.com/the-pudding/data/master/dearabby/raw_da_qs.csv")
library(RTextTools) 
data(NYTimes)
as_tibble(NYTimes)
  • the options are endless – be resourceful and creative!

Timeline

Mini-Project 2 must be submitted on Canvas (not Gradescope) by 11:59 PM on Wednesday October 2. You will add a tab to your Quarto webpage for Mini-Project 3 and submit the new page’s URL.

:::