Introduction to DS002R

Foundations of Data Science

Jo Hardin

2024-08-26

Agenda 8/26/24

  1. Syllabus
  2. What is Data Science?
  3. Tools

Important

Before Wednesday, listen to the full conversation of Not So Standard Deviations - Compromised Shoe Situation.

Course structure

  • bi-weekly HW (to GitHub + Gradescope)
  • bi-weekly quizzes
  • mini-projects
  • in-class activities / clickers
  • ethical considerations

Additional details

  • Canvas has all the links
    1. course website – almost everything
    2. class notes
    3. Canvas page – solutions and assignments
  • no computers (tablets fine)
  • good communication
  • TidyTuesday

Syllabus

  • office hours
  • mentor sessions
  • anonymous feedback
  • dates for assignments
  • links to resources
  • HW grading
  • project information

Important

I need your GitHub user name - please email it to me.

What is Data Science?

Data science lives at the intersection between statistics, computer science, and discipline knowledge. It is generally the process by which we gain insight from data.

V1.0 - Drew Conway

2010: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

V2.0 - Steve Geringer

2014: http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html

V3.0 - Writuparna Banerjee

2020: https://medium.com/@writuparnabanerjee/the-difference-in-the-career-options-in-data-science-data-scientist-vs-data-engineer-vs-data-33209d0ac880

V4.0 - Joel Grus

2013: https://posit.co/wp-content/themes/Posit/public/markdown-blogs/role-of-the-data-scientist/index.html?wtime=%7Bseek_to_second_number%7D

Data Science Overview

Based on https://www.effectivedatastorytelling.com/post/a-deeper-dive-into-lego-bricks-and-data-stories, original source: https://www.linkedin.com/learning/instructors/bill-shander

Data Science in DS 002R

DS workflow in DS002R beyond DS002R

data acquisition

web scraping, relational databases

APIs

data exploration

wrangling, strings, regular expressions

natural language processing

data visualization

grammar of graphics

animations

data conclusions

iteration, permutation tests

predictive modeling, machine learning, AI

data communication

yes!

yes!

Data Science in the Wild

Data science extracts knowledge from within a particular domain of inquiry. Examples from Pomona!

  • Shannon Burns (Psychological Science and Neuroscience) uses data to understand brain processes of social communication.
  • Charlotte Chang uses data to study and improve earth stewardship.
  • Anthony (Tony) Clark (Computer Science) uses data to improve the safety and reliability of mobile robots.
  • Manisha Goel (Economics) uses data to understand how people’s identities shape the fortunes of businesses where they work.
  • Jun Lang (Asian Languages and Literatures) uses data to analyze (1) the intersection of language, gender, and society, and (2) second language acquisition and pedagogy.
  • Frank Pericolosi (Physical Education) uses data to improve his team’s chances on the field.
  • Ami Radunskaya (Mathematics) uses data to model tumor growth and treatment.
  • Yuqing Zhu (Neuroscience) uses data to figure out how a jumble of neurons becomes smart.

Learning goals

By the end of the course, you will be able to…

  • gain insight from data
  • gain insight from data, reproducibly
  • gain insight from data, reproducibly, using modern programming tools and techniques
  • gain insight from data, reproducibly (with literate programming and version control), using modern programming tools and techniques

Activity

  1. What problem or task would you like to investigate using data?
  2. What would be hard about executing the project?
  3. What are the potential ethical frameworks to consider?
  4. How would you define success?

Toolkit: Computing

We use tools to do the things. But the tools are not the things.

Reproducible data analysis

Reproducibility checklist

What does it mean for a data analysis to be “reproducible”?

Short-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done?

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?

Toolkit for reproducibility

  • Scriptability \(\rightarrow\) R
  • Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto
  • Version control \(\rightarrow\) Git / GitHub

R and RStudio

R and RStudio

R logo

  • R is an open-source statistical programming language
  • R is also an environment for statistical computing and graphics
  • It’s easily extensible with packages

RStudio logo

  • RStudio is a convenient interface for R called an IDE (integrated development environment), e.g. “I write R code in the RStudio IDE”
  • RStudio is not a requirement for programming with R, but it’s very commonly used by R programmers and data scientists

R vs. RStudio

On the left: a car engine. On the right: a car dashboard. The engine is labelled R. The dashboard is labelled RStudio.

R packages

  • Packages: Fundamental units of reproducible R code, including reusable R functions, the documentation that describes how to use them, and sample data1

  • As of August 26, 2024, there are 21,145 R packages available on CRAN (the Comprehensive R Archive Network)2

  • We’re going to work with a small (but important) subset of these!

Tour: R + RStudio

Tour recap: R + RStudio

A short list (for now) of R essentials

  • Functions are (most often) verbs, followed by what they will be applied to in parentheses:
do_this(to_this)
do_that(to_this, to_that, with_those)
  • Packages are installed with the install.packages() function and loaded with the library function, once per session:
install.packages("package_name")
library(package_name)

R essentials (continued)

  • Columns (variables) in data frames are accessed with $:
dataframe$var_name
  • Object documentation can be accessed with ?
?mean

tidyverse

Hex logos for dplyr, ggplot2, forcats, tibble, readr, stringr, tidyr, and purrr

tidyverse.org

  • The tidyverse is an opinionated collection of R packages designed for data science
  • All packages share an underlying philosophy and a common grammar

Quarto

Quarto

  • Fully reproducible reports – each time you Render, the analysis is run from the beginning
  • Code goes in chunks
  • Narrative goes outside of chunks

Tour: Quarto

Tour recap: Quarto

RStudio IDE with a Quarto document, source code on the left and output on the right. Annotated to show the YAML, a link, a header, and a code chunk.

Environments

Important

The environment of your Quarto document is separate from the Console!

Remember this, and expect it to bite you a few times as you’re learning to work with Quarto!

Environments

First, run the following in the console:

x <- 2
x * 3


All looks good, eh?

Then, add the following in an R chunk in your Quarto document

x * 3


What happens? Why the error?

How will we use Quarto?

  • Every assignment is an Quarto document.
  • You’ll always have a template Quarto document to start with.
  • The amount of scaffolding in the template will decrease over the semester.

Toolkit: Version control and collaboration

Git and GitHub

Git logo

  • Git is a version control system – like “Track Changes” features from Microsoft Word, on steroids
  • It’s not the only version control system, but it’s a very popular one

GitHub logo

  • GitHub is the home for your Git-based projects on the internet – like DropBox but much, much better

  • We will use GitHub as a platform for web hosting and collaboration (and as our course management system!)

Versioning - done badly

Versioning - done better

Versioning - done even better

with human readable messages

How will we use Git and GitHub?

How will we use Git and GitHub?

How will we use Git and GitHub?

How will we use Git and GitHub?

Git and GitHub tips

  • There are millions of git commands – ok, that’s an exaggeration, but there are a lot of them – and very few people know them all. 99% of the time you will use git to add, commit, push, and pull.
  • We will be doing Git things and interfacing with GitHub through RStudio, but if you google for help you might come across methods for doing these things in the command line – skip that and move on to the next resource unless you feel comfortable trying it out.
  • There is a great resource for working with git and R: happygitwithr.com. Some of the content in there is beyond the scope of this course, but it’s a good place to look for help.

Tour: Git + GitHub