syllabus
Foundations of Data Science in R
DS002R, Fall 2024
Jo Hardin 2351 Estella jo.hardin@pomona.edu
Office Hours: Mon 3:30-4:30pm, Tues 10-11am & 2:30-3:30pm, Thurs 10am-noon
Mentor Sessions:
Mon 8-10pm, Tues 8-10pm
Estella 2099
Mentors: Federica Domecq Lacroze and Z Skigen
The course
Foundations of Data Science in R is a first course in data science. Data play an increasingly important role in many fields. Being able to understand data and the ethical implications in data driven decisions is paramount to being an informed member of society. As an introduction to data science with R, this course will introduce students to basic data science concepts. Prerequisite: CSCI004 or CSCI005 or CSCI051 or equivalent experience in programming.
Student Learning Outcomes.
By the end of the term, students will be able to:
- scrape, process, and clean data from the web
- wrangle data in a variety of formats
- contextualize variation in data
- construct point and interval estimates using resampling techniques
- design accurate, clear and appropriate data graphics
- query large relational databases (using SQL)
- work fluently with regular expression
- communicate data-driven decisions
Inclusion Goals1
In an ideal world, science would be objective. However, much of science is subjective and is historically built on a small subset of privileged voices. In this class, we will make an effort to recognize how science (and data science!) has played a role in both understanding diversity as well as in promoting systems of power and privilege. I acknowledge that there may be both overt and covert biases in the material due to the lens with which it was written, even though the material is primarily of a scientific nature. Integrating a diverse set of experiences is important for a more comprehensive understanding of science. I would like to discuss issues of diversity in statistics as part of the course from time to time.
Please contact me if you have any suggestions to improve the quality of the course materials.
Furthermore, I would like to create a learning environment for my students that supports a diversity of thoughts, perspectives and experiences, and honors your identities (including race, gender, class, sexuality, religion, ability, etc.) To help accomplish this:
- If you have a name and/or set of pronouns that differ from those that appear in your official records, please let me know!
- If you feel like your performance in the class is being impacted by your experiences outside of class, please don’t hesitate to come and talk with me. You can also relay information to me via your mentors. I want to be a resource for you.
I (like many people) am still in the process of learning about diverse perspectives and identities. If something was said in class (by anyone) that made you feel uncomfortable, please talk to me about it. As a participant in course discussions, you should also strive to honor the diversity of your classmates.
Technical Details
Text:
Modern Data Science with R, 3rd edition by Baumer, Kaplan, and Horton.
R for Data Science, 2nd edition by Wickham, Çetinkaya-Rundel, and Grolemund.
Links to resources:
Git resources
- Best and most comprehensive Git help: http://happygitwithr.com/
- Version control with Git
- More on Git
- Online Git book with lots of info
R resources
- A fantastic ggplot2 tutorial
- Great tutorials through the Coding Club
- Google for R
- Incredibly helpful cheatsheets from RStudio.
SQL resources
- W3 schools Introduction to SQL
- W3 schools SQL Exercises, Practice, Solution
- R packages for working with databases
- Introduction to
dbplyr
Regular expression resources
- stringr vignette
- stringr package
- Jenny Bryan et al.’s STAT 545 notes
- regexpal
- RegExr
- RegexOne
HW Grading
Homework assignments will be graded out of 5 points, which are based on a combination of accuracy and effort. Below are rough guidelines for grading.
[5] All problems completed with detailed solutions provided and 75% or more of the problems are fully correct. Additionally, there are no extraneous messages, warnings, or printed lists of numbers.
[4] All problems completed with detailed solutions and 50-75% correct; OR close to all problems completed and 75%-100% correct. Or all problems are completed and there are extraneous messages, warnings, or printed lists of numbers.
[3] Close to all problems completed with less than 75% correct.
[2] More than half but fewer than all problems completed and > 75% correct.
[1] More than half but fewer than all problems completed and < 75% correct; OR less than half of problems completed.
[0] No work submitted, OR half or less than half of the problems submitted and without any detail/work shown to explain the solutions. You will get a zero if your file is not compiled and submitted on GitHub.
Projects:
There will be 5 mini-projects (due roughly every other week). You will also compile the projects, reflect on the process, and present your work to your classrmates. Project information is available here: DS 002R Projects
Computing:
GitHub will be used as a way to practice reproducible and collaborative science. There may be a slight learning curve, but knowing Git will be an extremely useful skill as you venture beyond this class.
R will be used for all homework assignments. R is freely available at http://www.r-project.org/ and is already installed on college computers. Additionally, you need to install R Studio in order to use Quarto, https://posit.co/downloads/. If you are not already familiar with R, please work through some of the materials provided ASAP.
You are welcome to use Pomona’s R Studio server at https://rstudio.campus.pomona.edu/ (or https://rstudio.pomona.edu if you are off campus). If you use the server, you can connect directly to your Git account without installing Git locally on your own computer. [If you are not a Pomona student, you will need to get an account from Pomona’s ITS. Go to ITS, tell them that you are taking a Pomona course, and ask for an account for using RStudio.]
Engagement:
This class will be interactive, and your engagement is expected (every day in class). Although notes will be posted, your engagement is an integral part of the in-class learning process.
In class: after answering one question, wait until 5 other people have spoken before answering another question. [Feel free to ask as many questions as often as you like!]
Academic Honesty:
You are on your honor to present only your work as part of your course assessments. Below, I’ve provided Pomona’s academic honesty policy. But before the policy, I’ve given some thoughts on cheating which I have taken from Nick Ball’s CHEM 147 Collective (thank you, Prof Ball!). Prof Ball gives us all something to think about when we are learning in a classroom as well as on our journey to become scientists and professionals:
There are many known reasons why we may feel the need to “cheat” on problem sets or exams:
- An academic environment that values grades above learning.
- Financial aid is critical for remaining in school that places undue pressure on maintaining a high GPA.
- Navigating school, work, and/or family obligations that have diverted focus from class.
- Challenges balancing coursework and mental health.
- Balancing academic, family, peer, or personal issues.
Being accused of cheating – whether it has occurred or not – can be devastating for students. The college requires me to respond to potential academic dishonesty with a process that is very long and damaging. As your instructor, I care about you and want to offer alternatives to prevent us from having to go through this process. If you find yourself in a situation where “cheating” seems like the only option:
Please come talk to me. We will figure this out together.
Pomona College is an academic community, all of whose members are expected to abide by ethical standards both in their conduct and in their exercise of responsibilities toward other members of the community. The college expects students to understand and adhere to basic standards of honesty and academic integrity. These standards include, but are not limited to, the following:
- In projects and assignments prepared independently, students never represent the ideas or the language of others as their own.
- Students do not destroy or alter either the work of other students or the educational resources and materials of the College.
- Students neither give nor receive assistance in examinations.
- Students do not take unfair advantage of fellow students by representing work completed for one course as original work for another or by deliberately disregarding course rules and regulations.
- In laboratory or research projects involving the collection of data, students accurately report data observed and do not alter these data for any reason.
Advice:
Please email and / or set up a time to talk if you have any questions about or difficulty with the material, the computing, or the course. Talk to me as soon as possible if you find yourself struggling. The material will build on itself, so it will be much easier to catch up if the concepts get clarified earlier rather than later. This semester is going to be fun. Let’s do it.