Project 4
ethics: data and power
Project 4 examines ethics and power in the data science context. You will investigate a particular data science ethical quandary and the power dynamics within. Our work is grounded in the following two quotes from Data Feminism. The first asks about who and the second asks about why.
[Examine] how power operates in the world today. This consists of asking who questions about data science: Who does the work (and who is pushed out)? Who benefits (and who is neglected or harmed)? Whose priorities get turned into products (and whose are overlooked)?1
How did we get to the point where data science is used almost exclusively in the service of profit (for a few), surveillance (of the minoritized), and efficiency (amidst scarcity)?2
Your task
First, find an ethical dilemma with a data science component. Please take on only one ethical dilemma, it is too difficult to compare multiple dilemmsas in one short blog entry. There are many examples below, and within each example you should start with the reference provided and find at least one more article (possibly from another angle? or go find the privacy policy / user agreement if there is one!) to expand your understanding of the topic. It should be clear from your report what information came from which article. Feel free to choose an example different from those below.
Describe the example / scenario as if to someone who is not at all familiar with the setting. In particular, it should be clear both what is the data science component and what is the ethical dilemma.
Respond to at least 4 of the items below (from the list of questions or the Data Values and Principles Manifesto). Four separate paragraphs that explain both the issue (e.g., consent) and how the issue played out in the data science example. Note: totally fine if there are items that were done well in your example.
Given what you described in #3 (above), summarize by explaining why it matters. Who benefits? Who is neglected or harmed? Were the ethical violations in the interest of profit? Surveillance? Power?
Questions to respond to
What is the permission structure for using the data? Was it followed?
What was the consent structure for recruiting participants? Were the participants aware of the ways their data would be used for research? Was informed consent possible? Can you provide informed consent for applications that are yet foreseen?
What was the data collection process? Were the observations collected ethically? Are there missing observations?
Were the data made publicly available? Why? How? On what platform?
Is the data identifiable? All of it? Some of it? In what way? Are the data sufficiently anonymized or old to be free of ethical concerns? Is anonymity guaranteed?
How were the variables collected? Were they accurately recorded? Is there any missing data?
Who was measured? Are those individuals representative of the people to whom we’d like to generalize / apply the algorithm? Should we analyze data if we do not know how the data were collected?
Is the data being used in unintended ways to the original study?
Should race be used as a variable? Is it a proxy for something else (e.g., amount of melanin in the skin, stress of navigating microaggressions, zip-code, etc.)? What about gender?
As data teams, we aim to…
- Use data to improve life for our users, customers, organizations, and communities.
- Create reproducible and extensible work.
- Build teams with diverse ideas, backgrounds, and strengths.
- Prioritize the continuous collection and availability of discussions and metadata.
- Clearly identify the questions and objectives that drive each project and use to guide both planning and refinement.
- Be open to changing our methods and conclusions in response to new knowledge.
- Recognize and mitigate bias in ourselves and in the data we use.
- Present our work in ways that empower others to make better-informed decisions.
- Consider carefully the ethical implications of choices we make when using data, and the impacts of our work on individuals and society.
- Respect and invite fair criticism while promoting the identification and open discussion of errors, risks, and unintended consequences of our work.
- Protect the privacy and security of individuals represented in our data.
- Help others to understand the most useful and appropriate applications of data to solve real-world problems.
Logistics
- work in your website .Rproj, do not start a new R Project.
- create a new Quarto file (just like you did for previous projects), and type words into it, even though there is no code.
- no code is expected. If, for some reason, you include code, follow the same practices as in previous projects: explain what you are doing, show your code, no messages or warnings, etc.
- include the full citations for your references (of which there should be at least two), not just the hyperlink. If you do not know how to create or format a citation, as me or chatGPT.
- make it clear which information came from which resource.
Timeline
Project 4 must be submitted on Canvas (not Gradescope) by 11:59 PM on Wednesday April 16. You will add a tab to your Quarto webpage and submit the new page’s URL. [Remember, you should continue to work in your website Rproj. Do not start a new R Project.]
Potential examples
Biobank samples from the Havasupai lawsuit (informed consent, among other things). Van Assche, K., Gutwirth, S., and Sterckx, S. (2013), Protecting Dignitary Interests of Biobank Research Participants: Lessons from Havasupai Tribe v Arizona Board of Regents, Law, Innovation and Technology, 5, 54–84.
Facebook emotional contagion experiment (patient consent, among other things). Kramer, A. D., Guillory, J. E., and Hancock, J. T. (2014), Experimental Evidence of Massive-Scale Emotional Contagion through Social Networks, Proceedings of the National Academy of Sciences of the United States of America, 111, 8788–8790.
OK Cupid data release (privacy and publicly available data). Xiao, T., and Ma, Y. (2021), A Letter to the Journal of Statistics and Data Science Education — A Call for Review of ‘OkCupid Data for Introductory Statistics and Data Science Courses’ by Albert Y. Kim and Adriana Escobedo-Land, Journal of Statistics and Data Science Education, 29, 214–215.
Taxis dataset with 173 million cab rides (poorly anonymized data). Goodin, D. (2014) Poorly anonymized logs reveal NYC cab drivers’ detailed whereabouts, Ars Technical.
Netflix “de-anonymized” + link to IMDB (poorly anonymized data). Leetaru, K. (2016), The Big Data Era of Mosaicked Deidentification: Can We Anonymize Data Anymore?, Forbes.
COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) (biased algorithm). Angwin, J., Larson, J., Mattu, S., and Kirchner, L. (2016), Machine Bias, ProPublica.
Target (what if the algorithm is good at predicting something you don’t want predicted?). Duhigg, C. (2012), How Companies Learn Your Secrets, The New York Times Magazine.
Amazon hiring algorithm (who has moral responsibility) Dastin, J. (2018), “Amazon Scraps Secret AI Recruiting Tool that Showed Bias Against Women, Reuters.
Airlines respond differently, depending on who is tweeting (should the algorithm be used?). Gunarathne, P., Rui, H., and Seidmann, A. (2022). Racial Bias in Customer Service: Evidence from Twitter, Information Systems Research: 33, 43-54. n.b., log into the Claremont Colleges Library and search for the title. The library has the digital copy available.
The Allegheny Family Screening Tool is specifically designed to predict the risk that a child will be placed in foster care in the two years after being investigated (what if the marginalized community is overrepresented?). Ho, S. and Burke, G. (2022). An Algorithm That Screens for Child Neglect Raises Concerns, Pulitzer Center.
Training facial recognition software from publicly available data (what if the marginalized community is underrepresented?). Buolamwini, J. and Gebru, T. (2018). Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, Proceedings of Machine Learning Research: 81, 77-91.