1C Introduction to Data Science and Programming in R

Please note: This course will be taught in hybrid mode. Hybrid delivery of courses will include synchronous live sessions during which on campus and online students will be taught simultaneously.

Chris Fariss is a Professor in the Department of Political Science and Research Professor in the Center for Political Studies at the University of Michigan. Also an Affiliated Scholar at the Security and Political Economy (SPEC) Lab at the University of Southern California. Core research focuses on the politics and measurement of human rights, discrimination, violence, and repression. He use’s computational methods to understand why governments around the world torture, maim, and kill individuals within their jurisdiction and the processes monitors use to observe and document these abuses. Other projects cover a broad array of themes but share a focus on computationally intensive methods and research design. These methodological tools, essential for analyzing data at massive scale, open up new insights into the micro-foundations of state repression and the politics of measurement.

Course Description
This course focuses on the research design and data analysis tools used to explore and understand social media and text data using computational and simulation based methods in R. The fundamentals of research design are the same throughout the social sciences, however the topical focus of this class is on computationally intensive data generating processes and the research designs used to understand and manipulate such data at scale.

By massive or large scale, I mean that there are lots of subjects/connections/units/rows in the data (e.g., social network data like the kind available from twitter), or there are lots of variables/items/columns in the data (e.g., image or text data with many thousands of columns that represent the words in the document corpus), or the selected analytical tool is a computationally complex algorithm (e.g., a Bayesian simulation for modeling a latent variable, a random forest model for exploratory data analysis, or a neural network for automatically classifying new observations), or finally some combination of these three issues. The course will provide students with the tools to design observational studies and experimental interventions into large and unstructured data sets at increasingly massive scales and at different degrees of computational complexity.

How will we go about learning these tools? In this class, we will learn to program and program to learn. What do I mean? First, we will use the R program environment to learn the building blocks of programming. These skills are essential for managing the increasingly large and complex datasets of interest to social scientists (e.g., image data, text data).

As we develop programming skills in R, we will use them to help us understand how different types of data analysis tools work. For example, by the end of the course, students will be able to program and evaluate their own neural network or structural topic model from scratch.

We will start very small and learn how to scale up. In the beginning of the course, we will not make use of many packages other than the base packages available by default in R. As we proceed, we will learn how models for data work before then investigating the functions that exist in the large, always increasing catalogue of packages available for you to use in R. The development of new functions in R is advancing rapidly. The tools you learn in this class will help you improve as a programmer and a data scientists but learning how to program and using your programming skills to learn how to analyze data.

Course Objectives
Students will learn how to design models for data that take advantage of the wealth of information contained in new massive scale online datasets such as data available from twitter, images, and the many newly digitized document corpuses now available online. The focus of the course is on learning to program in R with special attention paid to designing studies in such a way as to maximize the validity of inferences obtained from these complex datasets.

Learn to program models in R at a small scale using the base package and a minimal number of other packages.
Use the tools from research design to assist in model development
Validate models of observational data in comparison to an appropriate baseline model
Develop simulation based models for large scale, observational data
Develop and validate measurement (e.g., latent variable models, structural topic models) and classification models (e.g., neural networks) of text and image based data

Course Prerequisites
Students should have some familiarity with concepts from research design and statistics. Generally, exposure to these concepts occurs during the first year course at a typical PhD program in political science. Students should also have familiarity with the R computing environment. The more familiarity with R the better.

Representative Background Reading
There are no required books in this course. Rather, I will make reference to material listed in the course outline in rough proportion to the suggested books > additional suggested books > other related readings. There are also a set of applied articles that I will make reference to as well in the other related readings subsections. Think of these as useful references and places to find examples. The primary course content will be the R programming lessons.

Course Details

We will begin each class period with a “programming challenge” (approximately 20-25 minutes).
I will then give a short lecture over the class material (approximately 30-45 minutes).
On the first day of class, I will introduce students to two large scale datasets. Students will use these data for applied examples over the 10 days of the course.
The remaining portion of class (approximately 1.5-2 hours) will be devoted to hands on learning with R, simulated data, and the large scale datasets provided by the instructor.

Install R and Rstudio

R: https://cran.r-project.org/
Rstudio: https://posit.co/download/rstudio-desktop/

Required Readings
There are no required books in this course. Rather, I will make reference to material listed below in rough proportion to the suggested books > additional suggested books > other related readings. There are also a set of applied articles that I will make reference to as well in the other related readings subsections. Think of these as useful references and places to find examples. The primary course content will be the R programing lessons.

Suggested Books

Jones, Owen, Robert Maillardet, and Andrew Robinson. 2014. Introduction to Scientific Programming and Simulation Using R. Second Edition. CRC Press. https://nyu-cdsc.github.io/learningr/assets/simulation.pdf
Matloff, Norman. 2011. Art of R Programming: A Tour of Statistical Software Design. no starch press. https://nostarch.com/artofr.htm
Bolker, Ben. 2007. Ecological Models and Data in R Princeton NJ: Princeton University Press. https://press.princeton.edu/books/hardcover/9780691125220/ecological-models-and-data-in-r
Davies, Tilman M. 2016. The Book of R: A First Course in Programming and Statistics. no starch press. https://nostarch.com/bookofr
Efron, Bradley and Trevor Hastie. 2016. Computer Age Statistical Inference Cambridge University Press. https://web.stanford.edu/~hastie/CASI/
Gelman,Andrew and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press. http://www.stat.columbia.edu/~gelman/arm/
Matloff, Norman. 2024. The Art of Machine Learning: A Hands-On Guide to Machine Learning with R no starch press. https://nostarch.com/art-machine-learning

Suggested User Guides and Reference Manuals

Stan Development Team. 2024. Stan Modeling Language: User’s Guide and Reference Manual. Version 2.35.https://mc-stan.org/docs/stan-users-guide/index.html
Wickham, Hadley. “The tidyverse style guide” https://style.tidyverse.org
R graph gallery. http://r-graph-gallery.com/

Introduction to Data Science and Programming in R

Latest News

Networking Events

Apply now

1C Introduction to Data Science and Programming in R

Introduction to Data Science and Programming in R

Latest News

Networking Events

Apply now

Find us online!