I am currently an Assistant Professor in the Department of Political Science and Faculty Associate in the Center for Political Studies at the University of Michigan. Prior to beginning these appointments, I was the Jeffrey L. Hyde and Sharon D. Hyde and Political Science Board of Visitors Early Career Professor in Political Science in the Department of Political Science at Penn State University. I am also an Affiliated Scholar at the Security and Political Economy (SPEC) Lab at the University of Southern California. In June 2013, I graduated with a Ph.D. in political science from the University of California, San Diego. I also studied at the University of North Texas, where I graduated with an M.S. in political science (2007), a B.F.A in drawing and painting (2005), and a B.A. in political science (2005).
My core research focuses on the politics and measurement of human rights, discrimination, violence, and repression. I use computational methods to understand why governments around the world torture, maim, and kill individuals within their jurisdiction and the processes monitors use to observe and document these abuses. Other projects cover a broad array of themes but share a focus on computationally intensive methods and research design. These methodological tools, essential for analyzing data at massive scale, open up new insights into the micro-foundations of state repression and the politics of measurement.
This course focuses on the research design and data analysis tools used to explore and understand social media and text data. The fundamentals of research design are the same throughout the social sciences, however the topical focus of this class is on computationally intensive data generating processes and the research designs used to understand and manipulate such data at scale.
By massive or large scale, I mean that there are lots of subjects/connections/units/rows in the data (e.g., social network data like the kind available from twitter), or there are lots of variables/items/columns in the data (e.g., image or text data with many thousands of columns that represent the words in the document corpus), or the selected analytical tool is a computationally complex algorithm (e.g., a Bayesian simulation for modeling a latent variable, a random forest model for exploratory data analysis, or a neural network for automatically classifying new observations), or finally some combination of these three issues. The course will provide students with the tools to design observational studies and experimental interventions into large and unstructured data sets at increasingly massive scales and at different degrees of computational complexity.
How will we go about learning these tools? In this class, we will learn to program and program to learn. What do I mean? First, we will use the R program environment to learn the building blocks of programming. These skills are essential for managing the increasingly large and complex datasets of interest to social scientists (e.g., image data, text data).
As we develop programming skills in R, we will use them to help us understand how different types of data analysis tools work. For example, by the end of the course, students will be able to program and evaluate their own neural network or structural topic model from scratch.
We will start very small and learn how to scale up. In the beginning of the course, we will not make use of many packages other than the base packages available by default in R. As we proceed, we will learn how models for data work before then investigating the functions that exist in the large, always increasing catalogue of packages available for you to use in R. The development of new functions in R is advancing rapidly. The tools you learn in this class will help you improve as a programmer and a data scientists but learning how to program and using your programming skills to learn how to analyze data.
Students will learn how to design models for data that take advantage of the wealth of information contained in new massive scale online datasets such as data available from twitter, images, and the many newly digitized document corpuses now available online. The focus of the course is on learning to program in R with special attention paid to designing studies in such a way as to maximize the validity of inferences obtained from these complex datasets.
Learn to program models in R at a small scale using the base package and a minimal number of other packages
Use the tools from research design to assist in model development
Validate models of observational data in comparison to an appropriate baseline model
Develop simulation-based models for large scale, observational data
Develop and validate measurement (e.g., latent variable models, structural topic models) and classification models (e.g., neural networks) of text and image-based data
Students should have some familiarity with concepts from research design and statistics. Generally, exposure to these concepts occurs during the first year course at a typical PhD program in political science. Students should have at least some exposure to the R computing environment. The more familiarity with R the better.
Required Reading Material
1. Matloff, Norman. 2011. Art of R Programming: A Tour of Statistical Software Design. no starch press.
2. Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning Data Mining, Inference, and Prediction. Springer Series in Statistics.
Background knowledge required
OLS = moderate
Maximum Likelihood = moderate
R = moderate
• I will begin each class day with a short lecture over the class material (approximately 45-60 minutes).
• After each lecture, students will discuss one or two articles as they relate to the lecture (approximately 30-45 minutes).
• On the first day of class, I will introduce students two large scale datasets. Students will use these data for applied examples over the 10 days of the course.
• The remaining portion of class (approximately 1.5-2 hours) will be devoted to hands on learning with R, simulated data, and the large scale datasets provided by the instructor. Day 7, and Day 9 will consist entirely of in class lab.
• The course schedule section, which is below, provides even more details about the topic of the lecture for each class day, citations for the discussion readings, and chapter entries from the text books for the lab portions of the class.
Day 1: Methods of Observation and Inference
Introduction to Exploratory Data Analysis, Visualization, and Validation
• Adcock, Robert, and David Collier. 2001. ”Measurement Validity: A Shared Standard for Qualitative and Quantitative Research.” American Political Science Review 95(3):529–546.
• Lazer, David, Alex (Sandy) Pentland, Lada Adamic, Sinan Aral, Albert-Lszl Barabsi, Devon Brewer, Nicholas Christakis, Noshir Contractor, James H. Fowler, Myron Gutmann, Tony Jebara, Gary King, Michael Macy, Deb Roy, Marshall Van Alstyne 2009. “Computational Social Science.” Science 323(5919): 721-723.
• Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25(3):289-310.
Note: Time permitting, we will get started with the lab material scheduled for day 2.
Day 2: Programming Social Media and “Big Data”
Introduction to Parallel Programming in R for “Big Data”.
• Lecture and lab material are drawn from the chapters in Art of R Programming text book.
Day 3: Experimental Design
Designing and Implementing Randomized Manipulations of Large Scale Data Generating Processes
• Bond, Robert M., Christopher J. Fariss, Jason J. Jones, Adam D. I. Kramer, Cameron Marlow,Jaime E. Settle, James H. Fowler. 2012. “A 61-Million-Person Experiment in Social Influence and Political Mobilization.” Nature 489(7415):295-298.
Day 4: Quasi-Experimental Designs
Exploiting Exogenous Shocks to Large Scale Data Generating Processes to Improve the Validity of Inferences
• Settle, Jaime E., Robert M. Bond, Lorenzo Coviello, Christopher J. Fariss, James H. Fowler, Jason J. Jones, Adam D. I. Kramer, Cameron Marlow. 2015. “From Posting to Voting: The Effects of Political Competition on Online Political Engagement” Political Science Research and Methods (Forthcoming).
Day 5: Measurement Theory: Data, Validity, and Reliability
Construct Validity and Models for Reducing High Dimensional Data for Visualization and Analysis
• Barber´a, Pablo. 2015. “Birds of the Same Feather Tweet Together. Bayesian Ideal Point Estimation
Using Twitter Data.” Political Analysis 23(1):76-91.
Day 6: Text as Data Part 1
Introduction to Text as Data
• Roberts, Margaret E, Brandon Stewart, and Dustin Tingley. “Navigating the Local Modes of Big Data: The Case of Topic Models.” In Data Analytics in Social Science, Government, and Industry,New York: Cambridge University Press.
Day 7: Text as Data Part 2
Text as Data Lab
Day 8: Social Network Analysis Part 1
Introduction to Social Network Data and Analysis
• Christakis, Nicholas A. and James H. Fowler. 2012. “Social contagion theory: examining dynamic social networks and human behavior.” Statistics in Medicine 32(4):556-577.
Day 10: Ethical Responsibilities for the Social Data Scientist
Issues Relating to Transparency, Anonymity, Replication, and Reproduction in the Analysis of “Big
• Driscoll, Jesse. “Prison States & Games of Chicken” working paper.
• Fariss, Christopher J. and Zachary M. Jones. “Enhancing Validity in Observational Settings When Replication is Not Possible” working paper.
• Jones, Jason J., Robert M. Bond, Christopher J. Fariss, Jaime E. Settle, Adam D. I. Kramer, Cameron Marlow, and James H. Fowler. 2013. “Yahtzee: An Anonymized Group Level Matching Procedure” PLoS ONE 8(2):e55760.
This syllabus is based on several courses that I have taken and designed over the last several years. Some of the material is based on the Research Design (PL SC 501) course that I developed at Pennsylvania State University when I began teaching there in the fall of 2013, which itself is based on similar course developed by David Lake and Mathew McCubbins at the University of California, San Diego. It is also based on material that I developed for a graduate measurement theory class (PL SC 597) and undergraduate Social Data Analysis and Design class (SO DA 308) that I also developed at Pennsylvania State University. Elements of the syllabus and other class materials created for this class are also based in part on the Bayesian Statistics class offered by Seth Hill at University of California, San Diego and the Measurement class offered by Keith Poole at UCSD and now the University of Georgia.