This course is now full, and we are operating a waiting list. Please complete an application form if you would like to be added to the waiting list.

Please note: This course will be taught online only. In person study is not available for this course. 

Douglas Rice is Associate Professor of Political Science and Legal Studies at the University of Massachusetts Amherst, where he is also a faculty affiliate of the Computational Social Science Institute and the Data Analytics and Computational Social Science (DACSS) graduate program. His research examines judicial policymaking in American politics, with a particular interest in the power of courts in the American policymaking context, and the implications of according policymaking power to judicial institutions in a democratic political system. His work has appeared in  The Journal of Politics, Political Research Quarterly, The Journal of Law, Economics, and Organization, Political Science Research & Methods, The Journal of Law and Courts, American Politics Research, and other journals, as well as in a book, Lighting the Way: Federal Courts, Civil Rights, and Public Policy, published with University of Virginia Press.  

Course content:

With the recent explosion in the availability of digitized text and the expanded access computing power, social scientists are increasingly leveraging  advanced computational tools for the analysis of text as data. In this course, students will explore the application of many advanced approaches for text-as-data research in the social sciences. 

The course will begin with an overview of text-as-data research for social scientists, orienting students to the general area and contextualizing the advanced approaches we will explore in the class. Then, we will begin to extend our text-as-data work beyond the  “bag of words” to models that better represent the richness of text. 

Next, the course will turn to embedding-based representations of texts and the underlying distributional theory. We will begin with static embedding models like word2vec and GloVe, and will discuss the benefits and utility of embedding-based representations for social science research.  

We will then further our work on embeddings by transitioning to contextual embeddings. To inform our understanding of pretrained contextual embedding models like ELMo and BERT, we will explore neural networks and deep learning in NLP, and will learn how to develop and deploy our own deep learning models. In doing so, we will cover feedforward neural networks, recurrent neural networks, and transformers. Then, we will explore transfer learning, or how to leverage pretrained models for application in our own specific domains.

Finally, we will explore an area of increasing interest at the confluence of NLP and social science research: causal inference with text. In this section, we’ll explore how and where text is being used as part of causal research designs, with a focus on efforts to leverage embedding based representations in those designs.

Most days will be split into roughly 2 hours of lecture and 1.5 hours of computing tutorials. Where possible, computing examples will be demonstrated in both Python and R. Students should be aware that some modern NLP models are extremely computationally intensive, requiring GPUs and/or hours/days for realistic examples to be completed. In these cases, tutorials will be limited to “toy examples” or will be demonstrated only partially live in class.

Course objectives:

Students will gain an understanding of important concepts and tools at the leading edge of text-as-data research and how they can be applied in social science text-as-data research. In so doing, the course will equip students as knowledgeable consumers of advanced text-as-data research and provide them with the tools to design and complete more advanced approaches leveraging text in their own work.

Course prerequisites:

Participants are assumed to have completed a course in text-as-data / quantitative text analysis covering basic text processing, supervised learning (e.g., classification), and unsupervised learning (e.g., scaling, topic modeling), such as 1B. Some facility with Python and/or R is assumed.

Background Knowledge

Maths

Calculus = Elementary

Linear Regression = Elementary

Statistics

OLS = Elementary

Maximum Likelihood = Elementary

Software

R = Moderate

Python = Moderate

Day 1. Introduction

 

Kenneth Benoit. 2019. “Text as Data: An Overview” Sage Handbook of Research Methods in Political Science & International Relations

 

Day 2. Language Models

 

Jacob Eisenstein. 2021. Chapter 1, Natural Language Processing. https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf

 

Daniel Jurafsky & James Martin. 2020. Chapter 2, Speech and Natural Language Processing https://web.stanford.edu/~jurafsky/slp3/2.pdf

 

Day 3. Embedding Representations

 

Daniel Jurafsky & James Martin. 2020. Chapter 6, Speech and Natural Language Processing “Vector Semantics and Embeddings.” https://web.stanford.edu/~jurafsky/slp3/

 

Day 4. Intro to Neural Networks 

 

Daniel Jurafsky & James Martin. 2020. Chapter 7, Speech and Natural Language Processing “Neural Networks and Neural Language Models.” https://web.stanford.edu/~jurafsky/slp3/

 

Jay Allamar. 2016. “A Visual and Interactive Guide to the Basics of Neural Networks.” https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/

 

Jay Allamar. 2016. “A Visual and Interactive Look at Basic Neural Network Math.”

https://jalammar.github.io/feedforward-neural-networks-visual-interactive/

 

Day 5. Neural Networks for NLP

 

Jay Allamar. 2019. “The Illustrated GPT-2 (Visualizing Transformer Language Models)” http://jalammar.github.io/illustrated-gpt2/

 

Day 6.  Intro to Contextual Embeddings

 

Noah Smith. 2019. “Contextual Word Vectors: A Contextual Introduction.” https://arxiv.org/abs/1902.06006

 

Jay Allamar. 2018. “The Illustrated BERT, ELMo and Co. (How NLP Cracked Transfer Learning.” http://jalammar.github.io/illustrated-bert/

 

Jay Allamar. 2019. “A Visual Guide to Using BERT for the First Time.” http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

 

Day 7. Contextual Embeddings, Continued

 

Shufan  Wang, Laure Thompson, and Mohit Iyyer. 2021. “Phrase-BERT: Improved Phrase Embeddings from BERT with an application to Corpus Exploration”. https://arxiv.org/abs/2109.06304

 

Day 8. Causal Inference and Text-as-Data: Text as Treatment

 

Naoki Egami, Christian Fong, Justin Grimmer, Margaret Roberts, and Brandon Stewart. “How to Make Causal Inferences Using Texts”. https://arxiv.org/abs/1802.02163

 

Day 9. Causal Inference and Text-as-Data: Text as Confounder

 

Margaret Roberts, Brandon Stewart, and Richard Nielsen. 2020. “Adjusting for Confounding with Text Matching.” American Journal of Political Science 

 

Reagan Mozer, Luke Miratrix, A. Kaufman, and Jason Anastasopolous. 2020. “Matching with text data: An experimental evaluation of methods for matching documents and measuring match quality.” Political Analysis 28(4):445-468.

 

Day 10. Causal Inference and Text-as-Data: Text as Outcome

 

Dhanya Sridhar and Lise Getoor. “Estimating Causal Effects of Tone in Online Debates” https://arxiv.org/abs/1906.04177

 

Michael Gill and Andrew Hall. “How Judicial Identity Changes the Text of Legal Rulings.” https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2620781

 

This course outline was heavily influenced by the excellent resources put together by Burt Monroe (here) and the organizers of the Causal Inference & NLP Workshop (here).