3H Advanced Methods for Text as Data: Natural Language Processing

Please note: This course will be taught online only. In-person study is not available for this course.

Nicole Baerg is a Senior Lecturer in the Department of Government at the University of Essex. Previously, she was a Senior Researcher in Data Science at the Bank of England. She is also a Senior Visiting Fellow at the Data Science Institute, London School of Economics and Political Science. Her research areas include Comparative Politics & International Relations, Political Institutions, Central Banking, Political Text Analysis, and Computational Social Science.

Course content:

With the recent explosion in the availability of digitized text and the expanded access computing power, social scientists are increasingly leveraging advanced computational tools for the analysis of text as data. In this course, students will explore the application of many advanced approaches for text-as-data research in the social sciences.

The course will begin with an overview of norms in reproducible research including how to write reproducible code and reproducibility issues with large language models. We will also discus issues related to bias and fairness in text and NLP.

Then, we will have a short overview of the “bag of words” approach including traditional machine learning models like Naive Bayes and SVM as well as topic and scaling methods. We will then think about static embedding models like word2vec and GloVe.

We will then further our understanding of labelling text for content as well as contextual embeddings. We will also explore transfer learning, or how to leverage pretrained models for application in our own specific domains, comparing directly applications from sentiment and emotion in texts.

Finally, we will explore causal inference with text.

Most days will be split into roughly 2 hours of lecture and 1.5 hours of computing tutorials. Where possible, computing examples will be demonstrated in R though some things may be discussed and shown in Python.

Course objectives:

Students will gain an understanding of important concepts and tools at the leading edge of text-as-data research and how they can be applied in social science text-as-data research. In so doing, the course will equip students as knowledgeable consumers of advanced text-as-data research and provide them with the tools to design and complete more advanced approaches leveraging text in their own work.

Course prerequisites:

Participants are assumed to have completed a course in text-as-data / quantitative text analysis covering basic text processing, supervised learning (e.g., classification), and unsupervised learning (e.g., scaling, topic modeling), such as 1B. Some facility with R is assumed.

Background Knowledge

Statistics

OLS – Moderate

Maximum Likelihood = Moderate

Software

R = Elementary

Python = Elementary

Class 1: Where is the discipline headed? Concepts

– Clark, William Roberts, and Matt Golder. “Big data, causal inference, and formal theory: Contradictory trends in political science?: Introduction.” PS: Political Science & Politics 48.1 (2015): 65-70.

– Grimmer, Justin, Margaret E. Roberts, and Brandon M. Stewart. “Machine learning for social science: An agnostic approach.” Annual Review of Political Science 24 (2021): 395-419.

– De Marchi, S., Stewart, B., Curini, L., & Franzese, R.. Wrestling with complexity in computational social science: Theory, estimation and representation (2020): p. 17. SAGE Publications.

– Grimmer, Justin. “We are all social scientists now: How big data, machine learning, and causal inference work together.” PS: Political Science & Politics 48.1 (2015): 80-83.

Class 2: Where is the discipline headed? Practicalities

– Ofosu, George K., and Daniel N. Posner. “Pre-analysis plans: An early stocktaking.” Perspectives on Politics (2021): 1-17.

– Rodrigues, B. Building reproducible analytical pipelines with R (1st ed.). (2023). https://raps-with-r.dev [Section 1].

– Palmer, Alexis, Noah A. Smith, and Arthur Spirling. “Using proprietary language models in academic research requires explicit justification.” Nature Computational Science (2023): 1-2.

Class 3: Bias and Fairness

– Wallach, H.. Computational social science≠ computer science+ social data. Communications of the ACM, (2018)” 61(3), 42-44.

– Tannen, Deborah. “The power of talk: Who gets heard and why.” Harvard business review 73 (1995): 138-138.

– Hovy, Dirk, and Shrimai Prabhumoye. “Five sources of bias in natural language processing.” Language and Linguistics Compass 15.8 (2021): e12432.

Class 4: The Old Fashioned?

– Baerg, N., & Lowe, W. A textual Taylor rule: estimating central bank preferences combining topic and scaling methods. Political Science Research and Methods, 8(1), (2020):106-122.

– Baerg, Nicole Rae, and Colin Krainin. “Divided committees and strategic vagueness.” European Journal of Political Economy 74 (2022): 102240.

– De Vries, Erik, Martijn Schoonvelde, and Gijs Schumacher. “No longer lost in translation: Evidence that Google Translate works for comparative bag-of-words text applications.” Political Analysis 26.4 (2018): 417-430.

Class 5: Textual Analysis and Scaling

– Lowe, Will, Kenneth Benoit, Slava Mikhaylov, and Michael Laver. “Scaling policy preferences from coded political texts.” Legislative studies quarterly 36, 1 (2011): 123-155.

– Slapin, Jonathan B., and Sven-Oliver Proksch. “A scaling model for estimating time-series party positions from texts.” American Journal of Political Science 52, 3 (2008): 705-722.

– Watanabe, Kohei. “Latent semantic scaling: A semisupervised text analysis technique for new domains and languages.” Communication Methods and Measures 15.2 (2021): 81-102.

Class 6: Word Embeddings

– Ash, Elliott, and Stephen Hansen. “Text algorithms in economics.” Annual Review of Economics 15 (2023): 659-688.

– Watanabe, Kohei, and Marius Sältzer. “Semantic temporality analysis: A computational approach to time in English and German texts.” Research & Politics 10.3 (2023).

– Rodriguez, P. L., & Spirling, A. (2022). Word embeddings: What works, what doesn’t, and how to tell the difference for applied research. The Journal of Politics, 84(1), 101-115.

Class 7: Embeddings, (distil)BERT, and Transfer Learning

– Jay Allamar. 2018. “The Illustrated BERT, ELMo and Co. (How NLP Cracked Transfer Learning.” http://jalammar.github.io/illustrated-bert/

– Jay Allamar. 2019. “A Visual Guide to Using BERT for the First Time.” http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

– Pfeifer, Moritz, and Vincent P. Marohl. “CentralBankRoBERTa: A fine-tuned large language model for central bank communications.” The Journal of Finance and Data Science 9 (2023): 100114.

Class 8: Text Extraction and NLP

– Munday, Tim, and James Brookes. “Mark my words: the transmission of central bank communication to the general public via the print media.” (2021).

– Shufan Wang, Laure Thompson, and Mohit Iyyer. 2021. “Phrase-BERT: Improved Phrase Embeddings from BERT with an application to Corpus Exploration”. https://arxiv.org/abs/2109.06304

Class 9: Causality and Text as Data

– Mozer, R., Miratrix, L., Kaufman, A. R., & Anastasopoulos, L. J. (2020). Matching with text data: An experimental evaluation of methods for matching documents and of measuring match quality. Political Analysis, 28(4), 445-468.

– Naoki Egami, Christian Fong, Justin Grimmer, Margaret Roberts, and Brandon Stewart. “How to Make Causal Inferences Using Texts”. https://arxiv.org/abs/1802.02163

– Margaret Roberts, Brandon Stewart, and Richard Nielsen. 2020. “Adjusting for Confounding with Text Matching.” American Journal of Political Science

Class 10: Student Projects & Presentations