Burt Monroe is Liberal Arts Professor of Political Science, Social Data Analytics, and Informatics at Pennsylvania State University, where he also serves as Director of Penn State’s Center for Social Data Analytics (C-SoDA) and Head of the Program in Social Data Analytics (SoDA). His research is in comparative politics, examining political communication and the impact of electoral and legislative institutions on political behavior and outcomes, and methodology, especially “text-as-data” and other data-intensive and computationally-intensive settings at the intersection of data science and social science. His research has been published in the American Political Science Review, American Journal of Political Science, Political Analysis, and elsewhere. Prior to joining the Department of Political Science at Penn State, he was a faculty member at Indiana University and Michigan State University, and a postdoctoral researcher at Harvard University.

 

Course content:

This course explores the application of state-of-the-art developments in the field of Natural Language Processing (NLP) for social scientific text-as-data problems.

We will start with an overview of typical objectives in NLP, as a scientific field and as an area of data science practice, and compare these with the typical goals (often measurement) of text-as-data research in social science. Throughout the course, we will focus on connecting ideas and tools from NLP to familiar document classification tasks like sentiment analysis and familiar unsupervised learning tasks like scaling and topic modeling.

We will then turn to a discussion of different ways of representing text beyond the “bag of words.” We will first discuss the use of NLP end products for extracting other textual features, like parts of speech or named entities. We will begin to explore how “language models” are developed in NLP to account for context, sequence, and syntax in these tasks.

Next we will get our first introduction to representing text with word embeddings, and closely related ideas. Applications of word embeddings and similar have exploded in recent years. We will discuss ways of interpreting word embeddings and how they are being applied in social science. We will discuss both the use of “pretrained” embeddings and estimation of your own embeddings.

We will then move into the topic that will occupy most of our attention in this class: neural networks and deep learning in NLP. This is a vast, complicated, subject and the knowledge frontier is expanding at an amazing pace. I will aim to give you intuition for core concepts and practical advice sufficient for you to begin developing and using your own deep learning models for tasks like document classification, and for you to engage with the rapidly changing state of the art going forward. We will begin with feedforward networks, and then we will focus on two classes of deep learning models that have been particularly successful in NLP: recurrent neural nets and transformers. In particular, we will focus on pretrained models like ELMo and BERT that can be used for transfer learning. Transfer learning allows us to leverage models trained on massive amounts of text and, for example, build a classifier for our own datasets using much less training data than would have been conventionally required.

In the final sessions, we will look at two social scientific application areas for NLP techniques: multilingual text analysis and political event data – and the last day will be reserved for other topics according to student interest.

Most days will be split into roughly 2 hours of lecture and 1.5 hours of computing tutorials. Where possible, computing examples will be demonstrated in both Python and R. Students should be aware that some modern NLP models are extremely computationally intensive, requiring GPUs and/or hours/days for realistic examples to be completed. In these cases, tutorials will be limited to “toy examples” or will be demonstrated only partially live in class (like a cooking show that doesn’t make you wait two hours while something bakes in the oven).

 

Course objectives:

Students will gain an understanding of important concepts and tools arising from contemporary NLP and how they can be applied in social science text-as-data research. These tools can improve familiar text-as-data document classification tasks like sentiment analysis and familiar unsupervised learning tasks like scaling and topic modeling, as well as applications that require more language-aware models, like multilingual text analysis or creation of political event data.

Course prerequistites:

Participants are assumed to have completed a course in text-as-data / quantitative text analysis covering basic text processing, supervised learning (e.g., classification), and unsupervised learning (e.g., scaling, topic modeling), such as 1B. Some facility with Python and/or R is assumed.

Course outline:

 

Day 1. Introduction / overview. NLP & social science text-as-data. Software overview.

Kenneth Benoit. 2020. “Text as Data: An Overview.” In Robert Franzese and Luigi Curini, eds. SAGE Handbook of Research Methods in Political Science and International Relations. http://dx.doi.org/10.4135/9781526486387.n29

Jacob Eisenstein. 2018. “Introduction.” Natural Language Processing.

Day 2. 

Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapters 3, 8, 14. “N-gram Language Models,” “Sequence Labeling for Parts of Speech and Named Entities,” “Dependency Parsing.” https://web.stanford.edu/~jurafsky/slp3/

Day 3. Word embeddings

Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapter 6, “Vector Semantics and Embeddings.” https://web.stanford.edu/~jurafsky/slp3/

Pedro Rodriguez and Arthur Spirling (Forthcoming) “Word embeddings: What works, what doesn’t, and how to tell the difference for applied research.” Journal of Politics.

 

Day 4 & 5. Neural networks and deep learning for NLP

Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapter 7, “Neural Networks and Neural Language Models.” https://web.stanford.edu/~jurafsky/slp3/

Jay Allomar. 2016. “A Visual and Interactive Guide to the Basics of Neural Networks.” https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/

Jay Allomar. 2016. “A Visual and Interactive Look at Basic Neural Network Math.”

Christopher Olah. 2014. “Deep Learning, NLP, and Representations.” http://colah.github.io/posts2014-07-N:P-RNNs-Representations/

Kakia Chatsiou and Slava Jankin Mikhaylov. 2020. “Deep Learning for Political Science.” In Robert Franzese and Luigi Curini, eds. SAGE Handbook of Research Methods in Political Science and International Relations. http://dx.doi.org/10.4135/9781526486387.n58

 

Day 6. Recurrent neural nets and transformers

Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapter 9, “Deep Learning Architectures for Sequence Processing.” https://web.stanford.edu/~jurafsky/slp3/

Jay Allomar. 2019. “The Illustrated GPT-2 (Visualizing Transformer Language Models)” http://jalammar.github.io/illustrated-gpt2/

Andrew Halterman. 2019. “Geolocating Political Events.”

Han Zhang and Jennifer Pan. 2019. “CASM: A Deep-Learning Approach for Identifying Collective Action Events from Text and Image Data.” Sociological Methodology 49(1): 1-57. http://dx.doi.org/10.1177/0081175019860244.

 

Day 7. Contextual embeddings, pretrained language models, and transfer learning

Noah Smith. 2019. “Contextual Word Vectors: A Contextual Introduction.”

Jay Allomar. 2018. “The Illustrated BERT, ELMo and Co. (How NLP Cracked Transfer Learning.” http://jalammar.github.io/illustrated-bert/

Jay Allomar. 2019. “A Visual Guide to Using BERT for the First Time.” http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

Zhanna Terechskenko, Fridolin Linder, Vishakh Padmakumar, Michael Liu, Jonathan Nagler, Joshua A. Tucker, and Richard Bonneau. 2020. “A Comparison of Methods in Political Science Text Classification: Transfer Learning Language Models for Politics.” https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3724644

 

Day 8. Multilingual text-as-data

Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2019. “A Survey of Cross-lingual Word Embedding Models.” Journal of Artificial Intelligence Research 65: 569-631. https://doi.org/10.1613/jair.1.11640

Mitchell Goist and Burt L. Monroe. 2020. “Scaling the Tower of Babel: Common-Space Analysis of Political Text in Multiple Languages.”

Leah C. Windsor, James G. Cupit, Alistair J. Windsor. 2019. “Automated content analysis across six languages.” PloS ONE 14(11):e0224425. https://doi.org/10.1371/journal.pone.0224425

 

Day 9. Political event data.

Dan Jurafsky & James Martin (2020), Speech & Language Processing (3rd edition draft). Chapters 17, “Information Extraction.” https://web.stanford.edu/~jurafsky/slp3/

Beieler, J., Brandt, P. T., Halterman, A., Schrodt, P. A., & Simpson, E. M. 2016. “Generating political event data in near real time,” in Alvarez, M. (ed.) Computational Social Science. Cambridge: Cambridge University Press

 

Day 10. Further topics and applications to suit student interests.