Burt Monroe is Liberal Arts Professor of Political Science, Social Data Analytics, and Informatics at Pennsylvania State University, where he also serves as Director of Penn State’s Center for Social Data Analytics (C-SoDA) and Head of the Program in Social Data Analytics (SoDA). His research is in comparative politics, examining political communication and the impact of electoral and legislative institutions on political behavior and outcomes, and methodology, especially “text-as-data” and other data-intensive and computationally-intensive settings at the intersection of data science and social science. His research has been published in the American Political Science Review, American Journal of Political Science, Political Analysis, and elsewhere. Prior to joining the Department of Political Science at Penn State, he was a faculty member at Indiana University and Michigan State University, and a postdoctoral researcher at Harvard University.
This course explores the application of state-of-the-art developments in the field of Natural Language Processing (NLP) for social scientific text-as-data problems.
We will start with an overview of typical objectives in NLP, as a scientific field and as an area of data science practice, and compare these with the typical goals (often measurement) of text-as-data research in social science. Throughout the course, we will focus on connecting ideas and tools from NLP to familiar document classification tasks like sentiment analysis and familiar unsupervised learning tasks like scaling and topic modeling.
We will then turn to a discussion of different ways of representing text beyond the “bag of words.” We will first discuss the use of NLP end products for extracting other textual features, like parts of speech or named entities. We will begin to explore how “language models” are developed in NLP to account for context, sequence, and syntax in these tasks.
Next we will get our first introduction to representing text with word embeddings, and closely related ideas. Applications of word embeddings and similar have exploded in recent years. We will discuss ways of interpreting word embeddings and how they are being applied in social science. We will discuss both the use of “pretrained” embeddings and estimation of your own embeddings.
We will then move into the topic that will occupy most of our attention in this class: neural networks and deep learning in NLP. This is a vast, complicated, subject and the knowledge frontier is expanding at an amazing pace. I will aim to give you intuition for core concepts and practical advice sufficient for you to begin developing and using your own deep learning models for tasks like document classification, and for you to engage with the rapidly changing state of the art going forward. We will begin with feedforward networks, and then we will focus on two classes of deep learning models that have been particularly successful in NLP: recurrent neural nets and transformers. In particular, we will focus on pretrained models like ELMo and BERT that can be used for transfer learning. Transfer learning allows us to leverage models trained on massive amounts of text and, for example, build a classifier for our own datasets using much less training data than would have been conventionally required.
In the final sessions, we will look at two social scientific application areas for NLP techniques: multilingual text analysis and political event data – and the last day will be reserved for other topics according to student interest.
Most days will be split into roughly 2 hours of lecture and 1.5 hours of computing tutorials. Where possible, computing examples will be demonstrated in both Python and R. Students should be aware that some modern NLP models are extremely computationally intensive, requiring GPUs and/or hours/days for realistic examples to be completed. In these cases, tutorials will be limited to “toy examples” or will be demonstrated only partially live in class (like a cooking show that doesn’t make you wait two hours while something bakes in the oven).
Students will gain an understanding of important concepts and tools arising from contemporary NLP and how they can be applied in social science text-as-data research. These tools can improve familiar text-as-data document classification tasks like sentiment analysis and familiar unsupervised learning tasks like scaling and topic modeling, as well as applications that require more language-aware models, like multilingual text analysis or creation of political event data.
Participants are assumed to have completed a course in text-as-data / quantitative text analysis covering basic text processing, supervised learning (e.g., classification), and unsupervised learning (e.g., scaling, topic modeling), such as 1B. Some facility with Python and/or R is assumed.