Nicole Rae Baerg, Ph.D. is a Senior Lecturer at the University of Essex, Government Department. Previously, she was an Assistant Professor at the University of Mannheim (2014 – 2016) and a research intern at the Federal Reserve Bank of Atlanta (2008 – 2013). She is also the author of Crafting Consensus: Why Central Bankers Change their Speech and How Speech Changes the Economy published by Oxford University Press (2020). She is a political economist working with political textual analysis, monetary and financial politics, and fiscal policy.
The course surveys methods for systematically extracting quantitative information from text for social scientific purposes, starting with classical content analysis and dictionary-based methods, to classification methods, and state-of-the-art scaling methods and topic models for estimating quantities from text using statistical techniques. The course lays a theoretical foundation for text analysis but mainly takes a very practical and applied approach, so that students learn how to apply these methods in actual research.
The common focus across all methods is that they can be reduced to a three-step process: first, identifying texts and units of texts for analysis; second, extracting from the texts quantitatively measured features—such as coded content categories, word counts, word types, dictionary counts, or parts of speech—and converting these into quantitative matrix; and third, using quantitative or statistical methods to analyse this matrix in order to generate inferences about the texts or their authors. The course systematically surveys these methods in a logical progression, with a very practical hands-on approach where each technique will be applied in lab sessions using appropriate software, on real texts.
The course is also designed to cover many fundamental issues in quantitative text analysis such as inter-coder agreement, reliability, validation, accuracy, and precision. It focuses on methods of converting texts into quantitative matrixes of features, and then analysing those features using statistical methods. The course briefly covers the qualitative technique of human coding and annotation (classical content analysis), but the main focus is on more automated approaches. These automated approaches include dictionary construction and application, classification and machine learning, scaling models, and topic models.
Students in this course should have prior knowledge in the following areas:
1) An understanding of probability and statistics at the level of an intermediate postgraduate social science course. Understanding of regression analysis is presumed. Some basic understanding of maximum likelihood would be useful. This course is not heavily mathematical or statistical but students without the prerequisite level of quantitative experience will find the second week (in particular) difficult to follow.
2) Basic familiarity with the R statistical language. The lab sessions will be designed to use R coupled with a customized R library designed by Ken Benoit. This is in development and available from http://github.com/kbenoit/quanteda.
Representative Background Reading
James, G., Witten, D., & Hastie, T. (2014). An Introduction to Statistical Learning: With Appli-
cations in R. Taylor and Francis
Background knowledge required
OLS = moderate
Maximum Likelihood = moderate
R = elementary
Classes will meet for ten sessions. Approximately 2/3 of the time will be devoted to lectures, and the other 1/3 will consist of “lab” sessions where we will work through exercises in class.
Computer Software Computer-based exercises will feature prominently in the course, especially in the lab sessions. The use of all software tools will be explained in the sessions, including how to download and install them. We will be working primarily in R, using the “quanteda” package.
Recommended Texts There is no really good single textbook that exists to cover computerized or quantitative text analysis.
• Krippendorff, K. (2013). Content Analysis: An Introduction to Its Methodology. Sage, Thousand Oaks, CA, 3rd edition. Another good general reference to content analysis that you might find useful as a supplement is:
• Neuendorf, K. A. (2002). The Content Analysis Guidebook. Sage, Thousand Oaks, CA. A good general statistics reference for Machine Learning in R is:
• James, G., Witten, D., & Hastie, T. (2014). An Introduction to Statistical Learning: With Applications in R. Taylor and Francis. A good book on annotation software to annotate or markup your own texts is:
• Pustejovsky, J., & Stubbs, A. (2012). Natural language annotation for machine learning. O’Reilly Media, Inc..
Detailed Course Schedule Day 1: Quantitative text analysis overview and fundamentals This topic will introduce the goals and logistics of the course, provide an overview of the topics to be covered, and preview the software to be used. It will also introduce traditional (non-computer assisted) content analysis and distinguish this from computer-assisted methods and quantitative text analysis. We will cover the conceptual foundations of content analysis and quantitative content analysis, discuss the objectives, the approach to knowledge, and the particular view of texts when performing quantitative analysis. We will also work through some published examples.
Required Reading: Krippendorff (2013, Ch. 1,2,4) Wallach (2014)
Recommended Reading: Grimmer and Stewart (2013) Roberts (2000)
Lab session: Exercise 1: Working with Texts in quanteda
Day 2: Textual Data, Units of Analysis, Definitions of Documents and Features Textual data comes in many forms. Here we discuss those formats, and talk about text processing preparation of texts. These issues include where to obtain textual data; formatting and working with text files; indexing and meta-data; units of analysis; and definitions of features and measures commonly extracted from texts, including stemming, stop-words, and feature weighting.
Required Reading: Jivani (2011) http://en.wikipedia.org/wiki/Stop_words Manning, Raghavan and Schütze (2008, 117–120)
Recommended Reading: Denny and Spirling (2016) Krippendorff (2013, Ch. 5,6,7) Wikipedia entry on Character encoding, http://en.wikipedia.org/wiki/Text_encoding Browse the different text file formats at http://www.fileinfo.com/filetypes/text
Lab session: Exercise 2: Extracting features from texts
Day 3: Descriptive statistical methods for textual analysis Here we focus on quantitative methods for describing texts, focusing on summary measures that highlight particular characteristics of documents and allowing these to be compared. These methods include characterizing texts through concordances, co-occurrences, and keywords in context; complexity and readability measures; and an in-depth discussion of text types, tokens, and equivalencies.
Required Reading: Spirling (2016)
Recommended Reading: DuBay (2004) Krippendorff (2013, Ch. 10)
Lab session: Exercise 3: Descriptive summaries of texts
Day 4: Quantitative methods for comparing texts Quantitative methods for comparing texts, through concordances and keyword identification, dissimilarity measures, association models, and vector-space models.
Required Reading: Lowe et al. (2011) Manning, Raghavan and Schütze (2008, Section 6.3)
Recommended Reading: Krippendorff (2013, Ch. 11)
Lab session: Exercise 4: Document similarity and resampling texts
Day 5: Automated dictionary methods Automatic dictionary-based methods involve association of pre-defined word lists with particular quantitative values assigned by the researcher for some characteristic of interest. This topic covers the design model behind dictionary construction, including guidelines for testing and refining dictionaries. We will also review a variety of text pre-processing issues and textual data concepts such as word types, tokens, and equivalencies, including word stemming and trimming of words based on term and/or document frequency.
Required Reading: Laver and Garry (2000) Rooduijn and Pauwels (2011) Loughran and McDonald (2011)
Recommended Reading: Pennebaker and Chung (2008)
Assignment: Exercise 5: Applying dictionary coding
Day 6: Document classifiers and Supervised Learning Supervised classification methods permit the automatic classification of texts in a test set following machine learning from a training set. We will introduce machine learning methods for classifying documents, including one of the most popular classifiers, the Naive Bayes model, as well as k-nearest neighbour and support vector machines (SVMs). The topic also introduces validation and reporting methods for classifiers and discusses where these methods are applicable as well as pitfalls and problems which each method.
Required Reading: James et al. (2013, Ch.4,5,9,8) Evans et al. (2007) Statsoft, “Naive Bayes Classifier Introductory Overview,” http://www.statsoft.com/textbook/ naive-bayes-classifier/.
Recommended Reading: An online article by Paul Graham on classifying spam e-mail. http://www.paulgraham.com/spam.htmlBionicspirit.com, 9 Feb 2012, “How to Build a Naive Bayes Classifier,” http://bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html. Yu, Kaufmann and Diermeier (2008)
Assignment: Exercise 6: Examining movie reviews and political speeches by author. http://www.cs.cornell. edu/People/pabo/movie-review-data/ and the 10th Republican Presidential candidate debate using R and the quanteda package.
Day 7: Document classifiers and Unsupervised Learning Unlike in supervised learning methods, in unsupervised learning, we only have input data and no corresponding labeled output variables. The goal for unsupervised learning is therefore to model the underlying structure of the data. Algorithms such as clustering and hierarchical clustering are used to try to discover and present the interesting structure in the data.
Required Reading: James et al. (2013, Ch.10)
Recommended Reading: Quinn et al. (2010)
Assignment: Exercise 7: Examining UK political parties and Immigration speeches using R and the quanteda package.
Day 8: Scaling Texts This topic introduces methods for placing documents on continuous dimensions or “scales”, introducing the major non-parametric methods for scaling documents and discusses the situations where scaling methods are appropriate. Building on the Naive Bayes classifier, we introduce the “Wordscores” method of Laver, Benoit and Garry (2003) and show the link between classification and scaling. We also discusses the similarities and differences to other non-parametric scaling models such as correspondence analysis. We then look at scaling texts based on parametric approaches modelling features as Bernoulli or Poisson distributed, and contrasts these methods to other alternatives, critically examining the assumptions such models rely upon.
Required Reading: Laver, Benoit and Garry (2003) Slapin and Proksch (2008) Lowe and Benoit (2013)
Recommended Reading: Martin and Vanberg (2007) Benoit and Laver (2007) Lowe (2008) Clinton, Jackman and Rivers (2004)
Assignment: Exercise 8: Wordscoring and “Wordfish” (Requires R).
Day 9: Clustering methods and topic models An introduction to hierarchical clustering for textual data, including parametric topic models such as Latent Dirichlet Allocation (LDA). We also look at recent work that tries to combine both scaling and topic modeling together in a “semi-supervised” approach.
Required Reading: Blei (2012) Blei, Ng and Jordan (2003) Baerg and Lowe (2016) Roberts et al. (2014)
Assignment: Exercise 9: Using LDA to estimate document topics.
Day 10: Text and Causal Inference We will discuss when and why QTA can lead us astray, making particular attention to concerns of reliability and validity.
Mozer, Reagan, et al. “Matching with text data: An experimental evaluation of methods for matching documents and of measuring match quality.” Political Analysis 28.4 (2020): 445-468. Roberts, Margaret E., Brandon M. Stewart, and Richard A. Nielsen. “Adjusting for confounding with text matching.” American Journal of Political Science 64.4 (2020): 887-903. Fong, Christian, and Justin Grimmer. “Discovery of treatments from text corpora.” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016.