Please note: This course will be taught online only. In person study is not available for this course. 

Martijn Schoonvelde is Assistant Professor in European Politics & Society at the University of Groningen. His main research and teaching interests are political communication, political rhetoric, and quantitative text analysis.

Course Content
With the massive availability of text data on the web, social scientists increasingly recognize automated text analysis (or “text as data”) as a promising approach for analyzing various kinds of social and political phenomena. This module introduces participants to a variety of its methods and tools. We discuss the underlying theoretical assumptions, substantive applications of these methods, and their implementation in the R statistical programming language. The meetings – which combine lectures and coding sessions in the RStudio Cloud platform – will be hands-on, dealing with practical issues in each step of the research process.

Course Objectives
Participants will understand fundamental issues in quantitative text analysis research design such as inter-coder agreement, reliability, validation, accuracy, and precision. Participants will learn to convert texts into informative feature matrices and to analyze those matrices using statistical methods. Participants will learn to apply these methods to a text corpus in support of a substantive research question. Furthermore, participants will be able to critically evaluate (social science) research that uses automated text analysis methods.

Course Prerequisites
Familiarity with basic research design and statistical analysis is expected, and familiarity with the R statistical programming language is strongly encouraged.

Background Reading
Benoit (2020). “Text as Data: An Overview”. Handbook of Research Methods in Political Science and International Relations. Ed. by L. Curini and R. Franzese. Thousand Oaks: Sage: 461–497.
Welbers, K., Van Atteveldt, W., & Benoit, K. (2017). Text analysis in R. Communication Methods and Measures, 11(4), 245–265.

Background knowledge required:
Computer Background
R = elementary

Day 1

Lecture: What is quantitative text analysis? What will you learn in this course?

Lab: Working in RStudio Cloud. Working with libraries in R. Working with RMarkdown.

Day 2

Lecture: Core assumptions in quantitative text analysis. Considering issues of measurement and validation.

Lab: Regular expressions. Working with strings. Creating a dataframe.

Day 3

Lecture: Going from text to data. Preprocessing and feature selection. Deciding on the unit of observation and unit of analysis.

Lab: Importing textual data into R. Creating a corpus of documents and adding metadata to it. Creating a document-feature matrix.

Day 4

Lecture: Comparing documents in a corpus. Combining linguistic features and social science theories.

Lab: Estimating similarity, readability and complexity of documents.

Day 5

Lecture: What are dictionaries and how can we validate them? Sensitivity and specificity.

Lab: Creating a dictionary and applying it to a document-feature matrix.

Day 6

Lecture: Human coding and document classification using supervised machine learning.

Lab: Binary classification of documents using a Naïve Bayes classifier.

 Day 7

Lecture: Supervised, semi-supervised and unsupervised approaches to place text on an underlying (political) dimension.

Lab: Wordfish, Wordscores and Latent Semantic scaling.

Day 8

Lecture: Understanding topic models. Discussing their pros and cons.

Lab: Latent Dirichlet Allocation (LDA) and Structural topics models (STM).

Day 9

Lecture: New developments in data. Images as data. Automated speech recognition. Machine translation.

Lab: Using APIs for generating useful textual data.

Day 10

Lecture: Word embeddings. Concluding remarks.

Lab: Exploring a pre-trained word embeddings model.