Please note: This course will be taught online only. In person study is not available for this course.

Martijn Schoonvelde is Assistant Professor in European Politics & Society at the University of Groningen. His main research and teaching interests are in political communication, EU politics, political rhetoric, and computational text analysis.

Course Content

With the explosion of digital text data, social scientists are increasingly relying on quantitative text analysis techniques to learn from these data. This course provides a comprehensive introduction to foundational and cutting-edge methods and tools in the field of “text-as-data”. We discuss substantive applications of these methods, their theoretical assumptions, and their implementation in the R statistical programming language. The meetings – which combine lectures and coding sessions – will be hands-on, dealing with practical issues in each step of a text as data project.

Course Objectives

Participants will understand fundamental issues in quantitative text analysis research design such as different types of textual representations; measurement versus prediction; multilingualism; and how to think about validation. Participants will learn to convert texts into informative representations and to analyze those representations using statistical methods. Participants will learn to apply these methods to a text corpus in support of a substantive research question. Furthermore, participants will be able to critically evaluate (social science) research that uses automated text analysis methods.

Course Prerequisites

Familiarity with basic research design and statistical analysis is expected, and familiarity with the R statistical programming language is encouraged.

Required Text (this will be provided by ESS)

Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart: Text as Data: A New Framework for Machine Learning and the Social Sciences, Princeton University Press, ISBN: 9780691207544

 

Day 1

  • Lecture: What is quantitative text analysis? What will you learn in this course? Developing a corpus.
  • Lab: Working in RStudio. Working with packages in R. Working with Quarto.

 

Day 2

  • Lecture: Core assumptions in quantitative text analysis. Representations of text. Preprocessing and feature selection.
  • Lab: Working with strings variables. Regular expressions. Cleaning a string vector. Creating a document-feature matrix..

 

Day 3

  • Lecture: Advanced text representations. Word embeddings.
  • Lab: Training a word embeddings model and inspecting document vectors using text2vec (Selivanov et al 2022).

 

Day 4

  • Lecture: What can we do with dictionaries and how can we validate them? Sensitivity and specificity.
  • Lab: Categorizing texts using off-the-shelf and home-made dictionaries.

 

Day 5

  • Lecture: Human coding (or machine coding) and document classification using supervised machine learning. Evaluating a classifier.
  • Lab: Binary classification of documents using traditional machine learning classifiers.

 

Day 6

  • Lecture: Supervised, semi-supervised and unsupervised approaches to place text on an underlying dimension.
  • Lab: Wordfish, Wordscores and Latent Semantic scaling.

 

Day 7

  • Lecture: Understanding unsupervised and semi-supervised topic models. Discussing their pros and cons.
  • Lab: Latent Dirichlet Allocation (LDA) Structural topics models (STM) and semi-supervised topic models

 

Day 8

  • Lecture: New developments in data. Machine translation. Automated speech recognition. LLMs as coding tools.
  • Lab: Linguistic preprocessing of text. POS tagging and lemmatizing using udpipe (Wijffels, 2022)

 

Day 9

  • Lecture: New developments in supervised machine learning. Weak supervision. Transfer learning.
  • Lab: Training a word embeddings model and inspecting document vectors using text2vec (Selivanov et al 2022)

 

Day 10

  • Lecture: Concluding remarks.
  • Lab: Explore working with LLMs using the quanteda.llm (Maerz and Benoit, 2025), ellmer (Wickham et al. 2025), and the ollamar (Lin & Safi, 2025) packages.