Please note: This course will be taught online only. In person study is not available for this course.

Martijn Schoonvelde is Assistant Professor in European Politics & Society at the University of Groningen. His main research and teaching interests are political communication, political rhetoric, and quantitative text analysis.

Course Content

With the explosion of digital text data, social scientists are increasingly leveraging sophisticated quantitative text analysis techniques to learn from these data. This course provides a comprehensive introduction to cutting-edge methods and tools in the field of “text-as-data”. We discuss substantive applications of these methods, their theoretical assumptions, and their implementation in the R statistical programming language. The meetings – which combine lectures and coding sessions – will be hands-on, dealing with practical issues in each step of a text as data project.

Course Objectives

Participants will understand fundamental issues in quantitative text analysis research design such as different types of textual representations; measurement versus prediction; multilingualism; and how to think about validation. Participants will learn to convert texts into informative feature matrices and to analyze those matrices using statistical methods. Participants will learn to apply these methods to a text corpus in support of a substantive research question. Furthermore, participants will be able to critically evaluate (social science) research that uses automated text analysis methods.

Course Prerequisites

Familiarity with basic research design and statistical analysis is expected, and familiarity with the R statistical programming language is encouraged.

Required Text (this will be provided by ESS)

Grimmer, J., Roberts, M.E. and Stewart, B.M., 2022. Text as data: A new framework for machine learning and the social sciences. Princeton University Press. ISBN: 9780691207551

Background knowledge required:

Computer Background
R = elementary

Maths
Calculus = elementary

 

Day 1

  • Lecture: What is quantitative text analysis? What will you learn in this course? Developing a corpus.
  • Lab: Working in RStudio. Working with packages in R. Working with Quarto. Working with string variables. Regular expressions.

 

Day 2

  • Lecture: Core assumptions in quantitative text analysis. Representations of text. Preprocessing and (risks of) feature selection.
  • Lab: Creating a document-feature matrix. Importing textual data into R. Introduction to Quanteda (Benoit et al., 2018). Inspecting and visualizing a corpus.

 

Day 3

  • Lecture: Advanced text representations. Word embeddings.
  • Lab: Training a word embeddings model and inspecting document vectors using text2vec (Selivanov et al 2022).

 

Day 4

  • Lecture: What can we do with dictionaries and how can we validate them? Sensitivity and specificity.
  • Lab: Categorizing texts using off-the-shelf and home-made dictionaries.

 

Day 5

  • Lecture: Human coding and document classification using supervised machine learning. Evaluating a classifier.
  • Lab: Binary classification of documents using traditional machine learning classifiers.

 

Day 6

  • Lecture: Supervised, semi-supervised and unsupervised approaches to place text on an underlying dimension.
  • Lab: Wordfish, Wordscores and Latent Semantic scaling.

 

Day 7

  • Lecture: Understanding unsupervised and semi-supervised topic models. Discussing their pros and cons.
  • Lab: Latent Dirichlet Allocation (LDA) Structural topics models (STM) and semi-supervised topic models

  

Day 8

  • Lecture: New developments in data. Machine translation. Automated speech recognition. LLMs as coding tools. Images as data.
  • Lab: Linguistic preprocessing of text. POS tagging and lemmatizing using udpipe (Wijffels, 2022)

  

Day 9

  • Lecture: Deep learning. Transfer Learning. Concluding remarks.
  • Lab: Exploring the text package (Kjell et al., 2023) and the grafzahl package (Chan, 2023).