Please note: This course will be taught in hybrid mode. Hybrid delivery of courses will include synchronous live sessions during which on campus and online students will be taught simultaneously.

Martijn Schoonvelde is Assistant Professor in European Politics & Society at the University of Groningen. His main research and teaching interests are political communication, political rhetoric, and quantitative text analysis.

Course Content

With the massive and ever-increasing availability of digital text data, social scientists increasingly use automated text analysis (or “text as data”) to examine various kinds of social and political phenomena. This module introduces participants to a variety of its methods and tools. We discuss their theoretical assumptions, substantive applications of these methods, and their implementation in the R statistical programming language. The meetings – which combine lectures and coding sessions in the RStudio Cloud platform – will be hands-on, dealing with practical issues in each step of a text as data project.

Course Objectives

Participants will understand fundamental issues in quantitative text analysis research design such as textual representations, measurement reliability and validation, and prediction accuracy. Participants will learn to convert texts into informative feature matrices and to analyse those matrices using statistical methods. Participants will learn to apply these methods to a text corpus in support of a substantive research question. Furthermore, participants will be able to critically evaluate (social science) research that uses automated text analysis methods.

Course Prerequisites

Familiarity with basic research design and statistical analysis is expected, and familiarity with the R statistical programming language is strongly encouraged.

Background Reading

  • Benoit (2020). “Text as Data: An Overview”. Handbook of Research Methods in Political Science and International Relations. Ed. by L. Curini and R. Franzese. Thousand Oaks: Sage: 461–497.
  • Welbers, K., Van Atteveldt, W., & Benoit, K. (2017). Text analysis in R. Communication Methods and Measures, 11(4), 245–265.

 

Background knowledge required:

Computer Background
R = elementary

Day 1

  • Lecture: What is quantitative text analysis? What will you learn in this course? Developing a corpus.
  • Lab: Working in RStudio Cloud. Working with libraries in R. Working with RMarkdown.

 

Day 2

  • Lecture: Core assumptions in quantitative text analysis. Representations of text. Preprocessing and feature selection.
  • Lab: Working with strings variables. Regular expressions. Cleaning a string vector.

 

Day 3

  • Lecture: Advanced text representations. Risks of feature selection with unsupervised models.
  • Lab: Importing textual data into R. Introduction to quanteda (Benoit et al., 2018) Creating a corpus of documents and adding metadata. Creating a document-feature matrix.

 

Day 4

  • Lecture: Comparing documents in a corpus. Generating insights by combining linguistic features with social science theories.
  • Lab: Examining similarity and complexity of documents.

 

Day 5

  • Lecture: What can we do with dictionaries and how can we validate them? Sensitivity and specificity.
  • Lab: Categorizing texts using off-the-shelf and home-made dictionaries.

 

Day 6

  • Lecture: Human coding and document classification using supervised machine learning. Evaluating a classifier.
  • Lab: Binary classification of documents using a Naïve Bayes classifier.

 

Day 7

  • Lecture: Supervised, semi-supervised and unsupervised approaches to place text on an underlying dimension.
  • Lab: Wordfish, Wordscores and Latent Semantic scaling.

 

Day 8

  • Lecture: Understanding topic models. Discussing their pros and cons.
  • Lab: Latent Dirichlet Allocation (LDA) and Structural topics models (STM).

 

Day 9

  • Lecture: New developments in data. Machine translation. Automated speech recognition. Images as data.
  • Lab: Linguistic preprocessing of text. POS tagging and lemmatizing using udpipe (Wijffels, 2022)

 

Day 10

  • Lecture: Word embeddings. Concluding remarks.
  • Lab: Training a word embeddings model and inspecting document vectors using text2vec (Selivanov et al 2022)