Please note: This course will be taught online only. In person study is not available for this course.

Raymond Hicks has been working with Columbia University’s History Lab since 2017. While at History Lab, he helped build pipelines to run topic modeling and Named Entity Recognition on more than 4 million declassified government documents. His work at Columbia bridges data science and historical research, focusing on improving access to and analysis of government archives through natural language processing and machine learning methods.

Before starting at Columbia, he worked as the Statistical Programmer for the Niehaus Center for Globalization and Governance at Princeton University. His research interests include monetary policy, trade policy, and statistics and his work has appeared in the Journal of Politics, International Organization, and the British Journal of Political Science, among other journals.

 

Relevant publications:
“New evidence and new methods for analyzing the Iranian revolution as an intelligence failure” (with Matthew Connelly, Robert Jervis, and Arthur Spirling), 2021, Intelligence and National Security, 36(6): 781-806.

“Diplomatic documents data for international relations: the Freedom of Information Archive Database” (with Matthew Connelly, Robert Jervis, Arthur Spirling, and Clara Suong), 2021, Conflict Management and Peace Science, 38(6): 762-781.

Course Content

The increasing digitization of social life has made text data a central resource for social science research. From social media posts and political speeches to survey responses and historical archives, vast amounts of textual material are now available for systematic analysis. This course introduces participants to the fundamental concepts and methods of computational text analysis using R, with a particular emphasis on how Large Language Models (LLMs) can assist in both coding and interpretation.

The first part of the course introduces the R environment for text analysis, covering how to import, clean, and structure textual data. Students will learn to work with key R packages such as tidytext and quanteda, gaining hands-on experience in creating document-feature matrices, tokenizing text, and applying descriptive methods like word frequency and co-occurrence analysis.

Building on this foundation, the course explores more advanced techniques for analyzing text as data, including topic modeling, sentiment analysis, and Named Entity Recognition (NER). Students will learn how these tools can be applied to research questions in political science, sociology, communication, and related fields.

A distinctive component of the course is the discussion of LLM-assisted (“vibe”) coding — the use of generative AI tools such as GPT or Claude to assist with qualitative and semi-quantitative text analysis. The course will demonstrate how LLMs can accelerate exploratory coding and thematic identification, while also examining their limitations, biases, and reproducibility challenges. Students who wish to experiment with LLM-assisted methods will need access to an API key for a platform such as OpenAI or Anthropic’s Claude, though participation in this component is optional.

Course Objectives

By the end of the course, participants will have a practical toolkit for conducting transparent, replicable text analyses that integrate traditional computational methods with emerging AI-assisted approaches. Students will understand the logic and workflow of computational text analysis in R, from importing, cleaning, and structuring text data to conducting preprocessing, tokenization, and feature extraction using packages such as tidytext and quanteda. They will learn to apply intermediate and advanced analytical techniques, including topic modeling and named entity recognition (NER), while critically assessing the role of large language models (LLMs) in social science research. Emphasis will be placed on both the methodological and ethical dimensions of using LLMs, as well as on how to incorporate AI-assisted tools into reproducible, well-documented research pipelines.

Course Prerequisites

No formal prerequisites are required. Basic familiarity with R will be helpful but not essential, as the course begins with a short refresher on R fundamentals, including installing and loading packages, writing and executing code, and handling data structures.

Students entirely new to R are encouraged to explore introductory materials prior to the course:

Familiarity with basic social science research design and textual data sources will be useful but is not required.

Students interested in experimenting with LLM-assisted analysis should obtain an API key (e.g., for OpenAI or Claude) before the relevant sessions.

 

Background knowledge:

Maths:

Linear Regression: Elementary

Statistics:

OLS: Elementary

Maximum Likelihood: Elementary

Software:

R: Elementary

 

 

 

Course Information

This course will offer a broad overview of textual analysis using R, from bringing textual data into R to using Quanteda to analyze the data to more sophisticated text analysis tools such as Named Entity Recognition and topic modeling. We will explore the utility and pitfalls of LLM-assisted coding, or vibe coding. It will not be necessary to use LLM-assisted coding. If interested in LLM-assisted coding, it will be useful to have an API key to openAI or claude.

 

Course Outline

Day 1: Introduction to Data Science & Research Design

  • Topics:
    • What is data science for social scientists?
    • Text as data: from research question to measurement
    • Introduction to R and RStudio
  • Lab:
    • Setting up R projects
  • Readings:
    • Grimmer & Stewart, “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts”
    • Trevor Owens, “Defining Data for Humanists: Text, Artifact, Information or Evidence?”
    • Grimmer, Roberts, & Stewart, Text as Data, Introduction

Day 2: Formatting and Loading Text Data (Part 1)

  • Topics:
    • Structured text data formats (CSV, TXT, SQL)
    • Data wrangling and cleaning in R (tidyverse)
    • Reproducibility and documentation best practices
  • Lab:
    • Load and explore structured text data
  • Readings:
    • Kenneth Benoit, Text Analysis in R (Ch. 1–2)
    • Matt Connelly et al., “New Evidence and New Methods for Analyzing the Iranian Revolution as an Intelligence Failure”

Day 3: Working with Unstructured and Tagged Data

  • Topics:
    • Web scraping and APIs (HTML, JSON)
    • Ethical considerations in web scraping
    • OCR quality and digitized archives
  • Lab:
    • Using rvest for scraping
    • Parsing JSON data
  • Readings:
    • rvest: Introduction and Web Scraping Examples
    • Bender et al., “On the Dangers of Stochastic Parrots” (optional ethics reading)

Day 4: Exploring and Preparing Text Data

  • Topics:
    • String functions and regular expressions
    • Tokenization, stopword removal, and stemming
    • Bag-of-words representation
  • Lab:
    • Preprocessing text for analysis
    • Using stringr and tidytext
  • Readings:
    • None (in-class demos)

Day 5: Introduction to Quanteda

  • Topics:
    • Introduction to the quanteda package
    • Creating document-feature matrices (DFMs)
    • Text normalization and weighting
  • Lab:
    • Load text data into quanteda
    • Explore term frequencies and contexts
  • Readings:
    • Quanteda Quickstart
    • Silge & Robinson, Text Mining with R, Ch. 1–3 (optional)

Day 6: Descriptive Text Analysis

  • Topics:
    • Summary statistics, keyword-in-context (KWIC), collocations
    • Named Entity Recognition (NER) basics
  • Lab:
    • Compute key terms and co-occurrences
    • Run NER on historical text
  • Readings:
    • None

Day 7: Data Coding and Unsupervised Learning

  • Topics:
    • Quantitative vs qualitative coding
    • Clustering (k-means, hierarchical)
    • Topic modeling (LDA and STM)
  • Lab:
    • Fit and interpret topic models in stm
  • Readings:
    • Benoit et al., “Crowd-Sourced Text Analysis”
    • Grimmer, Roberts, & Stewart, Text as Data, Ch. 6 (optional)

Day 8: Supervised Learning for Text Classification

  • Topics:
    • Regression and classification models for text
    • Feature selection, train/test splits, and model validation
    • Naive Bayes, SVMs, and logistic regression
  • Lab:
    • Author attribution using Federalist Papers
  • Readings:
    • D’Orazio et al., “Separating the Wheat from the Chaff”
    • Kosuke Imai, Quantitative Social Science, Ch. 5

Day 9: Sentiment Analysis and Word Embeddings

  • Topics:
    • Sentiment lexicons and validation
    • Introduction to word embeddings (Word2Vec, GloVe)
    • Measuring semantic similarity
  • Lab:
    • Conduct sentiment analysis on political or historical texts
    • Explore word embeddings with pre-trained vectors
  • Readings:
    • Lucas et al., “Computer-Assisted Text Analysis for Comparative Politics”
    • Spirling et al., “Embedding Regression: Models for Context-Specific Description and Inference”

Day 10: Beyond Bag-of-Words — Transformers and LLMs

  • Topics:
    • Contextual embeddings (BERT, Llama)
    • Large Language Models and prompt engineering
    • Critical discussion: opportunities and limitations of LLMs for social science research
    • Interpreting model outputs and connecting back to research design
  • Lab:
    • Compare embeddings from transformer models
  • Readings:
    • Grimmer, Roberts, & Stewart, Text as Data, Ch. 10
    • Nelson (2020), “Measuring Culture with Large-Scale Text Analysis”