Please note: This course will be taught online only. In person study is not available for this course.

Raymond Hicks has been working with Columbia University’s History Lab since 2017. While at History Lab, he helped build pipelines to run topic modeling and Named Entity Recognition on more than 4 million declassified government documents. His work at Columbia bridges data science and historical research, focusing on improving access to and analysis of government archives through natural language processing and machine learning methods.
Before starting at Columbia, he worked as the Statistical Programmer for the Niehaus Center for Globalization and Governance at Princeton University. His research interests include monetary policy, trade policy, and statistics and his work has appeared in the Journal of Politics, International Organization, and the British Journal of Political Science, among other journals.
Relevant publications:
“New evidence and new methods for analyzing the Iranian revolution as an intelligence failure” (with Matthew Connelly, Robert Jervis, and Arthur Spirling), 2021, Intelligence and National Security, 36(6): 781-806.
“Diplomatic documents data for international relations: the Freedom of Information Archive Database” (with Matthew Connelly, Robert Jervis, Arthur Spirling, and Clara Suong), 2021, Conflict Management and Peace Science, 38(6): 762-781.
Course Content
The increasing digitization of social life has made text data a central resource for social science research. From social media posts and political speeches to survey responses and historical archives, vast amounts of textual material are now available for systematic analysis. This course introduces participants to the fundamental concepts and methods of computational text analysis using R, with a particular emphasis on how Large Language Models (LLMs) can assist in both coding and interpretation.
The first part of the course introduces the R environment for text analysis, covering how to import, clean, and structure textual data. Students will learn to work with key R packages such as tidytext and quanteda, gaining hands-on experience in creating document-feature matrices, tokenizing text, and applying descriptive methods like word frequency and co-occurrence analysis.
Building on this foundation, the course explores more advanced techniques for analyzing text as data, including topic modeling, sentiment analysis, and Named Entity Recognition (NER). Students will learn how these tools can be applied to research questions in political science, sociology, communication, and related fields.
A distinctive component of the course is the discussion of LLM-assisted (“vibe”) coding — the use of generative AI tools such as GPT or Claude to assist with qualitative and semi-quantitative text analysis. The course will demonstrate how LLMs can accelerate exploratory coding and thematic identification, while also examining their limitations, biases, and reproducibility challenges. Students who wish to experiment with LLM-assisted methods will need access to an API key for a platform such as OpenAI or Anthropic’s Claude, though participation in this component is optional.
Course Objectives
By the end of the course, participants will have a practical toolkit for conducting transparent, replicable text analyses that integrate traditional computational methods with emerging AI-assisted approaches. Students will understand the logic and workflow of computational text analysis in R, from importing, cleaning, and structuring text data to conducting preprocessing, tokenization, and feature extraction using packages such as tidytext and quanteda. They will learn to apply intermediate and advanced analytical techniques, including topic modeling and named entity recognition (NER), while critically assessing the role of large language models (LLMs) in social science research. Emphasis will be placed on both the methodological and ethical dimensions of using LLMs, as well as on how to incorporate AI-assisted tools into reproducible, well-documented research pipelines.
Course Prerequisites
No formal prerequisites are required. Basic familiarity with R will be helpful but not essential, as the course begins with a short refresher on R fundamentals, including installing and loading packages, writing and executing code, and handling data structures.
Students entirely new to R are encouraged to explore introductory materials prior to the course:
- w3schools R Tutorial and Quiz: https://www.w3schools.com/r
- DataCamp R Courses: beginner-friendly introductions to R syntax and data handling.
- Swirl R Package: interactive tutorials within RStudio — see https://swirlstats.com/students.html
- Example: A Very Short Introduction to R
Familiarity with basic social science research design and textual data sources will be useful but is not required.
Students interested in experimenting with LLM-assisted analysis should obtain an API key (e.g., for OpenAI or Claude) before the relevant sessions.
Background knowledge:
Maths:
Linear Regression: Elementary
Statistics:
OLS: Elementary
Maximum Likelihood: Elementary
Software:
R: Elementary


