Rochelle Terman is an Assistant Professor of Political Science at University of Chicago. Her research examines international norms, gender, and human rights using a mix of quantitative, qualitative, and computational methods. She teaches computational social science in a variety of capacities.

Note: this class has been moved to the afternoon

Course Content

This course teaches students to acquire, process, and analyze data from the Internet using the R statistical programming language. The first portion of the class introduces tools to clean, transform, and wrangle data using `tidyverse` packages. We will also review key programming concepts and techniques to make the best use of R. In the second portion of the course, students will learn how to collect internet data in a variety of forms, including application programming interfaces (APIs) and scraping the open web. The third portion of the class focuses on analyzing the data we’ve collected, introducing the basics of text analysis and visualization.

Course Objectives
This course is geared towards social scientists who work with are interested in extracting, processing, and analyzing data from the internet. By the end of the course, participants will:
1.     Understand basic legal and ethical issues surrounding web scraping.
2.     Collect data via RESTful APIs:
a.     Master key principles and concepts of RESTful APIs.
b.     Use plug-n-play R packages for popular APIs such as Twitter, Google Translate, and others.
c.     Write a custom API query to extract data from RESTful APIs, such as the New York Times Article API.
3.     Collect data via web scraping:
d.     Understand how HTML & CSS work to display a website.
e.     Inspect a website using Google Developer Tools and SelectorGadget to understand its underlying structure and identify elements.
f.       Write a program that scrapes multiple webpages using R.
g.     Extract javascript-heavy and interactive sites using selenium.
4.     Clean, transform, and wrangle data using `tidyverse` packages.
Be introduced to the main methods and techniques involved in modern computational text analysis.

Course Prerequisites
Participants must have basic computer skills and be familiar with their computer’s file systems (e.g. paths). We will assume students have basic knowledge of R and RStudio. Participants with no prior experience with R are encouraged to complete this brief tutorial (requiring 2-3 hours) to learn the basics of R before the course.

Representative Background Reading

Justin Grimmer and Brandon Stewart. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts

Background knowledge required
Computer Background
R = elementary

Day 1: Review of R and Tidyverse.

Day 2: Review of R programming (functions, iteration, conditionals).

Day 3: Introduction to Web APIs. Collecting Twitter Data with RTweet.

Day 4: Writing API Queries. The New York Times Article API.

Day 5: Introduction to Webscraping – How Websites Work.

Day 6: Webscraping with RVest.

Day 7: Scraping javascript-heavy sites with Selenium.

Day 8: Text Analysis 1: Preprocessing, Dictionary Methods, Distinctive Words.

Day 9: Text Analysis 2: Supervised and Unsupervised Learning.

Day 10: Visualization and special topics.