Please note: This course will be taught in hybrid mode. Hybrid delivery of courses will include synchronous live sessions during which on campus and online students will be taught simultaneously.

Michael Jankowski works as a postdoctoral researcher and lecturer at the University of Oldenburg where he also got his Ph.D. in 2017. His research focuses on party politics, elections, populism, and representation. Many of his research projects rely on data collected from the internet using web scraping techniques. His work has been published in journals such as Political Analysis, European Journal of Political Research, European Political Science Review, and Party Politics.

Marius Sältzer is an Assistant Professor for Computational Social Science at the University of Oldenburg. He obtained his Ph.D. from the University of Mannheim. His research relies on large-scale data collection projects in party politics and legislative behavior. His research has been published among others in Party Politics, Information, Communication and Society, and the Journal of Open Source Software.

Course Content

The course provides an introduction to the automated collection of data from the internet as well as methods for efficient data management. The first week covers methods for collecting data from the internet. We start with simple exercises to collect data from static web pages and then move forward by automating these processes. We will also cover how to collect data from dynamic web pages. Finally, the first week will cover how to communicate with Application Programming Interfaces (APIs) to collect data. The second week focuses on questions of data management and storage. This includes working with non-rectangular data and the creation of SQL databases for efficient data storage. Finally, the last sessions will cover the automatization of data collection and storage using web servers. All methods are taught using R. Throughout the course we will demonstrate each method using various examples. We strongly encourage students to bring their own data collection projects to the course and we will provide space to discuss them.

Course Objectives

The internet has dramatically increased the availability of data for social science research. Online platforms and social media networks have provided vast amounts of data on human behavior and interactions. Researchers can now access data on a scale that was previously unimaginable, including information on individuals’ opinions, behaviors, and relationships. Additionally, advances in data analysis techniques and machine learning have made it possible to extract valuable insights from these large data sets. However, despite the abundance of data available on the internet, accessing and effectively utilizing it can prove to be challenging. How does one go about collecting information from web pages and online databases? And what is the best way to manage and store the massive amount of data that can be collected from the internet? These are the themes that this course will cover. Firstly, it offers a comprehensive introduction to automatic data collection techniques for both static and interactive web pages, as well as APIs. Secondly, it provides valuable insights into efficient data management, including different data formats, and working with SQL databases for efficient and flexible data storage.

Course Prerequisites

The only prerequisite is a basic work experience with R. You do not need to be an expert in R as we will provide detailed explanations of all R code. However, we cannot provide an introduction to R. For example, you should be familiar with installing and loading packages and how to write and execute code in R (such as creating objects, etc.). Some familiarity with Internet technologies like HTML and HTTP is helpful but not required.

Representative Background Reading

You are not required to do any prior reading.


Course Outline

Day 1: Welcome, ethical and legal issues of web scraping, first examples

Day 2: HTML parsing, regular expressions, and collecting data from static web pages

Day 3: Functional programming and automated data collection from static web pages

Day 4: Collecting data from interactive web pages using RSelenium

Day 5: Working with APIs – Part I

Day 6: Reading and storing treelike data 

Day 7: Linking and joining data

Day 8: Basic SQL 

Day 9: Storing data at scale: Server-based data collection and storage 

Day 10: Planning a data project