2V Introduction to Web Scraping and Data Management for Social Scientists

Please note: This course will be taught in hybrid mode. Hybrid delivery of courses will include synchronous live sessions during which on campus and online students will be taught simultaneously.

Johannes B. Gruber is a Postdoctoral Researcher at the Department of Communication Science at the University of Amsterdam. His research explores the flow of information through media systems containing news, social and alternative media within the NEWSFLOWS project. Previously he worked at the Vrije Universiteit Amsterdam developing open-source research software within the OPTED project. He has developed or worked on software packages for web-scraping (traktok, paperboy, cookiemonster) text analysis (spacyr, quanteda.textmodels, stringdist, rwhatsapp, LexisNexisTools), and data storage (amcat4r).

Course Content

Social sciences are increasingly transforming into data driven disciplines. The main reasons for these developments are the availability of an abundance of high quality data on the internet and more commonly available skills in collecting, processing and storing these data to make them efficiently usable resources for high quality research.

The course provides an introduction to the automated collection of data from the internet, as well as methods for efficient and safe data management. The first week covers different modes of storing data for individual researchers and research teams, and the theory and good practices behind them. This includes working with non-rectangular data and the creation of SQL and Elasticsearch databases for efficient data storage. The goal is to leave participants with a few good options to choose from and sufficient knowledge to make an informed choice for their next projects based on their needs and data structures. The first week also serves as an R refresher where need arises.

The second week focuses on methods for collecting different kinds of data from the internet through various means. Specifically, we cover static web pages, how to use public Application Programming Interface (APIs), how to discover and employ hidden APIs and how to scrape dynamic web pages with browser automation tools. Along the way, the course covers the core technologies and data wrangling strategies involved in scraping and how to automate the processes to build large data collections over time.

Course Objectives

The internet has dramatically increased the availability of data for social science research. Online platforms and social media networks have provided vast amounts of data on human behaviour and interactions. Researchers can now access data on a scale that was previously unimaginable, including information on individuals’ opinions, behaviours, and relationships, which is additionally often more truthful and detailed than what participants would or could disclose in surveys. Additionally, advances in data analysis techniques and machine learning have made it possible to extract valuable insights from these large data sets. However, despite the abundance of data available on the internet, accessing and effectively utilizing it can prove to be challenging. How does one go about collecting information from web pages and online databases? And what is the best way to manage and store the massive amount of data that can be collected from the internet? These are the themes that this course will cover.

Course Prerequisites

Students are expected to have working knowledge of R at the beginning of the course. You do not need to be an expert in R as we will provide detailed explanations of all R code upon request. However, we cannot provide an introduction to R. If you want to refresh your knowledge in R, we recommend the freely accessible R for Data Science (2nd edition) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund (https://r4ds.hadley.nz a print version is also available). For example, you should be familiar with installing and loading packages and how to write and execute code in R (such as creating objects, etc.). Familiarity with internet technologies like HTML and HTTP is helpful, but not required.

Representative Background Reading

You are not required to do any prior reading. Papers with interesting or creative use cases for web scraping might be distributed closer to the course.

Course Outline

Day 1: Welcome, ethical and legal issues of web scraping, first examples

Day 2: HTML parsing, regular expressions, and collecting data from static web pages

Day 3: Functional programming and automated data collection from static web pages

Day 4: Collecting data from interactive web pages using RSelenium

Day 5: Working with APIs – Part I

Day 6: Reading and storing treelike data

Day 7: Linking and joining data

Day 8: Basic SQL

Day 9: Storing data at scale: Server-based data collection and storage

Day 10: Planning a data project