Matt Loftis is an Assistant Professor of Political Science at Aarhus University in Aarhus, Denmark. he got his PhD from Rice University in Houston, Texas. His substantive research interests are in governance, bureaucracy, and party politics, but his work all involves making use of the vast amounts of government data freely available on the internet to build large original data sets.

Course description

This short course prepares students to acquire and process data from the Internet in the R statistical programming language. The course provides principles and a toolkit for several aspects of the process. We begin with tools for accessing web data in a variety of forms, from the open web to varieties of application programming interfaces (APIs). We also cover principles for archiving and cleaning web data and advanced tools for data storage.

Course objectives:

• Understand pitfalls and challenges to acquiring and processing Internet data
• Gain experience accessing data on the open web and via API calls
• Provide a toolkit and best practices for your own research

Students will be able to:

• Acquire and process information on the open web using R
• Select appropriate tools for accessing and processing open web data
• Access and process information via API calls using R
• Process, archive, and munge many types of Internet data using R
• Store and work with data in certain types of advanced data structures

Homework:

Each day includes a homework assignment. Students are strongly recommended to make their best attempt to complete the daily homework as assignments will be launching points for the next day’s class discussions and homework assignments will confront students with real-life problems they will encounter when scraping the open web.

Day 1 – Scraping the open web
• HTML basics
• File downloads
• Archiving and data management principles

Readings
• Chapters 1, 2, & 5: Munzert, Simon, Christian Rubba, Peter Meissner, and Dominic Nyhuis. 2015. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Wiley.
• Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software. Vol. 59. Issue 10.

Day 2 – Scraping the open web
• XPath and other content extraction tools
• Light text processing

Readings
• Chapter 4: Munzert et al.

Day 3 – Advanced web scraping
• AJAX and dynamic pages
• Selenium
• How not to get into trouble

Readings
• Chapter 6: Munzert et al.

Day 4 – Application Programming
Interfaces & more text processing
• Application programming interfaces (APIs)
• Popular APIs:
o Twitter
o Facebook
o Google Translate
• Other examples:
o EurLex, etc.
• Text processing tools in R

Readings
• Various R package documentation

Day 5 – Advanced data structures
• Relational databases
• Choosing data representations

Readings
• Chapters 3 & 7: Munzert et al.