Matt W. Loftis is an Assistant Professor of Political Science at Aarhus University in Aarhus, Denmark. He earned a BA in journalism at Texas A&M University, an MA in comparative politics from the University of Bucharest in Bucharest, Romania, and his Ph.D. in political science from Rice University. His substantive research interests are in governance, bureaucracy, and party politics, but his work all involves making use of the vast amounts of government data freely available on the internet to build large original data sets.

Course description
This course prepares students to acquire and process data from the Internet in the R statistical programming language. The course provides principles and a toolkit for several aspects of the process. We begin with tools for accessing web data in a variety of forms, from the open web to varieties of application programming interfaces (APIs). We also cover principles and plenty of hands-on practice for archiving and cleaning web data, and intro to advanced tools for data storage and data cleaning, and a peek into text analytic tools for using web data.

Course objectives:
• Understand pitfalls and challenges to acquiring and processing Internet data
• Gain experience accessing data on the open web and via API calls
• Provide a toolkit and best practices for your own research
• Practice some nuts and bolts of data cleaning and basic analysis tools

Students will be able to:
• Acquire and process information on the open web using R
• Select appropriate tools for accessing and processing open web data
• Access and process information via API calls using R
• Process, archive, and munge many types of Internet data using R
• Store and work with data in certain types of advanced data structures
• Apply a few basic text analytic tools

Homework:
Each day includes a homework assignment. Students are strongly recommended to make their best attempt to complete the daily homework as assignments will be launching points for the next day’s class discussions and homework assignments will confront students with real-life problems they will encounter when scraping the open web.

Background knowledge required
Statistics
OLS = e
Maximum Likelihood = e

Computer Background
R = e
HTML = e

e = elementary, m = moderate, s = strong

Day 1 – Scraping the open web
• HTML basics
• File downloads
• Data cleaning

Readings
• Chapters 1, 2, & 5: Munzert, Simon, Christian Rubba, Peter Meissner, & Dominic Nyhuis. 2015. Automated Data Collection with R. Wiley.

Day 2 – Scraping the open web
• Archiving & data management principles
• XPath & other content extraction tools
• Light text processing

Readings
• Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software. Vol. 59. Issue 10.
• Chapter 4: Munzert et al.

Day 3 – Scraping the open web
• XPath & other content extraction tools
• Light text processing

Readings
• Chapter 4: Munzert et al.

Day 4 – Advanced web scraping
• AJAX & dynamic pages
• Selenium
• How not to get into trouble

Readings
• Chapter 6: Munzert et al.

Day 5 – Scraping & data cleaning practice
• Troubleshooting
• Guided practice
• Data reshaping & merging

Readings
• Chapter 6: Munzert et al.

Day 6 – Application Programming Interfaces & more text processing
• Application programming interfaces (APIs)
• Popular APIs:Twitter, Facebook, Google Translate

Readings
• Various R package documentation

Day 7 – APIs & more text processing
• Building your own calls
• Other examples: EurLex, Sunlight Foundation, Data.gov, etc.
• Text processing tools in R

Readings
• Various R package documentation
• Various API documentation

Day 8 – Supplementary tools & review
• Job scheduling
• Text extraction tools
• Adding bots to your work flow

Day 9 – Using web data
• Light text analysis applications
• Building measurements

Readings
• Various R package documentation

Day 10 – Advanced data structures
• Relational databases
• Choosing data representations

Readings
• Chapters 3 & 7: Munzert et al.