Matt W. Loftis is an Associate Professor of Political Science at Aarhus University in Aarhus, Denmark. His research focuses on political control of bureaucracy, political agenda setting, and digital and statistical methods.

Course Content: This course will cover automated data collection from a variety of internet-accessible data sources including a wide range of web sites and Application Programming Interfaces. Students will learn to write software to automate the data-collection process. Data collected automatically over the Internet is generally unstructured, so students will further learn to write software to clean, structure, and visualize their data in preparation for analysis.

Analytical problems covered will relate to the pre-analysis stages of research: making decisions about how data are selected and structured and considering how early data-collection decisions have an impact on later measurement and analysis decisions.

Course objectives
: Participants will gain a broad toolkit aimed at (arguably) the most time-consuming part of the research process: data collection, cleaning, and structuring. The course will lead to students being prepared to execute reliable and efficient large-scale data-collection projects. Students will also be prepared to think through the implications of their data-collection decisions on their entire research design. Researchers who plan to work with a wide variety of observational data touching both the public and private spheres will find these tools applicable in their future work. These tools can support a variety of research designs, from large-scale quantitative studies to descriptive qualitative projects focusing on few cases.

Course prerequisites: Participants will find it very useful to have experience using the R statistical software. Some familiarity with Internet technologies like HTML and HTTP are also helpful but not required.

Background knowledge required

Computer Background
R = M

e = elementary, m = moderate, s = strong

  • Day 1: Intro to automated data collection & collecting from simple, static web pages
  • Day 2: Regular expressions & collection from simple, static web pages
  • Day 3: Managing big archives & advanced HTML parsing
  • Day 4: Large-scale data collection from web pages
  • Day 5: Open-source intelligence tools for social science research
  • Day 6: Collecting data from dynamic web pages
  • Day 7: Understanding application programming interfaces (APIs)
  • Day 8: Writing your own API interface
  • Day 9: Review, guided practice, & advanced topics
  • Day 10: Exercises with students’ own projects