Matt W. Loftis is an Assistant Professor of Political Science at Aarhus University in Aarhus, Denmark. He earned a BA in journalism at Texas A&M University, an MA in comparative politics from the University of Bucharest in Bucharest, Romania, and his Ph.D. in political science from Rice University. His substantive research interests are in governance, bureaucracy, and party politics, but his work all involves making use of the vast amounts of government data freely available on the internet to build large original data sets.
This course prepares students to acquire and process data from the Internet in the R statistical programming language. The course provides principles and a toolkit for several aspects of the process. We begin with tools for accessing web data in a variety of forms, from the open web to varieties of application programming interfaces (APIs). We also cover principles and plenty of hands-on practice for archiving and cleaning web data, and intro to advanced tools for data storage and data cleaning, and a peek into text analytic tools for using web data.
•Understand pitfalls and challenges to acquiring and processing Internet data
•Gain experience accessing data on the open web and via API calls
•Provide a toolkit and best practices for your own research
•Practice some nuts and bolts of data cleaning and basic analysis tools
Students will be able to:
•Acquire and process information on the open web using R
•Select appropriate tools for accessing and processing open web data
•Access and process information via API calls using R
•Process, archive, and munge many types of Internet data using R
•Store and work with data in certain types of advanced data structures
•Apply a few basic text analytic tools
Each day includes a homework assignment. Students are strongly recommended to make their best attempt to complete the daily homework as assignments will be launching points for the next day’s class discussions and homework assignments will confront students with real-life problems they will encounter when scraping the open web.
Background knowledge required
OLS = e
Maximum Likelihood = e
R = e
HTML = e
e = elementary, m = moderate, s = strong