Lucas Leemann is Reader in political science at the University of Essex. He obtained his PhD from Columbia University where he majored in comparative politics and minored in quantitative methodology. His research in comparative politics focuses on institutional origins and direct democratic institutions. In Data Science he is interested in both measurement (IRT, MrP) and modeling (mostly hierarchical). His articles have been published or are forthcoming in the American Political Science Review, the American Journal of Political Science, the Journal of Politics, Political Analysis, Electoral Studies, the Swiss Political Science Review.
This course will introduce participants to a fascinating field of statistics. We will see how we can rely on statistical models to gain a deep understanding from data. This often involves finding optimal predictions and classifications. Machine Learning (also known as Statistical Learning) is quickly developing and is being applied in various fields such as business analytics, political science, sociology, and elsewhere.
This course aims to provide an introduction to the data science approach to the quantitative analysis of data using the methods of statistical learning, an approach blending classical statistical methods with recent advances in computational and machine learning. The course will cover the main analytical methods from this field with hands-on applications using example datasets. This will allow students to gain experience with and confidence in using the methods we cover.
Students will know how to successfully apply a number of tools and models for supervised and unsupervised learning. After a short probability refresher, students will learn how to evaluate various methods based on cross-validation. We will then see how we can create optimal prediction models. Creating a good prediction model requires choosing an optimal set of explanatory variables. To this end, we will rely on subset selection, shrinkage methods, lasso, and ridge regression. Classification is another prominent topic and we will use decision trees and random forests to solve such problems. Finally, in terms of data reduction, we will rely on principle component analysis. All these tools provide the foundation for students to then solve real-world problems, potentially by combining these various approaches. The focus of this class is on giving the students sufficient practical training such that they can fruitfully apply these methods in their own work.
Students are expected to have a solid understanding of linear regression models and preferably know binary models. Some prior exposure to statistical software is beneficial but not required. The course will also provide a short introduction to RStudio at the beginning. More important than prior training will be a willingness to engage with the topics of the class.