Please note: This course will be delivered in person at the Colchester campus. Online study is not available for this course.

Marco Steenbergen has been a professor of political methodology at the University of Zurich since 2011. Prior to that, he held appointments at the University of Bern, the University of North Carolina at Chapel Hill, and Carnegie Melon University. Marco’s research covers methodology, as well as political psychology. He has published several books and articles in these areas. His current research focuses on electoral consideration sets, cleavages and identities, and new forms of political participation.

Course content

This course introduces methods of machine learning for social scientists. The broad objective of machine learning is to uncover patterns in data, either as an exploratory device or to make predictions. The course covers a variety of topics, including supervised, unsupervised, and ensemble learning. We discuss how the general principles of machine learning, as well as specific algorithms. The choice of technique, as well as application and interpretation take center stage in the course. Specific algorithms that will be dis-cussed include artificial neural networks, bagging, boosting, classification and regression trees, clustering, decision rules, k-nearest neighbors, principal components, probabilistic learning, random forests, regression, and support vector machines. General principles include cross-validation, global and local interpretation, loss functions, optimization, regularization, and variable importance.

Course Objectives

Machine learning is of ever greater importance in the social sciences, both inside and outside of academia. The ultimate goal of this course is to make you conversant with the most important techniques and ideas of machine learning. This means that you have a good overview of the fields and its relevance for social scientific research. It also means that you have sufficient background knowledge to allow you to study further. This is important because 2-week course can only scratch the surface of machine learning, which evolves quickly. Being conversant with machine learning also means that you understand how to implement these methods, which we shall do in R. Note that the examples will be relatively small, with an eye on minimizing computation time. Where necessary, we shall discuss how to engage in big data analysis.

Course Prerequisites

This is an introductory course, meaning that prior familiarity with machine learning is not expected. However, you should be familiar with basic probability theory and with R. In terms of R, you should know how to read and manipulate data, as well as generate plots using tidyverse.

Required text – this text will be provided by ESS:

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer. ISBN: 978-1-4614-7138-7.

Background texts

For a cursory introduction to many of the topics, you might consult: Lantz, Brett. 2019. Machine Learning with R: Expert Techniques for Predictive Modeling. Packt Publishing, 3rd. edition.

For an introduction to statistical concepts and R, you might want to consult Learning Statistics with R.

Background knowledge required

Maths

Calculus = elementary

Linear Regression = elementary

Statistics

OLS = elementary

Computer Background

R = moderate

For participation in this course, students are required to bring with them their own laptops.

General Information

Description

This course is a comprehensive overview of machine learning with a focus on the social sciences. Machine learning comes in several flavours, including (1) unsupervised machine learning; (2) supervised machine learning; and (3) ensemble learning. In this course, we shall touch on all those aspects. The general goal is to give you a broad overview of the field and a strong sense of how these powerful techniques can be put to good social scientific use.

The course interchanges lecture and exercise, so that there is never much time between the explanation of a concept or technique and you using it. We shall do all programming in R. While many courses of this kind focus heavily on predictive performance, we place a great deal of emphasis on interpretation. A socially responsible use of predictive learning requires that we can assess and explain the predictions that are being made, even in highly complex algorithms. Great strides have been made in this area and we shall cover them extensively.

Goals

At the end of the course, you should have mastered the following milestones:

  1. Understand the different uses of machine learning in the social sciences.
  2. Be aware of machine learning errors and their remedies.
  3. Know how to engage in cross-validation, including the ability to select appropriate methods.
  4. Know how to interpret machine learning outputs, including global and local interpretations, as well as variable importance.
  5. Understand a variety of machine learning algorithms in detail, including their implementation in R.

Prerequisites

It is not necessary that you have seen machine learning before. It is useful if you have used the linear regression model before, as it is a starting point for much of the course. A basic knowledge of probability theory is indispensable, as is a working understanding of R. In R, you should know: (1) how to access various data sources; (2) the basic objects of the language; (3) basic operations; (4) the ability to compute descriptive statistics and create graphs; and (5) the basics of tidyverse.

Requirements

It is possible to take single course exam for this course. Details can be obtained from the Summer School Office.

Course Materials

Required Materials

James, Garreth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning with Applications in R. New York: Springer. ISBN: 978-1-4614-7138-7.

Software

You should install R and RStudio on your computer. If you have an older computer, you can also perform the analyses in Rstudio Cloud. Details will be provided on the 1st day of the course.

 

Day

Topic

Reading

Exercises

Day 1

What Is Machine Learning?

James et al.,

Chapters 1 & 2

To know your data is always the first step!

Day 2

Unsupervised Machine Learning

James et al.,

Chapter 10

Principal component analysis, hierarchical cluster analysis, and K-means clustering.

Day 3

A First Foray into Supervised

Machine Learning

James et al.,

Chapters 3 & 5

Regression analysis, stochastic gradient descent, resampling, and effect plots.

Day 4

Preventing the Dreadful Overfit

James et al.,

Chapter 6

Regularization, principal component regression, partial least squares, and model selection.

Day 5

The World Is Not (Always) Linear

James et al.,

Chapter 7

Polynomials, splines, and generalized additive models.

Day 6-7

Classifiers

James et al.,

Chapter 4

Nearest neighbors, rule-based algorithms, logistic regression, naïve Bayes, and discriminant analysis.

Day 8

Support Vector Machines

James et al.,

Chapter 9

Maximum margin classifiers, soft margins, and support vector regression.

Day 9

An Introduction to Neural Networks

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Cambridge: MIT Press. Chapter 6.

Feed forward networks; regularization in ANNs.

Day 10

Tree-based Algorithms and

Ensemble Learners

James et al.,

Chapter 8.

Classification and regression trees, bagging, boosting, stacking, and random forests.