3E Machine Learning for Tabular Data

Please note: This course will be taught in hybrid mode. Hybrid delivery of courses will include synchronous live sessions during which on campus and online students will be taught simultaneously.

Chris Hare

Christopher Hare is Assistant Professor of Political Science at the University of California, Davis. His research focuses on ideology, polarization, and the application of measurement models and supervised machine learning methods to the study of voter behaviour. His work has been published in the American Journal of Political Science, Political Analysis, and the British Journal of Political Science. He is also co-author of the book Analyzing Spatial Models of Choice and Judgment (second edition

Course description

Tabular data—where the individual units of observation are arranged on the rows and information about the attributes of those observations are arranged on the columns remain central to social science research. These data may represent public opinion surveys, administrative or medical records, financial data, demographic statistics, or countless other sources of information crucial to social scientists. This course focuses on a suite of popular machine learning tools based on the concept of decision trees that are ideally suited for the analysis of tabular data. We will cover the theory and mechanics behind tree-based methods and the ways in which we can interpret their results, either as a means of fine-tuning the model, ensuring fairness in black box decisions, and/or understanding components of the population data generating process. The course will provide an overview of tree-specific and model-agnostic interpretability methods alongside global and local interpretability techniques.

Course objectives

By the end of the course, students will:

Appreciate the distinction between tree-based and traditional regression modelling strategies to analyse tabular data.
Appreciate the connection between recursive partitioning, rule-based algorithms, and decision trees.
Understand the general process by which decision trees are grown and how specific components of the tree-growing process can be customized by the researcher.
Be comfortable with the concept of parameter tuning in machine learning algorithms.
Understand the theoretical benefits of combining multiple decision trees specifically and ensemble learning models more generally.
Understand (and appreciate the distinctions between) the mechanisms of bagged decision trees, random forests, and boosted decision trees.
Be familiar with interpretable machine learning and the most popular strategies for interpreting output from tree-based and other “black-box” algorithms.

Texts (these will be provided by ESS)

Boehmke, Brad, and Brandon Greenwell. Hands-On Machine Learning with R. New York: Chapman and Hall/CRC, 2019. https://doi.org/10.1201/9780367816377.
Molnar, Christoph. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. 2nd ed., 2022. christophm.github.io/interpretable-ml-book/.

Course Prerequisites

Prior knowledge of Python is not required; the course will teach the basic essential skills. However, prior familiarity with analysing data (preferably using R) is a must. Additionally, having had prior exposure to some of the basic concepts behind machine learning (such as resampling and cross-validation strategies) are highly recommended.

Background knowledge

Maths

Linear Regression = elementary

Statistics

OLS = elementary

Computer background

R = moderate

Day 1: Learning theory; rule-based algorithms and single decision trees
Day 2: Aggregating trees (bagging)
Day 3: Random forests
Day 4: Boosting
Day 5: Other flavors of random forests and boosting (XGBoost, CatBoost, etc.)
Day 6: Tree-specific interpretability methods
Day 7: Interpreting black box models using the Monte Carlo principle
Day 8: Global interpretability methods I (permutation tests, partial dependence plots,
accumulated local effects, and individual conditional expectation plots)
Day 9: Global interpretability methods II (Shapley values)
Day 10: Local interpretability methods (LIME)