elff

Martin Elff is Professor of Political Sociology at Zeppelin University in Friedrichshafen, Germany. He is a political scientist with major research interests in the fields of political behaviour, party competition, and political methodology. He has published in the European Journal of Political Research, Perspectives on Politics, Electoral Studies and Political Analysis and has authored the R packages ‘memisc’, ‘mclogit’, and ‘munfold’, published at the ‘Comprehensive R Archive Network’ (http://cran.r-project.org).

Course Content: The module introduces to the practical analysis of quantitative social science data using R. Consequently, the module is not so much a theoretical presentation of concepts such as probability, expectation, regression, statistical significance etc. but rather emphasizes enabling participants to “road-test” such concepts with the help appropriate software, in particular the open source software package R.

Topics covered in the module are (1) basic concepts of data analysis with R; (2) data management – working with variables and data frames; (3) summarising data using tables and graphics; (4) linear regression – model construction and interpretation; (5) generalised linear models for categorical responses, counts, and survival times; (6) advanced statistical graphics – trellis graphics and maps; (7) random variables, random numbers and Monte Carlo simulations; (8) linear algebra with R and regression in matrix form; (9) principal components, factor analysis, and structural equations; (10) special and advanced topics of quantitative data analysis with R (causal inference, multilevel models, or programming techniques – depending on participants’ interests).

Course Objectives: Participants who successfully complete this module will have a solid understanding of the general principles of data analysis and how to put them into practice. They will also have an understanding of the issues and main techniques of multivariate statistical analysis. While a two week course can hardly cover all in depth, successful participants will at least be able to identify which of these techniques are appropriate for their research. Further they will be able to graph their data and conduct their data analysis with the free statistical software system R.

Course Prerequisites: The module introduces to a variety of techniques of data analysis and therefore has only little prerequisites. In order to be able to follow the course participants should have a solid understanding of descriptive statistics and regression. They should also have a certain level of “computer literacy”, that is, they should not be afraid of command-line oriented (as opposed to menu-driven) software and of writing short command scripts. The ability to do that is not pre-supposed, but the motivation to learn such things is.

Representative Background Reading:
Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer.
Fox, John 2008. Applied Regression Analysis, and General Linear Models. (2nd ed.) Thousand Oaks: Sage.
Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage.
Gill, Jeff 2006. Essential Mathematics for Political and Social Research. Cambridge: Cambridge University Press.
Venables, W.N., and Ripley, B.D. 2002. Modern Applied Statistics with S. (4th ed.) New York: Springer.

The course has to major parts. The first week mainly deals, apart from a basic introduction into the R environment for data analysis, with the main steps in the workflow from data acquisition to data analysis. The second part deals with various topics that allow to use R to gain a deeper understanding of the foundations of data analysis and statistical inference and with some advanced aspects of statistical graphics with R. There is also a couple of strictly optional modules that deal with some particular and/or advanced aspects of using R that will be covered if requested by course participants.

There is no textbook for this course. However, some further reading is suggested for those who want to delve deeper into the topics in the course. Instead of reading books, learning to use R are is best done practically. For this reason, participants are provided with exercises that are done in class, the solutions of which will also be discussed during class sessions.

Week 1

Monday – Basic Concepts: Data Objects, Basic Computation and Programming
This session introduces to the basic concepts of R: How R differs from other statistical software, how data are represented in R, how elementary computations can be done in R (such as using R like a pocket calculator) and how repetitive computational tasks can be automatised by user-defined functions and control structures. The aim of this session is to provide users with an orientation in the R environment of data analysis.

Suggested further reading:
Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapter 1.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, sections 1.1-1.3, 8.1, 8.3.

Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapter 1

Tuesday – Data Management: Variables, Data Frames and Data Manipulation
This session deals with the crucial steps that precede any serious statistical data analysis R. Before data can be analysed it must be prepared appropriately. First of all, data provided by social science data archives comes in binary formats of statistical packages other than R and in other cases data providers deliver the data in some tabular format. So this session starts with importing data from such “foreign” sources. Second, data is not always structured in a way that is appropriate for the intended analysis. Therefore another topic of this session is data recoding, labelling and the handling of missing values. Furthermore, it is discussed how to merge and append data sets and how to recast data sets from a wide into a long format for repeated-measures analysis.

Suggested further reading:
Elff, Martin, 2008. Analysing the American National Election Study of 1948 using the memisc package. R-package vignette, http://cran.r-project.org/web/packages/memisc/

Fox, John, 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 2.

Spector, Phil. 2008. Data Manipulation with R. New York: Springer.

Wednesday – Summarising Data: Tables, Descriptive Statistics and Basic Graphics
Often preliminary research questions can be addressed by descriptive statistics, such as contingency tables, or conditional means. This session therefore discusses how to create tables of frequencies and other descriptive statistics. Further, structures in the data can often be elucidated by statistical graphics. Since there are many facilities in R to create such graphics and since these capabilities is a main point of attraction for many users of R, creation of diagrams, scatter plots, bar charts and mosaic displays is extensively discussed in this session.

Suggested further reading:
Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapter 3.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 3.

Thursday – Linear Regression with R: Model Construction and Interpretation
Linear regression and its generalisations are widely used tools for data analysis in the social sciences. For this reason this session discusses how to construct and estimate linear regression models in R. It covers the construction of regression models for metric dependent and independent variables, regression models with dummy variables for categorical independent variables, and regression models with interaction effects. Also various aspects graphical model diagnostics are discussed and the formatting of regression estimates in a format required by many publishers.

Suggested further reading:
Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in R. Oxford: Oxford University Press, chapter 3.

Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapters 5, 9-10.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 4.

Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapters 5-7.

Friday – Steps Beyond Linear Regression: Models for Categorical Responses, Counts, and Survival Times
Linear regression requires the dependent variable to be metric (interval or ratio scaled). Yet often variables that social scientists want to analyse are categorical or involve frequencies and durations for which the classical linear regression model is not appropriate. This session shows how to construct and estimate models for data of these kinds in R. Some challenges in the interpretation of such models will be addressed as well as some tricks for the graphical presentation their implications. Topics covered are (a) linear vs generalised linear models – a review; (b) logit and probit regression models; (c) Poisson regression; (d) models for polychotomous dependent variables (e) duration and hazard models.

Suggested further reading:
Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in R. Oxford: Oxford University Press, chapters 4,5, 6.

Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapters 11,12.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 5.

Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapter 8.

Week 2

Monday – Advanced Graphics: Taylor-made Diagrams, Lattice Graphics and Maps
R provides not only a set of standard statistical graphics but also makes it quite easy to combine graphical elements, such as lines, dots, and rectangles, as building blocks of taylor-made graphics for one’s own particular purposes of representing data summaries or estimation results. In this session we will discuss how to create new types of diagrams out of these basic graphical elements. In addition we will discuss so-called lattice graphics, which are a great tool of comparison relations between variables in different groups or under different conditions and thus for the visualisation of interaction effects. Finally we will explore how to create maps in R, thus allowing to represent statistical summaries or model predictions with geographical referents.

Suggested further reading:
Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 7.

Murrell, Paul. 2005. R Graphics. Boca Raton: Chapman & Hall/CRC.

Tuesday – Exploring the Foundations of Statistical Inference: Random Variables and Distributions in R, Random Numbers and Monte Carlo Simulations
Most models used in conventional statistical data analysis are probability models. Thus the grasp of the concept of probability is essential for a solid understanding of the fundamentals of statistical inference. R provides many facilities that help gaining such an understanding. It provides density, probability mass and cumulative distribution functions for many common statistical distributions as well as as excellent random number generators for these distributions. In this session we will make use of these facilities to gain an understanding of how point and interval estimates and statistical hypothesis tests “work”. We will also explore using simulation studies the consequences of violations of the assumptions on which many “of-the-shelf” statistical procedures rest.

Suggested further reading:
Chihara, Laura and Tim Hesterberg. 2011. Mathematical Statistics with Resampling and R. Hoboken, NJ: Wiley, chapters 3,4,6,7,8, appendix A,B.

Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapters 2-4.

Wednesday – Linear Algebra and the Geometry of Linear Regression with R
This session introduces fundamental concepts of linear algebra that are necessary to grasp several advanced topics in multivariate data analysis and statistical inference, such as vectors and arrays. Furthermore, it discusses arithmetic operations on vectors and matrices and equations involving matrices and vectors. These concepts are however discussed in a hands-on manner using R rather than in the abstract in order to give participants an intuitive understanding about how these concepts can be put to good use. Thus the topics of this lesson are: (a) vectors, matrices, and arrays; (b) linear systems and matrix inverses; (c) linear regression in matrix form; (d) the geometry of least-squares solutions; (e) matrix algebra and model-based inference.

Suggested further reading:
Gill, Jeff 2006. Essential Mathematics for Political and Social Research. Cambridge: Cambridge University Press, chapters 2, 3, and 4.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, sections 2.3 and 8.4.

Thursday – Principal Components, Factor Analysis, and Structural Equations
Regression models rest on the distinction between dependent and independent variables, or between responses and regressors/predictors. There are however also research questions in which such distinctions do not make too much sense and researchers are more interested in patterns and structures residing in the data. These research questions are addressed by methods discussed in this and the following lesson. This session focusses on methods that emphasize relations between variables and that distinguish between latent and manifest variables, that is principal components and factor analysis. More specifically, the topics of this lesson are: (a) foundations and applications of principal component analysis; (b) the general factor model and confirmatory factor analysis; (c) systems of simultaneous equations; (d) structural equation models with latent variables; (e) latent variable models with binary and ordinal indicators.

Suggested further reading:
Bartholomew, David J., Fiona Steele, Irini Moustaki, and Jane I. Galbraith 2008. Analysis of Multivariate Social Science Data. (2nd ed.) Boca Raton: Chapman&Hall/CRC, chapters 5 and 7.

Venables, W.N., and Ripley, B.D. 2002. Modern Applied Statistics with S. (4th ed.) New York: Springer, section 11.3.

Friday – Special and Advanced Topics of Data Analysis with R
This last session will be used to address the specific needs of the participants, to review some topics that need a more thorough discussion, or to introduce some advanced topics in which participants may be interested. Consequently, there are no pre-determined topics for this lesson, although there are several possible topics that may be considered for this session. These include: (a) cluster analysis; (b) multidimensional scaling and unfolding; (c) time series analysis; (d) linear and generalised linear mixed-effects; (e) models with instrumental variables; (f) non-linear and semi-parametric extensions of the (generalized) linear model; (g) numeric optimisation and general maximum likelihood; (h) parametric and non-parametric bootstrapping; (i) design-based causal inference; (k) textual data and computational content analysis; (l) numeric optimisation and general maximum likelihood; (m) advanced programming concepts: classes and methods; parallel computations

Suggested further reading:
Adler, Joseph. 2010. R in a Nutshell: A Desktop Quick Reference. Sebastopol, CA: O’Reilly.

Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in R. Oxford: Oxford University Press, chapters 8, 9.

Braun, John W. and Duncan J. Murdoch. 2007. A First Course in Statistical Programming with R. Cambridge: Cambridge University Press.

Chihara, Laura and Tim Hesterberg. 2011. Mathematical Statistics with Resampling and R. Hoboken, NJ: Wiley, chapters 5,10,11.

Chambers, John M. 2008. Software for Data Analysis: Programming in R. New York: Springer.

Gelman, Andrew and Jennifer Hill. 2007. Data Analysis using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press.

Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics with R. New York: Springer.

Lumley, Thomas, 2010. Complex Surveys: A Guide to Analysis Using R. Hoboken, NJ: Wiley.

Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapters 9-14.

Ritz, Christian and Jens Carl Streibig 2009. Nonlinear Regression with R. New York: Springer.

Venables, W.N., and Ripley, B.D. 2002. Modern Applied Statistics with S. (4th ed.) New York: Springer, chapter 8.

Wood, Simon N. 2006. Generalized Additive Models: An Introduction with R. Boca Raton, FL: Chapman&Hall/CRC.