elff
Martin Elff is Professor of Political Sociology at Zeppelin University in Friedrichshafen, Germany. He is a political scientist with research interests in the fields of political behaviour, party competition, and political methodology. He has published in the European Journal of Political Research, Perspectives on Politics, Electoral Studies and Political Analysis and has authored the R packages ‘memisc’, ‘mclogit’, and ‘munfold’, published at the ‘Comprehensive R Archive Network’ (http://cran.r-project.org).(http://cran.r-project.org).

Course Content
The module introduces to the practical analysis of quantitative social science data using R. Consequently, the module is not so much a theoretical presentation of concepts such as probability, expectation, regression, statistical significance etc. but rather emphasizes enabling participants to “road-test” such concepts with the help appropriate software, in particular the open source software package R.

This module covers at least the following topics: (1) basic concepts of data analysis with R; (2) data management – working with variables and data frames; (3) summarising data using tables and graphics; (4) linear regression – model construction and interpretation; (5) testing statistical hypotheses in R (5); generalised linear models for categorical responses, counts, and survival times; (6) advanced statistical graphics; (7) multivariate data analysis – principal components, factor analysis, and structural equations. In addition to these, a few more topics are optionally covered if time permits, such as random variables, random numbers and Monte Carlo simulations; linear algebra with R and regression in matrix form; multilevel models; or programming techniques – depending on participants’ interests).


Course Objectives

Participants who successfully complete this module will have a solid understanding of the general principles of data analysis and how to put them into practice. They will also have an understanding of the issues and main techniques of multivariate statistical analysis. While a two week course can hardly cover all in depth, successful participants will at least be able to identify which of these techniques are appropriate for their research. Further they will be able to graph their data and conduct their data analysis with the free statistical software system R.

Course Prerequisites
Participants who successfully complete this module will have a solid understanding of the general principles of data analysis and how to put them into practice. They will also have an understanding of the issues and main techniques of multivariate statistical analysis. While a two week course can hardly cover all in depth, successful participants will at least be able to identify which of these techniques are appropriate for their research. Further they will be able to graph their data and conduct their data analysis with the free statistical software system R.

Representative Background Reading
Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer.

Fox, John 2008. Applied Regression Analysis, and General Linear Models. (2nd ed.) Thousand Oaks: Sage.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage.

Gill, Jeff 2006. Essential Mathematics for Political and Social Research. Cambridge: Cambridge University Press.

Venables, W.N., and Ripley, B.D. 2002. Modern Applied Statistics with S. (4th ed.) New York: Springer.

Background knowledge required
Statistics
OLS = m
Maximum Likelihood = e

e = elementary, m = moderate, s = strong

The course has to major parts. The first week mainly deals, apart from a basic introduction into the R environment for data analysis, with the main steps in the workflow from data acquisition to data analysis. The second part deals with various topics that allow to use R to gain a deeper understanding of the foundations of data analysis and statistical inference and with some advanced aspects of statistical graphics with R. There is also a couple of strictly optional modules that deal with some particular and/or advanced aspects of using R that will be covered if requested by course participants.

There is no textbook for this course. However, some further reading is suggested for those who want to delve deeper into the topics in the course. Instead of reading books, learning to use R are is best done practically. For this reason, participants are provided with exercises that are done in class, the solutions of which will also be discussed during class sessions.

The following outline gives a rough overview of the topics that are covered in the day-to-day sessions. Since the course is intended to be responsive to the participants’ needs and interests, the actual course may deviate from this outline because some topics may take longer than planned, while others may be skipped and shortened.

Week 1

Monday – Basic Introduction
In this section we discuss the basic usage of R and of one of its main user interfaces, RStudio. We will further look at how variables are defined and objects can be saved (stored in a computer file) and restored. Further, it discusses how extension packages can be installed and activated.
Numeric data, computations and logical operations
Numbers and numeric vectors are discussed as well as basic arithmetic and statistical computations. Furthermore, we look at comparisons between numbers and how they can be used to make data selections. Also, we will take a first look at the definition of arithmetic functions based on formulae.
The variety of data types in R
We will look at data beyond numbers – logical data, character data, factors, strings etc. and how to make good us of them in data analysis. Lists and how they can be used to construct more complex data types are also discussed.
Suggested further reading:
– Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapter 1.
– Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, sections 1.1-1.3, 8.1, 8.3.
– Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapter 1

Tuesday
Exercises
Data sets in R (data frames)

In this section we will take a closer look at how data sets (as common in social science data analysis) are constructed and used in R.
Basic programming ideas
R is not only a software that allows to do (advanced) data analysis, it is also a programming language. This section discusses basic programming concepts such as branches and loops, formal arguments of functions and local and global variables – and how these concepts can be put to work to avoid (or automatise) repetitive computational tasks.
Suggested further reading:
– Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapter 1.
– Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, sections 1.1-1.3, 8.1, 8.3.
– Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapter 1

Wednesday
Exercises
Summarising Data: Tables, Descriptive Statistics and Basic Graphics

Often preliminary research questions can be addressed by descriptive statistics, such as contingency tables, or conditional means. This session therefore discusses how to create tables of frequencies and other descriptive statistics. Further, structures in the data can often be elucidated by statistical graphics. Since there are many facilities in R to create such graphics and since these capabilities is a main point of attraction for many users of R, creation of diagrams, scatter plots, bar charts and mosaic displays is extensively discussed in this session.
Suggested further reading:
– Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapter 3.
– Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 3.

Thursday
Exercises
Data management: preparing data for analysis

This session deals with the crucial steps that precede any serious statistical data analysis R. Before data can be analysed it must be prepared appropriately. First of all, data provided by social science data archives comes in binary formats of statistical packages other than R and in other cases data providers deliver the data in some tabular format. So this session starts with importing data from such “foreign” sources. Second, data is not always structured in a way that is appropriate for the intended analysis. Therefore another topic of this session is data recoding, labelling and the handling of missing values. Furthermore, it is discussed how to merge and append data sets and how to recast data sets from a wide into a long format for repeated-measures analysis.
Suggested further reading:
– Elff, Martin, 2008. Analysing the American National Election Study of 1948 using the memisc package. R-package vignette, http://cran.r-project.org/web/packages/memisc/
– Fox, John, 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 2.
– Spector, Phil. 2008. Data Manipulation with R. New York: Springer.

Friday
Exercises
Linear Regression with R: Model Construction and Interpretation

Linear regression and its generalisations are widely used tools for data analysis in the social sciences. For this reason this session discusses how to construct and estimate linear regression models in R. It covers the construction of regression models for metric dependent and independent variables, regression models with dummy variables for categorical independent variables, and regression models with interaction effects. Also various aspects graphical model diagnostics are discussed and the formatting of regression estimates in a format required by many publishers.
Suggested further reading:
– Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in R. Oxford: Oxford University Press, chapter 3.
– Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapters 5, 9-10.
– Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 4.
– Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapters 5-7.

Monday
Exercises
Hypothesis Testing and Statistical Significance

Results of data analyses are hardly publishable if they do not involve “statistically significant” findings. The aim of this lesson is to give participants an understanding of what statistically significance actually means and how the concept of significance is connected to testing of statistical hypotheses. Again, the software R is us used not only to compute some examples but also to demonstrate concepts involved with the help of computer simulations. The following topics are covered in this lesson: (a) construction of hypothesis tests, type I and type II errors and levels of statistical significance; (b) testing differences in means; (c) testing stochastic independence; (d) testing hypotheses about regression models; (e) confidence intervals and their relation to hypothesis tests.
Suggested further reading:
– Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in R. Oxford: Oxford University Press, chapter 2.
– Casella, George and Roger L. Berger 2002. Statistical Inference. (2nd ed.) Pacific Grove, CA: Duxbury, chapters 6-10.
– Cox, D.R. 2006. Principles of Statistical Inference. Cambridge: Cambridge University Press.
– Fox, John 2008. Applied Regression Analysis, and General Linear Models. (2nd ed.) Thousand Oaks: Sage, chapter 6.
– Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, sections 4.3-4.6.

Tuesday
Exercises
Beyond linear regression: Models for binary responses, counts, and durations (Generalised linear models)

Linear regression requires the dependent variable to be metric (interval or ratio scaled). Yet often variables that social scientists want to analyse are categorical or involve frequencies and durations for which the classical linear regression model is not appropriate. This session shows how to construct and estimate models for data of these kinds in R. Some challenges in the interpretation of such models will be addressed as well as some tricks for the graphical presentation their implications. Topics covered are (a) linear vs generalised linear models – a review; (b) logit and probit regression models; (c) Poisson regression; (d) models for polychotomous dependent variables (e) duration and hazard models.
Suggested further reading:
– Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in R. Oxford: Oxford University Press, chapters 4,5, 6.
– Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapters 11,12.
– Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 5.
– Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapter 8.

Wednesday
Exercises
Advanced statistical graphics

R provides not only a set of standard statistical graphics but also makes it quite easy to combine graphical elements, such as lines, dots, and rectangles, as building blocks of taylor-made graphics for one’s own particular purposes of representing data summaries or estimation results. In this session we will discuss how to create new types of diagrams out of these basic graphical elements. In addition we will discuss so-called lattice graphics, which are a great tool of comparison relations between variables in different groups or under different conditions and thus for the visualisation of interaction effects. Finally we will explore how to create maps in R, thus allowing to represent statistical summaries or model predictions with geographical referents.
Suggested further reading:
– Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 7.
– Murrell, Paul. 2005. R Graphics. Boca Raton: Chapman & Hall/CRC.

Thursday
Principal Components, Factor Analysis, and Structural Equations

While linear and generalised linear models focus on the influence of (usually several) independent variables on a single dependent variable, some research questions require the analysis of several dependent variables simultaneously, while others do not make a distinction between dependent and independent variables. In this section we deal with models that are applicable to such research questions. Specifically, the topics of this lesson are: (a) principal component analysis; (b) the general factor model and confirmatory factor analysis; (c) systems of simultaneous equations; (d) structural equation models with latent variables.
Suggested further reading:
– Bartholomew, David J., Fiona Steele, Irini Moustaki, and Jane I. Galbraith 2008. Analysis of Multivariate Social Science Data. (2nd ed.) Boca Raton: Chapman&Hall/CRC, chapters 5 and 7.
– Venables, W.N., and Ripley, B.D. 2002. Modern Applied Statistics with S. (4th ed.) New York: Springer, section 11.3.

Friday
Special and Advanced Topics of Data Analysis with R

This last session will be used to address the specific interests of the participants, to review some topics that need a more thorough discussion, or to introduce some advanced topics in which participants may be interested. Consequently, there are no pre-determined topics for this sessions, but several topics are possible. These include: (a) multilevel analysis with linear and generalised linear mixed-effects models; (b) non-linear and semi-parametric extensions of the (generalized) linear model; (c) numeric optimisation and general maximum likelihood; (d) matrix algebra in R; (e) advanced programming concepts: classes and methods
Suggested further reading:
– Adler, Joseph. 2010. R in a Nutshell: A Desktop Quick Reference. Sebastopol, CA: O’Reilly.
– Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in R. Oxford: Oxford University Press, chapters 8, 9.
– Braun, John W. and Duncan J. Murdoch. 2007. A First Course in Statistical Programming with R. Cambridge: Cambridge University Press.
– Chihara, Laura and Tim Hesterberg. 2011. Mathematical Statistics with Resampling and R. Hoboken, NJ: Wiley, chapters 5,10,11.
– Chambers, John M. 2008. Software for Data Analysis: Programming in R. New York: Springer.
– Gelman, Andrew and Jennifer Hill. 2007. Data Analysis using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press.
– Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics with R. New York: Springer.
– Lumley, Thomas, 2010. Complex Surveys: A Guide to Analysis Using R. Hoboken, NJ: Wiley.
– Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapters 9-14.
– Ritz, Christian and Jens Carl Streibig 2009. Nonlinear Regression with R. New York: Springer.
– Venables, W.N., and Ripley, B.D. 2002. Modern Applied Statistics with S. (4th ed.) New York: Springer, chapter 8.
– Wood, Simon N. 2006. Generalized Additive Models: An Introduction with R. Boca Raton, FL: Chapman&Hall/CRC.