2B Quantitative Data Analysis and Statistical Graphics with R

Please note: This course will be delivered in person at the Colchester campus only. Online study is not available for this course.

Martin Elff is Professor at Zeppelin University in Friedrichshafen, Germany. He is a political scientist with research interests in political behaviour, comparative politics, and political methodology. His research appeared in various journals including the British Journal of Political Science and Political Analysis. He also published the R packages ‘memisc’, ‘mclogit’, and ‘munfold’ on the ‘Comprehensive R Archive Network’ (http://cran.r-project.org) as well as a book on Data Management with R: A Guide for Social Scientists with SAGE.

Course content

The course introduces to quantitative social science data analysis using R. It focuses on the practical aspects of data analysis, including the management of social science data. It also shows how patterns within data can be visualised and how statistical models can be illustrated using appropriate diagrams. Consequently, the course cannot introduce basic or advanced statistical concepts, but such concepts are reviewed as appropriate.

As far as possible, the contents of the course will be adapted to the existing statistical knowledge of the participants. However, it covers at least the following topics: (1) basic concepts of data analysis with R; (2) elementary programming techniques in R, (3) data management – working with variables and data frames; (4) summarising data using tables and graphics; (5) linear regression – model construction and interpretation; (6) generalised linear models for categorical responses, counts, and survival times; (7) advanced statistical graphics. In addition to these, a few more topics are optionally covered as time permits, such as principal components and factor analysis; structural equations models; random numbers and Monte Carlo simulations; linear algebra with R and regression in matrix form; multilevel models; advanced programming techniques – depending on participants’ interests.

Course objectives

Participants who successfully complete this module will be able to bring their knowledge about statistical concepts and techniques to fruition in practical analyses. They will also know how to prepare data in a way suitable for this purpose, how to explore data with statistical graphics and how to present analysis results by means of tables and diagrams.

Course prerequisites

In order to gain the most from the course, participants ideally have an understanding of what kind of analysis they intend to conduct and require guidance on how to apply the relevant techniques using the open course software R. While the course may become helpful in acquiring new statistical concepts, it cannot teach them. Therefore participants should have a good understanding of linear regression and at least a basic understanding of any other method they want to put into practice with R.

A certain level of “computer literacy” will certainly be helpful. That is, participants should not be afraid of the command-line oriented (as opposed to menu-driven) software and of writing short command scripts. The ability to do this is not a prerequisite, but the motivation to learn it is.

Representative Background Reading

Long, JD, and Teetor, Paul. R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics. O’Reilly Media, 2019. ISBN: 9781492040682 – (this text will be provided by ESS)

Elff, Martin. Data Management in R: A Guide for Social Scientists, Sage, 2021. ISBN: 9781526459978 – (this text will be provided by ESS)

Background knowledge required

Statistics

OLS = elementary
Maximum Likelihood = elementary

Maths

Calculus = elementary
Linear Regression = elementary

Course Outline

The course has two major parts. The first week mainly deals, apart from a basic introduction into the R environment for data analysis, with the main steps in the workflow from data acquisition to data analysis. The second part deals with various topics that allow to use R to gain a deeper understanding of the foundations of data analysis and statistical inference and with some advanced aspects of statistical graphics with R. There is also a couple of strictly optional modules that deal with some particular and/or advanced aspects of using R that will be covered if requested by course participants.

There is no textbook for this course. However, some further reading is suggested for those who want to delve deeper into the topics in the course. Instead of reading books, learning to use R are is best done practically. For this reason, participants are provided with exercises that are done in class, the solutions of which will also be discussed during class sessions.

The following outline gives a rough overview of the topics that are covered in the day-to-day sessions. Since the course is intended to be responsive to the participants’ needs and interests, the actual course may deviate from this outline because some topics may take longer than planned, while others may be skipped and shortened.

Week 1

Monday

Basic introduction

In this section we discuss the basic usage of R and of one of its main user interfaces, RStudio. We will further look at how variables are defined and objects can be saved (stored in a computer file) and restored. Further, it discusses how extension packages can be installed and activated.

Numeric data, computations and logical operations

Numbers and numeric vectors are discussed as well as basic arithmetic and statistical computations. Furthermore, we look at comparisons between numbers and how they can be used to make data selections. Also, we will take a first look at the definition of arithmetic functions based on formulae.

The variety of data types in R

We will look at data beyond numbers – logical data, character data, factors, strings etc. and how to make good us of them in data analysis. Lists and how they can be used to construct more complex data types are also discussed.

Suggested further reading:

Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapter 1.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, sections 1.1-1.3, 8.1, 8.3.

Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapter 1

Tuesday

Exercises

Data sets in R (data frames)

In this section we will take a closer look at how data sets (as common in social science data analysis) are constructed and used in R.

Basic programming ideas

R is not only a software that allows to do (advanced) data analysis, it is also a programming language. This section discusses basic programming concepts such as branches and loops, formal arguments of functions and local and global variables – and how these concepts can be put to work to avoid (or automatise) repetitive computational tasks.

Suggested further reading:

Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapter 1.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, sections 1.1-1.3, 8.1, 8.3.

Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapter 1

Wednesday

Exercises

Summarising Data: Tables, Descriptive Statistics and Basic Graphics

Often preliminary research questions can be addressed by descriptive statistics, such as contingency tables, or conditional means. This session therefore discusses how to create tables of frequencies and other descriptive statistics. Further, structures in the data can often be elucidated by statistical graphics. Since there are many facilities in R to create such graphics and since these capabilities is a main point of attraction for many users of R, creation of diagrams, scatter plots, bar charts and mosaic displays is extensively discussed in this session.

Suggested further reading:

Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapter 3.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 3.

Thursday

Exercises

Data management: preparing data for analysis

This session deals with the crucial steps that precede any serious statistical data analysis R. Before data can be analysed it must be prepared appropriately. First of all, data provided by social science data archives comes in binary formats of statistical packages other than R and in other cases data providers deliver the data in some tabular format. So this session starts with importing data from such “foreign” sources. Second, data is not always structured in a way that is appropriate for the intended analysis. Therefore another topic of this session is data recoding, labelling and the handling of missing values. Furthermore, it is discussed how to merge and append data sets and how to recast data sets from a wide into a long format for repeated-measures analysis.

Suggested further reading:

Elff, Martin, 2008. Analysing the American National Election Study of 1948 using the memisc package. R-package vignette, http://cran.r-project.org/web/packages/memisc/

Fox, John, 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 2.

Spector, Phil. 2008. Data Manipulation with R. New York: Springer.

Friday

Exercises

Linear Regression with R: Model Construction and Interpretation

Linear regression and its generalisations are widely used tools for data analysis in the social sciences. For this reason this session discusses how to construct and estimate linear regression models in R. It covers the construction of regression models for metric dependent and independent variables, regression models with dummy variables for categorical independent variables, and regression models with interaction effects. Also various aspects graphical model diagnostics are discussed and the formatting of regression estimates in a format required by many publishers.

Suggested further reading:

Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in R. Oxford: Oxford University Press, chapter 3.

Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapters 5, 9-10.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 4.

Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapters 5-7.

Week 2

Monday

Exercises

Hypothesis Testing and Statistical Significance

Results of data analyses are hardly publishable if they do not involve “statistically significant” findings. The aim of this lesson is to give participants an understanding of what statistically significance actually means and how the concept of significance is connected to testing of statistical hypotheses. Again, the software R is us used not only to compute some examples but also to demonstrate concepts involved with the help of computer simulations.

Suggested further reading:

Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in R. Oxford: Oxford University Press, chapter 2.

Casella, George and Roger L. Berger 2002. Statistical Inference. (2^nd ed.) Pacific Grove, CA: Duxbury, chapters 6-10.

Cox, D.R. 2006. Principles of Statistical Inference. Cambridge: Cambridge University Press.

Fox, John 2008. Applied Regression Analysis, and General Linear Models. (2nd ed.) Thousand Oaks: Sage, chapter 6.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, sections 4.3-4.6.

Tuesday

Exercises

Beyond linear regression: Models for binary responses, counts, and durations (Generalised linear models)

Linear regression requires the dependent variable to be metric (interval or ratio scaled). Yet often variables that social scientists want to analyse are categorical or involve frequencies and durations for which the classical linear regression model is not appropriate. This session shows how to construct and estimate models for data of these kinds in R. Some challenges in the interpretation of such models will be addressed as well as some tricks for the graphical presentation their implications. Topics covered are (a) linear vs generalised linear models – a review; (b) logit and probit regression models; (c) Poisson regression; (d) models for polychotomous dependent variables (e) duration and hazard models.

Suggested further reading:

Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in R. Oxford: Oxford University Press, chapters 4,5, 6.

Dalgaard, Peter 2002. Introductory Statistics with R. New York: Springer, chapters 11,12.

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 5.

Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapter 8.

Wednesday

Exercises

Advanced statistical graphics

R provides not only a set of standard statistical graphics but also makes it quite easy to combine graphical elements, such as lines, dots, and rectangles, as building blocks of taylor-made graphics for one’s own particular purposes of representing data summaries or estimation results. In this session we will discuss how to create new types of diagrams out of these basic graphical elements. In addition we will discuss so-called lattice graphics, which are a great tool of comparison relations between variables in different groups or under different conditions and thus for the visualisation of interaction effects. Finally we will explore how to create maps in R, thus allowing to represent statistical summaries or model predictions with geographical referents.

Suggested further reading:

Fox, John 2002. An R and S-Plus Companion to Applied Regression. Thousand Oaks: Sage, chapter 7.

Murrell, Paul. 2019. R Graphics. 3^rd ed. Boca Raton: Chapman & Hall/CRC.

Thursday

Principal Components, Factor Analysis, and Structural Equations

While linear and generalised linear models focus on the influence of (usually several) independent variables on a single dependent variable, some research questions require the analysis of several dependent variables simultaneously, while others do not make a distinction between dependent and independent variables. In this section we deal with models that are applicable to such research questions. Specifically, the topics of this lesson are: (a) principal component analysis; (b) the general factor model and confirmatory factor analysis; (c) systems of simultaneous equations; (d) structural equation models with latent variables.

Suggested further reading:

Bartholomew, David J., Fiona Steele, Irini Moustaki, and Jane I. Galbraith 2008. Analysis of Multivariate Social Science Data. (2nd ed.) Boca Raton: Chapman&Hall/CRC, chapters 5 and 7.

Venables, W.N., and Ripley, B.D. 2002. Modern Applied Statistics with S. (4th ed.) New York: Springer, section 11.3.

Friday

Special and Advanced Topics of Data Analysis with R

This last session will be used to address the specific interests of the participants, to review some topics that need a more thorough discussion, or to introduce some advanced topics in which participants may be interested. Consequently, there are no pre-determined topics for this sessions, but several topics are possible. These include: (a) multilevel analysis with linear and generalised linear mixed-effects models; (b) non-linear and semi-parametric extensions of the (generalized) linear model; (c) numeric optimisation and general maximum likelihood; (d) matrix algebra in R; (e) advanced programming concepts: classes and methods

Suggested further reading:

Adler, Joseph. 2010. R in a Nutshell: A Desktop Quick Reference. Sebastopol, CA: O’Reilly.

Aitkin, Murray, Brian Francis, John Hinde, and Ross Darnell 2009. Statistical Modelling in R. Oxford: Oxford University Press, chapters 8, 9.

Braun, John W. and Duncan J. Murdoch. 2007. A First Course in Statistical Programming with R. Cambridge: Cambridge University Press.

Chihara, Laura and Tim Hesterberg. 2011. Mathematical Statistics with Resampling and R. Hoboken, NJ: Wiley, chapters 5,10,11.

Chambers, John M. 2008. Software for Data Analysis: Programming in R. New York: Springer.

Gelman, Andrew and Jennifer Hill. 2007. Data Analysis using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press.

Kleiber, Christian, and Achim Zeileis. 2008. Applied Econometrics with R. New York: Springer.

Lumley, Thomas, 2010. Complex Surveys: A Guide to Analysis Using R. Hoboken, NJ: Wiley.

Maindonald, John and John Braun. 2006. Data Analysis and Graphics Using R: An Example-Based Approach (2nd ed.) Cambridge: Cambridge University Press, chapters 9-14.

Ritz, Christian and Jens Carl Streibig 2009. Nonlinear Regression with R. New York: Springer.

Venables, W.N., and Ripley, B.D. 2002. Modern Applied Statistics with S. (4th ed.) New York: Springer, chapter 8.

Wood, Simon N. 2006. Generalized Additive Models: An Introduction with R. Boca Raton, FL: Chapman&Hall/CRC.

Quantitative Data Analysis and Statistical Graphics with R

Latest News

Networking Events

Apply now

2B Quantitative Data Analysis and Statistical Graphics with R

Quantitative Data Analysis and Statistical Graphics with R

Latest News

Networking Events

Apply now

Find us online!