3C How to Communicate and Engage using Data Analysis in R

Please note: This course will be taught in hybrid mode. Hybrid delivery of courses will include synchronous live sessions during which on campus and online students will be taught simultaneously.

Dr Ben Skinner obtained a PhD in Genetics from the University of Kent in 2009, and then performed postdoctoral research at the University of Cambridge on structural and evolutionary genomics – how genomes and the chromosomes they contain change and rearrange over time. In 2019 he joined the University of Essex as a Lecturer in the School of Life Sciences. His research group works on development of image analysis methods, and genome structure and evolution. He teaches computational analysis and programming to undergraduate and postgraduate students.

Dr Dave Clark is a current lecturer in Ecoinformatics within the School of Life Sciences at the University of Essex. He obtained his PhD in Microbiology in 2017, working on microbial community ecology using high-throughput DNA sequencing and data-synthesis approaches. Since then, he has moved on to subsequent post-doctoral and research fellowship roles within the Institute of Analytics and Data Sciences, before progressing to his current position. Dr Clark’s research draws on bioinformatic-, statistical- and geographical-analyses to answer novel questions on global microbial community ecology. Dr Clark has extensive experience teaching programming skills in R to students of all levels in a variety of disciplines, including practitioners in industry.

Course Content

In the era of misinformation and fake news, engaging people with data in a clear and interpretable way is becoming an essential skill to data scientists in all fields. Whilst there are many software packages available to produce data graphics and conduct statistical analyses, the R programming language is one of the most flexible and feature-rich toolsets available. The purpose of this course is to equip participants with the knowledge to use these tools effectively to communicate concepts and data-analyses in a transparent, reproducible, and engaging manner to any audience. In essence, we hope to transform participants into data storytellers by the end of the course.

We will do this by addressing the following five topics:

· Functions, control flow, and automation – How can we use tools in R to make our analyses more efficient and robust, and then communicate and share the code we use by making our own functions.

· Advanced data wrangling with tidyverse & data.table – Using pre-existing packages and frameworks in R, we will learn how to fully explore, clean, and transform our data in a manner that is efficient, scalable to very large datasets, and optimised for speed.

· Data visualisation and graphic design – We will think about the key principles that go into creating effective data visualisations, and how we can build graphics using the ggplot2 (and other) packages to communicate the ‘story’ of our data to different types of audiences and in different contexts.

· Reproducible research & version control – How can we maximise the reproducibility of our analyses? We will learn about the tools available to create fully reproducible outputs of many kinds (reports, presentations, web-pages etc using markdown and knitr), and how we can document and share our code and analyses to maximise their impact and keep track of our versions (git).

· Interactive dashboards with Shiny – Creating interactive dashboards to communicate concepts from data can be an effective route to engage stakeholders and non-technical users with your analyses. Here, we will learn about how such dashboards can be created using the Shiny package within R, fully integrating all of the key skills and concepts dealt with over the prior course material.

Course Objectives:

By the end of the course, using R, participants should be able to:
To be able to automate code using loops and control flow structures.
Construct and document functions
Construct data wrangling pipelines using both tidyverse and data.table
Compare different data wrangling pipelines via benchmarking
Construct and format different types of data visualisations using ggplot2
Extend and format plots for specific contexts using ggplot extension packages (e.g. gganimate).
Learn how to create reproducible reports using markdown and knitr
Create different types of outputs using markdown and knitr including presentations, books, and websites.
Be able to version control and share code via git / github
Create interactive data-dashboards with shiny
Be able to dynamically update data visualisations by downloading data from the web within R

Course Prerequisites:

This course assumes that attendees are beginner – intermediate R users. This means that attendees should have some experience using R including being able to import and work with data.frames, install and load packages, use basic functions, and make simple plots using base R. Basic statistical knowledge is assumed, including knowledge of measures of central tendency (means, medians, variance etc), linear regression, percentages etc.

Required reading:

Assuming attendees have prior experience using R, we do not specify any obligatory pre-reading. However, we suggest the following freely available texts as optional reading to supplement the course material.

R for Data Science
Advanced R
ggplot2: Elegant Graphics for Data Analysis
R Markdown: The Definitive Guide

Background knowledge

Computer background

R = elementary

This course will be delivered over 10 sessions. Each session will consist of a lecture (up to 1 hour), with the rest of each session dedicated to applying new knowledge by completing a series of guided exercises with the support of the lecturer.

Session 1

We will learn how to implement the simple principles of Boolean logic in everything from simple data-wrangling tasks, through to complex control-flow structures, and how to use loops to carry out repetitive tasks with minimal effort.

Session 2

This session will focus on how we can construct our own functions to make our code more concise and make our analyses easier to repeat. We will learn how to specify inputs to functions, consider the scope of our functions, and how our functions can access different parts of our environment.

Session 3

Session 3 will focus on building complex data wrangling pipelines using the Tidyverse suite of packages. We will learn about the key ‘philosophy’ behind the Tidyverse, and how we can use simple pipelines to make our code more efficient to run.

Session 4

In session 4, we will look at ‘data.table’, and alternate package for data wrangling tasks. We will contrast this package with the Tidyverse packages from the previous session, and examine some ways that we can benchmark our code to measure how efficient our code is.

Session 5

Session 5 will examine some of the key tools available in R for visualising data, with a focus on the ‘ggplot2’ package. We will examine how to build plots using ggplot and consider some of the key aspects of building effective and accessible data visualisations.

Session 6

Session 6 will build on the previous session by examining some of the more advanced functionality of ‘ggplot2’ to build more specialist data visualisations, and how small changes to visualisations can make our plots more communicative.

Session 7

Scripting is a key benefit of using a programming language that enables our analyses to be reproducible. Session 7 will focus on ‘Rmarkdown’ and ‘knitr’, which are tools that allow us to embed our code into whatever type of output we desire (e.g. webpages, reports etc.), making our work fully transparent and reproducible.

Session 8

Session 8 will focus on best practice for preserving and sharing our code as it develops through version control platforms such as ‘git’. We will learn about the benefits of version control, and see how we can easily integrate version control into our workflows.

Session 9

Session 9 will introduce Shiny for generating interactive webpages and dashboard. We will learn how to construct interactive web pages with a variety of user inputs, and use those inputs to perform calculations returning tables or charts.

Session 10

In our final session, we will put together everything we have learned in the course, and build an interactive dashboard using publicly available data, with clear documentation, high-quality figures, and reproducible and sharable code.