Tutorials

The tutorials will take place on 10-11 July 2018. Click the tutorial for more information and register here.

Presenter Title Target audience
Tuesday Morning
Paula Moraga Disease risk modeling and visualization using R People who are interested in health surveillance or any subject that deals with spatially referenced data

In this tutorial we will learn how to estimate disease risk and quantify risk factors using areal and geostatistical data. We will also create interactive maps of disease risk and risk factors, and introduce presentation options such as interactive dashboards and shiny apps. We will work through two disease mapping examples using data of malaria in the Gambia and cancer in Pennsylvania, United States. We will focus on disease risk, but the approaches covered are also applicable to other fields such as climate, ecology or crime. We will cover the following topics:

  • Model disease risk in different settings
  • Manipulate and transform point, areal and raster data using the spatial packages `sp`, `spdep`, `raster`, `rgdal`, `geoR`, `SpatialEpi` and `SpatialEpiApp`,
  • Retrieve high resolution spatially referenced environmental data using `raster`,
  • Fit and interpret spatial models using `INLA`,
  • Map disease risk and risk factors using `leaflet` and `ggplot2`,
  • Communicate results with interactive dashboards using `flexdashboard` and shiny apps using `shiny`.

Heather Turner Generalized Nonlinear Models using the gnm Package People who wish to find out what generalized nonlinear models are and whether such models might be useful in their field of application

The class of generalized linear models encompasses many tools commonly used in data analysis, including multiple linear regression, logistic regression, log-linear models, etc. But a linear predictor does not always capture the relationship we wish to model. Rather, a nonlinear predictor may provide a better description of the observed data, often with fewer and more interpretable parameters. This tutorial introduces the wider class of generalized nonlinear models (GNMs) and their implementation via the R package `gnm`.

Elizabeth Stark Production-ready R: Getting started with R and docker Experience with using the command line and running basic scripts is helpful but not necessary. Some prior exposure to docker, git and cloud computing is helpful but no in depth knowledge is required.

We will present some real-world data science scenarios and use these as a basis to walk participants through the process of building and deploying R-docker apps. Participants will gain experience in writing R scripts to run as stand-alone docker applications through examples, discussion and activities. We will provide code that can be used as a basis for participants' own projects.

Scott Came Applications with R and Docker Attendees with some exposure to Docker interested in building multi-container networked applications using Docker and R

In this tutorial we will explore several "advanced" scenarios of using Docker and R together to ease deployment of R applications. Attendees will gain hands-on experience building and deploying docker images for Shiny, databases, plumber, and keras. We will also look at cloud deployment and scaling applications with Kubernetes.

Przemyslaw Biecek DALEX: Descriptive mAchine Learning EXplanations. Tools for exploration, validation and explanation of complex machine learning models TBA

TBA

Hanjo Odendaal The ultimate online collection toolbox: Combining RSelenium and Rvest Intermediate R users looking to explore online data collection

Rvest from Hadley Wickham has become the go to package for all online collection or website interaction (web-scraping) tasks in R. Although the package is amazing, it is not able to interact with a webpage when the page is dynamically loaded through javascript. For the latter, we need to have a browser that we 'drive' around the website to collect/load and interact with objects. Welcome to Rselenium from John Harrison. The package provides the necessary tools that allows the user to drive a web-browser, from R using script commands. In this tutorial, we will be looking at installing RSelenium, learning basic commands, look at javascript tips and how to play well with others like rvest.

Tuesday Afternoon
Thomas Lumley fasteR: ways to speed up R code Intermediate R programmers interested in speeding up their code

This workshop will cover some intermediate and advanced techniques for optimising R code. We will look at both processing speed and memory use, but will not cover converting your code into other languages (eg C).

Simon Jackson Wrangling data in the Tidyverse Beginner-to-intermediate R users who want to improve the day-to-day quality and efficiency of their data wrangling skills

This hands-on tutorial will help beginner-to-intermediate R users take their data wrangling skills to the next level with an introduction to the Tidyverse: a collection of data science packages including dplyr, tidyr, purrr, ggplot2, and more. Using provided data sets and practical examples, you will learn how to efficiently import, tidy, and transform data to more quickly focus on tasks like visualization and modeling.

Kevin Kuo Deep learning with TensorFlow and Keras Anyone interested in deep learning

We begin with a quick introduction of deep learning concepts, just enough to have a working vocabulary to facilitate construction of neural networks during the tutorial. The TensorFlow suite of R packages will be covered, including keras, tfestimators, and tfdatasets. Together with the participants, we build end-to-end workflows to perform classification and regression tasks using neural networks. We discuss the data pre-processing needs specific to neural network models, architectural choices, and best practices. Examples will be chosen to span a wide range of interests, including learning on structured data, time series, and unstructured text and image data.

Johann Gagnon-Bartsch Looking to clean your data? Learn how to Remove Unwanted Variation with R Data analysis, Statistics, Bioinformatics and Computational Biology

High-dimensional data often suffer from unwanted variation; for example, gene expression data commonly contain batch effects, and fMRI data commonly suffer from various systematic errors as well. Removing this unwanted variation while preserving the true signal in the data is essential to deriving the right scientific conclusions. A major complication, however, is that the factors causing the unwanted variation are often unknown and must be inferred from the data. In this tutorial we present the RUV (remove unwanted variation) package. RUV methods cover a range of approaches for removing unwanted variation depending on the purpose of the study: differential expression analysis, global data normalisation and visualisation, or classification. We also demonstrate an R shiny application that provides an overview of the methods, along with interactive options for data visualisation and method diagnostics.

Stephanie Kovalchik Sports Analytics TBA

TBA

TBA

Wednesday Morning
Julie Josse Missing values imputation People who want to know more about how dealing with missing values in their analysis and what is the available methods implemented - Basic knowledge of PCA and linear models are required

The ability to easily collect and gather a large amount of data from different sources can be seen as an opportunity to better understand many processes. It has already led to breakthroughs in several application areas. However, due to the wide heterogeneity of measurements and objectives, these large databases often exhibit an extraordinary high number of missing values. Hence, in addition to scientific questions, such data also present some important methodological and technical challenges for data analyst. In this tutorial, we give an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries. We will illustrate the methods on medical, environmental and survey data.

Carson Sievert Interactive data visualization on the web with R Anyone interested in interactive data visualization

This tutorial teaches practical workflows for creating interactive web graphics which support common data analysis tasks. Through a series of examples, demos, exercises, and lecture, attendees will gain a foundation for navigating through common barriers of productivity associated with both the creation (e.g. start-up cost, iteration cost, dead-end cost) and distribution (e.g., deployment cost, scaling cost, latency cost) of interactive web graphics.

Matteo Fasiolo Quantile Generalized Additive Models: moving beyond Gaussianity The attendees should have a basic understanding of regression models and of the basic concepts underlying statistics and machine learning (e.g. probability densities, quantiles, etc).

Generalized Additive Models (GAMs) models are an extension of traditional parametric regression models, which have proved highly useful for both predictive and inferential purposes in a wide variety of scientific and commercial applications. One reason behind the popularity of GAMs is that they strike an interesting balance between flexibility and interpretability, while being able to handle large data sets. The mgcv R package is arguably the state-of-the-art tool for fitting such models, hence the first half of this tutorial will introduce GAMs and mgcv, in the context of electricity demand forecasting. The second part of the tutorial will show how traditional GAMs can be extended to quantile GAMs, and how the latter can be fitted using the qgam R package. By the end of the tutorial the attendees should be able to build, fit and visualize traditional or quantile GAM models, using a combination of the mgcv, qgam and mgcViz R packages. This tutorial is aimed at a broad audience of statistical modellers, interested in using GAMs for predictive or inferential purposes. The models which will be presented in the tutorial have a very wide range of applicability, hence they should be of interest to practitioners in business intelligence, ecology, linguistics, epidemiology and geoscience to name a few.

Maria Prokofieva Follow Me: Introduction to social media analysis in R The tutorial will aim at the broad range of participants from various backgrounds (business, academics, etc.)

The tutorial will review a range of R packages in social media analysis in R and will aim at teaching general principles of working with social media platforms and analysing information there. The social media platform covered are Facebook, Twitter, Instagram and Youtube. Topics covered during the tutorial include: 1. Structure of the social media data (e.g. user-related data, posting related data, hashtags) 2. Benefits and challenges working with social media data (textual/non-textual information, large data volume, API limitations, 3. Connecting to a social media platform (e.g. authentication) and downloading data 4. Data analysis of the profile information (e.g. followers, likes, dislikes, favorites - platform dependent) 5. Data analysis of textual information (e.g. user posts, comments, dynamics, sentiment analysis, word clouds, etc.) 6. Visualisation of the social media data.

Tong He xgboost and MXNet TBA

TBA

TBA