Talk Schedule

The talks will take place on 11-13 July 2018 (click the interested talk for its abstract). A datatable version is provided here, if you’re looking for a more easy-to-search & R-oriented format. Information for presenters is here.

Time Session Presenter Venue Title Keywords
14:30 Poster lightning Shinichi Takayanagi TBD How LINE Corp Use R to Compete in a Data-Driven World applications, reproducibility, Practical use of R language in business
LINE is one of the most popular messaging applications in Asia developed by LINE corp.We always collect and utilize the log of user behavior to import our services.In our data science team, R plays an important role in all stages of the data-analysis and science.It ranges from exploratory data analysis and modeling to sharing the results using Shiny with other colleagues.In our poster presentation, we explain how we use R to solve our business problems and share some practical insights for others who want to use R in their business.At first, as our data science team grows, different people develop their own R script to solve similar tasks like data base connection.We started working on our internal R package called "liner" shared through Github Enterprise to solve this problems.Now, multiple members often collaborate in order to improve and fix bugs using this package.We also introduce an overview of our R-related data analysis platform along with some useful OSS tools:- yanagishima: It helps people run query easily developed at LINE corp- Drone: Docker-based continuous integration tool. We use this for automated unit test and deploy "liner" package to users
14:30 Poster lightning Mahmoud Ahmed TBD cRegulome: An R package for accessing microRNA/transcription factor-gene expression correlations in cancer bioinformatics, data-access
Transcription factors and microRNAs are important for regulating the gene expression in normal physiology and pathological conditions. Many bioinformatics tools were built to predict and identify transcription factors and microRNA targets and their role in development of diseases including cancers. The availability of public access high-throughput data allowed for data-driven predictions and discoveries. Here, we build on some of these tools and integrative analyses and provide a tool to access, manage and visualize data from open source databases. cRegulome provides a programmatic access to the regulome (microRNA and transcription factor) correlations with target genes in cancer. The package obtains a local instance of Cistrome Cancer and miRCancerdb databases and provides classes and methods to interact with and visualize the correlation data. The package is hosted on github as part of the rOpenSci collection, https://github.com/ropensci/cRegulome
14:30 Poster lightning Miles McBain TBD An R pipeline for creating and hosting collaborative Web VR environments. visualisation, space/time
We are continuing the rich history of R as a generative tool for data driven documents by developing a capability to generate Virtual Reality (VR) scenes. A VR scene object is provided that can create, configure, and serve multi-user Web VR scenes. Development so far includes tools that harness the R spatial ecosystem to create textured 3D meshes from contour and raster data. Several other VR primitives are provided, including 360 photo 'portals'.The immediate application for this work has been the creation of environments for the calibration of geo-spatial statistical models by expert elicitation. However, we see the ability to generate and host multi-user scenes, complete with facility for activity logging, as a compelling new platform for collecting and visualising experimental data.Given this work is evolving rapidly, a poster would be the perfect tool to engage attendees in a variety of conversations about the API, the technology stack, and applications. If given a poster slot at useR!2018, we would like to bring some VR hardware to the session, to give attendees a chance to experience some of the virtual environments our tools have generated.
14:30 Poster lightning Kim Ki-Yeol TBD Squamous cell carcinoma analysis with R bioinformatics
Squamous cell carcinoma (SCC) is the most common histological type of head neck cancer and cervical cancer. Carcinogenesis in these two types of cancers demonstrate similar multistep progression. The purpose of the study is to identify the significant consensus gene modules of these two cancers. We used a publicly available expression dataset for the study. The dataset included head neck cancer (42 cancer samples and 14 normal samples) and cervical cancer (20 cancer samples and 8 normal samples). We used only human papilloma virus 16 positive samples for excluding the bias according to the different types of HPV. We identified consensus modules of two types of cancers and explored the biological functions of each modules by annotation tool. We identified 8 consensus gene modules of head neck cancer and cervical cancer. Each module was well preserved between the two types of cancer. The modules included significant biological functions, including ATP binding and extracellular exosome. Consensus gene module identification is expected to contribute to more personalized management of multiple cancer types.
14:30 Poster lightning Erika Siregar TBD AnalevR: An Interactive R-Based Analysis Environment for Utilizing BPS-Statistics Indonesia Data visualisation, web app, community/education, networks, big data, analysis, collaboration, cloud computing, analysis as a service, bps, statistics Indonesia, analysis environment, remote data analysis
As Indonesia’s national statistical agency, BPS-Statistics Indonesia produces a massive amount of strategic data every year. However, these data are still underutilized by other parties (governments, researchers, etc.) due to technical limitations and raw data exclusivity and locality. Actually, numerous people outside BPS are capable of conducting analysis but unable to access the data. To increase the data usefulness, we introduce AnalevR, an online R-based analysis environment that allows anyone to perform analyses and create visualization without having to own the original raw data. It uses a notebook-like interface where users can type commands and the output appears below it. BPS provides the data and analysis service (including the R modules) which are held in cloud storage and can be explored via the helper function. Users remotely execute R commands and perform analysis inside the workspaces. A user can create up to 10 workspaces, each representing different sessions. Each saved session preserves the user-defined variables and functions for future use. This breakthrough will raise users’ involvement in employing BPS’ data and increase statistical quality in Indonesia.
14:30 Poster lightning Yan Holtz TBD From Data to Viz visualisation, community/education
Selecting the right graphic type is a common task for a data scientist. On a daily basis, an R user deals with a data frame and must decide what visualization is the most appropriate to represent it.The task is not easy. The data scientist must:- Know the broad spectrum of visualization types- Figure out which dataviz is doable given the dataset- Try several (or all) of them- Find the code to create the charts- Avoid the common caveats associated with the selected optionData-to-viz.com is a new website that comes to meet these needs. It displays an interactive decision tree. The user describe their dataset, what leads them to a set of appropriate graphic types. A description is provided for each, explaining its pros and cons. Links to the R and the Python graph galleries are provided, which allows to get the corresponding code in seconds.The complete decision tree is also available in a static version through a poster. The project has not been released yet due to its potential announcement at the useR conference.
14:30 Poster lightning Goknur Giner TBD Pathway-VisualiseR visualisation, web app, community/education, bioinformatics, networks, Bioconductor
Statistical modelling of any genomic research produces sets of genes and individual biomarkers which require investigation. Further exploration of those set of biomarkers is a prominent step towards discovering the source of a biological problem. Furthermore, understanding the collective behaviour of genes has been shown to provide valuable insights into the triggers of many human diseases. We have developed an RShiny application that provides an interface for researchers, enabling them to discover the interaction between their genes and biological pathways. This application allow users to inquire into details of their research through Gene Ontology (GO) analysis with interactive network visualisations and links to related web sites.
14:30 Poster lightning Edgar Santos-Fernandez TBD ActisoftR: a toolbox for processing and visualizing scored actigraphy data. visualisation, applications, space/time
Actigraphy is a cost-effective and convenient tool for activity-based monitoring. It allows studying sleep/wake patterns and identifying disorders in sleep research. ActisoftR was designed for parsing actigraphy outputs and to summarise scored data across user-defined intervals. It consists of several functions for importing, generating reports and statistics, and for data visualization.
14:30 Poster lightning Hannah Coughlan TBD Integration and visualisation of high throughput genomics data with R visualisation, bioinformatics
As sequencing of DNA becomes a more affordable option for studying genomics research questions, more data types are becoming available. One interesting and complex data type, chromosome confirmation capture, aims to interrogate the 3D structure of chromosomes. The same 2 metres of DNA is compacted into the nucleus of every human cell regardless of the cell function, and the genes that determine the cell function are controlled by strict regulation. Chromosome structure can be a mechanism of gene regulation; more specifically, DNA loops form to spatially associate genes with regulators that are not adjacent in the linear genome.However, the techniques to study chromosome structure (Hi-C) are often limited by spatial resolution and can be difficult to interpret. Other genomics techniques that study gene expression (RNA-seq) and regulation (ChIP-seq) cannot discover far away regulators. Here we will show how different types of genomics data can be integrated to investigate long distance gene regulation. Using R and Bioconductor packages (edgeR, Sushi, GenomicRanges and limma) we can integrate data into a common framework that can be visualised to allow for biological interpretation.
14:30 Poster lightning Vipavee Trivittayasil TBD MovingBubbles : Animated d3 bubble chart visualisation
Webpages: https://github.com/chengvt/MovingBubblesA line graph is usually used to portray time-series data. However when there are many time-series, the graph can become cluttered and thus difficult to read. In order to portray the time-series data with many samples in a more intuitive way, a package for plotting an animated bubble chart was developed. A bubble chart here refers to the chart which represents one quantity but arranged in a way that the bubbles are packed close together to use the space efficiently. The quantity each bubble represents is proportional to the bubble area. There is already a package to plot a static bubble chart in R (Joe Chang et al., n.d.). The MovingBubbles package provides a method to add second and third information dimensions to the chart by means of animation and color. The animation portrays changes in data with time and also helps make the plot more engaging to the viewers. The plotting and transitions between frames are handled by d3 library (Bostock et al., 2011). The package uses htmlwidgets framework (Vaidyanathan et al., 2017) to bridge Javascript and R.
14:30 Poster lightning Nicholas Spyrison TBD tourr; visualizing higher dimensions vs alternatives visualisation, community/education
Visualizing in higher (greater than p=3 numeric dimensions) can be messy and unintuitive. Here we explain the methodology and explore the functionality of tourr. We offer a vignette for use and contrast with other higher dimensional visualization methods. The R package, tourr (2011, Wickham, H., D. Cook), gives us the means to animate the projection as we rotate though p-dimensions. This is achieved by varying the contributions from each dimension, via random walk, predefined path, or optimizing an index.Wickham, H., D. Cook, and H. Hofmann (2015). Visualising statistical models: Removing the blindfold (withdiscussion). Statistical Analysis and Data Mining 8(4), 203–225.Wickham, H., D. Cook, H. Hofmann, and A. Buja (2011). tourr: An r package for exploring multivariate data withprojections. Journal of Statistical Software 40(2), http://www.jstatsoft.org/v40.Asimov D (1985). “The Grand Tour: A Tool for Viewing Multidimensional Data.” SIAMJournal of Scientific and Statistical Computing, 6(1), 128–143.
14:30 Poster lightning Motoyuki Oki TBD Time Series Digger : Automatic time series analysis for data science in R visualisation, data mining, space/time
Exploratory Data Analysis (EDA) is an essential process for understanding time series and conducting useful feature extraction. We introduce "Time Series Digger", which provides automatic and programmable EDA in R to accelerate time series analysis for data scientists.Time Series Digger is now deployed on data science platform in NTT Communications which is one of the largest Internet service providers in Japan. We show the effectiveness with real use cases.Time Series Digger consists of three parts.First, it provides automatic and comprehensive time series visualization on various time interval to understand the time series.Second, it provides basic and programmable feature extraction from uni- or multi-variate time series.Third, it applies the features to multiple time series anomaly detection methods.A number of various packages to treat with time series including forecasting and anomaly detection methods exist in R packages.To the best of our knowledge, no packages have focused on efficient and comprehensive analysis process, especially for multiple time series.Our package and contributions should effectively work for R users that face similar problems.
14:30 Poster lightning Volha Tryputsen TBD Antibody characterization with next generation sequencing using Group My Abs shiny app visualisation, algorithms, models, data mining, applications, web app, reproducibility, bioinformatics
Next-generation sequencing (NGS), phage display technology and high throughput capacities enables biologists in drug discovery to characterize antibodies (Abs) based on their HCDR3 sequences and further group them into families before moving to hit-to-lead stage of drug discovery and development. This enables diversification of Ab portfolio and insures back up options if Ab candidate fails. However, there was no method or software available in-house to support Ab discovery with capacities to apply biophysical rules to classify the sequences. Shiny app "Group My Abs" was developed to apply biophysical properties for Ab characterization to the NGS data. Several Multiple Sequence Alignment algorithms implemented in the app enable sequence comparability. A method was developed to evaluate differences between comparable sequences and subsequently classify sequences into families. The app provides custom-made and interactive data visualization, enables refined Ab classification in a mathematical manner, considerably increases efficiency and insures reproducibility. This all decreases bias and enables informative decision making during the hit-to-lead stage in biologics drug discovery.
14:30 Poster lightning Johanna Tróchez TBD APPLICATION TO ANALYZE THE STUDENT DESERTION visualisation, applications, web app, big data
Currently, in Colombia there is a high desertion rate in higher education, so there is great interest in knowing those factors that affect the dropout both at the institutional level and within each program, allowing to implement improvement plans in pro of the student stay. This article proposes an application through the R software and its Shiny library, which allows quantifying the number of dropouts, in relation to the number of students enrolled, obtaining the dropout rate, both at the level of faculties and academic programs, with this information, sample sizes are obtained, to intervene the population and question the causes and reasons for desertion, thus determining key factors that affect student permanence.
14:30 Poster lightning Gabriel Domingo TBD Use of R in Antitrust: The case of the Philippine Competition Commission visualisation, models, applications, reproducibility, community/education, Antitrust
The use of quantitative analysis of economic data in antitrust is well established. As a new competition agency, the challenges of adopting these analyses at the Philippine Competition Commission is daunting. From competitive enforcement to merger control, the R language empowers our teams of economists in their work.We use R's flexibility and power to clean and model price and demand data when investigating anti-competitive agreements or mergers, and abuses of dominant position. R allows our analysts to rule out various theories of harm in the market, while re-focusing our efforts on specific areas of concern. We use several packages for data modeling, but dplyr and antitrust is particularly useful.When defining the geographic market in merger control, we use R's mapping and plotting packages ggmap, ggplot, leaflet and osrm. These tools determine the scope of a market by pinpointing supplier and consumer locations, illustrating routes, and computing distances and travel times.Finally, we will discuss our efforts to expose more of our economists to R via hands-on training sessions in small teams, and our considerations in using of Rmarkdown to standardize our reports.
14:30 Poster lightning Shian Su TBD Glimma: interactive graphics for gene expression analysis visualisation, applications, bioinformatics
Modern RNA sequencing produces large amounts of data containing tens of thousands of genes. Exploratory and statistical analysis of these genes produces plots or tables with many data points. Glimma is a Bioconductor package that provides interactive versions of common plots from limma, a widely used gene expression analysis package. It allows researchers to explore the statistical summary of their data, with cross-chart interactions providing greater insight into the behaviours of specific genes. Interactivity allows genes of interest to be quickly interrogated on the summary graphic which provides better context than searching through spreadsheets. Cross-chart interactions display useful additional content that would otherwise require manual querying. Glimma produces HTML pages with custom D3 Javascript that handles interactions completely independent from R, the resulting plots to easily be shared with researchers without the need for software dependencies beyond a modern browser.
14:30 Poster lightning Adam Gruer TBD Using R and Process Control Charts to Help Hospital Management See The Woods For The Trees visualisation
The poster describes a project undertaken with the Head of Surgery to introduce Process Control tools and methods to a broader population of hospital managers and executives. It was observed that existing reporting of hospital KPIs was encouraging management and other staff to waste time, energy and analytic resources on variances that were not outside the range or random variance. This is an inefficient use of limited resources. The project involved developing RMarkdown reports and flexdashboards to visualise the variance in the processes being monitored. CRAN packages such as qichart2 and the tidyverse were selected as useful tools for completing the project as well as consulting literature on Process Control and Lean methodology and contacting other R users in health systems such as the NHS in the UK. Also, important topics such as user interface (UI), user experience (UX) and communication, promotion and education programmes needed to be considered and the poster will highlight how other departments in the hospital with experience in these areas were consulted. This poster discusses the technical and cultural challenges faced and the solutions that were developed.
14:30 Poster lightning Sharon Lee TBD Shiny EMMIXskew for symmetric and asymmetric mixture modelling visualisation, algorithms, models, data mining, applications, web app, multivariate
EMMIXskew allows users to easily fit univariate and multivariate mixture models and perform inference. Designed with a focus on analyzing data that exhibit non-normal features such as asymmetry and heavy-tails, EMMIXskew offers the options of fitting mixtures of skew normal and skew t-distributions in addition to traditional normal and t-mixture models. These models have received increasing attention in recent years due to their powerfulness and flexibility, as witnessed by many applications in fields ranging from biomedicine, imaging, social sciences, to finance. In this talk, we introduce the EMMIXskew package and its accompanying Shiny app. Its main functionalities will be demonstrated with real-life applications. We will also cover various useful tools included in the package, such as density calculation, mode calculation, random sample generation, error rate calculation, and contour visualization. With the Shiny interface, analysis using these models will become much more accessible for all practitioners and R users.
15:45 Applications in society Richard Layton TBD Data, methods, and metrics for studying student persistence applications, community/education, persistence metrics, intersectionality, longitudinal
This paper introduces R users to data and tools for investigating undergraduate persistence metrics using the midfieldr package and its associated data packages. The data are the student records (registrar's data) of approximately 200,000 undergraduates at US institutions from 1990 to 2016. midfieldr provides functions for determining persistence metrics such as graduation rates and for grouping and displaying findings by program, institution, race/ethnicity, and sex. These packages provide an entry to this type of intersectional research for anyone with basic proficiency in R and familiarity with packages from the tidyverse. The goal of the paper is to introduce the packages and to share our data, methods, and metrics for intersectional research in student persistence.
15:45 Applications in society Maria Holcekova TBD The dynamic approach to inequality: Using longitudinal trajectories of young women and their parents in determining their socio-economic positions within the contemporary Western society visualisation, clustering, imputation, longitudinal data analysis
Intensified globalisation and ensuing increased affluence of Western populations has changed the composition of traditional social class system in England. This does not imply the disappearance of socio-economic (SE) classes and inequalities, but rather their redefinition. Unfortunately, limited research has considered the dynamic nature of SE positions, especially in understanding youth transitions from parental to personal SE classes. I address this problem using nationally representative longitudinal data in the Next Steps 1990 youth cohort study in England. Firstly, I explore the parental transition patterns using longCatPlot. Secondly, I visualise missing data through the missmap function in Amelia and impute these values using random forests in missForest. Thirdly, I employ the daisy function within the cluster to create SE groups based on Gower distance, partitioning around medoids, and silhouette width. Finally, I visualise these results using ggplot2. In doing so, I establish five distinct SE groups of young women that contributes to the understanding of new forms of inequality, and I discuss its implications in terms of access to educational and labour market resources.
15:45 Applications in society Frank C.S. Liu TBD The Second Wing of Polls: How Multiple Correspondence Analysis using R Advances Exploring Associated Attitudes in Smaller-Data applications
Polls and surveys have been used for better forecasting voter preferences and understanding consumer behavior. Academically we employ the strength of these smaller but representative data to confirm theory, including identifying associations between theoretically identified variables. However, researchers who like to explore new patterns for better understanding voters' behavior and attitudes are hardly satisfied by the current practice of survey data analysis. While we turn to bigger data, little attention has been preserved to the value of such smaller data for their potential to achieve the same goal. This talk will demonstrate how the use of "FactorMineR" package of R assists exploration of associated concepts and attitudes and patterns that could not be identified by theories in the first place. Implications for the practice of survey data collection and MCA's connection to association rules mining will be discussed.
15:45 Applications in society Meryam Krit TBD Modelling Rift Valley Fever models, applications, community/education
Rift Valley Fever (RVF) is one of the major viral zoonoses in Africa, affecting man and several domestic animal species. The epidemics generally involve a 5 – 15 year cycle marked by abnormally high rainfall (El Niño/Southern Oscillation phenomenon (ENSO)), but there is more and more evidence of inter-epidemic transmission.A flexible model describing RVF transmission dynamics in six species (human, domestic animal, four vectors) in three different areas will be presented. The model allows for migration, flooding, variation in climate, seasonal effects on vector egg hatching, transhumance, alternative wildlife hosts and increased susceptibility of animals.User-friendly shiny interface and optimized Rcpp implementation allow the epidemiological researchers to study different scenarios and adapt it to other situations. Application of the model to the specific situation in Tanzania and Algeria will be discussed.
15:45 Big data Miguel Gonzalez-Fierro TBD Spark on demand with AZTK big data
Apache Spark has become the technology of choice for big data engineering. However, provisioning and maintaining Spark clusters can be challenging and expensive. To address this issue, Microsoft has developed the Azure Distributed Data Engineering Toolkit (AZTK). This talk describes how AZTK Spark clusters can be provisioned in the cloud from a local machine with just a few commands. The clusters are ready to use in under 5 minutes and come with R and R Studio Server pre-installed, allowing R users to start developing Spark applications immediately. Users can apply their own Docker image to customize the Spark environment. ATZK clusters, composed of low priority Azure virtual machines, can be created on demand and run only as needed allowing for large cost savings. We will show a short demo of how the pre-installed sparklyr package can be used to perform data engineering tasks using dplyr syntax, and machine learning using the Spark MLlib library.
15:45 Big data Benjamin Ortiz Ulloa TBD Graphs: Datastructures to Query algorithms, models, databases, networks, text analysis/NLP, big data
When people think of graphs, they often think about mapping out social media connections. While graphs are indeed useful for mapping out social networks, they have many other practical applications. Data in the real world resemble vertices and edges more than they resemble rows and columns. This allows researchers to intuitively grasp the data modeled and stored within a graph. Graph exploration -- also known as graph traversal -- is traditionally done with a traversal language such as Gremlin or Cypher. The functionality of these traversal languages can be duplicated by combining the igraph and magrittr packages. Traversing a graph in R gives useRs access to a myriad of simple, but powerful algorithms to explore their data sets. This talk will show why data should be explored as a graph as well as show how a graph can be traversed in R. I will do this by going through a survey of different graph traversal techniques and by showing the code patterns necessary for each of those techniques.
15:45 Big data Amy Stringer Amy Stringer TBD Automated Visualisations for Big Data visualisation, reproducibility, big data
The Catlin Seaview project is a large scale reef survey for estimating coralcover at various locations around the world. Upon re-surveying, it is possibleto track changes in, and predict the future condition of, these reefs over time.The survey collects hundreds of thousands of images from 2km transects ofreef, which are then sent to a neural network for automatic annotation of reefcommunities. Annotations are completed in such a way that the resulting datahave hierarchical spatial scales; going up from image, to transect, to reef, tosubregion, to region.Here, we present an efficient method for extracting, summarising and visu-alising the big and complex data with Rmarkdown, dplyr and ggplot2. The useof Rmarkdown, for report generation, allows for the introduction of parametersinto the construction of the document, allowing for entirely unique reports to bedeveloped from the one source script. This approach has resulted in a systemfor compiling 22 reproducible reports, extracting, summarising and visualisingdata at multiple spatial scales, from over 600 000 images, in a matter of minutes;leaving machines to do the work so that people have time to think.
15:45 Big data Snehalata Huzurbazar TBD : Visualizations to guide dimension reduction for sparse high-dimensional data visualisation
Dimension reduction for high-dimensional data is necessary for descriptive data analysis. Most researchers restrict themselves to visualizing 2 or 3 dimensions, however, to understand relationships between many variables in high-dimensional data, more dimensions are needed. This talk presents several new options for visualizing beyond 3D. These are illustrated using 16S rRNA microbiome data. We will show intensity plots developed to highlight the changing contributions of taxa (or subjects) as the number of principal components of the dimension reduction or ordination method are changed. And secondarily revive Andrews curves, connected with a tour algorithm for viewing 1D projections of multiple principal components, to study group behavior in the high-dimensional data. The plots provide a quick visualization of taxa/subjects that are close to the `center' or that contribute to dissimilarity. They also allow for exploration of patterns among related subjects or taxa not seen in other visualizations. All code is written in R and available on Github.
15:45 Statistical methods for high-dimensional biology Florian Rohart TBD mixOmics: An R package for 'omics feature selection and multiple data integration data mining, applications, bioinformatics, multivariate, big data
The mixOmics R-package contains a suite of multivariate methods that model molecular features holistically and statistically integrate diverse types of data (e.g. ‘omics data as transcriptomics, proteomics, metabolomics) to offer an insightful picture of a biological system.Our two latest frameworks for data integration; N-integration with DIABLO combines different ‘omics datasets measured on the same N samples or individuals; P-integration with MINT combines studies measured on the same P features (e.g., genes) but from independent cohorts of individuals. Both frameworks are introduced in a discriminative context for the identification of relevant and robust molecular signatures across multiple data sets. mixOmics is a well-designed, user-friendly package with attractive graphical outputs. It represents a significant contribution to the field of computational biology which has a strong need for such toolkits to mine and integrate datasets.
15:45 Statistical methods for high-dimensional biology Claus Ekstrøm TBD Using mommix for fast, large-scale genome-studies in the presence of gene-environment and gene-gene interaction algorithms, models, bioinformatics, big data
The majority of disorders and outcomes analysed in genome-wide association studies are believed to multi-factorial and influenced by gene-environment (GxE), gene-gene (GxG) interactions, or both. However, including GxE or GxG increases the computational burden by several orders of magnitude which makes the inclusion of interactions prohibitively cumbersome.Finite mixtures of regression models provide a flexible modeling framework for many phenomena. Using moment-based estimation of the regression parameters, we develop unbiased estimators with a minimum of assumptions on the mixture components. In particular, only the average regression model for one of the components in the mixture model is needed and no requirements on any of the distributions.We present a new R package, mommix, for moment-based mixtures of regression models, which implements this new approach for regression mixtures. We illustrate the use of the moment-based mixture of regression models with an application to genome-wide association analysis, and show that the implementation is fast, which makes large-scale genetic analysis with gene-environment and gene-gene interactions feasible.
15:45 Statistical methods for high-dimensional biology Jacob Bergstedt TBD Quantifying the immune system with the MMI package models, data mining, applications, reproducibility, bioinformatics, interfaces
The blood composition of immune cells provide a key indicator of human health and disease. To identify the sources of variation in this composition, we combined standardized flow cytometry and a questionnaire investigating demographical factors in 816 French individuals. The study is published in the Nature immunology article “Natural variation in innate immune cell parameters is preferentially driven by genetic factors”.To facilitate the study, we developed the R package MMI (https://github.com/jacobbergstedt/mmi), which defines a framework to specify a family of models. Operations are implemented for models in the family, such as doing tests, computing confidence intervals or AIC measures and investigating residuals, the results of which are collected in a MapReduce-like pattern. The software keeps track of variables, parameter transformations, multiple testing and selective inference adjustments.With the package we release the dataset of 816 observations of 166 immune cell parameters and 44 demographical variables. We hope that this resource can be used to generate hypotheses in immunology, but also be of benefit to the broader community, in education and benchmarking.
15:45 Statistical methods for high-dimensional biology Rudradev Sengupta TBD High Performance Computing Using R for High Dimensional Surrogacy Applications in Drug Development models, data mining, applications, bioinformatics, performance, big data
Identification of genetic biomarkers is a primary data analysis task in the context of drug discovery experiments. These experiments consist of several high dimensional datasets which contain information about a set of new drugs under development. This type of data structure introduces the challenge of multi-source data integration which is needed in order to identify the biological pathways related to the new set of drugs under development. In order to process all the information contained in the datasets, high performance computing techniques are required. Currently available R packages, for parallel computing, are not optimized for a specific setting and data structure. We proposed a new “master-slave” framework, for data analysis using R in parallel, in a computer cluster. The proposed data analysis workflow is applied to a multi-source high dimensional drug discovery dataset and a performance comparison is made between the new framework and existing R packages for parallel computing. Different configuration settings, for parallel programming in R, are presented to show that the computation time, for the specific application under consideration, can be reduced by 534.62%.
15:45 Complex models and performance Danielle Dean TBD R, Julia, Python Deep Learning Framework Comparison algorithms, models, reproducibility, deep learning
We extend the Python & Julia deep-learning examples (see repo: https://github.com/ilkarman/DeepLearningFrameworks) to also include R deep-learning frameworks such as MXNet and Keras. The aim of the repo is (1) to create an easy environment for translating models (VGG Training, ResNet Inference, LSTM Training) across different frameworks and across different platforms. The model-examples in this repo have matured though multiple PRs from the open-source community (including authors of deep-learning packages) to ensure they are (i) optimised and (ii) standardised across frameworks. This allows an easy and valid comparison of what it is like to train a CNN in R compared to Julia or Python and we believe is a key-advantage compared to other examples. (2) The repository has attracted a lot of attention from the open-source community and framework authors, which means that further examples (including those in R) will be seen and verified by a large amount of people. As the repository is expanded we hope that it is possible to leverage the learnings from one framework into another.
15:45 Complex models and performance Hong Ooi TBD SAR: a practical, rating-free hybrid recommender for large data algorithms, models, applications, big data
SAR (Smart Adaptive Recommendations) is a fast, scalable, adaptive algorithm for personalised recommendations, based on user transaction history and item descriptions. From an end-user's point of view, SAR has the following benefits. First, it is relatively easy to explain to a nontechnical audience, compared to algorithms that rely on matrix factorisation. Second, it doesn't use subjective ratings, which can be unreliable given the pervasive influence of social media: a product that gets review-bombed after going viral will have meaningless ratings. Third, it takes event times into account, thus allowing recommendations to evolve with changing trends. Finally, it does well in recommending cold items, by building a regression model on item data. In this talk I'll discuss two separate implementations of SAR: a standalone one in base R, and an interface to an Azure web service. The former allows easy experimentation and evaluation, while the latter provides more options and is scalable to production-scale datasets.
15:45 Complex models and performance Fang Zhou TBD Jumpstart Machine Learning with Pre-Trained Models algorithms, models, reproducibility, interfaces
As a community many of us are building models (statistical and machine learning) that address various scenarios. At conferences, like UseR!, but also across many academic conferences, researchers publish papers that introduce new algorithms with implementations available on GitHub, implemented in R and Python and other frameworks. The community also makes available pre-trained models, especially deep learning models, to demonstrate or highlight the capabilities of the algorithm. To foster a healthy collaboration and for the reproducibility of key results, it is important that fellow data scientists can read about a new algorithm or approach and to be able to try it out very quickly to see whether it meets their needs. While pre-trained machine learning models are available, they are often difficult to set up and evaluate. We are exploring a framework to make this process simpler by making it easy for any data scientist to investigate and evaluate pre-trained models. We will share our learnings and our proposal to enable data scientists to quickly discover pre-trained models that will support them to be able to get from zero to hero in short order.
15:45 Complex models and performance Joshua Bon TBD Semi-infinite programming in R algorithms, models
Semi-infinite programming (SIP) is an optimisation problem where, generally, there are a finite number of variables but an infinite number of (parametrised) constraints. We show how to optimise simple SIP problems in R, in particular SIP for shape-constrained regression. The package sipr (under development) will be presented and collaboration sought from those in attendance.
15:45 Robust methods Kasey Jones TBD rollmatch: An R Package for Rolling Entry Matching algorithms, models
The gold standard of experimental research is the randomized-control trial. However, many healthcare interventions are implemented without a randomized control group for practical or ethical reasons. Propensity score matching (PSM) is a popular method for approximating a randomized experiment from observational data by matching members of a treatment group to similar candidates of a control group that did not receive the intervention. However, traditional PSM is not designed for studies that enroll participants on a rolling basis, a common practice in healthcare interventions where delaying treatment may impact patient health. Rolling Entry Matching (REM) is a new matching method that addresses the rolling entry problem by selecting comparison group members who are similar to intervention members with respect to both static, unchanging characteristics (e.g., race, DOB) and dynamic characteristics that change over time (e.g., health conditions, health care use). This presentation will introduce both REM and rollmatch, an R package for performing REM to assess rolling entry interventions.
15:45 Robust methods Charles T. Gray TBD varameta': Meta-analysis of medians algorithms, models, applications, reproducibility
Meta-analyses bring together summary statistics from multiple sources; which are reported in various ways. In this talk I will introduce the `varameta` package, which will provide an underlying (and reproducible) framework for understanding skewed meta-analysis data and reporting. The `varameta` package accompanies a couple of theoretical meta-analysis papers I am working on for meta-analysis of medians. This package is also designed to be an adjunct to the well-established conventional `metafor` package. In this package I have collated the existing techniques for meta-analysing skewed data reported as medians and interquartile ranges (or ranges). The `varameta` package will also include reproducible simulation documentation (in .Rmd) of existing methods in meta-analysis benchmarked against our proposed estimator for the standard error of the sample median. In this talk I will demonstrate the package, the web interface for clinicians, as well as how it can be implemented in everyday systematic reviews.
15:45 Robust methods Sevvandi Kandanaarachchi TBD Does normalizing your data affect outlier detection? algorithms, Data pre-processing
It is common practice to normalize data before using an outlier detection method. But which method should we use to normalize the data? Does it matter? The short answer is yes, it does. The choice of normalization method may increase or decrease the effectiveness of an outlier detection method on a given dataset. In this talk we investigate this triangular relationship between datasets, normalization methods and outlier detection methods.
15:45 Robust methods Priyanga Dilini Talagala TBD oddstream and stray: Anomaly Detection in Streaming Temporal Data with R algorithms, space/time, multivariate, streaming data, outlier detection
This work introduces two R packages, oddstream and stray for detecting anomalous series within a large collection of time series in the context of non-stationary streaming data. In `oddstream` we define an anomaly as an observation that is very unlikely given the recent distribution of a given system. This package provides a framework that provides early detection of anomalous behaviour within a large collection of streaming time series. This includes a novel approach that adapts to non-stationarity. In `stray` we define an anomaly as an observation that deviates markedly from the majority with a large distance gap. This package provides a framework to detect anomalies in high dimensional data. Then the framework is extended to identify anomalies in streaming temporal data. The proposed algorithms use time series features as inputs, and approaches based on extreme value theory for the model building process. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our proposed frameworks. We show that the proposed algorithms can work well in the presence of noisy non-stationary data within multiple classes of time series.
15:45 Reproducibility John Blischak TBD The workflowr R package: a framework for reproducible and collaborative data science reproducibility
The workflowr R package helps scientists organize their research in a way that promotes effective project management, reproducibility, collaboration, and sharing of results. workflowr combines literate programming (knitr and rmarkdown) and version control (Git, via git2r) to generate a website containing time-stamped, versioned, and documented results. Any R user can quickly and easily adopt workflowr, which includes four key features: (1) workflowr automatically creates a directory structure for organizing data, code, and results; (2) workflowr uses the version control system Git to track different versions of the code and results without the user needing to understand Git syntax; (3) to support reproducibility, workflowr automatically includes code version information in webpages displaying results and; (4) workflowr facilitates online Web hosting (e.g. GitHub Pages) to share results. Our goal is that workflowr will make it easier for scientists to organize and communicate reproducible research results. Documentation and source code are available at https://github.com/jdblischak/workflowr.
15:45 Reproducibility Peter Baker TBD Efficient data analysis and reporting: DRY workflows in R applications, reproducibility
When analysing data for different projects, do you often find yourself repeating the same steps? Typically, these steps follow a familiar pattern of reading, cleaning, summarising, plotting and analysing data then producing a report. To aid reproducibility, naive examples using Rmarkdown are often presented. However, I routinely employ a modular approach combining GNU Make, R, Rmarkdown and/or Sweave files tracked under git. This system helps to implement a don't repeat yourself (DRY) approach and scales up well as projects become more complex.To aid automation, I have developed generic R, Rmarkdown, STATA, SAS and other pattern rules for GNU Make as well as R packages to generate a project skeleton consisting of initial directories, Makefiles, R syntax files for basic data cleaning and summaries; move data files and documents to standard directories; use codebook information to specify factors and check data; and finally initialise and add these to a local git repository. Comparisons will be made with alternate approaches such as ProjectTemplate and drake.GNU Make pattern rules and R software are available at https://github.com/petebaker.
15:45 Reproducibility Filip Krikava TBD Automated unit test generation using genthat reproducibility, testing
Your package has examples and vignettes of its overall functionality but no unit tests for individual functions. Writing those is no fun. Yet, when something goes wrong, unit tests are your best tool to quickly pinpoint errors. The genthat package can generate unit tests for you in the popular testthat format. Moreover, it can also be used to create reproductions when you find a bug in someone else’s code. There, instead of generating passing test cases it will generate the smallest, purposefully failing, one.Genthat does not magically create new tests out of the blue, instead it simply extracts the smallest possible test fragments from existing code. It does that by recording the input arguments and return values of all function called by clients of your package. The generated tests concentrate on single functions and test them independently of each other. Therefore a failing test usually locates the error more precisely that a failing chunk of application code. Trying it out on random set of 1500 CRAN packages, genthat managed to reproduce 80% of all function calls, increasing the unit test coverage from 19% to 54%. In this talk we present genthat and discuss testing R code.
15:45 Reproducibility Dan Wilson TBD Practical R Workflows reproducibility, workflow
Learn how R can be used to create reproducible workflows for practical use in business. As analysts and data scientists we often need to repeat our work time and time again. Sometimes this will be the exact same task, other times it may be a slight variation for another client or stakeholder. This talk will demonstrate a real-world set of workflows established at The Data Collective designed to reduce the amount of copy/paste type actions to a few function calls that get the repetitive actions out of the way, so you can focus on the important parts of your job. Find out how to overcome the challenges of a repeatable workflow and make your life easier.
Time Session Presenter Venue Title Keywords
10:30 Applications in health and environment Mark Padgham TBD tRansport tools for the World (Health Organization) applications, reproducibility, community/education, space/time, big data
The World Health Organization (WHO) contracted us to provide actionable evidence for the redesign of urban transport policies to help rather than hinder human health. That means more active transport. Designing cost-effective policies to get people walking and cycling requires insight into where, when, how, and why people currently travel. This is challenging, especially in cities with limited resources, data, or analysis capabilities. We briefly describe some technical details of our 'Active Transport Toolkit' (ATT), but the primary focus will be the context that led to the WHO contract and where we plan to go next. We argue that useRs are well-placed to provide openly available, global-scale, transparent tools for policy making. It was the flexibility of the R language and the supportiveness of its community - notably including ROpenSci, which hosts two of our packages - that enabled us to develop the ATT in a way that makes it flexible enough to capture citys' unique characteristics while providing a consistent user interface. The talk will conclude with a outline of lessons learned from the perspective of others wanting to create R tools to inform policy.
10:30 Applications in health and environment Philip Dyer TBD Models of global marine biodiversity: an exercise in mixing R markdown, parallel processing and caching on supercomputers models, applications, reproducibility, performance, big data
R has become the standard language in ecology for statistics and modelling. If a technique has been published in mathematical ecology it has an R package. Even the data sets have an R package! The size of data sets in ecology has been growing to the point where global analysis of ecological data can be considered. At the same time powerful statistical techniques that rely on randomly permuting the data, such as bootstrapping, have become more popular. These are exciting times, but how do we get R to process our large data sets with computationally expensive algorithms without waiting forever to get results? For those new to R, or at least new to big data in R, I have some tips, techniques and packages to help you get going. I have benefited from using R markdown and Knitr to make short transcript files. I have also made use of caching to avoid recalculating big models and using parallel processing to calculate the models faster in the first place.
10:30 Applications in health and environment Chris Hansen TBD Enabling Analysts: Embracing R in a National Statistics Office Official Statistics
Stats NZ has recently adopted R as an approved analytical tool, and more recently for use in production of official outputs. Since adoption, R has had significant uptake, and has been a great enabler for analysts. R is more expressive and flexible than the existing tools, allowing them to more easily solve a variety of problems. R is deployed on powerful servers, so users have a generous supply of memory and cores, meaning large datasets can be handled, and long-running computations parallelised. Analysts access R using R Studio Server, and this IDE itself has had a number of positive impacts--the use of RStudio projects and R markdown documents in particular helps analysts work in a more organised way, and ensures work is reproducible. Our statistical platforms can now also use R. This is done via OpenCPU which enables remote exection of function via an HTTP API. That is, OpenCPU can be used to call functions in internally developed packages as web services. This has proven useful as we transition to a more service oriented architecture. In this talk we describe the R environment at Stats NZ, it’s implications for analysts, and provide examples of its use in practice.
10:30 Applications in health and environment Tracy Huang TBD Developing an Uncertainty Toolbox for Agriculture: a closer look at Sensitivity Analysis visualisation, applications, web app, space/time, big data, R6 and Reference Classes
Digiscape is one of 8 Future Science Platforms in CSIRO focussed on delivering new analytics in the digital age to better inform agricultural systems in the face of uncertainty. The Uncertainty Toolbox is one of 15 projects within Digiscape trying to make a difference to the way models are interpreted, reported and communicated in practice for decision-making. Uncertainty is front and centre of every modelling problem but it is sometimes difficult to quantify and challenging to communicate. The Sensitivity Analysis workflow focuses on developing a general framework for sensitivity analysis to inform the modeller about key parameters of interest and refine the model so it can be used in a robust way to make predictions and forecasts with uncertainties. We focus on methods applicable for large scale, non-monotonic problems that develop variance based approaches to sensitivity analysis using emulators. As such, the framework for developing this workflow in R becomes important for transparency and usability. We will outline the design steps for constructing this workflow using the latest object oriented systems available in R and give a demonstration of the tool using Shiny.
10:30 Models and methods for biology and beyond Zachary Foster TBD Taxa and metacoder: R packages for parsing, visualization, and manipulation of taxonomic data visualisation, data mining, applications, databases, bioinformatics, Taxonomy
Modern microbiome research is producing datasets that are difficult to manipulate and visualize due to the hierarchical nature of taxonomic classifications. The “taxa” package provides a set of classes for the storage and manipulation of taxonomic data. Classes range from simple building blocks to project-level objects storing multiple user-defined datasets mapped to a taxonomy. It includes parsers that can read in taxonomic information in nearly any form. It also provides functions modeled after dplyr for manipulating a taxonomy and associated datasets such that hierarchical relationships between taxa as well as mappings between taxa and data are preserved. We hope taxa will provide a basis for an ecosystem of compatible packages. We have also developed the metacoder package for visualizing hierarchical data. Metacoder implements a novel visualization called heat trees that use the color and size of nodes and edges on a taxonomic tree to quantitatively depict up to 4 statistics. This allows for rapid exploration of data and information-dense, publication-quality graphics. This is an alternative to the stacked barcharts typically used in microbiome research.
10:30 Models and methods for biology and beyond Saswati Saha TBD Multiple testing approaches for evaluating the effectiveness of a drug combination in a multiple-dose factorial design. applications, multivariate, Factorial Design, Drug Combination
Drug combination trials are often motivated from the fact that using existing drugs in combination might prove to be more productive than the existing drug alone and less expensive than producing an entirely new drug. Several approaches have been explored for developing statistical methods that compare fixed (single) dose combinations to its component. However, the extension of these approaches to a multiple dose combination clinical trial is not always so simple. Considering these facts we have proposed three approaches by which we can provide confirmatory assurance that combination of two or more drugs is more effective than the component drug alone. These approaches involved multiple comparisons in multilevel factorial design where the type 1 error is controlled by bonferroni test, bootstrap test, and a union intersection test where the least favorable null configuration has been considered. We have also built a R package implementing the above approaches and in this presentation we would like to demonstrate how this R package can be used in a drug combination trial. We will also demonstrate how these three approaches are performing when benchmarked with an existing approach.
10:30 Models and methods for biology and beyond Bill Lattner TBD Modeling Heterogeneous Treatment Effects with R models, applications
Randomized experiments have become ubiquitous in many fields. Traditionally, we have focused on reporting the average treatment effect (ATE) from such experiments. With recent advances in machine learning, and the overall scale at which experiments are now conducted, we can broaden our analysis to include heterogeneous treatment effects. This provides a more nuanced view of the effect of a treatment or change on the outcome of interest. Going one step further, we can use models of heterogeneous treatment effects to optimally allocate treatment.In this talk will provide a brief overview of heterogeneous treatment effect modeling. We will show how to apply some recently proposed methods using R, and compare the results of each using a question wording experiment from the General Social Survey. Finally, we will conclude with some practical issues in modeling heterogeneous treatment effects, including model selection and obtaining valid confidence intervals.
10:30 Models and methods for biology and beyond Asya Shklyar TBD Building an HPC Infrastructure at a Liberal Arts college community/education, bioinformatics, performance, text analysis/NLP, big data, streaming data
The buildout of an HPC infrastructure for a Liberal Arts college, from design stage to implementation, with the technology stack, the reasoning behind choosing the technologies, and some interesting use cases.
10:30 Learning and teaching François Michonneau TBD Lessons learned from developing R-based curricula across disciplines community/education
The Carpentries is a non-profit volunteer organization that teaches scientists with no or little programming experience foundational skills in coding, data science, and best-practices for reproducible research. We offer 2-day workshops for a variety of disciplines including Ecology, Genomics, Geospatial analysis, and Social Sciences. With 1300+ instructors who have taught 500+ workshops on all continents, we worked with our community of instructors to assemble evidence-based curricula using results from research on teaching and learning. We have developed detailed short- and long-term assessments to evaluate the effectiveness and level of satisfaction of our learners after attending a workshop, as well as the impact on their research and careers 6 months or more afterwards. We find that workshop participants program more often, are more confident, and use programming practices that the report make them more efficient and reproducible. Here, we will present the lessons we learned about developing curricula based on teaching R to novices across diverse disciplines, and the strategies we use to instill the desire to continue learning after attending our workshops.
10:30 Learning and teaching Matthias Gehrke TBD Student Performance and Acceptance of Technology in a Statistics Course Based on R mosaic - Results from a Pre- and Post-Test Survey community/education, teaching
In the last years there is movement towards simulation-based inference (e.g., bootstrapping and randomization tests) in order to improve students' understanding of statistical reasoning (see e.g. Chance et al. 2016). The R package mosaic was developed with a "minimal R" approach to simplify the introduction of these concepts (Pruim et al. (2017)). With a pre and post survey we analysed whether students improved in understanding as well as in acceptance of R during a one semester statistics course in economically related Bachelor and Master programs. These courses were held by different lecturers at multiple locations in Germany. At our private university of applied sciences for professionals studying while working the use of R is compulsory in all statistical courses.While conceptual understanding was evaluated by a subset of the modified CAOS inventory (like Chance et al. (2016)) the acceptance and use of technology was collected by using an adopted version of UTAUT2 (Venkatesh et al. (2012)).
10:30 Learning and teaching Mette Langaas TBD Teaching statistics - with all learning resources written in R Markdown community/education, teaching
In applied courses in statistics it is important for the student to see a mix of theory, practical examples and data analyses. Being able to study the R code used to produce the data analyses, and to run and modify the R code will give the student hands on experience, which again may lead to increased theoretical understanding.I will tell about my experiences with producing and using learning material written in R Markdown in two courses in statistics at the Norwegian University of Science and Technology. One course is at the master level (Generalized linear models) with few students (35) and a mix of plenary and interactive lectures. The other course is at the bachelor level (Statistical learning) with more students (70).
10:30 Learning and teaching Peter Dalgaard TBD What's in a name? 20 years of R release management community/education, History of R
In this talk, I will go through the history of R releases since 1997. I will discuss the role of the R Core Team with special emphasis on development principles and release management issues. A few "war stories" will also be included. Some light will be thrown on the choice of release names since 2011.
10:30 Data handling Chester Ismay TBD Statistical Inference: A Tidy Approach using R visualisation, community/education, statistical inference, tidyverse community
How do you code-up a permutation test in R? What about an ANOVA or a chi-square test? Have you ever been uncertain as to exactly which type of test you should run given the data and questions asked? The `infer` R package was created to unite common statistical inference tasks into an expressive and intuitive framework to alleviate some of these struggles and make inference more intuitive. This talk will focus on developing an understanding of the design principles of the package, which are firmly motivated by Hadley Wickham's tidy tools manifesto. It will also discuss the implementation, centered on the common conceptual threads that link a surprising range of hypothesis tests and confidence intervals. Lastly, we'll dive into some examples of how to implement the code of the `infer` package via different data sets and variable scenarios. The package is aimed to be useful to new students of statistics as well as seasoned practitioners.
10:30 Data handling Thomas Lumley TBD Subsampling and one-step polishing for generalised linear models algorithms, models, databases, big data
Using only a commodity laptop it's possible to fit a generalised linear model to a dataset from about a million to a billion rows by first fitting to a subset and then doing a one-step update. The method depends on a bit of asymptotic theory, some sampling, the Fisher scoring algorithm, efficient R-database interfaces, and a little of the tidyverse.
10:30 Data handling James Hester TBD Glue strings to data in R Package development
String interpolation, evaluating a variable name to a value within a string, isa feature of many programming languages including Python, Julia, Javascript,Rust, and most Unix Shells. R's `sprintf()` and `paste()` functions providesome of this functionality, but have limitations which make them cumbersome touse. There are also some existing add on packages with similar functionality,however each has drawbacks.The glue package performs robust stringinterpolation for R. This includes evaluation of variables and arbitrary R code,with a clean and simple syntax. Because it is dependency-free, it is easy toincorporate into packages. In addition, glue provides an extensible interfaceto perform more complex transformations; such as `glue_sql()` to constructSQL queries with automatically quoted variables.This talk will show how to utilize glue to write beautiful code which iseasy to read, write and maintain. We will also discuss ways to best use glue whenperformance is a concern. Finally we will create custom glue functions tailoredtowards specific use cases, such as JSON construction, colored messages, emojiinterpolation and more.
10:30 Data handling Max Kuhn TBD Data Preprocessing using Recipes algorithms, models
The recipes package can be used as a replacement for model.matrix as well as a general feature engineering tool. The package uses a dplyr-like syntax where a specification for a sequence of data preprocessing steps are created with the execution of these steps deferred until later. Data processing recipes can be created sequentially and intermediate results can be cached. An example is used to illustrate the basic recipe functionality and philosophy.
10:30 Statistical modeling John Fox TBD New Features in the car and effects Packages visualisation, models
The widely used car and effects packages are associated with Foxand Weisberg, An R Companion to Applied Regression, the thirdedition of which will be published this year. In preparation, wehave released the substantially revised version 3.0-0 of the carpackage and version 4.0-1 of the effects package.The car package focuses on tools, many of them graphical, that areuseful for applied regression analysis (linear, generalized linear, mixed-effects models, etc.), including tools for preparing, examining, and transformingdata prior to specification of a regression model, and tools thatare useful for assessing regression models that have been fit todata. The effects packages focuses on graphical methods forinterpreting regression models that have been fit to data.Among the many changes and improvements to the packages are areconceptualization of effect displays, which we call "predictoreffects"; the ability to add partial residuals to effect plots ofarbitrary complexity; simplification to the arguments of plottingfunctions; new and improved functions for summarizing and testingstatistical models; and improved methods for selecting variabletransformations.
10:30 Statistical modeling Rainer Hirk TBD mvord: An R Package for Fitting Multivariate Ordinal Regression Models algorithms, models, applications, multivariate
The R package mvord implements composite likelihood estimation in the class of multivariate ordinal regression models with probit and a logit link. A flexible modeling framework for multiple ordinal measurements on the same subject is set up, which takes into consideration the dependence among the multiple observations by employing different error structures. Heterogeneity in the error structure across the subjects can be accounted for by the package, which allows for covariate dependent error structures. In addition, regression coefficients and threshold parameters are varying across the multiple response dimensions in the default implementation. However, constraints can be defined by the user if areduction of the parameter space is desired. The proposed multivariate framework is illustrated by means of a credit risk application.
10:30 Statistical modeling Heather Turner TBD PlackettLuce: Modelling Rankings Data models
This talk introduces the CRAN package PlackettLuce for fitting the Plackett-Luce model to rankings. This model estimates the worth of items being ranked by viewing a ranking as a set of consecutive choices and modelling the probability of an item being chosen above others as the ratio of that item's worth to the total worth of the items from which the choice is made.PlackettLuce offers several advantages over competing approaches to fit the Plackett-Luce model in R. Firstly it is flexible in the type of rankings that can be handled, in particular it can handle ties, partial rankings and sets of rankings that do not satisfy the conditions for maximum likelihood estimation, for example because one item always comes first in its rankings. Secondly it enables inference on the worth estimates by providing model-based standard errors along with a method for obtaining quasi-standard errors, which don't depend on the identifiability constraints. Finally it provides a method to to work with the psychotree package to fit Plackett-Luce trees.This presentation will include a novel application of Plackett-Luce trees to data from a citizen science project in agricultural development.
10:30 Statistical modeling Joachim Schwarz TBD Partial Least Squares with formative constructs and a binary target variable PLS, pslpm package, formative constructs, binary target variable
During the last years, the use of PLS became more and more important for the modelling of dependencies between latent variables as an alternative to classical structural equation modelling. However, a non-metric target variable in combination with formatively measured constructs is still a particular challenge for the PLS-approach.By using the plspm package (Sanchez/Trinchera/Russolillo 2017), we tested a model from the human resources management field. Main goal of this model is to examine the moderating and mediating role of meaning at work for the relationship between several social, personal, environmental and motivational job characteristics and the intention to quit as a manifest binary target variable. Coping with the complexity of the model, consisting of more than 70 latent variables, all formatively measured, many of them one indicator constructs, there are some pitfalls in the application of the plspm package, but due to the flexibility of R, it is possible even to evaluate such a complex model.
10:30 Better data performance David Cooley TBD Starting with geospatial data in Shiny, and knowing when to stop visualisation, databases, web app, performance, spatial
Theme:Coupling R with geospatial databases to reduce the calculations and data in R and improve shiny app speedLike any web page, Shiny apps need to be quick and responsive for a better user experience. Doing complex calculations and storing large data objects will slow the app. Therefore, it's often desirable to remove as much of this as possible from the app. The talk will demonstrate- Using MongoDB as a geospatial database- Querying & returning geospatial data to R from MongoDB- Comparison and benchmarking of geospatial operations in R vs on the database server- Applying this to a shiny app with a demonstration, highlighting the pros & cons- Introducing the latest updates to the `googleway` package for displaying data and using Google Map tools through R- Using Google Maps to trigger database queries and operations
10:30 Better data performance Jeffrey O. Hanson TBD prioritizr: Systematic conservation prioritization in R reproducibility, space/time, performance, conservation
Biodiversity is in crisis. To prevent further declines, protected areas need to be established in places that will achieve conservation objectives for minimal cost. However, existing decision support tools tend to offer limited customizability and can take a long time to deliver solutions. To overcome these limitations and help prioritize conservation efforts in a transparent and reproducible manner, here we present the prioritizr R package. Inspired by the tidyverse principles, this R package provides a flexible interface for articulating, building and solving conservation planning problems. In contrast to existing tools, the prioritizr R package uses integer linear programming (ILP) techniques to mathematically formulate and solve conservation problems. As a consequence, the prioritizr R package can find solutions that are guaranteed to be optimal and in record time. By finding solutions to problems that are relevant to the species, ecosystems, and economic factors in areas of interest, conservation scientists, planners, and decision makers stand a far greater chance at enhancing biodiversity. For more information, visit https://github.com/prioritizr/prioritizr.
10:30 Better data performance Remy Gavard TBD Using R to pre-process ultra-high-resolution mass spectrometry data of complex mixtures. algorithms, applications
Scientists are able to determine over hundreds of thousands of components in crude oil using Fourier transform ion cyclotron resonance mass spectrometry (FTICR-MS). The statistical tools required to analyse the mass spectra struggle to keep pace with advancinginstrument capabilities and increasing quantities of data. Today most ultrahigh resolution analyses for complex mixture samples are based on single, labour-intensive, experiments.We present a new algorithm developed in R named Themis to jointly pre-process replicate measurements of a complex sample analysed using FTICR-MS. This improves consistency as a preliminary step to assigning chemical compositions, and the algorithm has a quality control criterion. Through the use of peak alignment and an adaptive mixture model-based strategy, it is possible to distinguish true peaks from noise.Themis demonstrated a more effective removal of noise-related peaks and the preservation and improvement of the chemical composition profile. Themis enabled the isolation of peaks that would have otherwise been discarded using traditional peak picking (based upon signal-to-noise ratio alone) for a single spectrum.
10:30 Better data performance Murray Cameron TBD Exceeding the designer's expectation algorithms, models, applications
Statistical methods and their software implementation are generally designed for a particular class of applications. However, the nature of data, analysis and statisticians is that uses of the methods are envisaged that extend the application. Sometimes the reason is the nature of the data, sometimes it is a new type of model and sometimes it is the limitations of the software available. Software for regression and for generalised linear models have regularly been used in 'non-standard' ways.We will discuss some examples, considering some changepoint models in particular and emphasise some old lessons for software developers.
14:00 Lightning Alan Pearse TBD SSNDesign -- An R Package for Optimal Designs on Spatial Stream Networks models, experimental design; optimal design; spatial stream networks
Optimal experimental designs maximise the information gained from limited samples. Optimal designs are paramount when precise predictions or parameter estimates are required but data collection is resource intensive. R packages exist to find optimal designs for a few settings; e.g. AlgDesign and OPDOE. However, to our knowledge, there are no R packages for optimal design problems for stream and river networks. Stream networks provide a unique design challenge due to their branching structure and flow accumulation as water moves downstream. Given these statistical challenges and the importance of healthy freshwater ecosystems, computational tools for designing effective monitoring programs on streams with minimal cost for maximum impact are sorely needed. Here, we present SSNdesign; an R package for finding optimal designs on stream networks. This package relies on the S4 SpatialStreamNetwork object and models implemented in the package SSN. It has functionality for finding optimal designs for estimating model parameters and making predictions on stream networks. Users can also define utility functions for their own design problems.
14:00 Lightning Tatiana Marci TBD Using Factor Mixture Analysis in Developmental Psychology: An Application to Research on Parent-child Attachment applications, Factor Mixture Analysis, heterogeneous populations
Factor Mixture Analysis (FMA) is a useful tool to explore data from potentially heterogeneous populations using a crossbred of both categorical and continuous latent variables. Briefly, this approach allows to explore the underlying factorial structure of a theoretical construct, while simultaneously detecting unobserved subgroups in the study population. Thus, FMA becomes particularly useful to investigate psychological phenomena assumed to be categorical and continuous at the same time, and when the source of heterogeneity in the considered population may be not directly observed. Despite these advantages, its application within the psychological sciences remains limited. The current study aims to illustrate the utility of FMA within the context of attachment research in developmental psychology. By presenting a real data example concerning the latent structure of attachment in middle childhood, this work provides a practical example of FMA application using the FactMixtAnalyses package (Viroli, 2011). Furthermore, we will describe ad hoc R functions to assist in the interpretation of results. Benefits and drawbacks of applying FMA to this research area will be discussed.
14:00 Lightning Gi-Seop Lee TBD Evaluations of the machine learning models in the coastal habitat classification applications, multivariate, performance
A ‘short-neck clam’ (Ruditapes philippinarum) is one of the most important commercial shellfish. The amount of the shellfish production has been severely reduced due to the unexpected invasion of the ‘Japanese mud shrimp’ (Upogebia major) in some Korean tidal flats. Thus, it is highly required to know the habitat suitability for both organisms. In this study, the diverse simulations of the habitat classification were carried out using the available habitat data of the U. major and R. philippinarum. Supervised learning methods such as decision tree, k-Nearest Neighbor (kNN), Support Vector Machine (SVM) and Artificial Neural Network (ANN) were used with the three optimal clusters defined by R package ‘NbClust’. The decision trees were applied ‘bagging’ and ‘adaboost’ algorithms. Based on the simulation results, the prediction accuracies of each model in case of using the test data are estimated to be about 55-65%. This is considered to be due to outlier effects, and the overfitting problem due to the relatively small number of samples. In many biological data, these are still challenging problems.
14:00 Lightning Koji Makiyama TBD Magic Functions to Obtain Results from 'for' Loops in R interfaces
The function 'for' is one of the most popular functions in R. As you know, it is used to create loops. We think there is an inconvenience of 'for' loops in R. It is that the results you get will be gone away. So we have created a package to store the results automatically. To do it, you only need to cast one line spell 'magic_for.' For instance, to calculate squared values for 1 to 3 using 'for' loop and 'print' function is very easy. However, it becomes too much hassle to change such codes to store displayed results. You must prepare some containers with correct length for storing results and change 'print' function to assignment statements. Moreover, in such or more troublesome situations like where you have to store many variables, codes will grow more complex. The 'magicfor' package makes to resolve the problem with keep readability. You just add one line 'magic_for()' before 'for' loops. Once you call 'magic_for,' you can just execute 'for' as usual, the results will be stored in memory automatically. You can obtain the results using 'magic_result.' We introduce how to use the magic.
14:00 Lightning Yuya Matsumura TBD Easy Writing of Bayesian Optimizaion for Macine Learning models, performance, big data
In many machine learning algorithms, tuning hyperparameters is one of the most important point. Bayesian optimization (Shahriari et al., 2015) is a method for tuning hyperparameters faster and more efficient than grid search that searches all grids in parameter space. In R, combination of rBayesianOptimization package and some machine learning packages such as e1071 or ranger enable Bayesian optimization for hyperparameter tuning. However, it was troublesome to write codes for Bayesian optimization using those packages because we must make a complicated function to maximize, then write as code to execute Bayesian optimization. This is very confusing, hard to try and try and error. MlBayesOpt package (Matsumura, 2017, https://cran.r-project.org/web/packages/MlBayesOpt/index.html) is a very convenient to write this work. This package requires to execute Bayesian optimization only a dataframe, column name of label to classify (or regress), and column names of feature vectors. For example, there are 32 lines of a source using combination of packages, but 5 lines of that using MlBayesOpt package.
14:00 Lightning Jeremy Forbes TBD Using Australian census data to describe electorates' socio-economic profiles at the time of a federal election. models, reproducibility, community/education
In Australia, the House of Representatives is divided into 150 seats, each representing an electoral division, and each divisions' boundaries are revised periodically. Federal elections generally occur every three years, but electorate boundaries can change in between elections.The Australian Bureau of Statistics conducts a Census of Population and Housing every five years, and updates its record of electorate boundaries in July each year, in accordance with the official electoral commission's boundaries.This research looks at matching and estimating the socio-economic profile of each electorate at the time of a federal election.To accurately estimate profiles, each election is initially paired with the Census data taken closet to the election date. Many elections do not occur in the same year as a census, and are matched with data from nearby years. Differences between these dates are adjusted for using spatial analysis and time-series forecasts.This work is an update for the eechidna package, which contains Australian census and election data, and tools for visualisation and analysis.PS. An update can be provided closer to date. Research has only recently commenced.
14:00 Lightning Jessica Bagnall TBD Analysing the voting patterns of the Senate of the 45th Australian Parliament via fully-visible Boltzmann machines algorithms, models, applications, networks
The 45th Australian Senate—following the 2016 federal election—contains the largest crossbench since the expansion of the Senate in 1950. Of the 20 Senators who make up the crossbench, 7 minor parties were elected.We analyse the party-level voting patterns of the parties of the Senate of the 45th Australian parliament by modelling the crossbench via a fully-visible Boltzmann machine, a probabilistic graphical network that arises from the neural networks literature, in order to determine the various influences that each party has on each other, and to evaluate the relative pro- (or anti-) government stances of the aforementioned parties.We describe the required estimations and computations that are performed via our R package BoltzMM—available at github.com/andrewthomasjones/BoltzMM. The package implements the MM algorithm for maximum pseudolikelihood estimation of FVBM models of Nguyen and Wood (Neural Computation, 2016), and uses the asymptotic normality results of Nguyen and Wood (IEEE T Neural Networks and Learning Systems, 2016) for inferential computations.
14:00 Lightning Florian Schwendinger TBD Readability Prediction in R models, applications, text analysis/NLP
Readability prediction is commonly used to assess the comprehensibility of a given text. Early approaches focus on the development of readability scores (e.g. Fog-Index, Dale-Chall, Flesch Reading Ease). Most of these readability scores are based on the number of words, number of sentences, number of syllables and number of words which are not present in a predefined list.Current research in the field of linguistics suggests that these scores are often misleading and models which combine Natural Language Processing (NLP) and statistical learning should be used instead.This research presents how a state-of-the-art readability prediction can be implemented in R by utilizing the tools available from the StanfordCoreNLP package. The StanfordCoreNLP package and its companions can be installed from https://datacube.wu.ac.at/.
14:00 Lightning Aswi Aswi TBD Comparison of different Bayesian spatio-temporal models using R packages models, applications, space/time, CARBayesST
There is a growing number of packages in R for modelling spatio-temporal data. In this presentation, we will review and compare a number of spatio-temporal Bayesian models using R. We will focus on two R packages, namely R-INLA (Integrated Nested Laplace Approximation) and CARBayesST and describe the different spatio-temporal models available. We examine six and five Bayesian spatio temporal models using CARBayes and R-INLA, respectively. We will illustrate the application of these models and packages through a case study on dengue cases, in Makassar, Indonesia. Model performance will be compared using goodness of fit such as Deviance Information Criteria (DIC). The computational speed and ease of using these packages makes them a very attractive option for Bayesian spatio-temporal modelling.
14:00 Lightning Janek Thomas TBD Automatic gradient boosting algorithms, models, data mining, Automatic Machine Learning
Well-qualified data scientist are not a dime a dozen. Instead, employees being not very familiar with data analysis are often called to do the job. Automatic machine learning can help those persons to perform predictive modeling with high performing machine learning tools without having much experience. This is achieved by making those applications parameter-free, i.e. only the data is required as input. Projects like Auto-WEKA or auto-sklearn aim to solve the Combined Algorithm Selection and Hyperparameter optimization (CASH) problem resulting in a huge optimization space. However, for most real world applications, only few different learning algorithms are required to deliver superior performances. autoxgboost simplifies this idea one step further and the CASH problem to taking Gradient Boosting as a single learning algorithm in combination with intelligent model based hyperparameter tuning. It is based on the R-Packages mlr, mlrMBO and XGBoost. It also supports categorical variables due to special inbuilt factor feature encoding. Even though autoxgboost only uses one learner instead of a whole library, it provides comparable or even better performances.
14:00 Lightning Awdhesh Yadav TBD Household and Community factor on under-five mortality in India: An application of multilevel cox proportional hazard model multivariate, big data
The objective of this paper is to determine the important of community, household and individual level effect on under-five mortality in India. Using data from the latest round of Demographic Health Survey (DHS)-2005-06, multilevel cox proportional hazard analysis was performed on a nationally representative sample. The results indicate that pattern of under-five mortality were clustered within mothers and communities. The community level variables like region, place of residence, community poverty level, community education level, ethnic fractionalization index were significantly determine under-five mortality in India. The risk of under-five deaths was significantly higher for children residing in North, East and West regions compared to South region. In addition, the proportion of women in community completing secondary school were significantly more likely to increase the child survival. The household level variables like religion, caste and wealth index were significantly determining under-five mortality. The results suggest to address the contextual level factors to address under-five mortality in India
14:00 Lightning Jean-Michel Perraud TBD A suite of R packages for hydrological ensemble forecasting using Rcpp space/time, performance, big data, Hydrology, forecast
Ensemble prediction techniques have been shown to produce more accurate predictions as well as formally quantify prediction uncertainty in a range of scientific applications. We present a suite of libraries for hydrological ensemble forecasts designed for use both in research and operations. The features of the C++ libraries are available from several high-level interactive languages including R. The suite currently comprises three main R packages for rainfall forecast post processing (RPP), semi-distributed ensemble hydrological modelling (SWIFT2) and multi-dimensional ensemble time series (uchronia). The packages are designed to offer concise commands for handling ensemble time series, input/output, model parameterisation and simulation execution. The native libraries purposely have a C API for maximising interoperability and foster a consistent use experience across high-level languages. Rcpp is used for surfacing the features in the R packages. Bespoke code for marshaling data, object lifetime management and generating glue code for Rcpp is already open source and suitable for reuse in similar technical contexts.
14:00 Lightning Susanna Cramb TBD Bayesian disease mapping in R models, space/time
Bayesian methods prominently feature in disease mapping, and R has multiple packages designed to enable efficient computation of assorted Bayesian spatial models.Here we examine seven Bayesian models suitable for disease mapping and implement them using the R packages of R-INLA, CARBayes, R2WinBUGS and R2jags. Models considered included common approaches such as the BYM which smooth estimates over all adjacent areas, through to more recently introduced models that allowed for discontinuities between adjacent areas, as well as spline models. Simulated incidence data designed to represent a rare cancer (liver) and more common cancer (lung) were examined across 2153 areas in Australia. Model performance was compared on goodness of fit measures (WAIC, Moran’s I on residuals), computational time and convergence (Geweke). The packages themselves are also compared in terms of computational time and model flexibility. It is useful to consider several different models to understand the robustness of results when disease mapping. R has the capacity to enable a wide range of models to be considered, with the additional advantages of high quality visualisation of results.
14:00 Lightning Rosemary Putler TBD Analysis of EHR Data and Circulating Inflammatory Mediators: Association with Severe Clostridium difficile Infection models, applications, bioinformatics
Clostridium difficile infection (CDI) is a major healthcare-associated infection, and severe CDI often leads to subsequent recurrence or death. We hypothesized that circulating inflammatory mediators would associate with severity in a prospective cohort of inpatients diagnosed with CDI. An inflammatory mediator panel was performed on collected sera and merged with electronic health record (EHR) data. With these data we show that circulating biomarkers associate not only with severity of the CDI episode, but also with subsequent mortality. Because of the large number of potentially correlated predictors, our data presented a challenge when we set out to identify features to incorporate into an accurate predictive model. We explore the steps performed in this analysis, discussing methodology and decision points, including the management and analysis of EHR data, utilization of dimensional reduction techniques, and use of existing packages such as vegan, glmnet, and pROC. Through this analysis, we demonstrate how a diverse array of R packages and statistical methodologies, which function in a wide array of use cases, can also be used to answer a complex disease-related question.
14:00 Lightning Stuart Davie TBD A data driven approach to generating and scoring B2B leads visualisation, models, applications
In many industries, companies rely on a sales team to source and qualify leads. Unfortunately, this limits a company's potential leads to those that can be manually processed, while lead qualification is limited by the quality of ad hoc scoring systems. To find leads faster, companies might engage in cold-calling, or blanket email campaigns, both of which are known for their low conversion rates. Here, a data-driven B2B lead generation and qualification solution for the UK market is presented, based on open source data and XGBoost. Our models take into account both general and company specific features, and allow an approximation of the size of market opportunity. Lead reports containing pertinent conversion information are automatically generated using xgboostExplainer and R Markdown. Considerations on feature engineering, and difficulties associated with overfitting, are also discussed. **As there are several components to this presentation, a lightning talk would be preferred over a poster**
14:00 R in the community Simon Jackson TBD R from academia to commercial business applications, community/education, big data, industry, skill development
A 2017 report by StackOverflow showed that the use of R is greatest and growing fastest in academia. Commercial industries like tech, media, and finance, however, show the smallest usage and lowest adoption rates of the language. Yet learnings regarding the use of R and data science in academia and commercial settings complement each other. This presentation will share my experience as an R user moving from academia into commercial business; the transition moving from cognitive scientist at an Australian University to being a data scientist at one of the world’s largest travel e-commerce sites, Booking.com. I’ll discuss how the cutting-edge R skills used in academia can improve commercial product development. I will also identify the knowledge gaps I had moving into commercial business. This will be relevant to academics looking to move into industry, and business employers looking to hire data scientists from academia.
14:00 R in the community Joseph Rickert TBD Connecting R to the "Good Stuff" algorithms, models, applications, big data, interfaces
In his book, Extending R, John Chambers writes: One of the attractions of R has always been the ability to compute an interesting result quickly. A key motivation for the original S remains as important now: to give easy access to the best computations for understanding data.R developers have taken the challenge implied in John’s statement to heart, and have integrated R with some really “good stuff’ while providing easy access that conforms to natural R workflows. Rcpp and Shiny, for example, are both spectacularly successful projects in which R developers expanded the reach of R by connecting to external resources.In this talk, I will survey the ongoing work to connect R to “good stuff” such as the CVX optimization software, the Stan Bayesian engine, Spark, Keras and TensorFlow; and provide some code examples including using the sparlkyr package to run machine learning models on Spark and the keras package to run deep learning and other models on TensorFlow.
14:00 R in the community Lisa Chen TBD Using R to help industry clients – The benefits and Opportunities visualisation, algorithms, models, data mining, applications, web app, reproducibility, multivariate, networks, performance, text analysis/NLP, big data
Dr Lisa Chen is Chief Analytics Officer for Harmonic Analytics. She is a highly qualified and experienced data scientist, with a PhD in Statistics and a Bachelor of Science in Computer Science and Statistics. Lisa has extensive experience using R including designing solution-based models for complex optimisation problems, and analysing large-scale datasets in R. Harmonic has helped customers globally to address business challenges, across sectors including; agriculture, aviation, banking, energy, government, health, telecommunications and utilities. We use R in our daily project work and also help clients with data science team development and R Training. We will outline how we have used R, and RShiny and the benefits realised. We will discuss our journey, data-driven approach, workflow and industry observations. We will discuss our learning with R, e.g. observations regarding Big Data with R, version control and some of the pain points & work-arounds. We will share our observations on how clients are starting to adapt open source and R for their analytical work, plus the trends and opportunities. Lisa will demonstrate examples of our interactive client dashboards.
14:00 Building community Jonathan Carroll TBD Volunteer Vignettes; A Case-Study in Enhancing Documentation applications, reproducibility, community/education, documentation
Vignettes; long-form documentation for a package. Often a use-case, discussion, or scientific article. These are incredibly useful to both users and developers. In 2017, Julia Silge scraped CRAN and found most packages don't have one [1].At the start of 2018, I decided to give back to the community by 'being the change I wanted to see in the world' and writing a Volunteer Vignette a month, for the entire year. Yet all the new and interesting packages I could think to write something for already had vignettes.The solution came to me in February; have the community nominate packages. I made the call via Twitter [2] and received an encouraging response. I set about writing the first Volunteer Vignette and immediately discovered bugs and other issues, all of which have lead to positive discussions with the author and updates to the package.In this talk I will present my first six months of the Volunteer Vignettes Project. I will demonstrate why vignettes are an invaluable step in making a robust R package.[1] https://juliasilge.com/blog/mining-cran-description/[2] https://twitter.com/carroll_jono/status/961139524901527552
14:00 Building community Robin Hankin TBD Special and general relativity in R visualisation, community/education, space/time
Although mostly used for statistics, R is a general purpose tool andhere I discuss how the R programming language can be used in thecontext of physics education. Here I introduce two R packages thathave been used in the teaching of Einstein's theories of special andgeneral relativity.The 'gyrogroup' package implements the Lorentz boosts for relativisticvelocity addition. It provides dramatic visualization of thelittle-known fact that relativisitic Lorentzian velocity addition isneither commutative nor associative. The 'schwarzschild' packagepresents visualization of black hole physics, and gravitational waves.In this presentation I discuss these two packages and also the moregeneral issue of R used as a teaching tool in the context of physicsmore generally.
14:00 Building community Sam Clifford TBD Classes without dependencies community/education
Although important, learning statistics isn’t generally why students choose to study science. To engage a cohort of first year Bachelor of Science students with diverse backgrounds and interests, we decided to design their core first year quantitative methods unit (with no math or programming prerequisites) around R.The course is designed to be practical; using RStudio and tidyverse packages rather than statistical tables, students can quickly engage in visualisation, data wrangling, writing functions, and modelling as part of a coherent workflow for scientific inquiry.In this talk, we discuss the learning and teaching principles and activities, outlining the use of blended and problem based learning to teach both the quantitative topic and the use of R, developing students' data analysis skills and confidence.We discuss how workshop activities , quizzes, problem solving tasks, and the final project (a collaborative scientific article) not only assess students' skills but prepare them for work as a professional scientist. We will discuss students’ feedback on their experience in their journey from novice student to young scientist.
14:00 Scalable R Le Zhang TBD Build scalable Shiny applications for employee attrition prediction on Azure cloud visualisation, models, data mining, applications, web app, reproducibility, performance
Voluntary employee attrition may negatively affect a company in various aspects. Identifying employees with inclination of leaving is therefore pivotal to save potential loss. Data-driven techniques, assisted by a machine learning model, exhibit high accuracy in prediction for employee attrition and offer company executives insightful information for decision making.The talk will cover a step-by-step tutorial about how to build a model for employee attrition prediction and deploy such analytical solution as Shiny-based web service on Azure cloud. R is used as the primary programming language and method for the development. Novel R packages such as AzureSMR and AzureDSVM that allow data scientists and developers to programmatically operate cloud resources and seamlessly operationalize the analytics within an R session, will also be introduced in the talk. Shiny application of the analytics including interactive data visualization and model creation is designed and deployed on Docker containers orchestrated by Kubernetes. Parameters of the deployment environment are carefully tuned to favor scalability of the application.
14:00 Scalable R Bryan Galvin TBD Moving from Prototype to Production in R: A Look Inside the Machine Learning Infrastructure at Netflix data mining, reproducibility, performance, big data, interfaces
Machine learning helps inform decision making on just about every aspect of the business at Netflix, so it is important to empower our data scientists with tooling that makes them more effective. To accomplish this, we developed Metaflow-a platform written in Python for data scientists to develop, run, and deploy projects without getting in their way. Some key design features include: * Ability to work with the R packages we all know and love with no restrictions* Scale up seamlessly from local development to the almost infinite resources in the cloud * Automatic checkpointing of data and code with immutable snapshots created at each step of the modeling pipeline * Deployment made easy with built-in hosting service and schedulingIn this talk, I will present an overview of some of the best practices that are baked into Metaflow, focusing especially on those that can be applied effectively at organizations that are not at Netflix scale. Additionally, I will cover some of the lessons learned from using reticulate to interface R with a large Python project.
14:00 Scalable R Jason Gasper TBD Integrating R into a production data environment: A case example of using Oracle database services and R for fisheries management in Alaska. applications, databases, reproducibility
Catch and economic information from fisheries off Alaska are critical for the management and conservation of marine resources. The National Marine Fisheries Service, Alaska Regional Office, uses an Oracle database to monitor and store federal fishery catch data off Alaska. Annually, the system processes over 2 million+ fishery catch transactions, and it currently houses over 25 years of historical fishery data. Information in the database includes details on harvested fish, estimates of bycatch, at-sea observations of discards, electronic monitoring of catch (video-derived estimates), geospatial information, and complex business rules to monitor catch allocations to ensure overfishing does not occur. Our paper provides an high-level overview of the system architecture, with a focus on our use of R-Cran for both development (e.g., simulation and testing) and production (e.g., statistical features) within our Oracle database.
14:00 Spatial modeling Matt Moores TBD bayesImageS: an R package for Bayesian image analysis algorithms, applications, space/time
There are many approaches to Bayesian computation with intractable likelihoods, including the exchange algorithm, approximate Bayesian computation (ABC), thermodynamic integration, and composite likelihood. These approaches vary in accuracy as well as scalability for datasets of significant size. The Potts model is an example where such methods are required, due to its intractable normalising constant. This model is a type of Markov random field, which is commonly used for image segmentation. The dimension of its parameter space increases linearly with the number of pixels in the image, making this a challenging application for scalable Bayesian computation. My talk will introduce various algorithms in the context of the Potts model and describe their implementation in C++, using OpenMP for parallelism. I will also discuss the process of releasing this software as an open source R package on the CRAN repository.
14:00 Spatial modeling Jin Li TBD A new R package for spatial predictive modelling: spm models, data mining, reproducibility, space/time, performance, spatial predictive models; hybrid methods of geostatistics and machine learning; model selection and validation; predictive accuracy
Accuracy of spatial predictions is crucial for evidence-informed environmental management and conservation. Improving the accuracy by identifying the most accurate predictive model is essential, but also challenging as the accuracy is affected by multiple factors. Recently developed hybrid methods of machine learning methods and geostatistics have shown their advantages in spatial predictive modelling in environmental sciences, with significantly improved predictive accuracy. An R package, ‘spm: Spatial Predictive Modelling’, has been developed to introduce these methods and recently released for R users. This presentation will briefly introduce spm, including: 1) spatial predictive methods, 2) new hybrid methods of geostatistical and machine learning methods, 3) assessment of predictive accuracy, 4) applications of spatial predictive models, and 5) relevant functions in spm. It will then demonstrate how to apply some functions in spm to relevant datasets and to show the resultant improvements in predictive accuracy and modelling efficiency. Although in this presentation, spm is applied to data in environmental sciences, it can also be applied to data in other relevant disciplines.
14:00 Spatial modeling Nicholas Tierney TBD Maxcovr: Find the best locations for facilities using the maximal covering location problem visualisation, algorithms, models, applications, space/time, interfaces
Want better wifi at the office? Improved access to healthcare? The maximal covering location problem (MCLP) can help! The MCLP finds optimal locations of facilities to improve their coverage on a set of targets. This means better placed wifi routers and healthcare facilities. Although the MCLP was described in the 1970s, it can be daunting to actually implement as you need to know how to:1) Formulate an optimisation problem2) Make it talk to a solver engine3) Get the data into the appropriate format for the solver to recognise4) Translate the model output into a usable formatIt is challenging, particularly if you are not familiar with optimisation, or techniques such as linear programming. It is, however, a great use case for an R package to abstract away detail you don’t need to worry about. The R package maxcovr provides a set of tools to perform, summarise, and visualise the MCLP, so that you can move on with your analysis, place better cellphone towers, and create better access to health facilities.In this talk, I describe why the MCLP is useful, where it can be applied, and demonstrate of the use of maxcovr, before finally discussing future directions.
14:00 Visualisation Paul Murrell TBD The Minard Paradox visualisation
Charles Joseph Minard's depiction of Napoleon's 1812 Russian campaignmight be described as the best statistical graphic ever drawn ... byhand. Minard did not have the benefit of modern computer technologyto help with his drawing; he did not have the option of importing aGoogle map tile; and he probably did not even consider the possibilityof interactive tooltips. However, there are aspects of what Minardproduced by hand that are very challenging for modern graphicalsoftware, particularly the thick bands that represent the size ofNapolean's army over time. This talk will describe the 'vwline'package for R and explore some of the interesting challenges thatarise when attempting to render variable-width lines with software.
14:00 Visualisation Natalia da Silva TBD Interactive Graphics for Visually Diagnosing Forest Classifiers in R visualisation, data mining, web app
This paper describes structuring data and constructing plots to explore forest classification models interactively. A forest classifier is an example of an ensemble since it is produced by bagging multiple trees. The process of bagging and combining results from multiple trees produces numerous diagnostics which, with interactive graphics, can provide a lot of insight into class structure in high dimensions. Various aspects are explored in this paper, to assess model complexity, individual model contributions, variable importance and dimension reduction, and uncertainty in prediction associated with individual observations. The ideas are applied to the random forest algorithm (Breiman, 2001) and projection pursuit forest (da Silva et al., 2017), but could be more broadly applied to other bagged ensembles. Interactive graphics are built in R (R Core Team, 2016) using the ggplot2( Wickham, 2016), plotly (Sievert et al., 2017), and shiny (Chang et al., 2015) packages.
14:00 Visualisation Chun Fung (Jackson) Kwok TBD Rjs: Going hand in hand with Javascript visualisation, interfaces, JavaScript
Many of the popular data visualisation packages in R, e.g. Plotly, Leaflet and DiagrammeR, are powered by JavaScript. I will demonstrate how far a little JavaScript can go towards creating animated and interactive visualisations from within R. This is done with the package, Rjs, which provides a simple interface between R and JavaScript. It allows you to seamlessly combine R modelling packages with JavaScript interactive visualisation libraries. This talk is for researchers, data analysts, and intermediate R users looking to extend their skills in interactive data visualisation.
15:30 Genomic analysis: signatures to single cells Momeneh (Sepideh) Foroutan TBD Singscore: a single-sample gene-set scoring method for analysing molecular signatures visualisation, applications, bioinformatics
Several single sample gene-set enrichment analysis methods have been introduced to score samples against gene expression signatures, such as ssGSEA, GSVA, PLAGE and combining z-scores. Although these methods have been proposed to generate single-sample scores, they use information from all samples in a dataset to calculate scores for individual samples. This leads to unstable scores which are influenced by sample size and composition in datasets. We have proposed singscore, a ranked-based and truly single-sample scoring method implemented as an R/Bioconductor package singscore. We compare singscore to other methods and show that our approach performs as well as other methods for large datasets in terms of stability, while outperforming them in small datasets. Singscore is fast and generates easily-interpretable scores. We show the application of this method in cancer biology, where the dependence between distinct molecular signatures can be investigated across samples. Singscore has potential applications in personalised medicine, as it calculates replicable scores for individual samples regardless of the sample size or composition in the data.
15:30 Genomic analysis: signatures to single cells Liam Crowhurst TBD scIVA: Single Cell Interactive Visualisation and Analysis visualisation, data mining, web app, reproducibility, bioinformatics, big data
Technological advances enable measurements of gene expression at single cell resolution, creating datasets for investigating biological processes in life science research. Gene Expression data is commonly represented as a matrix of tens of thousands of genes and up to millions of cell, which has created a demand amongst biologists for quick visualisation and analysis. We developed scIVA, a Shiny web app that is designed to be used as an interactive visualisation tool of gene expression datasets, intended for those with little R experience and for users to gain preliminary insights into datasets for further exploration and analysis. The web app will also be available for download as a standalone R package. The web app performs various visualisations, all of which are interactive and downloadable through use of Plotly, integrated with d3 Javascript, as graphing tools. Moreover, scIVA allows for users to search for specific genes, subset by clusters and subpopulations, generate heatmaps and perform statistical analyses. The presentation will include a demonstration of the web app’s key features.
15:30 Genomic analysis: signatures to single cells Sarah Williams TBD Celaref: Annotating single-cell RNAseq clusters by similarity to reference datasets applications, bioinformatics
Single-cell RNA sequencing (scRNAseq) is a way of measuring gene expression of many individual cells simultaneously, and is often used on samples which contain a mix of different cell types. In an scRNAseq analysis individual cells are typically clustered to group them by cell type. After clustering, identifying what type of cell is in each cluster (e.g. neurons) usually needs domain-specific knowledge of marker genes and function. The celaref package accepts pre-computed cell-clusters and aims to suggest cell-types for each cluster via similarity to reference datasets (scRNAseq experiments or microarrays) from similar samples. Briefly, within-dataset differential expression is calculated to identify the most enriched genes for each cluster, then their rankings are examined in reference datasets. Kolmogorov–Smirnov tests are used to decide if multiple matches should be reported. Initial experiments on brain, lacrimal gland and blood PBMC samples show sensible matching between similar cell types without overreaching on dissimilar cells. Celaref will be submitted to Bioconductor and is available at https://github.com/MonashBioinformaticsPlatform/celaref
15:30 Genomic analysis: signatures to single cells Luke Zappia TBD clustree: a package for producing clustering trees using ggraph visualisation, algorithms, data mining, bioinformatics
Clustering analysis is commonly used in many fields to group together similar samples. Many clustering algorithms exist, but all of them require some sort of user input to set parameters that affect the number of clusters produced. Deciding on the correct number of clusters for a given dataset is a difficult problem that can be tackled by looking at the relationships between samples at different resolutions. Here I will present clustree, an R package for producing clustering tree visualisations. These visualisations combine information from multiple clusterings with different resolutions, showing where new clusters come from and how samples change clusters as the number of clusters increases. Summarised information describing the samples in each cluster can be overlaid on the tree to give additional insight. I will also describe my experience developing clustree, particularly how I have made use of the ggraph package. The clustree package is available at https://github.com/lazappi/clustree and a preprint describing clustering trees can be read at https://www.biorxiv.org/content/early/2018/03/02/274035.
15:30 Data mining Ilia Karmanov TBD Teach yourself deep-learning with R visualisation, algorithms, models, Deep Neural Nets, CNNs, MLPs, Machine Learning
R's concise matrix algebra and calculus functionality makes it easy to create machine learning-models from scratch. Creating models from scratch is a great way to learn how they actually work. We show how R can be used to create a linear regression, MLP and CNN from scratch (see blog: http://blog.revolutionanalytics.com/2017/07/nnets-from-scratch.html) and thus how one may go about teaching themselves about DNNs. We believe this "hands-on" approach to learning is more effective because it exposes the user to all the "leaky abstractions" that modern frameworks hide and helps them understand what makes the models fragile. R's simple interface lets us easily "play" with the created models to understand further (potentially abstract) topics, e.g: (i) visualise the classification boundary and thus investigate what effect number of neurons (and layers) have. (ii) Visualise different CNN filter-maps. (iii) Solve a neural-net deterministically through linear-programming (without SGD) by working through "Proof of Theorem 1" in "Understanding deep learning requires re-thinking generalization" by Zhang 2017 (as a mirror to solving linear regression with SGD).
15:30 Data mining Angus Taylor TBD Deep learning at scale with Azure Batch AI algorithms, models, Deep learning
In recent years, R users have been increasingly exploring the use of deep learning methods to solve difficult problems from computer vision to natural language processing. However, developing deep learning models is a time-consuming and compute-intensive task. To obtain good performance on many datasets, it is necessary to test many combinations of network structures and hyperparameters. In this talk, we will discuss how Microsoft Azure Batch AI can be used to perform this tuning task at scale on clusters of GPU-enabled virtual machines in the cloud. Developers create a single R script to define tests of multiple different network configurations, using the popular deep learning frameworks mxnet or Keras. We explain how to build a simple Docker image that can be deployed across multiple machines and defines the necessary installation dependencies. Batch AI will scale VM clusters as necessary to parallelize the tasks and obtain the optimal network configuration efficiently, saving hours or even days of the developer’s time. We will demonstrate the value of Batch AI with a live demo of training a deep learning model, implemented in R, on the classic MNIST computer vision dataset.
15:30 Data mining Timothy Wong TBD Modelling Field Operation Capacity using Generalised Additive Model and Random Forest algorithms, models, multivariate, big data
In any customer-facing business, accurately predicting demand ahead of time is of paramount importance*. Workforce capacity can be flexibly scheduled at local area accordingly. In this way, we can ensure having sufficient workforce to meet volatile demand.In this case study, we focus on the gas boiler repairing field operation in the UK. We have developed a prototype capacity forecasting procedure which uses a mixture of machine learning techniques to achieve its goal. Firstly, it uses Generalised Additive Model approach to estimate the number of incoming work requests. It takes into account the non-linear effects of multiple predictor variables. The next stage uses a large random forest to estimate the expected number of appointments for each work request by feeding in various ordinal and categorical inputs. At this stage, the size of the training set is considerable large and does not fully-fit in memory. In light of this, the random forest model was trained in chunks / parallel to enhance computational performance. Once all previous steps have been completed, probabilistic input such as the ECMWF Ensemble weather forecast to give a view of all predicted scenarios.
15:30 Data mining Bernd Bischl TBD iml: A new Package for Model-Agnostic Interpretable Machine Learning algorithms, models, machine learning
iml implements model-agnostic interpretability methods to explain the functional behavior and individual predictions of machine learning models. A large advantage of model-agnostic interpretability methods over model-specific ones is their flexibility, as often not one but many types of machine learning models are evaluated for solving a task. Anything that is build on top of an interpretation such an interpretation, e.g., a visualization or graphical user interface, now also becomes independent of the underlying model.Currently implemented are:Feature importance, Partial dependence plots, Individual conditional expectation plots (ICE), Tree surrogate, LocalModel: Local Interpretable Model-agnostic Explanations, Shapley value for explaining single predictions.The talk will cover the basic concepts behind model-agnostic interpretations, and demonstrate the functionality of the package through applied examples in R.Link to CRAN release: https://cran.r-project.org/web/packages/iml/index.htmlLink to Github page: https://github.com/christophM/iml
15:30 Simulation and modeling focus on survival analysis Bachmann Patrick TBD Estimating individual Customer Lifetime Values with R: The CLVTools Package models
Valuing customers is key to any firm. Customer lifetime value (CLV) is the central metric for valuing customers. It describes the long-term economic value of customers and gives managers an idea of how customers will evolve over time. To model CLVs in continuous non-contractual business settings such as retailers, probabilistic customer attrition models are the preferred choice in literature and practice. Our R package CLVTools provides an efficient and easy to use implementation frameworks for probabilistic customer attrition models. Building up on the learnings of other implementations, we adopt S4 classes to allow constructing rich and rather complex models that nevertheless still are easy to apply for the end user. In addition, the package includes recent model extensions, such as the option to consider contextual factors, that are not available in other packages.This article will focus on both, the theory of the underlying statistical framework as well as about the practical application using real world data.
15:30 Simulation and modeling focus on survival analysis Sam Brilleman TBD simsurv: A Package for Simulating Simple or Complex Survival Data models, simulation; survival analysis
The simsurv package allows users to simulate simple or complex survival data. Survival data refers to a variable corresponding to the time from a defined baseline until occurrence of an event of interest. Depending on the field, the analysis of survival data can be known as survival, duration, reliability, or event history analysis. It has been common to make simplifying parametric assumptions when simulating survival data, e.g. assuming survival times follow an exponential or Weibull distribution. However, such assumptions are unrealistic in many settings. The simsurv package provides additional flexibility by allowing users to simulate survival times from 2-component mixture distributions or a user-defined hazard function. The mixture distributions allow for a variety of flexible baseline hazard functions. Moreover, a user-defined hazard function can provide even greater flexibility since the cumulative hazard does not require a closed-form solution. This means it is possible to simulate survival times under complex statistical models such as those for joint longitudinal-survival data. The package is modelled on the survsim package in Stata (Crowther and Lambert, 2012, Stata J).
15:30 Simulation and modeling focus on survival analysis Raju Rimal TBD R-package for simulating linear model data (simrel) models, applications, web app, multivariate, interfaces, Simulation
Data science is generating enormous amounts of data, and new and advanced analytical methods are constantly being developed to cope with the challenge of extracting information from such “big-data”. Researchers often use simulated data to assess and document the properties of these new methods. Here we present an R-package `simrel`, which is a versatile and transparent tool for simulating linear model data with an extensive range of adjustable properties. The method is based on the concept of relevant components and a reduction of the regression model. The concept was first implemented in an earlier version of `simrel` but only for single response case. In this version we introduce random rotations of latent components spanning a response space in order to obtain a multivariate response matrix Y. The properties of the linear relation between predictors and responses are defined by a small set of input parameters which allow versatile and adjustable simulations. In addition to the R-package, user-friendly shiny application with elaborate documentation and an RStudio gadget provide an easy interface for the package.
15:30 Simulation and modeling focus on survival analysis Andrés Villegas TBD StMoMo: An R Package for Stochastic Mortality Modelling models, applications
In this talk we use the framework of generalised (non-)linear models to define the family of generalised Age-Period-Cohort stochastic mortality models which encompasses the vast majority of stochastic mortality projection models proposed to date, including the well-known Lee-Carter and Cairns-Blake-Dowd models. We also introduce the R package StMoMo which exploits the unifying framework of the generalised Age-Period-Cohort family to provide tools for fitting stochastic mortality models, assessing their goodness of fit and performing mortality projections. We illustrate some of the capabilities of the package by performing a comparison of several stochastic mortality models applied to the Australian mortality experience. The R package StMoMo is available at http://CRAN.R-project.org/package=StMoMo.
15:30 Improving performance Stepan Sindelar TBD FastR: an alternative R language implementation applications, performance, R implementations
R is a highly dynamic language that employs a unique combination of data type immutability, lazy evaluation, argument matching, large amount of built-in functionality, and interaction with C and Fortran code. It is therefore a challenging task to develop an alternative R runtime that is both compatible with GNU R and can provide performance of R code comparable to static programming languages like C.FastR is an open source alternative R implementation that is trying to achieve this. The talk will introduce FastR and demonstrate the performance improvements it can offer, compatibility with GNU R by being able to run unmodified popular complex CRAN pacakges like ggplot2 or Shiny, and FastR unique features, for example in-process multi-threaded execution, and tools like CPU sampler or viewing R memory dumps with VisualVM.
15:30 Improving performance Stepan Sindelar TBD Combining R and Python with GraalVM applications, performance, programming languages interoperability, debugging
GraalVM is a multi-language runtime that allows to run and combine multiple programming languages in one process and operating on the same data without the need to copy the data when crossing language boundaries. Moreover, the dynamic just-in-time compiler included in GraalVM is capable of applying optimizations across the languages boundaries. The languages implemented on top of GraalVM include FastR, an alternative R implementation, C, Ruby, JavaScript, and recently added GraalPython.The talk will present interesting ways how R and Python can be combined into a polyglot application running on GraalVM, for example using R package from Python or vice versa, and briefly explain how this interoperability works on the technical level. One of the most important parts of a language ecosystem is tooling and especially interactive debugger. The talk will also present how one can debug multiple GraalVM languages at the same time in the Google Chrome Dev Tools, for instance stepping from R into C code.
15:30 Improving performance David Smith TBD Speeding up computations in R with parallel programming in the cloud models, performance, parallel programming
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and grid-based computations are just a few examples. In this talk, I'll provide a review of tools for implementing embarrassingly parallel computations in R, including the built-in "parallel" package and extensions such as the "foreach" package. I'll also demonstrate how you can dramatically reduce the time for a complex computation -- optimizing hyperparameters for a predictive model with the "caret" package -- by using a cluster of parallel R session in the cloud. With the "doAzureParallel" package, I'll show how you can create a cluster of virtual machines running R in Azure, parallelize the problem by registering backend to "foreach", and shut down the cluster when the computation is complete, all with just a few lines of R code.
15:30 Improving performance Romain François TBD rrrow: an R front end to Apache Arrow algorithms, performance, big data, streaming data, interfaces
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.R support is currently being implemented and in this talk we will discuss the various challenges, and our short, medium and long term vision for the connection between R and Apache Arrow.
15:30 Sports analytics Robert Nguyen TBD USING AUSTRALIAN RULES FOOTBALL TO BROADEN THE APPEAL OF R AND STATISTICS AMONG YOUTH AND PUBLIC WITHOUT A STEM BACKGROUND visualisation, models, applications, reproducibility, community/education, interfaces
Our talk explores how sports analytics can be used to encourage those without a STEM background into the application of statistics and programming to a real world environment. Through the use of a R package (fitzRoy) related to AFL we aim to both lower the barrier to entry for data access while also increasing analytical fan engagement in AFL. We will also talk about common issues that arise for the first time R package builder.A key barrier to entry for the growth of the AFL community is data access which prevents not only people having a go at writing, but it also prevents current media having reproducible work. By having an R package with online lessons on creating common fan rating systems like ELO, Pythagorean and Massey this will engage people who otherwise might have put learning statistical modelling and R into their personal *this is too hard* bucket. Commonly users are taught from a cleaned dataset and jump straight into modelling. This misses a key part which is cleaning. Our package, we aim to use tangible examples of scraped, raw AFL data from afltables and footywire to teach users how to clean scraped data themselves to get it into a tidy format for modelling.
15:30 Sports analytics Alex Fun TBD Using TMB (Template Model Builder) to predict the winner of a ping pong match algorithms, models, applications
In a recent and popular stats.stackexchange post, the following question was asked:“I bet with my colleague that I will beat him in fifty consecutive ping pong games. So far I have won 15, what are my chances of winning the next 35 games?”-- from https://stats.stackexchange.com/questions/329521/To answer this question, I propose the following data generation process for the score-line in each game: The OP (original poster) is a far superior player who still wishes to make the game fun for their opponent (they are colleagues after all). This leads to a regression problem for the OP’s probability of winning a point, that cannot be fit using standard regression packages. This introductory talk will demonstrate how to use the TMB (Template Model Builder) package with an optimisation algorithm to find maximum likelihood estimates for the regression coefficients. This will show that TMB is a very useful and efficient tool, that allows the practictioner a lot of flexibility in exploring novel data generation processes and objective functions. I will also briefly touch upon using C++ from R, and automatic differentiation, which is great for those that dislike multivariate calculus.
15:30 Sports analytics Andrew Simpkin TBD A Shiny app used to predict training load in professional sports visualisation, algorithms, models, applications, databases, web app, multivariate, performance, streaming data
We have developed a Shiny dashboard web application used in professional sports to predict player load while planning a training session. This app allows coaches to better plan, prescribe and tailor training drills in advance. The Shiny dashboard app is deployed on Shiny Server Pro and connects to an SQL database of GPS data across multiple teams and sports. Teams can plan, save, edit and delete planned sessions to and from the GPS database. Based on retrospectively collected GPS and accelerometer data, we have developed a statistical learning algorithm to cluster similar drills and predict training load. The model achieves correlations over 0.95 in out-of-sample testing, with median differences of below 1% of GPS outcomes.
15:30 Sports analytics Sayani Gupta TBD CricketData: An R package for international cricket data visualisation, data mining, applications, web app, reproducibility
The CricketData package provides convenient scraper functions for downloading data from ESPNCricinfo into tibbles. Functions are provided for obtaining data on the performance of male and female players across Test, One Day International and Twenty20 formats, and for batting, bowling and fielding. Tidyverse packages can then be used to explore, visualise and analyse the data. The package enables a user to answer simple questions such as -What is the highest number of catches taken by a wicket keeper?-What is the maximum number of catches taken by a fielder in a particular innings?-How many batsmen have scored consecutive 100’s in two matches or more?-What is the maximum number of maiden overs by a bowler in a specific innings?It will also allow for deeper questions to be addressed such as-Do batsman tend to get run out more frequently when they are about to score a century?-How does the performance of cricketers change in the 12 months before he/she retires?-When is the period of peak performance during a cricketer’s career?Finally, it makes it easy to produce visual comparisons of player performance across different statistics.
15:30 Leveraging web apps Katie Sasso TBD Shiny meets Electron: Turn your Shiny app into a standalone desktop app in no time applications, databases, web app, reproducibility, interfaces, Automation
Using Shiny in consulting can be challenging, as all deployment options involve either sending intellectual property and data to the cloud or IT involvement. When providing consultative like services to extremely large, risk-averse, enterprises this can greatly restrict one’s ability to quickly get Shiny apps into users’ hands, as engagement of IT can take months if approved at all. We’ll share how the Columbus Collaboratory team overcame these barriers to rapid deployment by coupling R Portable and Electron, a framework for creating native applications with a variety of web technologies. All the tools needed to use Electron for desktop deployment of Shiny apps will be reviewed. We’ll highlight a specific example in which these technologies were used within a large enterprise to completely automate a weekly report. We’ll also share how the app used R Packages such as openxlsx, shinydashboard, RODBC, and Zoo to query an internal database, cleanse data, calculate key metrics, and create a downloadable excel file for dissemination. The best part? This Shiny app was delivered to the end business user as a stand-alone executable. https://github.com/ksasso/Electron_ShinyApp_Deployment
15:30 Leveraging web apps Adrian Barnett TBD Saving time for researchers by creating publication lists using shiny applications, databases, web app, open access
Researchers are often asked by funders or employers to list their publications, but funders often have different requirements (e.g., all papers versus only those in the last five years) and researchers waste a lot of time formatting papers. To save time for researchers I made a shiny application (https://aushsi.shinyapps.io/orcid/) that takes a researcher’s ORCID ID and outputs their papers in alternative formats. It uses crossref and pubmed (rentrez) to supplement the ORCID data. The app was included in the Australian Research Council’s instructions to applicants and has been well used with many good suggestions for improvements. However, the ORCID data is relatively messy and papers can be in multiple formats, making it difficult to create a standardised paper that can be flexibly manipulated. For example, the publication’s author data are in different fields and formats. Google Scholar publications are nicely standardised, but there are authentication issues when using shiny.I will describe how the app has developed and canvass how it could be improved, including adding the percent of publications that are open access or other alternative research metrics.
15:30 Leveraging web apps Gergely Daroczi TBD Managing database credentials and connections: an easy and secure approach applications, databases, web app, interfaces, business
Although the `DBI` R package family already provides a standardized way of opening connections to various databases and querying data, and eg the `config` package allows to store the database connection default parameters in a central file, maybe some of the sensitive fields encrypted via `keyring` or the `secret` packages -- but there is no convenient and secure wrapper around these for the actual R end-users. This talk introduces a new package taking care of opening connections in the background to the databases specified in a secured and encrypted YAML file, so that the R user can simply specify the SQL command without the need to think about what DB backend and credentials are used.
15:30 Leveraging web apps Ian Hansel TBD Large Scale Data Visualisation with Deck.gl and Shiny visualisation, web app, space/time
deck.gl is a WebGL-powered framework for visual exploratory data analysis of large datasets' - https://uber.github.io/deck.gl/#/Combining deck.gl and shiny allows for rich interactive graphics of large datasets, in particular visualising GeoSpatial data. We will review how to integrate deck.gl with shiny using the upcoming R package 'deck.gl'. The talk will:- Review the underlying technologies; WebGL, Mapbox and React.js- Dive into an example exploring the latest Census from the Australian Beureau of Statistics- Compare to existing visualisation capabilities in the 'rthreejs' and 'leaflet' packages- Discuss how further integrations with React.js can enable more browser based interfaces to data and analyticsAfter the talk the attendees should:- Know how Deck.gl works- Understand how to visualise data in deck.gl from R using the 'deck.gl' package- Want to use deck.gl in their own work :)The talk is aimed at those with some experience (or interest) in GeoSpatial analysis.
Time Session Presenter Venue Title Keywords
10:30 Optimisation for model fitting Anqi Fu TBD Disciplined Convex Optimization with CVXR models, data mining, applications, multivariate, big data, interfaces, optimization
CVXR is an R package that provides an object-oriented modeling language for convex optimization, similar to CVX, CVXPY, YALMIP, and Convex.jl. It allows the user to formulate convex optimization problems in a natural mathematical syntax rather than the restrictive standard form required by most solvers. The user specifies an objective and set of constraints by combining constants, variables, and parameters using a library of functions with known mathematical properties. CVXR then applies signed disciplined convex programming (DCP) to verify the problem's convexity. Once verified, the problem is converted into standard conic form using graph implementations and passed to a cone solver such as ECOS or SCS. We demonstrate CVXR's modeling framework with applications in engineering, statistical estimation, and machine learning.
10:30 Optimisation for model fitting Giuseppe Bruno TBD Stochastic Gradient Descent: boosting its performances in R data mining, big data
Despite the tremendous improvements in HW&SW technologies, the requirements for training Machine Learning models keep growing. With standard loss functions the Gradient Descent (GD) provides a simple approach.The whole gradient is the sum of the gradients of each component function:∇ F(w) =2 = Σ(xiT w - yi) xi.The complexity per iteration is O(n d). Here we gauge the Stochastic Gradient Descent (SGD) where the gradient is approximated with one observation. When the stopping criterium is|w_{k+1}-w_k|<ε we have that GD requires O(log(1/ ε)) iteration while SGD needs O(1/ ε).Albeit the iterative nature of the SGD prevents its straightforward parallelization, a few alternatives have been proposed in the literature for carrying out its parallel implementation. In this paper we provide different benchmark examples of parallel implementation through standard shared memory and Spark distributed computing framework to boost the SGD performances. Preliminary results, under given conditions show significant performance improvements. The possibility to take advantage of these speed up is open to practitioners and not just computer specialists.
10:30 Optimisation for model fitting Melina Ribaud TBD Robustness criterion for the derivative kriging-based optimization algorithms, models, applications
In the context of robust shape optimization, the estimation cost of some numerical models is reduced with a kriging metamodel. The function, the first and second derivatives are provided by the majority of industrial codes. We propose a robust optimization procedure that leans on the prediction function and its derivatives. Those predictions are given by the kriging. The use of the derivatives improves the metamodel quality. We rely on Rccp and nloptr packages of R to estimate, predict and simulate the kriging with derivatives. Taylor theorem that is calculated with the prediction of the function and its derivatives is applied on each used point to approximate the variation of the function. This cheap criterion is used as the replacement of a full computation of the second moment from the model. A Pareto front of robust solutions (minimization of the function and the robustness criterion) is then generated by the NSGA-II genetic algorithm through the nsga2r package from R. This algorithm efficiently produces a Pareto front with no regard to the model complexity.
10:30 Optimisation for model fitting Andrew Locke TBD Augmented Lagrangian for constrained optimizations in Empirical Likelihood estimations algorithms, models
Empirical Likelihood is a useful tool for inference as it does not require knowledge about where the data comes from. It can be extended in many ways including regression or adding constraints using estimating equations.The positivity constraint has often been overlooked or ignored but this means existing methods may not be applicable for some data. We look at enforcing this constraint by applying the Karush-Kuhn-Tucker conditions and using a multiplicative iterative optimization method of updating parameters which ensures movement towards the maximum. We have programmed this method in R and use simulations to demonstrate the model works
10:30 infrastructure and tools for genomic analysis Ido Bar TBD Shinotate: an R-based shiny server for annotation and analysis of RNA-Seq transcriptome assemblies visualisation, web app, bioinformatics
Assembly of transcriptome data in non-model species has become common practice in the last decade thanks to the advent of high-throughput RNA-sequencing platforms and accompanying bioinformatics tools. Trinity is one of the most commonly used tools for transcriptome assembly from Illumina RNA-Seq data and its accompanying functional annotation framework, Trinotate, offers a pipeline for running the various annotation tools and consolidating the results into a single database. Trinotate also includes a web-based graphical user interface for querying the annotations and provide basic visualisation, but its Perl implementation makes it difficult to customise and deploy. Shinotate was developed to provide a modern graphical interface for the analysis of transcriptome annotations, utilising the Trinotate annotation framework to deliver summarised results and insights to users of all skill levels. Shinotate is written in R and uses the `tidyverse` approach to summarise and visualise the data stored in Trinotate and thus can be easily adapted to accommodate custom annotation tables. It serves interactive annotation tables and plots, with search, selection and data export functions.
10:30 infrastructure and tools for genomic analysis Peter Hickey TBD DelayedArray: A tibble for arrays bioinformatics, performance, big data
High-throughput genomics data are commonly summarised in a feature-by-sample matrix or higher-dimensional array. In R, these have traditionally been stored in-memory, but this is becoming prohibitive for large, contemporary datasets, such as those generated using new genomics technologies like single-cell RNA-seq. Instead, these arrays may be stored on-disk, using the Hierarchical Data Format 5 (HDF5), for example. The Bioconductor project has developed the DelayedArray, which supports different 'backends' to wrap around an in-memory, on-disk, or remotely served representation of an array, providing a unified interface to the data that is familiar to users of ordinary R arrays. In this sense, a DelayedArray is to an array as a tibble is to a data frame. I will provide an overview of the DelayedArray framework, explain the requirements for developing a new backend for a DelayedArray, and highlight example backends for on-disk and remotely served data. I will also demonstrate how user-created packages can extend the capabilities of the DelayedArray and how this has enabled us to analyse large genomics datasets in R that were previously infeasible.
10:30 infrastructure and tools for genomic analysis Ramyar Molania TBD improved normalization of the Nanostring nCounter gene expression data bioinformatics
The NanoString nCounter gene expression assay uses molecular barcodes and single molecule imaging to detect and count hundreds of unique transcripts in a single reaction. These counts need to be normalized to adjust for variations in assay efficiency, the amount of sample, and other factors. Most users adopt one of the options described in the nSolver analysis software, which involve background correction based on the observed values for 8 negative control probes, a within sample normalization using the observed values for 6 positive control probes, and normalization across samples using reference (“housekeeping”) genes. Including technical replicates is not recommended by the assay developers, but some users do so anyway. Here we present a new normalization called RUV3 which makes vital use of technical replicates and suitable control genes. We illustrate its effectiveness on four quite different datasets, and offer suggestions on the design and analysis of studies involving this technology.
10:30 infrastructure and tools for genomic analysis Abdul Abdulmonem A. Alsaleh TBD Identifying methylation biomarkers for childhood leukaemia from human 450k DNA methylation array data using ABC.RAP R package data mining, bioinformatics, data analysis
To date, the majority of the available 450k DNA methylation analysis tools focus on single CpG methylation differences. The array based CpG region analysis pipeline (ABC.RAP) R package was developed to analyse normalised human 450k DNA methylation array datasets and applies Student’s t-test and delta beta analysis to identify candidate genes containing multiple differentially methylated CpG sites. In addition, ABC.RAP can profile DNA methylation for any gene of interest, providing a powerful feature for comparison between datasets. We analysed nine publicly available acute leukaemia datasets and identified a panel of 11 genes that were consistently methylated across different cohorts. We used targeted DNA methylation sequencing (MiSeq; Illumina) to sequence blood samples from healthy adults and newborns and also leukaemia xenograft samples and cell lines. The selected panel of genes showed dense DNA methylation in leukaemia samples compared to low-level methylation in control samples consistent with the publicly available 450k array data. ABC.RAP was accepted by the CRAN, and can be accessed on the following site: https://cran.r-project.org/package=ABC.RAP
10:30 Classification and data mining Christoph Bergmeir TBD ssc: An R Package for Semi-Supervised Classification algorithms, models, data mining
Semi-supervised classification has become a popular area of machine learning, where both labeled and unlabeled data are used to train a classifier. This learning paradigm has obtained promising results, specifically in the presence of a reduced set of labeled examples. We present the R package ssc (https://cran.r-project.org/package=ssc) that implements a collection of self-labeled techniques to construct a classification model. This family of techniques enlarges the original labeled set using the most confident predictions to classify unlabeled data. The techniques implemented in the ssc package can be applied to classification problems in several domains by the specification of a suitable learning scheme. At low ratios of labeled data, it can be shown to perform better than classical supervised classifiers.
10:30 Classification and data mining Przemyslaw Biecek TBD DALEX will help you to understand this complex predictive model visualisation, algorithms, models, data mining, bioinformatics
Complex machine learning models (random forest/gradient boosting machines/other ensembles) are frequently used in predictive modeling and have many successful applications in predictive and prognostic modeling. Yet in many cases these models are perceived as ,,black-boxes’’ with good accuracy but very complex, hard to understand, structure. In this talk I will present the methodology for exploration, validation and explanation of complex machine learning models. The methodology is implemented in the DALEX library for the R (Descriptive mAchine Learning EXplanations). The methodology contains three sets of explainers:- explainers for individual model predictions, that may be used to better understand key variables that drive model predictions,- explainers for individual variables, that may be used to better understand how model predictions are related with values of a selected feature,- explainers for global model structure, that may be used to assess globally important variables or important structures in the model.Find more about DALEX here: https://pbiecek.github.io/DALEX/I will give a workshop about this package.
10:30 Classification and data mining Roel Henckaerts TBD Tree-Based Machine Learning for Insurance Pricing visualisation, models, Tree-based machine learning
The goal of this paper is to apply machine learning techniques to insurance pricing, thereby leaving the actuarial comfort zone of generalized linear models (GLMs) and generalized additive models (GAMs). We focus on developing full tariff plans, built from both the frequency and severity of claims. We adapt the cost functions and performance measures used in the algorithms such that the specific characteristics of insurance data are carefully incorporated: highly unbalanced count data with excess zeros on the frequency side and scarce, but potentially heavy-tailed and right-censored data on the severity side. One of the key requirements is the need for transparent, interpretable pricing models which are easily explainable to all stakeholders. We therefore shy away from black box models such as neural networks and rather focus on tree-based machine learning models. Starting from single recursive trees we work towards more advanced ensembles such as bagged trees, random forests and boosted trees. We also present visualization tools to obtain insights from the models by assessing the importance of the different risk factors and their impact on the price of an insurance contract.
10:30 Classification and data mining Lubomír Štěpánek TBD Classification and evaluation of facial attractiveness and emotions for purposes of plastic surgery using machine-learning methods and R algorithms, models, multivariate, machine learning
Current plastic surgery deals with aesthetic indications such as an improvement of the attractiveness of a smile or other facial emotions. In this work, we have applied machine-learning methods and R to explore how accurate classification of photographed faces into sets of facial emotions (based on Ekman-Friesen FACS scale) is, and furthermore which facial emotions are associated with highest facial attractiveness, measured using Likert scale by a board of observers. Facial image data collected for each of a patient, exposed to an emotion incentive, were processed, landmarked and analyzed using R. Neural networks (neuralnet package) in comparison to Bayesian naive classifiers (e1071 package) and regression trees (rpart package) manifested the highest predictive accuracy of a new face categorization into facial emotions. Decision trees identified that the geometry of a mouth, eyebrows and eyes, respectively, affect in descending order an intensity of a classified emotion. We performed machine-learning analyses using R to point out which facial emotions and their geometry affect facial attractiveness the most, and therefore should preferentially be addressed within plastic surgeries.
10:30 Modeling and algorithms with a health focus Helena Kotthaus TBD Optimizing Parallel R Programs via Dynamic Scheduling Strategies models, performance
We present scheduling strategies for optimizing the overall runtime of parallel R programs. Our proposal improves upon the existing mclapply function of the parallel package, which already offers a load balancing option that dynamically allocates tasks to worker processes. However, this mechanism has shortcomings when used on heterogeneous hardware architectures, where different CPU cores might have vastly different performance characteristics. We thus propose to enhance mclapply with a new parameter that allows mapping tasks to specific CPUs. The new affinity.list parameter, already available on the R-devel branch, allows setting a so-called CPU affinity mask that specifies on which CPU a given task is allowed to run. We demonstrate the benefits of the new mclapply version by showing how it can speed up parallel applications like parameter tuning. In this case study, we develop a regression model that guides the scheduling by estimating the runtime of a task for each processor type based on previous executions. In a series of code examples, we explain how this approach can be generalized to develop efficient scheduling strategies for parallel R programs.
10:30 Modeling and algorithms with a health focus Mauricio Sarrias TBD Comparing Implementations of Logit Models with Individual Heterogeneity models, applications, reproducibility, performance
This paper discusses different specifications for Multinomial Logit models that include individual-heterogeneity (Mixed Logit Model, Latent Class MNL, GMNL model, MM-MNL). Due to the ability of these models to include unobserved heterogeneity, they have become quite popular for the empirical analysis of choice decisions. However, due to the complex estimation of these models using simulated maximum likelihood (SML) is quite difficult to compare or even replicate results from different software implementations, even using the same database. For example, the estimates depend on the optimization algorithm, the way on how random numbers are generated; the prime number used if Halton draws are selected, etc. Despite the now widespread use of these methods and these consideration, there appears to be no systematic investigation of the accuracy of these models or a comparison of the performance of the SML estimation routines that now exist in several software. In this article, I compare different implementation (R, Stata, Matlab) of these models by focusing in their ability to retrieve the true parameters from different data generating process and different default setup.
10:30 Modeling and algorithms with a health focus Daniel Putler TBD Optimally Locating Opioid Treatment Centers in Under Served Areas Using R and Alteryx applications, web app, Public Health, Optimization
In 2016 there were nearly 64,000 drug overdose deaths in the US, most of these due to opioids. Treatment of opioid addiction is one of the primary tools for addressing this situation. However, many of the areas hardest hit by opioid use are under served from a treatment perspective. An issue currently impeding the location of treatment facilities is the lack of fine grained location data of opioid abusers.We present a Web application that assists decision makers in locating opioid treatment facilities in under served areas. To do this, estimates of the number of adult opioid abusers at the census tract level are developed using R, based on data from the National Survey on Drug Use and Health and both census tract and microdata sample data from the American Community Survey. The census tract estimates of adult opioid abusers is used, along with data on the locations of existing opioid treatment facilities, to locate new facilities in areas that are further than ten miles from existing facilities, and maximizes the estimated number of abusers within a ten mile radius of the new facilities. The optimization is done using an evolutionary algorithm that is implemented in Alteryx.
10:30 Modeling and algorithms with a health focus Brianna Hitt TBD Optimal group testing algorithms for infectious disease detection with the binGroup package algorithms, applications, binary response; infectious disease testing; pooled testing; screening; sensitivity; specificity
Group testing is the process of amalgamating clinical specimens from individuals (e.g., blood, urine, or saliva) into groups to test for an infectious disease. When disease prevalence is small, the majority of these groups will test negatively. For positive testing groups, there are many algorithmic retesting procedures available to differentiate positive individuals from negative ones. The appeal of group testing to laboratories is that the number of tests needed is significantly less than testing each individual separately. Both estimating the probability of disease infection and identifying positive/negative individuals are goals of group testing. Unfortunately, no package has been available to address the identification goal for the most common group testing algorithms. We present the first functions for identification and make these available in the binGroup package to complement its large set of estimation functions. Our new functions calculate operating characteristics for algorithms and choose the optimal set of group sizes for user specified settings. These new functions allow laboratories to understand how well an algorithm is expected to perform before implementation.
10:30 Working with text Aneesha Bakharia TBD Topic Modeling with LDA and NMF from a Qualitative Content Analysis Perspective visualisation, text analysis/NLP, interfaces
The Latent Dirichlet Allocation (LDA) and Non Negative Matrix Factorisation (NMF) algorithms are able to find the latent topics within a document collection. Although LDA is specifically designed as a topic modeling algorithm, NMF is able to produce more coherent topics for smaller domain specific document collections. Both algorithms map documents to topics and topics to words and perform soft clustering (i.e., documents and words can belong to multiple topics), making them particularly suitable as qualitative content analysis aids. In this presentation the mathematical underpinning of both algorithms along with their relevant R packages (Topic Models and NMF) will briefly be introduced. The main focus of the presentation however will be on using R to address issues that qualitative researchers encounter when using topic modeling algorithms which include trust, topic quality/coherence, topic interpretation, evidence gathering and model parameter selection. Various tools to visualise the output of topic model will be discussed (i.e., LDAVis) and an intuitive user interface to explore topic models and gather evidence will be built using Shiny.
10:30 Working with text JeongMin Kwon TBD From humble data to training data data mining, applications, text analysis/NLP
There are lots of imperfect data. User feedback is not trustworthy, and implicit data is not unlabelled and hard to wrangle - and it is hard to use for machine learning and many other ways. But we can use them with changing thinking and data wrangling. In this presentation, I suggest some new ideas for wrangling data to use in machine learning and show our case studies.
10:30 Working with text Talia Beech TBD Strategic Capability Analysis for CANSOFCOM visualisation, text analysis/NLP
To enhance Canada Special Operations Forces Command competitive advantage in deterring and defeating adversaries as well as collaborating with allies, a strategic capability assessment was conducted to identify current and future capability gaps using concepts from the forecasted future operating environment. Military capability implications are identified and assessed using a wargame-based survey approach across a range of units within the Command. Data collected included ordinal data as well as supplementary comments.Ordinal data is analyzed using the Likert package, with an emphasis on the visualization of the data using stacked bar plots. Comment data is evaluated using R text mining packages with some emphasis on preprocessing steps to simplify text mining tasks. Results are used as a foundation for implementing constructive institutional change across the Command.
10:30 Working with text Thomas Klebel TBD jstor: An R package for Analysing Scientific Articles data mining, text analysis/NLP
The interest in the (quantitative) analysis of textual data has increased considerably over the last few years. For researchers investigating the scholarly literature the full text archive of JSTOR (http://www.jstor.org) offers a rich and diverse set of journal articles and other texts. Through its service Data for Research (http://www.jstor.org/dfr/), JSTOR gives researchers the opportunity to analyse this data, by delivering metadata, n-grams and, upon special request, full-text materials. jstor (https://tklebel.github.io/jstor/) enables researchers to easily import the supplied metadata to R. These metadata can either be analysed on their own, or be used in conjunction with n-grams or full-text-data. The presentation will show how jstor supports investigations of scholarly literature, covering the analysis of n-grams and citation analysis. Besides introducing possible applications, the paper will also discuss limitations regarding data quality and possible solutions thereof.
10:30 Time series data and forecasting Earo Wang TBD tsibble: Tidy data structures to support exploration and modeling of temporal-context data visualisation, space/time
The conventional matrix structure that underlies time series models in R does not easily accommodate a few complications, such as multiple variables, heterogeneous data types, low time resolutions, implicit missing values, and multilevel. This work addresses the broader issues of better data structures and modern data pipelines for analysing and visualising temporal-context data. We extend the tidy data concept to temporal data, and note that the “molten” data structure is flexible enough to handle heterogeneity, low time resolutions, and implicit missing values. There are two constraints required to turn the “molten” data into a valid temporal data: (1) an explicitly declared index variable containing timestamps; (2) a constraint uniquely identifies the multiple units of measurements, which is referred to as a “key”. A syntactical approach is introduced to describe nested or crossed data structure, which employs the “key”. Based on the tidy temporal data, a data pipeline is discussed and formulated to facilitate time-based transformation and visualisation. A case study is included to demonstrate the tidy structure and the data pipeline ideas and usage.
10:30 Time series data and forecasting Mitchell O'Hara-Wild TBD fasster: Forecasting multiple seasonality with state switching algorithms, models, streaming data, timeseries
Forecasting time-series which contain multiple seasonal patterns requires flexible modelling approaches, and the need for continuously updating models emphasises the importance of fast model estimation. In response to shortcomings in current models, a new model is proposed which brings the desirable qualities of speed, flexibility and support for exogenous regressors into a state space model. This proposed model also introduces state switching, which captures groups of irregular multiple seasonality by switching between states. The functionality of the proposed model extends beyond forecasting, by allowing for model based time-series decomposition, imputation of missing values, and support for streaming data.This model is available as an R package (mitchelloharawild/fasster), which provides formula based model specification, and uses tidy data structures (tsibble) and APIs which will later become familiar in forecast's next iteration: tidyforecast.
10:30 Time series data and forecasting Rob Hyndman TBD Tidy forecasting in R space/time
The forecast package in R is widely used and provides good tools for monthly, quarterly and annual time series. But it is not so well-developed for daily and sub-daily data, and it does not interact easily with modern tidy packages such as dplyr, purrr and tidyr.I will describe our plans and progress in developing a collection of packages to provide tidy tools for time series and forecasting, which will interact seamlessly with tidyverse packages, and provide functions to handle time series at any frequency.
10:30 Time series data and forecasting Thiyanga Talagala TBD seer: R package for feature-based forecast-model selection algorithms, models, time series
The seer package provides a novel framework for forecast model selection using time series features. We call this framework FFORMS (Feature-based FORecast Model Selection). The underlying approach involves computing a vector of features from the time series which are then used to select the forecasting model. The model selection process is carried out using a classification algorithm -- we use the time series features as inputs, and the best forecasting algorithm as the output. The classification algorithm can be built in advance of the forecasting exercise (so it is an “offline” procedure). Then, when we have a new time series to forecast, we can quickly compute its features, use the pre-trained classification algorithm to identify the best forecasting model, and produce the required forecasts. Thus, the “online” part of our algorithm requires only feature computation, and the application of a single forecasting model, with no need to estimate large numbers of models within a class, or to carry out a computationally-intensive cross-validation procedure. This framework is compared against several benchmarks and other commonly used forecasting methods.
14:00 Applications in big data Ansgar Wenzel TBD A closer look at UK MOT results - Why does my car always fail? applications
We present an analysis of the last 10 years of MOT results in the UK, with a particular focus on when cars fail and why. This is based on an open data set provided by the UK Government on the MOT, an annual car check mandatory for all vehicles older than three years. We hope that the results of this can inform customer choice when purchasing used (or new) vehicles as well as provide some interesting results. In particular, we consider geographical, vehicle, owner data, and interactions between these groups to identify the main drivers for a car to fail an MOT. We also consider the severity of a fail, eg. a non-working number plate light versus unsafe brake discs. We use these results to inform the training and design of a model that we train to predict failure (or passing) of a given vehicle. Additionally, we present results that were found using some less-common techniques. We also consider whether there are significant regional differences in pass or fail rates for different car brands or models.
14:00 Applications in big data Kevin Kuo TBD Claims reserving in general insurance with R and Keras algorithms, models, data mining, applications, insurance
In loss reserving, actuaries are concerned with estimating liabilities from current and future, yet to be reported, claims. In this session, we first provide an overview of the loss reserving problem and current techniques. We then frame the loss reserving problem as a predictive modeling problem, and propose a deep learning approach to solve it. We benchmark the model against existing techniques, then discuss applications of deep learning to other problems in actuarial science and insurance.
14:00 Applications in big data Sangeeta Bhatia TBD Big Brother is Watching - Using Digital Disease Surveillance Tools for Near Real-Time Forecasting models, applications, Epidemiology
In our increasingly interconnected world, it is crucial to understand the risk of an outbreak originating in one country/region and spreading to the rest of the world. Digital disease surveillance tools such as ProMed, HealthMap etc. have the potential to serve as important early warning systems as well as complement the field surveillance data during an ongoing outbreak. While there are a number of systems that carry out digital disease surveillance, there is as yet a lack of tools that can compile and analyse the generated data to produce easily understood actionable reports. I will present a flexible statistical model that uses different streams of data (such as disease surveillance data, mobility data etc.) for short-term incidence trend forecasting. I will also highlight an example of disaggregating aggregated data to obtain incidence information at a fine spatial scale. This could be particularly important in instances where information at sub-national levels is lacking or incomplete. The model has been developed in R and will be made available as a R package as well as through a website for use by non-technical stakeholders.
14:00 Community John Mertic TBD Sustainable community investment in action - a look at some of the R Consortium Funded Grant Projects and Working Groups. community/education
R Consortium has funded over $500,000 USD to R community members improving the community and technical infrastructure for the benefit of the R ecosystem. In addition, the working groups program has worked to drive discussion and alignment on key areas such as industry adoption, package health, and educational standards.In this talk, we will showcase several of the working groups and funded projects stewarded by R Consortium. This will provide an opportunity for the audience to understand the work being done, and also opportunities on how they could take part.
14:00 Community Lorna Maria Aine TBD Content For the Community:Leveraging Rmarkdown. community/education
It is nightmare to solve an R problem when you cannot find any cheat sheets on the internet. Although many people tend to shy away from the fact that programming with R is so powerful in the data science field, its high time we face it.Data science is becoming more popular and in the next few years and we need to learn beyond the traditional languages of the web and mobile. When I got a chance to start learning R, it was a personal decision that I took with my whole heart until I realised how challenging it was to find content built around R.Either it was too complex for me as a newbie or it was hard to find or none existent or there were no examples to work with. This stirred my need to build simple fun yet educative and example-based content for the next generation of data scientists. It has always been passion to build better for the next generation In this talk, I take us through the pains of building in R while relying on free courses (self-taught R programmer) the journey to documenting my own experience while maximising use of Rmarkdown, and how you too can make a contribution to R content around the globe.
14:00 Community Dennis Irorere TBD R labs Africa community/education
In this century there is a pent-up demand for “the next big thing”, and R labs Africa is at the right time to lead what is important to many in the areas of big data, data science, machine learning and artificial intelligence. I will reference the quote, “Talent is everywhere, it only needs opportunity to emerge” and this is what the R labs Africa will be about. That is, providing opportunity to marginalized and at risk communities all over Africa to learn about Data science with R through group mentor-ship, real world challenges and providing regular meet up. There will be an Annual R converge where everyone meets to talk about the future of R, share ideas and motivate one another.When access to knowledge is democratized, we see meaningful social development, because it takes a brilliant mind from a disadvantage background to create transformation solutions that solve problems within his or her domains of life experiences. When brilliance and naïve context find a nexus, true local solutions are created. These are the opportunities R labs Africa will bring to Africa. Already, the Akure R user group has reached out to about 62 young minds in the spans of 3 months.
14:00 Modeling Teck Kiang Tan TBD Doubly Classified Models with R models, applications, community/education
When we look at a cross tabulation, could we see any pattern out from it? When the table is big, it is extremely hard to discovery pattern by examining the cell frequencies. Doubly classified models are a set of statistical models aim to reveal patterns out from a cross classification table. There are substantial applications of these models. For instance, social researchers who are interested to find out intergenerational social mobility will find this way of analysing refreshing. These models are not new-fangled, but standard textbooks cover only a few of them, and journal articles are usually too technical to grasp the idea behind the model. For those with little mathematical statistic background, it causes great difficult to understand them. The talk focuses on conveying these models using a new graphical table tool called symbolic tables to give the basic idea about the models. Together, using a few standard R functions, mainly generalized linear model, doubly classified models can be set up easily. Real life examples will be illustrated, extracted mainly from the book titled “Doubly Classified Model with R”, written by the author.
14:00 Modeling Sourav Das TBD A routine for measuring the nonstationarity of a time series space/time, streaming data
Since the 1960's nonstationarytime series have beeninvestigated extensively. Methodology and theory haveevolved rapidly since Dahlhaus' construction of locallystationary processes in the 1990's. However much of thetheory in above constructions rely on assumptions ofsmoothness on the time varying transfer function. Howeverwhen modelling real data, tools for assessing such regularityconditions are yet to be developed. We have proposed a methodology that allows a domain expert to measure the non-stationarity of a time series using principles of non-parametric regression. In this talk we present a R routine that can be used to easily compute the proposed non-stationarity index.
14:00 Modeling Aya Alwan TBD Observation driven Conway-Maxwell Poisson count data models algorithms, models
Conway-Maxwell-Poisson (CMP) distributions is one of the flexible generalisation of the Poisson distribution that gained recent attention due to its flexibility in modelling both overdispersed and underdispersed count data. The main hindrance to their wider use in practice seems to be the inability to directly model the mean of counts, making them not compatible with nor comparable to competing count regression models, such as the log-linear Poisson, negative binomial or generalized Poisson regression models. In this talk, we will review how CMP can be parametrized via the mean, so that simpler and more easily-interpretable mean-models can be used, such as a log-linear model. A newly developed R package to fit the model to data will be discussed. Some simulated and real datasets will be used as demonstration.
14:00 Productivity John Mertic TBD Improve your R package quality with R Hub applications, community/education
R Hub is a project designed to help package developers ensure broad support of their packages across multiple platforms ( Windows, Linux, Mac, Solaris ), which in turn ensures higher user satisfaction and makes it more likely for the package to be accepted to CRAN.In this talk, you will learn about the tool, how and when to use it, and how to integrate it into your package development workflow.
14:00 Productivity Tomas Kalibera TBD Preventing and Detecting Memory Protection Bugs in Packages programming packages, finding bugs in packages
R's garbage collector (GC) ensures that memory used for R values is automatically reclaimed when they become unreachable via pointers and hence no longer needed. R code is handled automatically, but C code must protect from the GC the R values it needs and unprotect them after. Forgetting to protect and/or to unprotect (protect bug) often makes R crash but also can lead to incorrect results. It is not uncommon that old protect bugs are uncovered much later by inconsequential code changes. These bugs are common and hard to find, and thus R offers tools to detect them. `gctorture` helps testing by increasing the chance a protect bug will crash R and will do so sooner after the code with the bug executes. `rchk` is a static analysis tool that identifies potential protect bugs in C code without executing it; it is used regularly to check incoming CRAN packages. Finally, protect bugs can be prevented by following several simple programming rules. The talk is intended for package developers and everyone who write C code to work with R.
14:00 Productivity Kelly O'Briant TBD How to Play with and Integrate DevOps Technologies in an R Data Science Workflow Cloud computing
Over the last year I’ve become obsessed with trying to encourage the data science community to explore and exploit DevOps and cloud computing technology. This isn’t often an easy undertaking. Most people (data scientists or not) are skeptical of deviating from the tools and workflows they’ve come to rely on. This talk will feature case studies in developing data science products and workflows in the cloud, and how working with these tools can open up a world of new possibilities within the intersection of DevOps and Data Analytics.KEYS Topics to discuss:- How DataOps can address the growing scope of data science tasking- Where to start when you start exploring cloud services- How to work through functionality/engineering challenges in a cloud environment- Case studies in data science product engineering and deployment
14:00 Large spatial data Daniel Fryer TBD rcosmo: Statistical Analysis of the Cosmic Microwave Background visualisation, databases, space/time, big data, new R package
The Cosmic Microwave Background (CMB) is remnant electromagnetic radiation from the epoch of recombination. It is the most ancient important source of data about the early universe and the key to unlocking the mysteries of the Big Bang and the structure of time and space. Spurred on by a wealth of satellite data, intensive investigations in the past few years have resulted in many physical and mathematical results to characterise CMB radiation. It can be modelled as a realisation of a homogeneous Gaussian random field on the sphere. But, what does any of this matter for statisticians if they cannot play with the CMB data in their favourite programming language?A new R package, rcosmo, provides easy access to the CMB data and various tools for exploring geometric and statistical properties of the CMB. This talk will be a quick introduction to rcosmo by one of its developers, followed by an invitation for discussions and suggestions.This research was supported under the Australian Research Council's Discovery Project DP160101366.
14:00 Large spatial data Marek Rogala TBD Using deep learning on Satellite imagery to get a business edge visualisation, algorithms, models, applications, web app, Satellite data
The talk is about new possibilities arising from applying deep learning to satellite imagery. Satellite data changes the game as it allows to reach information not available to business nowadays and to travel in time. Combined with deep learning techniques, it delivers unique insights that have never been available before.Using deep learning on satellite data can deliver insights no human can. Satellite data is huge and non-obvious. By being able to go back to an arbitrary time in history we can prevent frauds. We can build forecasts and observe events we wouldn’t have access to otherwise. We’ll explore a number of emerging use cases and the common traits behind them. I will show how our R department is working with satellite data and how we use Shiny to build decision support systems for business.As an example of my previous talks, here’s a link of my talk at UseR, Brussel 2017: https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference/shinycollections-Google-Docs-like-live-collaboration-in-Shiny