A few months ago, Zeming Yu wrote My top 10 Python packages for data science. Like him, my preferred way of doing data analysis has shifted away from proprietary tools to these amazing freely available packages. I’d like to share some of my old-time favourites and exciting new packages for R. Whether you are an experienced R user or new to the game, I think there may be something here for you to take away.
No discussion of top R packages would be complete without the tidyverse. In a way, this is cheating because there are multiple packages included in this – data analysis with dplyr, visualisation with ggplot2, some basic modelling functionality, and comes with a fairly comprehensive book that provides an excellent introduction to usage.
If you were getting started with R, it’s hard to go wrong with the tidyverse toolkit. And if you are just getting started, check out our recent Insights – Starting the Data Analytics Journey – Data Collection. Here’s the video, audio, and presentation. This and more can be found on our knowledge bank page.
library(tidyverse) ggplot(mtcars, aes(mpg, disp, color=cyl)) + geom_point()
2. Need for speed? dtplyr
There has been a perception that R is slow, but with packages like data.table, R has the fastest data extraction and transformation package in the West. However, the dplyr syntax may more familiar for those who use SQL heavily, and personally I find it more intuitive. So, dtplyr provides the best of both worlds. In :
# Install from github as at time of writing. # install.packages("devtools") # devtools::install_github("tidyverse/dtplyr") library(dtplyr) library(data.table) mtcars %>% lazy_dt() %>% group_by(cyl) %>% summarise(total.count = n()) %>% as.data.table()
3. Out of memory? Try disk.frame
One major limitation of r data frames and Python’s pandas is that they are in memory datasets – consequently, medium sized datasets that SAS can easily handle will max out your work laptop’s measly 4GB RAM. The ideal solution would be to do those transformations on the data warehouse server, which would reduce data transfer and also should, in theory, have more capacity. If it runs with SQL, dplyr probably has a backend through dbplyr. Alternatively, with cloud computing, it is possible to rent computers with up to 3,904 GB of RAM.
But for those with a habit of exploding the data warehouse or those with cloud solutions being blocked by IT policy, disk.frame is an exciting new alternative. It does require some additional planning with respect to data chunks, but maintains a familiar syntax – check out the examples on the page.
The package stores data on disk, and so is only limited by disk space rather than memory…
4. Parking it with parquet and Arrow
Running low on disk space once, I asked my senior actuarial analyst to do some benchmarking of different data storage formats: the “Parquet” format beat out sqlite, hdf5 and plain CSV – the latter by a wide margin. That experience is also likely not unique as well, considering this article where the author squashes a 500GB dataset to a mere fifth of its original size.
If you were working with a heavy workload with a need for distributed cluster computing, then sparklyr could be a good full stack solution, with integrations for Spark-SQL, and machine learning models xgboost, tensorflow and h2o.
But often you just want to write a file to disk, and all you need for that is Apache Arrow.
library(arrow) write_parquet(mtcars, "test.parquet") # Done!
You may have seen earlier videos from Zeming Yu on Lightgbm, myself on XGBoost and of course Minh Phan on CatBoost. Perhaps you’ve heard me extolling the virtues of h2o.ai for beginners and prototyping as well.
LightGBM has become my favourite now in Python. It is incredibly fast, and although it has the limitation that it can only do leaf-wise models – unlike XGBoost which has the flexibility to use traditional depth-wise growth models as well – but a lower memory usage allows you to be greedier in putting large datasets into the model.
install.packages("devtools") devtools::install_github("Laurae2/lgbdl") library(lgbdl) lgb.dl(commit = "master", compiler = "vs", repo = "https://github.com/microsoft/LightGBM")
If that looks too hard, that is why I would still recommend xgboost for R users at the present time. With either package it is fairly straightforward to build a model – here we use sparse matrix to convert categorical variables in a memory efficient way, then model with xgboost:
library(xgboost) library(Matrix) # Road fatalities data - as previously seen in the YAP-YDAWG course deaths <- read.csv("https://raw.githubusercontent.com/ActuariesInstitute/YAP-YDAWG-R-Workshop/master/bitre_ardd_fatalities_dec_2018.csv") # Explain age of the fatality based on speed limit, road user and crash type sparse_matrix = sparse.model.matrix(Age ~ Speed.Limit + Road.User + Crash.Type, data=deaths)[,-1] bst <- xgboost(data = sparse_matrix, label = deaths$Age, nrounds=10, objective="reg:linear")
xgb.importance(feature_names = colnames(sparse_matrix), model = bst, data = sparse_matrix, label = deaths$Age) %>% xgb.plot.importance()
6. Nets: keras
Neural network models are generally better done in Python rather than R, since Facebook’s Pytorch and Google’s Tensorflow are built with it in mind. However in writing Analytics Snippet: Multitasking Risk Pricing Using Deep Learning I found Rstudio’s keras interface to be pretty easy to pick up.
While most example usage and online tutorials with be in Python, they translate reasonably well to their R counterparts. The Rstudio team were also incredibly responsive when I filed a bug report and had it fixed within a day.
7. Multimodel: MLR
Working with multiple models - say a linear model and a GBM - and being able to calibrate hyperparameters, compare results, benchmark and blending models can be tricky. This video on Applied Predictive Modelling by the author of the caret package explains a little more on what’s involved.
If you want to get up and running quickly, and are okay to work with just GLM, GBM and dense neural networks and prefer an all-in-one solution, h2o.ai works well. It does all those models, has good feature importance plots, and ensembles it for you with autoML too, as explained in this video by Jun Chen from the 2018 Weapons of Mass Deduction video competition. Ensembling h2o models got me second place in the 2015 Actuaries Institute Kaggle competition, so I can attest to its usefulness.
mlr comes in for something more in-depth, with detailed feature importance, partial dependence plots, cross validation and ensembling techniques. It integrates with over 100 models by default and it is not too hard to write your own.
There is a handy cheat sheet.
Visualisation and Presentation
8. Too technical for Tableau (or too poor)? flexdashboard
To action insights from modelling analysis generally involves some kind of report or presentation. Rarely you may want to serve R model predictions directly - in which case OpenCPU may get your attention - but generally it is a distillation of the analysis that is needed to justify business change recommendations to stakeholders.
Flexdashboard offers a template for creating dashboards from Rstudio with the click of a button. This extends R Markdown to use Markdown headings and code to signpost the panels of your dashboard.
Interactivity similar to Excel slicers or VBA-enabled dropdowns can be added to R Markdown documents using Shiny. To do so, add ‘runtime: shiny’ to the header section of the R Markdown document. This is great for live or daily dashboards. It is also possible to produce static dashboards using only Flexdashboard and distribute over email for reporting with a monthly cadence.
9. HTML Charts: plotly
Different language, same package. Plot.ly is a great package for web charts in both Python and R. The documentation steers towards the paid server-hosted options but using for charting functionality offline is free even for commercial purposes. The interface is clean, and charts embeds well in RMarkdown documents.
Check out an older example using plotly with Analytics Snippet: In the Library.
10. Explain it like I'm five: DALEX
Also featured in the YAP-YDAWG-R-Workshop, the DALEX package helps explain model prediction. Like mlr above, there is feature importance, actual vs model predictions, partial dependence plots:
library(DALEX) xgb_expl <- explain(model = bst, data = sparse_matrix, y = deaths$Age) resp <- variable_response(xgb_expl, "Speed.Limit", "pdp") plot(resp)
Yep, that looks like it needs a bit of cleaning - check out the course materials... but the key use of DALEX in addition to mlr is individual prediction explanations.
brk <- prediction_breakdown(xgb_expl, observation=sparse_matrix[1, ,drop=FALSE]) plot(brk)
We have taken a journey with ten amazing packages covering the full data analysis cycle, from data preparation, with a few solutions for managing “medium” data, then to models - with crowd favourites for gradient boosting and neural network prediction, and finally to actioning business change - through dashboard and explanatory visualisations - and most of the runners up too… I would recommend exploring the resources in the many links as well, there is a lot of content that I have found to be quite informative.
Did I miss any of your favourites? Let me know in the comments!
CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.