Hugh Miller comments on the plethora of packages available to data scientists, and its implications for businesses.
Most surveys around popular data science languages show a steady rise for R and Python. The rise of these free tools runs in the opposite direction to other types of software, where polished commercial software still dominates (Microsoft Office anyone?). While cost is a factor, there is another huge reason – packages. There are now tens of thousands of packages written to achieve very specific tasks. They can be big (The Rcpp package now drives much of R’s higher-performance functions) through to the tiny (like making your computer beep).
Navigating the maze of packages can sometimes be a daunting task. Partly for this reason, the Institute is running a competition right now asking people to share their favourite packages and how they can be used to make their workday easier. Check it out and get involved. There are other ways too; you can keep an eye on packages people are using on Kaggle kernels, or there are sometimes people who will do the hard yards for you by reviewing new packages.
One of my favourite things is finding people who’ve used a bunch of packages to solve a really niche problem. For instance, this article describing how to auto-generate fantasy-world maps using Python is amazing. In a similar vein, there’s some nice computer-generated art that can be made using some packages and a few lines of code.
The rise of packages fundamentally changes how we think about how data science works too. Rather than spending significant development time writing and implementing functions (while there’s still a place for this), we can find something off the shelf that is near enough and prototype a method much faster. Deep knowledge of programming can partly be substituted by cobbling together solutions.
This approach to data science relies on a culture of sharing and goodwill that seems to permeate the R and Python communities.
People are keen to share their research (and have their code used) and will often point to their public github repositories. This has helped lower the barriers between academic research and business application. This also supports the reproducible research movement, where people are invited to check an analysis beyond what was possible in a traditional journal article.
The open package movement raises natural questions around support, quality assurance and updates. Nobody wants a crucial model dependency to disappear or fail at the wrong time. These are issues to be balanced against the significant benefits of better code availability. One answer is to continue to improve in-house expertise to keep an eye on tools that are used. Another answer is to look to commercial providers who’ve seen the opportunity in supporting this type of development in a more formal way (like Rstudio or Anaconda).
All of this makes it an exciting, and at times bewildering, time to be building models and writing code.
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivatives CC BY-NC-ND Version 3.0 (CC Australia ported licence).
CPD Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.