Automatic machine learning – Zeming’s data analytics ‘meetup’ files
It’s the year 2025. As an actuary for a major software vendor, you’ve just been asked by your boss to figure out why the algorithm liability claims are deteriorating.
You turn on your computer and start talking. “OK computer, pull the last 4 years of claims data for me.”
“No problem. Estimated time to completion: 20 seconds… Done.”
The screen starts to display some key information about the dataset – 1.4 million rows, 25 columns. Payments have all been reconciled to the bank statements.
“OK computer, start doing exploratory analysis.”
“Sure.” In a few seconds, the screen starts displaying all sorts of histograms, stacked bar charts, time series charts.
“That’s too much. I need to get back to my boss in a few hours, give me the key insights please.” You become a bit impatient.
“The overall claim frequency is stable. There has been a sharp increase in the average claim size for the drone software liablity product. I recommend start building a predictive model on historical data and do an actual vs expected analysis on more recent data to figure out what’s going on. A thorough analysis will take an hour, but I can report back some initial findings in 5 minutes. Is that OK?”
You go to the kitchen to make a coffee. Walking through the office floor, you see dozens of other actuaries talking to their own computers. The noise cancelling feature of the headset is getting so good that you can hardly hear a word they say. The work place has become a bit like a gym. Some are running on a treadmill while some others are on cross-trainers. Everyone is still working, of course, as working is nothing but having a friendly chat with your computer, and multi-tasking is a piece of cake.
You sometimes miss the good old days when you had to sit down and write SAS programs to get things done. Oh well, the world has moved on.
Back to reality
Only time can tell whether the above will become a reality or not. For now, let’s come back to the year of 2017 and see how automatic machine learning is done.
Jose Magana (Data Scientist at CBA) presented to the Sydney Users of R Forum (SURF) on Automatic Machine Learning in R.
Currently this can be done via both proprietary software and open source solutions. If there’s enough budget, DataRobot is certainly worth trying.
If there’s no budget, you can always try the open source solutions.
- H2O offers AutoML in both Python and R.
- In Python, there’s auto-sklearn.
- In R, there’s the caret package and the focus of this meetup – the MLR package.
Jose provided a comprehensive demonstration on using the MLR package to predict survival using the Titanic dataset.
I’m very impressed by the package. From missing data imputation, automatic feature selection to training a range of different models and cross validation, this package can do it all!
Admittedly it’s not quite like DataRobot where you just click a few buttons and wait for the magic to happen, but the coding doesn’t look too bad. You basically have to tell the package what to do at a high level (for example, use these imputation methods and these algorithms). Jose showed us that you only need about 30 lines of R code to complete the entire process end-to-end.
I can imagine that as this area matures, and as natural language processing advances to such an extent that the instructions can be given by simple conversations, the 2025 scenario described above may well be a reality!
That’s all for now. Wish you a Merry Christmas and Happy New Year!
CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.