The popular Data Science Sydney Meetup Group recently heard from the Chief Data Scientist for DataRobot. The talk showed that actuaries, statisticians and data scientists can complement, and learn important practices from each other.
I recently attended a session of the Data Science Sydney (DSS) Meetup. This meetup group has 5,600 members and is run by Eugene Dubossarsky, who gave a brilliant Keynote Address at the Actuaries Institute’s recent Data Analytics Seminar. The group meets regularly at the CBA premises located down near Haymarket. DSS meetups are generally significantly oversubscribed with waiting lists to attend. There are networking opportunities, pizza and beverages at the start and end. Past presentations can be accessed at the Data Science Sydney YouTube Channel.
Two hundred people attended the most recent meetup with Chief Data Scientist for DataRobot, Xavier Conort. Xavier is a French actuary with a Masters in Statistics and Actuarial Science. He has held senior actuarial and risk positions at CNP group and AXA Insurance. Xavier is the Chief Data Scientist for DataRobot, where he leads the R&D efforts in Data Science globally from Singapore. Since 2011 he has transitioned into a data scientist. He was a former #1 ranked Data Scientist on Kaggle. He has applied machine learning to diverse business problems from claims modelling to flight arrival prediction, essay scoring, sales forecasting and biological response prediction. Read more about Xavier on his LinkedIn profile.
The Topic was “How Statisticians and Data Scientists could learn from each other”. Xavier argued that Data Scientists have been highly successful at automating modelling through machine learning. They extract powerful insights at a rapid pace. In contrast, statisticians have been attempting to manually build complex and robust models using Generalized Linear Models (GLMs). GLMs are little-known by Data Scientists while Statisticians may dismiss machine learning tools that they find too complex, calling them “black boxes”. The talk covered what data scientists, actuaries and statisticians can learn from each other and bridge skill gaps. The XGBoost package, one of the most popular open source projects, is a good example of such collaboration.
Learnings from actuarial work
Xavier brought up the fact that the actuarial mind set really taught him about dealing with skewed distributions, rare event modelling and the fact that risk generally doesn’t behave additively but multiplicatively. Actuarial work also taught him about pricing, commercial and regulatory constraints; ensuring pricing doesn’t increase too much from year to year and the need to provide transparency for stakeholders. Offset was something he highlighted as a feature that allows actuaries to incorporate constraints or strengthen their modelling strategy by say, ensuring predicted values are proportional to the exposure, or application of discounts, or inclusion of prior effects derived from other sources. He also acknowledged the process by which actuaries build models: in two stages (with a first stage focused on primary features that are fully trusted, and a second stage capturing the marginal effects of features that are less trusted /available) was sound.
The presentation gave an introduction to GLMs and their underlying assumptions which led to a discussion around their usefulness and their limitations. It referenced Towers Watson’s “A practitioner’s guide to generalized linear models”. A similar discussion was covered in this pair of articles on Actuaries Digital from 2017.
The biggest drawback of GLMs being the speed it takes to build GLMS versus machine learning. GLMs are poor at picking up cross effects (interactions), as you need to know they are there to model them. GLMs are good at taking into account the shape of relationships by allowing functions to be forced into models such as monotonic functions which increase with say sum insured.
GLMs require a lot of manual fixing and structuring of models and factors which takes time. Machine learning techniques can build models very rapidly but have the risk of over-fitting or following the noise and creating nonsensical relationships between logically related parameters in neighbouring classification bins. GLMs are not very practical when you have a lot of data variables.
Discussion continued on XGBoost. Being one of the most popular open source machine learning packages, it now has the standard actuarial modelling distributions of Poisson, Gamma and Tweedie and also Offset functionality. It also has monotonic constraints in tree construction eg upward slope on sum insured. It is a package that is evolving continually.
Hinton’s Dark Knowledge
Xavier pointed to opportunities to better understand complex models, such as Hinton’s approach to extracting correlations from deep learning models. Training simpler model on a complex model can result in a simpler model which has better accuracy and provides a more transparent solution but not as granular. This is similar to what we often do in actuarial work where we fit a simpler model over our details component claim type and size models to assist in implementation into a rating algorithm.
Discussion continued on other ways to impose meaningful structure on more automated machine learning methods. One example is the fused lasso package. This shows in the penalised regression framework how to tie or link model parameters to certain other modelling bins to avoid some of the problems of over-fitting.
Xavier talked about Kaggle and said it was a good way to connect with people especially in isolated environments like in Singapore where he is based. When working in teams he said that often they would each model the problem themselves individually and share outputs with each other but only within the last two weeks would they come together and co-create. This would result in more ideas, reminiscent of the Delphi technique which provides a greater universe of ideas as it avoids anchoring. He also said he learnt a lot from participating in Kaggle competitions. Interested readings can dive right in to Kaggle. For others, the Institute is planning to run some Kaggle based events later in 2019.
When to use data science vs other traditional techniques
In deciding whether to construct a more traditional model, you need to weigh up what is the impact of wrong predictions and what is the risk of not understanding all the effects built into a model. If the risk is low using straight data science techniques are OK for speed. If not, you need to do more manual work and spend time for example removing features. Data science techniques can be used in the first instance to understand any key factors and then you can fit a GLM using that knowledge.
“It’s like your kids, you first teach them values then you send them to school,” said Xavier.
When you want to explain something, sometimes fitting a simple model over data science models may be better for understanding and transparency.
Statisticians used to think machine learning was a black box, but this is increasingly not the case. They can now learn from machine learning insights. Data Science took time to embrace the practices of actuaries and statisticians. On the flipside, actuaries and statisticians have resisted machine learning innovations. We can now observe interest on both sides and see that more machine learning algorithms support features that are essential for actuaries, but also useful for data scientists.
New data science technologies need to be understood and embraced by actuaries as part of our new toolkit otherwise there is a risk that data scientists may produce similar models and results to what actuaries do, but faster. Hinton’s Dark Knowledge or Fused Lasso are good examples how practices from the two communities can merge.
We should expect to see more innovations that combine ideas from the two communities in the future.
CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.