Oliver Chambers and Andrew Bird made the winning submission for the Actuaries Institute’s inaugural Kaggle competition in 2015. Here, they explain how they explored different features in the data to predict mortality at locations around Australia.
The objective of the Kaggle competition was to predict the number of deaths in each SA2 region of Australia based on a subset of the 2011 census data covering demographic, socio and hospital information. We used a range of different approaches, utilising Tableau, Python, and R to explore the data, identify features, and build a predictive model.
At a high level our process involved:
- Using Tableau to explore features in the data such as mining sites, interaction between different variable dimensions, and clustering of residuals generated by our models in R.
- Using the insights from Tableau to develop algorithms (implemented in python) to create new features in the data.
- Building an ensemble model using these features compiled in R, and tested their power using k-fold cross validation and scores on the public leader board.
- Repeating this process ad nauseam.
Our final model was an ensemble of several models fit in R:
- A Gradient Boosting Model (mboost)
- A Bayesian GLM (arm)
- A Cubist (cubist)
- A Linear Model (base R)
The Cubist was by far the most powerful model. It is a tree-based model with a linear model at each node. It contributed the most to our final prediction.
Feature Extraction and Selection
We invested significant effort exploring different features in the data.
A problem we encountered when selecting features for our linear model was that the addition of a new feature can make other features less significant (sometimes making a previously strong feature insignificant, or making a variable previously excluded for being insignificant more significant). This occurred because there were a large number of correlated (even collinear) factors in the data. The obvious way to handle this issue was to use Principal Component Analysis to reduce the dimension of the data. The drawback of PCA is that it became difficult to interpret the reduced features (i.e. orthogonal eigenvectors), or use them for new feature creation. We therefore employed random feature selection to find a subset of features that ‘worked well’ together (i.e. were weakly independent) and that we could use to draw meaningful inferences.
Some observations we concluded from this exercise:
- The average death rate and the size of the elderly population were strong predictors of the number of deaths, as you would expect. However, we were initially surprised to find that they were not as strong as the number of widows. We reasoned that this was due to the fact that widows are not only more likely to be old, but a married couple would be exposed to the same environmental factors and lifestyle habits. The death of one partner would indicate a high expected mortality for the other, so it is a stronger predictor than old age alone.
- Number of individuals who own a house outright was also a good predictor. We suspected this was due to the correlation with age and socioeconomic status, though it was unclear why this wouldn’t have already been captured in the age-specific dimensions.
Similarly, we were initially surprised that median income bore very little relation to mortality despite our efforts to coerce it. We speculate that this is because the median income is a statistic reflecting the working-age population, whereas death normally occurs in old age / retirement. Therefore home-ownership is a better indicator of socioeconomic status for the elder population than median income.
Exploring Data in Tableau
Tableau was very useful for exploring data and identifying relationships between variables.
We plotted all of the SA2 regions on a map of Australia and looked at key ratios (male to female ratio, male workforce to total population, number of unemployed, etc).
One illustrative example was using Tableau to identify potential mining sites:
In the graphic above we can see all SA2 regions across two dimensions: the size of the bubble shows the population size and the colour scale indicates the median income in that region. Ignoring regions on the coast / near capital cities, potential mining sites are readily identifiable (some have been circled).
Unfortunately the mine locations had little predictive power. We reasoned that this was due to the high degree of automation and stringent health & safety at mines in Australia. Perhaps it would be a better predictor of TPD claims, but not necessarily number of deaths.
Tableau was also useful for identifying potential indicator variables across multiple dimensions. The chart below shows the relationship between indigenous status and proportion of the population renting. The size of each bubble is the size of the error (actual less predicted deaths) from an early version of our model. Orange bubbles are overestimates, green bubbles are underestimates and the Xs are the points in the test set. A visual inspection suggests that when there is a high proportion of the population renting and a haigh portion of females that did not state their indigenous status than our model underestimates the number of deaths. Therefore, we may get additional predictive power from including an indicator variable which is 1 for regions with proportion renting greater than 10% and Indigenous Status F not stated greater than 500.
It is difficult to guess where these relationships will exist: there is no obvious reason why renting vs. indigenous status would provide additional information. Therefore, we used python to automate this analysis.
Another method we employed was analysing the residuals of our prediction to look for hotspots where our model over/underestimated the number of deaths (revealing a hidden geographic feature). An illustrative example for the state of Victoria is provided below based on an early version of our model. We can observe that in the western suburbs and past Frankston to the east, there is a reasonable mix of over/underestimates (orange / green bubbles) as would be expected if the residuals were random white noise. However, the inner eastern suburbs are clearly dominated by underestimates. Similarly, there is a cluster of SA2 regions near Geelong that are underestimated. We, therefore, hypothesised that creating an indicator variable on the eastern suburbs of Melbourne would improve our model.
After trying to identify these groups by hand, we took a more robust approach and wrote a simple nearest neighbour clustering algorithm in python. We looked for single points where it’s nearest neighbours (in the train set) all over/underestimated the actual deaths by a significant amount. Then we captured points in the test set that sat within a circle enclosing the points.
The technique that ultimately provided the largest improvement to our model was to feed our estimate of predicted deaths back into the model as a new feature. Implicitly this means that predictions from one model are used to boost another model and this was rather successful and we iterated this procedure several times.
CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.