Analysts at Atlassian and LinkedIn take out Actuaries Kaggle Competition

Reading time: 4 mins

A trio of former work colleagues ‘the Nelson Boys’ have won the latest Actuaries Kaggle Competition to predict the cost of motor vehicle accidents on Victorian roads.

Geoff Sims, Joel van Veluwen and Luke Heinrich created the winning predictive model for the Actuaries Institute 2016 VicRoads Kaggle Competition to predict the cost of motor vehicle accidents on Victorian roads, relating to the "road infrastructure" where each accident occurred.

"A big congratulations to the Nelson Boys and well done to all the teams," said iRAP CEO Rob McInerney.

"Road crashes are the biggest killer of young people worldwide and cost 2-5% of GDP. Innovative new ways to model this unacceptable risk and help target action and save lives is so important. The Actuaries Institute Competition has opened that door and has so much potential for the future."

About the competition

Last year’s inaugural Institute Kaggle competition challenged participants to determine mortality for each census district in Australia. This year’s competition was to predict the cost of motor vehicle accidents on Victorian A, B and C category roads relating that cost to the "road infrastructure" on the road on which each accident occurred.

The competition specifically allowed other data to be used in the predictive modelling, expanding the competition to include finding other relevant data.

Sponsored by IAG, as part of their ongoing promotion of road and vehicle safety research, the winners received $3000. $1000 was also awarded to the two runners up.

VicRoads provided the data for the competition which combined details of the Infrastructure on Victorian Roads and Data from police reports on accidents on Victorian Roads.

iRap was instrumental in gathering the road data information that was ultimately used in this competition.

Congratulations to all the entrants. Second Place went to a team including Peter Rickwood, Suzanne Patten, Michael Hauptman and Shiri Shapiro. Third Place went to Oliver Chambers, who was part of the winning team in the inaugural competition last year.

XGBoost Model Predicting the Cost of Accidents on Victorian Roads - from the modellers:

Our winning model used a combined classification/regression model, trained in XGBoost using Python. We believe a good cross-validation strategy (to ensure model generalisation), careful use of supplied data (manual mapping of every single road feature), along with relevant externally sourced data (traffic lights, city/town locations & local population estimates) were responsible for our victory, and we kept things simple by not employing any ensembling.

Feature Engineering

  • Road features - using the supplied data, each column was manually investigated and either one-hot encoded, keyed, or converted to a binary representation of some descriptive component, resulting in around 100 road-level features.
  • Weather - the supplied precipitation data from BOM was aggregated up for each station per quarter, and joined to the nearest road_id. This was our first temporal feature.
  • Seasonality - we built a simple linear regression to model seasonality, finding the average cost per road as a function of year (long term trend) and quarter (seasonality), which was joined on the year-quarter as a sort of "meta feature"
  • Population - we sourced population estimates at an LGA level from the ABS, which we mapped to the nearest road_id and joined on. Estimates for 2016 were not available and were extrapolated using a moving average.
  • Traffic lights & cities - distance of road_id to nearest traffic light, number of traffic lights in various radii, and distance to the eight largest towns in VIC were joined on a road_id level.
     

Model Training

All modelling was performed with XGBoost in Python. We had just over 100 features (~100 road features, 1 weather feature, 1 seasonal stacked metafeature,1 population feature, 5 traffic lights features, and 8 distances), and fast opted for a classic "frequency/severity" approach, which both greatly reduced the amount of data needed to model, and decreased iteration times to under 5 minutes. Both frequency/severity models used 5 fold cross-validation, with custom specified folds such that no block_id appeared in both training and validation sets, which mimics the way the actual blind test holdout was produced.

For the frequency model, we used binary classification (minimising logloss) to predict whether or not the road/qtr combination had an accident or not. We included all rows where an accident occurred, and randomly sampled 9x this amount of rows where no accident occurred, such that the positive response rate was 10% (we actually repeated this 4 times, and averaged the results, in a sort of meta-ensemble). The final frequency probability was computed by "unwinding" the raw model probability (using the "prior correction" as described by King & Zeng in "Logistic Regression in Rare Events Data") to account for the fact we oversampled the response (from 0.27% to 10%). The severity model (minimising rmse) used only the rows where the cost was greater than 0. Model parameters were tweaked manually initially, and the final prediction was calculated by multiplying the frequency and severity predictions.

*Note: features in capitals were derived from given data, while features in lowercase were externally sourced features

Interesting findings

Geelong stood out as a huge outlier in the training set, and the major central roads did not appear to be in the testing set. By clipping the our training data to the 99%ile (to account for the fact that the test set had probably lower than average cost), we further improved our score.

 Geelong Clearly stands out as the highest cost per road “block”

We tried a ton of things which didn't add any value: basic linear/logistic regression (instead of XGB); seasonal offset modelling; microseasonality; socioeconomic demographics; kNN methods; hyperparameter optimsation; and temporal traffic volume. Many of the things we were sure would add value didn't; while some of the other features we weren't so confident in gave large gains.

A big thanks to the Kaggle competition sponsors for their support:


 

CPD Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.

About the authors

Geoff Sims

Senior data analyst at Atlassian with a scientific academic background (PhD in Astrophysics).

Joel van Veluwen

Analytics lead for LinkedIn's Sales Solutions business in ANZ with an academic background in Information Systems and Economics.

Luke Heinrich

Customer Insights Analyst at Atlassian who formerly worked at Quantium. Luke studied Actuarial Studies and Finance and is halfway through the Part III exams.

Comment on the article (Be kind)

Likes:0
Comments:0
Print

No Comments

Also this month