Analysts at Atlassian and LinkedIn take out Actuaries Kaggle Competition

Read­ing time: 4 mins

A trio of for­mer work col­leagues ‘the Nel­son Boys’ have won the lat­est Actu­ar­ies Kag­gle Com­pe­ti­tion to pre­dict the cost of motor vehi­cle acci­dents on Vic­to­ri­an roads.

Geoff Sims, Joel van Veluwen and Luke Hein­rich cre­at­ed the win­ning pre­dic­tive mod­el for the Actu­ar­ies Insti­tute 2016 VicRoads Kag­gle Com­pe­ti­tion to pre­dict the cost of motor vehi­cle acci­dents on Vic­to­ri­an roads, relat­ing to the “road infra­struc­ture” where each acci­dent occurred.

A big con­grat­u­la­tions to the Nel­son Boys and well done to all the teams,” said iRAP CEO Rob McIn­er­ney.

Road crash­es are the biggest killer of young peo­ple world­wide and cost 2 – 5% of GDP. Inno­v­a­tive new ways to mod­el this unac­cept­able risk and help tar­get action and save lives is so impor­tant. The Actu­ar­ies Insti­tute Com­pe­ti­tion has opened that door and has so much poten­tial for the future.”

About the competition

Last year’s inau­gur­al Insti­tute Kag­gle com­pe­ti­tion chal­lenged par­tic­i­pants to deter­mine mor­tal­i­ty for each cen­sus dis­trict in Aus­tralia. This year’s com­pe­ti­tion was to pre­dict the cost of motor vehi­cle acci­dents on Vic­to­ri­an A, B and C cat­e­go­ry roads relat­ing that cost to the “road infra­struc­ture” on the road on which each acci­dent occurred.

The com­pe­ti­tion specif­i­cal­ly allowed oth­er data to be used in the pre­dic­tive mod­el­ling, expand­ing the com­pe­ti­tion to include find­ing oth­er rel­e­vant data.

Spon­sored by IAGas part of their ongo­ing pro­mo­tion of road and vehi­cle safe­ty research, the win­ners received $3000. $1000 was also award­ed to the two run­ners up.

VicRoads pro­vid­ed the data for the com­pe­ti­tion which com­bined details of the Infra­struc­ture on Vic­to­ri­an Roads and Data from police reports on acci­dents on Vic­to­ri­an Roads.

iRap was instru­men­tal in gath­er­ing the road data infor­ma­tion that was ulti­mate­ly used in this com­pe­ti­tion.

Con­grat­u­la­tions to all the entrants. Sec­ond Place went to a team includ­ing Peter Rick­wood, Suzanne Pat­ten, Michael Haupt­man and Shiri Shapiro. Third Place went to Oliv­er Cham­bers, who was part of the win­ning team in the inau­gur­al com­pe­ti­tion last year.

XGBoost Model Predicting the Cost of Accidents on Victorian Roads – from the modellers:

Our win­ning mod­el used a com­bined classification/regression mod­el, trained in XGBoost using Python. We believe a good cross-val­i­da­tion strat­e­gy (to ensure mod­el gen­er­al­i­sa­tion), care­ful use of sup­plied data (man­u­al map­ping of every sin­gle road fea­ture), along with rel­e­vant exter­nal­ly sourced data (traf­fic lights, city/town loca­tions & local pop­u­la­tion esti­mates) were respon­si­ble for our vic­to­ry, and we kept things sim­ple by not employ­ing any ensem­bling.

Fea­ture Engi­neer­ing

  • Road fea­tures – using the sup­plied data, each col­umn was man­u­al­ly inves­ti­gat­ed and either one-hot encod­ed, keyed, or con­vert­ed to a bina­ry rep­re­sen­ta­tion of some descrip­tive com­po­nent, result­ing in around 100 road-lev­el fea­tures.
  • Weath­er – the sup­plied pre­cip­i­ta­tion data from BOM was aggre­gat­ed up for each sta­tion per quar­ter, and joined to the near­est road­_id. This was our first tem­po­ral fea­ture.
  • Sea­son­al­i­ty – we built a sim­ple lin­ear regres­sion to mod­el sea­son­al­i­ty, find­ing the aver­age cost per road as a func­tion of year (long term trend) and quar­ter (sea­son­al­i­ty), which was joined on the year-quar­ter as a sort of “meta fea­ture”
  • Pop­u­la­tion – we sourced pop­u­la­tion esti­mates at an LGA lev­el from the ABS, which we mapped to the near­est road­_id and joined on. Esti­mates for 2016 were not avail­able and were extrap­o­lat­ed using a mov­ing aver­age.
  • Traf­fic lights & cities – dis­tance of road­_id to near­est traf­fic light, num­ber of traf­fic lights in var­i­ous radii, and dis­tance to the eight largest towns in VIC were joined on a road­_id lev­el.

Model Training

All mod­el­ling was per­formed with XGBoost in Python. We had just over 100 fea­tures (~100 road fea­tures, 1 weath­er fea­ture, 1 sea­son­al stacked metafeature,1 pop­u­la­tion fea­ture, 5 traf­fic lights fea­tures, and 8 dis­tances), and fast opt­ed for a clas­sic “frequency/severity” approach, which both great­ly reduced the amount of data need­ed to mod­el, and decreased iter­a­tion times to under 5 min­utes. Both frequency/severity mod­els used 5 fold cross-val­i­da­tion, with cus­tom spec­i­fied folds such that no block­_id appeared in both train­ing and val­i­da­tion sets, which mim­ics the way the actu­al blind test hold­out was pro­duced.

For the fre­quen­cy mod­el, we used bina­ry clas­si­fi­ca­tion (min­imis­ing logloss) to pre­dict whether or not the road/qtr com­bi­na­tion had an acci­dent or not. We includ­ed all rows where an acci­dent occurred, and ran­dom­ly sam­pled 9x this amount of rows where no acci­dent occurred, such that the pos­i­tive response rate was 10% (we actu­al­ly repeat­ed this 4 times, and aver­aged the results, in a sort of meta-ensem­ble). The final fre­quen­cy prob­a­bil­i­ty was com­put­ed by “unwind­ing” the raw mod­el prob­a­bil­i­ty (using the “pri­or cor­rec­tion” as described by King & Zeng in “Logis­tic Regres­sion in Rare Events Data”) to account for the fact we over­sam­pled the response (from 0.27% to 10%). The sever­i­ty mod­el (min­imis­ing rmse) used only the rows where the cost was greater than 0. Mod­el para­me­ters were tweaked man­u­al­ly ini­tial­ly, and the final pre­dic­tion was cal­cu­lat­ed by mul­ti­ply­ing the fre­quen­cy and sever­i­ty pre­dic­tions.

*Note: fea­tures in cap­i­tals were derived from giv­en data, while fea­tures in low­er­case were exter­nal­ly sourced fea­tures

Interesting findings

Gee­long stood out as a huge out­lier in the train­ing set, and the major cen­tral roads did not appear to be in the test­ing set. By clip­ping the our train­ing data to the 99%ile (to account for the fact that the test set had prob­a­bly low­er than aver­age cost), we fur­ther improved our score.

 Gee­long Clear­ly stands out as the high­est cost per road “block”

We tried a ton of things which didn’t add any val­ue: basic linear/logistic regres­sion (instead of XGB); sea­son­al off­set mod­el­ling; microsea­son­al­i­ty; socioe­co­nom­ic demo­graph­ics; kNN meth­ods; hyper­pa­ra­me­ter optim­sa­tion; and tem­po­ral traf­fic vol­ume. Many of the things we were sure would add val­ue didn’t; while some of the oth­er fea­tures we weren’t so con­fi­dent in gave large gains.

A big thanks to the Kag­gle com­pe­ti­tion spon­sors for their sup­port:


CPD Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.

About the authors

Geoff Sims

Senior data analyst at Atlassian with a scientific academic background (PhD in Astrophysics).

Joel van Veluwen

Analytics lead for LinkedIn's Sales Solutions business in ANZ with an academic background in Information Systems and Economics.

Luke Heinrich

Customer Insights Analyst at Atlassian who formerly worked at Quantium. Luke studied Actuarial Studies and Finance and is halfway through the Part III exams.

Comment on the article (Be kind)


No Comments

Also this month