Xavier Conort on maximising your machine learning success with innovative feature engineering

Xavier Conort is an esteemed actuary, Kaggle Grandmaster, DataRobot Chief Scientist (2013-2020), and co-founder and Chief Product Officer of FeatureByte. On his recent visit to Sydney, Xavier gave an inspiring yet practical talk on feature engineering for machine learning.

What are Data Science Sydney Meet-Ups?

Data Science Sydney Meet-Ups are not just monthly gatherings for the data science community but is also an opportunity where actuarial and non-actuarial data scientists come together to explore the potential of data through advanced statistical, computational, and analytical tools.

Participants engage in networking, knowledge sharing, and while enjoying pizza! These sessions are usually held at the Actuaries Institute, but this month due to an increase of attendees, we met at the University of Sydney Business School.

Xavier addressing the auditorium

Xavier began the session by introduced himself and outlined that the topics he would cover including the creative process of feature engineering, the importance of organising ideas, and introducing FeatureByte – an open-source Python library created by Xavier to assist with feature engineering.

What is feature engineering? It’s the process of selecting, manipulating, and transforming raw data into features that be used in machine learning models.

Feature engineering: A creative process

Xavier emphasised that feature engineering is a highly creative process and clarified that creativity does not solely rely on generating original ideas.

Quoting Steve Jobs, who said, “Creativity is just connecting things,” Xavier suggested that originality isn’t the only aspect of creativity, but rather, it involves combining separate ideas.

Reflecting on his success as a Kaggle Grandmaster, Xavier shared techniques that he employed that were inspired from a range of sources; from past and present Kaggle Champions to decades old academic papers.

Xavier mentioned that while he thoroughly enjoys the creative process of feature engineering, the audience should not overlook knowledge shared by peers and established techniques. As a learnable example, Xavier shared learnings from his record on the “Give Me Some Credit” Kaggle challenge, where he built a credit scoring model that has remained at the top of the competition leader board for over 10 years. He attributed the inspiration of techniques used and his success in this challenge to his peer and previous Kaggle Grandmaster, Owen Zhang, as well as the use of stacking, which he discovered in a research paper dating back over 20 years.

Name and organise your ideas

While Xavier takes a creative approach to generating feature ideas, he enhances this process through a systematic categorisation of his ideas. Xavier found it useful to classify his thoughts and feature ideas into distinct signal types. Each signal type is tabulated below with a brief description and example.

Signal Type

Description

Example

Recency

Attributes to the latest event

Time since last event (claim, customer communication)

Frequency

How often events occur

Number of events occurring over a time window (number of times caught public transport in a week)

Monetary

Monetary amounts over a time window

Amount or average spent on consumer products over a month

Seasonality

Any seasonality in the timing of events

Higher/lower frequency of claims over holiday periods

Clumpiness

Do events occur randomly or in intense bursts?

Coefficient of variation or log utility of inter-event times, such as identifying customers binge-watching TV shows on streaming platforms

Change

Have static attributes changed? By how much?

Has a customer updated their address? How often has password been reset in the past 48 hours?

Basket

Counts or amounts of events/resources cross-categorised by an item label

What are the counts of each item type in a shopping basket? What is the maximum weight of each item type in a shipping order?

Diversity

How variable are the data values?

How stable is a patient’s blood pressure?

Similarity

How similar is an entity to a related group?

Ratio of a customer’s premium compared to the average or maximum for their age group

Stability

Whether an entity’s recent events resemble their past events

Ratio of the latest premium to the average or maximum over a historical time window

Location

he static or dynamic location of an event or entity

Post code, or the distance between home and office

 

A slide from Xavier’s presentation providing examples of signal types for a grocery dataset

FeatureByte

FeatureByte is a free and open-source Python package designed to empower data enthusiasts who love creating innovative features and improving model accuracy through data. It enables users to quickly create and share features, experiment seamlessly with features, and effortlessly deploy features in production without the challenges of managing data pipelines.

From a technical standpoint, the package allows users to easily create features for each of the signal types discussed earlier. Defining features is swift as no data needs to be held in memory. The SQL code required to obtain a feature is executed only when you “save” it, which is much faster than the corresponding pandas manipulations. Once users grasp the package’s syntax, it becomes a fast and easy way to develop features for machine learning models. Additionally, there is an enterprise solution built on top of the open-source package, offering a comprehensive framework for managing and organising AI operations at scale, including governance workflows and a user interface for the feature catalogue. If you’re interested, you can check it out here.

Final words from Eugene

Eugene Dubossarksy, the founder of Data Science Sydney, concluded the session by recognising Xavier as an exceptional figure in the field of data science, epitomising an extraordinary balance of humility and expertise. Eugene also reminded the audience that AI is not limited to just Large Language Models (such as GTP-4). Many organisations still have a significant need for machine learning models and will benefit from platforms like FeatureByte.

Eugene Dubossarsky acknowledges the remarkable blend of “humility and awesomeness” embodied by Xavier Conort during his speech

CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.