What is the Most Christmassy of Christmas Songs?

In his last Normal Deviance article for the year, Hugh looks at lyrics to see what makes a song festive.

After the mild non-failure of last year’s festive column, I’ve decided to have another crack at a festively-themed analytics challenge. Is it possible to build a model that predicts whether a song is Christmas-themed and, if so, can we use it to say something about those predictions and which songs are the ‘most’ Christmassy on this measure?

There are lots of potential avenues, but to keep it simple I’ve looked at words used in lyrics. This is obviously a little simpler than a full analysis of audio files, and is probably appropriate for the year, given the prominence of language models in the AI space. But it does mean we are not listening out for jingling bells in the backing track.

For data, I have used the remarkable Million Song Dataset (by Bertin-Mahieux et al., 2011), which is actually a collection of datasets related to songs, linked by a common track ID. I have used:

  • The musicXmatch dataset, which has taken the lyrics of 238,000 songs and converted them into a bag of words format. That is, they have selected 5,000 common word stems and counted the number of times each of those words have occurred in each song.
  • The Last.fm dataset, which has a range of genre and other tags for each track. For our purposes we’ve extracted the “Christmas” tag and used that as our target
  • The core dataset for matching song IDs to artist and song names.

 

Datasets are available at the website link above, and full (not pretty) python code for the analysis is available at this GitHub repository.

After merging and down-sampling the non-Christmas songs, we arrived at a dataset with 1,305 Christmas songs and 23,635 other songs, with a list of word counts for each. I picked a simple xgboost setup to build a prediction model, and created a set of probabilities for each record (cross-validated predictions, to avoid overfitting effects).

The model performs strongly – overall accuracy is about 97%, suggesting that Christmas songs are not that hard to spot. The two distributions below show the predicted probability of being a Christmas song for the two groups.

 

Figure 1 – Model predicted probability of being a Christmas song

 

Most non-Christmas songs are correctly given very low probabilities. Interestingly, there is also a cluster of songs tagged as Christmas that have similarly low probabilities – looking at a few of them, I struggled to pick them as Christmas songs, so some of it may relate to the quality of the original tags.

We can also ask the model to report which words are most useful for making predictions across the dataset – feature importance. The top 20 words are shown below. Sleigh manages to edge out Christmas (here stemmed to “christma” for analysis). The list looks suitably jolly, with many traditional word stems (Christ, shepherd) included.

 

Top 20 word stems

 

Now, onto the main event – our top ten list of most Christmassy Christmas songs is shown below. The probabilities are all high, with very little separating, so it is a little bit forced (and likely to change with a bit of model tweaking) – but we won’t let that get in the way of a good list.

 

Top 10 Most Christmassy Christmas Songs

Artist

Song

Predicted probability

Brenda Lee

Christy Christmas

99.6%

Destiny’s Child

Platinum Bells

99.5%

Carpenters

O Holy Night

99.5%

Bebe Winans

Yes It’s Christmas

99.4%

Mark Knopfler

The Ragpicker’s Dream

99.4%

Charlotte Church

God Rest Ye Merry Gentlemen

99.3%

Destiny’s Child

A “DC” Christmas Medley

99.3%

Billy Idol

God Rest Ye Merry Gentlemen

99.3%

Donna Summer

O Come All Ye Faithful

99.3%

Vanessa Williams

Merry Christmas Darling

99.2%

 

Since the training set is pretty large, it’s perhaps not surprising that the list includes quite a few I’ve not heard below. Our winner is Brenda Lee’s “Christy Christmas”, which should almost win for its title alone.

 

 

The song lyrics do have it all – Christmas, toys, sleigh, tree, star and Santa – so a deserved winner. Interestingly, Brenda Lee is already in the Christmas news this year, with her famous 65-year-old recording of “Rockin’ Around the Christmas Tree” hitting number 1 on the USA Billboard hot 100, making her the oldest ever person to top the charts.

Destiny’s Child’s 2001 Christmas Album managed to get two in the top ten, including their original “Platinum Bells”, which similarly manages to hit a vast array of Christmas words, although shockingly not regarded as Beyonce’s best work.

 

 

One double-up is in the list, too – “God Rest Ye Merry Gentlemen”. This is not surprising, given the number of covers and versions of many songs that are reflected in the list.

Of the list, “The Ragpicker’s Dream” by Mark Knopfler was my favourite discovery – a gentler narrative with a lot of heart.

Overall, a pretty straightforward analytics exercise, but it is nice when good data is out there to make life easy.

Thanks for reading again this year – this has been my sixth year of writing the Normal Deviance column, and it continues to be a joy. And whatever your preferred music genre, I do hope you’re able have a fantastic festive season, with some time for relaxation, friends and family.

CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.