My top 10 Python packages for data science

Over the last four years I have transitioned from using SAS exclusively for all data processing and statistical modelling tasks to using Python for these tasks. One barrier I had to overcome was the need to keep discovering and learning to use all the great packages put together by the open source community.

There are a lot of benefits of adopting these open source packages, including:

Everything is free
Most likely they are constantly being updated and improved
There’s a large community offering support to each other on websites like Stack Overflow

It does take some time to get familiar with these packages. However, if you are the kind of person who gets excited about learning new things, you’ll actually enjoy the process.

Today I’m sharing my top 10 Python packages for data science, grouped by tasks. Hopefully you find it useful!

Data processing

pandas

Developed by Wes McKinney more than a decade ago, this package offers powerful data table processing capabilities. For people with a SAS background, it offers something like SAS data steps functionality. You can do sorting, merging, filtering etc. The key difference is in pandas, you call a function to perform these tasks.

By the way, I was really amazed to know that Wes McKinney was able to develop pandas after only a few years of Python experience. Some people are just really gifted!

His book Python for Data Analysis is highly recommended if you are just starting out your Python data science journey.

numpy

Pandas builds on top of another important package, numpy. So when you work with data you will often rely on this package for basic data manipulations. For example when you need to create a new column based on the age of the customer, you need to do something like:

df['isRetired'] = np.where(df['age']>=65, 'yes', 'no')

qgrid

An amazing package which allows you to sort, filter, and edit DataFrames in Jupyter Notebooks.

Graphing

The next three packages are all to do with graphing — which is a key step in exploratory data analysis.

matplotlib

This package allows you to do all sorts of graphs. If you are using it in a Jupyter Notebook, remember to run this line of code to enable the display of the graphs:

%matplotlib inline

seaborn

With the help of this package, you can make matplotlib graphs look much more attractive.

plotly

Nowadays we come across interactive graphs everywhere. They offer a much better user experience. For example:

when we hover the mouse over a line plot we expect some text to pop up.
when we select a line, we expect it to stand out from the other lines.
sometimes we would like to zoom into parts of the graph.

plotly allows you to build these interactive graphs easily within a Jupyter Notebook. A great way to share work with your colleagues and stakeholders is sending a webpage (a Jupyter Notebook) with beautiful, interactive plotly graphs embedded.

The best part is there is no need for the recipient to install any special software other than a modern internet browser.

Modelling

statsmodels

This package allows you to build Generalized Linear Models (GLMs) which are still widely used by actuaries today.

It also offers time series analysis and other statistical modelling capabilities.

scikit-learn

This is the main machine learning package allowing you to complete most machine learning tasks, including classification, regression, clustering, and dimensionality reduction.

I also use the model selection and pre-processing functions. From k-fold cross validation to scaling data and encoding categorical features, it has so much to offer.

lightgbm

This is one of my favourite machine learning packages for Gradient Boost Machine (GBM). I gave a talk in the 2018 Data Analytics seminar about this package.

For a fraction of the time and effort needed to build GLMs, you could run a GBM, look at the importance matrix to find out the most important features for your model and have a good initial understanding of the problem. This can be a standalone step, or a quick first step before building a full GLM that’s more readily accepted by the stakeholders.

lime

Model interpretation is still a challenge for machine learning models like GBM. When stakeholders don’t understand a model they can’t trust it and as a result there’s no adoption.

However, I feel model interpretation packages like lime are starting to change this. They allow you to examine each model prediction and work out what’s driving the prediction.

Conclusion

I’ve listed my top 10 packages. Have you come across any other useful packages? Please share in your comments below.

“Exploration is really the essence of the human spirit.” – Frank Borman

This article was originally published on Medium.com

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivatives CC BY-NC-ND Version 4.0.

CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.

Data processing

pandas

numpy

qgrid

Graphing

matplotlib

seaborn

plotly

Modelling

statsmodels

scikit-learn

lightgbm

lime

Conclusion

Most Popular

Why Buying a Home is Far Better than Renting

The Top Skills Employers Want When Hiring Actuaries

Excess Mortality in First Two Months of 2024 Was Nil, Noting the New Baseline

Mastering Workplace Communication: Adapting Style for Success

Under the Spotlight – Asia Series – Shannon Lin

Editor’s Note – Hugh Miller

Just in This Month

Past Extremes are Now Normal

The Associate Edition: Actuary by Trade, Creative by Nature

Climate Finance: Tracking the Funding of our Future

My U.S. Senate Testimony on Insurance and Climate Change

Data processing

Graphing

Modelling

Conclusion

Most Popular

Related Articles

Just in This Month