Getting started with Data Science
In data science, we are continuing to learn more by experimenting with tools and solving problems. Like cooking, intuition is great, but you can’t always decide what to cook based on what’s in the fridge.
There are five things Members can do when getting started with data science:
- Get the tools and know the tools
- Read Actuaries’ Analytical Cookbook
- Share your work on GitHub
- Join Hackathon
- Join Kaggle
Get the tools and know the tools
There is no need for a shiny new computer. You can just use whatever you have. But what is important is ensuring that you set up another user account as this will keep your data safe, prevent the deletion of files and avoid installing unwanted apps.
If you are more adventurous and have a spare old computer that’s collecting dust, why not install a Linux distribution? You can find out more about my recent holiday project in transforming an older computer into a data science machine here. Modern versions of Linux are much easier to set up and run well on old hardware. In fact, my ‘Linux box’ is 12 years old and is still going strong.
Why Linux? As Free/Libre software, you’ll have the freedom to use and modify the operating system and also get to learn a lot along the way.
Read Actuaries’ Analytical cookbook
Reading the Actuaries’ Analytical Cookbook is a must as it includes:
- A series of data and analytics recipes to help Actuaries get started quickly with a new project
- And is intended to be a resource for Actuaries in both data science and traditional fields.
For starters, I recommend following the guides on how to install and start with Python and R:
- Python is a beginner-friendly yet powerful programming language and is often the first programming language for children. It is so popular that, often, if you have a problem to solve, someone else has solved it and written a python code for it too!
- R is a popular statistical software that is often used in data mining
So, should you choose Python or R? I recommend using one that you’re familiar with. For me, I started with R as it’s similar to MATLAB that I used in engineering courses. Knowing both is essential as each programming language has its own strength.
Share works and collaborate on GitHub
GitHub is a system often used by software developers to share codes and collaborate, but it can be used for other projects. In fact, the Actuaries’ Analytical Handbook is stored on GitHub.
For data science, GitHub can be used as a backup repository, collaboration tool, and version control. It can also track change features like Microsoft Office, take snapshots of codes, undo changes, revert to an earlier version, and implement changes into the main code only after testing.
I use GitHub to write tutoring notes; I leave the main tutoring notes unedited while I try to draft new sections. Then when I finished the new sections, I can merge that section with the main notes to create a new version.
The Actuaries’ Analytical cookbook has more on this topic which you can check out under Workflow Management – Version Control.
Join Hackathon
Hackathon is Actuaries’ Institute’s annual event where Actuaries collaborate to provide solutions to business problems posed by not-for-profit (NFP) organisations. It is an excellent opportunity to test your knowledge while contributing back to society.
Join Kaggle
Want to apply what you know and compete too? Kaggle is the website to go to! Kaggle hosts data analytics competitions, where companies provide the raw data. This means that you get to apply your data cleansing skills, apply various data science algorithms to detect any patterns and submit your answers to test how your algorithms went.
The Institute has previously run competitions on Kaggle allowing Members to showcase their skills in data analytics.
I hope this guide can help Members, especially Student Members, get started with their data science journey. I’m still on the journey myself and found these resources useful when building my skillset:
- Missing Semester (2020): This YouTube video discusses relevant topics to data science,
- Intro to Machine Learning with R & caret: This is another YouTube that provides insights into the advanced use, of R programming language in data science (machine learning)
- Knowing the verbs of actions makes online searches easier. For example, ‘parsing’ is a data cleansing process to detect errors with data entries
- To learn to match text patterns, visit Regular Expressions (regex). There’s even a crossword to practice your regex skills.
- Do not be afraid to use the command line interface. Command line interface tends to have more options than a graphical user interface and also is easier to share what you did by copying and pasting commands rather than explaining what you’ve clicked.
CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.