Which data science skills are important ($50,000 increase in salary in 6-months)

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In late September 2021, David was a Research Analyst with Texas A&M University.

In March of 2022, less than 6-months later, he has accepted a position with Microsoft as a Machine Learning Support Engineer. In one of my webinars, David explained that he had just increased his salary by $50,000.

I told him that’s just the beginning.

How was David able to get a job at an elite company like Microsoft?

What skills was David learning that he could transition so quickly?

The rest of this post will show you how David did it. This post includes research and 2 surveys all to answer questions like…

  • What skills separate data scientists like David from everyone else
  • How to pick a language (and your choice may surprise you)
  • How to learn the skills
  • Why this approach to learning skills works

1. The skills that separated David from everyone else

If you want to become a data scientist, you’ll need to learn how to generate value for your organizations. I’ve written about how data scientists create value for organizations here. But, in general, you complete a process (called the data science process), which involves learning these data science skills.

The skills needed to become a data scientist

David learned these skills and was able to convince employers to hire him. The result was an instant $50,000 increase to his salary and not to mention is in a career that he’s super excited about.

Further to my point, according to Glassdoor, learning these data science skills can turn into a $126,722 career (if you live in Pittsburgh PA like I do, I encourage you to check your own locale).

Glassdoor: Data Scientist Earnings in Pittsburgh, PA

But that’s just the start.

Like I told David, your career will accelerate.

What’s after data scientist?

After data scientist comes senior data scientist.

Here’s what the career path looks like for a Senior data scientist in Pittsburgh PA.

Glassdoor: Senior Data Scientist Earnings in Pittsburgh, PA

Now I know what you’re thinking: “That salary is great. BUT, I’ll never be able to master this list of skills. Especially not in 6-months.”

Actually you can.

Here’s how.

How to master learning data science

Mastering data science isn’t hard. It just requires:

  • Motivation: You’ll need to dedicate about 10-hours per week
  • A plan: You’ll learn from David and several others in this

You’ll need to start by picking a language.

2. How to pick a language

Who do you think is going to win this battle?

Well it’s neither.

Because C++ is the true superior programming language.

I’m just kidding.

But in truth, it actually doesn’t matter. You can succeed with both.

I know both.

I teach both.

But, if we want to really answer this question, we should tackle this like data scientists. You know, with data to support our decision.

So, let’s tackle this like data scientists.

Here’s how to pick a language

If I were picking a language for the first time, I would consider a few things:

  1. How useful is the language for data science
  2. The demand for the job market
  3. The competition in the job market

How useful is the language for data science?

So if you look at the history of python, it clearly says it is a general purpose, high level programming language. It has an emphasis on code readability and to express concepts in fewer lines of code.

GeeksforGeeks: History of Python

Meanwhile if you look at R, it was closely modeled on the S language for statistical computing and graphics.

R-project: What is R?

So Python is a general purpose language (but has been adapted for many tasks like data science) while R has been developed for the sole purpose of statistics.

But I wasn’t satisfied with that, so I dug a little deeper. Here’s what I’ve found.

  • Python: Great for Machine Learning and Deep Learning but misses the mark on reporting (very important) and has fewer libraries for important analyses like econometrics.
  • R: Has well developed tools for business analysis and data science. Strong in everything except deep learning. But, deep learning is rarely used. And when you need deep learning or extra APIs, you can integrate R with Python.

So I’m going to give this one to R.

The demand for the job market

Next is demand for the job market for Python and R. Currently there are 21,271 Data Scientist jobs for Python. A

And, there are 8,713 Data Scientist Jobs for R.

So for every 1 R data science job there are 2.4 for Python.

I’ll give this one to Python.

The competition in the job market

Next, what we need to consider is how many people you will be competing against to get these jobs.

  • Python: There are over 8,000,000 people that know python (and that number is growing fast)
  • R: It’s estimated that 250,000 to 2,000,000 people that know R and that number is also growing fast.

So for every 1 R user there are potentially 4 to 32 more python users.

So R positions are going to less competitive by 10X or more. Dang!

This one clearly goes to R.

R is a solid choice

R is a solid choice, and it’s one of the reasons that students like David are able to quickly transition into a data science role. And keep in mind, you can always pick up Python later.

What about Excel?

At this point I always get a question, “what about excel?”

And my thought is this:

You can use any tool you’d like if it gets the organization results – R, Python, Excel, Tableau, PowerBI. All are great. BUT each has strengths and weaknesses.

Excel is great as a communication tool:

  • Everyone has it
  • Business people like it.

Excel has the following limitations:

  • Cannot do machine learning well. Machine learning is essential for modeling and explanations.
  • Cannot handle large data well (maximum data size is 1-million rows, which is not very big)
  • Functions are buried in cells, which leads to errors and difficult debugging.

And yes, this is the Blue Screen of Death, and I used to get this constantly when doing data analysis in Excel.

So please use Excel wisely.

3. How to pick a development tool

Next, it’s time to pick an integrated development environment (IDE), which is just a fancy term for the thing I type code into.

The RStudio IDE: The thing I type code into

I ran a poll to see what everyone’s using for R (and I did the same thing for Python too if you want to see those results).

Survey 1: What’s your favorite way to code in R?

R Poll Results

Here are the results.

It’s a landslide victory for RStudio.

So, if you are going to learn R, pick RStudio. Easy peasy.

Survey 2: What’s your favorite way to code in Python?

Python Poll Results

I ran the same poll for python. And here’s where it gets more complicated.

  • About half enjoy coding in Jupyter
  • A third like VSCode, and
  • Some are even using RStudio to code in Python!

Keep in mind of my 61,000+ followers on LinkedIn, many are likely to be people who follow my content and therefore are interested in R programming in addition to python.

But still, it’s not an easy decision for python users to pick an IDE.

In fact, I got a ton of comments for Spyder and half a dozen other random IDEs.

4. How to learn the data science skills

Once you settle on a language and IDE, you’re ready to begin the fun process of learning the skills to become a data scientist.

At this point you need a plan. Why?

Data Science Skills

…Because your goal should be to get a data science job as fast as possible. The market is crazy right now. But, eventually the market will cool and you’ll be outa-luck.

What about soft skills?

I always get this question at this point. I can hear it now.

“Matt, everything you’ve shown is technical skills. What about communication skills?”

Yes – you absolutely need those too. But you’ve also been learning those all your life. And if you haven’t, then add these 3 things to your arsenal:

  1. Making a slide deck
  2. Presenting your findings in a report
  3. Being nice when you talk to people.

If you do those 3 things consistently, you will be promotable. And people will want to work with you.

Especially focus on #3 (Being nice).

The 3 learning paths (choose wisely)

There are 3 types of data science learning paths:

  1. Those that have no plan. These are hobbyists. They usually quit. This costs them $8,000,000 over a 35 year career when factoring in a measly 3-percent annual raise.
  2. Those that have a crappy plan. They will take 5-years. But will eventually learn data science. They will also lose out financially because it took them sooo long to learn data science. 5-years at $125,000 per year when factoring in a low 3-percent raise = loss of $664,000. Ouch!
  3. Those that have an exceptional plan. They are likely to be successful and can complete the transition in under 6-months.

Now, keep in mind, I actually had a pretty crappy plan. And it seriously took me 5-years. And it cost me a lot financially too. But whatever. At least I made it.

But, students like David have an exceptional plan. They made it in 6-months. And, it involves cheating.

It’s OK to cheat…

And in the real world, to learn data science fast you need to cheat. What I mean is use a cheat sheet. Here’s my R-Cheat Sheet that will help you learn the skills you need.

The Ultimate R Cheat Sheet. It’s OK to cheat.

Here’s how to cheat.

Learn the foundational skills first (save Machine learning for later)

Now I know what half of you are going to do.

You’re going to jump right into Machine Learning. It’s A BIG MISTAKE. Don’t do that.

Instead learn these skills.

Rather, learn the foundations.

These are the skills you are going to use every day. I call them 80/20 skills.

They are the skills that help you early on in your process.

Things like:

  • Importing data: Working with databases, connecting to SQL, readr, readxl
  • Transforming data: Working with outliers, missing data, reshaping data, aggregation, filtering, selecting, calculating, and many more critical operations, dplyr and tidyr packages
  • Visualizing Data: Communicating through Interactive and Static Visualizations, ggplot2 and plotly
  • Time Series: Working with date/datetime data, aggregating, transforming, visualizing time series, timetk package
  • Text: Working with text data, stringr
  • Categorical data: Working with categories, forcats package
  • Functional Programming: Making reusable functions, sourcing code
  • Reporting: Making reports in interactive HTML and staticPDF formats

It’ the honest truth. Listen, if you focus on these core foundational skills, it will make machine learning so much easier.

How to learn modeling (and machine learning)

Now it’s time to take the training wheels off. Machine Learning!

Now you’re probably thinking…

What about maths, stats, and algorithms?

At this point, a logical question is – “What about maths, stats, and algorithms?”

Here are my two cents.

The Popular Opinion: Take 5-years and study theory, maths, learn how to code algorithms from scratch.

The Smart (Fast) Way: Learn in tandem why you apply machine learning in projects

The only way I’ve ever been successful with learning new algorithms is by experimenting and applying.

I’m talking about actually applying data science to projects I’m working on.

The process involves:

  1. applying machine learning to problems,
  2. experimenting with different algorithms, and
  3. seeing the results on real applications.

If you do this on real projects then you will in fact learn maths, stats, and algorithms.

What machine learning tools should I learn?

If we head on back to my cheat sheet, on page 3 you’ll find links to my goto-machine learning tools.

I’m a big fan of two packages (or ecosystems):

  1. Tidymodels: I use this for making adhoc models and then explaining
  2. H2O: I use this for automatic machine learning and in production

Time series is a money saver

Next, if you are interested in becoming insanely valuable to your organization.

Then learn time series.

Organizations are fans of saving money. An if you can predict the future, then chances are you are going to be very valuable to your company.

Time series analysis and forecasting are two of the most in demand skills. Why? A 5% improvement in a forecast can save a company like Walmart $50,000,000 each years.

So Walmart will pay an arm and a leg for someone that can help them improve that area.

What time series tools should I learn?

Let’s head back to the cheat sheet, and check out page 3 the “Time Series Analysis” and “Forecasting” section.

Here’s what you need to learn:

  • Time Series Analysis: Working with date/datetime data, aggregating, transforming, visualizing time series, timetk package
  • Forecasting: ARIMA, Exponential Smoothing, Prophet, Machine Learning (XGBoost, Random Forest, GLMnet, etc), Deep Learning (GluonTS), Ensembles, Hyperparameter Tuning, Scaling to 1000s of forecasts, modeltime package

Once you have those skills in the bank, then it’s time to move onto production.

How to take models into production (what the heck is this?)

Your model is worthless…

Until someone can use it to do something productive like…

Examples:

  • Call a customer that is on the bleeding edge of unsubscribing because of high complaint volume
  • Review more accurate forecast information before placing an $1,000,000 order for parts that could be unnecessary

It’s at this point that your hard work pays off. And you provide value.

But how do you give the decision-makers the help they critically need?

This is putting applications into production.

The Application

A Shiny Application

One of the truly amazing things is the ability to integrate models in to applications.

We can use applications to automate the analysis process.

And users can simply click buttons, use drop-downs, and get information, all without ever knowing that R (or Python) is truly running code behind the scenes.

The particular application shown above was made with a tool called shiny.

How to learn shiny

If we take a look at the cheat sheet, we can see on page 2 the “Shinyverse”, an ecosystem of R packages that can be used to create powerful applications that run R or Python behind them.

5. How to earn a $50,000 salary bump in 6-months

With everything we’ve covered in this post, you have all of the information you need to learn data science.

But you still don’t have a plan to do it fast.

In fact, it will still take a minimum of 2 years (or longer) to learn on your own.

But what if you could do it 6-months?

How amazing would it be to have a 6-figure career that you love?

And in the process earn $125,000 per year or more until you retire and have the financial freedom to do the things you enjoy.

Remember David’s story – How he just increased his salary $50,000.

How did David accomplish the impossible in under 6-months?

Here’s what he is doing.

David is learning data science in my R-Track Program.

If you are ready to learn. I am ready to teach.

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)