The first step to becoming a top performing data scientist

[This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

photo by

Nearly every day, I see a new article talking about the benefits of data: “data will change the world” … “data is transforming business” … “data is the new oil.”

Setting aside the hyperbolic language, most of this is true.

So when you hear that “data scientist is the sexiest job of the 21st century,” you should mostly believe it. Companies are fighting to hire great data scientists.

But there’s a catch.

Even though there’s a huge demand for data scientists, a lot of people who study data science still can’t get jobs.

I regularly hear from young data science students who tell me that they can’t get a job. Or they can get a “job,” but it’s actually an unpaid internship.

What’s going on here?

The dirty little secret is that companies are desperate for highly skilled data scientists.

Companies want data scientists that are great at what they do. They want people who create more value than they cost in terms of salary.

What this means is that to get a data science job, you actually need to be able to “get things done.”

… and if you want a highly-paid data job, you need to be a top performer.

I can’t stress this enough: if you want to get a great data science job, certificates aren’t enough. You need to become a top performer.

Your first steps towards becoming a top performer

Your first step towards becoming a top-performing data scientist is mastering the foundations:

  • data visualization
  • data manipulation
  • exploratory data analysis

Have you mastered these? Have you memorized the syntax to accomplish these? Are you “fluent” in the foundations?

If not, you need to go back and practice. Believe me. You’ll thank me later. (You’re welcome.)

The reason is that these skills are used in almost every part of the data science workflow, particularly at earlier parts of your career.

Given almost data task, you’ll almost certainly need to clean your data, visualize it, and do some exploratory data analysis.

Moreover, they are also important as you move into more advanced topics. Do you want to start doing machine learning, artificial intelligence, and deep learning? You had better know how to clean and explore a dataset. If you can’t, you’ll basically be lost.

“Fluency” with the basics … what does this mean?

I want to explain a little more about what I mean by “master of the foundations.” By “mastery,” I mean something like “fluency.”

As I’ve said before, programming languages are a lot like human languages.

To communicate effectively and “get things done” in a language, you essentially need to be “fluent.” You need to be able to express yourself in that language, and you need to be able to do so in a way that’s accurate and performed with ease.

Granted, you can “get by” without fluency, but you couldn’t expect to be hired for a language-dependent job without fluency.

For example, do you think you could get a job as a journalist at the New York Times if you hadn’t mastered basic English grammar? Do you think you could get a job at the Wall Street Journal if you needed to look up 50% of the words you used?

Of course not. If you wanted to be a journalist (in the USA), you would absolutely need to be fluent in English.

Data science is similar. You can’t expect to get a paid job as a data scientist if you’re doing google searches for syntax every few minutes.

If you eventually want a great job as a data scientist, you need to be fluent in writing data science code.

Can you write this code fluently?

Here’s an example. This is some code to analyze some data.

Ask yourself, can you write this code fluently, from memory?


# In this post, we will be using data from an analysis
# performed by pwc
# source:

df.ai_growth <- tribble(
                  ~region, ~ai_econ_growth
                  ,"China", 7
                  ,"North America", 3.7
                  ,"Northern Europe", 1.8
                  ,"Africa, Oceania, & Other Asia", 1.2
                  ,"Developed Asia", .9
                  ,"Southern Europe", .7
                  ,"Latin America", .5

# (bar chart)

ggplot(data = df.ai_growth, aes(x = region, y = ai_econ_growth)) +
  geom_bar(stat = 'identity')

ggplot(data = df.ai_growth, aes(x = region, y = ai_econ_growth)) +
  geom_bar(stat = 'identity') +

# - reorder by econ growth
#   using forcats::fct_reorder()

ggplot(data = df.ai_growth, aes(x = fct_reorder(region, ai_econ_growth), y = ai_econ_growth)) +
  geom_bar(stat = 'identity') +


theme.futurae <- theme(text = element_text(family = 'Gill Sans', color = "#444444")
                       ,panel.background = element_rect(fill = '#444B5A')
                       ,panel.grid.minor = element_line(color = '#4d5566')
                       ,panel.grid.major = element_line(color = '#586174')
                       ,plot.title = element_text(size = 26)
                       ,axis.title = element_text(size = 18, color = '#555555')
                       ,axis.title.y = element_text(vjust = .7, angle = 0)
                       ,axis.title.x = element_text(hjust = .5)
                       ,axis.text = element_text(size = 12)
                       ,plot.subtitle = element_text(size = 14)


ggplot(data = df.ai_growth, aes(x = fct_reorder(region, ai_econ_growth), y = ai_econ_growth)) +
  geom_bar(stat = 'identity', fill = 'cyan') +
  coord_flip() +
  labs(x = NULL
       , y = 'Projected AI-driven growth (Trillion USD)'
       , title = 'AI is projected to add an additional\n$15 Trillion to the global economy by 2030'
       , subtitle = '...strongest growth predicted in China & North America') +
  annotate(geom = 'text', label = "source:", x = 1, y = 6, color = 'white') +

You should be able to write most of this code fluently, from memory. You shouldn’t have to use many google searches or external resources at all.

Will you maybe forget a few things? Sure, every now and again. Will you write it all in one go? No. Even the best data scientists write code iteratively.

But in terms of remembering the syntax, you should know most of this cold. You should know most of the syntax by memory.

That’s what fluency means.

… and that’s what it will take to be one of the best.

Mastering data science is easier than you think

I get it. This probably sounds hard.

I don’t want to lie to you. It’s not “easy” in the sense that you can achieve “fluency” without any effort.

But it is much easier than you think. With some discipline, and a good practice system, you can master the essentials within a couple of months.

If you know how to practice, within a few months you can learn to write data science code fluently and from memory.

Discover how to become a top-performing data scientist

If you want to become a top-performing data scientist, then make sure you sign up for our email list.

Next week, we will re-open enrollment for our data science training course, Starting Data Science.

This course will teach you the essentials of data science in R, and give you a practice system that will enable you to memorize everything you learn.

Want to become a top performer? Our course will show you how.

Sign up for our email list and you’ll get an exclusive invitation to join the course when it opens.


The post The first step to becoming a top performing data scientist appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)