The myth of the missing Data Scientist

January 7, 2013

(This article was first published on Cartesian Faith » R, and kindly contributed to R-bloggers)

Much has been said about the dire shortage of Data Scientists looming on the horizon. With the spectre of Big Data casting shadows over every domain, it would seem we need nothing short of a caped wonder to help us see the light. Heralded as superheroes, Data Scientists will swoop into an organization and free the Lois Lane of latent knowledge from the cold clutches of Big Data. In the end the enterprise bystanders will marvel at the amazing powers these superhumans possess. Everyone will be happy and the Data Scientist will get the girl.

It’s a great story and a great time to be a nerd. As much as I want to believe in this story, I just don’t buy it. True there is more data being produced now than ever before. The rate of data production is growing exponentially and people need to be able to analyze this data. Yet this dire need feels manufactured. The promoters of Data Science point to the McKinsey study that cites a “shortage of 140,000 – 190,000 people with deep analytical skills” by 2018. That’s a lot of Data Scientists! Some people claim that every organization will eventually need at least one Data Scientist and perhaps even have their own department. This all sounds fantastic (who wouldn’t want a legion of super-nerds be a force in culture?) except there are some serious problems with this analysis. There are three significant problems with the hyperbole surrounding Data Science: selection bias, assimilation blindness, and automation blindness. What we’ll see is that the need for Data Scientists is likely smaller than advertised with a startlingly short half-life.

Selection Bias

The first problem is that people assume that all 150k “people with deep analytical skills” are all Data Scientists. First let’s look at the math. Suppose every organization does need at least one Data Scientist. We start with the number of public companies listed on major exchanges in the US as a proxy for “every organization”, which is about 5000. Why is this a reasonable proxy? Because smaller companies probably don’t have the budget to support full time data scientists. Adding businesses listed in OTC markets, we can roughly double that number. Fine, so let’s say 10000 companies. Then on average that would mean each organization has a team of 15 Data Scientists. Wow, I see a lot of dollar signs piling up alongside the map-reduce queries.

Clearly there must be other professions that require analytical skills that aren’t Data Scientists. Look at the cross section of people that use R and you’ll see people in Psychology, Economics, Biology, Finance, etc. The biggest population by far is the traditional group you think of when you think analysis: engineering. McKinsey hints at this when they list the Internet of Things as being one of the sources for the exponential growth in data. This version of the future, popularized by GE, points to Computational Engineers as filling most of this population. When GE alone is hiring 400 people to fill one development center, it’s plausible that the net shortage could reach hundreds of thousands.

Assimilation Blindness

The next problem is what I call assimilation blindness. Even if a shortage of this scale did exist for Data Scientists, it wouldn’t be sustained. As understanding of Big Data and analytical methods becomes more widespread, the need for specialists will often diminish. A good example is how web developers used to be a prized resource but are now commoditized since even High School students can build web sites (or iPhone apps for that matter). Data Scientists will find that their role will be assimilated quickly since their role only differs from traditional roles by having a big data component. What is the role of a Data Scientist? It is still up for debate, but here are some of the most popular themes I’ve seen:

  • Telling stories with data (including visualization) – This is what marketers do. As tools become easier to use and analytical methods more pervasive, presumably many people in the marketing department will know how to take advantage of these tools directly rather than relying on a Data Scientist
  • Finding insights in data – This is what business analysts do. They’ve been trained to use analytical tools for years and know how to spot interesting phenomena in data. The tool set is different as is the scale of the data, but given most business analysts know a little SQL and basic statistics, it isn’t a stretch to conclude that they would assimilate many of the functions a Data Scientist fills
  • Creating products from data – This is what product managers do. In finance there are plenty of data products, and they aren’t managed or invented by Data Scientists. As data products become more mainstream, more people in the product management arena will know how to ask questions of data directly because they will have learned these skills themselves

Hence while there may be a shortage in the short term, over time the Data Scientist will lose his cape and disappear into the crowd.

Automation Blindness

The functions of a Data Scientist that aren’t assimilated will likely be automated away. Not recognizing this phenomenon is what I call automation blindness. Numerous startups and big players such as IBM are developing tools to simplify big data analysis. Currently a big portion of a Data Scientist’s role is bringing together data from disparate sources to make an analysis possible. Once this is automated, the need for specialists will again decline.

In short the shortage of Data Scientists is shrouded in the myths of storytellers. There is definitely a need for people with analytical skills, and we will see this separate into skills that are generally assimilated and advanced skills used by engineers to design tools and systems that rely on data for their proper function.

To leave a comment for the author, please follow the link and comment on their blog: Cartesian Faith » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)