Strategic Data Science: Creating Value With Data Big and Small

[This article was first published on The Devil is in the Data – The Lucid Manager, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post Strategic Data Science: Creating Value With Data Big and Small appeared first on The Lucid Manager.

Data science is without a doubt the most popular business fad of the past decade. The promise of machine learning blinds many managers so they forget about deploying these new approaches strategically. This article provides a framework for data science strategy and is a synopsis of my forthcoming book Principles of Strategic Data Science on LeanPub. The markdown files for the text, images and code of the book are available on my GitHub repository.

What is Data Science?

The term data science emerged in the middle of the last century when electronic computation first became a topic of study. In those days, the discipline was literally a science of storing and manipulating data. The current definition has drifted away from this initial academic activity to a business activity. The present data science hype can be traced back to an article in the 2012 edition of Harvard Business Review. Davenport and Patil proclaimed data scientist to be the sexiest job of the twenty-first century. In the wake of this article, the number of data science searches in Google increased rapidly.

Organisations have for a long time used data to improve the lives of their customers, shareholders or society overall. Management gurus promoted concepts such as the data-driven organisation, evidence-based management, business intelligence and Six Sigma to help businesses realise the benefits of their data. Data science is an evolution of these methods enabled by the data revolution.

The Data Revolution

Recent developments in information technology have significantly improved what we can do with data, resulting in what we now know as data science. Firstly, most business processes are managed electronically, which has exponentially increased the amount of available data. Developments in communication, such as the Internet of Things and personal mobile devices, have significantly reduced the price of collecting data.

Secondly, the computing capabilities on the average office worker’s desk outstrip the capabilities of the supercomputers of the past. Not only is it cheaper to collect vast amounts of electronic data, but processing these enormous volumes has also come within reach of the average office worker.

Lastly, developments in applied mathematics and open source licensing have accelerated our capabilities in analysing this data. These new technologies allow us to discover patterns that were previously invisible. Most tools required to examine data are freely available on the internet with a helpful community sharing knowledge on how to use them.

These three developments enabled an evolution from traditional business analysis to data science. Data science is the strategic and systematic approach to analysing data to achieve organisational objectives using electronic computing. This definition is agnostic of the promises of machine learning and leverages the three developments mentioned above. Data science is the next evolution in business analysis that maximises the value we can extract from data.

Data Science Strategy Competencies

The craft of data science combines three different competencies. Data scientist Drew Conway visualised the three core competencies of data science in a Venn diagram.

Data Science Venn Diagram

Data Science Venn Diagram (Conway, 2010).

Firstly and most importantly, data science requires domain knowledge. Any analysis needs to be grounded in the reality it seeks to improve. Subject-matter expertise is necessary to make sense of the investigation. Professional expertise in most areas uses mathematics to understand and improve outcomes. New mathematical tools expand the traditional approaches to develop a deeper understanding of the domain under consideration. Computer science is the competency that binds the available data with mathematics. Writing computer code to extract, transform and analyse data to create information and stimulate knowledge is an essential skill for any data scientist.

Good Data Science

To create value with data, we need to know how to create or recognise good data science. The second chapter uses three principles originally introduced two thousand years ago by Roman architect and engineer Vitruvius. He wrote that buildings need to be useful, sound and aesthetic. These requirements are also ideally suited to define best-practice in data science.

The Vitruvian triangle for data science.

The Vitruvian triangle for data science.

For data science to be useful, it needs to contribute to the objectives of an organisation positively. It is in this sense that data science is an applied science and not an academic pursuit. The famous Data-Information-Knowledge pyramid visualises the process of creating value from data.


Useful data science meaningfully improves our reality through data. Data is a representation of either a social or physical reality. Any data source is ever only a sample of the fullness and complexity of the real world. Information is data imbued with context. The raw data collected from reality needs to be summarised, visualised and analysed for managers to understand the reality of their business. This information increases knowledge about a business process, which is in turn used to improve the reality from which the data was collected. This feedback loop visualises the essence of analysing data in businesses. Data science is a seductive activity because it is reasonably straightforward to create impressive visualisations with sophisticated algorithms. If data products don’t improve or enlighten the current situation, they are in essence useless.

Reality, Data, Information, Knowledge pyramid.

The Reality, Data, Information, Knowledge pyramid.


Data science needs to be sound in that the outcomes are valid and reliable. The validity and reliability of data are where the science meets the traditional approaches to analysing data. Validity is the extent to which the data represents the reality it describes. The reliability of data relates to the accuracy of the measurement. These two concepts depend on the type of data under consideration. Measuring physical processes is less complicated than the social aspects of society. Validity and reliability are in essence a sophisticated way of expressing the well-known Garbage-In-Garbage-Out principle.

The soundness of data science also relates to the reproducibility of the analysis to ensure that other professionals can review the outcomes. Reproducibility prevents that the data and the process by which it was transformed and analysed become a black-box where we have no reason to trust the results. Data science also needs to be sound concerning the governance of the workflow. All data sources need to be curated by relevant subject matter experts to ensure their validity and reliability. Data experts provide that the data is available to those who need it.


Lastly, data science needs to be aesthetic to ensure that any visualisation or report is easy to understand by the consumer of the analysis. This requirement is not about beautification through infographics. Aesthetic data products minimise the risk or making wrong decisions because the information is presented without room for misinterpretation. Any visualisation needs to focus on telling a story with the data. This story can be a comparison, a prediction, a trend or whatever else is relevant to the problem.

One of the essential principles of aesthetic data science is the data-to-pixel ratio. This principle means that we need to maximise the ratio between all the pixels on a screen and those pixels that present information. Good data visualisation practices austerity to ensure that the people that consume the information understand the story that needs to be told.

Example of low and high data-to-pixel ratio.

Example of low and high data-to-pixel ratio.

Strategic Data Science

The data science continuum is a strategic journey for organisations that seek to maximise value from data. As an organisation moves along the continuum, increased complexity is the payoff for increased value. This continuum is a hierarchy as all phases are equally important. The latter stages cannot exist without the previous ones.


Data science continuum

Data science continuum.

Collecting data is requires important considerations on what to collect, how to collect it and at what frequency. To collect meaningful data requires a good understanding of the relationship between reality and the data. There is no such thing as raw data as all information relies on assumptions and practical limitations.

Describing the data is the first step in extracting value. Descriptive statistics are the core of most business reporting and are an essential first step in analysing the data.

Diagnostics or analysis is the core activity of most professions. Each subject area uses specialised methods to create new information from data.

Predictive analysis seems to be the holy grail for many managers. A prediction is not a perfect description of the future but provides the distribution of possible futures. Managers can use this information to change the present to construct their desired future.

Prescriptive analysis uses the knowledge created in the previous phases to automatically run business process and even decide on future courses of action.

Any organisation starting with data science should follow the five phases in this process and not jump ahead to try to bypass the seemingly less valuable stages.

The Data-Driven Organisation

Implementing a data science strategy is more than a matter of establishing a specialised team and solve complex problems. Creating a data-driven organisation that maximises the value of data requires a whole-of-business approach that involves people with the right attitude and skills, appropriate systems and robust processes.

A data science team combines the three competencies described in the Conway Venn diagram. People that have skills in all three of these areas are rare, and the industry calls them unicorns. There is no need for recruiters to start hunting unicorns because these three areas of expertise can also exist within a team. Possibly more important than the technical skills are the social skills of a data scientist. Not only need they create useful, sound and aesthetic data science, they also need to convince the consumers of their work of its value.

One of the problems of creating value with data is ensuring that the results are implemented in the organisation. A starting point to achieve this is to ensure that the users of data products have a relevant level of data literacy. Developing data literacy among the consumers of data science is perhaps the greatest challenge. The required level of data literacy depends on the type of position and the role of the data consumer within the organisation.

Data scientists use an extensive range of tools and are often opportunistic in their choice of software. Spreadsheets are not very suitable to create good data science. Data science requires coding skills and the Python and R languages are powerful tools to solve complex problems. After the data specialists have developed the best way to analyse data, they need to communicate these to their customers. Many specific products exist to communicate data to users with interactive dashboards and many other dynamic systems.

The final part of this book delves into the ethics of data science. From the fact that something can be done, we cannot conclude that it should be done. Just like any other profession that impacts humans, data scientists need ethical guidelines to ensure that their activities cause no harm. This book provides some basic guidelines that can assist data scientists to assess the ethical merits of their projects.

The post Strategic Data Science: Creating Value With Data Big and Small appeared first on The Lucid Manager.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data – The Lucid Manager. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)