Demystifying data science terminology

[This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The language used by data scientists can be confusing to anyone encountering it for the first time. Ever changing best practices and constantly evolving technologies and methodologies have given rise to a range of nuanced terms used throughout casual data conversation. Unfamiliarity with these terms often leads to disconnected expectations across different parts of a business when undertaking projects involving data and analytics. To make the most out of any data science project, it is important that participants have a shared vocabulary and an understanding of key terms at a level that is required of their role.

Mango Solutions is regularly involved in data science projects spanning different levels of a business. Below, we’ve outlined the most common data science terms that act as communication barriers in such projects:

 

Terms (common examples) Definition for…
… a data scientist … a data science manager … a business director
Data Science An interdisciplinary field spanning mathematics, statistics and computer science aimed at delivering insights from data using a variety of technologies and methodologies. An interdisciplinary business function making use of predictive and prescriptive analytics to make better business decisions. The proactive use of data and advanced analytics to drive better decision making.
Descriptive Analytics Examination of historical data to understand the changes occurring to a business.

Used to answer the question “what happened?”

Diagnostic Analytics Examination of historical data to understand why changes have occurred within a business. Used to answer the question “why did something happen?”
Predictive Analytics The use of historical data to make predictions about future events. Used to answer the question “what will happen next?”
Prescriptive Analytics The use of data and above forms of analytics to determine the best course of action for a business. Used to answer the question “what’s the best decision we can make based on the data we have?”
Model The mathematical relationships describing how a sample of data is generated from other observations. A data science product where mathematical and statistical relationships are estimated from historical data and later used to make predictions. The mathematical and statistical relationships used to make predictions about key business metrics (e.g. future sales or probability a customer will make a purchase).
Artificial Intelligence (AI) In practice, this term is generally used to refer to “narrow AI” and encompasses the types of problems that can be solved with machine learning. AI usually encompasses topics like machine learning, natural language processing and computer vision among others.
Machine Learning (e.g. random forest, xgboost, neural networks) Variety of computational methods implementing supervised and unsupervised learning methods to predict class labels or continuous measures. Typically, regression and classification algorithms for building models with many open-source implementations. A broad range of leading predictive modelling methodologies.
Deep Learning A generalisation of artificial neural networks that makes use of many intermediate layers of representation to better capture relationships between the observed data and predictions. A subcategory of machine learning well-suited for complex models and particularly successful in image classification and speech translation.
Supervised Learning Machine learning algorithms where existing data exists for both the prediction target and the observations with which the prediction will be made. Machine learning problems where models are estimated from known examples (e.g. identifying fraudulent credit card transactions from reported cases).
Unsupervised Learning A category of machine learning problems where labels or prediction targets are unknown and must be discovered from patterns in the data. The class of machine learning problems where object groupings need to be discovered (e.g. clusters/labels for pieces of text).
Over-fitting Estimation error where the model fits the noise in the data.
This is often the result of using models that are too complex for either the problem or available data. e.g. A complex image classifier trained using 20 photographs will likely have 100% classification accuracy on those images but otherwise perform poorly on new images.
Cross-Validation An iterative approach for splitting data into train and test sets to ensure robust model estimation. Critical strategy to ensure machine learning models don’t overfit the data and provide misleading predictions. This is needed to ensure models are general enough to be useful for making future predictions.
Training/Test Data A division of data that allows unbiased model validation. Typically, models are estimated on training data and validated on “test” data that is withheld until the end of the analysis.
Classification A general term for a class of predictive problems where the target of the prediction is a label (e.g. if an observation belongs to one of two categories).
Regression Statistical and mathematical procedures for estimated the relationship between a set of variables and a target quantity while minimising the prediction errors A broad term often used to refer to model estimation where the target variable is a continuous value (e.g. weekly sales)
Forecasting The prediction of future events using mathematical or statistical models.
Cloud (AWS, GCP, Azure, Cloudera) A shared set of computational resources allowing on-demand scaling of infrastructure to meet business or project computational requirements. A broad term for scalable on demand infrastructure and computing. A shared set of computational resources that allow businesses to avoid upfront infrastructure costs.
Version/Source Control (Git, SVN, Github, Gitlab) A system for tracking, managing, and integrating code changes through a process involving branching and merging code repositories. A system for tracking, managing, and integrating code changes while ensuring a full history of code changes is preserved along with comments from the individuals making those changes. Framework for tracking code changes and allowing for the roll back to previous versions of software.
Unit Testing The automation of code validation through tests designed to ensure the correct functioning of small components of code. An often time-consuming step during development that helps programmers test code functionality and protect against future bugs. The benefit in unit testing is often realised in the long term. Development practice that helps ensure correct code functionality.
Continuous Integration A development practice where code changes are committed to a shared repository and validated by an automated build and testing process. A practice used by a team of developers that helps protect against code integration failures and code changes that break existing or expected functionality.

Mango Solutions can help you build a shared language around data science in your organisation. Based on our experience working with the world’s leading companies, we have developed 3 workshops to build a common language.

Find out which of the three workshops would be valuable to your organisation:

 

 

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)