Data come in variety of form, at different pace, and at different volume. And if all three criteria define the difference between SQL and NoSQL and there, all three are still irrelevant for data science.
My theorem is, that no matter what shape, size, frequeny, value and trustworthiness, SQL type of presenting the data is still the number one player.
Before you all jump, and start writing comments, hold on and continue reading.
SQL Databases as we have known them and as we know them nowadays have improved over the years in terms of velocity, performance, volume, frequency, and up until this day, keep the relational model, data ACID, isolations, and so on. All this – if you would like to address them – fix, rigid, complex, … tendencies have not only kept the databased number one, but provided the business, digitalization wave and cloud computing with structure and data visibility.
NoSQL database technologies are, on the other hand as described by many, flexible, can provide horizontal scaling, give better query performances due to data normalization and can do parallel computations.
I cannot argue with parallel computation and horizontal scaling. But with rest, I have big issues.
Offerings in Microsoft Azure
Let’s take a look into the different offerings in comparison between SQL and NoSQL:
|Storage||Predefined tables||Key-value pair storage, Document Storage, Columnstore storage, Graph based storage|
|Data Model||Tables of rows and columns with relations between other tables, related data stored separately, joined to form complex queries||Stores data depending on database type: Key-value, documents, columnar type and graph databases|
|Example||SQL Server, Azure SQL Server, Azure Database||Azure Cosmos DB ( Table API, Cassandra API, Graph API), HBase in HDInsight, Azure Cache for Redis, Azure Table Storage, Azure Blob Store, Azure Data Lake store, Azure File Store, Azure Time Series Insights|
|Business Purpose||General purpose systems, CRM, Accounting, Finance, Human resources, Planning, Inventory management, Transactional management System||Mobile Apps, IoT Apps, Real-time data Stream and Analytics, Content management|
|Scale||Vertically by increasing server load||Horizontally by sharding across multiple|
|Analytics||SQL, SQL with R, Python, and Java, Analysis Services with Data Mining, Integration Services with data profiling, Azure Machine Learning||Spark, Scala, Python, R, Hive, SQL, .NET Core, Azure Machine Learning for Python, Stream Analytics|
Analytical hiccups and trade-offs with NoSQL
Horizontal scaling and parallel computations are great. They both provide elasticity of resources and faster analytical results. When delivering data to analytical department, the ends must meet.
You have probably heard the phrase “from eventual to strong consistency”. This simply means that the schema is defined on read and when a document, key-value pair binary file, columnar file is copied between different transformation zones it will eventually become consistent with schema, ACID rules, integrity constraints. But with eventual consistent data, you are gaining fast response (analytics) with the cost of potentials errors or stale data. And delivering this type of data to data science department will always result in back and forth communication full of nagging questions and clearing many data inconsistencies.
Keeping data in original format is great; Transforming it to a readable dataset is a big trade-off for data architect, data engineers and a huge must for data science.
Faster query performance can be achieved with in NoSQL without complex query joins. Normalized data would mean, that all transactional data, along with all dimensional data (with names, explanations) are included. Making datasets inadvertently large, but with cheap storage and scalability, this is not a problem. Problem lies with data consistency, updated, well, ACID. And it’s orchestration. Making shards and copies of data every time, something is changed, updated or deleted, can be tedious, invisible and result of many storage issues. But all orchestration issues are also solvable with right software and some coding.
Data usage and delivery
Data scientists love Python Pandas, Numpy, R’s Dplyr, data.table, Spark’s dataframe, datasets, Julia’s dataframe.jl. You get the picture. All are column and row based. In other words, all NoSQL data are delivered as SQL typed data or columnar typed. Even graph data (with edges and vertices) are transformed in this format.
I tend to calculate flexibility of NoSQL data through “time to market” KPI. How fast can add a new type of data (image, alter schema, new KPI) or a new change on the data and deliver it to the data science department is key to flexibility. There are for sure also other key factors regarding flexibility, for example through deployment and operations, replication or even availability.
I am happy that NoSQL concepts, technologies and statistical approached have penetrated world of data science. Not only it helped developed new ways of calculation, improved and developed new algorithms, but also opened new ways of analysing formats, brought it to community faster that we have envisioned. But there are caveats to these concepts. Data still need to be cleaned, harmonised, consolidated. In order to bring the consistency, accuracy through transformation, wrangling and orchestration. And these processes must (!) not be neglected, overlooked nor underestimated. If they are, the complete NoSQL paradigm will go down the drain. And this will cost company precious resources, energy and unhappiness.