by Jaya Mathew, Data Scientist at Microsoft
By using R Services within SQL Server 2016, users can leverage the power of R at scale without having to move their data around. Such a solution is beneficial for organizations with very sensitive, big data which cannot be hosted on any public cloud but does most of their coding in R.
To illustrate the scenario, we will focus on companies who operate machines which encounter mechanical failures. These failures lead to downtime which has cost implications on any business, hence most companies are interested in predicting the failures ahead of time so that they can proactively prevent them. This scenario is aligned with an existing R Notebook published in the Cortana Intelligence Gallery but works with a larger dataset where we will focus on predicting component failures of a machine using raw telemetry, maintenance logs, previous errors/failures and additional information about the make/model of the machine. This scenario is widely applicable for almost any industry which uses machines that need maintenance. A quick overview of typical feature engineering techniques as well as how to build a model will be discussed below.
How to use the predictive maintenance template
The sample data and code for this template are available on Github, and requires Windows for both the local R client and remote SQL Server. To use the template, you will need:
- A local R client and a remote database server, both running Microsoft Windows.
- SQL Server 2016 with R Services installed on the database server.
- A local installation of Microsoft R Client, to access the data on the server. An integrated development environment will make this easier; you can use either R Tools for Visual Studio or RStudio.
- Git Large File Storage, to clone the GitHub repository. (This is needed to download the large CSV files from Github.)
To run the template using the sample data, follow these steps:
- Create your SQL Server instance on the server, then enable R services.
- With Microsoft R Client on the local machine, check your R code can access your SQL Server DB (you will need to provide your SQL Server credentials).
- In SQL Server, create a database where you would like to load the datasets and perform feature engineering. Then run the SQL scripts in this order:
- pdm_data_ingestion.sql (edit the folder path to where you have saved the CSV files)
- In your R IDE, run the following scripts: pdm_visualization.R, pdm_modeling.R. (Ensure that you first edit the code to provide your SQL Server credentials in both the R scripts.)
- The final dataset is loaded onto SQL Server with the evaluation metrics: metrics_df
Implementation setup overview
In this template, there are 5 data sources namely: telemetry, errors, maintenance, machines, failures. The data ingestion, feature engineering and data preparation is done using SQL code on the server where the data resides for most users which circumvents the issues of moving sensitive big data. (For the template we upload the local CSV files to SQL Server.) Then various aspects of the data are visualized and models built using their R code on SQL Server which enables users to scale their operations.
The process can be visualized as shown below:
Understanding the input data
The five data sources (telemetry, errors, failures, maintenance, machines) capture various aspects of the functioning of the machine. The machine attribute table typically contains descriptive information about the make, model, age of the machine. The telemetry time-series data consists of operating measurements like voltage, rotation, pressure and vibration measurements collected at an hourly rate (in this template data). Depending on the machine, other measurements might be available. The error log contains non-breaking errors thrown while the machine is still operational hence do not qualify as failures. The error date and times are rounded to the closest hour since the telemetry is also collected at an hourly rate. The scheduled and unscheduled maintenance records correspond to both regular inspection of components as well as failures where a record is generated if a component is replaced during the scheduled inspection or replaced due to a break down. The failure table captures the records of component replacements due to machine failures. Each record has a date and time, machine ID and failed component type associated with it. In this template the data is at an hourly rate, but in other use cases, the rate can be every second, minute, hour or even once in a day etc.
Feature engineering process overview
The first step in predictive maintenance applications is feature engineering which combines the different data sources to create features that best describe a machines’ health condition at a given point in time. Telemetry data almost always comes with time-stamps which makes it suitable for calculating lagging features. In the following template, rolling mean and standard deviation of the telemetry data over the last 3-hour lag window is calculated for every 3 hours (depending on the requirements of the business, the lag window sizes can be edited). Similar to telemetry data, the errors also come with time-stamps. However, unlike telemetry that had numerical values, errors have categorical values denoting the type of error that occurred at a time-stamp. In this case, aggregating methods such as averaging does not apply. Hence, counting the different categories is a more viable approach where lagging counts of different types of errors that occurred in the lag window are calculated. Maintenance data contains the information of component replacement records. A relevant feature from this data set is to calculate how long it has been since a component was last replaced or number of times a component is replaced. The machine features are used directly since they hold descriptive information about the type of the machines and their age which is defined as the years in service.
Once the data has been collated and features generated, the next step was to create the label for the prediction problem. In this example scenario we are interested in computing the probability that a machine will fail in the next 24 hours due to a certain component failure, hence the labels are (component 1,2,3 or 4) and the rest of the records are labeled as "none" indicating there is no failure within the next 24 hours.
Modelling: Training, Validation and Evaluation
For this predictive maintenance problem, a time-dependent splitting strategy is used to estimate performance which is done by validating and testing on examples that are later in time than the training examples. For a time-dependent split, a point in time is picked and model is trained on examples up to that point in time, and validated on the examples after that point assuming that the future data after the splitting point is not known. There are many other ways of splitting the data, but this template illustrates this method of splitting the data.
At the end of the template implementation, the user would have had a hands-on experience about how to formulate the problem and work with SQL Server R Services to solve their data science problems even when working with Big Data. Based on experience, we have noticed a significant decrease in the feature engineering processing time when done using SQL directly on the server. Another benefit is the ability to run R models on much larger datasets that could previously not be done on a local laptop or PC.
Acknowledgements: Thanks to Danielle Dean and Fidan Boylu Uz for their input.
Cortana Intelligence Gallery: Predictive Maintenance Modeling Guide using SQL R Services