Site icon R-bloggers

How to predict failure of machinery using data science

[This article was first published on R on Asitav Sen, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • Everyone knows how annoying mechanical breakdown can be. Production is directly successful because the equipment malfunctions. Large sums of money are lost as production resumes. It also affects OEMs and distributors based on lost reputation and business opportunity. Fortunately, the use of data science can handle this problem to a large extent.

    Opportunity for machine owners

    The cost of breakdown is not only the loss of opportunity (profit from production), but also the fixed cost of the machine. Also, delays in production can attract fines and lost orders. Sometimes, while other machines depend on failed machines, the cost increases through the roof. The cost of a single break can easily exceed thousands of dollars. The worst part is that this loss can never be recovered.

    The figure above shows some of the components of cost of downtime.

    Using predictive models, one can now estimate the probability of failure. It gives us two abilities. First, the ability to plan maintenance to minimize losses. Second, to improve inventory better. Instead of stockpiling a lot of spare parts, it is possible to keep only what you need in the future.

    Advantages for machine owners

    • Reduce repair time
    • Avoid unplanned maintenance
    • Reduce inventory cost
    • Increase bottom line

    Opportunity for manufacturers (OEMs)

    Failures do not directly affect OEM, but it can damage reputation and end up in lost business. If an important item is not available nearby, customers will not hesitate to purchase items from the local market. Also, manpower may not be available to repair the machine immediately.

    These problems can be avoided if you already have an idea of the potential breakdowns. OEMs / Distributors can then schedule a maintenance and replace parts or provide immediate support in the event of a breakdown. In addition, it will enable OEMs to introduce new revenue models of maintenance contracts. It can also ensure that customers do not buy spare parts from local markets.

    Advantages for OEMS

    • Avoid unplanned breakdowns
    • Increase customer satisfaction (lower repair time)
    • Optimize inventory
    • Improve top line
    • Improve product

    How to implement

    The process begins by identifying the problems to be solved. Usually the problem should be broad and exclusively segregated. E.g. The goal may be to reduce operating costs. One of the many ways to achieve this is by reducing idle time and improving the list of spare parts.

    Once problems are identified, data should be collected for analysis. Often data is not available. As such, the infrastructure for data collection needs to be built. When creating infrastructure and processes, efforts should be made to improve the utility of such infrastructure / processes. This can be done by evaluating other uses of the collected data. If it costs somewhat less to add additional data to solve a significant problem, it should proceed.

    Once the data starts to collect, it should be cleaned and visualized. This applies not only to the data science team, but also to other business partners. If possible and possible, the dashboard (s) can be built for different partners depending on the need. Predictive features of analyzes are natural advances from research analyzes. Dashboards and visualization should be tested and reviewed by creating a forecast model (s).

    Example using real life data

    Code heavy!!! Skip if you are not interested in the codes. However, you may like to understand how well the below model performed in results

    Real life (normalized) data have been used to illustrate the possible model. The data contains a record of certain parameters taken per hour. In addition, record and maintenance of failures will be used.1 The parameters recorded in hours are voltage, vibration, rotation and pressure. Data were recorded from 100 machines of 4 types. Failure of 4 components has been recorded. To simplify, let us analyze the failure of ‘comp2.’

    To keep this brief and to focus on the prognostic model many research activities have been avoided.

      1# Following data are available
      2# 1. Telemetry - Logs hourly parameters (Voltage, Pressure, Rotation and Vibration) for each machine
      3# 2. Failures - Log of component failures. Contains the time slot (that matches the telemetry log) and machineID
      4# 3. Maint - Log of maintainance. Contains time slot, machineID and the component replaced
      5# 4. Assets - Information about the machines - machineID, Model and age
      6
      7# Function to calculate number of periods since last maintenance of a component
      8timeslm <- function(k) {
      9  output <- c()
     10  output[1] = 0
     11  for (i in 2:length(k)) {
     12    if (k[i - 1] == 1) {
     13      output[i] = 1
     14    } else {
     15      output[i] = output[i - 1] + 1
     16    }
     17  }
     18  return(output)
     19}
     20
     21# For the sake of simplicity, failures of 'comp2' will be analysed
     22
     23#preparing data
     24failures2 <-
     25  failures %>% filter(failure == "comp2") #new data frame with failure of comp2
     26maint2 <-
     27  maint %>% filter(comp == "comp2") #new data frame with maintenance of comp2
     28
     29df <-
     30  telemetry %>% left_join(failures2, by = c("machineID", "datetime")) %>% #Joining log with failure
     31  mutate(failed = ifelse(is.na(failure), 0, 1)) %>%  #creating column with binary (failed or not)
     32  left_join(maint2, by = c("machineID" = "machineID", "datetime" = "datetime")) %>% #Joining maintenance data
     33  mutate(maint = ifelse(is.na(comp), 0, 1)) %>% #New column with binary (maintained or not)
     34  inner_join(assets, by = "machineID") #Joining machine details
     35df$datetime <-
     36  parse_date_time(df$datetime, "mdy HMS p") #Changing column type to date time
     37df$machineID <- as.factor(df$machineID)
     38df$model <- as.factor(df$model)
     39
     40timesm <- timeslm(df$maint) #Calculating periods since maintenance
     41df$timesm <- timesm
     42
     43df <-
     44  df %>% mutate(t = ifelse(is.na((
     45    as.numeric(datetime - lag(datetime, 1))
     46  )), 0, (as.numeric(
     47    datetime - lag(datetime, 1)
     48  )))) %>%
     49  mutate(tim = cumsum(t)) #Adding time columns
     50
     51df <- df[, c(1, 14, 15, 13, 11, 12, 2:6, 8, 10)]
     52#Column details - datetime = event log time
     53#                 machineID = Machine Identification number
     54#                 volt, rotate, pressure, vibration are some of the parameters that are measured
     55#                 failed and maint indicate if the component (comp2) failed. 1 indicates true.
     56#                 age is the age of the machine
     57#                 timesm is the time since maintenance
     58#                 tim is the time since the beginning of event logging
     59
     60
     61#removing dfs that are not required
     62rm(telemetry)
     63rm(assets)
     64rm(failures)
     65rm(maint)
     66rm(timesm)
     67
     68# adding grouping to calculate cumulative parameter values.
     69g <- c()
     70g[1] = 1
     71for (i in 2:nrow(df)) {
     72  if (df$timesm[i] < df$timesm[i - 1]) {
     73    g[i] = g[i - 1] + 1
     74  } else {
     75    g[i] = g[i - 1]
     76  }
     77}
     78
     79df$group <- g
     80
     81# removing unwanted data
     82rm(g)
     83
     84
     85#Will add new columns with cumulative parameters - sum, mean
     86
     87df1 <-
     88  df %>% group_by(group) %>% mutate(
     89    volt.cum = cumsum(volt),
     90    vib.cum = cumsum(vibration),
     91    pres.cum = cumsum(pressure),
     92    rot.cum = cumsum(rotate),
     93    volt.mean = cumsum(volt) / seq_along(volt),
     94    vib.mean = cumsum(vibration) / seq_along(vibration),
     95    pres.mean = cumsum(pressure) / seq_along(pressure),
     96    rot.mean = cumsum(rotate) / seq_along(rotate)
     97  ) %>%
     98  filter(maint == 1)   # filtered required data
     99
    100# In case you are interested to know about the transformations in detail please get in touch.
    101
    102#preparing test and train data
    103df.train <- df1 %>% filter(datetime < "2015-11-15 06:00:00 UTC")
    104df.test <- df1 %>% filter(datetime > "2015-11-15 06:00:00 UTC")
    105
    106#removing unwanted data
    107rm(df1)
    108rm(df)
    109
    110# Fitting Kaplan Meier
    111
    112kap.fit<-survfit(Surv(timesm,failed)~model, data=df.train)
    113
    114#Plotting
    115
    116fig<-ggsurvplot(
    117  kap.fit,
    118  pval = F, # show p-value
    119  break.time.by = 1000, #break X axis by 25 periods
    120  #risk.table = "abs_pct", # absolute number and percentage at risk
    121  #risk.table.y.text = FALSE,# show bars instead of names in text annotations
    122  linetype = "strata",
    123  # Change line type by groups
    124  conf.int = T,
    125  # show confidence intervals for
    126  #conf.int.style = "step",  # customize style of confidence intervals
    127  #surv.median.line = "hv",
    128  # Specify median survival
    129  ggtheme = theme_minimal(),
    130  # Change ggplot2 theme
    131  legend.labs =
    132    c("Model 1", "Model 2", "Model 3", "Model 4"),
    133  # change legend labels
    134  ncensor.plot = F,
    135  # plot the number of censored subjects (outs) at time t
    136  #palette = c("#000000", "#2E9FDF","#FF0000")
    137
    138)+
    139  labs(x="Hours")
    140fig
    

    Kaplan Meier model predicts survival chances by 3000 hours of operation (post maintenance) reduces significantly. But there is a substantial amount of uncertainty in prediction, except in case of model 4. The model also shows that it is almost certain that there will be no failures till about 700 hours of operation, post maintenance.

    Summary of the fit is as shown below.

    1summary(kap.fit)$table
    
    ##              records n.max n.start events   *rmean *se(rmean) median 0.95LCL
    ## model=model1     103   103     103     37 2661.662   317.7212   1800    1440
    ## model=model2     112   112     112     32 2781.430   370.7489   2160    1489
    ## model=model3     257   257     257     78 2871.243   264.5345   1800    1440
    ## model=model4     191   191     191     70 2620.189   214.6314   2160    1800
    ##              0.95UCL
    ## model=model1    3240
    ## model=model2      NA
    ## model=model3    2880
    ## model=model4    2880
    

    Since the uncertainty is high, I will not use this model and try Cox regression.

    Cox regression model shows that among all the parameters used, cumulative rotation is the one that significantly impacts failure of ‘comp2.’

    With this insight, OEM can think of ways to improve component quality or find out ways to keep the rotation low.

    1# Cox regression
    2cox.fit<-coxph(Surv(timesm,failed) ~ volt.cum+vib.cum+pres.cum+rot.cum+volt.mean+vib.mean+pres.mean+rot.mean+factor(model), data=df.train)
    3summary(cox.fit)
    
    ## Call:
    ## coxph(formula = Surv(timesm, failed) ~ volt.cum + vib.cum + pres.cum + 
    ##     rot.cum + volt.mean + vib.mean + pres.mean + rot.mean + factor(model), 
    ##     data = df.train)
    ## 
    ##   n= 663, number of events= 217 
    ## 
    ##                           coef  exp(coef)   se(coef)      z Pr(>|z|)    
    ## volt.cum            -2.389e-04  9.998e-01  1.416e-04 -1.687 0.091529 .  
    ## vib.cum             -3.233e-04  9.997e-01  2.849e-04 -1.135 0.256382    
    ## pres.cum             2.164e-05  1.000e+00  1.557e-04  0.139 0.889459    
    ## rot.cum             -1.496e-04  9.999e-01  4.161e-05 -3.594 0.000326 ***
    ## volt.mean            2.527e-01  1.287e+00  1.687e-01  1.498 0.134178    
    ## vib.mean             5.436e-01  1.722e+00  3.291e-01  1.652 0.098603 .  
    ## pres.mean            1.520e-01  1.164e+00  1.851e-01  0.821 0.411487    
    ## rot.mean            -4.229e-02  9.586e-01  4.338e-02 -0.975 0.329642    
    ## factor(model)model2 -4.183e-01  6.582e-01  2.786e-01 -1.501 0.133273    
    ## factor(model)model3 -1.932e-01  8.243e-01  2.322e-01 -0.832 0.405304    
    ## factor(model)model4  9.385e-02  1.098e+00  2.397e-01  0.391 0.695429    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ##                     exp(coef) exp(-coef) lower .95 upper .95
    ## volt.cum               0.9998     1.0002    0.9995    1.0000
    ## vib.cum                0.9997     1.0003    0.9991    1.0002
    ## pres.cum               1.0000     1.0000    0.9997    1.0003
    ## rot.cum                0.9999     1.0001    0.9998    0.9999
    ## volt.mean              1.2875     0.7767    0.9250    1.7920
    ## vib.mean               1.7222     0.5807    0.9035    3.2826
    ## pres.mean              1.1641     0.8590    0.8100    1.6731
    ## rot.mean               0.9586     1.0432    0.8804    1.0437
    ## factor(model)model2    0.6582     1.5193    0.3813    1.1363
    ## factor(model)model3    0.8243     1.2132    0.5229    1.2994
    ## factor(model)model4    1.0984     0.9104    0.6866    1.7571
    ## 
    ## Concordance= 0.98  (se = 0.004 )
    ## Likelihood ratio test= 1060  on 11 df,   p=<2e-16
    ## Wald test            = 135.2  on 11 df,   p=<2e-16
    ## Score (logrank) test = 483.9  on 11 df,   p=<2e-16
    

    So, a new model was built to include only cumulative rotation as parameter. The model was used to predict outcome on test data. Confusion matrix was built to analyze accuracy.

    1# Revised cox model
    2cox.fit<-coxph(Surv(timesm,failed) ~ rot.cum, data=df.train)
    3
    4# Predicted probabilities
    5pred<-predict(cox.fit,newdata=filter(df.test), type = "expected")
    6
    7d<-df.test%>%cbind(pred=pred)%>%select(12,14,23)%>%mutate(pred=ifelse(pred>0.5,1,0))
    8
    9table(act=d$failed,pred=d$pred)
    
    ##    pred
    ## act  0  1
    ##   0 49 12
    ##   1  7 32
    

    Results

    • 81 % accuracy was achieved *
    • 87 % accurate when did not predict failure *
    • 72 % accurate when predicted failure *

    Wondering what does it mean in terms of saved cash? Or what are all possible ways to create value out of this? One needs more detail of the business and operations to answer those questions. If you are really interested, you know what to do!

    For any questions and/or clarifications, please do not hesitate to contact me.

    Notes

    < section class="footnotes" role="doc-endnotes">
    1. The data has been sourced from [Deepti Chevvuri’s Github] (https://github.com/DeeptiChevvuri/Predictive-Maintenance-Modelling-Datasets). The data contains error logs as well, which has not been used in this analysis. ↩︎

    To leave a comment for the author, please follow the link and comment on their blog: R on Asitav Sen.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.