Failure Pressure Prediction Using Machine Learning

December 10, 2018
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Regression Models

    Tags

    1. ggplot2
    2. Machine Learning
    3. Prediction
    4. R Programming

    In this post, the failure pressure will be predicted for a pipeline containing a defect based solely on burst test results and learning machine models. For this purpose, various Machine Learning models will be fitted to test data under R using the caret package, and in the process compare the accuracy of the models in order to identify the best performing one(s).

    Importing The Data

    The data set to be used, has been extracted from an extensive database of burst test results, the set contains the results of approximately 313 tests which were compiled by GL in 2009. Report as well as the data used here can be obtained using this link.

    The data was extracted and inserted into a CSV file, allowing for it to be called and manipulated easily.

    We will start by loading all libraries needed to run the analysis and the data set.

    library(caret)
    library(caretEnsemble)
    library(ggforce)
    library(ggplot2)
    library(gridExtra)
    Burst_Tests_Data <- read.csv(file="F:/kamel/Burst Data.csv", header=TRUE, sep=",")
    # Summary of the data is shown below:  
      summary(Burst_Tests_Data)
    ##        INDEX              Source_Reference     Grade       OD_by_WT     
    ##  INDEX 1  :  1   BRITISH GAS RING1:  1     X52    :58   Min.   :  8.60  
    ##  INDEX 10 :  1   BRITISH GAS RING2:  1     B      :50   1st Qu.: 46.20  
    ##  INDEX 100:  1   BRITISH GAS RING3:  1     X60    :46   Median : 57.90  
    ##  INDEX 101:  1   BRITISH GAS RING4:  1     X65    :44   Mean   : 58.77  
    ##  INDEX 102:  1   BRITISH GAS RING5:  1     X100   :41   3rd Qu.: 70.70  
    ##  INDEX 103:  1   BRITISH GAS RING6:  1     X46    :41   Max.   :111.10  
    ##  (Other)  :307   (Other)          :307     (Other):33                   
    ##    Defect_Type   Norm_Length         d_by_WT        YS_by_SMYS 
    ##  Machined:180   Min.   :  0.228   Min.   :0.089   1.134  : 41  
    ##  Real    :133   1st Qu.:  1.594   1st Qu.:0.448   N/A    : 19  
    ##                 Median :  3.549   Median :0.586   1.194  : 17  
    ##                 Mean   : 36.323   Mean   :0.567   1.117  : 11  
    ##                 3rd Qu.:  9.878   3rd Qu.:0.715   1.211  : 10  
    ##                 Max.   :284.164   Max.   :1.000   1.092  :  9  
    ##                                                   (Other):206  
    ##   UTS_by_SMTS    YS_by_UTS   Failure_Mode Failure_Pressure_psi
    ##  N/A    : 62   N/A    : 62   L  : 79      Min.   :  677       
    ##  1.057  : 41   0.976  : 41   N/A: 73      1st Qu.: 1434       
    ##  1      : 18   0.634  : 17   R  :161      Median : 1844       
    ##  1.098  : 17   0.771  : 10                Mean   : 2252       
    ##  1.045  :  9   0.773  : 10                3rd Qu.: 2363       
    ##  1.048  :  9   0.834  :  9                Max.   :17999       
    ##  (Other):157   (Other):164                                    
    ##  Failure_Pressure_Mpa      SMYS            SMTS      
    ##  Min.   :  4.668      Min.   :172.0   Min.   :310.0  
    ##  1st Qu.:  9.887      1st Qu.:317.0   1st Qu.:434.0  
    ##  Median : 12.714      Median :359.0   Median :455.0  
    ##  Mean   : 15.525      Mean   :398.3   Mean   :505.1  
    ##  3rd Qu.: 16.292      3rd Qu.:448.0   3rd Qu.:531.0  
    ##  Max.   :124.099      Max.   :689.0   Max.   :757.0  
    ## 
    

    Prepare Dataset

    The main parameters needed for the analysis will be separated from the loaded data and into another set, the main parameters we need are:

    • Normalised length
    • Diameter to wall thickness ratio
    • Defects depth
    • Burst Pressure
    • SMYS
    • Failure Mode (for visualisation)
    Data <- data.frame(Norm_L = Burst_Tests_Data$Norm_Length, Depth = Burst_Tests_Data$d_by_WT, OD_by_WT = Burst_Tests_Data$OD_by_WT, SMYS = Burst_Tests_Data$SMYS, Test_P_Burst = Burst_Tests_Data$Failure_Pressure_Mpa, Failure_Mode=Burst_Tests_Data$Failure_Mode) 
    

    We can view all of the selected parameters using pairwise comparison on the basis of the failure mode.

    library(GGally)
    ggpairs(Data, mapping = aes(color = Failure_Mode))
    

    Next step is to partition our data into 2 sets, 90% for training and 10% for testing, these will be balanced splits based on the burst pressure values. The purpose of splitting the data is to allow the algorithms to learn the relationships between the parameters with the training set, followed by the testing set, which will be used as an independent set to make sure our trained model is not over-fitting the data during the training step.

    Split_Data <- createDataPartition(Data$Test_P_Burst, p = 0.90, list = FALSE)
    Training_Data <- Data[Split_Data,]
    Testing_Data <- Data[-Split_Data,]
    

    Train the Models

    There is a large list of machine learning algorithm supported by the caret package; however, for this analysis, I chose randomly 6 regression models, these are:

    • Model Tree
    • Support Vector Machines with Linear Kernel
    • Random Forest
    • k-Nearest Neighbors
    • Generalized Linear Model
    • Projection Pursuit Regression

    We will use repeated cross-validation which is a way to evaluate the performance of a model by randomly partitioning the training set into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k-1 subsamples are used as training data. The cross-validation process is then repeated k times, for our analysis we will use 10 folds and 10 repeats.

    Each of the 6 models will be built using the function 'train()', the relationship between parameters will be expressed as follow:

    Purst Pressure = f(Normalised Length, Depth, OD/WT, SMYS)

    At the end, we will use the 10% testing data to predict the burst pressure for each trained model, using the function 'predict()'.

    train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 10, verboseIter=FALSE)  
    Formula <- Test_P_Burst ~ Norm_L + Depth + OD_by_WT + SMYS
    Test_Data <- subset(Testing_Data , select = c("Norm_L", "Depth", "OD_by_WT", "SMYS"))
    
    # Model Rules 
    Model_MR <- train(Formula, data = Training_Data, method = "M5Rules", trControl = train.control, preProc = c("center", "scale"))
    Model_MR_P_Burst <- predict(Model_MR, Testing_Data)
    
    # Support Vector Machines with Linear Kernel
    Model_SVM <- train(Formula, data = Training_Data, method = "svmLinear", trControl = train.control, tuneGrid = expand.grid(C = c(0.01, 0.1, 0.5)), preProc = c("center", "scale"))
    Model_SVM_P_Burst <- predict(Model_SVM, Testing_Data)
    
    # Random Forest
    Model_RF <- train(Formula, data = Training_Data, method = "rf", trControl = train.control, ntree=200, preProc = c("center", "scale"))
    Model_RF_P_Burst <- predict(Model_RF, Testing_Data)
    
    # k-Nearest Neighbors
    Model_KNN <- train(Formula, data = Training_Data, method = "knn", trControl = train.control, tuneLength = 10, preProc = c("center", "scale"))
    Model_KNN_P_Burst <- predict(Model_KNN, Testing_Data)
    
    # Generalized Linear Model
    Model_GLM <- train(Formula, data = Training_Data, method = "glm", trControl = train.control, tuneLength = 10, preProc = c("center", "scale"))
    Model_GLM_P_Burst <- predict(Model_GLM, Testing_Data)
    
    # Projection Pursuit Regression
    Model_PPR <- train(Formula, data = Training_Data, method = "ppr", trControl = train.control, tuneLength = 10, preProc = c("center", "scale"))
    Model_PPR_P_Burst <- predict(Model_PPR, Testing_Data)
    
    Models <- resamples(list(MR = Model_MR, RF = Model_RF, SVM = Model_SVM, KNN = Model_KNN, GLM = Model_GLM, PRR = Model_PPR))
    

    Compare the Models

    Now we have trained and tested all of our 6 models, let's have a look at how each model performed. First plot of the predicted burst pressure Vs. the real burst pressure:

    library(ggthemes)
    Test_Failure_P <- Testing_Data$Test_P_Burst
    Markers <- geom_point(size=1.5, shape=6, color="black")
    Unity <- geom_abline(slope=1, intercept = 0, color="red", linetype="dashed",size=1)
    Bground <- theme(axis.title = element_text(face="bold", size=12), panel.background = element_rect(fill = 'lightsteelblue4'), panel.grid = element_line(colour="grey"), plot.background=element_rect(fill="grey95"))
    P1 <- ggplot(,aes(x = Test_Failure_P, y = Model_MR_P_Burst)) + Markers + Unity + Bground 
    P2 <- ggplot(,aes(x = Test_Failure_P, y = Model_SVM_P_Burst)) + Markers + Unity + Bground
    P3 <- ggplot(,aes(x = Test_Failure_P, y = Model_RF_P_Burst)) + Markers + Unity + Bground
    P4 <- ggplot(,aes(x = Test_Failure_P, y = Model_KNN_P_Burst)) + Markers + Unity + Bground
    P5 <- ggplot(,aes(x = Test_Failure_P, y = Model_GLM_P_Burst)) + Markers + Unity + Bground
    P6 <- ggplot(,aes(x = Test_Failure_P, y = Model_PPR_P_Burst)) + Markers + Unity + Bground
    grid.arrange(P1, P2, P3, P4, P5, P6, nrow=2)
    

    We can see that in overall, the Random Forest appears to be the most accurate model, however, performance is not totally that clear from the plots alone, for a clear comparison we can use some evaluation metrics, these can be viewed by calling the function summary().

    summary(Models)
    ## 
    ## Call:
    ## summary.resamples(object = Models)
    ## 
    ## Models: MR, RF, SVM, KNN, GLM, PRR 
    ## Number of resamples: 100 
    ## 
    ## MAE 
    ##         Min.  1st Qu.   Median     Mean  3rd Qu.      Max. NA's
    ## MR  1.249353 1.961890 2.341977 2.611588 3.200767  5.861817    0
    ## RF  1.023035 1.495052 1.728636 1.908296 2.125849  4.333484    0
    ## SVM 1.986233 2.777073 3.766255 4.289719 5.560249 10.508175    0
    ## KNN 1.133630 1.858979 2.138108 2.655981 3.050784  6.585347    0
    ## GLM 3.070781 4.524447 5.415050 5.652634 6.456241 10.905213    0
    ## PRR 1.167177 1.861359 2.260386 2.293145 2.678526  3.878608    0
    ## 
    ## RMSE 
    ##         Min.  1st Qu.   Median     Mean   3rd Qu.      Max. NA's
    ## MR  1.557999 2.623602 3.619572 4.472389  5.756353 14.857580    0
    ## RF  1.227453 2.092623 2.482315 3.415498  3.719935 11.558717    0
    ## SVM 2.487095 3.660816 8.935002 9.425596 16.255103 25.082527    0
    ## KNN 1.491673 2.677945 3.185990 5.282331  6.621437 20.164751    0
    ## GLM 3.721669 5.943858 7.765995 9.409684 12.624701 22.397631    0
    ## PRR 1.550538 2.596583 3.350004 3.793293  4.563330  9.107361    0
    ## 
    ## Rsquared 
    ##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
    ## MR  0.0002491479 0.8188490 0.8872148 0.8464541 0.9523463 0.9863039    0
    ## RF  0.7559770108 0.9101539 0.9545877 0.9344962 0.9787224 0.9939840    0
    ## SVM 0.2365625650 0.4314300 0.5670594 0.5446240 0.6590483 0.8127915    0
    ## KNN 0.5504808103 0.7854814 0.8823773 0.8547489 0.9483226 0.9844278    0
    ## GLM 0.3281077690 0.4974465 0.5495725 0.5538143 0.6003874 0.8050598    0
    ## PRR 0.6472502237 0.8316454 0.9302789 0.8939118 0.9687244 0.9913075    0
    

    We can also compare the accuracy of the models visually by using the dot plot.

    scales <- list(x=list(relation="free"), y=list(relation="free"))
    dotplot(Models, scales=scales, layout=c(3,1))
    

    It is clear from the above, Random Forst appears to be the optimal model in this analysis, now we can use it in a normal assessment.

    First let’s have a look how normal defect assessment method perform when compared to the test data (in this case I have selected the Modified ASME B31G method). For that, we create a function to calculate burst pressure of the defects used in the tests and add results to the parameters we selected previously.

    # Modified ASME B31G 
    P_Burst <- function(Norm_Length, d_by_WT, OD_by_WT, SMYS){
    if (Norm_Length <= 50^0.5) 
    {
        M <-  (1 + 0.6275*Norm_Length - 0.003375*Norm_Length^2)^0.5
    } else {
        M <-  0.032*Norm_Length + 3.3
    } 
    Sig_Flow <- SMYS + 69
    P <- (((2*Sig_Flow)/OD_by_WT )*(1-0.85*d_by_WT))/(1-((0.85*d_by_WT)/M))  
    return(P)
    }
    
    # Calculate burst pressure for test defects
    B31G_P_Burst <- c()
    for(i in 1:length(Burst_Tests_Data[,1])) 
    {
      B31G_P_Burst[i] <- P_Burst(Burst_Tests_Data$Norm_Length[i], Burst_Tests_Data$d_by_WT[i], Burst_Tests_Data$OD_by_WT[i], Burst_Tests_Data$SMYS[i])
    }
    
    # Add results to previous selection
    Data <- data.frame(Norm_L = Burst_Tests_Data$Norm_Length, Depth = Burst_Tests_Data$d_by_WT, OD_by_WT = Burst_Tests_Data$OD_by_WT, SMYS = Burst_Tests_Data$SMYS, Test_P_Burst = Burst_Tests_Data$Failure_Pressure_Mpa, B31G_P_Burst) 
    

    Let’s have a look how Modified B31G burst pressure results compare to tests results.

    P <- ggplot(Data, aes(x = Test_P_Burst, y = B31G_P_Burst))+ geom_point(size=1.6, shape=6, colour = "black")+
    labs(x = "Test Burst Pressure", y = "Modified B31G Burst Pressure") + 
    geom_abline(aes(slope=1, intercept = 0), linetype="dashed", size=1, colour = "red")+ 
    theme(panel.background = element_rect(fill = 'lightsteelblue4'), 
          axis.title = element_text(face="bold", size=12),
          plot.background=element_rect(fill="white"),
          axis.text = element_text(colour = "black", size=11))
    P + facet_zoom(x = Test_P_Burst < 40, y = B31G_P_Burst < 40, horizontal=F,zoom.size = 1)
    

    We can see that in overall, the Modified B31G tend to under-estimate the values of the burst pressure. Now let's have a practical test with a new data set, created randomly.

    Rep_Nb <- 500    # Random numbers to be genrated
    L <- runif(Rep_Nb, min=5, max=100)
    dt <- runif(Rep_Nb, min=0.1, max=0.7)
    OD <- rep(406.4, times=Rep_Nb)   
    SMYS <- rep(359, times=Rep_Nb)   
    WT <- runif(Rep_Nb, min=5, max=15)
    Norm_L <- L/(OD*WT)^0.5
    OD_by_WT <- OD/WT
    Depth <- dt
    
    # Now let's predict the burst pressure using the best performing model, i.e. neural networks.
    Pract_Test <- data.frame(Norm_L, Depth, OD_by_WT, SMYS)
    Pract_P_Burst <- predict(Model_RF, Pract_Test)
    
    # Estimate the burst pressure of the randomly generated parameters using Modified B31G
    B31G_P_Pract <- c()
    for(i in 1:Rep_Nb) 
    {
     B31G_P_Pract[i] <- P_Burst(Norm_L[i], Depth[i], OD_by_WT[i], SMYS[i])
    }
    ggplot(, aes(x = Pract_P_Burst, y = B31G_P_Pract))+ geom_point(size=1.6, shape=6, colour = "black")+
    labs(x = "Predicted Pressure", y = "Modified B31G Pressure") + 
    geom_abline(aes(slope=1, intercept = 0), linetype="dashed", size=1, colour = "red")+ 
    theme(panel.background = element_rect(fill = 'lightsteelblue4'), 
          axis.title = element_text(face="bold", size=12),
          plot.background=element_rect(fill="white"),
          axis.text = element_text(colour = "black", size=11))
    

    In The above guide we only limited the number of models to 6; however, we can include as many as we want for better selection. In the next post, I will be showing how to combine multiple models as an ensemble for better predictions.

    Related Post

    1. Machine learning logistic regression for credit modelling in R
    2. Commercial data analytics: An economic view on the data science methods
    3. Weight loss in the U.S. – An analysis of NHANES data with tidyverse
    4. Machine Learning Results in R: one plot to rule them all! (Part 2 – Regression Models)
    5. Story of pairs, ggpairs, and the linear regression

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers


    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)