Using R and H2O Isolation Forest For Data Quality

November 11, 2019
By

[This article was first published on R-Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Loading libraries
suppressWarnings( suppressMessages( library( h2o ) ) )
# For interactive plotting
suppressWarnings( suppressMessages( library( dygraphs ) ) )
suppressWarnings( suppressMessages( library( dplyr ) ) )
suppressWarnings( suppressMessages( library( DT ) ) )

# Start a single-node instance of H2O using all available processor cores and reserve 5GB of memory
h2oServer = h2o.init( ip = "localhost", port = 54321, max_mem_size = "5g", nthreads = -1 )
## 
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## /tmp/RtmpC1pHJS/h2o_ckassab_started_from_r.out
## /tmp/RtmpC1pHJS/h2o_ckassab_started_from_r.err
##
##
## Starting H2O JVM and connecting: . Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 395 milliseconds
## H2O cluster timezone: America/Mexico_City
## H2O data parsing timezone: UTC
## H2O cluster version: 3.26.0.6
## H2O cluster version age: 1 month and 8 days
## H2O cluster name: H2O_started_from_R_ckassab_aat507
## H2O cluster total nodes: 1
## H2O cluster total memory: 4.44 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 3.6.1 (2019-07-05)
h2o.removeAll() # Removes all data from h2o cluster, ensuring it is clean.
h2o.no_progress() # Turn off progress bars for notebook readability

# Setting H2O timezone for proper date data type handling
#h2o.getTimezone() ===>>> UTC
#h2o.listTimezones() # We can see all H2O timezones
h2o.setTimezone("US/Central")
## [1] "US/Central"
# Note. I am using Ubuntu 19.10, using /tmp directory
# Every time I boot my computer, I need to copy the data file again to /tmp
# directory.

# Importing data file and setting data types accordingly.
allData = read.csv( "/tmp/PGA_Tour_Golf_Data_2019_Kaggle.csv", sep = ",", header = T )

# When using as.Posixct H2O is not importing data, so we are using as.Date.
allData$Date = as.Date( allData$Date )
allData$Value = as.numeric(allData$Value)

# Convert dataset to H2O format.
allData_hex = as.h2o( allData )

# Build an Isolation forest model
startTime <- Sys.time()
startTime
## [1] "2019-11-10 20:10:30 CST"
trainingModel = h2o.isolationForest( training_frame = allData_hex
, sample_rate = 0.1
, max_depth = 32
, ntrees = 100
)
## Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Stopping tolerance is ignored for _stopping_rounds=0..
Sys.time()
## [1] "2019-11-10 20:20:15 CST"
Sys.time() - startTime
## Time difference of 9.756691 mins
# According to H2O doc: 
# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html
#
# Isolation Forest is similar in principle to Random Forest and is built on
# the basis of decision trees.

# Isolation Forest creates multiple decision trees to isolate observations.
#
# Trees are split randomly, The assumption is that:
#
# IF ONE UNIT MEASUREMENTS ARE SIMILAR TO OTHERS,
# IT WILL TAKE MORE RANDOM SPLITS TO ISOLATE IT.
#
# The less splits needed, the unit is more likely to be anomalous.
#
# The average number of splits is then used as a score.

# Calculate score for all data.
startTime <- Sys.time()
startTime
## [1] "2019-11-10 20:20:15 CST"
score = h2o.predict( trainingModel, allData_hex )
result_pred = as.vector( score$predict )
Sys.time()
## [1] "2019-11-10 20:23:18 CST"
Sys.time() - startTime
## Time difference of 3.056829 mins
################################################################################
# Setting threshold value for anomaly detection.
################################################################################

# Setting desired threshold percentage.
threshold = .999 # Let's say we want the .001% data different than the rest.

# Using this threshold to get score limit to filter data anomalies.
scoreLimit = round( quantile( result_pred, threshold ), 4 )

# Add row score at the beginning of dataset
allData = cbind( RowScore = round( result_pred, 4 ), allData )

# Get data anomalies by filtering all data.
anomalies = allData[ allData$RowScore > scoreLimit, ]

# As we can see in the summary:
summary(anomalies)
##     RowScore              Player.Name        Date           
## Min. :0.9540 Jonas Blixt : 231 Min. :2019-07-07
## 1st Qu.:0.9565 Jordan Spieth : 231 1st Qu.:2019-08-25
## Median :0.9614 Julian Etulain: 221 Median :2019-08-25
## Mean :0.9640 Johnson Wagner: 213 Mean :2019-08-24
## 3rd Qu.:0.9701 John Chin : 209 3rd Qu.:2019-08-25
## Max. :1.0000 Keegan Bradley: 209 Max. :2019-08-25
## (Other) :8325
## Statistic
## Club Head Speed : 234
## Driving Pct. 300-320 (Measured): 193
## Carry Efficiency : 163
## First Tee Early Lowest Round : 161
## First Tee Late Lowest Round : 160
## GIR Percentage - 100+ yards : 158
## (Other) :8570
## Variable
## First Tee Early Lowest Round - (LOW RND) : 103
## First Tee Late Lowest Round - (LOW RND) : 96
## First Tee Late Lowest Round - (ROUNDS) : 64
## Driving Pct. 300-320 (Measured) - (TOTAL DRVS - OVERALL): 61
## GIR Percentage - 175-200 yards - (%) : 61
## First Tee Early Lowest Round - (ROUNDS) : 58
## (Other) :9196
## Value
## Min. : 1268
## 1st Qu.: 53058
## Median : 87088
## Mean :111716
## 3rd Qu.:184278
## Max. :220583
##
# The Statistic: GIR Percentage - 100+ yards is one of the most important values
# Filtering all anomalies within this Statistic value
statisticFilter = "GIR Percentage - 100+ yards"

specificVar = anomalies %>%
filter(Statistic==statisticFilter)

cat( statisticFilter,": ", dim(specificVar)[1] )
## GIR Percentage - 100+ yards :  158
if( dim(specificVar)[1]  > 0 ) {

# We want to know the relation between Players and "Approaches from 200-225 yards"
# So, in order to get a chart, we assign a code to each player
# Since factors in R are really integer values, we do this to get the codes:
specificVar$PlayerCode = as.integer(specificVar$Player.Name)

# To sort our dataset we convert the date to numeric
specificVar$DateAsNum = as.numeric( paste0( substr(specificVar$Date,1,4)
, substr(specificVar$Date,6,7)
, substr(specificVar$Date,9,10) ) )
# And sort the data frame.
specificVar = specificVar[order(specificVar$DateAsNum),]
# Set records num using a sequence.
rownames(specificVar) = seq(1:dim(specificVar)[1])

colNamesFinalTable = c( "PlayerCode", "Player.Name", "Date", "Variable", "Value" )
specificVar = specificVar[, colNamesFinalTable]
specificVar$PlayerCode = as.factor(specificVar$PlayerCode)

# Creating our final dataframe for our chart.
specificVarChartData = data.frame( SeqNum = as.integer( rownames(specificVar) )
, PlayerCode = specificVar$PlayerCode
, Value = specificVar$Value
)



AnomaliesGraph = dygraph( specificVarChartData, main = ''
, xlab = paste(statisticFilter,"Anomaly Number."), ylab = "Player Code." ) %>%
dyAxis("y", label = "Player Code.") %>%
dyAxis("y2", label = "Value.", independentTicks = TRUE) %>%
dySeries( name = "PlayerCode", label = "Player Code.", drawPoints = TRUE, pointShape = "dot"
, color = "blue", pointSize = 2 ) %>%
dySeries( name = "Value", label = "Value.", drawPoints = TRUE, pointShape = "dot"
, color = "green", pointSize = 2, axis = 'y2' ) %>%
dyRangeSelector()
dyOptions( AnomaliesGraph, digitsAfterDecimal = 0 )
}
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo

Sample chart with the anomalies found:

Sample data table with the anomalies found:

Here is the code on github, including the total html functional demo:
https://github.com/LaranIkal/DataQuality
Enjoy it!!!…

Carlos Kassab
https://www.linkedin.com/in/carlos-kassab-48b40743/

More information about R:
https://www.r-bloggers.com

To leave a comment for the author, please follow the link and comment on their blog: R-Analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)