Process Mining (Part 1/3): Introduction to bupaR package
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Event Logs
The digitalization of healthcare is more than just electronic medical records. It has also allowed each instance a clinician conducts an activity for a patient to be stored as a log. These logs stored in the healthcare institution’s information system are known as event logs. Event logs captured these fundamental information:
- Activity
Activity is a well-defined step in a process. For example, in a hospital admission, activities include registration, blood test, discharged
- Case identifier
Case identifier is a unique identifier which allows activities to be tagged to the respective case. For example, in a hospital admission, the case identifier will be the patient identifier
- Activity instance identifier
Activity instance identifier is a unique identifier which allows an activity instance to be tagged to the related activity. Do note the following as it is vital to understanding the granularity between activity and activity instance. Activity instance is more granular than activity. There can be more activity instances than activities for a specific case. For example, a patient has an activity “X-ray” which is 1 activity but it can have 2 activity instance. The first activity instance is when the patient’s X-ray started and the second activity instance is when the patient’s X-ray ended. Activity instance identifier allows activity instances within a case to be arranged in sequential order. Capturing the sequential order is critical to understand the workflow of processes.
Event logs can also contain other information such as:
- Timestamp (Timestamp can be when the instance of an activity started or ended or both.)
- Resource (Resource refers to the person or device which conducted the activity. In the hospital, the resource for the activity X-ray will be the radiographer.)
- Location (Location refers to the location where the activity was conducted. In the hospital, the location for the activity X-ray will be the radiology department.)
Below is an example of an event log from a hosptial database:
Case ID | Activity Instance | Activity | Time Start | Time End |
---|---|---|---|---|
A | 1 | Admit | 10:00 | 10:05 |
A | 2 | Blood test | 10:05 | 10:35 |
B | 3 | Admit | 11:00 | 11:05 |
C | 4 | Admit | 12:00 | 12:10 |
C | 5` | X-ray | 12:30 | 12:50 |
Process Mining
With the rich information in the event logs, process mining can be done. Process mining is the extraction of information and workflow models from event logs. The content of the post may overlap with related topics such as business intelligence, event log analysis, process analysis. Nonetheless, the term “process mining” will be the default term used the post. In healthcare, process mining has been used to uncover the sequencing of activities, identify bottlenecks, identify outliers when compared against a theoretical workflow model and examine the relationship between providers.
bupaR
R has a package, bupaR
, to do process mining and analysis. bupaR
is the core package and when you load bupaR
you load other packages (e.g. eventdataR
, processmapR
) use for process mining.
While an event log can be stored as a data frame, it can also be stored as a bupaR eventlog
object. Storing the event log as an eventlog
object allows you to use bupaR
functions to wrangle, analyse, visual the event log with ease.
Event log object
Let’s us explore a pre-loaded eventlog
object in bupaR
called patients
. patients
is a fictious event log about a hosptial’s processes involving patients’ admission.
# library library(plyr) library(tidyverse) library(bupaR) theme_set(theme_light()) class(patients) # `eventlog` object noted ## [1] "eventlog" "tbl_df" "tbl" "data.frame"
When you type an event log which was saved as an event log
object, besides seeing the log, you will also see a basic summary of the log.
patients ## Event log consisting of: ## 5442 events ## 7 traces ## 500 cases ## 7 activities ## 2721 activity instances ## ## # A tibble: 5,442 x 7 ## handling patient employee handling_id registration_ty~ ## <fct> <chr> <fct> <chr> <fct> ## 1 Registr~ 1 r1 1 start ## 2 Registr~ 2 r1 2 start ## 3 Registr~ 3 r1 3 start ## 4 Registr~ 4 r1 4 start ## 5 Registr~ 5 r1 5 start ## 6 Registr~ 6 r1 6 start ## 7 Registr~ 7 r1 7 start ## 8 Registr~ 8 r1 8 start ## 9 Registr~ 9 r1 9 start ## 10 Registr~ 10 r1 10 start ## # ... with 5,432 more rows, and 2 more variables: time <dttm>, ## # .order <int>
In the introduction, I described what information is captured in an event log. The bupaR::mapping
function identifies the information captured and the respective variable name in the event log.
mapping(patients) ## Case identifier: patient ## Activity identifier: handling ## Resource identifier: employee ## Activity instance identifier: handling_id ## Timestamp: time ## Lifecycle transition: registration_type
The mapping
function also revealed the mandatory information you will need to supply to create an eventlog
object. bupaR
requires the following information to generate an eventlog
object:
- Case identifier
- Activity identifier (refers to the activities)
- Activity instance identifier
- Time stamp
- Lifecycle transition (refers to the status of an activity instance. For example, if an activity instance started or ended. This status will determine if the time stamp collected was for the start or end of an activity instance. Do note that as each status of an activity instance has its own row, the event log is in a long layout.)
- Resource identifier (it refers to the individual executing the activity)
If you have incomplete information to develop an eventlog
object, you may wish to refer to bupaR’s site on how to address some of the missing elements.
EDA
Let us explore activities registered in the “patients”” event log. Both bupaR
and tidyverse
/base R
functions will be used for comparison.
As lifecycle transition of an activity instance has its own row (e.g. start of an activity instance X-ray has its own row and the completion of the X-ray will be recorded in a seperate row), the event log is in a long layout.
dim(patients) ## [1] 5442 7
I changed the long layout to the wide layout as the wide layout is needed to use the tidyverse
approach.
patients_df<- data.frame(patients)%>% # convert object select(- .order) %>% #remove this col as we don't need it and it messes with the spread function spread(registration_type, time) dim(patients_df) ## [1] 2721 6
Number of types of activities
bupaR approach
n_activities(patients) ## [1] 7
tidyverse function
n_distinct(patients$handling) ## [1] 7
There are 7 kinds of activities in the “patients” event log. From the above example, familiarizing yourself with bupaR
’s function is easy. The wording of bupaR
’s function, n_activities
is similar to dplyr
’s function, n_distinct
.
Types of Activities
bupaR function
activity_labels(patients) ## [1] Registration Triage and Assessment Blood test ## [4] MRI SCAN X-Ray Discuss Results ## [7] Check-out ## 7 Levels: Blood test Check-out Discuss Results MRI SCAN ... X-Ray
base R function
unique(patients$handling) ## [1] Registration Triage and Assessment Blood test ## [4] MRI SCAN X-Ray Discuss Results ## [7] Check-out ## 7 Levels: Blood test Check-out Discuss Results MRI SCAN ... X-Ray
The 7 activities capatured in the event log is a simplified list of activities one can expect from admission to discharge.
Frequency of Activities
bupaR approach
activities(patients) ## # A tibble: 7 x 3 ## handling absolute_frequency relative_frequency ## <fct> <int> <dbl> ## 1 Registration 500 0.184 ## 2 Triage and Assessment 500 0.184 ## 3 Discuss Results 495 0.182 ## 4 Check-out 492 0.181 ## 5 X-Ray 261 0.0959 ## 6 Blood test 237 0.0871 ## 7 MRI SCAN 236 0.0867
tidyverse apporach
patients_df %>% group_by(handling) %>% summarise(absolute=n()) %>% mutate(relative=absolute/sum(absolute)) ## # A tibble: 7 x 3 ## handling absolute relative ## <fct> <int> <dbl> ## 1 Blood test 237 0.0871 ## 2 Check-out 492 0.181 ## 3 Discuss Results 495 0.182 ## 4 MRI SCAN 236 0.0867 ## 5 Registration 500 0.184 ## 6 Triage and Assessment 500 0.184 ## 7 X-Ray 261 0.0959
The bupaR
approach used only one function while the tidyverse
approach used three functions.
Summary of Activity Duration
bupaR approach
processing_time(patients, #event log "activity", # level of analysis, in this situation at level of activity units="mins") #time units to be used ## # A tibble: 7 x 11 ## handling min q1 mean median q3 max st_dev iqr total ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Registr~ 49.7 124. 165. 163. 204. 338. 57.2 79.9 8.26e4 ## 2 Triage ~ 352. 681. 786. 800. 902. 1128. 166. 221. 3.93e5 ## 3 Discuss~ 80.0 139. 167. 166. 193. 272. 37.7 54.4 8.24e4 ## 4 Check-o~ 40.0 96.7 124. 124. 148. 234. 37.2 51.6 6.09e4 ## 5 X-Ray 138. 233. 291. 288. 339. 490. 76.9 106. 7.59e4 ## 6 Blood t~ 185. 285. 332. 328. 376. 488. 63.6 90.7 7.87e4 ## 7 MRI SCAN 149. 216. 249. 245. 282. 355. 44.1 65.4 5.88e4 ## # ... with 1 more variable: relative_frequency <dbl>
tidyverse approach
patients_df %>% mutate(Act_Ins_Dur=complete-start) %>% # duration for each activity instance group_by(patient, handling) %>% summarise(Act_Duration= sum(Act_Ins_Dur)) %>% # activity duration for each case ungroup() %>% group_by(handling) %>% # activity duration for each activity at case level mutate(min= round((min(Act_Duration)), 1), q1= round((quantile(Act_Duration, .25)),1), mean=round((mean(Act_Duration)),1), median= round((quantile(Act_Duration, .5)),1), q3= round((quantile(Act_Duration, .75)),1), max= round((max(Act_Duration)),1), st_dev= round((sd(Act_Duration)),1), iqr= round((IQR(Act_Duration)),1)) %>% select (handling, min, q1, mean, median, q3, max, st_dev, iqr) %>% unique() ## # A tibble: 7 x 9 ## # Groups: handling [7] ## handling min q1 mean median q3 max st_dev iqr ## <fct> <time> <time> <time> <time> <time> <time> <time> <dbl> ## 1 Blood test 185.4 ~ 285.4 ~ 332.1 ~ 328.1 ~ 376.0~ 488.~ 63.6 ~ 90.7 ## 2 Check-out 40.0 ~ 96.7 ~ 123.8 ~ 124.3 ~ 148.3~ 233.~ 37.2 ~ 51.6 ## 3 Discuss Resu~ 80.0 ~ 138.9 ~ 166.5 ~ 166.3 ~ 193.2~ 272.~ 37.7 ~ 54.3 ## 4 MRI SCAN 149.3 ~ 216.4 ~ 249.0 ~ 245.4 ~ 281.8~ 355.~ 44.1 ~ 65.4 ## 5 Registration 49.7 ~ 124.2 ~ 165.2 ~ 162.8 ~ 204.1~ 338.~ 57.2 ~ 79.9 ## 6 Triage and A~ 352.1 ~ 681.1 ~ 786.3 ~ 800.4 ~ 901.9~ 1128.~ 165.6 ~ 221. ## 7 X-Ray 137.7 ~ 233.2 ~ 290.8 ~ 287.5 ~ 338.9~ 490.~ 76.9 ~ 106.
The tidyverse
approach was rather lengthy as each descriptive value had to be phyiscally calculated. In addition, it required two rounds of group_by
.
Summary of Case Duration
Similar to calculating the summary statistics of activity duration, calculating the summary statistics for case duration is simple and short with bupaR
’s function, processing_time
. You just need to amend the level
arugment in the function.
processing_time(patients, level="log", units="days") ## min q1 median mean q3 max st_dev ## 0.4465741 1.0395833 1.1552951 1.1562278 1.2807002 1.5935764 0.1739458 ## iqr ## 0.2411169
Summary of Case Duration (with condition)
Bed crunch is a common problem in hospitals. A shorter case duration for a patient will translate to a shorter length of stay which allows the hospital to admit the next patient. A hospital staff hypothesizes that patients who have an MRI scan have a longer length of stay thus the staff wishes to compare the case duration of patients who had an MRI scan and those who did not. bupaR
’s filter_activity_presence
function provides a quick solution. The function “filters cases based on the presence or absence of activites”. The filter_activity_presence
function behaviours similar to dplyr
’s group_by(case identifier) %>% filter(activity== )
.
Case Duration of patients with MRI scan
patients %>% filter_activity_presence("MRI SCAN") %>% processing_time(level="log", units="hours") ## min q1 median mean q3 max st_dev ## 21.873056 27.790347 30.769444 30.303944 32.614514 38.245833 3.359803 ## iqr ## 4.824167
Case Duration of patients without MRI scan
patients %>% filter_activity_presence("MRI SCAN", method="none") %>% # set arugment to "none" to for cases without the specific activity processing_time(level="log", units="hours") ## min q1 median mean q3 max st_dev ## 10.717778 23.285764 25.840139 25.465921 27.898958 33.422500 3.448286 ## iqr ## 4.613194
The hospital staff was right, patients who did an MRI scan had a longer case duration than those who did not.
Conclusion
In this post, we looked at the definition of event log and process mining and also a package which was created to make process mining convenient. The functions in bupaR
have similar wording and behaviour to dplyr
functions. One of the benefits seen above using bupaR
’s functions is that it reduces the length of code required to extract the desired results. In the next post, I will cover more process mining concepts and visualizations of process analysis.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.