Process Mining (Part 1/3): Introduction to bupaR package

[This article was first published on R on notast, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Event Logs

The digitalization of healthcare is more than just electronic medical records. It has also allowed each instance a clinician conducts an activity for a patient to be stored as a log. These logs stored in the healthcare institution’s information system are known as event logs. Event logs captured these fundamental information:

  1. Activity

Activity is a well-defined step in a process. For example, in a hospital admission, activities include registration, blood test, discharged

  1. Case identifier

Case identifier is a unique identifier which allows activities to be tagged to the respective case. For example, in a hospital admission, the case identifier will be the patient identifier

  1. Activity instance identifier

Activity instance identifier is a unique identifier which allows an activity instance to be tagged to the related activity. Do note the following as it is vital to understanding the granularity between activity and activity instance. Activity instance is more granular than activity. There can be more activity instances than activities for a specific case. For example, a patient has an activity “X-ray” which is 1 activity but it can have 2 activity instance. The first activity instance is when the patient’s X-ray started and the second activity instance is when the patient’s X-ray ended. Activity instance identifier allows activity instances within a case to be arranged in sequential order. Capturing the sequential order is critical to understand the workflow of processes.

Event logs can also contain other information such as:

  1. Timestamp (Timestamp can be when the instance of an activity started or ended or both.)
  2. Resource (Resource refers to the person or device which conducted the activity. In the hospital, the resource for the activity X-ray will be the radiographer.)
  3. Location (Location refers to the location where the activity was conducted. In the hospital, the location for the activity X-ray will be the radiology department.)

Below is an example of an event log from a hosptial database:

Case ID Activity Instance Activity Time Start Time End
A 1 Admit 10:00 10:05
A 2 Blood test 10:05 10:35
B 3 Admit 11:00 11:05
C 4 Admit 12:00 12:10
C 5` X-ray 12:30 12:50

Process Mining

With the rich information in the event logs, process mining can be done. Process mining is the extraction of information and workflow models from event logs. The content of the post may overlap with related topics such as business intelligence, event log analysis, process analysis. Nonetheless, the term “process mining” will be the default term used the post. In healthcare, process mining has been used to uncover the sequencing of activities, identify bottlenecks, identify outliers when compared against a theoretical workflow model and examine the relationship between providers.


R has a package, bupaR, to do process mining and analysis. bupaR is the core package and when you load bupaR you load other packages (e.g. eventdataR, processmapR) use for process mining.

While an event log can be stored as a data frame, it can also be stored as a bupaR eventlog object. Storing the event log as an eventlog object allows you to use bupaR functions to wrangle, analyse, visual the event log with ease.

Event log object

Let’s us explore a pre-loaded eventlog object in bupaR called patients. patients is a fictious event log about a hosptial’s processes involving patients’ admission.

# library 

class(patients) # `eventlog` object noted 
## [1] "eventlog"   "tbl_df"     "tbl"        "data.frame"

When you type an event log which was saved as an event log object, besides seeing the log, you will also see a basic summary of the log.

## Event log consisting of:
## 5442 events
## 7 traces
## 500 cases
## 7 activities
## 2721 activity instances
## # A tibble: 5,442 x 7
##    handling patient employee handling_id registration_ty~
##  1 Registr~ 1       r1       1           start           
##  2 Registr~ 2       r1       2           start           
##  3 Registr~ 3       r1       3           start           
##  4 Registr~ 4       r1       4           start           
##  5 Registr~ 5       r1       5           start           
##  6 Registr~ 6       r1       6           start           
##  7 Registr~ 7       r1       7           start           
##  8 Registr~ 8       r1       8           start           
##  9 Registr~ 9       r1       9           start           
## 10 Registr~ 10      r1       10          start           
## # ... with 5,432 more rows, and 2 more variables: time ,
## #   .order 

In the introduction, I described what information is captured in an event log. The bupaR::mapping function identifies the information captured and the respective variable name in the event log.

## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type

The mapping function also revealed the mandatory information you will need to supply to create an eventlog object. bupaR requires the following information to generate an eventlog object:

  1. Case identifier
  2. Activity identifier (refers to the activities)
  3. Activity instance identifier
  4. Time stamp
  5. Lifecycle transition (refers to the status of an activity instance. For example, if an activity instance started or ended. This status will determine if the time stamp collected was for the start or end of an activity instance. Do note that as each status of an activity instance has its own row, the event log is in a long layout.)
  6. Resource identifier (it refers to the individual executing the activity)

If you have incomplete information to develop an eventlog object, you may wish to refer to bupaR’s site on how to address some of the missing elements.


Let us explore activities registered in the “patients”” event log. Both bupaR and tidyverse/base R functions will be used for comparison.

As lifecycle transition of an activity instance has its own row (e.g. start of an activity instance X-ray has its own row and the completion of the X-ray will be recorded in a seperate row), the event log is in a long layout.

## [1] 5442    7

I changed the long layout to the wide layout as the wide layout is needed to use the tidyverse approach.

patients_df<- data.frame(patients)%>%  # convert object
  select(- .order) %>% #remove this col as we don't need it and it messes with the spread function
spread(registration_type, time) 

## [1] 2721    6

Number of types of activities

bupaR approach

## [1] 7

tidyverse function

## [1] 7

There are 7 kinds of activities in the “patients” event log. From the above example, familiarizing yourself with bupaR’s function is easy. The wording of bupaR’s function, n_activities is similar to dplyr’s function, n_distinct.

Types of Activities

bupaR function

## [1] Registration          Triage and Assessment Blood test           
## [4] MRI SCAN              X-Ray                 Discuss Results      
## [7] Check-out            
## 7 Levels: Blood test Check-out Discuss Results MRI SCAN ... X-Ray

base R function

## [1] Registration          Triage and Assessment Blood test           
## [4] MRI SCAN              X-Ray                 Discuss Results      
## [7] Check-out            
## 7 Levels: Blood test Check-out Discuss Results MRI SCAN ... X-Ray

The 7 activities capatured in the event log is a simplified list of activities one can expect from admission to discharge.

Frequency of Activities

bupaR approach

## # A tibble: 7 x 3
##   handling              absolute_frequency relative_frequency
## 1 Registration                         500             0.184 
## 2 Triage and Assessment                500             0.184 
## 3 Discuss Results                      495             0.182 
## 4 Check-out                            492             0.181 
## 5 X-Ray                                261             0.0959
## 6 Blood test                           237             0.0871
## 7 MRI SCAN                             236             0.0867

tidyverse apporach

patients_df %>% group_by(handling) %>% summarise(absolute=n()) %>% mutate(relative=absolute/sum(absolute))
## # A tibble: 7 x 3
##   handling              absolute relative
## 1 Blood test                 237   0.0871
## 2 Check-out                  492   0.181 
## 3 Discuss Results            495   0.182 
## 4 MRI SCAN                   236   0.0867
## 5 Registration               500   0.184 
## 6 Triage and Assessment      500   0.184 
## 7 X-Ray                      261   0.0959

The bupaR approach used only one function while the tidyverse approach used three functions.

Summary of Activity Duration

bupaR approach

processing_time(patients, #event log 
  "activity", # level of analysis, in this situation at level of activity
  units="mins") #time units to be used
## # A tibble: 7 x 11
##   handling   min    q1  mean median    q3   max st_dev   iqr  total
## 1 Registr~  49.7 124.   165.   163.  204.  338.   57.2  79.9 8.26e4
## 2 Triage ~ 352.  681.   786.   800.  902. 1128.  166.  221.  3.93e5
## 3 Discuss~  80.0 139.   167.   166.  193.  272.   37.7  54.4 8.24e4
## 4 Check-o~  40.0  96.7  124.   124.  148.  234.   37.2  51.6 6.09e4
## 5 X-Ray    138.  233.   291.   288.  339.  490.   76.9 106.  7.59e4
## 6 Blood t~ 185.  285.   332.   328.  376.  488.   63.6  90.7 7.87e4
## 7 MRI SCAN 149.  216.   249.   245.  282.  355.   44.1  65.4 5.88e4
## # ... with 1 more variable: relative_frequency 

tidyverse approach

patients_df %>%
mutate(Act_Ins_Dur=complete-start) %>% # duration for each activity instance 
group_by(patient, handling) %>% summarise(Act_Duration= sum(Act_Ins_Dur)) %>% # activity duration for each case 
ungroup() %>% group_by(handling) %>% # activity duration for each activity at case level
  mutate(min= round((min(Act_Duration)), 1),
  q1= round((quantile(Act_Duration, .25)),1),
  median= round((quantile(Act_Duration, .5)),1),
  q3= round((quantile(Act_Duration, .75)),1),
  max= round((max(Act_Duration)),1),
  st_dev= round((sd(Act_Duration)),1),
  iqr= round((IQR(Act_Duration)),1)) %>% select (handling, min, q1, mean, median, q3, max, st_dev, iqr) %>% unique() 
## # A tibble: 7 x 9
## # Groups:   handling [7]
##   handling      min     q1      mean    median  q3     max    st_dev    iqr

The tidyverse approach was rather lengthy as each descriptive value had to be phyiscally calculated. In addition, it required two rounds of group_by.

Summary of Case Duration

Similar to calculating the summary statistics of activity duration, calculating the summary statistics for case duration is simple and short with bupaR’s function, processing_time. You just need to amend the level arugment in the function.

processing_time(patients, level="log", units="days")
##       min        q1    median      mean        q3       max    st_dev 
## 0.4465741 1.0395833 1.1552951 1.1562278 1.2807002 1.5935764 0.1739458 
##       iqr 
## 0.2411169

Summary of Case Duration (with condition)

Bed crunch is a common problem in hospitals. A shorter case duration for a patient will translate to a shorter length of stay which allows the hospital to admit the next patient. A hospital staff hypothesizes that patients who have an MRI scan have a longer length of stay thus the staff wishes to compare the case duration of patients who had an MRI scan and those who did not. bupaR’s filter_activity_presence function provides a quick solution. The function “filters cases based on the presence or absence of activites”. The filter_activity_presence function behaviours similar to dplyr’s group_by(case identifier) %>% filter(activity== ).

Case Duration of patients with MRI scan

patients %>% filter_activity_presence("MRI SCAN") %>% processing_time(level="log", units="hours")
##       min        q1    median      mean        q3       max    st_dev 
## 21.873056 27.790347 30.769444 30.303944 32.614514 38.245833  3.359803 
##       iqr 
##  4.824167

Case Duration of patients without MRI scan

patients %>% filter_activity_presence("MRI SCAN", 
  method="none") %>% # set arugment to "none" to for cases without the specific activity
    processing_time(level="log", units="hours")
##       min        q1    median      mean        q3       max    st_dev 
## 10.717778 23.285764 25.840139 25.465921 27.898958 33.422500  3.448286 
##       iqr 
##  4.613194

The hospital staff was right, patients who did an MRI scan had a longer case duration than those who did not.


In this post, we looked at the definition of event log and process mining and also a package which was created to make process mining convenient. The functions in bupaR have similar wording and behaviour to dplyr functions. One of the benefits seen above using bupaR’s functions is that it reduces the length of code required to extract the desired results. In the next post, I will cover more process mining concepts and visualizations of process analysis.

To leave a comment for the author, please follow the link and comment on their blog: R on notast. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)