Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft)

Employee retention has been and will continue to be one of the biggest challenges of a company. While classical tactics such as promotion, competitive perks, etc. are practiced as ways to retain employees, it is now a hot trend to rely on machine learning technology to discover behavioral patterns with which a company can understand their employees better.

Employee demographic data has been studied and used for analyzing employees’ inclination towards leaving a company. Nowadays, with the proliferation of the Internet, employees’ behavior can better be understood and analyzed through such data as internal and external social media postings. Such data can be leveraged for the analysis of, for example, sentiment and thereby determination of an employees likelihood of leaving the company. Novel cognitive computing technology based on artificial intelligence tools empower today’s HR departments to identify staff who are likely to churn before they do. Through pro-active intervention HR can better manage staff to encourage them to remain longer term with the company.

This blog post introduces an R based data science accelerator that can be quickly adopted by a data scientist to prototype a solution for the employee attrition prediction scenario.  The prediction is based on two types of employee data that are typically already collected by companies:

1. Static data which does not tend to change over time. This type of data may refer to demographic and organizational data such as age, gender, title, etc. A characteristic of this type of data is that within a certain period they do not change or solely change in a deterministic way. For instance, years of service of an employee is static as the number increments every year.
2. The second type of data is the dynamically evolving information about an employee. Recent studies revealed that sentiment is playing a critical role in employee attrition prediction. Classical measures of sentiment require employee surveys of work satisfaction work. Social media posts become useful for sentiment analysis as employees may express their feelings about work. Non-structural data such as text can be collected for mining patterns which are indicative of employees with different inclinations for churn.

Attrition prediction is a scenario that takes historic employee data as input to then identify individuals that are inclined to leave. The basic procedure is to extract features from the available data that might have previously been manually analyzed and to build predictive models based on a training set with labels relating to the employment status. Normally it can be formalized as a supervised classification problem, while the uniqueness is that population of employees with different employment status may not be equal. Training such an imbalanced data set requires resampling or cost-sensitive learning techniques. For sentiment analysis on unstructured data such as text, pre-processing techniques that extract analysis-friendly quantitative features should be applied. Commonly used feature extraction methods for text analysis include word-to-vector, term frequency, or term frequency and inverse document frequency, etc. Algorithms for building the model depend on the data characteristics. In case a specific algorithm does not yield the desired results, we have found that ensemble techniques can be deployed to effectively boost model performance.

The data science language R is a convenient tool for performing HR churn prediction analysis. A lightweight data science accelerator that demonstrates the process of predicting employee attrition is shared in this Github repository. The walk-through basically shows cutting-edge machine learning and text mining techniques applied in R.

The code for the analytics is provided in an R markdown document, which can be interactively executed step by step to aid replication and learning. Documents of various formats (e.g., PDF, html, Jupyter Notebook, etc.) can be produced directly from the markdown document.

Taking this employee attrition data set, for example, one can easily visualize and perform correlation analysis with R packages. For instance, we may plot the distribution of monthly income of employees from different departments and intuitively analyze whether income can be a factor that affects attrition.

df <- read.xlsx("./attrition.xlsx",
sheetIndex=1,
colClasses=NA)

ggplot(df, aes(x=factor(Department), y=MonthlyIncome, color=factor(Attrition))) +
geom_boxplot() +
xlab("Department") +
ylab("Monthly income") +
scale_fill_discrete(guide=guide_legend(title="Attrition")) +
theme_bw()


R packages such as tm offer handy functions for dealing with text mining tasks on employees’ social media posts. Sometimes text analysis on multiple languages are of great interest to employers that have subsidiaries across different language-speaking locations. This can be done either via language-specific text analysis package (e.g., jiebaR for Chinese segmentation) or on-line translation. This R code demonstrates how to perform tokenization and term frequency count on English-Chinese text with jiebaR and tm.