In this post, I show how to build an interactive geographic dashboard using Displayr, Plotly and R. It is particularly fascinating in that it tracks the real-time position of military aircraft. To do so, I am first going to bring in some data from two separate sources (regions based on the size of military air-force, and then the real-time tracking of aircraft positions). The dashboard displays the dynamic data in two ways: region shading (to indicate the size of the military force per country) and point markers (for the aircraft positions). I then build a map to neatly and attractively display all this data.
I am going to shade each country of the map according to the size of its military air-force. To do this I need a list of aircraft per country. Fortunately, the ever useful Wikipedia has just what I need (here). The following code reads in the data and cleans it to produce a neat table.
library(httr) library(XML) # Extract table via URL url &amp;lt;- "https://en.wikipedia.org/wiki/List_of_countries_by_level_of_military_equipment#List" r &amp;lt;- GET(url) airforces &amp;lt;- readHTMLTable(doc = content(r, "text"))[[2]] # Clean required columns airforces &amp;lt;- airforces[-1, c("Country[note 1]", "Military aircraft[note 3]")] colnames(airforces) &amp;lt;- c("Country", "MilitaryAircraft") remove.bracket.content &amp;lt;- function(s) { return(gsub("\\[.+\\]", "", s)) } airforces &amp;lt;- data.frame(apply(airforces, 2, remove.bracket.content)) airforces$MilitaryAircraft &amp;lt;- as.numeric(gsub(",", "", airforces$MilitaryAircraft)) airforces
Compared to the above, the second source of data is more dynamic. I am using ADS-B, which pools real-time flight tracking information from across the globe. Not all military operations are top secret. Apparently, some aircraft operated by the armed forces broadcast their positions publicly.
To collate this information, I construct a URL to retrieve a JSON object with information about military aircraft (JSON is flexible text format used for data exchange). Then I parse the JSON to create a data.frame.
library(jsonlite)&amp;lt;/pre&amp;gt; url &amp;lt;- "http://public-api.adsbexchange.com/VirtualRadar/AircraftList.json?" url &amp;lt;- paste0(url, "fMilQ=TRUE") positions &amp;lt;- fromJSON(url)$acList if (length(positions) != 0) { positions &amp;lt;- positions[positions$Type != "TEST", ] positions &amp;lt;- positions[!is.na(positions$Lat), ] } positions
The following code produces a plotly map of the world. The countries are shaded according to the size of the air-force, with the scale shown on a legend. In plotly terminology, each layer of the map is known as a trace.
library(plotly) library(flipFormat) # specify map area and projection g &amp;lt;- list(scope = "world", showframe = FALSE, showcoastlines = TRUE, projection = list(type = 'mercator'), lonaxis = list(range = c(-140, 179)), lataxis = list(range = c(-55, 70)), resolution = 50) # higher resolution # shade countries by airforce size p &amp;lt;- plot_geo(airforces) %&amp;gt;% add_trace(data = airforces, name = "Airforce", z = ~MilitaryAircraft, color = ~MilitaryAircraft, colors = 'Blues', locations = ~Country, marker = list(line = list(color = toRGB("grey"), width = 0.5)), showscale = TRUE, locationmode = "country names", colorbar = list(title = 'Airforce', separatethousands = TRUE)) %&amp;gt;% config(displayModeBar = F) %&amp;gt;% layout(geo = g, margin = list(l=0, r=0, t=0, b=0, pad=0), paper_bgcolor = 'transparent')
Finally, I add markers showing the airplane locations as another trace. I use a different color for those with speed less than 200 knots and altitude less than 2000 feet. The hover text contains more detail about the aircraft.
aircolors = rep("airborne", nrow(positions)) # create a vector with the status of each aeroplane aircolors[positions$Spd &amp;lt; 200 &amp;amp; positions$Alt &amp;lt; 2000] &amp;lt;- "ground/approach" hovertext = paste0("Operator:", positions$Op, "\nModel:", positions$Mdl, "\nAltitide(ft):", sapply(positions$Alt, FormatAsReal)) hoverinfo = rep("all", nrow(positions)) p = add_trace(p, data = positions, x = positions$Long, y = positions$Lat, color = aircolors, hovertext = hovertext, showlegend = FALSE)
The final result is shown below. Since broadcasting of position is purely voluntary, I suspect that the Top Gun flights are not shown!
While the map above shows the required information, it can easily be made more useful and appealing. Displayr allows me to add a control to switch between regions, text and image annotations, and a background. Here is a link to the finished dashboard, with a screenshot shown below.
Thinking about it again, if you can’t see any military aircraft at all on the radar, then you should really worry!
You can explore the dashboards used in this post within Displayr. All the code is embedded within the pages (look on the right-hand panel after you click on an object). Click here to see the copy of the working document (without background and finishing touches). Click here open a copy of the finished document that includes the background, control box, and other finishing touches.
You know the drill by now. Grab the tweets. Generate the report using RMarkdown. Push to Github. Publish to RPubs.
This time it’s the Australian Bioinformatics & Computational Biology Society Conference 2017, including the COMBINE symposium. Looks like a good time was had by all in Adelaide.
A couple of quirks this time around. First, the rtweet package went through a brief phase of returning lists instead of nice data frames. I hope that’s been discarded as a bad idea
Second, results returned from a hashtag search that did not contain said hashtags, but were definitely from the meeting. Yet to get to the bottom of that one. I treated ABACBS and COMBINE as one event for this analysis.
Third, given that most Twitter users have had 280 characters since about November 7, is this reflected in the conference tweets? It is not. In fact, most tweets seem to hit the hard limit of 140, which makes me wonder if users’ clients were not updated, or the R package has a hard-coded character limit in its parsing functions (longest = 148 characters).
Filed under: australia, bioinformatics, meetings, R, statistics Tagged: abacbs, twitter
Twitter recently doubled the maximum length of a tweet to 280 characters, and while all users now have access to longer tweets, few have taken advantage of the opportunity. Bob Rudis used the rtweet package to analyze tweets sent with the #rstats hashtag since 280-char tweets were introduced, and most still kept below the old 280-character limit. The actual percentage differed by the Twitter client being used; I’ve reproduced the charts for the top 10 clients below. (Click the chart for even more clients, and details of the analysis including the R code that generated it.)
That being said, some have embraced the new freedom with gusto, not least my Microsoft colleague Paige Bailey who demonstrated you can fit some pretty nice R charts — and even animations! — into just 280 characters:
df <- expand.grid(x = 1:10, y=1:10)
df$angle <- runif(100, 0, 2*pi)
df$speed <- runif(100, 0, sqrt(0.1 * df$x))ggplot(df, aes(x, y)) +
geom_point() +
geom_spoke(aes(angle = angle, radius = speed))y’all twitterpeople give me 280 characters?
yr just gonna get code samples pic.twitter.com/hyGEE2DxGy— @DynamicWebPaige (@DynamicWebPaige) November 8, 2017
x <- getURL(“https://t.co/ivZZvodbNK“)
b <- read.csv(text=x)
c <- get_map(location=c(-122.080954,36.971709), maptype=”terrain”, source=”google”, zoom=14)
ggmap(c) +
geom_path(data=b, aes(color=elevation), size=3)+
scale_color_gradientn(colours=rainbow(7), breaks=seq(25, 200, 25)) pic.twitter.com/7WdQLR56uZ— @DynamicWebPaige (@DynamicWebPaige) November 11, 2017
library(dygraphs)
lungDeaths <- cbind(mdeaths, fdeaths)
dygraph(lungDeaths) %>%
dySeries(“mdeaths”, label = “Male”) %>%
dySeries(“fdeaths”, label = “Female”) %>%
dyOptions(stackedGraph = TRUE) %>%
dyRangeSelector(height = 20)I ❤️ R’s interactive visualizations SO MUCH pic.twitter.com/LevjElly3L
— @DynamicWebPaige (@DynamicWebPaige) November 16, 2017
library(plot3D)
par(mar = c(2, 2, 2, 2))
par(mfrow = c(1, 1))
x <- seq(0, 2*pi,length.out=50)
y <- seq(0, pi,length.out=50)
M <- mesh(x, y)surf3D(x = (3+2*cos(M$x)) * cos(M$y),
y = (3+2*cos(M$x)) * sin(M$y),
z = 2 * sin(M$x),
colkey=FALSE,
bty=”b2″) pic.twitter.com/a6GwQTaGYC— @DynamicWebPaige (@DynamicWebPaige) November 17, 2017
For more 280-character examples in R and other languages follow this Twitter thread, and for more on analyzing tweets with rtweet follow the link below.
rud.is: Twitter Outer Limits : Seeing How Far Have Folks Fallen Down The Slippery Slope to “280” with rtweet
Introduction
Kaggle.com is a website designed for data scientists and data enthusiasts to connect and compete with each other. It is an open community that hosts forums and competitions in the wide field of data. In each Kaggle competition, competitors are given a training data set, which is used to train their models, and a test data set, used to test their models. Kagglers can then submit their predictions to view how well their score (e.g., accuracy or error) compares to others’.
As a team, we joined the House Prices: Advanced Regression Techniques Kaggle challenge to test our model building and machine learning skills. For this competition, we were tasked with predicting housing prices of residences in Ames, Iowa. Our training data set included 1460 houses (i.e., observations) accompanied by 79 attributes (i.e., features, variables, or predictors) and the sales price for each house. Our testing set included 1459 houses with the same 79 attributes, but sales price was not included as this was our target variable.
To view our code (split between R and Python) and our project presentation slides for this project see our shared GitHub repository.
Understanding the Data
Of the 79 variables provided, 51 were categorical and 28 were continuous.
Our first step was to combine these data sets into a single set both to account for the total missing values and to fully understand all the classes for each categorical variable. That is, there might be missing values or different class types in the test set that are not in the training set.
Processing the Data
Response Variable
As our response variable, Sale Price, is continuous, we will be utilizing regression models. One assumption of linear regression models is that the error between the observed and expected values (i.e., the residuals) should be normally distributed. Violations of this assumption often stem from a skewed response variable. Sale Price has a right skew, so we log + 1 transform it to normalize its distribution.
Missing Data
Machine learning algorithms do not handle missing values very well, so we must obtain an understanding of the missing values in our data to determine the best way to handle them. We find that 34 of the predictor variables have values that are interpreted by R and Python as missing (i.e., “NA” and “NaN”). Below we describe examples of some of the ways we treated these missing data.
1) NA/NaN is actually a class:
In many instances, what R and Python interpret as a missing value is actually a class of the variable. For example, Pool Quality is comprised of 5 classes: Excellent, Good, Fair, Typical, and NA. The NA class describes houses that do not have a pool, but our coding languages interpret houses of NA class as a missing value instead of a class of the Pool Quality variable.
Our solution was to impute most of the NA/NaN values to a value of “None.”
2) Not every NA/NaN corresponds to a missing attribute:
While we found that most NA/NaN values corresponded to an actual class for different variables, some NA/NaN values actually represented missing data. For example, we find that three houses with NA/NaN values for Pool Quality, also have a non-zero value for the variable, Pool Area (square footage of pool). These three houses likely have a pool, but its quality was not assessed or input into the data set.
Our solution was to first calculate mean Pool Area for each class of Pool Quality, then impute the missing Pool Quality classes based on how close that house’s Pool Area was to the mean Pool Areas for each Pool Quality class. For example, the first row in the below picture on the left has a Pool Area of 368 square feet. The average Pool Area for houses with Excellent pool quality (Ex) is about 360 square feet (picture on the right). Therefore, we imputed this house to have a Pool Quality of Excellent.
3) Domain knowledge:
Some variables had a moderate amount of missingness. For example, about 17% of the houses were missing the continuous variable, Lot Frontage, the linear feet of street connected to the property. Intuitively, attributes related to the size of a house are likely important factors regarding the price of the house. Therefore, dropping these variables seems ill-advised.
Our solution was based on the assumption that houses in the same neighborhood likely have similar features. Thus, we imputed the missing Lot Frontage values based on the median Lot Frontage for the neighborhood in which the house with missing value was located.
4) Imputing with mode:
Most variables have some intuitive relationship to other variables, and imputation can be based on these related features. But some missing values are found in variables with no apparent relation to others. For example, the Electrical variable, which describes the electrical system, was missing for a single observation.
Our solution was to simply find the most common class for this categorical variable and impute for this missing value.
Ordinal Categories
For linear (but not tree-based) models, categorical variables must be treated as continuous. There are two types of categorical features: ordinal, where there is an inherent order to the classes (e.g., Excellent is greater than Good, which is greater than Fair), and nominal, where there is no obvious order (e.g., red, green, and blue).
Our solution for ordinal variables was to simply assign the classes a number corresponding to their relative ranks. For example, Kitchen Quality has five classes: Excellent, Good, Typical, Fair, and Poor, which we encoded (i.e., converted) to the numbers 5, 4, 3, 2, and 1, respectively.
Nominal Categories
The ranking of nominal categories is not appropriate as there is no actual rank among the classes.
Our solution was to one-hot encode these variables, which creates a new variable for each class with values of zero (not present) or one (present).
Outliers
An outlier can be defined with a quantitative (i.e., statistical) or qualitative definition. We opted for the qualitative version when looking for outliers: observations that are abnormally far from other values. Viewing the relationship between Above Ground Living Area and Sale Price, we noticed some very large areas for very low prices.
Our solution was to remove these observations as we thought they fit our chosen definition of an outlier, and because they might increase our models’ errors.
Skewness
While there are few assumptions regarding the independent variables of regression models, often transforming skewed variables to a normal distribution can improve model performance.
Our solution was to log + 1 transform several of the predictors.
Near Zero Predictors
Predictors with very low variance offer little predictive power to models.
Our solution was to find the ratio of the second most frequent value to the most frequent value for each predictor, and to remove variables where this ratio was less than 0.05. This roughly translates to dropping variables where 95% or more of the values are the same.
Feature Engineering
Feature (variable or predictor) engineering is one of the most important steps in model creation. Often there is valuable information “hidden” in the predictors that is only revealed when manipulating these features in some way. Below are just some examples of the features we created:
Models and Results
Now that we have prepared our data set, we can begin training our models and use them to predict Sale Price.
We trained and tested dozens of versions of the models described below with different combinations of engineered features and processed variables. The information in the table represents our best results for each model. The table explains the pros and cons for each model type, the optimal hyperparameters found through either grid search or Bayesian optimization, our test score, and the score we received from Kaggle. Our scores the root mean square error (RMSE) of our predictions, which is a metric for describing the difference between the observed values and our predicted values for Sale Price; scores closer to zero are better.
For brevity, we will not describe the details of the different models. However, see the following links for more information about how each model is used to create predictions: random forest, gradient boost, XGBoost, elastic net regularization for regression.
Below are plots summarizing variables that contribute most to the respective model’s prediction of Sale Price.
For most models, predictors related to square footage (Area), quality (different Quality measures), and age (Year Built) have the strongest impact on each model’s predictions.
There is no visualization for our best model, which was an ensemble of four other models. The predictions for this ensembled model are calculated by averaging the predictions from the separate models (two linear regression models and two tree-based models). The idea is that each model’s predictions include error both above and below the real values, and the averaged predictions of the best models might have less overall error than any one single model.
One note is that tree-based models (random forest, gradient boosting, and XGBoosting) cannot provide an idea of positive or negative influence for each variable on Sale Price, rather they can only provide an idea of how important that variable is to the models’ predictions overall. In contrast, linear models can provide information about which variables positively and negatively influence Sale Price. For the figure immediately above, the strongest predictor, residency in a Commercial Zone, is actually negatively related to Sale Price.
Conclusions
The objective of this Kaggle competition was to build models to predict housing prices of different residences in Ames, IA. Our best model resulted in an RMSE of 0.1071, which translates to an error of about $9000 (or about 5%) for the average-priced house.
While this error is quite low, the interpretability of our model is poor. Each model found within our ensembled model varies with respect to the variables that are most important to predicting Sale Price. The best way to interpret our ensemble is to look for shared variables among its constituent models. The variables seen as most important or as strongest predictors through our models were those related to square footage, the age and condition of the home, the neighborhood where the house was located, the city zone where the house was located, and the year the house was sold.
Hi everyone! Today I want to tell you about parameters estimation of smooth functions. But before going into details, there are several things that I want to note. In this post we will discuss bias, efficiency and consistency of estimates of parameters, so I will use phrases like “efficient estimator”, implying that we are talking about some optimisation mechanism that gives efficient estimates of parameters. It is probably not obvious for people without statistical background to understand what the hell is going on and why we should care, so I decided to give a brief explanation. Although there are strict statistical definitions of the aforementioned terms (you can easily find them in Wikipedia or anywhere else), I do not want to copy-paste them here, because there are only a couple of important points worth mentioning in our context. So, let’s get started.
Bias refers to the expected difference between the estimated value of parameter (on a specific sample) and the “true” one. Having unbiased estimates of parameters is important because they should lead to more accurate forecasts (at least in theory). For example, if the estimated parameter is equal to zero, while in fact it should be 0.5, then the model would not take the provided information into account correctly and as a result will produce less accurate point forecasts and incorrect prediction intervals. In inventory this may mean that we constantly order 100 units less than needed only because the parameter is lower than it should be.
Efficiency means that if the sample size increases, then the estimated parameters will not change substantially, they will vary in a narrow range (variance of estimates will be small). In the case with inefficient estimates the increase of sample size from 50 to 51 observations may lead to the change of a parameter from 0.1 to, let’s say, 10. This is bad because the values of parameters usually influence both point forecasts and prediction intervals. As a result the inventory decision may differ radically from day to day. For example, we may decide that we urgently need 1000 units of product on Monday, and order it just to realise on Tuesday that we only need 100. Obviously this is an exaggeration, but no one wants to deal with such an erratic stocking policy, so we need to have efficient estimates of parameters.
Consistency means that our estimates of parameters will get closer to the stable values (what statisticians would refer to as “true” values) with the increase of the sample size. This is important because in the opposite case estimates of parameters will diverge and become less and less realistic. This once again influences both point forecasts and prediction intervals, which will be less meaningful than they should have been. In a way consistency means that with the increase of the sample size the parameters will become more efficient and less biased. This in turn means that the more observations we have, the better. There is a prejudice in the world of practitioners that the situation in the market changes so fast that the old observations become useless very fast. As a result many companies just through away the old data. Although, in general the statement about the market changes is true, the forecasters tend to work with the models that take this into account (e.g. Exponential smoothing, ARIMA). These models adapt to the potential changes. So, we may benefit from the old data because it allows us getting more consistent estimates of parameters. Just keep in mind, that you can always remove the annoying bits of data but you can never un-throw away the data.
Having clarified these points, we can proceed to the topic of today’s post.
We already know that the default estimator used for
smooth
functions is Mean Squared Error (for one step ahead forecast). If the residuals are distributed normally / log-normally, then the minimum of MSE gives estimates that also maximise the respective likelihood function. As a result the estimates of parameters become nice: consistent and efficient. It is also known in statistics that minimum of MSE gives mean estimates of the parameters, which means that MSE also produces unbiased estimates of parameters (if the model is specified correctly and bla-bla-bla). This works very well, when we deal with symmetric distributions of random variables. But it may perform poorly otherwise.
In this post we will use the series N1823 for our examples:
library(Mcomp) x <- ts(c(M3$N1823$x,M3$N1823$xx),frequency=frequency(M3$N1823$x))
Plot the data in order to see what we have:
plot(x)
The data seems to have slight multiplicative seasonality, which changes over the time, but it is hard to say for sure. Anyway, in order to simplify things, we will apply an ETS(A,A,N) model to this data, so we can see how the different estimators behave. We will withhold 18 observations as it is usually done for monthly data in M3.
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T)
Here’s the output:
Time elapsed: 0.08 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.147 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 629.249 Cost function type: MSE; Cost function value: 377623.069 Information criteria: AIC AICc BIC 1703.389 1703.977 1716.800 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -14%; Bias: -74.1%; MAPE: 16.8%; SMAPE: 15.1% MASE: 0.855; sMAE: 13.4%; RelMAE: 1.047; sMSE: 2.4%
It is hard to make any reasonable conclusions from the graph and the output, but it seems that we slightly overforecast the data. At least the prediction interval covers all the values in the holdout. Relative MAE is equal to 1.047, which means that the model produced forecasts less accurate than Naive. Let’s have a look at the residuals:
qqnorm(resid(ourModel)) qqline(resid(ourModel))
The residuals of this model do not look normal, a lot of empirical quantiles a far from the theoretical ones. If we conduct Shapiro-Wilk test, then we will have to reject the hypothesis of normality for the residuals on 5%:
shapiro.test(resid(ourModel)) > p-value = 0.001223
This may indicate that other estimators may do a better job. And there is a magical parameter
cfType
in the
smooth
functions which allows to estimate models differently. It controls what to use and how to use it. You can select the following estimators instead of MSE:
\begin{equation} \label{eq:MAE}
\text{MAE} = \frac{1}{T} \sum_{t=1}^T |e_{t+1}|
\end{equation}
The minimum of MAE gives median estimates of the parameters. MAE is considered to be a more robust estimator than MSE. If you have asymmetric distribution, give MAE a try. It gives consistent, but not efficient estimates of parameters. Asymptotically, if the distribution of the residuals is normal, the estimators of MAE converge to the estimators of MSE (which follows from the connection between mean and median of normal distribution).
Let’s see what happens with the same model, on the same data when we use MAE:
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="MAE")
Time elapsed: 0.09 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.101 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 636.546 Cost function type: MAE; Cost function value: 462.675 Information criteria: AIC AICc BIC 1705.879 1706.468 1719.290 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -5.1%; Bias: -32.1%; MAPE: 12.9%; SMAPE: 12.4% MASE: 0.688; sMAE: 10.7%; RelMAE: 0.842; sMSE: 1.5%
There are several things to note from the graph and the output. First, the smoothing parameter alpha is smaller than in the previous case. Second, Relative MAE is smaller than one, which means that model in this case outperformed Naive. Comparing this value with the one from the previous model, we can conclude that MAE worked well as an estimator of parameters for this data. Finally, the graph shows that point forecasts go through the middle of the holdout sample, which is reflected with lower values of error measures. The residuals are still not normally distributed, but this is expected, because they won’t become normal just because we used a different estimator:
\begin{equation} \label{eq:HAM}
\text{HAM} = \frac{1}{T} \sum_{t=1}^T \sqrt{|e_{t+1}|}
\end{equation}
This is even more robust estimator than MAE. On count data its minimum corresponds to the mode of data. In case of continuous data the minimum of this estimator corresponds to something not yet well studied, between mode and median. The paper about this thing is currently in a draft stage, but you can already give it a try, if you want. This is also consistent, but not efficient estimator.
The same example, the same model, but HAM as an estimator:
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="HAM")
Time elapsed: 0.06 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.001 0.001 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 666.439 Cost function type: HAM; Cost function value: 19.67 Information criteria: AIC AICc BIC 1715.792 1716.381 1729.203 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -1.7%; Bias: -14.1%; MAPE: 11.4%; SMAPE: 11.4% MASE: 0.63; sMAE: 9.8%; RelMAE: 0.772; sMSE: 1.3%
This estimator produced even more accurate forecasts in this example, forcing smoothing parameters to become close to zero. Note that the residuals standard deviation in case of HAM is larger than in case of MAE which in its turn is larger than in case of MSE. This means that one-step-ahead parametric and semiparametric prediction intervals will be wider in case of HAM than in case of MAE, than in case of MSE. However, taking that the smoothing parameters in the last model are close to zero, the multiple steps ahead prediction intervals of HAM may be narrower than the ones of MSE.
Finally, it is worth noting that the optimisation of models using different estimators takes different time. MSE is the slowest, while HAM is the fastest estimator. I don’t have any detailed explanation of this, but this obviously happens because of the form of the cost functions surfaces. So if you are in a hurry and need to estimate something somehow, you can give HAM a try. Just remember that information criteria may become inapplicable in this case.
While these three estimators above are calculated based on the one-step-ahead forecast error, the next three are based on multiple steps ahead estimators. They can be useful if you want to have a more stable and “conservative” model (a paper on this topic is currently in the final stage). Prior to v2.2.1 these estimators had different names, be aware!
\begin{equation} \label{eq:MSEh}
\text{MSE}_h = \frac{1}{T} \sum_{t=1}^T e_{t+h}^2
\end{equation}
The idea of this estimator is very simple: if you are interested in 5 steps ahead forecasts, then optimise over this horizon, not one-step-ahead. However, by using this estimator, we shrink the smoothing parameters towards zero, forcing the model to become closer to the deterministic and robust to outliers. This applies both to ETS and ARIMA, but the models behave slightly differently. The effect of shrinkage increases with the increase of \(h\). The forecasts accuracy may increase for that specific horizon, but it almost surely will decrease for all the other horizons. Keep in mind that this is in general not efficient and biased estimator with a much slower convergence to the true value than the one-step-ahead estimators. This estimator is eventually consistent, but it may need a very large sample to become one. This means that this estimator may result in values of parameters very close to zero even if they are not really needed for the data. I personally would advise using this thing on large samples (for instance, on high frequency data). By the way, Nikos Kourentzes, Rebecca Killick and I are working on a paper on that topic, so stay tuned.
Here’s what happens when we use this estimator:
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="MSEh")
Time elapsed: 0.24 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 657.781 Cost function type: MSEh; Cost function value: 550179.34 Information criteria: AIC AICc BIC 30393.86 30404.45 30635.25 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -10.4%; Bias: -62%; MAPE: 14.9%; SMAPE: 13.8% MASE: 0.772; sMAE: 12.1%; RelMAE: 0.945; sMSE: 1.8%
As you can see, the smoothing parameters are now equal to zero, which gives us the straight line going through all the data. If we had 1008 observations instead of 108, the parameters would not be shrunk to zero, because the model would need to adapt to changes in order to minimise the respective cost function.
The need for having a specific 5 steps ahead forecast is not common, so it makes sense to work with something that deals with one to h steps ahead:
\begin{equation} \label{TMSE}
\text{TMSE} = \sum_{j=1}^h \frac{1}{T} \sum_{t=1}^T e_{t+j}^2
\end{equation}
This estimator is more reasonable than MSE\(_h\) because it takes into account all the errors from one to h-steps-ahead forecasts. This is a desired behaviour in inventory management, because we are not so much interested in how much we will sell tomorrow or next Monday, but rather how much we will sell starting from tomorrow till the next Monday. However, the variance of forecast errors h-steps-ahead is usually larger than the variance of one-step-ahead errors (because of the increasing uncertainty), which leads to the effect of “masking”: the latter is hidden behind the former. As a result if we use TMSE as the estimator, the final values are seriously influenced by the long term errors than the short term ones (see Taieb and Atiya, 2015 paper). This estimator is not recommended if short-term forecasts are more important than long term ones. Plus, this is still less efficient and more biased estimator than one-step-ahead estimators, with slow convergence to the true values, similar to MSE\(_h\), but slightly better.
This is what happens in our example:
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="TMSE")
Time elapsed: 0.2 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.075 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 633.48 Cost function type: TMSE; Cost function value: 7477097.717 Information criteria: AIC AICc BIC 30394.36 30404.94 30635.75 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -7.5%; Bias: -48.9%; MAPE: 13.4%; SMAPE: 12.6% MASE: 0.704; sMAE: 11%; RelMAE: 0.862; sMSE: 1.5%
Comparing the model estimated using TMSE with the same one estimated using MSE and MSE\(_h\), it is worth noting that the smoothing parameters in this model are greater than in case of MSE\(_h\), but less than in case of MSE. This demonstrates that there is a shrinkage effect in TMSE, forcing the parameters towards zero, but the inclusion of one-step-ahead errors makes model slightly more flexible than in case of MSE\(_h\). Still, it is advised to use this estimator on large samples, where the estimates of parameters become more efficient and less biased.
This is similar to TMSE, but derived from the so called Trace Forecast Likelihood (which I may discuss at some point in one of the future posts). The idea here is to take logarithms of each MSE\(_j\) and then sum them up:
\begin{equation} \label{eq:GTMSE}
\text{GTMSE} = \sum_{j=1}^h \log \left( \frac{1}{T} \sum_{t=1}^T e_{t+j}^2 \right)
\end{equation}
Logarithms make variances of errors on several steps ahead closer to each other. For example, if the variance of one-step-ahead error is equal to 100 and the variance of 10 steps ahead error is equal to 1000, then their logarithms will be 4.6 and 6.9 respectively. As a result when GTMSE is used as an estimator, the model will take into account both short and long term errors. So this is a more balanced estimator of parameters than MSE\(_h\) and TMSE. This estimator is more efficient than both TMSE and MSE\(_j\) because of the log-scale and converges to true values faster than the previous two, but still can be biased on small samples.
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="GTMSE")
Time elapsed: 0.18 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 649.253 Cost function type: GTMSE; Cost function value: 232.419 Information criteria: AIC AICc BIC 30402.77 30413.36 30644.16 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -8.2%; Bias: -53.8%; MAPE: 13.8%; SMAPE: 12.9% MASE: 0.72; sMAE: 11.3%; RelMAE: 0.882; sMSE: 1.6%
In our example GTMSE shrinks both smoothing parameters towards zero and makes the model deterministic, which corresponds to MSE\(_h\). However the initial values are slightly different, which leads to slightly different forecasts. Once again, it is advised to use this estimator on large samples.
Keep in mind that all those multiple steps ahead estimators take more time for the calculation, because the model needs to produce h-steps-ahead forecasts from each observation in sample.
There is also a non-documented feature in
smooth
functions (currently available only for pure additive models) – analytical versions of multiple steps ahead estimators. In order to use it, we need to add “a” in front of the desired estimator: aMSE\(_h\), aTMSE, aGTMSE. In this case only one-step-ahead forecast error is produced and after that the structure of the applied state-space model is used in order to reconstruct multiple steps ahead estimators. This feature is useful if you want to use the multiple steps ahead estimator on small samples, where the multi-steps errors cannot be calculated appropriately. It is also useful in cases of large samples, when the time of estimation is important.
These estimators have similar properties to their empirical counterparts, but work faster and are based on asymptotic properties. Here is an example of analytical MSE\(_h\) estimator:
ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="aMSEh")
Time elapsed: 0.11 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 627.818 Cost function type: aMSEh; Cost function value: 375907.976 Information criteria: AIC AICc BIC 30652.15 30662.74 30893.55 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -1.9%; Bias: -14.6%; MAPE: 11.7%; SMAPE: 11.6% MASE: 0.643; sMAE: 10%; RelMAE: 0.787; sMSE: 1.3%
The resulting smoothing parameters are shrunk towards zero, similar to MSE\(_h\), but the initial values are slightly different, which leads to different forecasts. Note that the time elapsed in this case is 0.11 seconds instead of 0.24 as in MSE\(_h\). The difference in time may increase with the increase of sample size and forecasting horizon.
smooth
functions (e.g. MAE\(_h\) and THAM). However, they are currently implemented mainly “because I can” and for fun, so I cannot give you any recommendations about them.
Now that we discussed all the possible estimators that you can use with
smooth
, you are most probably confused and completely lost. The question that may naturally appear after you have read this post is “What should I do?” Frankly speaking, I cannot give you appropriate answer and a set of universal recommendations, because this is still under-researched problem. However, I have some advice.
First, Nikos Kourentzes and Juan Ramon Trapero found that in case of high frequency data (they used solar irradiation data) using MSE\(_h\) and TMSE leads to the increase in forecasting accuracy in comparison with the conventional MSE. However in order to achieve good accuracy in case of MSE\(_h\), you need to estimate \(h\) separate models, while with TMSE you need to estimate only one. So, TMSE is faster than MSE\(_h\), but at the same time leads to at least as accurate forecasts as in case of MSE\(_h\) for all the steps from 1 to h.
Second, if you have asymmetrically distributed residuals in the model after using MSE, give MAE or HAM a try – they may improve your model and its accuracy.
Third, analytical counterparts of multi-step estimators can be useful in one of the two situations: 1. When you deal with very large samples (e.g. high frequency data), want to use advanced estimation methods, but want them to work fast. 2. When you work with small sample, but want to use the properties of these estimators anyway.
Finally, don’t use MSE\(_h\), TMSE and GTMSE if you are interested in the values of parameters of models – they will almost surely be inefficient and biased. This applies to both ETS and ARIMA models, which will become close to their deterministic counterparts in this case. Use conventional MSE instead.
Recently, Dataiku 4.1.0 was released, it now offers much more support for R users. But wait a minute, Data-what? I guess some of you do not know Dataiku, so what is Dataiku in the first place? It is a collaborative data science platform created to design and run data products at scale. The main themes of the product are:
Collaboration & Orchestration: A data science project often involves a team of people with different skills and interests. To name a few, we have data engineers, data scientists, business analysts, business stakeholders, hardcore coders, R users and Python users. Dataiku provides a platform to accommodate the needs of these different roles to work together on data science projects.
Productivity: Whether you like hard core coding or are more GUI oriented, the platform offers an environment for both. A flow interface can handle most of the steps needed in a data science project, and this can be enriched by Python or R recipes. Moreover, a managed notebook environment is integrated in Dataiku to do whatever you want with code.
Deployment of data science products: As a data scientist you can produce many interesting stuff, i.e. graphs, data transformations, analysis, predictive models. The Dataiku platform facilitates the deployment of these deliverables, so that others (in your organization) can consume them. There are dashboards, web-apps, model API’s, productionized model API’s and data pipelines.
There is a free version which contains already a lot of features to be very useful, and there is an paid version, with “enterprise features“. See for the Dataiku website for more info.
Among many new features, and the one that interests me the most as an R user, is the improved support for R. In previous versions of Dataiku there was already some support for R, this version has the following improvements. There is now support for:
In Dataiku you can now create so-called code environments for R (and Python). A code environment is a standalone and self-contained environment to run R code. Each environment can have its own set of packages (and specific versions of packages). Dataiku provides a handy GUI to manage different code environments. The figure below shows an example code environment with specific packages.
In Dataiku whenever you make use of R –> in R recipes, Shiny, R Markdown or creating R API’s you can select a specific R code environment to use.
If you are working in RStudio you most likely already know R Markdown documents and Shiny applications. In this version, you can also create them in Dataiku. Now, why would you do that and not just create them in RStudio? Well, the reports and shiny apps become part of the Dataiku environment and so:
The figure above shows a R markdown report in Dataiku, the interface provides a nice way to edit the report, alter settings and publish the report. Below is an example dashboard in Dataiku with a markdown and Shiny report.
Once you created an analytical model in R, you want to deploy it and make use of its predictions. With Dataiku you can now easily expose R prediction models as an API. In fact, you can expose any R function as an API. The Dataiku GUI provides an environment where you can easily set up and test an R API’s. Moreover, The Dataiku API Node, which can be installed on a (separate) production server imports the R models that you have created in the GUI and can take care of load balancing, high availability and scaling of real-time scoring.
The following three figures give you an overview of how easy it is to work with the R API functionality.
First, define an API endpoint and R (prediction) function.
Then, define the R function, it can make use of data in Dataiku, R objects created earlier or any R library you need.
Then, test and deploy the R function. Dataiku provides a handy interface to test your function/API.
Finally, once you are satisfied with the R API you can make a package of the API, that package can then be imported on a production server with Dataiku API node installed. From which you can then serve API requests.
The new Dataiku 4.1.0 version has a lot to offer for anyone involved in a data science project. The system already has a wide range support for Python, but now with the improved support for R, the system is even more useful to a very large group of data scientists.
Cheers, Longhow.
Customer churn occurs when customers or subscribers stop doing business with a company or service, also known as customer attrition. It is also referred as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location.
Similar concept with predicting employee turnover, we are going to predict customer churn using telecom dataset. We will introduce Logistic Regression, Decision Tree, and Random Forest. But this time, we will do all of the above in R. Let’s get started!
The data was downloaded from IBM Sample Data Sets. Each row represents a customer, each column contains that customer’s attributes:
library(plyr) library(corrplot) library(ggplot2) library(gridExtra) library(ggthemes) library(caret) library(MASS) library(randomForest) library(party) churn <- read.csv('Telco-Customer-Churn.csv') str(churn) 'data.frame': 7043 obs. of 21 variables: $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ... $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ... $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ... $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ... $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ... $ tenure : int 1 34 2 45 2 8 22 10 28 62 ... $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ... $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ... $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ... $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ... $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ... $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ... $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ... $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ... $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ... $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ... $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ... $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ... $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ... $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ... $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
The raw data contains 7043 rows (customers) and 21 columns (features). The “Churn” column is our target.
We use sapply to check the number if missing values in each columns. We found that there are 11 missing values in “TotalCharges” columns. So, let’s remove all rows with missing values.
sapply(churn, function(x) sum(is.na(x))) customerID gender SeniorCitizen Partner 0 0 0 0 Dependents tenure PhoneService MultipleLines 0 0 0 0 InternetService OnlineSecurity OnlineBackup DeviceProtection 0 0 0 0 TechSupport StreamingTV StreamingMovies Contract 0 0 0 0 PaperlessBilling PaymentMethod MonthlyCharges TotalCharges 0 0 0 11 Churn 0
churn <- churn[complete.cases(churn), ]
1. We will change “No internet service” to “No” for six columns, they are: “OnlineSecurity”, “OnlineBackup”, “DeviceProtection”, “TechSupport”, “streamingTV”, “streamingMovies”.
cols_recode1 <- c(10:15) for(i in 1:ncol(churn[,cols_recode1])) { churn[,cols_recode1][,i] <- as.factor(mapvalues (churn[,cols_recode1][,i], from =c("No internet service"),to=c("No"))) }
2. We will change “No phone service” to “No” for column “MultipleLines”
churn$MultipleLines <- as.factor(mapvalues(churn$MultipleLines, from=c("No phone service"), to=c("No")))
3. Since the minimum tenure is 1 month and maximum tenure is 72 months, we can group them into five tenure groups: “0–12 Month”, “12–24 Month”, “24–48 Months”, “48–60 Month”, “> 60 Month”
min(churn$tenure); max(churn$tenure) [1] 1 [1] 72
group_tenure = 0 & tenure 12 & tenure 24 & tenure 48 & tenure 60){ return('> 60 Month') } } churn$tenure_group <- sapply(churn$tenure,group_tenure) churn$tenure_group <- as.factor(churn$tenure_group)
4. Change the values in column “SeniorCitizen” from 0 or 1 to “No” or “Yes”.
churn$SeniorCitizen <- as.factor(mapvalues(churn$SeniorCitizen, from=c("0","1"), to=c("No", "Yes")))
5. Remove the columns we do not need for the analysis.
churn$customerID <- NULL churn$tenure <- NULL
Correlation between numeric variables
numeric.var <- sapply(churn, is.numeric) corr.matrix <- cor(churn[,numeric.var]) corrplot(corr.matrix, main="\n\nCorrelation Plot for Numerical Variables", method="number")
Gives this plot:
The Monthly Charges and Total Charges are correlated. So one of them will be removed from the model. We remove Total Charges.
churn$TotalCharges <- NULL
Bar plots of categorical variables
p1 <- ggplot(churn, aes(x=gender)) + ggtitle("Gender") + xlab("Gender") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p2 <- ggplot(churn, aes(x=SeniorCitizen)) + ggtitle("Senior Citizen") + xlab("Senior Citizen") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p3 <- ggplot(churn, aes(x=Partner)) + ggtitle("Partner") + xlab("Partner") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p4 <- ggplot(churn, aes(x=Dependents)) + ggtitle("Dependents") + xlab("Dependents") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p1, p2, p3, p4, ncol=2)
p5 <- ggplot(churn, aes(x=PhoneService)) + ggtitle("Phone Service") + xlab("Phone Service") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p6 <- ggplot(churn, aes(x=MultipleLines)) + ggtitle("Multiple Lines") + xlab("Multiple Lines") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p7 <- ggplot(churn, aes(x=InternetService)) + ggtitle("Internet Service") + xlab("Internet Service") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p8 <- ggplot(churn, aes(x=OnlineSecurity)) + ggtitle("Online Security") + xlab("Online Security") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p5, p6, p7, p8, ncol=2)
p9 <- ggplot(churn, aes(x=OnlineBackup)) + ggtitle("Online Backup") + xlab("Online Backup") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p10 <- ggplot(churn, aes(x=DeviceProtection)) + ggtitle("Device Protection") + xlab("Device Protection") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p11 <- ggplot(churn, aes(x=TechSupport)) + ggtitle("Tech Support") + xlab("Tech Support") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p12 <- ggplot(churn, aes(x=StreamingTV)) + ggtitle("Streaming TV") + xlab("Streaming TV") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p9, p10, p11, p12, ncol=2)
p13 <- ggplot(churn, aes(x=StreamingMovies)) + ggtitle("Streaming Movies") + xlab("Streaming Movies") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p14 <- ggplot(churn, aes(x=Contract)) + ggtitle("Contract") + xlab("Contract") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p15 <- ggplot(churn, aes(x=PaperlessBilling)) + ggtitle("Paperless Billing") + xlab("Paperless Billing") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p16 <- ggplot(churn, aes(x=PaymentMethod)) + ggtitle("Payment Method") + xlab("Payment Method") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p17 <- ggplot(churn, aes(x=tenure_group)) + ggtitle("Tenure Group") + xlab("Tenure Group") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p13, p14, p15, p16, p17, ncol=2)
All of the categorical variables seem to have a reasonably broad distribution, therefore, all of them will be kept for the further analysis.
First, we split the data into training and testing sets
intrain<- createDataPartition(churn$Churn,p=0.7,list=FALSE) set.seed(2017) training<- churn[intrain,] testing<- churn[-intrain,]
Confirm the splitting is correct
dim(training); dim(testing) [1] 4924 19 [1] 2108 19
Fitting the Logistic Regression Model
LogModel <- glm(Churn ~ .,family=binomial(link="logit"),data=training) print(summary(LogModel)) Call: glm(formula = Churn ~ ., family = binomial(link = "logit"), data = training) Deviance Residuals: Min 1Q Median 3Q Max -2.0370 -0.6793 -0.2861 0.6590 3.1608 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.030149 1.008223 -2.014 0.044053 * genderMale -0.039006 0.077686 -0.502 0.615603 SeniorCitizenYes 0.194408 0.101151 1.922 0.054611 . PartnerYes -0.062031 0.092911 -0.668 0.504363 DependentsYes -0.017974 0.107659 -0.167 0.867405 PhoneServiceYes -0.387097 0.788745 -0.491 0.623585 MultipleLinesYes 0.345052 0.214799 1.606 0.108187 InternetServiceFiber optic 1.022836 0.968062 1.057 0.290703 InternetServiceNo -0.829521 0.978909 -0.847 0.396776 OnlineSecurityYes -0.393647 0.215993 -1.823 0.068379 . OnlineBackupYes -0.113220 0.213825 -0.529 0.596460 DeviceProtectionYes -0.025797 0.213317 -0.121 0.903744 TechSupportYes -0.306423 0.220920 -1.387 0.165433 StreamingTVYes 0.399209 0.395912 1.008 0.313297 StreamingMoviesYes 0.338852 0.395890 0.856 0.392040 ContractOne year -0.805584 0.127125 -6.337 2.34e-10 *** ContractTwo year -1.662793 0.216346 -7.686 1.52e-14 *** PaperlessBillingYes 0.338724 0.089407 3.789 0.000152 *** PaymentMethodCredit card (automatic) -0.018574 0.135318 -0.137 0.890826 PaymentMethodElectronic check 0.373214 0.113029 3.302 0.000960 *** PaymentMethodMailed check 0.001400 0.136446 0.010 0.991815 MonthlyCharges -0.005001 0.038558 -0.130 0.896797 tenure_group0-12 Month 1.858899 0.205956 9.026 < 2e-16 *** tenure_group12-24 Month 0.968497 0.201452 4.808 1.53e-06 *** tenure_group24-48 Month 0.574822 0.185500 3.099 0.001943 ** tenure_group48-60 Month 0.311790 0.200096 1.558 0.119185 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 5702.8 on 4923 degrees of freedom Residual deviance: 4094.4 on 4898 degrees of freedom AIC: 4146.4 Number of Fisher Scoring iterations: 6
Feature Analysis
The top three most-relevant features include Contract, tenure_group and PaperlessBilling.
anova(LogModel, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: Churn Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 4923 5702.8 gender 1 0.39 4922 5702.4 0.5318602 SeniorCitizen 1 95.08 4921 5607.3 < 2.2e-16 *** Partner 1 107.29 4920 5500.0 < 2.2e-16 *** Dependents 1 27.26 4919 5472.7 1.775e-07 *** PhoneService 1 1.27 4918 5471.5 0.2597501 MultipleLines 1 9.63 4917 5461.8 0.0019177 ** InternetService 2 452.01 4915 5009.8 < 2.2e-16 *** OnlineSecurity 1 183.83 4914 4826.0 < 2.2e-16 *** OnlineBackup 1 69.94 4913 4756.1 < 2.2e-16 *** DeviceProtection 1 47.58 4912 4708.5 5.287e-12 *** TechSupport 1 82.78 4911 4625.7 < 2.2e-16 *** StreamingTV 1 4.90 4910 4620.8 0.0269174 * StreamingMovies 1 0.36 4909 4620.4 0.5461056 Contract 2 309.25 4907 4311.2 < 2.2e-16 *** PaperlessBilling 1 14.21 4906 4297.0 0.0001638 *** PaymentMethod 3 38.85 4903 4258.1 1.867e-08 *** MonthlyCharges 1 0.10 4902 4258.0 0.7491553 tenure_group 4 163.67 4898 4094.4 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Analyzing the deviance table we can see the drop in deviance when adding each variable one at a time. Adding InternetService, Contract and tenure_group significantly reduces the residual deviance. The other variables such as PaymentMethod and Dependents seem to improve the model less even though they all have low p-values.
Assessing the predictive ability of the Logistic Regression model
testing$Churn <- as.character(testing$Churn) testing$Churn[testing$Churn=="No"] <- "0" testing$Churn[testing$Churn=="Yes"] <- "1" fitted.results <- predict(LogModel,newdata=testing,type='response') fitted.results 0.5,1,0) misClasificError <- mean(fitted.results != testing$Churn) print(paste('Logistic Regression Accuracy',1-misClasificError)) [1] "Logistic Regression Accuracy 0.796489563567362"
Logistic Regression Confusion Matrix
print("Confusion Matrix for Logistic Regression"); table(testing$Churn, fitted.results > 0.5) [1] "Confusion Matrix for Logistic Regression" FALSE TRUE 0 1392 156 1 273 287
Odds Ratio
One of the interesting performance measurements in logistic regression is Odds Ratio.Basically, Odds ratio is what the odds of an event is happening.
exp(cbind(OR=coef(LogModel), confint(LogModel))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 0.1313160 0.01815817 0.9461855 genderMale 0.9617454 0.82587632 1.1199399 SeniorCitizenYes 1.2145919 0.99591418 1.4807053 PartnerYes 0.9398537 0.78338071 1.1276774 DependentsYes 0.9821863 0.79488224 1.2124072 PhoneServiceYes 0.6790251 0.14466019 3.1878587 MultipleLinesYes 1.4120635 0.92707245 2.1522692 InternetServiceFiber optic 2.7810695 0.41762316 18.5910286 InternetServiceNo 0.4362582 0.06397364 2.9715699 OnlineSecurityYes 0.6745919 0.44144273 1.0296515 OnlineBackupYes 0.8929545 0.58709919 1.3577947 DeviceProtectionYes 0.9745328 0.64144877 1.4805460 TechSupportYes 0.7360754 0.47707096 1.1344691 StreamingTVYes 1.4906458 0.68637788 3.2416264 StreamingMoviesYes 1.4033353 0.64624171 3.0518161 ContractOne year 0.4468271 0.34725066 0.5717469 ContractTwo year 0.1896086 0.12230199 0.2861341 PaperlessBillingYes 1.4031556 1.17798691 1.6725920 PaymentMethodCredit card (automatic) 0.9815977 0.75273387 1.2797506 PaymentMethodElectronic check 1.4523952 1.16480721 1.8145076 PaymentMethodMailed check 1.0014007 0.76673087 1.3092444 MonthlyCharges 0.9950112 0.92252949 1.0731016 tenure_group0-12 Month 6.4166692 4.30371945 9.6544837 tenure_group12-24 Month 2.6339823 1.78095906 3.9256133 tenure_group24-48 Month 1.7768147 1.23988035 2.5676783 tenure_group48-60 Month 1.3658675 0.92383315 2.0261505
Decision Tree visualization
For illustration purpose, we are going to use only three variables for plotting Decision Trees, they are “Contract”, “tenure_group” and “PaperlessBilling”.
tree <- ctree(Churn~Contract+tenure_group+PaperlessBilling, training) plot(tree, type='simple')
Gives this plot:
1. Out of three variables we use, Contract is the most important variable to predict customer churn or not churn.
2. If a customer in a one-year or two-year contract, no matter he (she) has PapelessBilling or not, he (she) is less likely to churn.
3. On the other hand, if a customer is in a month-to-month contract, and in the tenure group of 0–12 month, and using PaperlessBilling, then this customer is more likely to churn.
Decision Tree Confusion Matrix
We are using all the variables to product confusion matrix table and make predictions.
pred_tree <- predict(tree, testing) print("Confusion Matrix for Decision Tree"); table(Predicted = pred_tree, Actual = testing$Churn) [1] "Confusion Matrix for Decision Tree" Actual Predicted No Yes No 1395 346 Yes 153 214
Decision Tree Accuracy
p1 <- predict(tree, training) tab1 <- table(Predicted = p1, Actual = training$Churn) tab2 <- table(Predicted = pred_tree, Actual = testing$Churn) print(paste('Decision Tree Accuracy',sum(diag(tab2))/sum(tab2))) [1] "Decision Tree Accuracy 0.763282732447818"
The accuracy for Decision Tree has hardly improved. Let’s see if we can do better using Random Forest.
Random Forest Initial Model
rfModel <- randomForest(Churn ~., data = training) print(rfModel) Call: randomForest(formula = Churn ~ ., data = training) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 4 OOB estimate of error rate: 20.65% Confusion matrix: No Yes class.error No 3245 370 0.1023513 Yes 647 662 0.4942704
The error rate is relatively low when predicting “No”, and the error rate is much higher when predicting “Yes”.
Random Forest Prediction and Confusion Matrix
pred_rf <- predict(rfModel, testing) caret::confusionMatrix(pred_rf, testing$Churn) Confusion Matrix and Statistics Reference Prediction No Yes No 1381 281 Yes 167 279 Accuracy : 0.7875 95% CI : (0.7694, 0.8048) No Information Rate : 0.7343 P-Value [Acc > NIR] : 9.284e-09 Kappa : 0.4175 Mcnemar's Test P-Value : 9.359e-08 Sensitivity : 0.8921 Specificity : 0.4982 Pos Pred Value : 0.8309 Neg Pred Value : 0.6256 Prevalence : 0.7343 Detection Rate : 0.6551 Detection Prevalence : 0.7884 Balanced Accuracy : 0.6952 'Positive' Class : No
Random Forest Error Rate
plot(rfModel)
Gives this plot:
We use this plot to help us determine the number of trees. As the number of trees increases, the OOB error rate decreases, and then becomes almost constant. We are not able to decrease the OOB error rate after about 100 to 200 trees.
Tune Random Forest Model
t <- tuneRF(training[, -18], training[, 18], stepFactor = 0.5, plot = TRUE, ntreeTry = 200, trace = TRUE, improve = 0.05)
Gives this plot:
We use this plot to give us some ideas on the number of mtry to choose. OOB error rate is at the lowest when mtry is 2. Therefore, we choose mtry=2.
Fit the Random Forest Model After Tuning
rfModel_new <- randomForest(Churn ~., data = training, ntree = 200, mtry = 2, importance = TRUE, proximity = TRUE) print(rfModel_new) Call: randomForest(formula = Churn ~ ., data = training, ntree = 200, mtry = 2, importance = TRUE, proximity = TRUE) Type of random forest: classification Number of trees: 200 No. of variables tried at each split: 2 OOB estimate of error rate: 19.7% Confusion matrix: No Yes class.error No 3301 314 0.0868603 Yes 656 653 0.5011459
OOB error rate decreased to 19.7% from 20.65% earlier.
Random Forest Predictions and Confusion Matrix After Tuning
pred_rf_new <- predict(rfModel_new, testing) caret::confusionMatrix(pred_rf_new, testing$Churn) Confusion Matrix and Statistics Reference Prediction No Yes No 1404 305 Yes 144 255 Accuracy : 0.787 95% CI : (0.7689, 0.8043) No Information Rate : 0.7343 P-Value [Acc > NIR] : 1.252e-08 Kappa : 0.3989 Mcnemar's Test P-Value : 4.324e-14 Sensitivity : 0.9070 Specificity : 0.4554 Pos Pred Value : 0.8215 Neg Pred Value : 0.6391 Prevalence : 0.7343 Detection Rate : 0.6660 Detection Prevalence : 0.8107 Balanced Accuracy : 0.6812 'Positive' Class : No
The accuracy did not increase but the sensitivity improved, compare with the initial Random Forest model.
Random Forest Feature Importance
varImpPlot(rfModel_new, sort=T, n.var = 10, main = 'Top 10 Feature Importance')
From the above example, we can see that Logistic Regression and Random Forest performed better than Decision Tree for customer churn analysis for this particular dataset.
Throughout the analysis, I have learned several important things:
1. Features such as tenure_group, Contract, PaperlessBilling, MonthlyCharges and InternetService appear to play a role in customer churn.
2. There does not seem to be a relationship between gender and churn.
3. Customers in a month-to-month contract, with PaperlessBilling and are within 12 months tenure, are more likely to churn; On the other hand, customers with one or two year contract, with longer than 12 months tenure, that are not using PaperlessBilling, are less likely to churn.
Source code that created this post can be found here. I would be pleased to receive feedback or questions on any of the above.
Related Post
Just visit this link and post a new R job to the R community.
You can post a job for free (and there are also “featured job” options available for extra exposure).
Job seekers: please follow the links below to learn more and apply for your R job of interest:
In R-users.com you can see all the R jobs that are currently available.
R-users Resumes
R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.
(you may also look at previous R jobs posts ).
]]>