Predict Bounce Rate based on Page Load Time in Google Analytics

September 26, 2012
By

(This article was first published on Tatvic Blog » R, and kindly contributed to R-bloggers)

style="text-align: justify">Welcome to the second part. In the last blog post on title="Linear Regression using R" href="http://www.tatvic.com/blog/linear-regression-using-r/" >Linear Regression with R, we have discussed about what is regression? and how it is used ? Now we will apply that learning on a specific problem of prediction. In this post, I will create a basic model to predict bounce rate as function of page load time components. In next blog, I’d share how to improve the model to improve the prediction.

style="text-align: justify">We know that bounce rate is important  for a web site. Here, we want to identify  relationships between bounce rate and time components of a web page(e.g. average page download time, average page load time, average server response time, etc.) and how much these time components  impact on bounce rate? For this problem, we have collected data of various web sites from Google analytics. The data set contains following parameters.

  1. x_id – Id of the page
  2. ismobile – page visited is by mobile or not
  3. Country
  4. pagePath
  5. pageTitle
  6. avgServerResponseTime
  7. avgServerConnectionTime
  8. avgRedirectionTime
  9. avgPageDownloadTime
  10. avgDomainLookupTime
  11. avgPageLoadTime
  12. entrances
  13. pageviews
  14. exits
  15. bounces
style="text-align: justify">Each parameter is tracked for a single page. We have 8488 rows in data set and we have calculated bounce rate for each page as below.

style="text-align: justify">Bounce rate = (bounces / entrances)*100

Here, we want to know the impact of  average server response time, average server connection time, average redirection time, average domain look up  time, average page download time and average page load time on the bounce rate. So, we have rearranged the data set and removed x_id, country, page path, page title, entrances, page views, exits and bounces from the data set and appended bouncerate after calculating it. Now data set contains following parameters.

  1. bouncerate
  2. avgServerResponseTime
  3. avgServerConnectionTime
  4. avgRedirectionTime
  5. avgPageDownloadTime
  6. avgDomainLookupTime
  7. avgPageLoadTime
style="text-align: justify">Let’s use regression on this data set. In this problem, we want to identify the dependency of the bounce rate on time components. So, we will consider bouncerate as dependent variable and the rest of the parameters from the data set as independent variables. Regression model for our data set in R is as below

>Model_1 <- lm(bouncerate ~ avgServerResponseTime + avgServerConnectionTime + avgRedirectionTime + avgPageDownloadTime +avgDomainLookupTime + avgPageLoadTime)
style="text-align: justify">We have generated the model nicely, but we are interested to know the relationships between bounce rate and and time components. Let’s check summary of the model.

>summary(model_1)
Output
Call:
lm(formula = bouncerate ~ avgServerResponseTime + avgServerConnectionTime +
    avgRedirectionTime + avgPageDownloadTime + avgDomainLookupTime +
    avgPageLoadTime)
Residuals:
    Min      1Q  Median      3Q     Max
-98.276 -19.816  -1.169  19.805 107.705
Coefficients:
                        Estimate Std. Error t value Pr(>|t|)
(Intercept)             49.10686    0.32862 149.435  < 2e-16 ***
avgServerResponseTime   -0.85724    0.17154  -4.997 5.93e-07 ***
avgServerConnectionTime  2.02335    0.55566   3.641 0.000273 ***
avgRedirectionTime      -0.37822    0.06368  -5.939 2.97e-09 ***
avgPageDownloadTime      0.31975    0.12172   2.627 0.008631 **
avgDomainLookupTime      4.14929    0.88525   4.687 2.81e-06 ***
avgPageLoadTime          0.04684    0.01896   2.470 0.013528 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 26.74 on 8481 degrees of freedom
Multiple R-squared: 0.01339,	Adjusted R-squared: 0.0127
F-statistic: 19.19 on 6 and 8481 DF,  p-value: < 2.2e-16
style="text-align: justify">Let’s understand the result. In the result, coefficients are shown in the column Estimate std. So, the equation for bounce rate becomes as below.

style="text-align: justify">bouncerate = 49.107 + (-0.86)avgServerResponsetime + (2.03)avgServerconnectionTime + (-0.38)avgRedirectionTime + (0.32)avgPageDownloadTime + (4.14)avgDomainLookuptime + (.05)avgpageLoadtime

style="text-align: justify">As we can see from the equation, avgDomainLookupTime impacts more on bounce rate . If avgDomainLookupTime increase by 1 unit then bounce rate increase by 4.14. At last, we succeed in identifying  the relationship between bounce rate and time components of a web page using regression.

style="text-align: justify">Here, we cannot say that the relationships estimated from this regression model(model_1) are perfect, because the model result is  generated after model fitted to the data set(i.e. model learns from the data and then estimate coefficients values) and data set may contain some unreliable observations . It is necessary to improve the model, so we can identify the relationships of bounce rate and time components very precisely. In the title="Improving Bounce Rate Prediciton Model for Google Analytics Data" href="http://www.tatvic.com/blog/improving-bounce-rate-prediction-model-for-google-analytics-data/" >next blog, we will discuss about how to improve the model? and summary of the improved model.

class="wp-about-author-containter-top" style="background-color:#FFEAA8;"> class="wp-about-author-pic"> src="http://www.tatvic.com/blog/wp-content/uploads/userphoto/14.jpg" alt="Amar Gondaliya" width="60" class="photo" />
class="wp-about-author-text">

href='http://www.tatvic.com/blog/author/amar/' title='Amar Gondaliya'>Amar Gondaliya

Amar is data modeling engineer at Tatvic. He is focused on building predictive model based on available data using R, hadoop and Google Prediction API.
Google Plus Profile: : href="https://plus.google.com/115682702585184320806/" >Amar Gondaliya

align="right" style="float: right; clear:left; padding: 0px 5px 0px 7px;"> name="fb_share" type="box_count" share_url="http://www.tatvic.com/blog/predict-bounce-rate-based-on-page-load-time-in-google-analytics/">

To leave a comment for the author, please follow the link and comment on his blog: Tatvic Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Top 3 Posts from the past 2 days

Top 9 articles of the week

  1. Scatterplots
  2. In-depth introduction to machine learning in 15 hours of expert videos
  3. Installing R packages
  4. The Single Most Important Skill for a Data Scientist
  5. Illustrated Guide to ROC and AUC
  6. Using apply, sapply, lapply in R
  7. Network analysis with igraph
  8. R vs Python: Survival Analysis with Plotly
  9. KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!

Sponsors