Data Analysis for Marketing Research with R Language (1)

[This article was first published on Data Apple » R Blogs in English, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data Analysis technologies such as t-test, ANOVA, regression, conjoint analysis, and factor analysis are widely used in the marketing research areas of A/B Testing, consumer preference analysis, market segmentation, product pricing, sales driver analysis, and sales forecast etc. Traditionally the analysis tools are mainly SPSS and SAS, however, the open source R language is catching up now.

In this article, we will introduce how to use R to conduct some basic marketing researches by a sample, and then to further implement the analysis in Oracle R Enterprise which integrating R with Oracle Database.

Scenario Introduction and Marketing Research Objectives

A fabricate company, ABC store chain, is selling a new type of grape juice in some of its stores for pilot selling. The marketing team of ABC wants to analyze:-

  • Which type of in-store advertisement is more effective? They have placed two types of ads in stores for testing, one theme is natural production of the juice, the other theme is family health caring;
  • The Price Elasticity – the reactions of sales quantity of the grape juice to its price change;
  • The Cross-price Elasticity – the reactions of sales quantity of the grape juice to the price changes of other products such as apple juice and cookies in the same store;
  • How to find the best unit price of the grape juice which can maximize the profit and the forecast of sales with that price.

The marketing team has randomly sampled 30 observations and constructed the following dataset for the analysis. There are 5 variables (data columns) in the dataset.

Variable Description
Sales Total unit sales of the grape juice in one week in a store
Price Average unit price of the grape juice in the week
ad_type The in-store advertisement type to promote the grape juice.

ad_type = 0,  the theme of the ad is natural production of the juice

ad_type = 1,  the theme of the ad is family health caring

price_apple Average unit price of the apple juice in the same store in the week
price_cookies Average unit price of the cookies in the same store in the week

The dataset can be downloaded here. Please note the dataset is made up by the author for illustration purpose only, so it maybe looks different from the data in the real world.

Data Exploration

Let’s have some basic exploration to know more about the dataset.

#load the libraries needed in the following codes
> library(s20x)
> library(car)

> #read the dataset from an existing .csv file
> df <- read.csv(file.choose(),header=T)

> #list the name of each variable (data column) and the first six rows of the dataset
> head(df)
sales price ad_type price_apple price_cookies
1   222  9.83       0        7.36          8.80
2   201  9.72       1        7.43          9.62
3   247 10.15       1        7.66          8.90
4   169 10.04       0        7.57         10.26
5   317  8.38       1        7.33          9.54
6   227  9.74       0        7.51          9.49

> # basic statistics of the variables
> summary(df)
sales           price           ad_type     price_apple    price_cookies
Min.   :131.0   Min.   : 8.200    Min.   :0.0   Min.   :7.300   Min.   : 8.790
1st Qu.:182.5 1st Qu.: 9.585  1st Qu.:0.0  1st Qu.:7.438    1st Qu.: 9.190
Median :204.5 Median : 9.855 Median :0.5  Median :7.580   Median : 9.515
Mean:   216.7  Mean:   9.738   Mean   :0.5   Mean:   7.659  Mean   : 9.622
3rd Qu.:244.2  3rd Qu.:10.268 3rd Qu.:1.0   3rd Qu.:7.805    3rd Qu.:10.140
Max. :   335.0   Max. :   10.490  Max. :   1.0   Max. :   8.290   Max. :   10.580

From the above summary table, we can roughly know the basic statistics of each numeric variable. For example, the mean value of sales is 216.7 units, the min value is 131, and the max value is 335. Please ignore the statistics of the “ad_type” there since it is a categorical variable.

We can further explore the distribution of the data of sales by visualizing the data in graphical form as follows.

> #set the 1 by 2 layout plot window
> par(mfrow = c(1,2))
> # boxplot to check if there are outliers
> boxplot(df$sales,horizontal = TRUE, xlab=”sales”)
> # histogram to explore the data distribution shape
> hist(df$sales,main=””,xlab=”sales”,prob=T)
> lines(density(df$sales),lty=”dashed”,lwd=2.5,col=”red”)


We don’t find outliers in the above box plot graph and the sales data distribution is roughly normal. It is not necessary to apply further data cleaning and treatment to the data set.

Analysis of Ad Effectiveness

The marketing team wants to find out the ad with better effectiveness for sales between the two types of ads, one is with natural production theme; the other is with family health caring theme. So they can place the better one into all of ABC’s stores after the pilot period.

To find out the better ad, we can calculate and compare the mean of sales with the two different ad types at the first step.

> #divide the dataset into two sub dataset by ad_type
> sales_ad_nature = subset(df,ad_type==0)
> sales_ad_family = subset(df,ad_type==1)
> #calculate the mean of sales with different ad_type
> mean(sales_ad_nature$sales)
[1] 186.6667
> mean(sales_ad_family$sales)
[1] 246.6667

The mean of sales with nature product theme is about 187; the mean of sales with family health caring theme is about 247. It looks like that the latter one is better. However, this is only the conclusion based on the sample with only 30 observations randomly selected. To find out how likely the conclusion is correct for the whole population, it is necessary to do statistical testing – two-sample t-test.

It is important to check the assumptions of t-tests, which assume the observations are normally distributed and independent, before conducting the t-tests. Otherwise the results of t-tests are not valid. The observations are independent since they were randomly sampled. Let’s check the normality by plotting the distribution shapes of the two groups of sales data.

> #set the 1 by 2 layout plot window
> par(mfrow = c(1,2))
> # histogram to explore the data distribution shapes
> hist(sales_ad_nature$sales,main=””,xlab=”sales with nature production theme ad”,prob=T)
> lines(density(sales_ad_nature$sales),lty=”dashed”,lwd=2.5,col=”red”)
> hist(sales_ad_family$sales,main=””,xlab=”sales with family health caring theme ad”,prob=T)
> lines(density(sales_ad_family$sales),lty=”dashed”,lwd=2.5,col=”red”)


We can see that the shapes are roughly normally distributed. We can also check the normality by Shapiro-Wilk test as follows.

> shapiro.test(sales_ad_nature$sales)

Shapiro-Wilk normality test
data:  sales_ad_nature$sales
W = 0.9426, p-value = 0.4155

> shapiro.test(sales_ad_family$sales)

Shapiro-Wilk normality test
data:  sales_ad_family$sales
W = 0.8974, p-value = 0.08695

The p-values of the Shapiro-Wilk tests are larger than 0.05, so there is no strong evidence to reject the null hypothesis that the two groups of sales data are normally distributed.

Now we can conduct the t-test since the t-test assumptions are met.

> t.test(sales_ad_nature$sales,sales_ad_family$sales)

Welch Two Sample t-test
data:  sales_ad_nature$sales and sales_ad_family$sales
t = -3.7515, df = 25.257, p-value = 0.0009233
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -92.92234 -27.07766
sample estimates:
mean of x mean of y
186.6667  246.6667

From the output of t-test above, we can say that:-

We have strong evidence to say that the population means of the sales with the two different ad types are different because the p-value of the t-test is very small;

With 95% confidence, we can estimate that the mean of the sales with natural production theme ad is somewhere in 27 to 93 units less than that of the sales with family health caring theme ad.

So the conclusion is that the ad with the theme of family health caring is BETTER.

Sales Driver Analysis and Price Elasticity Analysis

With the information given in the data set, we can explore how grape juice price, ad type, apple juice price, cookies price influence the sales of grape juice in a store by multiple linear regression analysis. Here, “sales” is the dependent variable and the others are independent variables.

Let’s investigate the correlation between the sales and other variables by displaying the correlation coefficients in pairs.

> pairs(df,col=”blue”,pch=20)
> pairs20x(df)


The correlation coefficients between sales and price, ad_type, price_apple, and price_cookies are 0.85, 0.58, 0.37, and 0.37 respectively, that means they all might have some influences to the sales, so we can try to add all of the independent variables into the regression model as follows.

> sales.reg<-lm(sales~price+ad_type+price_apple+price_cookies,df)
> summary(sales.reg)

lm(formula = sales ~ price + ad_type + price_apple + price_cookies,
data = df)

Min      1Q  Median      3Q     Max
-36.290 -10.488   0.884  10.483  29.471

Estimate Std. Error t value Pr(>|t|)
(Intercept)    774.813    145.349   5.331 1.59e-05 ***
price          -51.239      5.321  -9.630 6.83e-10 ***
ad_type         29.742      7.249   4.103 0.000380 ***
price_apple     22.089     12.512   1.765 0.089710 .
price_cookies  -25.277      6.296  -4.015 0.000477 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.2 on 25 degrees of freedom
Multiple R-squared: 0.8974,    Adjusted R-squared: 0.881
F-statistic: 54.67 on 4 and 25 DF,  p-value: 5.318e-12

The p-value for price, ad_type, and price_cookies in the last column of the above output is much less than 0.05. They are significant in explaining the sales. We are confident to include these three variables into the model.

The p-value of price_apple is a bit larger than 0.05, seems there are no strong evidence for apple juice price to explain the sales. However, according to our real-life experience, we know when apple juice price is lower, consumers likely to buy more apple juice, and then the sales of other fruit juice will decrease. So we can also add it into the model to explain the grape juice sales.

The Adjusted R-squared is 0.881, which indicates a reasonable goodness of fit and 88% of the variation in sales can be explained by the four variables. The remaining 12% can be attributed to other factors or inherent variability. Please note the R-squared is very high here because the dataset were made up rather than from real world data sources.

The assumptions for the regression to be true are that data are random and independent; residuals are normally distributed and have constant variance. Let’s check the residuals assumptions visually.

# plotting the residuals vs. other key model metrics
> par(mfrow=c(2,2))
> plot(sales.reg)


The Residuals vs Fitted graph above shows that the residuals scatter around the fitted line with no obvious pattern, and the Normal Q-Q graph shows that basically the residuals are normally distributed. The assumptions are met.

For multiple regression, it is also important to check the multicollinearity among the variables because high multicollinearity will make the coefficients for independent variables less precise and introduce large errors in the predictions for dependant variable. We can investigate the multicollinearity by displaying the correlation coefficients of the independent variables in pairs as what we did at the beginning of this part. We can also check the multicollinearity by the following command in R.

#check multicollinearity
> vif(sales.reg)
price       ad_type   price_apple price_cookies
1.246084      1.189685      1.149248      1.099255

The VIF test value for each variable is close to 1, which means the multicollinearity is very low among these variables.

Based on the above analysis, we can accept the regression result and construct the multi-linear model of sales as follows.

Sales = 774.81 – 51.24 * price + 29.74 * ad_type + 22.1 * price_apple – 25.28 * price_cookies

With model established, we can analysis the Price Elasticity(PE) and Cross-price Elasticity(CPE) to predict the reactions of sales quantity to price. “Price elasticity is defined as %ΔQ/%ΔP, which indicates the percent change in quantity divided by the percent change in price; Cross-price Elasticity is the percent change in quantity divided by the change in the price of some other product.”1

PE = (ΔQ/Q) / (ΔP/P) = (ΔQ/ΔP) * (P/Q) = -51.24 * 0.045 = -2.3

P is price, Q is sales quantity

ΔQ/ΔP = -51.24 , the parameter before the variable “price” in the above model

P/Q = 9.738 / 216.7 = 0.045,  P is the mean of prices in the dataset, so does Q

The PE indicates that 10% decrease in price will increase the sales by 23%, and vice verse.

Let’s further calculate the CPE on apple juice and cookies to analyze the how the change of apple juice price and cookies price influence the sales of grape juice.

CPEapple = (ΔQ/ΔPapple) * (Papple/Q) = 22.1 * ( 7.659 / 216.7) = 0.78

CPEcookies = (ΔQ/ΔPcookies) * (Pcookies/Q) = -25.28 * ( 9.622 / 216.7) = – 1.12

The CPEapple indicates that 10% decrease in apple juice price will DECREASE the sales by 7.8%, and vice verse. So the grape juice and apple juice are substitutes.

The CPEcookies indicates that 10% decrease in cookies price will INCREASE the sales by 11.2%, and vice verse. So the grape juice and cookies are compliments. Place the two products together will likely increase the sales for both.

We can also know that the sales increase 29.74 units when using the ad with the family health caring theme (ad_type = 1).

Optimal Pricing and Sales Prediction

Usually companies want to get higher profit rather than just higher sales quantity. So, how to set the optimal price for the new grape juice to get the maximum profit based on the dataset collected in the pilot period and the regression model above?

To simplify the question, we can let the ad_type = 1, the price_apple = 7.659 (mean value), and the price_cookies = 9.738 (mean value).

The model is simplified as follows:-

Sales = 774.81 – 51.24 * price + 29.74 * 1 + 22.1 * 7.659 – 25.28 * 9.738

Sales = 772.64 – 51.24*price

Assume the marginal cost(C) per unit of grape juice is 5. We can calculate the profit (Y) by the following formula.

Y = (price – C) * Sales Quantity = (price – 5) * (772.64 – 51.24*price)

Y = – 51.24 * price2 + 1028.84 * price – 3863.2

To get the optimal price to maximize Y, we can use the following R function.

> f = function(x) -51.24*x^2 + 1028.84 * x – 3863.2
> optimize(f,lower=0,upper=20,maximum=TRUE)
[1] 10.03942
[1] 1301.28

The optimal price is 10.04; the maximum profit will be 1301 according to the above output. In reality, we can reasonably set the price to be 10 or 9.99.

We can further use the model to predict the sales while the price is 10.

> # predict the sales
> inputData <- data.frame(price=10,ad_type=1,price_apple=7.659,price_cookies=9.738)
> predict(sales.reg,inputData,interval=”p”)
fit           lwr           upr
1 215.1978 176.0138 254.3817

The sales forecast will be 215 units with a variable range of 176 ~ 254 with 95% confidence in a store in one work on average. Based on the forecast and other factors, ABC Company can prepare the inventory for all of its stores after the pilot period.

Implementation in Oracle R Enterprise

“Oracle R Enterprise (ORE) implements a transparency layer on top of the R engine that allows R computations to be executed in Oracle Database from the R environment.”3 It is also not necessary to load the whole bunch of data into R environment, which usually runs on a desktop or laptop with limitations of RAM and CPU, from database. This is helpful when we have millions of data records to be analyzed. Let’s try to implement the regression model by ORE.

# load the Oracle R Enterprise library and connect to Oracle Database
> library(ORE)
> ore.connect(user = “”,sid = “orcl”,host = “”,password = “”,port = 1521,all = TRUE)

# regression by ore.lm
> summary(sales.reg)

The output is the same as we use the function of “lm” for regression.


In this article, using the open source R language, we introduced how to test the differences of effectiveness among different ad types; how to analyze the price elasticity and cross-price elasticity of a product; and how to set the optimal price to maximize the profit and then to forecast the sales with the price.



  1. Prof. Ronald T. Wilcox’s lecture slides of “Estimating Demand from Scanner Data”


Author: Jack Han. All rights reserved. 转载须以超链接形式标明文章原始出处和作者信息

2,095 total views, 5 views today

To leave a comment for the author, please follow the link and comment on their blog: Data Apple » R Blogs in English. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)