**Oracle R Enterprise**, and kindly contributed to R-bloggers)

Overhauling analytics processes is becoming a recurring

theme among customers. A major telecommunication provider recently

embarked on overhauling their analytics process for customer surveys. They had three

broad technical goals:

- Provide an agile

environment that empowers business analysts to test hypotheses based on

survey results - Allow dynamic customer segmentation

based on survey responses and even specific survey questions to drive

hypothesis testing - Make results of new

surveys readily available for research

The ultimate goal is to derive greater value from survey

research that drives measurable improvements in survey service delivery, and as

a result, overall customer satisfaction.

This provider chose Oracle Advanced Analytics (OAA) to power

their survey research. Survey results and analytics are maintained in Oracle

Database and delivered via a parameterized BI dashboard. Both the database and

BI infrastructure are standard components in their architecture.

A parameterized BI dashboard enables analysts to create

samples for hypothesis testing by filtering respondents to a survey question

based on a variety of filtering criteria. This provider required the ability to

deploy a range of statistical techniques depending on the survey variables,

level of measurement of each variable, and the needs of survey research

analysts.

Oracle Advanced Analytics offers a range of in-database

statistical techniques complemented by a unique architecture supporting

deployment of open source R packages in-database to optimize data transport to

and from database-side R engines. Additionally, depending on the nature of

functionality in such R packages, it is possible to leverage data-parallelism

constructs available as part of in-database R integration. Finally, all OAA

functionality is exposed through SQL, the ubiquitous language of the IT

environment. This enables OAA-based solutions to be readily integrated with BI

and other IT technologies.

The survey application noted above has been in production

for 3 months. It supports a team of 20 business analysts and has already begun

to demonstrate measurable improvements in customer satisfaction.

In the rest of this blog, we explore the range of

statistical techniques deployed as part of this application.

At the heart of survey research is *hypothesis testing*. A completed customer satisfaction survey

contains data used to draw conclusions about the state of the world. In the survey

domain, hypothesis testing is comparing the significance of answers to specific

survey questions across two distinct groups of customers – such groups are

identified based on knowledge of the business and technically specified through

filtering predicates.

Hypothesis testing sets up the world as consisting of 2

mutually exclusive hypotheses:

a) Null hypothesis –

states that there is no difference in satisfaction levels between the 2 groups

of customers

b) Alternate

hypothesis states that there is a significant difference in satisfaction levels

between the 2 groups of customers

Obviously only one of these can be true and the true-ness is

determined by the strength, probability, or likelihood of the null hypothesis

over the alternate hypothesis. Simplistically, the degree of difference

between, e.g., the average score from a specific survey question across two

customer groups could provide the necessary evidence in helping decide which

hypothesis is true.

In practice the process of providing evidence to make a

decision involves having access to a range of test statistics – a number

calculated from each group that helps determine the choice of null or alternate

hypothesis. A great deal of theory, experience, and business knowledge goes

into selecting the right statistic based on the problem at hand.

The t-statistic (available in-database) is a fundamental

function used in hypothesis testing that helps understand the differences in

means across two groups. When the t-values across 2 groups of customers for a

specific survey question are extreme then the alternative hypothesis is likely

to be true. It is common to set a critical value that the observed t-value

should exceed to conclude that the satisfaction survey results across the two

groups are significantly different. Other similar statistics available

in-database include F-test, cross tabulation (frequencies of various response

combinations captured as a table), related hypothesis testing functions such as

chi-square functions, Fisher’s exact

test, Kendall’s coefficients, correlation coefficients and a range of lambda

functions.

If an analyst desires to compare across more than 2 groups

then analysis of variance (ANOVA) is a collection of techniques that is commonly

used. This is an area where the R package ecosystem is rich with several proven

implementations. The R **stats** package

has implementations of several test statistics and function **glm** allows analysis of count data

common in survey results including building Poisson and log linear models. R’s **MASS** package implements a popular

survey analysis technique called *iterative
proportional fitting*. R’s

**survey**

package has a rich collection of features

(http://faculty.washington.edu/tlumley/survey/).

The provider was specifically interested in one function in

the **survey **package – raking (also known as sample balancing) – a process that assigns

a weight to each customer that responded to a survey such that the weighted

distribution of the sample is in very close agreement with other customer attributes,

such as the type of cellular plan, demographics, or average bill amount. Raking

is an iterative process that uses the sample design weight as the starting

weight and terminates when a convergence is achieved.

For this survey application, R scripts that expose a wide

variety of statistical techniques – some in-database accessible through the

transparency layer in Oracle R Enterprise and some in CRAN packages – were

built and stored in the Oracle R Enterprise in-database R script repository.

These parameterized scripts accept various arguments that identify samples of

customers to work with as well as specific constraints for the various

hypothesis test functions. The net result is greater agility since the business

analyst determines both the set of samples to analyze as well as the

application of the appropriate technique to the sample based on the hypothesis

being pursued.

For more information see these links for Oracle’s R Technologies software: Oracle R Distribution, Oracle R Enterprise, ROracle, Oracle R Connector for Hadoop.

**leave a comment**for the author, please follow the link and comment on their blog:

**Oracle R Enterprise**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...