(This article was first published on

**Exploring and experiencing analytics**, and kindly contributed to R-bloggers)Hello Friends,

This time I thought to bring in little more spice and thought of focusing on movies. I don’t know about you but I am a movie buff. Often on a weekend when I am trying to pick up a movie from my movie repository, which spans to some TBs now, I feel little lost. Apart from a general rating or a perception, the length of the movie plays a role in the choice, simple reason; the movie needs to be cramped between other demanding priorities.

So last Saturday, when I was in between this process, and I was searching for a movie less than 1 hour 30 minutes (There was a hard stop on that) my wife commented but “The short movies are generally not so good”. I did not pay much heed to that then (Don’t conclude anything from this please), but later on I thought hold on, is that a hypothesis? Can I do something statistically here? And here we are. We will talk little bit on correlation, normal distribution etc. I use ‘R’, but it is so simple , we can even use excel sheet do the same.

**Correlation:**

This is an indicator whose value is between -1 and 1 and it indicates strength of linear relationship between two variables. Leave the jargon, many cases we relate features. The typical law of physics like speed and displacement may have a perfect correlation, but those are not the point of interest. However a point of interest may be is there a relation between say

a) IQ Score of a person and Salary drawn

b) No. of obese people in an area vis-à-vis no. of fast-food centers in the locality

c) No. of Facebook friends , with relationship shelf life

d) No. of hours spent in office and attrition rate for and organization

An underlying technicality, I must point out here is both of the variables should follow a normal distribution.

**Normal Distribution:**

This is the most

**common probability distribution**function, which is a bell shaped curve, with equal spread in both side of the mean. Associate to manager alike, you must have heard about normalization and bell curve while you face/do the appraisal. Most of the random events across disciplines follow normal distribution. The below is an internet image.So I picked up movie information and like any one of us picked it up from IMDB (http://www.imdb.com/) and I put it in a structured form like the below, the ones highlighted below may not be required at this point of time, I kept it just for some future work in mind. The list was prepared manually; I will keep on hunting for some API and all and would keep you posted on the same.

Name | Year of Release | Rating | Duration | Small Desc |

Skyfall | 2012 | 8.1 | 143 | Bond's loyalty to M is tested as her past comes back to haunt her. As MI6 comes under attack, 007 must track down and destroy the threat, no matter how personal the cost. |

At this point of time I have taken 183 movies. I have stored it as a csv file.

First thing first, there are various formal ways to test whether it follows a normal distribution, I would just plot histograms and see how this looks like, both the variable seem to follow normal distributions closely.

Below are the commands for a quick reference. What I just adore about R is it’s simplicity, with just so few commands we are done

*film<-read.csv("film.csv",header=T)# Reading the file in a list object*

*x<-as.matrix(film) # Converting the list to a matrix, for histogram plotting*

*y<-as.numeric(x[,3]) # Converting the movie rating to a numeric vector*

*y<-as.numeric(x[,4]) # Converting the movie duration to a numeric vector*

*hist(y,col="green",border="black",xlab="Duarion",ylab="mvfreq",main="Mv Duration Distribution",breaks=7)*

*hist(y,col="blue",border="black",xlab="mvRtng",ylab="mvfreq",main="Mv Rtng Distribution",breaks=9)*

*cor(y,z) # Calculate Correlation Coefficient between rating and duration*

Interestingly the correlation turns out to be .48 in this case, which says there is a positive correlation between this two phenomenon and the correlation is not small. We can set up a hypothesis “ There is no correlation “ and a level of significance and test the hypothesis. However .48 is a high value and I am sure we would reject the hypothesis that there is no correlation.

So someway or other the rating goes up with the duration of the movie.

I leave it to you for interpretation, but next time you might look at the movie duration for taking a call ! Mr. Directors , it might be a tips for you who knows and may be to me wify is always right. May be all that short is not that sweet.

With that I will call it a day, hope you enjoyed reading. I will be coming on with more such Looking forward to your feedbacks and comments

To

**leave a comment**for the author, please follow the link and comment on his blog:**Exploring and experiencing analytics**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...