(This article was first published on Exploring and experiencing analytics, and kindly contributed to Rbloggers)
Hello Friends,
This time I thought to bring in little more spice and thought of focusing on movies. I don’t know about you but I am a movie buff. Often on a weekend when I am trying to pick up a movie from my movie repository, which spans to some TBs now, I feel little lost. Apart from a general rating or a perception, the length of the movie plays a role in the choice, simple reason; the movie needs to be cramped between other demanding priorities.
So last Saturday, when I was in between this process, and I was searching for a movie less than 1 hour 30 minutes (There was a hard stop on that) my wife commented but “The short movies are generally not so good”. I did not pay much heed to that then (Don’t conclude anything from this please), but later on I thought hold on, is that a hypothesis? Can I do something statistically here? And here we are. We will talk little bit on correlation, normal distribution etc. I use ‘R’, but it is so simple , we can even use excel sheet do the same.
Correlation:
This is an indicator whose value is between 1 and 1 and it indicates strength of linear relationship between two variables. Leave the jargon, many cases we relate features. The typical law of physics like speed and displacement may have a perfect correlation, but those are not the point of interest. However a point of interest may be is there a relation between say
a) IQ Score of a person and Salary drawn
b) No. of obese people in an area visàvis no. of fastfood centers in the locality
c) No. of Facebook friends , with relationship shelf life
d) No. of hours spent in office and attrition rate for and organization
An underlying technicality, I must point out here is both of the variables should follow a normal distribution.
Normal Distribution:
This is the most common probability distribution function, which is a bell shaped curve, with equal spread in both side of the mean. Associate to manager alike, you must have heard about normalization and bell curve while you face/do the appraisal. Most of the random events across disciplines follow normal distribution. The below is an internet image.
So I picked up movie information and like any one of us picked it up from IMDB (http://www.imdb.com/) and I put it in a structured form like the below, the ones highlighted below may not be required at this point of time, I kept it just for some future work in mind. The list was prepared manually; I will keep on hunting for some API and all and would keep you posted on the same.
Name

Year of Release

Rating

Duration

Small Desc

Skyfall

2012

8.1

143

Bond’s loyalty to M is tested as her past comes back to haunt her. As MI6 comes under attack, 007 must track down and destroy the threat, no matter how personal the cost.

At this point of time I have taken 183 movies. I have stored it as a csv file.
First thing first, there are various formal ways to test whether it follows a normal distribution, I would just plot histograms and see how this looks like, both the variable seem to follow normal distributions closely.
Below are the commands for a quick reference. What I just adore about R is it’s simplicity, with just so few commands we are done
film<read.csv(“film.csv”,header=T)# Reading the file in a list object
x<as.matrix(film) # Converting the list to a matrix, for histogram plotting
y<as.numeric(x[,3]) # Converting the movie rating to a numeric vector
y<as.numeric(x[,4]) # Converting the movie duration to a numeric vector
hist(y,col=”green”,border=”black”,xlab=”Duarion”,ylab=”mvfreq”,main=”Mv Duration Distribution”,breaks=7)
hist(y,col=”blue”,border=”black”,xlab=”mvRtng”,ylab=”mvfreq”,main=”Mv Rtng Distribution”,breaks=9)
cor(y,z) # Calculate Correlation Coefficient between rating and duration
Interestingly the correlation turns out to be .48 in this case, which says there is a positive correlation between this two phenomenon and the correlation is not small. We can set up a hypothesis “ There is no correlation “ and a level of significance and test the hypothesis. However .48 is a high value and I am sure we would reject the hypothesis that there is no correlation.
So someway or other the rating goes up with the duration of the movie.
I leave it to you for interpretation, but next time you might look at the movie duration for taking a call ! Mr. Directors , it might be a tips for you who knows and may be to me wify is always right. May be all that short is not that sweet.
With that I will call it a day, hope you enjoyed reading. I will be coming on with more such Looking forward to your feedbacks and comments
To leave a comment for the author, please follow the link and comment on his blog: Exploring and experiencing analytics.
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...