This time I thought to bring in little more spice and thought of focusing on movies. I don’t know about you but I am a movie buff. Often on a weekend when I am trying to pick up a movie from my movie repository, which spans to some TBs now, I feel little lost. Apart from a general rating or a perception, the length of the movie plays a role in the choice, simple reason; the movie needs to be cramped between other demanding priorities.
So last Saturday, when I was in between this process, and I was searching for a movie less than 1 hour 30 minutes (There was a hard stop on that) my wife commented but “The short movies are generally not so good”. I did not pay much heed to that then (Don’t conclude anything from this please), but later on I thought hold on, is that a hypothesis? Can I do something statistically here? And here we are. We will talk little bit on correlation, normal distribution etc. I use ‘R’, but it is so simple , we can even use excel sheet do the same.
This is an indicator whose value is between -1 and 1 and it indicates strength of linear relationship between two variables. Leave the jargon, many cases we relate features. The typical law of physics like speed and displacement may have a perfect correlation, but those are not the point of interest. However a point of interest may be is there a relation between say
a) IQ Score of a person and Salary drawn
b) No. of obese people in an area vis-à-vis no. of fast-food centers in the locality
c) No. of Facebook friends , with relationship shelf life
d) No. of hours spent in office and attrition rate for and organization
An underlying technicality, I must point out here is both of the variables should follow a normal distribution.
This is the most common probability distribution function, which is a bell shaped curve, with equal spread in both side of the mean. Associate to manager alike, you must have heard about normalization and bell curve while you face/do the appraisal. Most of the random events across disciplines follow normal distribution. The below is an internet image.
So I picked up movie information and like any one of us picked it up from IMDB (http://www.imdb.com/) and I put it in a structured form like the below, the ones highlighted below may not be required at this point of time, I kept it just for some future work in mind. The list was prepared manually; I will keep on hunting for some API and all and would keep you posted on the same.
Year of Release
Bond’s loyalty to M is tested as her past comes back to haunt her. As MI6 comes under attack, 007 must track down and destroy the threat, no matter how personal the cost.
At this point of time I have taken 183 movies. I have stored it as a csv file.
First thing first, there are various formal ways to test whether it follows a normal distribution, I would just plot histograms and see how this looks like, both the variable seem to follow normal distributions closely.
film<-read.csv(“film.csv”,header=T)# Reading the file in a list object
x<-as.matrix(film) # Converting the list to a matrix, for histogram plotting
y<-as.numeric(x[,3]) # Converting the movie rating to a numeric vector
y<-as.numeric(x[,4]) # Converting the movie duration to a numeric vector
hist(y,col=”green”,border=”black”,xlab=”Duarion”,ylab=”mvfreq”,main=”Mv Duration Distribution”,breaks=7)
hist(y,col=”blue”,border=”black”,xlab=”mvRtng”,ylab=”mvfreq”,main=”Mv Rtng Distribution”,breaks=9)
cor(y,z) # Calculate Correlation Coefficient between rating and duration
Interestingly the correlation turns out to be .48 in this case, which says there is a positive correlation between this two phenomenon and the correlation is not small. We can set up a hypothesis “ There is no correlation “ and a level of significance and test the hypothesis. However .48 is a high value and I am sure we would reject the hypothesis that there is no correlation.
So someway or other the rating goes up with the duration of the movie.
With that I will call it a day, hope you enjoyed reading. I will be coming on with more such Looking forward to your feedbacks and comments