Site icon R-bloggers

Opting for shorter movies, be aware u might be cutting the entertainment too!

[This article was first published on Exploring and experiencing analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Hello Friends,< o:p>
This time I thought to bring in little more spice and thought of focusing on movies.  I don’t know about you but I am a movie buff. Often on a weekend when I am trying to pick up a movie from my movie repository, which spans to some TBs now, I feel little lost.  Apart from a general rating or a perception, the length of the movie plays a role in the choice, simple reason; the movie needs to be cramped between other demanding priorities.< o:p>
So last Saturday, when I was in between this process, and I was searching for a movie less than 1 hour 30 minutes (There was a hard stop on that) my wife commented but “The short movies are generally not so good”.  I did not pay much heed to that then (Don’t conclude anything from this please), but later on I thought hold on, is that a hypothesis?  Can I do something statistically here?  And here we are. We will talk little bit on correlation, normal distribution etc. I use ‘R’, but it is so simple , we can even use excel sheet do the same. < o:p>
Correlation:< o:p>
This is an indicator whose value is between -1 and 1 and it indicates strength of linear relationship between two variables. Leave the jargon, many cases we relate features.  The typical law of physics like speed and displacement may have a perfect correlation, but those are not the point of interest.  However a point of interest may be is there a relation between say< o:p>
a)      IQ Score of a person and Salary drawn< o:p>
b)      No. of obese people in an area vis-à-vis no. of fast-food centers in the locality< o:p>
c)       No. of Facebook friends , with relationship shelf life< o:p>
d)      No. of hours spent in office and attrition rate for and organization< o:p>
An underlying technicality, I must point out here is both of the variables should follow a normal distribution.< o:p>
Normal Distribution:< o:p>
This is the most common probability distribution function, which is a bell shaped curve, with equal spread in both side of the mean.  Associate to manager alike, you must have heard about normalization and bell curve while you face/do the appraisal.  Most of the random events across disciplines follow normal distribution. The below is an internet image. 
< o:p>

So I picked up movie information and like any one of us picked it up from IMDB (http://www.imdb.com/) and I put it in a structured form like the below, the ones highlighted below may not be required at this point of time, I kept it just for some future work in mind.  The list was prepared manually; I will keep on hunting for some API and all and would keep you posted on the same.< o:p>

Name< o:p>
Year of Release< o:p>
Rating< o:p>
Duration< o:p>
Small Desc< o:p>
Skyfall< o:p>
2012< o:p>
8.1< o:p>
143< o:p>
Bond’s loyalty to M is tested as her past comes back to haunt her. As MI6 comes under attack, 007 must track down and destroy the threat, no matter how personal the cost.< o:p>


At this point of time I have taken 183 movies.  I have stored it as a csv file.< o:p>
First thing first, there are various formal ways to test whether it follows a normal distribution, I would just plot histograms and see how this looks like, both the variable seem to follow normal distributions closely.

< o:p>



















Below are the commands for a quick reference.  What I just adore about R is it’s simplicity, with just so few commands we are done< o:p>
film<-read.csv(“film.csv”,header=T)# Reading the file in a list object< o:p>
x<-as.matrix(film) # Converting the list to a matrix,  for histogram plotting< o:p>
y<-as.numeric(x[,3]) # Converting the movie rating to a numeric vector< o:p>
y<-as.numeric(x[,4]) # Converting the movie duration to a numeric vector< o:p>
hist(y,col=”green”,border=”black”,xlab=”Duarion”,ylab=”mvfreq”,main=”Mv Duration Distribution”,breaks=7)< o:p>
hist(y,col=”blue”,border=”black”,xlab=”mvRtng”,ylab=”mvfreq”,main=”Mv Rtng Distribution”,breaks=9)< o:p>
cor(y,z) # Calculate Correlation Coefficient between rating and duration< o:p>
Interestingly the correlation turns out to be .48 in this case, which says there is a positive correlation between this two phenomenon and the correlation is not small.  We can set up a hypothesis “ There is no correlation “ and a level of significance and test the hypothesis. However .48 is a high value and I am sure we would reject the hypothesis that there is no correlation.< o:p>
So someway or other the rating goes up with the duration of the movie.< o:p>
I leave it to you for interpretation, but next time you might look at the movie duration for taking a call ! Mr. Directors , it might be a tips for you who knows and may be to me wify is always right. May be all that short is not that sweet.< o:p>
With that I will call it a day, hope you enjoyed reading. I will be coming on with more such Looking forward to your feedbacks and comments< o:p>






To leave a comment for the author, please follow the link and comment on their blog: Exploring and experiencing analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.