Network analysis in Bollywood

[This article was first published on R | Asitav Sen, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Network Analysis

Network theory is the study of graphs as a representation of relationship between discrete elements. When applied to social relations, it is known as social network analysis.1

Bollywood

In this article, network theory is applied to analyse relationship between some professionals in bollywood, based on data from movie set. The data has been compiled by Parth Parikh from various sources. The analysis in this article involves a subset of relevant data.

Data preparation

The CSVs downloaded are imported as following

  • film1: Movie details from 1950 to 1989
  • film2: Movie details from 1990 to 2009
  • film3: Movie details from 2010 to 2019
  • crew: Crew information with unique identifier
  • filmcrew: Crew details (unique identifier) in each movie
#Combining film data
films<-rbind(film1,film2,film3)
rm(film1,film2,film3)
# separating columns with actor names, since actor names are in single column
f<-films%>%
separate(actors,c
("a1","a2","a3","a4","a5","a6","a7","a8","a9","a10"),sep="[|]")%>%
filter(a1!="NA")
#dedup any possible duplicates in movie
f<-f[!duplicated(f$imdb_id), ]
rm(films)
# selecting relevant columns from filmcrew
fc<-filmcrew[,c(1,2,4,5)]
rm(filmcrew)
# separating column with writer names, since writer names are in single column and deleting rows with no writer
c<-crew%>%
separate(writers, c("w1","w2","w3","w4","w5","w6","w7","w8","w9","w10"), sep="[|]")%>%
filter(w1!="\\N")
#dedup any possible duplicates in movie
c<-c[!duplicated(c$imdb_id), ]
rm(crew)
#Compilining actor and crew info per movie together
full.coded.raw<-c%>%right_join(f,by=c("imdb_id"))
#Relevant columns are directors, writers and actors i.e. director, wx and ax
dt<-full.coded.raw[,c(2:12,16:25)]
rm(full.coded.raw)
#To create a file suitable for network analysis, a 2 column file is required, which depicts relationships. This can be achieved by creating combination of columns and binding them together. It can be done manually or the `juggling_jaguar` function from package `Rmessy` can be used. The package is under development. Please feel free to use it and develop it further. The package can be downloaded from github here. Or it can be installed using devtools using the following command.
#>install.packages("devtools") #if not installed
#>devtools::install_github("asitav-sen/Rmessy")
#using juggling_jaguar
dt.net<-juggling_jaguar(dt)
rm(dt)
# names(dt.net)[1]="x"
# names(dt.net)[2]="y"
#
# dt.net<-
# dt.net%>%filter(x!=y)
#The data frame contains unique ids of the crew (and names of the actors). The data frame fc contains the relevant information to get the names of the relevant codes. However, since actor names are not coded, it is important that their names remain intact.
dt.net1<-
dt.net%>%
left_join(fc,by=c("x"="crew_id"))%>%
mutate(from=ifelse(is.na(name),x,name))%>%
select(c(6,2))%>%
left_join(fc,by=c("y"="crew_id"))%>%
mutate(to=ifelse(is.na(name),y,name))%>%
select(c(1,6))
rm(dt.net)
# Since some of the names were not found in the crew list available, these rows may be deleted. This is optional. One may want to analyse using the codes.
dt.net2<-
dt.net1%>%
filter(!str_detect(to,"^nm"))%>%
filter(!str_detect(from,"^nm"))
rm(dt.net1)
#Removing possible empty rows
rm.ro<-which(dt.net2$to=="")
dt.net2<-dt.net2[-rm.ro,]
#In this data set, pair of A-B and B-A are considered different. However, they are ultimately same. Hence, the data was further rectified.
# converting df in igraph file
gra.ph<-graph_from_data_frame(d=dt.net2, directed=FALSE)
# converting back to data frame
data.df<-get.data.frame(gra.ph)
rm(gra.ph)
#To analyze 'strength of relation' one can assume the number of times people have worked together to be a good indicator.
data.df<-
data.df%>%
group_by(from,to)%>%
count()%>%
arrange(desc(n))%>%
rename(works=n)%>%
filter(from!=to)
# creating graph object and Removing scatters
gra.ph<-graph_from_data_frame(d=data.df, directed=FALSE)
gra.ph$weight<-data.df$works
V(gra.ph)$comp <- components(gra.ph)$membership
gra.main <- induced_subgraph(gra.ph,V(gra.ph)$comp==1)
rm(gra.ph)

Analyses

Understanding importance of nodes (individuals) and the network

There is immense inequality in the importance of the individuals. Out of all, very few individuals have worked with more than 100 different people in the industry. This is observed through histogram of degrees. Similar trend is observed in the histogram of betweenness, which is another indicator of importance. Roughly, betweenness in this case can be simplified as the tendency to do have worked with different individuals who have not worked together. (This is an oversimplification)

main.deg<-degree(gra.main, mode = "all")
main.bw<-betweenness(gra.main,directed = FALSE, normalized = TRUE)
eigen.main<-eigen_centrality(gra.main)
par(mfrow=c(1,3))
hist(main.deg, breaks = 10, main = "Degree", xlab = "Degree")
hist(main.bw, breaks = 100, main = "Betweenness", xlab = "Betweenness")
hist(eigen.main$vector, breaks = 100, main= "Eigen Vector", xlab = "Eigen Vector")

The top individuals identified are mentioned in the table.

bydegree<-sort(main.deg, decreasing=TRUE)[1:20]
bybetweenness<-sort(main.bw, decreasing=TRUE)[1:20]
byeigen<-sort(eigen.main$vector,decreasing = TRUE)[1:20]
top<-data.frame(bydegree,bybetweenness,byeigen)
top
## bydegree bybetweenness byeigen
## Aruna Irani 764 0.03596357 1.0000000
## Anupam Kher 743 0.02947958 0.9435123
## Shakti Kapoor 715 0.02653782 0.9166490
## Gulshan Grover 713 0.02410787 0.9026478
## Prem Chopra 671 0.02088515 0.8755179
## Dharmendra 671 0.02049494 0.8719768
## Asrani 669 0.02017910 0.8558145
## Amitabh Bachchan 630 0.01996406 0.8477890
## Mithun Chakraborty 628 0.01778346 0.8053414
## Pran 597 0.01778251 0.7885631
## Amrish Puri 591 0.01704035 0.7849753
## Om Puri 578 0.01702614 0.7789609
## Satyendra Kapoor 571 0.01491215 0.7771716
## Jeetendra 565 0.01470073 0.7758555
## Kader Khan 563 0.01375074 0.7718398
## Paresh Rawal 542 0.01205046 0.7698608
## Madan Puri 535 0.01181536 0.7682778
## Rekha 535 0.01159546 0.7663677
## Johnny Lever 533 0.01133155 0.7619251
## Ashok Kumar 531 0.01062228 0.7604579

Dilip Kumar Saira Banu Aruna Irani still8 Shakti Kapoor 1


The farthest or the longest connection in the network is between Ajai Sinha and Edwin Fernandes with 5 individuals in between.

farthest_vertices(gra.main)
## $vertices
## + 2/5134 vertices, named, from bc1c079:
## [1] Ajai Sinha Edwin Fernandes
##
## $distance
## [1] 6
get_diameter(gra.main) 
## + 7/5134 vertices, named, from bc1c079:
## [1] Ajai Sinha Ishaan Trivedi Juhi Chawla
## [4] Shakti Kapoor Ananth Narayan Mahadevan Sridhar Rangayan
## [7] Edwin Fernandes

The assortativity based on degree i.e. tendency for individuals to work with other individuals with similar degree (connections), lies somewhere in the middle, near 0. There is almost equal mix of cases.

assortativity.degree(gra.main, directed= FALSE)
## [1] -0.02778606

Transitivity of 0.23 is much higher than that of randomly generated network of similar properties. However, it is not uncommon to observe social networks to have transitivity between 03. to 0.6.2 Transitivity measures how well connected the network is. (Oversimplification)

# creating random trees for comparison
# *****Requires substantial computational power*****
rnd.main <- vector('list',500)
dens.main<-edge_density(gra.main)
n=gorder(gra.main)
for(i in 1:500){
rnd.main[[i]] <- erdos.renyi.game(n=n, p.or.m = dens.main, type = "gnp")
}
tra.main<-transitivity(gra.main)
tra.rnd <- unlist(lapply(rnd.main, transitivity))
par(mfrow=c(1,2))
hist(tra.rnd, main="Transitivity")
abline(v=tra.main)
hist(tra.rnd, main="Transitivity, x-axis extended", xlim = c(0,0.3))
abline(v=tra.main)

rm(rnd.main,tra.rnd)
#similar test ca be prformed for other properties like diameter, max cliques etc.
# dia.main<- diameter(gra.main, directed = FALSE)
# dia.rnd <- unlist(lapply(rnd.main, diameter, directed = FALSE))
# max.c.main<-max_cliques(gra.main)
# lar.c.main<-largest_cliques(gra.main)

Understanding communities/clusters in the network

Fast Greedy algorithm identifies several segments, top five of which are as follows.

#Fast Greedy
com.fg<-fastgreedy.community(gra.main)
sort(sizes(com.fg), decreasing = TRUE)[1:5]
## Community sizes
## 1 2 4 3 16
## 2412 1660 351 76 55
#To check membership
#membership(com.fg)
#membership(com.fg)[membership(com.fg)=1]
#membership(com.fg)[names(membership(com.fg))="Amitabh Bahchan"]

Following is the plot of cluster 4 , third largest community identified by the algorithm.

comm.fg.4 <- as.undirected(induced_subgraph(gra.main, com.fg[[4]]))
comm.fg.4.deg<-degree(comm.fg.4, mode = "all")
par(bg="black", mfrow=c(1,1))
plot(comm.fg.4,
rescale= TRUE,
vertex.label = ifelse(degree(comm.fg.4) >= 20, names(V(comm.fg.4)), NA),
vertex.color = adjustcolor("gold", alpha.f = .5),
vertex.size = sqrt(comm.fg.4.deg),
layout = layout_with_lgl(comm.fg.4),
vertex.label.cex= 0.75,
vertex.label.degree=pi/2,
vertex.label.dist=1.5,
vertex.label.color="white",
edge.curved=0.5,
edge.width= 0.5,
edge.color = ifelse(comm.fg.4$weight>25, "dark green", "dark red"))
title("Network of cluster 4",cex.main=1,col.main="white")
legend("topright", c("<=25 times",">25 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")

Central figure cannot be identified from this graph. There seems to be several who can clain to be ‘central’. Eminent personality like Kamal Hassan, Girish Karnad, Smita Patil etc. are also present in this segment. Satyajit Ray and Anil Chatterjee too. Interestingly, they some of the major characters of parallel cinema.

Mahabanoo Mody Kotwal seems to be the central figure in cluster 3. This group does not seem to consist of bollywood blockbuster creators. However, they have gained popularity in regional movies and television. Some of them are foreigners too.

comm.fg.3 <- as.undirected(induced_subgraph(gra.main, com.fg[[3]]))
comm.fg.3.deg<-degree(comm.fg.3, mode = "all")
par(bg="black", mfrow=c(1,1))
plot(comm.fg.3,
rescale= TRUE,
vertex.label = ifelse(comm.fg.3.deg >= 11, names(V(comm.fg.3)), NA),
vertex.color = adjustcolor("gold", alpha.f = .5),
vertex.size = comm.fg.3.deg^(1/5),
layout = layout_with_lgl(comm.fg.3),
vertex.label.cex= 0.75,
vertex.label.degree=pi/2,
vertex.label.dist=1,
vertex.label.color="white",
edge.curved=0.5,
edge.width= 0.5,
edge.color = ifelse(comm.fg.3$weight>25, "dark green", "dark red"),
xlim = c(-1,1.1),
asp=-0.5)
title("Network of cluster 3",cex.main=1,col.main="white")
legend("topright", c("<=25 times",">25 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")

Cluster 1 and 2 are too big and complex to show anything meaningful in the network plot. They need to be broken down further or other segmenting methods need to be used to capture different segments.

comm.fg.1 <- as.undirected(induced_subgraph(gra.main, com.fg[[1]]))
comm.fg.1.deg<-degree(comm.fg.1, mode = "all")
comm.fg.2 <- as.undirected(induced_subgraph(gra.main, com.fg[[2]]))
comm.fg.2.deg<-degree(comm.fg.2, mode = "all")
# cluster 1
par(bg="black", mfrow=c(2,2))
plot(comm.fg.1,
rescale= TRUE,
vertex.label = NA, #ifelse(comm.fg.2.deg >= 350, names(V(comm.fg.2)), NA),
vertex.color = adjustcolor("gold", alpha.f = .5),
vertex.size = ifelse(comm.fg.1.deg<50,0.1,(comm.fg.1.deg)^(1/4)),
layout = layout_with_lgl(comm.fg.1),
vertex.label.cex= 0.75,
vertex.label.degree=pi/2,
vertex.label.dist=1,
vertex.label.color="black",
edge.curved=0.5,
edge.width= 0.5,
edge.color = adjustcolor(ifelse(comm.fg.1$weight>10, "dark green", "dark red"),alpha=0.3),
xlim = c(-1,1.1),
asp=-1,
axes = F)
title("Network of cluster 1",cex.main=1,col.main="white")
legend("topright", c("<=10 times",">10 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")
#cluster 1, zoomed
plot(comm.fg.1,
rescale= TRUE,
vertex.label = ifelse(comm.fg.1.deg >= 350, names(V(comm.fg.1)), NA),
vertex.color = adjustcolor("gold", alpha.f = .3),
vertex.size = ifelse(comm.fg.1.deg<50,0.1,(comm.fg.1.deg)^(1/5)),
layout = layout_with_lgl(comm.fg.1),
vertex.label.cex= 0.75,
vertex.label.degree=pi/2,
vertex.label.dist=1,
vertex.label.color="black",
edge.curved=0.5,
edge.width= 0.5,
edge.color = adjustcolor(ifelse(comm.fg.1$weight>10, "dark green", "dark red"),alpha=0.2),
xlim = c(-0.025,0.025),
ylim = c(-0.025,0.025),
asp=-1,
axes = F)
title("Network of cluster 1, zoomed",cex.main=1,col.main="white")
legend("topright", c("<=10 times",">10 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")
# cluster 2
plot(comm.fg.2,
rescale= TRUE,
vertex.label = NA, #ifelse(comm.fg.2.deg >= 350, names(V(comm.fg.2)), NA),
vertex.color = adjustcolor("gold", alpha.f = .5),
vertex.size = ifelse(comm.fg.2.deg<50,0.1,(comm.fg.2.deg)^(1/4)),
layout = layout_with_lgl(comm.fg.2),
vertex.label.cex= 0.75,
vertex.label.degree=pi/2,
vertex.label.dist=1,
vertex.label.color="black",
edge.curved=0.5,
edge.width= 0.5,
edge.color = adjustcolor(ifelse(comm.fg.2$weight>10, "dark green", "dark red"),alpha=0.3),
xlim = c(-1,1.1),
asp=-1,
axes = F)
title("Network of cluster 2",cex.main=1,col.main="white")
legend("topright", c("<=10 times",">10 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")
#cluster 2, zoomed
plot(comm.fg.2,
rescale= TRUE,
vertex.label = ifelse(comm.fg.2.deg >= 350, names(V(comm.fg.2)), NA),
vertex.color = adjustcolor("gold", alpha.f = .3),
vertex.size = ifelse(comm.fg.2.deg<50,0.1,(comm.fg.2.deg)^(1/5)),
layout = layout_with_lgl(comm.fg.2),
vertex.label.cex= 0.75,
vertex.label.degree=pi/2,
vertex.label.dist=1,
vertex.label.color="black",
edge.curved=0.5,
edge.width= 0.5,
edge.color = adjustcolor(ifelse(comm.fg.2$weight>10, "dark green", "dark red"),alpha=0.2),
xlim = c(-0.025,0.025),
ylim = c(-0.025,0.025),
asp=-1,
axes = F)
title("Network of cluster 2, zoomed",cex.main=1,col.main="white")
legend("topright", c("<=10 times",">10 times"), pch=21, col="white", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")

Comparing the clusters

l1<-mapping_monkey(comm.fg.1)
l2<-mapping_monkey(comm.fg.2)
l3<-mapping_monkey(comm.fg.3)
l4<-mapping_monkey(comm.fg.4)

Comparing the histogram of degrees, betweenness and eigen vectors do not show any significant difference or unexpected outcome. In the histogram of eigen vectors of segment 3, the pattern is different from others. There are more individuals with higher eigen vectors. This is natural in a smaller group (Segment 3 is very small). People in a smaller group tend to connect to each other more than in those in larger group.

par(mfrow=c(3,4))
hist(l1[[1]], breaks = 50, col=adjustcolor("black", alpha=0.3), main = "Degrees, Segment 1", xlab = "Degrees")
hist(l2[[1]], breaks = 50, col=adjustcolor("red", alpha=0.3), main = "Degrees, Segment 2", xlab = "Degrees")
hist(l3[[1]], breaks = 50, col=adjustcolor("green", alpha=0.3), main = "Degrees, Segment 3", xlab = "Degrees")
hist(l4[[1]], breaks = 50, col=adjustcolor("blue", alpha=0.3), main = "Degrees, Segment 4", xlab = "Degrees")
hist(l1[[2]], breaks = 50, col=adjustcolor("black", alpha=0.3), main = "Betweenness, Segment 1", xlab = "Betweenness")
hist(l2[[2]], breaks = 50, col=adjustcolor("red", alpha=0.3), main = "Betweenness, Segment 2", xlab = "Betweenness")
hist(l3[[2]], breaks = 50, col=adjustcolor("green", alpha=0.3), main = "Betweenness, Segment 3", xlab = "Betweenness")
hist(l4[[2]], breaks = 50, col=adjustcolor("blue", alpha=0.3), main = "Betweenness, Segment 4", xlab = "Betweenness")
hist(l1[[3]], breaks = 50, col=adjustcolor("black", alpha=0.3), main = "Eigen vectors, Segment 1", xlab = "Eigen Vector")
hist(l2[[3]], breaks = 50, col=adjustcolor("red", alpha=0.3), main = "Eigen vectors, Segment 2", xlab = "Eigen Vector")
hist(l3[[3]], breaks = 50, col=adjustcolor("green", alpha=0.3), main = "Eigen vectors, Segment 3", xlab = "Eigen Vector")
hist(l4[[3]], breaks = 50, col=adjustcolor("blue", alpha=0.3), main = "Eigen vectors, Segment 4", xlab = "Eigen Vector")

Variation in edge density is noticed in the segments. Segment 3, being smallest, can be expected to have higher density (people connected to each other). Comparing density of segment 2 and 4 is interesting. The population of segment 4 is much smaller. Yet, the density is lower than that of segment 2. This indicates that individuals in segment 2 are more connected (have worked with) to each other than those in segment 4. This, kind of, gets reinforced when the diameter is observed. The diameter of segment 4 is 8, compared to 4 of segment 2. This means that there are 3 connections in between the farthest points of the network in segment 2, compared to 8 in case of segment 4 (despite substantially lower population).

Another interesting observation is the assortativity of segment 3, which is highest. The tendency to stick together with people with similar number of connections is higher in segment 3.

seg1<-c(l1[[4]],l1[[5]],l1[[8]],l1[[9]])
seg2<-c(l2[[4]],l2[[5]],l2[[8]],l2[[9]])
seg3<-c(l3[[4]],l3[[5]],l3[[8]],l3[[9]])
seg4<-c(l4[[4]],l4[[5]],l4[[8]],l4[[9]])
rname<-c("density", "diameter","assortativity", "transitivity")
tba<-data.frame(`segmnt 1`= seg1,`segmnt 2`= seg2,`segmnt 3`= seg3,`segmnt 4`= seg4)
rownames(tba)<-rname
tba
## segmnt.1 segmnt.2 segmnt.3 segmnt.4
## density 0.01440812 0.02934123 0.1049123 0.02049654
## diameter 5.00000000 4.00000000 9.0000000 8.00000000
## assortativity -0.07202416 -0.11842136 0.1778428 -0.05157090
## transitivity 0.22909242 0.22909242 0.2290924 0.22909242

Final plot (for fun)

For fun, graph with individuals with highest eigen vector score is plotted to visualize the network among themselves.

ev<-sort(eigen.main$vector, decreasing = T)[1:50]
ename<-c(names(ev))
n<-c(match(ename,V(gra.main)$name))
rock<-induced_subgraph(gra.main,vids=n)
rock.d<-degree(rock,mode = "all")
par(bg="black")
plot(rock,
rescale= TRUE,
vertex.label = ifelse(rock.d >= 11, names(V(rock)), NA),
vertex.color = adjustcolor("gold", alpha.f = .5),
vertex.size = ifelse(rock.d<10,0.1,sqrt(rock.d)),
layout = layout_with_lgl(rock),
vertex.label.cex= 0.75,
vertex.label.degree=pi/2,
vertex.label.dist=1,
vertex.label.color="white",
edge.curved=0.5,
edge.width= 0.5,
edge.color = adjustcolor(ifelse(rock$weight>8, "dark green", "dark red"),alpha=0.9),
xlim = c(-1,1),
#asp=-1,
axes = F)
title("Network of top guns",cex.main=1,col.main="white")
legend("topright", c("<=8 times",">8 times"), pch=21, col="black", pt.bg=c("dark red","dark green"), pt.cex=2, cex=.8, bty="n", ncol=1, title = "Worked Together", text.col = "white")

This analysis can be further extended. Esp, by using clustering algorithms other than fast greedy. Moreover, detailed analysis of ego graphs may reveal interesting insights. Don’t forget to share your results, if you do any of it.

Contact me if - You want to understand how network analysis can help in your sales and marketing efforts. - You are looking to collaborate for some investigation/research.

Sources

To leave a comment for the author, please follow the link and comment on their blog: R | Asitav Sen.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)