Statistical Interests in Large Cities

January 10, 2014
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

I always thought that there were some kind of schools in statistics, areas (not to say universities or laboratories) where people had common interest in term of statistical methodology. Like people with strong interest in extreme values, or in Lévy Processes. I wanted to check this point so I did extract information about articles puslished in about 35 journals in statistics, probability and econometrics. I got all the information in files extracted from http://scopus.com/

> setwd("/home/arthur/Documents/scopus/")
> L=list.files()
> z=NULL
> for(i in 1:length(L)){
+ B=read.csv(L[i])
+ z=c(z,as.character(B$Source.title))
+ }

Here is the list of the publications I have used

> Z=sort(table(z),decreasing=TRUE)
> Z[1:34]
                                 Computational Statistics and Data Analysis 
                                                                       4000 
                                           Journal of Multivariate Analysis 
                                                                       4000 
                                                         Econometric Theory 
                                                                       2631 
                                              Annals of Applied Probability 
                                                                       2051 
                                                             Bioinformatics 
                                                                       2000 
                                                                 Biometrika 
                                                                       2000 
                                                    Journal of Econometrics 
                                                                       2000 
                              Journal of Statistical Planning and Inference 
                                                                       2000 
                            Journal of the American Statistical Association 
                                                                       2000 
                                                        Operations Research 
                                                                       2000 
                                                        Pattern Recognition 
                                                                       2000 
                                      Probability Theory and Related Fields 
                                                                       2000 
                                                          Signal Processing 
                                                                       2000 
                                             Journal of Applied Probability 
                                                                       1999 
                                Stochastic Processes and their Applications 
                                                                       1999 
                         Annals of the Institute of Statistical Mathematics 
                                                                       1985 
                                                       Annals of Statistics 
                                                                       1797 
                                                              Technometrics 
                                                                       1446 
                                       Journal of Machine Learning Research 
                                                                       1441 
                                                              Biostatistics 
                                                                       1120 
                                         Statistics and Probability Letters 
                                                                       1062 
                                                      Annals of Probability 
                                                                       1054 
                                                   Statistics and Computing 
                                                                        927 
                                            Advances in Applied Probability 
                                                                        895 
                                        Journal of Nonparametric Statistics 
                                                                        836 
                                                   Computational Statistics 
                                                                        813 
                                            Journal of Time Series Analysis 
                                                                        811 
                          Journal of Computational and Graphical Statistics 
                                                                        802 
     Journal of the Royal Statistical Society. Series C: Applied Statistics 
                                                                        794 
Journal of the Royal Statistical Society. Series B: Statistical Methodology 
                                                                        793 
                                                                 Biometrics 
                                                                        784 
                                                           Machine Learning 
                                                                        559 
                                                  SIAM Journal on Computing 
                                                                        433 
                                     International Journal of Biostatistics 
                                                                        368

The first problem is that is it difficult to extract universities and locations of contributors. When you look at what we have in the dataset, here it is

> B$Authors.with.affiliations[1]
[1] Mischler, S., CEREMADE, UMR CNRS 7534, Universit\303\251 Paris-Dauphine, Place du 
Mar\303\251chal de Lattre de Tassigny, Paris Cedex 16, 75775, France; Mouhot, C., DPMMS,
Centre for Mathematical Sciences, University of Cambridge, Wilberforce Road, Cambridge, 
CB3 0WA, United Kingdom; Wennberg, B., Department of Mathematical Sciences, Chalmers 
University of Technology, G\303\266teborg, Sweden, Department of Mathematical Sciences, 
University of Gothenburg, G\303\266teborg, 41296, Sweden

The first step was to split all that sentence, based on the comma operator

> setwd("/home/arthur/Documents/scopus/")
> L=list.files()
> v=NULL
> for(i in 1:length(L)){
+ B=read.csv(L[i])
+ A=B$Authors.with.affiliations
+ for(j in 1:length(A)){
+ x1=as.character(A[j])
+ x2=strsplit(x1,",")
+ v=c(v,x2[[1]])}
+ }

I have a long  long vector here. Which contains a lot of things !

> V=sort(table(v),decreasing=TRUE)
> names(V)[1:40]
 [1] " United States"                           
 [2] " Department of Statistics"                
 [3] " Department of Mathematics"               
 [4] " M."                                      
 [5] " J."                                      
 [6] " A."                                      
 [7] " S."                                      
 [8] " United Kingdom"                          
 [9] " France"                                  
[10] " D."                                      
[11] " P."                                      
[12] " Y."                                      
[13] " R."                                      
[14] " China"                                   
[15] " H."                                      
[16] " Germany"                                 
[17] " Department of Economics"                 
[18] " C."                                      
[19] " G."                                      
[20] " L."                                      
[21] " Canada"                                  
[22] " T."                                      
[23] " University of California"                
[24] " Department of Biostatistics"             
[25] " F."                                      
[26] " B."                                      
[27] " Department of Mathematics and Statistics"
[28] " E."                                      
[29] " K."                                      
[30] " N."                                      
[31] " Department of Computer Science"          
[32] " Japan"                                   
[33] " Australia"                               
[34] " X."                                      
[35] " Hong Kong"                               
[36] " Italy"                                   
[37] " W."                                      
[38] " Spain"

 

A lot of useless information, for sure, but also more valuable information. Like university names,

> names(V)[c language="(23,50,58,59,61,66,67,72,84,87,89)"][/c]
 [1] " University of California"     " Stanford University"         
 [3] " Chapel Hill"                  " University of Washington"    
 [5] " Stanford"                     " University of Michigan"      
 [7] " Carnegie Mellon University"   " Columbia University"         
 [9] " Cornell University"           " University of North Carolina"
[11] " Duke University"

or cities,

> names(V)[c 1="+" 2="70,71,82,92,97)" language="(35,40,41,44,45,47,51,53,54,55,56,62,64,65,"][/c]
 [1] " Hong Kong"    " New York"     " Berkeley"     " Cambridge"   
 [5] " Boston"       " Seattle"      " London"       " Pittsburgh"  
 [9] " Los Angeles"  " Singapore"    " Beijing"      " Philadelphia"
[13] " Ann Arbor"    " Atlanta"      " Toronto"      " Baltimore"   
[17] " Chicago"      " San Diego"    " Tokyo"

I decided to focus on 90 locations. Each time I have a string which is the same as the name of one of my 90 cities, I keep it. So if there is a Prof. Ann Arbor, I will consider that person as a city. Here is the graph of all locations, with the number of “articles“. Or contributors. If four people in San Francisco published toegher an article, the article appears four times in my dataset. I did spend some time with Cambridge, and I decided to move Cambridge, MA to Boston, MA. Just for convenience.

> require("geosphere")
> require("maps")
> data(world.cities)
> data(us.cities)
> data(canada.cities)
> LOCALIZE=Vectorize(function(v){z=findLatLon(v)$latlon;if(is.na(z)){z=c(NA,NA)};return(z)})
> CITIES=names(V)[city]
> NCITIES=substr(CITIES,2,nchar(CITIES))
> NCITIES[substr(NCITIES,1,5)=="Paris"]="Paris"
> NCITIES=unique(NCITIES)
> LC=matrix(unlist(LOCALIZE(NCITIES)),nrow=2)
> BASELOC=data.frame(CITY=NCITIES,LAT=LC[2,],LON=LC[1,])

I did spend some time on some cities, such as Paris, or London, where zip code was sometimes attached to the city name. I also had to fix some problems… But after a few minuts, I was able to locate those cities.

Then, I wanted to extract information about all publications. Keywords are interesting, but over 266,567 “publications“, it is hard to use (sometimes it is not file, somethimes it is extremely general, or extremely specialized). So I decided to extract words from the title of the contribution.

> VCITY=NULL
> VKW=NULL
> VY=NULL
> VJ=NULL
> VA=NULL
> VW=NULL
> art=0
> for(i in 1:length(L)){
+ B=read.csv(L[i])
+ A=B$Authors.with.affiliations
+ for(j in 1:length(A)){
+ art=art+1
+ x1=as.character(A[j])
+ x2=strsplit(x1,",")
+ listu=which(x2[[1]]%in%CITIES)
+ if(length(listu)>0){
+ C=tolower(paste(" ",as.character(B[j,"Title"]),sep=""))
+ x3=strsplit(C," ")[[1]]
+ kx3=which(!x3%in%c("a","the","of","an","in","",
+ "for","and","with","on","to","using","from","under"))
+ x3=x3[kx3]
+ J=as.character(B[j,"Source.title"])
+ Y=B[j,"Year"]
+ n1=length(listu)
+ n2=length(x3)
+ VCITY=c(VCITY,rep(x2[[1]][listu],each=n2))
+ VKW=c(VKW,rep(x3,n1))
+ VY=c(VY,rep(Y,n1*n2))
+ VJ=c(VJ,rep(J,n1*n2))
+ VA=c(VA,rep(art,n1*n2))
+ VW=c(vW,rep(1/n2,n1*n2))
+ }}}
­> BASEUNIV=data.frame(CITY=NCITIES,KEYW=VKW,YEAR=VY,JOURNAL=VJ,INDICE=VA,W=W)
Here, I got a huge dataset. One line is one city and one "word". Now, let us select one word, and let us plot how important that word is, in each city,
> Figure=function(keyword="bayesian"){
+ SBASEUNIV=BASEUNIV[BASEUNIV$KEYW==keyword,]
+ SB2=tapply(SBASEUNIV$W,SBASEUNIV$CITY,sum)
+ D=data.frame(CITY=names(SB2),CT=as.vector(SB2))
+ BASE=merge(BASELOC,D)
+ library(maps)
+ library(RColorBrewer)
+ CL=brewer.pal(6, "RdBu")
+ Y=SB2/SB*sum(SB,na.rm=TRUE)/sum(SB2,na.rm=TRUE)
+ X=cut(Y,breaks=c(0,.5,.75,1,1.333,2,10000))
+ levels(X)=1:6
+ library(maps)
+ map("world")
+ points(BASE$LON,BASE$LAT,pch=1,col=CL[as.numeric(X)],
+ cex=sqrt(Y*20),lwd=4)
+ }
In the code above, we compare with the independent case (if cities and keywords where independent) since we normalize using
SB2/SB*sum(SB,na.rm=TRUE)/sum(SB2,na.rm=TRUE)
For bayesian statistics (publication with the word bayesian in the title)

For nonparametric statistics (publication with the word nonparametric in the title)

For stochastic processes (publication with the word processes in the title)

(the problem here is that we cannot visualize the red circles: if in a city, no one published on a given topic, it would be strong red, but tiny, or even null… so we won’t see it). It decided to keep the top 250 words that appeaared in titles, I removed standard common words, such as it, theof, etc.

> listewords=names(sort(table(BASEUNIV$KEYW),decreasing=TRUE)[1:250])
> listewords=listewords[-c(1,2,3,4,7,15,24,42,129)]
> idx=which(BASEUNIV$KEYW%in%listemots)
> T=table(as.character(BASEUNIV$KEYW[idx]),BASEUNIV$CITY[idx])
> MATRICE=as.matrix(T)

I had a nice contingency table, with 90 cities, versus 200 words.

> library("FactoMineR")
> res.pca = PCA(t(MATRICE), scale.unit=TRUE, ncp=5, 
+ graph=FALSE)
> plot.PCA(res.pca, axes=c(1, 2), choix="ind")

Principal component analysis was disapointing,

So I decided to extract, per city, the largest contributions to the chi-square distance

> K2=chisq.test(MATRICE)
> M2=K2$expected

On the graph below, the green level is the theoretical counts of each word, under some independence assumption. The dark line is the observed one. For instance, in San Francisco, on top, we have words that were not used a lot (e.g. processes: given the total number of publications, it would make sense to have 6 or 7 publications with the word processes, but there were 0 publications actually), and below words that were intensively used. Intensively (such as method and structure, the last one was expected two or three times, but it appeared in 25 publications) compared with the other cities,

In Boston, MA,  we got

In New York City, NY

In Paris (France),

But to be honest, I was disapointed. I mean, yes, I can see on the previous graph, for instance, that there are a lot of people working on stochastic processes, with the words Brownian and Markov. But for most cases, I can hardly get an interpretation…

I tried a finaly graph, on interconnexions between authors. The first point is that it is common to have joint publications with colleagues, in the same city. The largest the point, the more joint papers,

But we can add here cross publications: the thinner the line, the less joint publications between two places,

We can see that I missed in the first part the Cambridge-Boston distinction, since Cambridge should now stand for Cambridge, UK. But the line is clearly too large to be explained here by collaboration betweem Cambridge, UK, and Boston, MA. But still. a lot of them can be explained, with Hong-Kong and Shanghai, or Mexico and Guanajuato.

If someone has better ideas to import properly the locations (or affiliations, it might be fun to focus on universities) and perhaps the abstract (more than the title), I’d be glad to try the same study in Economic journals…

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.