Statistical Interests in Large Cities

January 10, 2014
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

I always thought that there were some kind of schools in statistics, areas (not to say universities or laboratories) where people had common interest in term of statistical methodology. Like people with strong interest in extreme values, or in Lévy Processes. I wanted to check this point so I did extract information about articles puslished in about 35 journals in statistics, probability and econometrics. I got all the information in files extracted from http://scopus.com/

> setwd("/home/arthur/Documents/scopus/")
> L=list.files()
> z=NULL
> for(i in 1:length(L)){
+ z=c(z,as.character(B$Source.title)) + } Here is the list of the publications I have used > Z=sort(table(z),decreasing=TRUE) > Z[1:34] Computational Statistics and Data Analysis 4000 Journal of Multivariate Analysis 4000 Econometric Theory 2631 Annals of Applied Probability 2051 Bioinformatics 2000 Biometrika 2000 Journal of Econometrics 2000 Journal of Statistical Planning and Inference 2000 Journal of the American Statistical Association 2000 Operations Research 2000 Pattern Recognition 2000 Probability Theory and Related Fields 2000 Signal Processing 2000 Journal of Applied Probability 1999 Stochastic Processes and their Applications 1999 Annals of the Institute of Statistical Mathematics 1985 Annals of Statistics 1797 Technometrics 1446 Journal of Machine Learning Research 1441 Biostatistics 1120 Statistics and Probability Letters 1062 Annals of Probability 1054 Statistics and Computing 927 Advances in Applied Probability 895 Journal of Nonparametric Statistics 836 Computational Statistics 813 Journal of Time Series Analysis 811 Journal of Computational and Graphical Statistics 802 Journal of the Royal Statistical Society. Series C: Applied Statistics 794 Journal of the Royal Statistical Society. Series B: Statistical Methodology 793 Biometrics 784 Machine Learning 559 SIAM Journal on Computing 433 International Journal of Biostatistics 368 The first problem is that is it difficult to extract universities and locations of contributors. When you look at what we have in the dataset, here it is > B$Authors.with.affiliations[1]
[1] Mischler, S., CEREMADE, UMR CNRS 7534, Universit\303\251 Paris-Dauphine, Place du
Mar\303\251chal de Lattre de Tassigny, Paris Cedex 16, 75775, France; Mouhot, C., DPMMS,
Centre for Mathematical Sciences, University of Cambridge, Wilberforce Road, Cambridge,
CB3 0WA, United Kingdom; Wennberg, B., Department of Mathematical Sciences, Chalmers
University of Technology, G\303\266teborg, Sweden, Department of Mathematical Sciences,
University of Gothenburg, G\303\266teborg, 41296, Sweden

The first step was to split all that sentence, based on the comma operator

> setwd("/home/arthur/Documents/scopus/")
> L=list.files()
> v=NULL
> for(i in 1:length(L)){
+ A=B$Authors.with.affiliations + for(j in 1:length(A)){ + x1=as.character(A[j]) + x2=strsplit(x1,",") + v=c(v,x2[[1]])} + } I have a long long vector here. Which contains a lot of things ! > V=sort(table(v),decreasing=TRUE) > names(V)[1:40] [1] " United States" [2] " Department of Statistics" [3] " Department of Mathematics" [4] " M." [5] " J." [6] " A." [7] " S." [8] " United Kingdom" [9] " France" [10] " D." [11] " P." [12] " Y." [13] " R." [14] " China" [15] " H." [16] " Germany" [17] " Department of Economics" [18] " C." [19] " G." [20] " L." [21] " Canada" [22] " T." [23] " University of California" [24] " Department of Biostatistics" [25] " F." [26] " B." [27] " Department of Mathematics and Statistics" [28] " E." [29] " K." [30] " N." [31] " Department of Computer Science" [32] " Japan" [33] " Australia" [34] " X." [35] " Hong Kong" [36] " Italy" [37] " W." [38] " Spain" A lot of useless information, for sure, but also more valuable information. Like university names, > names(V)[c language="(23,50,58,59,61,66,67,72,84,87,89)"][/c] [1] " University of California" " Stanford University" [3] " Chapel Hill" " University of Washington" [5] " Stanford" " University of Michigan" [7] " Carnegie Mellon University" " Columbia University" [9] " Cornell University" " University of North Carolina" [11] " Duke University" or cities, > names(V)[c 1="+" 2="70,71,82,92,97)" language="(35,40,41,44,45,47,51,53,54,55,56,62,64,65,"][/c] [1] " Hong Kong" " New York" " Berkeley" " Cambridge" [5] " Boston" " Seattle" " London" " Pittsburgh" [9] " Los Angeles" " Singapore" " Beijing" " Philadelphia" [13] " Ann Arbor" " Atlanta" " Toronto" " Baltimore" [17] " Chicago" " San Diego" " Tokyo" I decided to focus on 90 locations. Each time I have a string which is the same as the name of one of my 90 cities, I keep it. So if there is a Prof. Ann Arbor, I will consider that person as a city. Here is the graph of all locations, with the number of “articles“. Or contributors. If four people in San Francisco published toegher an article, the article appears four times in my dataset. I did spend some time with Cambridge, and I decided to move Cambridge, MA to Boston, MA. Just for convenience. > require("geosphere") > require("maps") > data(world.cities) > data(us.cities) > data(canada.cities) > LOCALIZE=Vectorize(function(v){z=findLatLon(v)$latlon;if(is.na(z)){z=c(NA,NA)};return(z)})
> CITIES=names(V)[city]
> NCITIES=substr(CITIES,2,nchar(CITIES))
> NCITIES[substr(NCITIES,1,5)=="Paris"]="Paris"
> NCITIES=unique(NCITIES)
> LC=matrix(unlist(LOCALIZE(NCITIES)),nrow=2)
> BASELOC=data.frame(CITY=NCITIES,LAT=LC[2,],LON=LC[1,])

I did spend some time on some cities, such as Paris, or London, where zip code was sometimes attached to the city name. I also had to fix some problems… But after a few minuts, I was able to locate those cities.

Then, I wanted to extract information about all publications. Keywords are interesting, but over 266,567 “publications“, it is hard to use (sometimes it is not file, somethimes it is extremely general, or extremely specialized). So I decided to extract words from the title of the contribution.

> VCITY=NULL
> VKW=NULL
> VY=NULL
> VJ=NULL
> VA=NULL
> VW=NULL
> art=0
> for(i in 1:length(L)){
+ A=B$Authors.with.affiliations + for(j in 1:length(A)){ + art=art+1 + x1=as.character(A[j]) + x2=strsplit(x1,",") + listu=which(x2[[1]]%in%CITIES) + if(length(listu)>0){ + C=tolower(paste(" ",as.character(B[j,"Title"]),sep="")) + x3=strsplit(C," ")[[1]] + kx3=which(!x3%in%c("a","the","of","an","in","", + "for","and","with","on","to","using","from","under")) + x3=x3[kx3] + J=as.character(B[j,"Source.title"]) + Y=B[j,"Year"] + n1=length(listu) + n2=length(x3) + VCITY=c(VCITY,rep(x2[[1]][listu],each=n2)) + VKW=c(VKW,rep(x3,n1)) + VY=c(VY,rep(Y,n1*n2)) + VJ=c(VJ,rep(J,n1*n2)) + VA=c(VA,rep(art,n1*n2)) + VW=c(vW,rep(1/n2,n1*n2)) + }}} ­> BASEUNIV=data.frame(CITY=NCITIES,KEYW=VKW,YEAR=VY,JOURNAL=VJ,INDICE=VA,W=W) Here, I got a huge dataset. One line is one city and one "word". Now, let us select one word, and let us plot how important that word is, in each city, > Figure=function(keyword="bayesian"){ + SBASEUNIV=BASEUNIV[BASEUNIV$KEYW==keyword,]
+ SB2=tapply(SBASEUNIV$W,SBASEUNIV$CITY,sum)
+ D=data.frame(CITY=names(SB2),CT=as.vector(SB2))
+ BASE=merge(BASELOC,D)
+ library(maps)
+ library(RColorBrewer)
+ CL=brewer.pal(6, "RdBu")
+ Y=SB2/SB*sum(SB,na.rm=TRUE)/sum(SB2,na.rm=TRUE)
+ X=cut(Y,breaks=c(0,.5,.75,1,1.333,2,10000))
+ levels(X)=1:6
+ library(maps)
+ map("world")
+ points(BASE$LON,BASE$LAT,pch=1,col=CL[as.numeric(X)],
+ cex=sqrt(Y*20),lwd=4)
+ }
In the code above, we compare with the independent case (if cities and keywords where independent) since we normalize using
SB2/SB*sum(SB,na.rm=TRUE)/sum(SB2,na.rm=TRUE)
For bayesian statistics (publication with the word bayesian in the title)

For nonparametric statistics (publication with the word nonparametric in the title)

For stochastic processes (publication with the word processes in the title)

(the problem here is that we cannot visualize the red circles: if in a city, no one published on a given topic, it would be strong red, but tiny, or even null… so we won’t see it). It decided to keep the top 250 words that appeaared in titles, I removed standard common words, such as it, theof, etc.

> listewords=names(sort(table(BASEUNIV$KEYW),decreasing=TRUE)[1:250]) > listewords=listewords[-c(1,2,3,4,7,15,24,42,129)] > idx=which(BASEUNIV$KEYW%in%listemots)
> T=table(as.character(BASEUNIV$KEYW[idx]),BASEUNIV$CITY[idx])
> MATRICE=as.matrix(T)

I had a nice contingency table, with 90 cities, versus 200 words.

> library("FactoMineR")
> res.pca = PCA(t(MATRICE), scale.unit=TRUE, ncp=5,
+ graph=FALSE)
> plot.PCA(res.pca, axes=c(1, 2), choix="ind")

Principal component analysis was disapointing,

So I decided to extract, per city, the largest contributions to the chi-square distance

> K2=chisq.test(MATRICE)
> M2=K2\$expected

On the graph below, the green level is the theoretical counts of each word, under some independence assumption. The dark line is the observed one. For instance, in San Francisco, on top, we have words that were not used a lot (e.g. processes: given the total number of publications, it would make sense to have 6 or 7 publications with the word processes, but there were 0 publications actually), and below words that were intensively used. Intensively (such as method and structure, the last one was expected two or three times, but it appeared in 25 publications) compared with the other cities,

In Boston, MA,  we got

In New York City, NY

In Paris (France),

But to be honest, I was disapointed. I mean, yes, I can see on the previous graph, for instance, that there are a lot of people working on stochastic processes, with the words Brownian and Markov. But for most cases, I can hardly get an interpretation…

I tried a finaly graph, on interconnexions between authors. The first point is that it is common to have joint publications with colleagues, in the same city. The largest the point, the more joint papers,

But we can add here cross publications: the thinner the line, the less joint publications between two places,

We can see that I missed in the first part the Cambridge-Boston distinction, since Cambridge should now stand for Cambridge, UK. But the line is clearly too large to be explained here by collaboration betweem Cambridge, UK, and Boston, MA. But still. a lot of them can be explained, with Hong-Kong and Shanghai, or Mexico and Guanajuato.

If someone has better ideas to import properly the locations (or affiliations, it might be fun to focus on universities) and perhaps the abstract (more than the title), I’d be glad to try the same study in Economic journals…