Statistical Interests in Large Cities

January 10, 2014
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

I always thought that there were some kind of schools in statistics, areas (not to say universities or laboratories) where people had common interest in term of statistical methodology. Like people with strong interest in extreme values, or in Lévy Processes. I wanted to check this point so I did extract information about articles puslished in about 35 journals in statistics, probability and econometrics. I got all the information in files extracted from http://scopus.com/

```> setwd("/home/arthur/Documents/scopus/")
> L=list.files()
> z=NULL
> for(i in 1:length(L)){
+ z=c(z,as.character(B\$Source.title))
+ }```

Here is the list of the publications I have used

```> Z=sort(table(z),decreasing=TRUE)
> Z[1:34]
Computational Statistics and Data Analysis
4000
Journal of Multivariate Analysis
4000
Econometric Theory
2631
Annals of Applied Probability
2051
Bioinformatics
2000
Biometrika
2000
Journal of Econometrics
2000
Journal of Statistical Planning and Inference
2000
Journal of the American Statistical Association
2000
Operations Research
2000
Pattern Recognition
2000
Probability Theory and Related Fields
2000
Signal Processing
2000
Journal of Applied Probability
1999
Stochastic Processes and their Applications
1999
Annals of the Institute of Statistical Mathematics
1985
Annals of Statistics
1797
Technometrics
1446
Journal of Machine Learning Research
1441
Biostatistics
1120
Statistics and Probability Letters
1062
Annals of Probability
1054
Statistics and Computing
927
895
Journal of Nonparametric Statistics
836
Computational Statistics
813
Journal of Time Series Analysis
811
Journal of Computational and Graphical Statistics
802
Journal of the Royal Statistical Society. Series C: Applied Statistics
794
Journal of the Royal Statistical Society. Series B: Statistical Methodology
793
Biometrics
784
Machine Learning
559
SIAM Journal on Computing
433
International Journal of Biostatistics
368```

The first problem is that is it difficult to extract universities and locations of contributors. When you look at what we have in the dataset, here it is

```> B\$Authors.with.affiliations[1]
[1] Mischler, S., CEREMADE, UMR CNRS 7534, Universit\303\251 Paris-Dauphine, Place du
Mar\303\251chal de Lattre de Tassigny, Paris Cedex 16, 75775, France; Mouhot, C., DPMMS,
Centre for Mathematical Sciences, University of Cambridge, Wilberforce Road, Cambridge,
CB3 0WA, United Kingdom; Wennberg, B., Department of Mathematical Sciences, Chalmers
University of Technology, G\303\266teborg, Sweden, Department of Mathematical Sciences,
University of Gothenburg, G\303\266teborg, 41296, Sweden```

The first step was to split all that sentence, based on the comma operator

```> setwd("/home/arthur/Documents/scopus/")
> L=list.files()
> v=NULL
> for(i in 1:length(L)){
+ A=B\$Authors.with.affiliations
+ for(j in 1:length(A)){
+ x1=as.character(A[j])
+ x2=strsplit(x1,",")
+ v=c(v,x2[[1]])}
+ }```

I have a long  long vector here. Which contains a lot of things !

```> V=sort(table(v),decreasing=TRUE)
> names(V)[1:40]
[1] " United States"
[2] " Department of Statistics"
[3] " Department of Mathematics"
[4] " M."
[5] " J."
[6] " A."
[7] " S."
[8] " United Kingdom"
[9] " France"
[10] " D."
[11] " P."
[12] " Y."
[13] " R."
[14] " China"
[15] " H."
[16] " Germany"
[17] " Department of Economics"
[18] " C."
[19] " G."
[20] " L."
[22] " T."
[23] " University of California"
[24] " Department of Biostatistics"
[25] " F."
[26] " B."
[27] " Department of Mathematics and Statistics"
[28] " E."
[29] " K."
[30] " N."
[31] " Department of Computer Science"
[32] " Japan"
[33] " Australia"
[34] " X."
[35] " Hong Kong"
[36] " Italy"
[37] " W."
[38] " Spain"```

A lot of useless information, for sure, but also more valuable information. Like university names,

```> names(V)
[1] " University of California"     " Stanford University"
[3] " Chapel Hill"                  " University of Washington"
[5] " Stanford"                     " University of Michigan"
[7] " Carnegie Mellon University"   " Columbia University"
[9] " Cornell University"           " University of North Carolina"
[11] " Duke University"```

or cities,

```> names(V)
[1] " Hong Kong"    " New York"     " Berkeley"     " Cambridge"
[5] " Boston"       " Seattle"      " London"       " Pittsburgh"
[9] " Los Angeles"  " Singapore"    " Beijing"      " Philadelphia"
[13] " Ann Arbor"    " Atlanta"      " Toronto"      " Baltimore"
[17] " Chicago"      " San Diego"    " Tokyo"```

I decided to focus on 90 locations. Each time I have a string which is the same as the name of one of my 90 cities, I keep it. So if there is a Prof. Ann Arbor, I will consider that person as a city. Here is the graph of all locations, with the number of “articles“. Or contributors. If four people in San Francisco published toegher an article, the article appears four times in my dataset. I did spend some time with Cambridge, and I decided to move Cambridge, MA to Boston, MA. Just for convenience.

```> require("geosphere")
> require("maps")
> data(world.cities)
> data(us.cities)
> LOCALIZE=Vectorize(function(v){z=findLatLon(v)\$latlon;if(is.na(z)){z=c(NA,NA)};return(z)})
> CITIES=names(V)[city]
> NCITIES=substr(CITIES,2,nchar(CITIES))
> NCITIES[substr(NCITIES,1,5)=="Paris"]="Paris"
> NCITIES=unique(NCITIES)
> LC=matrix(unlist(LOCALIZE(NCITIES)),nrow=2)
> BASELOC=data.frame(CITY=NCITIES,LAT=LC[2,],LON=LC[1,])```

I did spend some time on some cities, such as Paris, or London, where zip code was sometimes attached to the city name. I also had to fix some problems… But after a few minuts, I was able to locate those cities.

Then, I wanted to extract information about all publications. Keywords are interesting, but over 266,567 “publications“, it is hard to use (sometimes it is not file, somethimes it is extremely general, or extremely specialized). So I decided to extract words from the title of the contribution.

```> VCITY=NULL
> VKW=NULL
> VY=NULL
> VJ=NULL
> VA=NULL
> VW=NULL
> art=0
> for(i in 1:length(L)){
+ A=B\$Authors.with.affiliations
+ for(j in 1:length(A)){
+ art=art+1
+ x1=as.character(A[j])
+ x2=strsplit(x1,",")
+ listu=which(x2[[1]]%in%CITIES)
+ if(length(listu)>0){
+ C=tolower(paste(" ",as.character(B[j,"Title"]),sep=""))
+ x3=strsplit(C," ")[[1]]
+ kx3=which(!x3%in%c("a","the","of","an","in","",
+ "for","and","with","on","to","using","from","under"))
+ x3=x3[kx3]
+ J=as.character(B[j,"Source.title"])
+ Y=B[j,"Year"]
+ n1=length(listu)
+ n2=length(x3)
+ VCITY=c(VCITY,rep(x2[[1]][listu],each=n2))
+ VKW=c(VKW,rep(x3,n1))
+ VY=c(VY,rep(Y,n1*n2))
+ VJ=c(VJ,rep(J,n1*n2))
+ VA=c(VA,rep(art,n1*n2))
+ VW=c(vW,rep(1/n2,n1*n2))
+ }}}```
`­> BASEUNIV=data.frame(CITY=NCITIES,KEYW=VKW,YEAR=VY,JOURNAL=VJ,INDICE=VA,W=W)`
```Here, I got a huge dataset. One line is one city and one "word". Now, let us select one word, and let us plot how important that word is, in each city,
```
```> Figure=function(keyword="bayesian"){
+ SBASEUNIV=BASEUNIV[BASEUNIV\$KEYW==keyword,]
+ SB2=tapply(SBASEUNIV\$W,SBASEUNIV\$CITY,sum)
+ D=data.frame(CITY=names(SB2),CT=as.vector(SB2))
+ BASE=merge(BASELOC,D)
+ library(maps)
+ library(RColorBrewer)
+ CL=brewer.pal(6, "RdBu")
+ Y=SB2/SB*sum(SB,na.rm=TRUE)/sum(SB2,na.rm=TRUE)
+ X=cut(Y,breaks=c(0,.5,.75,1,1.333,2,10000))
+ levels(X)=1:6
+ library(maps)
+ map("world")
+ points(BASE\$LON,BASE\$LAT,pch=1,col=CL[as.numeric(X)],
+ cex=sqrt(Y*20),lwd=4)
+ }```
```In the code above, we compare with the independent case (if cities and keywords where independent) since we normalize using
```
`SB2/SB*sum(SB,na.rm=TRUE)/sum(SB2,na.rm=TRUE)`
`For bayesian statistics (publication with the word bayesian in the title)`

For nonparametric statistics (publication with the word nonparametric in the title)

For stochastic processes (publication with the word processes in the title)

(the problem here is that we cannot visualize the red circles: if in a city, no one published on a given topic, it would be strong red, but tiny, or even null… so we won’t see it). It decided to keep the top 250 words that appeaared in titles, I removed standard common words, such as it, theof, etc.

```> listewords=names(sort(table(BASEUNIV\$KEYW),decreasing=TRUE)[1:250])
> listewords=listewords[-c(1,2,3,4,7,15,24,42,129)]
> idx=which(BASEUNIV\$KEYW%in%listemots)
> T=table(as.character(BASEUNIV\$KEYW[idx]),BASEUNIV\$CITY[idx])
> MATRICE=as.matrix(T)```

I had a nice contingency table, with 90 cities, versus 200 words.

```> library("FactoMineR")
> res.pca = PCA(t(MATRICE), scale.unit=TRUE, ncp=5,
+ graph=FALSE)
> plot.PCA(res.pca, axes=c(1, 2), choix="ind")```

Principal component analysis was disapointing,

So I decided to extract, per city, the largest contributions to the chi-square distance

```> K2=chisq.test(MATRICE)
> M2=K2\$expected```

On the graph below, the green level is the theoretical counts of each word, under some independence assumption. The dark line is the observed one. For instance, in San Francisco, on top, we have words that were not used a lot (e.g. processes: given the total number of publications, it would make sense to have 6 or 7 publications with the word processes, but there were 0 publications actually), and below words that were intensively used. Intensively (such as method and structure, the last one was expected two or three times, but it appeared in 25 publications) compared with the other cities,

In Boston, MA,  we got

In New York City, NY

In Paris (France),

But to be honest, I was disapointed. I mean, yes, I can see on the previous graph, for instance, that there are a lot of people working on stochastic processes, with the words Brownian and Markov. But for most cases, I can hardly get an interpretation…

I tried a finaly graph, on interconnexions between authors. The first point is that it is common to have joint publications with colleagues, in the same city. The largest the point, the more joint papers,

But we can add here cross publications: the thinner the line, the less joint publications between two places,

We can see that I missed in the first part the Cambridge-Boston distinction, since Cambridge should now stand for Cambridge, UK. But the line is clearly too large to be explained here by collaboration betweem Cambridge, UK, and Boston, MA. But still. a lot of them can be explained, with Hong-Kong and Shanghai, or Mexico and Guanajuato.

If someone has better ideas to import properly the locations (or affiliations, it might be fun to focus on universities) and perhaps the abstract (more than the title), I’d be glad to try the same study in Economic journals…

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...