R-Bloggers’ Web-Presence

April 6, 2012
By

(This article was first published on theBioBucket*, and kindly contributed to R-bloggers)

We love them, we hate them: RANKINGS!

Rankings are an inevitable tool to keep the human rat race going. In this regard I'll pick up my last two posts (HERE & HERE) and have some fun with it by using it to analyse R-Bloggers' web presence. I will use number of hits in Google Search as an indicator.

I searched for URLs like this: https://www.google.com/search?q="http://www.twotorials.com" - meaning that only the exact blog-URL is searched.



BlogsNoHits
http://google-opensource.blogspot.com82300
http://www.programmingr.com73500
http://googleresearch.blogspot.com58000
http://dirk.eddelbuettel.com53000
http://borasky-research.net33100
http://casoilresource.lawr.ucdavis.edu32500
http://andrewgelman.com30000
http://yihui.name29600
http://xianblog.wordpress.com27900
http://nsaunders.wordpress.com27600
http://chem-bla-ics.blogspot.com26600
http://plindenbaum.blogspot.com24600
http://blog.ouseful.info24300
http://www.vcasmo.com24200
http://yz.mit.edu23500
http://romainfrancois.blog.free.fr22700
http://blog.revolutionanalytics.com21000
http://robjhyndman.com18400
http://freakonometrics.blog.free.fr16100
http://perfdynamics.blogspot.com15400
http://www.stubbornmule.net14800
http://zoonek.free.fr14800
http://jackman.stanford.edu13900
http://www.bytemining.com13700
http://learnr.wordpress.com12600
http://tommy.chheng.com12200
http://mazamascience.com12000
http://www.investuotojas.eu11500
http://www.r-statistics.com11300
http://www.franklincenterhq.org10800
http://gettinggeneticsdone.blogspot.com10700
http://mpastell.com9930
http://pineda-krch.com9780
http://blog.saush.com9220
http://www.premiersoccerstats.com8950
http://developmentality.wordpress.com7250
http://www.dataspora.com7200
http://blog.hiremebecauseimsmart.com7050
http://isomorphismes.tumblr.com7040
http://www.mathfinance.cn6930
http://blog.nguyenvq.com6150
http://www.drewconway.com5970
http://www.carlboettiger.info5520
http://www.statisticsblog.com5110
http://www.decisionsciencenews.com4950
http://www.r-chart.com4810
http://chartsgraphs.wordpress.com4480
http://www.portfolioprobe.com4410
http://procomun.wordpress.com4330
http://jeromyanglim.blogspot.com4080
http://spatialanalysis.co.uk4080
http://www.theresearchkitchen.com4080
http://www.forex-bloggers.com4070
https://www.rmetrics.org4050
http://princeofslides.blogspot.com3900
http://www.cybaea.net3740
http://www.cerebralmastication.com3710
http://ygc.name3670
http://ryouready.wordpress.com3450
http://jeffreybreen.wordpress.com3410
http://systematicinvestor.wordpress.com3400
http://sgsong.blogspot.com3310
http://industrialengineertools.blogspot.com3290
http://www.r-tutor.com3270
http://fishlab.ucdavis.edu3270
http://ggorjan.blogspot.com3250
http://blog.ynada.com3220
http://farmacokratia.blogspot.com3170
http://4dpiecharts.com3130
http://heuristically.wordpress.com3040
http://blog.rtwilson.com2910
http://www.wekaleamstudios.co.uk2890
http://www.dataists.com2840
http://ikanb.wordpress.com2750
http://shape-of-code.coding-guidelines.com2730
http://onertipaday.blogspot.com2710
http://blog.fosstrading.com2700
http://blog.echen.me2690
http://www.theusrus.de2670
http://cloudnumbers.com2630
http://paulbutler.org2620
http://biostatmatt.com2460
http://www.johnmyleswhite.com2430
http://dataninja.wordpress.com2360
http://realizationsinbiostatistics.blogspot.com2340
http://statisfaction.wordpress.com2300
http://uxblog.idvsolutions.com2250
http://timelyportfolio.blogspot.com2210
http://radfordneal.wordpress.com2200
http://sas-and-r.blogspot.com2200
http://pairach.com2110
http://yusung.blogspot.com2050
http://blog.flacso.edu.mx2010
http://www.rensenieuwenhuis.nl2000
http://michaeldhealy.com1990
http://freigeist.devmag.net1950
http://www.fernandohrosa.com.br1920
http://statbandit.wordpress.com1870
http://www.win-vector.com1840
http://lukemiller.org1830
http://ropensci.org1720
http://www.eggwall.com1650
http://benmazzotta.wordpress.com1620
http://bms.zeugner.eu1610
http://cartesianfaith.wordpress.com1580
http://linkedscience.org1570
http://stevemosher.wordpress.com1550
http://intelligenttradingtech.blogspot.com1520
http://www.imachordata.com1480
http://blog.diegovalle.net1470
http://jermdemo.blogspot.com1430
http://nortalktoowise.com1420
http://ekonometrics.blogspot.com1340
http://digitheadslabnotebook.blogspot.com1320
http://flyordie.sin.khk.be1310
http://schamberlain.github.com1230
http://gribblelab.org1180
http://www.quantf.com1130
http://offensivepolitics.net1020
http://www.markmfredrickson.com981
http://blog.mckuhn.de948
http://erehweb.wordpress.com889
http://confounding.net886
http://simplystatistics.tumblr.com875
http://www.babelgraph.org859
http://csgillespie.wordpress.com857
http://joewheatley.net844
http://helmingstay.blogspot.com843
http://theaverageinvestor.wordpress.com825
http://quantitative-ecology.blogspot.com785
http://zvfak.blogspot.com776
http://ucfagls.wordpress.com766
http://opendatagroup.com760
http://cameron.bracken.bz740
http://rtutorialseries.blogspot.com738
http://opencpu.org708
http://novicemetrics.blogspot.com700
http://lamages.blogspot.com680
http://nir-quimiometria.blogspot.com679
http://tonybreyal.wordpress.com677
http://brokeringclosure.wordpress.com658
http://socialdatablog.com643
http://dancingeconomist.blogspot.com629
http://www.rtexttools.com603
http://danganothererror.wordpress.com589
http://thebiobucket.blogspot.com567
http://holtmeier.de531
http://val-systems.blogspot.com519
http://thelogcabin.wordpress.com489
http://dcemri.blogspot.com484
http://rdatamining.wordpress.com477
http://bridgewater.wordpress.com460
http://www.rcasts.com444
http://dsparks.wordpress.com436
http://pr.cloudst.at422
http://polstat.org409
http://www.compmath.com401
http://techno-realism.blogspot.com399
http://www.backsidesmack.com395
http://geotheory.org393
http://miraisolutions.wordpress.com367
http://econometricsense.blogspot.com352
http://blog.binfalse.de344
http://rforcancer.drupalgardens.com317
http://blog.rstudio.org316
http://mcfromnz.wordpress.com309
http://www.quantumforest.com309
http://blog.quanttrader.org303
http://chrisladroue.com293
http://www.michaelbommarito.com289
http://procrun.com280
http://mikeksmith.posterous.com279
http://bio7.org278
http://kbroman.wordpress.com278
http://martynplummer.wordpress.com272
http://bryer.org268
http://www.funjackals.com265
http://www.harlan.harris.name252
http://www.milktrader.net248
http://www.surefoss.org241
http://rigorousanalytics.blogspot.com231
http://www.jameskeirstead.ca229
http://programming-r-pro-bro.blogspot.com225
http://plausibel.blogspot.com224
http://statistic-on-air.blogspot.com217
http://mintgene.wordpress.com212
http://moderntoolmaking.blogspot.com205
http://quantitativeecology.blogspot.com199
http://www.sigmafield.org199
http://www.ancienteco.com194
http://worldofrcraft.blogspot.com191
http://rappster.wordpress.com190
http://stotastic.com189
http://evolvingspaces.blogspot.com184
http://strugglingthroughproblems.blogspot.com166
http://sharpstatistics.co.uk161
http://leftcensored.skepsi.net160
http://omegahat.wordpress.com156
http://drunks-and-lampposts.com155
http://amathew.com152
http://onlinelabor.blogspot.com147
http://johnramey.net144
http://gossetsstudent.wordpress.com138
http://tomhopper.wordpress.com135
http://ggobi.blogspot.com134
http://blog.fellstat.com131
http://www.openanalytics.eu130
http://www.numbertheory.nl127
http://stats.blogoverflow.com127
http://the-praise-of-insects.blogspot.com122
http://lpenz.github.com118
http://christophergandrud.blogspot.com118
http://f.giorlando.org112
http://bayesianbiologist.com110
http://www.graphoftheweek.org109
http://oneliner.soma20.com109
http://inundata.org107
http://geokook.wordpress.com104
http://blog.datapunks.com102
http://eranraviv.com102
http://eranraviv.com102
http://www.compbiome.com101
http://www.techpolicy.ca99
http://www.psychwire.co.uk97
http://blog.carlislerainey.com93
http://vasishth-statistics.blogspot.com93
http://www.statsravingmad.com93
http://using-r-project.blogspot.com93
http://www.nikhilgopal.com92
http://thedatamonkey.blogspot.com92
http://jeffreyhorner.tumblr.com90
http://menugget.blogspot.com88
http://www.twotorials.com88
http://dataexcursions.wordpress.com84
http://viksalgorithms.blogspot.com83
http://exploringdatablog.blogspot.com81
http://sachaepskamp.com81
http://aphysicistinwallstreet.blogspot.com77
http://lastresortsoftware.blogspot.com75
http://www.nomad.priv.at72
http://applyr.blogspot.com71
http://www.knowledgediscovery.jp71
http://weitaiyun.blogspot.com71
http://xmphforex.wordpress.com71
http://statsadventure.blogspot.com70
http://davenportspatialanalytics.squarespace.com70
http://anandram.wordpress.com69
http://rpint.wordpress.com68
http://datadebrief.blogspot.com66
http://blog.cloudstat.org64
http://www.r-podcast.org64
http://rmkrug.wordpress.com62
http://denishaine.wordpress.com61
http://expansed.com58
http://r.andrewredd.us57
http://isseing333.blogspot.com57
http://solomonmessing.wordpress.com57
http://rtricks.wordpress.com57
http://anrprogrammer.wordpress.com56
http://arungaikwad.wordpress.com56
http://geolabs.wordpress.com55
http://lookingatdata.blogspot.com55
http://factbased.blogspot.com54
http://severity.blogspot.com54
http://swordofcrom.wordpress.com53
http://librestats.wordpress.com51
http://marcinkula.wordpress.com51
http://gsoc2010r.wordpress.com47
http://psyccomputing.blogspot.com46
http://fabiomarroni.wordpress.com45
http://jedifran.com45
http://alstatr.blogspot.com43
http://r-video-tutorial.blogspot.com42
http://alexfarquhar.posterous.com40
http://bmb-common.blogspot.com40
http://rdataviz.wordpress.com40
http://mypapertrades.blogspot.com38
http://pitchrx.blogspot.com38
http://simonmueller.net38
http://statisfactions.wordpress.com37
http://nzprimarysectortrade.wordpress.com36
http://seanmulcahy.blogspot.com36
http://www.speakingstatistically.com35
http://joshpaulson.wordpress.com34
http://learningrbasic.blogspot.com34
http://mockquant.blogspot.com33
http://costaleconomist.blogspot.com32
http://rsnippets.blogspot.com31
http://statmethods.wordpress.com29
http://aviadklein.wordpress.com28
http://obeautifulcode.com28
http://blog.cloudst.at24
http://rstats.posterous.com23
http://notebookonthewebs.tumblr.com22
http://0utlier.blogspot.com21
http://gjkerns.github.com21
http://eigensomething.blogspot.com10
http://brocktibert.wordpress.com9
http://toddjobe.blogspot.com9
http://mickeymousemodels.blogspot.com9
http://forgetfulfunctor.blogspot.com9
http://rocknrblog.wordpress.com9
http://dmbates.blogspot.com8
http://blog.nextbiomotif.com8
http://indiacrunchin.wordpress.com8
http://blog.trenthauck.com8
http://mikescnc.blogspot.com8
http://jeroldhaas.blogspot.com8
http://tlevine.tumblr.com8
http://empty-moon-9726.heroku.com8
http://www.proc-x.com7
http://jointposterior.blogspot.com7
http://gastonsanchez.wordpress.com7
http://mlt-thinks.blogspot.com7
http://rstats.wordpress.com7
http://playingwithr.blogspot.com7
http://scottmutchler.blogspot.com6
http://iamdata.wordpress.com6
http://sfchaos.blogspot.com6
http://nightlordtw.wordpress.com5
http://pleasepasstheroc.blogspot.com5
http://wiekvoet.blogspot.com5
http://d7.stattler.com4
http://yetanotherrblog.blogspot.com4
http://blog.iwanluijks.nl:803
https://rlearner.wordpress.com3
http://margintale.blogspot.com1

When checking the results manually I discovered slight deviations in the numbers and admittedly have no clue why this is.. Sorry if any blog is under- overrepresented due to such an error - please report!

Here is the R-script:

require(XML)
library(stringr)
library(RCurl)
library(xtable)

GoogleHits.1 <- function(input)
{
url <- paste("https://www.google.com/search?q=\"",
input, "\"", sep = "")

CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
doc <- htmlParse(script)
res <- xpathSApply(doc, "//div[@id='subform_ctrl']/*", xmlValue)[[2]]
return(as.integer(gsub("[^0-9]", "", res)))
}

# Example:
GoogleHits.1("R%Statistical%Software")

###################### Begin get r-blogger's URLs: ###########################################
# get blogger urls with XML:
script <- getURL("www.r-bloggers.com")
doc <- htmlParse(script)
li <- getNodeSet(doc, "//ul[@class='xoxo blogroll']//a")
urls <- sapply(li, xmlGetAttr, "href")

# extract sensible blog urls:
# get ids for those with only 2 slashes (no 3rd in the end):
id <- which(nchar(gsub("[^/]", "", urls )) == 2)
slash_2 <- urls[id]

# find position of 3rd slash occurrence in strings:
slash_stop <- unlist(lapply(str_locate_all(urls, "/"),"[[", 3))
slash_3 <- substring(urls, first = 1, last = slash_stop - 1)

# replace the ones with 2 slashes:
blogs <- slash_3; blogs[id] <- slash_2

# dismiss:
blogs <- blogs[blogs != "http://domain"]
###################### End get r-blogger's URLs: #############################

###################### Begin Google Search: ##################################
# with lapply google mocks about roboting the site..
# I'm blocked on the 300th recursion..
# unlist(lapply(blogs, GoogleHits.1))

# try splitting, doesn't work (blocked the same as before)
res1 <- unlist(lapply(blogs[1:170], GoogleHits.1))
res2 <- unlist(lapply(blogs[171:334], GoogleHits.1))

# try to do it in 2 sessions (saving first result), or manually re-connnect host before second try:
df1 <- data.frame(Blogs = blogs[1:170], NoHits = res1, row.names = NULL)
save(df1, file = "df1.R")
load("df1.RData"); unlink("df1.RData")

# second run:
df2 <- data.frame(Blogs = blogs[171:334], NoHits = res2, row.names = NULL)

# bind dfs, sort by NoHits:
finres <- as.data.frame(rbind(df1, df2)); finres$Blogs <- as.character(finres$Blogs)
(finres <- finres[order(finres$NoHits, decreasing = T), ])

htmltab <- xtable(finres)
print(htmltab, type = "html", include.rownames=FALSE, file = "Bloggers.Google.Hits.htm")
###################### End Google Search #####################################

###################### Begin Plot: ###########################################
pdf("RBloggersWebPresence.pdf")
par(mar = c(4.5, 4.5, 3, 2), ylog = F)
plot(finres$NoHits, cex = 0.5, col = 3,
ylab = "No. of Hits in Google Search",
xlab = "Blogs", log = "y")
set.seed(19)
rid <- sample(13:nrow(finres), 15)
text(x = rid, y = finres$NoHits[rid],
labels = finres$Blogs[rid],
cex = 0.75, srt = 90, pos = 4, offset = -1)
title(main = "R-Bloggers' Web Presence")
dev.off()
###################### End Plot ##############################################

To leave a comment for the author, please follow the link and comment on his blog: theBioBucket*.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.