R-Bloggers’ Web-Presence

[This article was first published on theBioBucket*, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We love them, we hate them: RANKINGS!

Rankings are an inevitable tool to keep the human rat race going. In this regard I’ll pick up my last two posts (HERE & HERE) and have some fun with it by using it to analyse R-Bloggers’ web presence. I will use number of hits in Google Search as an indicator.

I searched for URLs like this: https://www.google.com/search?q=”http://www.twotorials.com” – meaning that only the exact blog-URL is searched.

BlogsNoHits
http://google-opensource.blogspot.com82300
http://www.programmingr.com73500
http://googleresearch.blogspot.com58000
http://dirk.eddelbuettel.com53000
http://borasky-research.net33100
http://casoilresource.lawr.ucdavis.edu32500
http://andrewgelman.com30000
http://yihui.name29600
http://xianblog.wordpress.com27900
http://nsaunders.wordpress.com27600
http://chem-bla-ics.blogspot.com26600
http://plindenbaum.blogspot.com24600
http://blog.ouseful.info24300
http://www.vcasmo.com24200
http://yz.mit.edu23500
http://romainfrancois.blog.free.fr22700
http://blog.revolutionanalytics.com21000
http://robjhyndman.com18400
http://freakonometrics.blog.free.fr16100
http://perfdynamics.blogspot.com15400
http://www.stubbornmule.net14800
http://zoonek.free.fr14800
http://jackman.stanford.edu13900
http://www.bytemining.com13700
http://learnr.wordpress.com12600
http://tommy.chheng.com12200
http://mazamascience.com12000
http://www.investuotojas.eu11500
http://www.r-statistics.com11300
http://www.franklincenterhq.org10800
http://gettinggeneticsdone.blogspot.com10700
http://mpastell.com9930
http://pineda-krch.com9780
http://blog.saush.com9220
http://www.premiersoccerstats.com8950
http://developmentality.wordpress.com7250
http://www.dataspora.com7200
http://blog.hiremebecauseimsmart.com7050
http://isomorphismes.tumblr.com7040
http://www.mathfinance.cn6930
http://blog.nguyenvq.com6150
http://www.drewconway.com5970
http://www.carlboettiger.info5520
http://www.statisticsblog.com5110
http://www.decisionsciencenews.com4950
http://www.r-chart.com4810
http://chartsgraphs.wordpress.com4480
http://www.portfolioprobe.com4410
http://procomun.wordpress.com4330
http://jeromyanglim.blogspot.com4080
http://spatialanalysis.co.uk4080
http://www.theresearchkitchen.com4080
http://www.forex-bloggers.com4070
https://www.rmetrics.org4050
http://princeofslides.blogspot.com3900
http://www.cybaea.net3740
http://www.cerebralmastication.com3710
http://ygc.name3670
http://ryouready.wordpress.com3450
http://jeffreybreen.wordpress.com3410
http://systematicinvestor.wordpress.com3400
http://sgsong.blogspot.com3310
http://industrialengineertools.blogspot.com3290
http://www.r-tutor.com3270
http://fishlab.ucdavis.edu3270
http://ggorjan.blogspot.com3250
http://blog.ynada.com3220
http://farmacokratia.blogspot.com3170
http://4dpiecharts.com3130
http://heuristically.wordpress.com3040
http://blog.rtwilson.com2910
http://www.wekaleamstudios.co.uk2890
http://www.dataists.com2840
http://ikanb.wordpress.com2750
http://shape-of-code.coding-guidelines.com2730
http://onertipaday.blogspot.com2710
http://blog.fosstrading.com2700
http://blog.echen.me2690
http://www.theusrus.de2670
http://cloudnumbers.com2630
http://paulbutler.org2620
http://biostatmatt.com2460
http://www.johnmyleswhite.com2430
http://dataninja.wordpress.com2360
http://realizationsinbiostatistics.blogspot.com2340
http://statisfaction.wordpress.com2300
http://uxblog.idvsolutions.com2250
http://timelyportfolio.blogspot.com2210
http://radfordneal.wordpress.com2200
http://sas-and-r.blogspot.com2200
http://pairach.com2110
http://yusung.blogspot.com2050
http://blog.flacso.edu.mx2010
http://www.rensenieuwenhuis.nl2000
http://michaeldhealy.com1990
http://freigeist.devmag.net1950
http://www.fernandohrosa.com.br1920
http://statbandit.wordpress.com1870
http://www.win-vector.com1840
http://lukemiller.org1830
http://ropensci.org1720
http://www.eggwall.com1650
http://benmazzotta.wordpress.com1620
http://bms.zeugner.eu1610
http://cartesianfaith.wordpress.com1580
http://linkedscience.org1570
http://stevemosher.wordpress.com1550
http://intelligenttradingtech.blogspot.com1520
http://www.imachordata.com1480
http://blog.diegovalle.net1470
http://jermdemo.blogspot.com1430
http://nortalktoowise.com1420
http://ekonometrics.blogspot.com1340
http://digitheadslabnotebook.blogspot.com1320
http://flyordie.sin.khk.be1310
http://schamberlain.github.com1230
http://gribblelab.org1180
http://www.quantf.com1130
http://offensivepolitics.net1020
http://www.markmfredrickson.com981
http://blog.mckuhn.de948
http://erehweb.wordpress.com889
http://confounding.net886
http://simplystatistics.tumblr.com875
http://www.babelgraph.org859
http://csgillespie.wordpress.com857
http://joewheatley.net844
http://helmingstay.blogspot.com843
http://theaverageinvestor.wordpress.com825
http://quantitative-ecology.blogspot.com785
http://zvfak.blogspot.com776
http://ucfagls.wordpress.com766
http://opendatagroup.com760
http://cameron.bracken.bz740
http://rtutorialseries.blogspot.com738
http://opencpu.org708
http://novicemetrics.blogspot.com700
http://lamages.blogspot.com680
http://nir-quimiometria.blogspot.com679
http://tonybreyal.wordpress.com677
http://brokeringclosure.wordpress.com658
http://socialdatablog.com643
http://dancingeconomist.blogspot.com629
http://www.rtexttools.com603
http://danganothererror.wordpress.com589
http://thebiobucket.blogspot.com567
http://holtmeier.de531
http://val-systems.blogspot.com519
http://thelogcabin.wordpress.com489
http://dcemri.blogspot.com484
http://rdatamining.wordpress.com477
http://bridgewater.wordpress.com460
http://www.rcasts.com444
http://dsparks.wordpress.com436
http://pr.cloudst.at422
http://polstat.org409
http://www.compmath.com401
http://techno-realism.blogspot.com399
http://www.backsidesmack.com395
http://geotheory.org393
http://miraisolutions.wordpress.com367
http://econometricsense.blogspot.com352
http://blog.binfalse.de344
http://rforcancer.drupalgardens.com317
http://blog.rstudio.org316
http://mcfromnz.wordpress.com309
http://www.quantumforest.com309
http://blog.quanttrader.org303
http://chrisladroue.com293
http://www.michaelbommarito.com289
http://procrun.com280
http://mikeksmith.posterous.com279
http://bio7.org278
http://kbroman.wordpress.com278
http://martynplummer.wordpress.com272
http://bryer.org268
http://www.funjackals.com265
http://www.harlan.harris.name252
http://www.milktrader.net248
http://www.surefoss.org241
http://rigorousanalytics.blogspot.com231
http://www.jameskeirstead.ca229
http://programming-r-pro-bro.blogspot.com225
http://plausibel.blogspot.com224
http://statistic-on-air.blogspot.com217
http://mintgene.wordpress.com212
http://moderntoolmaking.blogspot.com205
http://quantitativeecology.blogspot.com199
http://www.sigmafield.org199
http://www.ancienteco.com194
http://worldofrcraft.blogspot.com191
http://rappster.wordpress.com190
http://stotastic.com189
http://evolvingspaces.blogspot.com184
http://strugglingthroughproblems.blogspot.com166
http://sharpstatistics.co.uk161
http://leftcensored.skepsi.net160
http://omegahat.wordpress.com156
http://drunks-and-lampposts.com155
http://amathew.com152
http://onlinelabor.blogspot.com147
http://johnramey.net144
http://gossetsstudent.wordpress.com138
http://tomhopper.wordpress.com135
http://ggobi.blogspot.com134
http://blog.fellstat.com131
http://www.openanalytics.eu130
http://www.numbertheory.nl127
http://stats.blogoverflow.com127
http://the-praise-of-insects.blogspot.com122
http://lpenz.github.com118
http://christophergandrud.blogspot.com118
http://f.giorlando.org112
http://bayesianbiologist.com110
http://www.graphoftheweek.org109
http://oneliner.soma20.com109
http://inundata.org107
http://geokook.wordpress.com104
http://blog.datapunks.com102
http://eranraviv.com102
http://eranraviv.com102
http://www.compbiome.com101
http://www.techpolicy.ca99
http://www.psychwire.co.uk97
http://blog.carlislerainey.com93
http://vasishth-statistics.blogspot.com93
http://www.statsravingmad.com93
http://using-r-project.blogspot.com93
http://www.nikhilgopal.com92
http://thedatamonkey.blogspot.com92
http://jeffreyhorner.tumblr.com90
http://menugget.blogspot.com88
http://www.twotorials.com88
http://dataexcursions.wordpress.com84
http://viksalgorithms.blogspot.com83
http://exploringdatablog.blogspot.com81
http://sachaepskamp.com81
http://aphysicistinwallstreet.blogspot.com77
http://lastresortsoftware.blogspot.com75
http://www.nomad.priv.at72
http://applyr.blogspot.com71
http://www.knowledgediscovery.jp71
http://weitaiyun.blogspot.com71
http://xmphforex.wordpress.com71
http://statsadventure.blogspot.com70
http://davenportspatialanalytics.squarespace.com70
http://anandram.wordpress.com69
http://rpint.wordpress.com68
http://datadebrief.blogspot.com66
http://blog.cloudstat.org64
http://www.r-podcast.org64
http://rmkrug.wordpress.com62
http://denishaine.wordpress.com61
http://expansed.com58
http://r.andrewredd.us57
http://isseing333.blogspot.com57
http://solomonmessing.wordpress.com57
http://rtricks.wordpress.com57
http://anrprogrammer.wordpress.com56
http://arungaikwad.wordpress.com56
http://geolabs.wordpress.com55
http://lookingatdata.blogspot.com55
http://factbased.blogspot.com54
http://severity.blogspot.com54
http://swordofcrom.wordpress.com53
http://librestats.wordpress.com51
http://marcinkula.wordpress.com51
http://gsoc2010r.wordpress.com47
http://psyccomputing.blogspot.com46
http://fabiomarroni.wordpress.com45
http://jedifran.com45
http://alstatr.blogspot.com43
http://r-video-tutorial.blogspot.com42
http://alexfarquhar.posterous.com40
http://bmb-common.blogspot.com40
http://rdataviz.wordpress.com40
http://mypapertrades.blogspot.com38
http://pitchrx.blogspot.com38
http://simonmueller.net38
http://statisfactions.wordpress.com37
http://nzprimarysectortrade.wordpress.com36
http://seanmulcahy.blogspot.com36
http://www.speakingstatistically.com35
http://joshpaulson.wordpress.com34
http://learningrbasic.blogspot.com34
http://mockquant.blogspot.com33
http://costaleconomist.blogspot.com32
http://rsnippets.blogspot.com31
http://statmethods.wordpress.com29
http://aviadklein.wordpress.com28
http://obeautifulcode.com28
http://blog.cloudst.at24
http://rstats.posterous.com23
http://notebookonthewebs.tumblr.com22
http://0utlier.blogspot.com21
http://gjkerns.github.com21
http://eigensomething.blogspot.com10
http://brocktibert.wordpress.com9
http://toddjobe.blogspot.com9
http://mickeymousemodels.blogspot.com9
http://forgetfulfunctor.blogspot.com9
http://rocknrblog.wordpress.com9
http://dmbates.blogspot.com8
http://blog.nextbiomotif.com8
http://indiacrunchin.wordpress.com8
http://blog.trenthauck.com8
http://mikescnc.blogspot.com8
http://jeroldhaas.blogspot.com8
http://tlevine.tumblr.com8
http://empty-moon-9726.heroku.com8
http://www.proc-x.com7
http://jointposterior.blogspot.com7
http://gastonsanchez.wordpress.com7
http://mlt-thinks.blogspot.com7
http://rstats.wordpress.com7
http://playingwithr.blogspot.com7
http://scottmutchler.blogspot.com6
http://iamdata.wordpress.com6
http://sfchaos.blogspot.com6
http://nightlordtw.wordpress.com5
http://pleasepasstheroc.blogspot.com5
http://wiekvoet.blogspot.com5
http://d7.stattler.com4
http://yetanotherrblog.blogspot.com4
http://blog.iwanluijks.nl:803
https://rlearner.wordpress.com3
http://margintale.blogspot.com1

When checking the results manually I discovered slight deviations in the numbers and admittedly have no clue why this is.. Sorry if any blog is under- overrepresented due to such an error – please report!

Here is the R-script:

require(XML)
library(stringr)
library(RCurl)
library(xtable)

GoogleHits.1 <- function(input)
{
url <- paste("https://www.google.com/search?q=\"",
input, "\"", sep = "")

CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
doc <- htmlParse(script)
res <- xpathSApply(doc, "//div[@id='subform_ctrl']/*", xmlValue)[[2]]
return(as.integer(gsub("[^0-9]", "", res)))
}

# Example:
GoogleHits.1("R%Statistical%Software")

###################### Begin get r-blogger's URLs: ###########################################
# get blogger urls with XML:
script <- getURL("www.r-bloggers.com")
doc <- htmlParse(script)
li <- getNodeSet(doc, "//ul[@class='xoxo blogroll']//a")
urls <- sapply(li, xmlGetAttr, "href")

# extract sensible blog urls:
# get ids for those with only 2 slashes (no 3rd in the end):
id <- which(nchar(gsub("[^/]", "", urls )) == 2)
slash_2 <- urls[id]

# find position of 3rd slash occurrence in strings:
slash_stop <- unlist(lapply(str_locate_all(urls, "/"),"[[", 3))
slash_3 <- substring(urls, first = 1, last = slash_stop - 1)

# replace the ones with 2 slashes:
blogs <- slash_3; blogs[id] <- slash_2

# dismiss:
blogs <- blogs[blogs != "http://domain"]
###################### End get r-blogger's URLs: #############################

###################### Begin Google Search: ##################################
# with lapply google mocks about roboting the site..
# I'm blocked on the 300th recursion..
# unlist(lapply(blogs, GoogleHits.1))

# try splitting, doesn't work (blocked the same as before)
res1 <- unlist(lapply(blogs[1:170], GoogleHits.1))
res2 <- unlist(lapply(blogs[171:334], GoogleHits.1))

# try to do it in 2 sessions (saving first result), or manually re-connnect host before second try:
df1 <- data.frame(Blogs = blogs[1:170], NoHits = res1, row.names = NULL)
save(df1, file = "df1.R")
load("df1.RData"); unlink("df1.RData")

# second run:
df2 <- data.frame(Blogs = blogs[171:334], NoHits = res2, row.names = NULL)

# bind dfs, sort by NoHits:
finres <- as.data.frame(rbind(df1, df2)); finres$Blogs <- as.character(finres$Blogs)
(finres <- finres[order(finres$NoHits, decreasing = T), ])

htmltab <- xtable(finres)
print(htmltab, type = "html", include.rownames=FALSE, file = "Bloggers.Google.Hits.htm")
###################### End Google Search #####################################

###################### Begin Plot: ###########################################
pdf("RBloggersWebPresence.pdf")
par(mar = c(4.5, 4.5, 3, 2), ylog = F)
plot(finres$NoHits, cex = 0.5, col = 3,
ylab = "No. of Hits in Google Search",
xlab = "Blogs", log = "y")
set.seed(19)
rid <- sample(13:nrow(finres), 15)
text(x = rid, y = finres$NoHits[rid],
labels = finres$Blogs[rid],
cex = 0.75, srt = 90, pos = 4, offset = -1)
title(main = "R-Bloggers' Web Presence")
dev.off()
###################### End Plot ##############################################

To leave a comment for the author, please follow the link and comment on their blog: theBioBucket*.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)