Personal Analytics with RSS Feeds

February 7, 2014
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

I am currently working on a paper on Academic Blogging, from my own experience. And I wanted to do something similar to Stephen Wolfram’s personal analytics of my life. More specifically, I wanted to understand when I do post my blog entries. If I post more entries during office hours, then it should mean that, indeed, I consider my blog as a part of my job (which is something I believe, actually). On the other hand, if I post more in the evening, or in the middle the night, then it could mean that my blog is clearly only for fun, and somehow outside the official academic time schedule.

With the help of @3wen, we have here a function that can read rss feeds, and extract the publication date (and other pieces of information actually),

> library(XML)
> library(dplyr)
> baseRSS <- function(adresse){
+   doc <- try(xmlTreeParse(adresse))
+   if(length(doc)>1){
+   lesArticles <- xpathApply(r <- xmlRoot(doc), "//item") 
+   infosUneEntree <- function(x){
+   title <- sapply(xpathApply(x, "//title"), xmlValue)
+   links <- sapply(xpathApply(x, "//link"), xmlValue)
+   pubDate <- sapply(xpathApply(x, "//pubDate"), xmlValue)
+ return(cbind(title = title, links = links, pubDate = pubDate))
+ }
+ df <- lapply(lesArticles, infosUneEntree)
+ df <- data.frame(do.call("rbind", df))
+ return(df)
+ }
+ else{return(NA)}
+ }

The trick is that the page containing the rss feeds is truncated: you get only 30 post (the latest ones). With WordPress, you can easily go further (thanks @3wen) using

> df.freak2 <- baseRSS("http://freakonometrics.hypotheses.org/feed?paged=2")
Namespace prefix dc on creator is not defined
Namespace prefix content on encoded is not defined
Namespace prefix wfw on commentRss is not defined
Namespace prefix slash on comments is not defined
> head(df.freak2)
                                         title
1       S\303\251ries chronologiques, syllabus
2 Copules et valeurs extr\303\252mes, syllabus
3         Jimmy, Mile End, et le Qu\303\251bec
4                Multivariate Archimax copulas
5                     Somewhere else, part 107
6     Informatique (sans ordinateur), partie 1
                                        links                         pubDate
1 http://freakonometrics.hypotheses.org/11593 Mon, 06 Jan 2014 00:31:52 +0000
2 http://freakonometrics.hypotheses.org/11595 Mon, 06 Jan 2014 00:31:21 +0000
3 http://freakonometrics.hypotheses.org/11362 Sun, 05 Jan 2014 03:33:31 +0000
4  http://freakonometrics.hypotheses.org/7673 Sat, 04 Jan 2014 11:01:05 +0000
5 http://freakonometrics.hypotheses.org/11584 Fri, 03 Jan 2014 15:34:29 +0000
6 http://freakonometrics.hypotheses.org/11138 Fri, 03 Jan 2014 07:15:03 +0000

and if we try to get a page that does not exist, we got the following error

> df.freakFaux <- baseRSS("http://freakonometrics.hypotheses.org/feed?paged=2000")
failed to load HTTP resource
Error : 1: failed to load HTTP resource

(unfortunately, I could not do it with https://feeds.feedburner.com/ for instance). With the following code, we can extract information about all the posts online on my blog

> df.freak <- NULL
> for(i in 1:2000){
+   df.tmp <- baseRSS(paste("http://freakonometrics.hypotheses.org/feed?paged=", i, sep = ""))
+   if(length(df.tmp)>1){
+     df.freak <- rbind(df.freak, df.tmp)
+   }else{ break }
+ }

All that is just fine. Now, let us write a small function to convert the date into some format I can use (here, I want to study the hour, as well as the week day).

> LD=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
> datahour=function(txt){
+ wd=substr(as.character(txt),1,3)
+ wdy=which(LD==wd)
+ y=substr(as.character(txt),13,16)
+ h=substr(as.character(txt),18,19)
+ mn=substr(as.character(txt),21,22)
+ T=as.numeric(h)+as.numeric(mn)/60
+ return(data.frame(weekday=wdy,time=T,year=as.numeric(y)))}

> datarss=function(df){
+ L=unlist(lapply(as.character(df$pubDate),datahour))
+ db=data.frame(
+ D=L[names(L)=="weekday"],
+ T=L[names(L)=="time"],
+ Y=L[names(L)=="year"])
+ return(db)}

Here, I extract the week day, the time in the day (continuous, from 0 till 24, excluded). With the following function we can see the proportion of posts per week day,

> hc=rev(heat.colors(100))
> weekday=function(db,yearinf=FALSE){
+ y=unique(db$Y)
+ if(yearinf==TRUE) y=y[-which.max(y)]
+ if(yearinf==FALSE) y=y[-c(which.max(y),which.min(y))]
+ L=NULL
+ for(i in y){
+ sB=subset(db,db$Y==i)
+ L=rbind(L,table(sB$D)/nrow(sB)*100)}
+ barplot(t(L[nrow(L):1,]),names=rev(y),col=hc)
+ }

(from the bottom to the top, Monday till Friday in light yellow, and Saturday and Sunday in light red). Here, on my own blog, it would be

> weekday(datarss(df.freak))

For the hour, it was slightly more technical (I could not find a decent and simple way to plot the graph I was looking for graph, so I did it by myself)

> hour=function(db,yearinf=FALSE){
+ y=unique(db$Y)
+ if(yearinf==TRUE) y=y[-which.max(y)]
+ if(yearinf==FALSE) y=y[-c(which.max(y),which.min(y))]
+ L=NULL
+ for(i in y){
+ sB=subset(db,db$Y==i)
+ if(i==2013) t=table(floor((sB$T+6)%%24))/nrow(sB)*100
+ if(i<2013)  t=table(floor(sB$T))/nrow(sB)*100
+ t=t[as.character(0:23)]
+ names(t)=as.character(0:23)
+ t[is.na(t)]=0
+ L=rbind(L,t)}
+ plot(y,rep(24,length(y)),ylim=c(-3,24),axes=FALSE,
+ xlim=c(min(y)-.5,max(y)+.5),xlab="",ylab="",col="white")
+ axis(2)
+ for(i in y){
+ text(i,-2,i)
+ for(j in 0:23){
+ polygon(c(i-.4,i-.4,i+.4,i+.4),
+ c(j,j+1,j+1,j),border=NA,col=hc[L[max(y)-i+1,j+1]/max(L)*98+1])
+ }}}

Just a short comment here. If you look at the code, there is a difference between 2013, and before. The reason is simple: in December 2012, I officially decided to migrate from my old blog to this new one. All the post prior December 2012 were initially published on the old blog. Which was at Montréal (East Coast) time. And I have the feeling that my new blog has a European time. So I did translate, of 6 hours. But the problem might be more complicated actually

> hour(datarss(df.freak))

Now, if we try to comment. On the week days, I find it a bit scary, to see that I spend so much time during the weekends on my (supposed to be) professional blog. And on the hour, I can explain the 2013 easily. I usually spend most of my evenings working (on the blog, or on my courses, or on my research). But usually, I try to avoid posting an entry at 2 a.m. So usually, I keep it until the morning, then when I arrive at the office, I finalize the post, and I make it available.

To understand the difference with previous years, I should probably add a technical comment : the previous blog was on a dotclear platform. On dotclear the Publication time is not exactly the time the post was officially posted online, but the default value is more the time the post was saved for the first time. So there might be some slight differences. I believe that previously, I started to work on a post in the afternoon, then I might spend some time in the evening, even the day after, but when I publish it, if I do not change the default settings, then the publication time would be the afternoon, when I did save the post.

Let us try on another blog… The problem is that is it is quite difficult to get old entries from the rss feeds. Except with WordPress… So I tried to run the previous code on http://economix.blogs.nytimes.com/. The extraction is simple here.

 501 Tue, 14 May 2013 04:01:

But here again, I do have trouble with 2013. To be more specific, when I look at the feeds I get

while I have on my side

 501 Tue, 14 May 2013 04:01:29 +0000
 502 Mon, 13 May 2013 19:51:06 +0000
 503 Mon, 13 May 2013 04:01:46 +0000
 504 Fri, 10 May 2013 21:11:58 +0000
 505 Fri, 10 May 2013 18:45:48 +0000
 506 Fri, 10 May 2013 17:33:55 +0000
 507 Fri, 10 May 2013 13:00:41 +0000
 508 Fri, 10 May 2013 04:01:53 +0000

I have here a 4 hours difference I cannot explain. But it looks fine before 2013. If I use the previous code, with (in the loop)

+   df.tmp <- baseRSS(paste("http://economix.blogs.nytimes.com/feed/?paged=", i, sep = ""))

we can get, for instance the following  graph,

 

We do observe an interesting dynamics here : I guess that previously people were working during the day, and then posting at the end of the day. It looks like, now, people work in the day, sometimes late in the evening, but wait till the next morning to post the entry. Just as I did, in order to read one last time, with a fesh mind… Anyway, I still have to understand what did happened in 2013, just to make sure that the data I extract can be used…

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)