What’s in my Pocket? (Part II) – Analysis of Pocket App Article Tagging

November 10, 2013
By

(This article was first published on everyday analytics, and kindly contributed to R-bloggers)

Introduction

You know what's still awesome? Pocket.

As I noted in an earlier post (oh god, was that really more than a year ago?!) I started using the Pocket application, previously known as Read It Later, in July of 2011 and it has changed my reading behavior ever since.

Lately I've been thinking a lot about quantified self and how I'm not really tracking anything anymore. Something which was noted at one of the Meetups is that data collection is really the hurdle: like anything in life - voting, marketing, dating, whatever - you have to make it easy otherwise most people probably won't bother to do it. I'm pretty sure there's a psychological term for this - something involving the word 'threshold'.

That's where the smartphones come in. Some people have privacy concerns about having all their data in the cloud (obviously I don't, as I'm willing putting myself all on display in the blog here) but that aside, one of the cool things about smartphone apps is that you are passively creating lots of data. Over time this results in a data set about you. And if you know how to pull that data you can analyze it (and hence yourself).  I did this previously, for instance with my text messages and also with data from Pocket collected up to that time.

So let's give it a go again, but this time with a different focus for the analysis.

Background

This time I wasn't so interested when I read articles and from where, but moreso in the types of articles I was reading. In the earlier analysis, I summarized the types of things I was reading, but by top-level domain of the site - and what resulted was a high-level overview of my online reading behavior.

Pocket added the ability for you to tag your articles. The tags are similar to labels in Gmail and so the relationships can be many to one. This provides a way for you to categorize your reading list (and archive) by category, and for the purposes of this analysis here to analyze them accordingly.

First and foremost, we need the data (again). Unfortunately over the course of the development of the Pocket application, the amount of data you can get easily via export (without using the API) has diminished. Originally the export was available both as XML or JSON, but unfortunately those are now no longer available.

However, you can still export your reading list as an HTML file, which still contains attributes in the link elements for the time the article was added and the tags it has attached.

Basically the export is quasi-XML, so it's a simple matter of writing some R code using the XML library to get the data into a format we can work with (CSV):


Here I extract the attributes and also create a column for each tag name with a binary value for if the article had that tag (one of my associates at work would call this a 'classifier', though it's not the data science-y kind). Because I wrote this in a general enough fashion, you should be able to run the code on your Pocket export and get the same results.

Now that we have some data we can plunk it into Excel and do some data visualization.

Analysis

First we examine the state of articles over time - what is the proportion of articles added over time which are tagged versus not?

Tagged vs. Untagged

You can see that initially I resisted tagging articles, but starting November adopted it and began tagging almost all articles added. And because stacked area graphs are not especially good data visualization, here is a line graph of the number of articles tagged per month:


Which better shows that I gradually adopted tagging from October into November. Another thing to note from this graph is that my Pocket usage peaked between November of last year to May of this year, after which the number of articles added on a monthly basis decreases significantly (hence the previous graph being proportional).

Next we examine the number of articles by subject area. I've collected them into more-or-less meaningful groups and will explain the different tags as we go along. Note the changing scale on the y-axes for these graphs, as the absolute number of articles varies greatly by category.

Psych & Other Soft Topics
As I noted previously in the other post, when starting to use Pocket I initially read a very large number of psych articles.

I also read a fair number of "personal development" articles (read: self-helpish - mainly from The Art of Manliness) which has decreased greatly as of late. The purple are articles on communications, the light blue "parapsych", which is my catchall for new-agey articles relating to things like the zodiac, astrology, mentalism, mythology, etc. (I know it's all nonsense, but hey it's good conversation for dinner parties and the next category).

The big spike recently was a cool site I found recently with lots of articles on the zodiac (see: The Barnum Effect). Most of these later got deleted.

Dating & Sex
Now that I have your attention... what you don't read articles on sex? The Globe and Mail's life section has a surprising number of them. Also if you read men's magazine online there are a lot, most of which are actually pretty awful. You can see too that articles on dating made up a large proportion of my reading back in the fall, also from those types of sites (which thankfully I now visit far less frequently).

News, etc.
This next graph is actually a bit busy for my liking, but I found this data set somewhat challenging to visualize overall, given the number of categories and how they change in time.


News is just that. Tech mostly the internet and gadgets. Jobs is anything career related. Finance is both in the news (macro) and personal. Marketing is a newcomer.

Web & Data

The data tag relates to anything data-centric - as of late more applied to big data, data science and analytics. Interestingly my reading on web analytics preceded my new career in it (January 2013), just like my readings in marketing did - which is kind of cool. It also goes to show that if you read enough about analytics in general you'll eventually read about web analytics.

Data visualization is a tag I created recently so has very few articles - many of which I would have previously tagged with 'data'.

Life & Humanities



If that other graph was a little too busy this one is definitely so, but I'm not going to bother to break it out into more graphs now. Articles on style are of occasional interest, and travel has become a recent one. 'Living' refers mainly to articles on city life (mostly from The Globe as well as the odd one from blogto).

Work
And finally some new-comers, making up the minority, related to work:


SEO is search engine optimization and dev refers to development, web and otherwise.

Gee that was fun, and kind of enlightening. But tagging in Pocket is like in Gmail - it is not one-to-one but many-to-one. So next I thought to try to answer the question: which tags are most related? That is, which tags are most commonly applied to articles together?

To do this we again turn to R and the following code snippet, on top of that previous, does the trick:

All this does is remove the untagged articles from the tag frame and then run a correlation between each column of the tag matrix. I'm no expert on exotic correlation coefficients, so I simply used the standard (Pearson's). In the case of simple binary variables (true / false such as here), the internet informs me that this reduces to the phi coefficient.

Given there are 30 unique tags, this creates a 30 x 30 matrix, which is visualized below as a heatmap:


Redder is negative, greener is positive. I neglected to add a legend here as when not using ggplot or a custom function it is kind of a pain, but some interesting relationships can still immediately be seen. Most notably food and health articles are the most strongly positively correlated while data and psych articles are most strongly negatively correlated.

Other interesting relationships are that psych articles are negatively correlated with jobs, tech and web analytics (surprise, surprise) and positively correlated with communications, personal development and sex; news is positively correlated with finance, science and tech.

Conclusion

All in all this was a fun exercise and I also learned some things about my reading habits which I already suspected - the amount I read (or at least save to read later) has changed over time as well as the sorts of topics I read about. Also some types of topics are far more likely to go together than others.

If I had a lot more time I could see taking this code and standing it up into some sort of generalized analytics web service (perhaps using Shiny if I was being really lazy) for Pocket users, if there was sufficient interest in that sort of thing.

Though it was still relatively easy to get the data out, I do wish that the XML/JSON export would be restored to provide easier access, for people who want their data but are not necessarily developers. Not being a developer, my attempts to use the new API for extraction purposes were somewhat frustrating (and ultimately unsuccessful).

Though apps often make our lives easier with passive data collection, all this information being "in the cloud" does raise questions of data ownership (and governance) and I do wish more companies, large and small, would make it easier for us to get a hold of our data when we want it.

Because at the end of the day, it is ultimately our data that we are producing - and it's the things it can tell us about ourselves that makes it valuable to us.

Resources

Pocket - Export Reading List to HTML
http://getpocket.com/export

Pocket - Developer API
http://getpocket.com/developer/

Phi Coefficient
http://en.wikipedia.org/wiki/Phi_coefficient

The Barnum (Forer) Effect
http://en.wikipedia.org/wiki/Barnum_effect

code on github
http://github.com/mylesmharrison/pocket2csv_tagheatmap

To leave a comment for the author, please follow the link and comment on his blog: everyday analytics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.