Blog Archives

A quick look at #march11 / #saudi tweets

March 12, 2011
By
A quick look at #march11 / #saudi tweets

Well, so much for that #march11 #Saudi day of rage.  Whether it was really the "tempest in a teacup" that  Prince Al-Waleed suggested on CNBC (video below, transcript here) or not, the oil complex and Saudi markets seem to have shrugged … Continue reading →

Read more »

Dataset: Wisconsin Union Protester Tweets #wiunion

February 21, 2011
By
Dataset: Wisconsin Union Protester Tweets #wiunion

   I’ve been playing with Twitter data over the last week, archiving Algerian, Egyptian, Iranian, and Chinese tweets.  I thought I’d bring the story a little closer to home this time by archiving tweets from Wisconsin Union protesters on the … Continue reading →

Read more »

Tracking the Frequency of Twitter Hashtags with R

February 21, 2011
By
Tracking the Frequency of Twitter Hashtags with R

 I’ve posted three examples of Twitter hashtags datasets in the last week: one on China, one on Iran, and one on Algeria.  In order to build these datasets, I needed to obtain older tweets; this is slightly more difficult than … Continue reading →

Read more »

Dataset: Tweets from the Chinese Protests #cn220

February 20, 2011
By
Dataset: Tweets from the Chinese Protests #cn220

  Earlier this week, I posted a ~100k tweet dataset on the #25bahman protests in Iran.  The corresponding figure of frequencies showed a strong presence on Twitter, with over 500 tweets per 5 minute period at peak.  You can download the … Continue reading →

Read more »

R Bloggers: The Site I Wish Existed in 2007

February 19, 2011
By
R Bloggers: The Site I Wish Existed in 2007

  My first experience with R was in 2007 as a sophomore in undergrad.  As part of a larger project on pricing day-ahead electricity futures, I wanted to cluster locational marginal price (LMP) data from the ISO-NE.  Something like k-means is easy … Continue reading →

Read more »

Pre-processing text: R/tm vs. python/NLTK

February 16, 2011
By
Pre-processing text: R/tm vs. python/NLTK

  Let’s say that you want to take a set of documents and apply a computational linguistic technique.  If your method is based on the bag-of-words model, you probably need to pre-process these documents first by segmenting, tokenizing, stripping, stopwording, and … Continue reading →

Read more »