Extracting data from Twitter for #machinelearningflashcards

[This article was first published on Jasmine Dumas' R Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m a fan of Chris Albon’s recent project #machinelearningflashcards on Twitter where generalized topics and methodologies are drawn out with key takeaways. It’s a great approach to sharing concepts about machine learning for everyone and a timely refresher for those of us who frequently forget algorithm basics.

I leveraged Maëlle Salmon’s recent blog post on the Faces of #rstats Twitter heavily as a tutorial for this attempt at extracting data from Twitter to download the #Machinelearningflashcards.

Source Repo for this work: jasdumas/ml-flashcards

Directions

  • Load libraries:

For this project I used rtweet to connect the Twitter API to search for relevant tweets by the hash tag, dplyr to filter and pipe things, stringr to clean up the tweet description, and magick to process the images.

Note: I previously ran into trouble when downloading ImageMagick and detailed the errors and approaches, if you fall into the same trap I did: https://gist.github.com/jasdumas/29caf5a9ce0104aa6bf14183ee1e3cd8

<span class="n">library</span><span class="p">(</span><span class="n">rtweet</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">magick</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">stringr</span><span class="p">)</span><span class="w">
</span>
  • Get tweets for the hash tag and only curated tweets for Chris Albon’s work:
<span class="n">ml_tweets</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">search_tweets</span><span class="p">(</span><span class="s2">"#machinelearningflashcards"</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="n">include_rts</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">screen_name</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'chrisalbon'</span><span class="p">)</span><span class="w">
</span>
<span class="n">head</span><span class="p">(</span><span class="n">ml_tweets</span><span class="p">)</span><span class="w">
</span>
##   screen_name  user_id          created_at          status_id
## 1  chrisalbon 11518572 2017-05-02 16:32:20 859445463316963328
## 2  chrisalbon 11518572 2017-05-01 22:19:26 859170425921650689
## 3  chrisalbon 11518572 2017-05-01 22:11:26 859168412555132928
## 4  chrisalbon 11518572 2017-05-01 20:23:49 859141329879580672
## 5  chrisalbon 11518572 2017-04-28 21:07:10 858065073167777792
## 6  chrisalbon 11518572 2017-04-28 15:33:57 857981218754764800
##                                                                                   text
## 1 Chi-squared For Feature Selection #machinelearningflashcards https://t.co/Pxxa7NDYUS
## 2   Fundamental Theorem Of Calculus #machinelearningflashcards https://t.co/0aOJMYqVFM
## 3      Why is nearest neighbor lazy #machinelearningflashcards https://t.co/vvqX39oGks
## 4         Precision Recall Tradeoff #machinelearningflashcards https://t.co/rKT1d3gD1V
## 5      Singular Value Decomposition #machinelearningflashcards https://t.co/Sahq7AWqQR
## 6         How to avoid overfitting. #machinelearningflashcards https://t.co/uUnUG7Xljv
##   retweet_count favorite_count is_quote_status quote_status_id is_retweet
## 1             1             11           FALSE            <NA>      FALSE
## 2             3              6           FALSE            <NA>      FALSE
## 3             3             10           FALSE            <NA>      FALSE
## 4             6             25           FALSE            <NA>      FALSE
## 5             4             20           FALSE            <NA>      FALSE
## 6            45             83           FALSE            <NA>      FALSE
##   retweet_status_id in_reply_to_status_status_id
## 1              <NA>                         <NA>
## 2              <NA>                         <NA>
## 3              <NA>                         <NA>
## 4              <NA>                         <NA>
## 5              <NA>                         <NA>
## 6              <NA>                         <NA>
##   in_reply_to_status_user_id in_reply_to_status_screen_name lang
## 1                       <NA>                           <NA>   en
## 2                       <NA>                           <NA>   en
## 3                       <NA>                           <NA>   en
## 4                       <NA>                           <NA>   en
## 5                       <NA>                           <NA>   es
## 6                       <NA>                           <NA>   en
##                        source           media_id
## 1 Machine Learning Flashcards 859445461152800768
## 2 Machine Learning Flashcards 859170424256512000
## 3 Machine Learning Flashcards 859168410713808896
## 4 Machine Learning Flashcards 859141327270821888
## 5             Twitter for Mac 858065067903823872
## 6             Twitter for Mac 857981212857516032
##                                        media_url
## 1 http://pbs.twimg.com/media/C-1dF-fVoAAAHR0.jpg
## 2 http://pbs.twimg.com/media/C-xi8uNUwAAKBFx.jpg
## 3 http://pbs.twimg.com/media/C-xhHhLVYAAXUBP.jpg
## 4 http://pbs.twimg.com/media/C-xIfDfVwAA4xmm.jpg
## 5 http://pbs.twimg.com/media/C-h1og6UMAA7oCY.jpg
## 6 http://pbs.twimg.com/media/C-gpXghUwAAXB19.jpg
##                                                 media_url_expanded urls
## 1 https://twitter.com/chrisalbon/status/859445463316963328/photo/1 <NA>
## 2 https://twitter.com/chrisalbon/status/859170425921650689/photo/1 <NA>
## 3 https://twitter.com/chrisalbon/status/859168412555132928/photo/1 <NA>
## 4 https://twitter.com/chrisalbon/status/859141329879580672/photo/1 <NA>
## 5 https://twitter.com/chrisalbon/status/858065073167777792/photo/1 <NA>
## 6 https://twitter.com/chrisalbon/status/857981218754764800/photo/1 <NA>
##   urls_display urls_expanded mentions_screen_name mentions_user_id symbols
## 1         <NA>          <NA>                 <NA>             <NA>      NA
## 2         <NA>          <NA>                 <NA>             <NA>      NA
## 3         <NA>          <NA>                 <NA>             <NA>      NA
## 4         <NA>          <NA>                 <NA>             <NA>      NA
## 5         <NA>          <NA>                 <NA>             <NA>      NA
## 6         <NA>          <NA>                 <NA>             <NA>      NA
##                    hashtags coordinates place_id place_type place_name
## 1 machinelearningflashcards          NA       NA         NA         NA
## 2 machinelearningflashcards          NA       NA         NA         NA
## 3 machinelearningflashcards          NA       NA         NA         NA
## 4 machinelearningflashcards          NA       NA         NA         NA
## 5 machinelearningflashcards          NA       NA         NA         NA
## 6 machinelearningflashcards          NA       NA         NA         NA
##   place_full_name country_code country bounding_box_coordinates
## 1              NA           NA      NA                       NA
## 2              NA           NA      NA                       NA
## 3              NA           NA      NA                       NA
## 4              NA           NA      NA                       NA
## 5              NA           NA      NA                       NA
## 6              NA           NA      NA                       NA
##   bounding_box_type
## 1                NA
## 2                NA
## 3                NA
## 4                NA
## 5                NA
## 6                NA
  • Get text within the tweet to add to the file name by removing the hash tag and URL link:
<span class="n">ml_tweets</span><span class="o">$</span><span class="n">clean_text</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ml_tweets</span><span class="o">$</span><span class="n">text</span><span class="w">
</span><span class="n">ml_tweets</span><span class="o">$</span><span class="n">clean_text</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace</span><span class="p">(</span><span class="n">ml_tweets</span><span class="o">$</span><span class="n">clean_text</span><span class="p">,</span><span class="s2">"#[a-zA-Z0-9]{1,}"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w"> </span><span class="c1"># remove the hashtag
</span><span class="n">ml_tweets</span><span class="o">$</span><span class="n">clean_text</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace</span><span class="p">(</span><span class="n">ml_tweets</span><span class="o">$</span><span class="n">clean_text</span><span class="p">,</span><span class="w"> </span><span class="s2">" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w"> </span><span class="c1"># remove the url link
</span><span class="n">ml_tweets</span><span class="o">$</span><span class="n">clean_text</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">str_replace</span><span class="p">(</span><span class="n">ml_tweets</span><span class="o">$</span><span class="n">clean_text</span><span class="p">,</span><span class="w"> </span><span class="s2">"[[:punct:]]"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">)</span><span class="w"> </span><span class="c1"># remove punctuation
</span>
  • Download images of the flashcards from the media_url column and append the file name from the cleaned tweet text description and save into a folder:
<span class="n">save_image</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">df</span><span class="p">){</span><span class="w">
  </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">nrow</span><span class="p">(</span><span class="n">df</span><span class="p">))){</span><span class="w">
    </span><span class="n">image</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">try</span><span class="p">(</span><span class="n">image_read</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">media_url</span><span class="p">[</span><span class="n">i</span><span class="p">]),</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
  </span><span class="k">if</span><span class="p">(</span><span class="nf">class</span><span class="p">(</span><span class="n">image</span><span class="p">)[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="s2">"try-error"</span><span class="p">){</span><span class="w">
    </span><span class="n">image</span><span class="w"> </span><span class="o">%>%</span><span class="w">
      </span><span class="n">image_scale</span><span class="p">(</span><span class="s2">"1200x700"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
      </span><span class="n">image_write</span><span class="p">(</span><span class="n">paste0</span><span class="p">(</span><span class="s2">"data/"</span><span class="p">,</span><span class="w"> </span><span class="n">ml_tweets</span><span class="o">$</span><span class="n">clean_text</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="s2">".jpg"</span><span class="p">))</span><span class="w">
  </span><span class="p">}</span><span class="w">
 
  </span><span class="p">}</span><span class="w">
   </span><span class="n">cat</span><span class="p">(</span><span class="s2">"Function complete...\n"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>
  • Apply the function:
<span class="n">save_image</span><span class="p">(</span><span class="n">ml_tweets</span><span class="p">)</span><span class="w">
</span>

At the end of this process you can view all of the #machinelearningflashcards in one place! Thanks to Chris Albon for his work on this, and I’m looking forward to re-running this script to gain additional knowledge from new #machinelearningflashcards that are developed in the future!

To leave a comment for the author, please follow the link and comment on their blog: Jasmine Dumas' R Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)