Social Media Mining and Bioinformatics (with R)

August 5, 2014
By

(This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers)

In June and July, I receive copies of two books,

For the first one, two recent interesting books deal with the same topic. Reza Zafarani, Mohammad Ali Abbasi and Huan Liu published last year Social Media Mining: An Introduction. Actually, the book can be downloaded from dmml.asu.edu. And – of course – there is Matthew A. Russell‘s Mining the Social Web. The main interest of this new book seems to be that it should be just perfect for R users !

Now, to be honest, among Twitter, Facebook, LinkedIn, Google+, etc, my main interest – so far – is Twitter (that I start to understand well, I believe). I would love to start working on LinkedIn’s connections, but so far, did not try. I was a bit disappointed by Matthew’s book, which does not simply presents all APIs, but on Twitter, I did not find much (interesting) applications. But I believe that I have a biased and partial view of the book, which goes way beyond my usual interests.  published an interesting review on Matthew’s book, and I believe that you can really to a lot of things on social medias from his book.

Now, to get back on the book I received, David Springate wrote a review on his blog that is extremely interesting (and I share most of his concerns). I did learn a lot of thing about sentiment analysis and the tm package (for text mining), but I am clearly not an expert. On the other hand, I did not learn much about R (such as subtle points to manipulate strings and words). But I believe that it was not the goal of that book. By the way, the codes can be found on https://github.com/SocialMediaMininginR, so anyone can play with them.

I should also probably mention that I would have expected less on (basic) R language (such as plotting an histogram, or Anscombe’s regression) and more on the roots of sentiment analysis for instance, on the algorithm, or on pitfalls (with some examples, such as irony or sarcasm, which is rater common on Twitter). There is a (short) chapter 5 about the theory (or sort of), very brief, but we have hints about what’s going on, and then there are applications in chapter 6. I would have preferred to have (in the same chapter) the theory, and then the code, with some comments. And maybe 120 pages, instead of 40. I have the feeling that several opportunities have been missed. It is clearly not a starting point to start mining social media, but combined with another book, it might probably be interesting (if you are already a R user).

About the second one, I have to admit (one more time) that my expertise in bioinformatics is rather limited. There is a really nice ebook on a similar topic, by Avril Coghlan, entitled A Little Book of R For Bioinformatics, also available online, http://a-little-book-of-r-for-bioinformatics.readthedocs.org/. Nevertheless, the models mentioned here are the same as the one I use in my research, or teach in my courses.

For instance, there is a chapter on Machine Learning (in Bioinformatics). Now, let’s be honest one more time : as claimed in the title, it is a cookbook. But it is a fair one. In the  Machine Learning chapter, there is a section on cross validation. Let us look at it to see how it is structured (the structure is the same all along the book)

We start with a brief introduction and description of the problem. Then, a short paragraph about the dataset used

Now, the core of the section is the following part,

(etc)

Here, we have the R code (with an introduction, to make sure we understand what we’re doing here).

Then, we have a wrap-up summary, where all the points are connected. But of course, alternative functions and packages can be used, and it is mentioned in the next paragraph,

And to conclude, there is a (really) brief list of references, to go further on theoretical aspects,

You need to find quickly a function to get a ROC curve or to visualize clusters? I think that you will find an illustration in this book to do it on your own. So I believe that it does the job. Now, just to be clear, 90% of the book is clearly outside my scope : I know nothing about “Protein Structure Analysis”, and even if someday I might be interested to learn more on that topic, so far, I do not really care. Nevertheless, I am facing a problem to read (in R) a .sql file, so I went through the book, to see if I can find a technique to read such a file, but I could not find anything helpful.

To leave a comment for the author, please follow the link and comment on his blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.