Data Science MD July Recap: Python and R Meetup

August 8, 2013

(This article was first published on Data Community DC » R, and kindly contributed to R-bloggers)


For July’s meetup, Data Science MD was honored to have Jonathan Street of NIH and Brian Godsey of RedOwl Analytics come discuss using Python and R for data analysis.

Jonathan started off by describing the growing ecosystem of Python data analysis tools including Numpy, Matplotlib, and Pandas.

He next walked through a brief example demonstrating Numpy, Pandas, and Matplotlib that he made available with the IPython notebook viewer.

The second half of Jonathan’s talk focused on the problem of using clustering to identify scientific articles of interest. He needed to a) convert PDF to text b) extract sections of the document c) cluster and d) retrieve new material.

Jonathan used the PyPDF library for PDF conversion and then used the NLTK library for text processing. For a thorough discussion of NLTK, please see Data Community DC’s multi-part series written by Ben Bengfort.

Clustering was done using scikit-learn, which identified seven groups of articles. From these, Jonathan was then able to retrieve other similar articles to read.

Overall, by combining several Python packages to handle text conversion, text processing, and clustering, Jonathan was able to create an automated personalized scientific recommendation system. Please see the Data Community DC posts on Python data analysis tutorials and Python data tools for more information.

Next to speak was Brian Godsey of RedOwl Analytics who was presenting on their social network analysis. He first described the problem of identifying misbehavior in a financial firm. Their goals are to detect patterns in employee behavior, measure differences between various types of people, and ultimately find anomalies in behavior.

In order to find these anomalies, they model behavior based on patterns in communications and estimate model parameters from a data set and set of effects.

Brian then revealed that while implementing their solution they have developed a R package called rRevelation that allows a user to import data sets, create covariates, specify a model’s behavioral parameters, and estimate the parameter values.

To conclude his presentation, Brian demonstrated using the package against the well-known Enron data set and discussed how larger data sets requires using other technologies such as MapReduce.


Slides can be found here for Jonathan and here for Brian.

The post Data Science MD July Recap: Python and R Meetup appeared first on Data Community DC.

To leave a comment for the author, please follow the link and comment on their blog: Data Community DC » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training




CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)