For July’s meetup, Data Science MD was honored to have Jonathan Street of NIH and Brian Godsey of RedOwl Analytics come discuss using Python and R for data analysis.
Jonathan started off by describing the growing ecosystem of Python data analysis tools including Numpy, Matplotlib, and Pandas.
He next walked through a brief example demonstrating Numpy, Pandas, and Matplotlib that he made available with the IPython notebook viewer.
The second half of Jonathan’s talk focused on the problem of using clustering to identify scientific articles of interest. He needed to a) convert PDF to text b) extract sections of the document c) cluster and d) retrieve new material.
Jonathan used the PyPDF library for PDF conversion and then used the NLTK library for text processing. For a thorough discussion of NLTK, please see Data Community DC’s multi-part series written by Ben Bengfort.
Clustering was done using scikit-learn, which identified seven groups of articles. From these, Jonathan was then able to retrieve other similar articles to read.
Overall, by combining several Python packages to handle text conversion, text processing, and clustering, Jonathan was able to create an automated personalized scientific recommendation system. Please see the Data Community DC posts on Python data analysis tutorials and Python data tools for more information.
Next to speak was Brian Godsey of RedOwl Analytics who was presenting on their social network analysis. He first described the problem of identifying misbehavior in a financial firm. Their goals are to detect patterns in employee behavior, measure differences between various types of people, and ultimately find anomalies in behavior.
In order to find these anomalies, they model behavior based on patterns in communications and estimate model parameters from a data set and set of effects.
Brian then revealed that while implementing their solution they have developed a R package called rRevelation that allows a user to import data sets, create covariates, specify a model’s behavioral parameters, and estimate the parameter values.
To conclude his presentation, Brian demonstrated using the package against the well-known Enron data set and discussed how larger data sets requires using other technologies such as MapReduce.