Big data (useR! 2011)

August 18, 2011
By

(This article was first published on Why? » R, and kindly contributed to R-bloggers)

Unfortunatley, I missed the first and last talks.

My notes from a session on Thursday morning

J. Demmler – Challenges of working with a large database of routinely
collected health data

The SAIL data bank holds over 1.9 billion (anonymous) entries. To use the data for research, they need to ensure that proper data security is observed. For example, secure data transport. All analysis is done with a secure environment. Files are moved into the environment via an FTP client

Why R? No advanced SQL options available, so using DB2 allows loops. Also R is great for data pre-cleaning and is suitable for the heavy analysis. To connect to the SAIL database, they need to use the RODBC package. SQL queries are run from within R, however SQL scripts are kept in separate files since they are “reviewed”.

Lots of errors in data, e.g. units.

John Bryant – Demographic: classes and methods for data about populations

Existing data structures for population type data:

  • array: messy code;
  • data frames: not that natural for this type of code;
  • demography package: not really extensible.

Target audience for this new package: applied statisticians, social scientists. Not programmers. Core to this package is the Demographic class: S4 object, specialized array with associated meta data.


To leave a comment for the author, please follow the link and comment on his blog: Why? » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.