by Bob Horton, Data Scientist, Revolution Analytics
From electronic medical records to genomic sequences, the data deluge is affecting all aspects of health care. The Masters of Science in Health Informatics (MSHI) program at the University of San Francisco, now in its second year, is designed to help students develop the practical computing skills and quantitative perspicacity they need to manage and exploit this wealth of data in health care applications.
This spring, I am privileged to participate in this effort by developing and teaching a new course, “Statistical Computing for Biomedical Data Analytics”, intended to motivate and prepare students for further studies in data science, such as the intensive summer courses of the MSAN bootcamp. The syllabus is on github.
As you’ve probably guessed, we will be using R. Other courses in the curriculum use Python, which seems to be favored by engineers; in contrast, R was developed by and for statisticians. We want the students to be exposed to both perspectives, and to have the technical background needed to make use of the extensive repositories of code available from CRAN and Bioconductor.
Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts. R is a fantastic environment for this; you can see a bell-shaped curve emerge from an example as simple as
plot(0:20, choose(n=20, k=0:20))<br><br> <a class="asset-img-link" href="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb07dbcff1970d-pi" style="display: inline;"><img alt="Bob_plot" border="0" class="asset asset-image at-xid-6a010534b1db25970b01bb07dbcff1970d image-full img-responsive" src="http://revolution-computing.typepad.com/.a/6a010534b1db25970b01bb07dbcff1970d-800wi" title="Bob_plot"></a><br><br>
Note the expressive power of the vector of k values, and the easy convenience of having a world of statistical functions at your fingertips. Imagine how this little plot would have delighted Sir Francis Galton.
Data science is a journey. The enormous breadth of material and the rapid pace of development mean that the most important thing to learn is how to learn more. We’ll explore many fantastic resources for learning data science and R. For example, Coursera has excellent offerings, exemplified by the series of mini-courses from Johns Hopkins; our students will take at least one of these as a course project.
Of course, the R community itself is the biggest and most important resource. One class will be a field trip to a Bay Area useR Group (BARUG) meeting, and the comments in response to this post will be required reading. Ideas or suggestions regarding the syllabus or course materials from the github repository are welcome, as are observations or ruminations on the process of learning data science and R.
Finally, we are very interested in helping our students find outstanding internship opportunities in health-related organizations. Please don’t hesitate to contact me through [email protected] if you are interested in working with us. Stay tuned for progress reports.