Software engineer’s guide to getting started with data science

December 30, 2012

(This article was first published on Enterprise Software Doesn't Have to Suck, and kindly contributed to R-bloggers)

Many of my software engineer friends ask me about learning data science. There are many articles on this subject from renowned data scientists (DatasporaGigaomQuora, Hilary Mason). This post captures my journey (a software engineer) on learning Statistics and Data Visualization.

I’m mid-way in my 5 year journey to become proficient in data science and my learning program has included self-learning (books, blogs, toy problems), projects at work, class-room training (Stanford), teaching/presentations, conferences (UseR, Strata). Here’s what I’ve done so far and what worked and what didn’t…


a) Self-learning (2 – 4 months)

Explore if data science is for you

This is the key to getting started. Two years ago some of us at work formed a study group to review Stats 202 class material. This is what got me excited and started with data analytics. Only 2 of the 5 members of our study group chose to dive deeper into this field (data science is not for everyone).

b) Class-room training (9 – 12 months)

If you’re serious about learning, enroll into a formal program

If you’re serious about picking this skill, then opt for a course. The rigor of the class ensured that I didn’t slack. Stanford offers great coursework to get started. They are far superior compared to many week-long training courses I’ve been to…


a) Spend 100% of my time on data science

  • Once I was hooked on data science, it was difficult to spend only 20% of my time on it to build expertise. I needed to spend 100% of my time on it, so I found work problems related to data science (big data analysis, healthcare, marketing & sales and retail analytics, optimization problems). 

b) Work on interesting problems

  • I aligned my learning goal with my passion. I found it energizing and engaging to solve interesting problems while learning new techniques. I was interested in retail, healthcare and sports (cricket) data analysis. 

c) Accelerate learning: 

d) Learn business domains

I’m lucky to have access to internal and external experts in data science, and they’ve helped me understand their approach to data science problems (how they think, hypothesize and test/access/reject solutions). I’ve learned from them the importance of “Hypothesis-driven data analysis” rather than “blind/brute-force data analysis”. This highlighted the importance of understanding the business domains really well before trying to extract meaningful insights from the data. This led me to understand operations research and marketing topics, retail, travel & logistics (revenue management) and healthcare industries. NY Times recently published an article highlighting the need for intuition.



    • Learning multiple Statistical tools: A year ago, I started getting some work requests for SAS programming, so I wanted to learn it. I tried to learn it for a month or so but could not do it. The main reason was learning inertia and my love for the statistical tool I knew already – R. I really didn’t need another statistical tool. I could solve most of my data science problems with R and other software tools I knew. So my advice is that if you already know SAS, Stata, Matlab, SPSS, Statistica very well, stick to it. However if you’re learning a new statistical tool, pick R. R is open source while most others are commercial software (expensive and complex).
    • Auditing courses: I tried to follow self-paced coursework from Coursera and other MOOCs but it wasn’t effective for me. I needed the routine, the pressure of a formal course with proper grading to go through the rigor
    • Increasing academic workload: Manage work-life balance and work-commitments well. Earlier this year, I tried to take multiple difficult courses at the same time and quickly realized that I wasn’t enjoying and learning as I should.
    • Sticking to course text book only: Many of the books in these classes are too “dense” for me (a software engineer). So I used other material to understand the concepts. E.g. regression from Carnegie Mellon notes

    Comments, questions, suggestions are welcome!

    To leave a comment for the author, please follow the link and comment on their blog: Enterprise Software Doesn't Have to Suck. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.


    Mango solutions

    RStudio homepage

    Zero Inflated Models and Generalized Linear Mixed Models with R

    Dommino data lab

    Quantide: statistical consulting and training



    CRC R books series

    Six Sigma Online Training

    Contact us if you wish to help support R-bloggers, and place your banner here.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)