(This article was first published on

**Enterprise Software Doesn't Have to Suck**, and kindly contributed to R-bloggers)I’m training some of my colleagues on Big’ish data analysis this week. Here’s how I’m running the class. Would love your ideas to make it better.

**CLASS OBJECTIVES (LEARNING OUTCOMES)**

After completion of the course, you will be able to:

- Understand concepts of data science, related processes, tools, techniques and path to building expertise
- Use Unix command line tools for file processing (awk, sort, paste, join, gunzip, gzip)
- Use Excel to do basic analysis and plots
- Write and understand R code (data structures, functions, packages, etc.)
- Explore a new dataset with ease (visualize it, summarize it, slice/dice it, answer questions related to dataset)
- Plot charts on a dataset using R

**CLASS PREREQUISITES**

- Good knowledge of basic statistics (min, max, avg, sd, variance, factors, quantiles/deciles, etc.)
- Familiarity with Unix OS

**CLASS TOPICS**

**A) Intro to data science**

- Explain data science and its importance. Data-driven business functions e.g. MROI, mix optimization, IPL teams / fantasy teams, predictions
- Big data

– Definition: Data sets that no longer fit on a disk, requiring compute clusters and respective software and algorithms (map/reduce running on Hadoop).

– Real big data problems: parallel computing, distributed computing, cloud, hadoop, casandra

– Most analysis isn’t Big Data. Business apps often deal with datasets that fit in Excel/Access - Products: Desktop tools (Excel (solver, what if), Access, SQL, spss, stata, R, sas, programming languages (ruby, python, java) — stats libs in these languages, BI tools, etc.

**B) Steps in data science**

- Acquire data: “obtaining the data”… databases, log files… exports, surveys, web scraping etc.
- Verify data
- Cleanse and transform data: outliers, missing values, dedupe, merge
- Explore data: The first step when dealing with a new data set needs to be exploratory in nature: what actually is in the data set? Summarize, Visually inspect entire data

– What does the data look like? summaries, cross-tabulation

– What does knowing one thing tell me about another? Relationships between data elements

– What the heck is going on? - Visualize data
- Interact with data (not covered here): BI tools, custom dashboards, other tools (ggobi etc.)
- Archive data (not covered here)

**C) Skills needed for data science**

- Statistics: Concepts, approach, techniques
- Databasing: SQL
- Scripting language: Ruby, Python
- RegEx
- Visual design: Story telling with charts
- File handling: Unix preferred. awk, gzip, gunzip, paste, sort etc.
- Office tools: Excel (plugins like Solver, What If)
- Statistical tools: R, SAS, SPSS, Stata, MATLAB, etc.
- BI tools: Qlikview, Tableau

**D) Learning R**

We will pick a tool to learn the concepts of data science. We will use R, a leading open source stats package. Why I started learning data science and picked R

Curriculum for Intro to R (R has steep learning curve. Purpose of this discussion is to get you started)

**E) Where to go from here?**

- Learn adv techniques: sampling, predictions. Books, Conferences
- Analyse your favorite dataset: e.g. Cricket data analysis
- Compete (kaggle)
- Learn other tools (Excel Solver, SAS etc.)

**REFERENCE**

**Tutorials**

- Stats202 class
- UCLA’s mini course on R
- R intro
- R fundamentals
- R data import/export
- R-bloggers
- Web app integration
- RTips
- TBD

**Books**

- TBD

To

**leave a comment**for the author, please follow the link and comment on his blog:**Enterprise Software Doesn't Have to Suck**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...