CLASS OBJECTIVES (LEARNING OUTCOMES)
After completion of the course, you will be able to:
- Understand concepts of data science, related processes, tools, techniques and path to building expertise
- Use Unix command line tools for file processing (awk, sort, paste, join, gunzip, gzip)
- Use Excel to do basic analysis and plots
- Write and understand R code (data structures, functions, packages, etc.)
- Explore a new dataset with ease (visualize it, summarize it, slice/dice it, answer questions related to dataset)
- Plot charts on a dataset using R
- Good knowledge of basic statistics (min, max, avg, sd, variance, factors, quantiles/deciles, etc.)
- Familiarity with Unix OS
A) Intro to data science
- Explain data science and its importance. Data-driven business functions e.g. MROI, mix optimization, IPL teams / fantasy teams, predictions
- Big data
- Definition: Data sets that no longer fit on a disk, requiring compute clusters and respective software and algorithms (map/reduce running on Hadoop).
- Real big data problems: parallel computing, distributed computing, cloud, hadoop, casandra
- Most analysis isn't Big Data. Business apps often deal with datasets that fit in Excel/Access
- Products: Desktop tools (Excel (solver, what if), Access, SQL, spss, stata, R, sas, programming languages (ruby, python, java) -- stats libs in these languages, BI tools, etc.
B) Steps in data science
- Acquire data: "obtaining the data"... databases, log files... exports, surveys, web scraping etc.
- Verify data
- Cleanse and transform data: outliers, missing values, dedupe, merge
- Explore data: The first step when dealing with a new data set needs to be exploratory in nature: what actually is in the data set? Summarize, Visually inspect entire data
- What does the data look like? summaries, cross-tabulation
- What does knowing one thing tell me about another? Relationships between data elements
- What the heck is going on?
- Visualize data
- Interact with data (not covered here): BI tools, custom dashboards, other tools (ggobi etc.)
- Archive data (not covered here)
C) Skills needed for data science
- Statistics: Concepts, approach, techniques
- Databasing: SQL
- Scripting language: Ruby, Python
- Visual design: Story telling with charts
- File handling: Unix preferred. awk, gzip, gunzip, paste, sort etc.
- Office tools: Excel (plugins like Solver, What If)
- Statistical tools: R, SAS, SPSS, Stata, MATLAB, etc.
- BI tools: Qlikview, Tableau
D) Learning R
We will pick a tool to learn the concepts of data science. We will use R, a leading open source stats package. Why I started learning data science and picked R
Curriculum for Intro to R (R has steep learning curve. Purpose of this discussion is to get you started)
E) Where to go from here?
- Learn adv techniques: sampling, predictions. Books, Conferences
- Analyse your favorite dataset: e.g. Cricket data analysis
- Compete (kaggle)
- Learn other tools (Excel Solver, SAS etc.)
- Stats202 class
- UCLA's mini course on R
- R intro
- R fundamentals
- R data import/export
- Web app integration