# Data Analysis Steps

**Data Perspective**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Any data analysis project starts with identifying a business problem where historical data exists. A business problem can be anything which can include prediction problems, analyzing customer behavior, identifying new patterns from past events, building recommendation engines etc.

The steps for solving a data analysis problem can be shown as below:

*“Define Problem statement”*

** Data Acquisition:**

*“Identify data sources”*

As a second step, all the data sources related to the problem statement will be identified and pulled into a central repository. The data sources can vary from SQL databases to text files to csv files to online data. If the data size is large we may use Hadoop to pull, store & pre-process the data.**Process/Clean Data:**

*“The accuracy of the results of analysis depends on the quality of data”*

Data Clean step is considered to be one of the very important phases in Data analysis. The accuracy of the analysis depends on the quality of data.

Few approaches:

- Formatting the data as per the data analytical tools we use.
- Missing data handling
- Data Transformations like normalizing the data Identifying outliers & handling etc.

**Exploratory Analysis:**

*“Embrace the data visually before diving further”*

The objective of this step is to understand the main characteristics of the data. This analysis is generally done using visualizing tools. Performing an Exploratory analysis helps us:

- to understand causes of an observed event
- to understand the nature of the data we are dealing with
- assess assumptions on which our analysis will be based
- to identify the key features in the data needed for the analysis

**Graphical Techniques:**Scatter plots, box plots, histograms**Quantitative techniques: **Mean, median, Mode, Standard deviation

**Model Generation & Validation:**

*“Select-Train-Evaluate”*

This step involves extracting features from the data and feeding them into the machine learning algorithms to build a model. Model is the solution proposed for the problem statement. This step involves: Model selection, model training and model evaluation.**Model selection: **Based on the type of business problem we are dealing, a model will be built. For example,if the objective of the analysis is to predict a future event, we need to build a Regression model for prediction.**Model Training:** After selecting the Model for the analysis, the entire dataset is divided into 2 parts – Training data & Test Data. 3/4th of the entire data will be fed as input to the Model Algorithms.**Model Evaluation: **Once the model is built. The next step is to test the model & validate it. The data used for testing the model is the remaining 1/3rd of the dataset in the previous step.

**Visualize Results:**

*“Show the results visually”*

This is the final step of Data analysis where the results of the model & problem solved will be presented generally in visual plots/graphs.

Few visualizing tools: d3.js, ggplot2, tableau.

Please go through the tools/technologies , skill set required to learn Data Analysis here

**leave a comment**for the author, please follow the link and comment on their blog:

**Data Perspective**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.