As more and more organizations are setting up teams of data scientists to make sense of the massive amounts of data they collect, the need grows for a standardized process for managing the work of those teams. To help with this, the data science team at Microsoft has drawn on their experience with large-scale data science projects to develop the Team Data Science Process. The process is built around this data science lifecycle:
The Team Data Science Process proposes a standardised directory structure for managing the data, code and documents for a data science project, and provides for tracking of those artifacts using a version control system such as Git. It also proposes a shared distributed analytics infrastucture to provide the computational and storage resources that the data scientist tools rely on. It also provides two open-source utilities to support data scientists:
- IDEAR: the interactive Data Exploration Analytics and Reporting (IDEAR), an interactive data exploration tool based on R and RStudio
- The Automated Modeling and Reporting tool, which automates and standardizes the process of creating reports describing the results of statistical models created with R's caret package.
You can find more background on the team data science process in this blog post, and you can also watch this presentation from the developers of the process from the Data Science Summit, embedded below.
You can download the various artifacts of the Team Data Science Process (and even suggest your improvements via a pull request) at the Github repository linked below.
Github (Azure): Team Data Science Process from Microsoft