dmi

Section 508

June 9th, 2017

Developing Predictive Analytics Solutions Using Agile/DevOps Techniques

Half of predictive analytics projects started by companies fail “because they aren’t completed within budget or on schedule, or because they fail to deliver the features and benefits that are optimistically agreed on at their outset.” [1] This story is very familiar in the software development world. [2] Agile techniques, complemented by DevOps methodologies, were developed to address some of the key challenges in bringing software projects to completion. In this post, I address one way to adapt these techniques for use in a data analytics project. [3]

CRISP-DM Methodology

The current standard [4] methodology for data science projects is the Cross Industry Standard Process for Data Mining (CRISP-DM) (illustrated in the circular diagram) [5]. It captures the iterative nature of doing data science.

Cross Industry Standard Process for Data Mining (CRISP-DM)

Source: [6]

The challenge with CRISP-DM is getting actionable results from the data science project – turning it into business processes and getting results out to decision makers. There are several potential traps in the methodology that can lead to project failure:

  1. Getting stuck in the data understanding-data preparation phase. Large data can be overwhelming and lead to the team getting lost in trying to match the business use cases with the available data.
  2. Getting stuck in the data preparation-modeling loop. This potentially can be an infinite loop without sufficient controls and focus for breaking out of the loop. There is no such thing as a perfect model, but it is hard to determine when the model is “good enough.”
  3. Getting out of the main business understanding to evaluation loop. Data analytics projects can iterate many times over this entire loop and never break out into deployment.

One way to avoid these traps is to mesh the CRISP-DM methodology [7] with the results-driven Agile methodology and the integrated techniques from DevOps/DataOps [8].

Agile Methodology

I adapt the Agile Scrum Framework [9] to the needs of a data analytics project [5], mapping the roles and events onto the CRISP-DM methodology. The resources involved in the Scrum Framework are illustrated in the flow diagram below.

Flow diagram of Agile Scrum Framework

Source: [10]

The Scrum Team

Scrum/Product Owner

From the Scrum Guide:

“The Product Owner is responsible for maximizing the value of the product and the work of the Development Team. How this is done may vary widely across organizations, Scrum Teams, and individuals.” [9]

For a predictive analytics project, this is either the data science project sponsor or another member of the organizational leadership team. Having a product owner gives clear guidance and direction to the data science team and helps to keep the project focused on real business needs.

The Data Science Team

Typically called the “development team” in Agile guides and recently modified to be a DevOps team, the data science team includes everyone who is working on the data science project. From [9]:

“The Development Team consists of professionals who do the work of delivering a potentially releasable Increment of “Done” product at the end of each Sprint. Only members of the Development Team create the Increment.”

Likewise, a data science (or DataOps) team consists of members with complementary skills [11] including:

  • Data engineers who are responsible for capturing, storing, and processing data;
  • Data scientists who work on the data cleaning and predictive modeling;
  • Business analysts who connect an understanding of the business with data understanding;
  • Platform administrators who work with the data engineers and data scientists to develop deployable products; and,
  • UX designers who work on the front-end data communication with the data product users.
The Scrum Master

A Scrum Master acts as the data science team guide and an interface between the data science team, the product owner, and the organization.

Scrum Events

The Scrum methodology breaks up the overall project into smaller pieces of work, known as sprints, with the goal of producing a potentially usable product at the end of each sprint.

“The heart of Scrum is a Sprint, a time-box of one month or less during which a “Done,” useable, and potentially releasable product Increment is created. Sprints best have consistent durations throughout a development effort. A new Sprint starts immediately after the conclusion of the previous Sprint.” [9]

The iterative nature of the CRISP-DM doesn’t fall nicely into the more linearly-focused Agile Sprint. However, I map the key components of CRISP-DM onto Agile Sprints, focusing on creating usable business products at the end of each sprint.

First Sprint

The goal of the first sprint is to reach a point where the team understands the business objectives and organizational data. From the CRISP-DM method:

“The first stage of the CRISP-DM process is to understand what you want to accomplish from a business perspective. Your organization may have competing objectives and constraints that must be properly balanced. The goal in this stage of the process is to uncover important factors that could influence the outcome of the project. Neglecting this step can mean that a great deal of effort is put into producing the right answers to the wrong questions.” [5]

Furthermore, the sprint should gather an initial collection of data sources including the tools required for data loading. [5]

This sprint is considered “Done” when the team presents a report describing the key business issues, an inventory of available data assets, a plan for answering the top business data questions, and a description of what success will look like.

Second Sprint

In order to front-load the entire data process, I combine several of the CRISP-DM stages into a single sprint with the goal of delivering a minimally viable predictive product at the end of the sprint. The combined CRISP-DM stages are:

  1. Data Preparation: perform data cleaning, enrichment, and feature engineering steps
  2. Modelling: select and assess modeling techniques, tune model parameters
  3. Evaluation: evaluate model performance against the business goals

This sprint is “done” when the team either has a model that performs at an acceptable level or has determined that the data is not sufficient to meet the business goals. In the case of an acceptable model, the goal of the sprint is to have the initial model ready for further testing and deployment into a production environment. When the data is not sufficient to meet the business goals, the sprint produces a report documenting the evidence for this outcome.

Third Sprint

In the case where the test model developed in the second sprint is meeting business goals, the goal of the third sprint is to get the model into production.

“In the deployment stage, you’ll take your evaluation results and determine a strategy for their deployment. If a general procedure has been identified to create the relevant model(s), this procedure is documented here for later deployment. It makes sense to consider the ways and means of deployment during the business understanding phase as well, because deployment is absolutely crucial to the success of the project. This is where predictive analytics really help to improve the operational side of your business.” [5]

The sprint is considered “done” when the team deploys a functional predictive analytics model in the production environment. At this point, the predictive analytics model can start to generate value for the business.

In the event where the second sprint finds that the business goal cannot be met with existing data, a third (and successive) sprint starts back at the beginning, selecting another business goal for evaluation or selecting a different set of data to work with.

Conclusion

Adopting this combination of Agile and CRISP-DM methodologies creates a framework for moving predictive analytics projects into the production environment where they can have a positive impact on the business. It will help teams break out of potential infinite loop traps and keep them focused on the overall goal: providing a positive return on investment for the business.

Martin John Madsen
Senior Data Analytics Consultant

Sources

[1] http://analytics-magazine.org/the-data-economy-why-do-so-many-analytics-projects-fail/, https://www.analyticsvidhya.com/blog/2016/05/8-reasons-analytics-machine-learning-models-fail-deployed/
[2] http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/delivering-large-scale-it-projects-on-time-on-budget-and-on-value
[3] http://www.kdnuggets.com/2017/04/librarian-scientist-alchemist-engineer-dataops.html, https://www.svds.com/tbt-successful-data-teams-are-agile-and-cross-functional/
[4] http://www.kdnuggets.com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-science-projects.html
[5] http://www.sv-europe.com/crisp-dm-methodology/
[6] https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
[7] http://www.kdnuggets.com/2017/02/real-world-results-agile-data-science-teams.html
[8] https://www.tamr.com/from-devops-to-dataops-by-andy-palmer/, https://en.wikipedia.org/wiki/Dataops
[9] https://www.scrumguides.org/scrum-guide.html
[10] http://agileforall.com/resources/introduction-to-agile/
[11] http://www.datasciencecentral.com/profiles/blogs/what-roles-do-you-need-in-your-data-science-team, http://www.kdnuggets.com/2015/08/3-components-successful-data-science-team.html

Tags: agile Agile technique analytics DevOps DevOps Techniques Predictive Analytics Solutions

Connect with us

Job Openings

Want to be part of our growing team?

View More
Work with us

Learn how DMI can help you grow, or launch your business.

Get In Touch
Offices

See all of our locations around the world

View Locations