Broadly defined, agile project management is an iterative process that focuses on customer value first, team interaction over tasks, and adapting to current business reality rather than following a prescriptive plan. Agile project management is based on the same organizational practices and key principles found in the Agile Manifesto. However, the principles go beyond software development. It’s a mindset for people who need a management approach that builds consensus quickly in a fast-paced environment.
Bringing agility into big data (and small data) analytics, however, has been a challenge for many very bright and talented data scientists and engineers. The reasons are similar to the difficulty in adopting agile application software development: mainly organizational culture and team dynamics (with individual accountability or hierarchy-driven work). In the space of agile analytics, the problem is further amplified as the stakeholders often go beyond IT to include marketing and other executives.
In this article we’ll explore what makes analytics different from application development, and how to adapt agile principles and practices to the nuances of analytics.
Agile analytics is all about failing fast. In a scientific context, this means stating hypotheses and seeking to disprove them using quantitative analysis of real data in rapid cycles. The end result is actionable insight and maximized value to the business.
The term agile analytics is an umbrella term that covers a wide swath of tools and techniques, all rooted in rigorous statistics. A data science team is not only able to solve IT problems, but can answer any quantitative question that is at the heart of a business.
It’s worth noting that the nature of agile analytics is fundamentally different than that of application development. Application software development, for example, receives user requirements that can be tweaked based on feedback from a client. Data sources for applications are clear.
An agile analytics project, by contrast, would have problem statements with varying levels of clarity and a number of hypotheses that are expressly discarded or pursued based on what the data reveals. Sometimes the data would be insufficient and might require some massaging. A rejected hypothesis, for instance, might not have a concrete deliverable that the client can utilize and put forth to their users– but it still requires an investment of time to evaluate and reject that hypothesis. This dynamic is rarely encountered in application software development where a client requirement would be rejected as invalid after a week or so invested in attempting to realize it.
The Difference Between Developing Agile Apps and Agile Analytics
To highlight the differences between the aforementioned genres of projects, take the example of an analytics project for an online book seller where they need a recommendations engine to power the site compared against an application software project for the same book seller to put their business online. In the latter case, user stories and requirements will be hashed out and iteratively developed and showcased to the client and in the former case, hypotheses would be iteratively worked on, accepted or rejected based on data and accuracy of predictions and showcased to the client. In an application software project, the requirements are clearer and what one wishes to experience on the site is mostly known.
In an analytics project, what is expected of the recommendations engine is believed to be known but what data will help us get there is not well understood. The application software project team receives requirements like “Need an admin page for data entry, a homepage, ability to add promotions, drag and drop books into cart” –clear requirements whose feasibility is rarely questioned. The analytics project team might receive requirements like “Use the user’s social network data to make recommendations, use all the books they bought to make recommendations” among other things, which the team has no way of declaring as relevant or not at the outset. Data might eventually reveal that social network activity has no impact on the accuracy of recommendation. Books someone bought in the past might not be valuable recommendation as the person might be looking for books in other genres.
Perhaps all of these data are very powerful indicators of what to recommend but churning them and identifying the 10 books to recommend might be very computation/performance intensive and take 20 seconds to arrive at the list. It is difficult if not impossible to know these characteristics at the outset. In this measure of uncertainty, analytics projects and application software projects differ. Added to this would be requirements coming from a variety of stakeholders (marketing, publisher relationship manager, warehouse relationship manager, et al) who might want to have a say in what the recommendation engine spews out.
While we are at it, let us also compare a traditional analytics team’s workings with that of an agile analytics team. A traditional analytics team would go endlessly at building an engine which uses all user data imagined (for example, page views, book genres, author nearness, demographics, social network activity, articles read and shared online, for example) and after a long stretch of time return with, perhaps, a powerful engine that would do near-perfect recommendations but with data that the site cannot capture and hence, doesn’t exist (except as test data that that team used).
An agile analytics team, on the other hand, would build an engine checking feasibility of variable gathering, judging preference alignment and present findings to the client and the application software project team so that the value can be judged and deemed feasible. Perhaps the application software project team gets an added requirement of logging duration of page views which they were not going to implement until this meeting. The agile analytics project team will also receive constraints from the application software project team like, “The search page will load in 3 seconds and hence, your recommendation engine cannot take more than 1 second.” Improvement in requirements will happen both ways. Strategies to solve such challenges will be altered and course-corrected to bring maximum return on investment to the business. It is important to note that this scenario assumes that the agile analytics project and application software project teams are separate. Ideally, the data science team would overlap with the application software project team.
How an Agile Analytics Project Unfolds
The main difference from a traditional approach to analytics and an agile analytics approach is the upfront, rapid feedback cycle of test and learn with stakeholders posing business questions and describing available datasets. A team of data scientists, business analysts, software engineers and other experts work with the stakeholders to hone each question until they:
- Have as narrow a scope as possible;
- Contain explicitly quantitative clauses;
- Are ranked by relative value; and
- Are potentially answerable given the available data.
In the data lab, the question “Can we extract additional revenue from the logs of user activity on our website?” may be honed to, “Can we increase our online conversion rates from 1.5 percent to at least 3 percent by providing targeted recommendations?”
The corresponding null hypothesis to be tested in the data lab might look like, “A Mahout singular value decomposition recommendation engine offers no additional conversion beyond what we’re already achieving.”
The data science team then breaks off to dig into the data, seeking to disprove the null hypotheses. This means trying to show that a Mahout Taste implementation achieves a conversion rate that is statistically consistent with 3 percent or more, and statistically inconsistent with today’s conversion rate of 1.5 percent. These quantitative requirements impose constraints on the quality and quantity of the data needed to achieve the goal. As this example illustrates, the primary focus in this early data lab stage is to identify the appropriate dataset and the minimum sample size to generate observations and insight and maximize value. The questions answered typically include:
- Is the data we already have appropriate to gain insight into the question asked?
- If not, where and how can we get the data we really need? This may entail proactive data collection.
- What are the variances, covariances, biases and selection functions inherent to the data?
- What is a reasonable and realistic baseline against which to test our hypothesis?
- How much data do we need to reject our hypothesis?
In the data lab, a large number of visualizations and statistical tests—potentially hundreds—are typically generated to understand the nature and quality of the datasets and the relative merits of different statistical methods. R and Python are the typical go-to languages for this stage of data exploration, thanks to the many actively supported statistics libraries and the relative brevity of data analysis scripts.
Insights reached in the lab are communicated back to the stakeholders in a rapid cycle, typically every few days to every week. Questions and hypotheses are further honed and potentially new data sources are put on the table. In this upfront feedback cycle, the team reaches deeper insight on how exactly to answer valuable business questions with real data. The cycles lead to either results that curve upwards or hypotheses that are eliminated and marked as unworthy of pursuit.
The data lab produces insight on the nature and quality of datasets, business questions and hypotheses that are narrow in scope and answerable, and a concrete selection of algorithms and models that can be used to answer those questions with the data. These data lab outputs come in the form of prototypes and proofs-of-concept. The team gains a deep understanding of the viability of the advanced analytics solution, and what remains is to put the resulting output of each iteration into production.
Implementing the Model
Implementing the resulting model of an agile analytics project—in this case, the model that produces recommendations based on real input— takes special care because tools used in the data lab are often distinct from those used to crunch data at scale, in production. Whereas relatively few lines of R or Python are needed to choose whether to use singular value decomposition or k-means for a recommendation engine, Hadoop, MapReduce and Storm may be needed to apply these methods to high volume, high velocity user product preference events as they arrive from a web site, in real time. The process of implementing a scalable solution should begin as soon as a viable method is discovered in the data science lab.
Implementing solutions at scale is costly. Businesses were not built with the intention of enabling an analytics solution. Hence, their data sources are disparate and often difficult to extract. Sometimes it is not even known that the data sources exist until the right questions are asked. Each silo in the business might be on a different database with different domain terminology. These factors make building an analytics solution costlier than developing application software.
Added to this is the risk of not doing it right. For example, a firm might decide to integrate the data from the marketing and HR departments, and pursue a large sweeping change, only to realize that that hypothesis (of a correlation between some variables from the marketing data and HR data) is not valid.
The benefits of agile analytics help to justify the investment while reducing the risks of pursuing blind alleys. By testing techniques, characterizing datasets using well-selected subsets, and honing business questions up front in the data lab, you can mitigate the risk of implementing models that turn out to be garbage.
At the heart of it, agile analytics is all about rapid feedback. It’s about stating hypotheses, ensuring they are answerable, understanding available data, refraining from operationalizing until you’ve arrived at a proven approach on a small scale, and always validating chosen methods and software.
Anand Krishnaswamy is a senior consultant and developer in data analytics with ThoughtWorks, a global technology company that provides fresh thinking to solve some of the world’s toughest problems. You can reach him via email at firstname.lastname@example.org.