In part two of this series, I described the diverse set of skills needed to get new data science initiatives off the ground. I also detailed terminology for several data science techniques to consider and presented a case for focusing on business value. In this article, I describe two popular data science techniques and when to use them. I also cover strategies for measuring business value and considerations for deploying new solutions.
Failure Prediction and Anomaly Detection
In my previous piece, I outlined 10 techniques for deploying data science into your company’s workflow. Two techniques, failure prediction and anomaly detection, come up regularly and offer an interesting dichotomy in how data scientists view the world. Let’s explore them further. For both failure prediction and anomaly detection, the end goal is the same: learn from machine-related data and detect significant problems in advance. The key distinction lies with your familiarity of what constitutes a problem. If you have a well-defined failure (for example, a fuel injector on a mine truck) and a history of those failures, then failure prediction is an appropriate approach. This is considered a supervised learning problem to data scientists. Using a failure prediction technique, you can make predictions on a specific failure type. For example, your solution might forecast that a given mining truck in operation has a 95% probability of an engine failure within the next two weeks.
When there is ambiguity around the target event or failure, the problem may be best approached using the anomaly detection technique. In data science parlance, this is considered an unsupervised learning problem. For these use cases, there is a less well-defined description of what constitutes a problem or insufficient failure history. Using an anomaly detection approach would entail leveraging historical asset data (temperature readings, pressure readings, etc.) to construct a statistical representation of normal operating behavior for a key asset. When real-time equipment readings deviate significantly from normal or predicted values, an alert for anomalous activity will be generated.
Measuring Business Value
As you evaluate the effectiveness of data science solutions, two common statistical measures generally come up. Precision refers to the ability of a data science solution to limit false alerts when making predictions. For example, if your solution makes 10 asset failure predictions and all are correct, then that would be considered a precision score of 100%. However, that stellar performance on precision could come at the expense of another important statistical measure — recall. Recall measures the ability of your solution to catch all failures that occur. What if in the above example, there were a total of 20 failures that occurred during the period. If the data science solution correctly flagged only 10 of those failures, then recall would be just 50% (10 alerts/20 total failures). Another way to think about this is that you are determining whether the data science solution caught all the failures that actually occurred (recall). Of the instances where the data science solution predicted a failure, you determine how many were correct (precision).
Data science solutions have to be calibrated to get this balance right. You may favor a high- precision (low-recall) solution if your aim is to minimize false alerts. In other applications, however, the risk of being wrong is so significant that you’ll be willing to tolerate a certain level of false alerts, provided that you’re reasonably assured of not missing a failure. To get this calibration correct, data scientists will need inputs from domain experts and financial analysts around the economic estimates of generating false alerts and missing actual failures. For example, the economic cost of falsely alerting on an asset may include decreased asset utilization, logistical disruptions and increased labor costs.
In other cases, a high-recall (low-precision) solution may be more desirable. For example, if your company is operating a locomotive in a remote area, your business operations could be extremely sensitive toward detecting and avoiding any possible engine failures. A failure of that type would be quite costly and disruptive. In this instance, your work crew will be willing to tolerate a higher likelihood of false alerts, provided that they are fairly certain of catching any possible failures. Data scientists can generate probabilities of bad things happening, but operational cost and benefit estimates have to come from frontline personnel, site managers and business analysts. Using a blend of statistical metrics and economic estimates, you can tune the data science solution to the correct sensitivity.
Simply proving that a data science solution generates some positive value is not enough. It’s important to consider performance in the context of alternatives that are generally already in place and less complex. This will help you establish a reasonable baseline by which you can gauge a new solution’s performance. The baseline performance may simply consist of asset of industry rules of thumb or heuristics. These methods of tracking asset performance offer a benchmark to evaluate any additional cost and complexity that come with the full deployment of a new data science solution.
Deployment and Feedback
To realize the true return on investment for data solutions, they have to be deployed to frontline employees and managers who can take action on insights. This usually takes the form of deploying a predictive model into existing operating systems or applications. This is where you’ll draw heavily on your company’s IT expertise. In most cases, this step will involve an awkward process of reprogramming (rebuilding) your data science solution to work within an existing IT or operations management application. This is a critical and difficult step. However, your return on investment will be greatly diminished if you fail to get this right.
Additionally, remember that one benefit of data science solutions is that they generally get better with more data and feedback. As your new solution generates insights, it’s important to capture feedback from personnel who can confirm the validity and value of those predictions. By capturing this crucial data, you’ll be able to “retrain” your data science solutions to get better over time. Additionally, you may monitor deployed solutions for steep declines in performance that can be triggered by new asset working conditions or data quality issues.
This marks the end of this three-part series. I hope that you have enjoyed reading this series as much as I have enjoyed writing it. I’m excited about the multitude of undiscovered opportunities across major industrial sectors waiting to be infused with data science. We are still in the early stages of the industrial digitization. Just as you would expect when embarking on a journey into any new frontier, uncertainty and risks run high. However, new frontiers can also offer wildly exciting rewards and advantages. My recommendation is to take action — either by doubling down on existing data science efforts or starting them.
Manny Bernabe leads and develops strategic relationships for Uptake’s data science team. He works with industry partners, university advisers and business leaders to understand the opportunities for aspiring data-driven organizations.
Manny brings background from the financial services sector and, specifically, asset management. He has deep expertise in research and deployment of quantitative strategies in exchange traded funds (ETFs).
To learn more about Uptake’s data science capabilities, visit https://www.uptake.com/expertise/data-science.