There’s tremendous excitement around data science – maybe too much. We’re at the point in the hype cycle where executives are waking up to the fact that they’ve spent serious money on people, tools and technology… with minimal ROI to show for it. And no wonder – the enterprise data science market has never been defined. The excitement around data science has precluded any organization around it, resulting in a myriad of tools cobbled together as a supposed enterprise solution.
Spoiler alert: It’s not working. Without technical guide rails, it’s only an abstract vision.
In order for companies to start getting tangible business value out of data science efforts, we need to redefine – or more accurately, define – what enterprise data science should look like. Only with a proper technical foundation in place will companies then be able to effectively execute on their data science vision.
The Challenges Are Real
As someone who’s spent more than 20 years working in software architecture and engineering, I get it: the challenges are significant. Not so long ago, the idea of “enterprise-grade” was a joke – it was synonymous with counter-intuitive and clunky software. Compounding the enterprise-grade conundrum as it relates to data science is its inherent complexity.
To make data science truly ready for the enterprise, we have to acknowledge that there are different challenges and that the stakes are higher. With data science, there’s so much ripe ground for innovation, from data platforms to algorithms, which has led to a surplus of “cool new tools” to use. But instead of increasing productivity, many of these tools are adding unnecessary complexity and distraction, creating the adverse effect of slowing down enterprise projects.
I see countless examples where a company feels that it needs to apply deep learning to their data when a plain old logistic regression would work just fine. In many cases, this same organization hasn’t given a second thought to the underlying technology framework they would need to support their bigger vision. The result? Data science models that die on the vine, destined to the sad fate of a static PPT slide, or that take months to put into production by an entirely new set of engineers.
That’s not enterprise-grade. That’s inefficiency.
Enterprise: It Shall Just Work
As the lines between consumer and enterprise apps blur, UI has become an increasingly important consideration for enterprise software – but that’s the easy part.
Making data science ready for prime time requires more than just having a pretty interface with a modern color palette. What matters is what’s under the hood.
First and foremost, enterprise-grade in my book means it shall just work.
In an enterprise data science context, this involves seamlessly integrating with and leveraging the critical Hadoop technologies, from authentication protocols like Kerberos, to data warehouse abstractions such as Hive.
For most data scientists, even those who consider themselves Apache Spark experts, having to constantly tune each Spark executor for each Spark job is a major headache. As data scientists, you don’t want to be dependent on engineers to deploy a model into production. Plus, IT doesn’t want anyone else messing around with its infrastructure.
With that in mind, there are some key areas where turnkey systems will have the greatest impact. First off, you need to address security. It’s a complicated space, with lots of different preferences for how to make data secure. But in order to do real work with real data – HIPAA data, customer data, PII – you need to have a strategy in place for how to secure the data and who has access. And let’s be clear, I’m not talking about just security (that’s a given), but turnkey security. Software that allows for turnkey integration will enable the organization to function at the highest level of security while ensuring data scientists have access to the data they need to have an impact.
In addition to security, turnkey systems also play a critical role as it relates to scale and performance. Are you able to easily scale across a cluster? Will your platform seamlessly work against large data sets? How will you efficiently leverage technologies like Spark for improved algorithm performance in workflows? The bumpy road to addressing all of these challenges will be exponentially easier to travel with built-in automation.
A New Mindset to Make Models Iterative
Enlightened companies think through how to extend data science to the broader enterprise so more people can act on the promise of what data science offers. Retaining a culture where data scientists are academics in a corner tuning an algorithm isn’t going to fly anymore; nor should it. After all, data scientists are most powerful when they’re impacting business action. And the only way to make that happen is to collaborate directly with the business.
New algorithms will always emerge, and you can’t rip out your system for a new one every time. In the same way, you don’t want to get stuck in seeking the “perfect” model. Instead, it’s important to pretty quickly swing through the steps to define a business problem, model data, and push that model into production in a turnkey fashion to understand in short order how it performs. These performance metrics are critical and when embedded into the technological framework, they should seamlessly feed results back into the system to help you better understand how the model is performing and improve it over time.
For example, imagine a car manufacturer trying to cross- or up-sell different models or brands of vehicles in order to increase customer loyalty. Not only is the baseline demographic data important, but the ability to iterate on the model to incorporate up-to-date economic data about consumer behavior – for example, maybe luxury car sales are going down, but hybrid luxury car sales are going up. Only with the ability to iterate over time – instead of relying on a stale model – can the company consider a range of factors to retain and attract customers amidst changing economic tides. In addition, with feedback from subject matter experts checking the assumptions of the data scientists, the model can effectively evolve over time to best incorporate customer and consumer data, and drive business objectives.
Data Science Projects That Deliver
For many enterprises, the most challenging part of the data science process is what we refer to as the “last mile” of analytics. This requires putting data science work into production and monitoring concrete ROI. However, to achieve this, the real hurdle is making data science enterprise ready – while enterprise integration may not be as exciting as writing new algorithms, it’s the infrastructure underpinnings you absolutely need for anything else to be relevant.
Those data scientists who understand this shift and who can turn data science endeavors from black box projects to true enterprise-level analytics solutions that are a functioning part of the business will be the true heroes. Those applications will be the ones that stand the test of time, and the difference between success and failure, both for the data scientists and the company.
Lawrence Spracklen leads engineering at Alpine. He is tasked with the continued development Alpine’s advanced analytics platform. He brings to Alpine a diverse range of experiences, spanning processor design and research, distributed systems experience, software optimization and multi-thread scalability, security and hardware accelerator development. Prior to joining Alpine, Lawrence worked at Ayasdi as both VP of Engineering and Chief Architect. Before this, Lawrence spent over a decade working at Sun Microsystems, NVIDIA and VMware, where he led teams focused on the intersection of hardware architecture and software performance and scalability. Lawrence holds a Ph.D. in Electronic Engineering from the University of Aberdeen, a B.Sc. in Computational Physics from the University of York and has been issued over 30 U.S. patents.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.