A well-run big data environment including Apache Hadoop and Apache Spark can generate valuable analytical insights that boost business performance by guiding business strategy and tactics. But what type of environment is best – an on-premise deployment or deployment in the cloud?
If you are trying to decide whether to implement Hadoop and Spark in the cloud or on-premises, the following questions can help make the right choice much easier to identify.
Do you want to deploy and manage infrastructure and software yourself?
Companies that go the on-premise route for deployments of Spark on Hadoop end up spending an inordinate amount of time and money wrestling with infrastructure. In the initial deployment, companies spend lots of time – and often, lots of consulting dollars – selecting hardware and a distribution vendor. Then they climb a huge learning curve regarding proper node, network, and OS configurations.
After all of this, many organizations breathe a premature sigh of relief; premature because they soon realize that their infrastructure battles are just beginning. As they move from pilots to production, companies gain a better understanding of their actual workloads and realize that they need a different hardware configuration. And, as their capacity needs increase, they realize that they didn’t adequately plan to meet the physical needs of a growing deployment. They are running out of space and need to make another capital budget request to buy more hardware.
Do you want to tackle ongoing operations yourself?
Companies accustomed to the “high upfront, low ongoing” operational learning curves and HR requirements of traditional enterprise software are often in for a surprise when it comes to big data. Operational excellence is not only critical to ongoing big data success, but also increasingly challenging to maintain as data volumes scale. The learning curve keeps rising and it can be difficult to find and retain experienced operational staff. Keeping a mix of production jobs and ad hoc exploratory work running reliably and concurrently on the same Hadoop cluster is very difficult.
By implementing Spark on Hadoop in the cloud, companies can skip all of this, relying instead on their cloud vendor to manage the steps required to get up and running quickly and to ensure reliable, high-performing, ongoing operations. However, it’s important to keep in mind that not all cloud providers are created equal. Some, which we refer to as do-it-yourself (DIY) vendors, offer only Spark on Hadoop on a cloud server, and everything else – job monitoring, job tuning, job fixing, software updates, all of the heavy lifting of operations – is up to you.
In contrast, a fully managed Spark-as-a-Service offering delivers a complete big data environment, including operations, advice, and support. This frees companies to focus their internal resources on realizing the benefits of big data, instead of shouldering all of its burdens.
Can you keep up with the full Spark and Hadoop ecosystem?
As big data adoption accelerates throughout industry, so do the capabilities found within the Hadoop and Spark ecosystem. This is clearly seen in the development of in-memory analysis engines, low-latency SQL for Hadoop, and the increasing number of scheduling, workflow, and data-governance tools. These new capabilities often require the features found in the very latest releases of the core Hadoop components, including, Hadoop, Spark, Hive, and Pig, and there can be multiple releases per year, especially in the case of Spark. Cloud solutions are able to provide the latest, production-ready releases of these core components so that users can take advantage of the ecosystem’s latest innovations.
Cloud solutions generally already support third-party applications, such as the H2O machine learning solution, the Alation collaborative analytics platform, the AtScale BI solution, and more. There has been an explosion of tools that run on top of Hadoop to make “big data” accessible to a broader audience (e.g., business analysts and data scientists). In an on-premises environment, installing, running, and maintaining these applications can be very painful. A cloud solution provider takes care of it for you.
Are you ready to handle bursting, scaling, and Spark’s cluster hogging?
A successful Hadoop cluster tends to grow as more data and applications are developed on top of it. Additionally, bursty workloads can cause clusters to be simultaneously over-provisioned for the “steady state” and short on resources for burst loads, which in a massively parallel environment can have a crippling impact on performance.
A true big data lake needs to be able to mix production workloads operating under deadlines with bursty workloads with highly varying resource requirements. This ensures that you can achieve the agreed-upon performance for your existing pipeline of production jobs delivering critical reports and data while, at the same time, providing an experimental environment for data scientists to perform the ad hoc research on raw/primitive data that will result in the next big analysis breakthrough. This mixed environment is challenging to manage because it’s difficult to predict the space required for an ad hoc job; such jobs frequently interfere with production jobs and the meeting of SLAs. The addition of Spark to your data environment amplifies the complexity of performing mixed workloads because Spark “hogs” resources when an ad hoc job begins.
Inflexible, on-premises deployments lack the ability to easily handle the bursty workloads generated by ad hoc queries due to budgeting and internal process constraints. Cloud providers, on the other hand, can readily address this problem by providing additional, affordable computing power on demand. The best cloud providers can automatically scale to meet job needs with “automated compute bursting.”
How will you ensure a high level of security?
For most companies, security is a key factor in deciding whether to deploy a solution on-premises or work with a cloud provider. Organizations often assume that data is better secured on-premises. However, there are several reasons why the security of a cloud vendor is as good as or better than securing data on premises:
- A Narrow Focus: Cloud providers are able to focus narrowly on securing one thing very, very well. On-premises security, however, usually requires that companies adopt a more general approach designed to protect against a broad range of possible threats. Additionally, cloud vendors often can devote greater financial resources to security because they view security spending as a competitive differentiator, and not just an expense.
- Up-to-Date Technology: Cloud providers are more likely to have deployed the latest in security measures because they have to: Security is a fundamental requirement of their business. However, it’s also easier for cloud providers to secure their systems because they tend to be less complex and not dependent on older technologies. In contrast, on-premises systems are typically composed of technology from various eras. Some of this aging infrastructure will be less secure because it was developed to protect against less-sophisticated threats, and the mixture of technologies will likely leave more openings for a hacker to exploit.
- World-Class Security Expertise: In a market beset by a shortage of security experts, organizations deploying solutions on-premises frequently find it difficult to hire enough skilled security workers. Cloud providers, due to their willingness to devote greater financial resources to security, can more successfully hire and retain workers with world-class expertise.
Combined with necessary compliance certifications, such as SOC2, HIPAA, or PCI, these factors should reassure enterprises seeking optimal data security from a cloud provider.
Can you easily serve all of your data constituents, particularly data scientists?
Data scientists using Spark on Hadoop for data exploration face a different set of challenges than users of production analytics systems. For instance, because data scientists write ad hoc code for jobs that will run only once or twice, job setup and debugging is an integral, and often time-consuming, part of data scientists’ day-to-day work.
Unfortunately, as on-premises clusters grow to support more and more use cases, already overburdened IT departments are increasingly unable to assist data scientists with job setup and other needs. As a result, data scientists either attempt to use Hadoop and Spark directly, struggling with unfamiliar and unproductive command-line tools and APIs, or indirectly, via a back-and-forth with data engineers. This back-and-forth often leads to delays and misunderstandings. Worse, data scientists frequently find themselves at the end of the queue, waiting in line for access to the cluster.
In an attempt to expedite their research, some data scientists build and maintain their own on-premises Hadoop clusters, independently of IT. This often results in data scientists spending the majority of each day focusing on Hadoop administration, job setup, and debugging instead of data modeling and analysis. Not surprisingly, this can lead even the most determined data scientist to abandon big data research efforts in favor of data sampling, which can be both inaccurate and inefficient.
A fully managed cloud vendor can provide an independent data-exploration environment, ensuring access to the latest features while providing the operational support to liberate the time of the data scientist. This makes it possible for data scientists to gain more quickly and easily a deeper understanding of data relationships.
On-premises vs. Cloud: Hard vs. Easy
Implementing and running Spark involves complex challenges, particularly as data volumes expand. When deciding whether to implement in the cloud or on-premises, it’s worth considering whether your organization wants to spend its time and resources wrestling with these challenges or focusing on how big data insights can improve its business results. By working with a fully managed cloud provider, companies are freed to leverage Spark’s benefits instead of struggling to manage big data infrastructure and operations.
Raymie Stata is CEO and co-founder of Altiscale. Raymie comes to Altiscale from Yahoo, where he was Chief Technical Officer. At Yahoo, he played an instrumental role in algorithmic search, display advertising, and cloud computing. He also helped set Yahoo’s Open Source strategy and initiated its participation in the Apache Hadoop project. Prior to joining Yahoo, Raymie founded Stata Laboratories, maker of the Bloomba search-based e-mail client, which Yahoo acquired in 2004. He has also worked for Digital Equipment’s Systems Research Center, where he contributed to the AltaVista search engine. Raymie received his Ph.D. in Computer Science from MIT in 1996.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.