Apache Spark is quickly growing in popularity, already eclipsing other popular projects within the Hadoop ecosystem. And for good reason – Apache Spark is the next great data processing framework. When coupled with an operational, relational, in-memory database, Spark and Spark result sets can be put to work for applications immediately.
But how can we best understand the potential of Spark? To answer that question, let’s take a closer look at the core components of data management systems across databases and Spark.
Basics of a Database
Before tackling the details of Spark, it might help to understand three critical components of a database. In this example, we’ll look at it from a relational database perspective. The components include a programming language, execution environment, and storage.
The programming language defines how the user interacts with the database. For the past 40 years, the language most commonly used with relational databases has been the Structured Query Language, or SQL for short. SQL has stood the test of time and is not likely to disappear anytime soon. Even amid the rise of NoSQL solutions, just about every datastore has found a way to be accessed via SQL.
The execution environment handles the mechanics of putting the user’s requests into action. In a relational database, that typically involves an optimizer to speed up queries, joins to merge data across multiple tables, and other features like GROUP BY for more sophisticated queries. All of these capabilities form the nuts and bolts of retrieving the right data quickly and easily.
The final piece of a database is the storage environment or, more specifically, the place to persist and protect data. For databases, the specific structures include an index such as a skip list or Btree, a row store and/or a column store, and the ability to handle transactions. Most relational databases handle transactions with ACID compliance, meaning that the transaction has the properties atomicity, consistency, isolation, and durability for reliable processing.
Basics of Apache Spark
If we look more closely at Apache Spark, we can see similarities and differences between it and typical relational databases.
The primary programming language for Spark is Scala, although Java and Python also can be used. But the real benefit of Spark is the large number of libraries for high-level operations that simplify programming and transformations for data.
At the heart of Apache Spark is a DAG (directed acyclic graph) execution engine that supports in-memory computing. This provides the ability to schedule jobs and distribute them over many nodes. While MapReduce has a two-stage DAG – map and reduce, Spark can split jobs into multiple stages, or just one stage, frequently allowing jobs to be completed more quickly and efficiently than conventional MapReduce.
Spark currently does not have its own storage environment, instead relying on external data stores like HDFS or Amazon S3 for persistence. While Spark has a concept of a Resilient Distributed Dataset (RDD), it is best to think of an RDD as a temporary repository while Spark jobs are being processed and not a place for ongoing persistence. One of the reasons that Spark has taken off so quickly is that there is no native storage environment, which often is one of the most challenging aspects of creating a new data framework.
Understanding Databases and Spark
When we look at the basics of a relational database and Spark side by side, we see that each has its own strengths. The interesting opportunities lie in combining the two. When Spark relies on a datastore like HDFS, it tends to be relegated to faster batch processing as HDFS is typically a collection of data and not necessarily a production datastore supporting a live application.
All of that changes when Spark is combined with an operational database. Spark can easily access live production data, and result sets from Spark can be put to use immediately in the operational database supporting mission-critical applications. The following are three popular use cases combining Spark with operational databases.
- Real-time streaming
- Advanced analytics of operational data
- Operationalizing Spark results
In a recent Java 8 survey, 67 percent of developers associated their interest in Spark with Spark Streaming. In this use case, data frequently comes in through a publish-and-subscribe messaging system such as Apache Kafka. From there, it goes to Apache Spark for real-time transformation or enrichment. But Spark itself cannot persist the data long-term, so developers must choose a storage layer. If they choose an operational database, particularly one that is in-memory, they can then build an application on top of that database.
That is exactly what Pinterest did in a recent showcase of Apache Kafka, Spark Streaming, and MemSQL at Strata + Hadoop World. More information on that specific use case is available here.
Advanced Analytics of Operational Data
Another popular use case for Spark is to analyze real-time operational data. Too often, data scientists are left behind, having to work with data stored in a data warehouse that is separated from operational data by a lengthy and complex ETL (extract, transform, load) process.
When Spark is connected to an operational database, it provides immediate access to real-time data and the ability to build models related to the present, not just the past.
For operational databases based on SQL, users can query data using the well-known SQL syntax. However, not every data analytics operation can be expressed via SQL. Spark provides a unique set of advanced analytics capabilities beyond what standard SQL can offer. By connecting Spark to an operational database, the models are current and more reflective of real-time activity.
Operationalizing Spark Results
Spark provides analysts with a remarkable ability to quickly test and iterate assumptions on data models. But what happens after the results are derived?
Rather than simply generating a report, combining Spark with an operational database provides the ability to put results to use for critical applications. When the result set is derived, it can be placed in the operational database and immediately accessed by the application to make critical recommendations or decisions. This is radically different than simply generating dashboards to be interpreted by analysts. In today’s real-time world, result sets will need to be fed back into applications because human gatekeepers will not be able to keep up with the volume and variety of results.
Today we are witnessing a shift in architectures, from many niche solutions to fewer multi-purpose solutions. Using Spark with an operational database is such a combination of two sophisticated, multi-purpose tools. Used together, Spark provides outstanding processing capabilities, and an in-memory operational, relational database offers enterprises a compelling solution to drive both insight and impact.
Eric Frenkiel is co-founder and CEO of MemSQL. Before MemSQL, Eric worked at Facebook on partnership development. He has worked in various engineering and sales engineering capacities at both consumer and enterprise startups. Eric is a graduate of Stanford University’s School of Engineering. In 2011 and 2012, Eric was named to Forbes’ 30 under 30 list of technology innovators.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.