Apache Spark 2.0: Speed, Structured Streaming, and Simplification

by   |   July 22, 2016 1:15 pm   |   0 Comments

Jules S. Damji, Apache Spark Community Evangelist, Databricks

Jules S. Damji, Apache Spark Community Evangelist, Databricks

If you use one eye to look at the past, and the other to look to the future, you gain invaluable insight to affect the present.

Since Apache Spark 1.x was released, the community has learned from a growing number of Spark users, a global community, and contributors.

And from this collective wealth of wisdom, alongside our own contributions to the open-source project, Apache Spark 2.0 has become easier, faster, smarter, and accessible to all.

Although Spark 2.0 has many new features, the trifecta themes of speed, structured streaming, and simplification set the building blocks to bigger and better things in Spark 2.0 and the future.



In comparison to disk I/O or network performance, CPU speed has lagged in hardware improvements and is often the bottleneck. To address this deficit, Project Tungsten phase-2, with its own memory management and runtime compact code generation, brings Spark performance closer to bare metal, as if executing native code, enabling order-of-magnitude speedups across Spark’s libraries.

All this is possible because Spark acts like a compiler. That is, Spark 2.0 employs a strategy of streamlining a query plan, collapsing a query into a single function, generating compact code as though it were handwritten and leveraging CPU registers for intermediate data. Built upon ideas and techniques used in modern compilers and applied to data processing queries, this technique is called “whole-stage code generation.” To get a sense of the power and potential of whole-stage-code generation, this notebook demonstrates the Project Tungsten phase-2, in which we join more than a billion rows, and compare execution times between Spark 1.6 and 2.0.

Beyond compact code-generation techniques, major efforts went into improving Spark SQL’s Catalyst Optimizer in Spark 2.0. For example, by optimizing general queries written in Spark SQL or high-level DataFrame/Dataset APIs, query execution speeds significantly increased. By building an extensible, logical, and physical query plan and leveraging Scala’s advanced programming language features like pattern matching, Catalyst generates an efficient, selected, coalesced physical query plan for final code generation and execution (Figure 1). Together, they increase speeds by, in some cases, 5-10X.

Figure 1. Catalyst’s query optimization process flow. Click to enlarge.

Figure 1. Catalyst’s query optimization process flow. Click to enlarge.


Structured Streaming

Beyond traditional streaming, where you process only real-time events, structured streaming is a novel approach of processing structured data of either a discrete table of data (where you know all the data) or an infinite table of data (where new data is continuously arriving). When viewed through this structured lens of data – as a discrete table or an infinite table – you simplify streaming.

Because developers already know how to process static datasets using Spark’s DataFrame/Dataset API, extending this API to streaming for both static and continuous data made sense. By using the Catalyst optimizer to ascertain and convert static query to incremental query execution on streams, developers can use the same code for static query as for streaming, and the underlying streaming engine would automatically perform the necessary conversions.

Related Stories

Spark and Hadoop: In the Cloud or On-Premises?
Read the story »

Are your Relationships Missing a Spark?
Read the story »

Sharpen your Customer Reflexes with Apache Spark.
Read the story »

Harnessing the Enterprise Capabilities of Spark.
Read the story »

With this structured approach to streaming, developers possess a myriad of ways to process data and can build end-to-end, continuous applications that act on real-time data: They can perform traditional ETL on structured data; they can execute ad-hoc SQL queries against a stream; they can aggregate data in a stream, update a database, and track session data; and they can generate the necessary reports and/or build ML pipeline models.

A combined batch and real-time offering with exactly one set of semantics and built-in state management is a unique offering that pushes Spark Streaming to another level. “Structured Streaming is really meant to push Spark 2.0 beyond just streaming to a new class of applications that do other things in real time,” said Databricks CTO Matei Zaharia.

All this equates to one simple thing: It’s much easier to process both batch and streaming data through the lens of structured data and a structured approach to streaming, as demonstrated at the Spark Summit 2016 keynote and demo.

In the upcoming Apache Spark 2.0 release, the basic underlying infrastructure and unified streaming API based on Dataset/DataFrame for end-to-end continuous applications are constructed. In this experimental release of Structured Streaming, we initially will support in-memory tables and files as sinks and sources. Subsequent releases will support additional data sources (e.g., Apache Kafka) and sinks, sessionization, watermarks, and ML integrations with emphasis on stability, scalability, and production-streaming workloads.

Simplified and Unified APIs

Stitching it all together is unification of Datasets with DataFrames APIs, across all libraries atop Spark core, so that libraries handle both data forms. Because DataFrames and Datasets are built on Spark SQL engine, a developer gets Tungsten’s performance benefits and Catalyst’s optimized execution plans. Additionally, Datasets provides developers strict compile-time type safety, which didn’t exist in DataFrames previously (Figure 2).

Figure 2. Unification of Datasets and DataFrames in Spark 2.0. Click to enlarge.

Figure 2. Unification of Datasets and DataFrames in Spark 2.0. Click to enlarge.


The unification theme extends to a number of other APIs, too. SparkSession is now a single point of entry, subsuming SparkContext, HiveContext, StreamingContext, and SQLContext. Developers use only SparkSession to access all four contexts, minimizing the number of different contexts to remember while still maintaining backward compatibility. Also, unified APIs build the foundation for Spark’s future, spanning across all libraries – including Spark Streaming and MLlib, introducing structure to Spark, offering richer semantics and high-level declarative APIs, and making Spark much simpler and easier for all developers.

This trifecta of speed, structured streaming, and simplification – informed by greater community usage of new features and feedback from the experimental APIs – is the building block to bigger and better things in Spark 2.0 and the future.

“One of the things that’s really exciting for me as a developer of Spark is seeing how quickly people start to use these (features) and give us feedback on them,” said Zaharia.

You can preview Apache Spark 2.0 on Databricks, try some of the notebooks in the links above, and peruse and preview other technical assets in the anthology of technical assets for Apache Spark 2.0.

Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies building large-scale distributed systems.

Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.

Tags: , , , , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>