Apache Arrow is a common data layer for storage systems and execution engines that enables columnar in-memory analytics. The Apache Software Foundation, which launched Arrow in February, claims a 10-100x performance improvement for many workloads and zero-overhead data transfers between systems.
According to the foundation, Arrow achieves performance improvements by enabling execution engines to take advantage of Single Input Multiple Data operations included in modern processors, and by making more efficient use of CPU caches. Arrow supports a variety of industry-standard programming languages, including Java, C, C++, and Python. Support for additional languages is expected soon. The project is backed by developers of 13 major open-source projects, including Cassandra, Drill, Hadoop, Impala, Kudu, Spark, and Storm.
Jacques Nadeau, co-founder and CTO of Dremio and vice president of the Arrow and Apache Drill projects, believes that, within a few years, most of the world’s data will be processed through Arrow. Nadeau fielded questions from Data Informed about Arrow, the benefits it can deliver to businesses, and the future of the open-source project.
Data Informed: What advantages can Arrow deliver to businesses running big data initiatives?
Jacques Nadeau: Arrow brings massive improvements in speed, with some workloads benefitting from a 10-100x faster execution performance. This is an important advantage when it comes to tackling contemporary data volumes.
In addition, the implementation of a common data layer can mean a significant reduction in the amount of overhead incurred when systems communicate with one another. This allows additional freedom for companies selecting the components of their data solution, since you can now pick the best of breed without worrying about a performance hit due to differing internal data representations.
All this is accomplished in a way that’s programming-language agnostic and without potentially costly ETL – Arrow works with relational and complex data as-is.
How does Arrow achieve such significant performance improvements?
Nadeau: Due to the need to translate between internal data formats, contemporary big data stacks may spend up to 70 to 80 percent of CPU resources on serialization and deserialization. This means that the elimination of this necessity can represent a substantive gain in performance.
In addition to providing a common format, Arrow also increases processing performance through the internal organization of its data structures. In simple terms, Arrow organizes data in a way that’s consistent with the way that a CPU “thinks.” We call this format columnar because it organizes data column-wise rather than row-wise – that is, data of the same type is grouped together, rather than broken up. Organizing data in this way permits us to employ a class of CPU operations called Single Instruction Multiple Data (SIMD) instructions, which allow for a more efficient allocation of CPU cycles. These operations fit multiple operations into a single CPU clock cycle, allowing for high-performance computation. The columnar representation is also beneficial because it eases the uptake of data into the CPU. Grouping data by type increases the effectiveness of cache prefetching and minimizes CPU stalls.
How is Arrow able to function in low-memory situations?
Nadeau: You’ll be pleased to know that Arrow works just fine even when the data that you are processing is too big for the available memory. This is accomplished by splitting Arrow’s core data structures into manageably sized groups of record batches, which are at most 216 records in length. These are typically between 64KB and 1MB in size.
The lead developers of 13 big data open-source projects, including Hadoop, Impala, Spark, and Kudu, were involved in the development of Arrow. What does such widespread commitment mean for the performance and future of Arrow?
Nadeau: This shows that Arrow addresses a very real need. It also means that we are witnessing the birth of a new standard. In the future, Arrow will be the canonical way to represent data in storage systems and query/execution engines, and we can expect it to be the focus of continued development over the coming years.
You have said that, within a few years, you expect most new data to move through Arrow. What do you anticipate for the next six months to a year?
Nadeau: Within the next year, I expect Drill, Impala, Ibis, Kudu, Parquet and Spark will use Apache Arrow. These are huge, standards-setting projects and their adoption of the format will greatly speed its uptake within the data space. The future of Arrow is incredibly bright.
Where can readers learn more about Apache Arrow?
Nadeau: The Apache Arrow website includes general information about the project, including the source code and community mailing lists. I also recently published a technical blog post about Apache Arrow, which provides more details about the technology. I would encourage everyone to read the blog post and join the community through one of the mailing lists.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.