Data comes in many forms, and its usage continues to evolve. Traditionally, essential data was locked away in siloed systems of record. To manage sensitive information such as customer records, financial transactions, product inventory, and other business-critical data, strong governance protocols were in place to protect against information espionage and transaction downtime. These systems were built to manage highly structured, static, and predictable data structures, and the ecosystem that developed around them followed a similar behavior with rule-based processes.
But times have changed. Data is no longer static – it’s evolving and available in various locations and in different forms. Apache Spark is a powerful technology that has risen to address this changing data landscape. Through its ability to process data in-memory, Spark has eliminated the need to shuttle data back and forth for analytics, reducing the time and complexity around developing analytics workflows and making it easier for data professionals to build more advanced analytics applications.
Yet, despite Spark’s fast-growing popularity and proven performance advantages, many of its benefits remain untapped due to a lack of data literacy and compatible data tools. Only when the right toolset can be leveraged with the right knowledge will the open-source technology deliver. Only then will it be able to help close the skills gap, unlock the true potential of a company’s data stores, and develop applications that offer smarter and more personalized ways to interact with customers, ultimately achieving the promise of a data-first culture powered by application experiences that are more intelligent than what exists today.
The Upfront Benefits of Spark
The application of Spark pays immediate benefits. Spark’s ease of use and quick implementation make it the perfect framework to reduce the need for recoding applications from data science algorithms to software. Further, although many enterprises already apply analytics to their data sets, Spark can derive even more value from unstructured data without the added complexity of building entire analytics stacks from the ground up.
But perhaps most importantly, Spark’s iterative ability means that data practitioners can apply a massive amount of computation, making unsupervised machine learning possible to deliver intuitive applications. This is key because applications that “learn” can have much broader business implications than customer interaction because, in their own way, they can understand and reason. This “cognitive” ability can generate real value by making organizations better attuned to many possible outcomes, which then can be interpreted by industry professionals.
Moreover, prescriptive analytics not only will predict that something will happen or the reasons why it’s going to happen, but also will suggest actions that should be taken in response. This is just the beginning. As applications get smarter and more customized through interactions with data, devices, and people, previously untapped opportunities will become possible. In the next five years, machine learning applications will lead to new breakthroughs that will amplify human abilities, helping data handlers navigate the world in interesting new ways.
Additionally, there is the entire open-source community that powers Spark. It’s a data community that continues to adjust as new problems and needs arise. This dynamic nature also means that Spark can be applied in a variety of ways, limited only by the companies deploying it. For example, through its availability on IBM Cloud, Spark has become the first cloud-based development environment for near real-time, high-performance analytics that is easily extendable to mobile, IoT, Web, and many applications through IBM Bluemix.
Much of Spark’s standout qualities come in light of its maturing in the open-source world. The entire Apache ecosystem, from Hadoop to Spark, has blazed the trail for the future of applied data and the possibilities that can be harnessed in a time when every new device and user generates an unprecedented amount of information. This is because the open-source movement has removed the walled gardens that have existed across the enterprise. The open approach harnesses the power of distributed minds to create a community that has tackled some of the hardest challenges in advancing data management and analytics technology. Netflix, Facebook, IBM, and VMware, for example, all have made significant contributions that have helped to solve problems for not only their customers but for the community as a whole.
Tapping into the New Ecosystem
Many see Spark as a replacement for distributed file systems like Hadoop, but that isn’t the case. Although they are both big data frameworks, Hadoop and Spark serve different functions. Hadoop is a distributed data infrastructure that saves on resources by avoiding the acquisition and maintenance of custom hardware while batch processing data for affordable data management. Spark, on the other hand, is a data-processing tool that works atop distributed data collections but does not store that data. If anything, it is a complement to the Hadoop Distributed File System (HDFS).
Further, Spark comes out of the box with more capabilities and speed than Hadoop’s MapReduce, allowing users to focus on bigger questions and more advanced applications. While Hadoop is great for manipulating large data sets, Spark can power streaming and machine learning while also performing many key services like SQL and Transformations. In essence, it’s a core asset that is portable and scalable, and allows the use of an integrated stack for reactive data processing and agile developments.
Consider the case of USA Cycling Women’s Team Pursuit. The team realized it had an “overwhelming” amount of data to analyze for potential insights that could lead to better results. From sleep patterns to DNA testing, every fraction of data was taken into account to make the team a success. The team jumped into data-driven decisions and never looked back. But it was hard to process streaming data in real or very-near-real time, with an automated classifier built that could leverage machine learning, and future-proof their implementation so that radical overhauls weren’t needed every time the data volumes grew.
In came Spark. It was able to process all this and more as it continues to enhance its capabilities, including the ability to do additional descriptive and predictive analytics related to performance, training workload, and overall athlete well-being. Now, Spark allows for unmatched understanding and, in turn, more effective action. The team has been able to take the next step in the distributed computing stack by taking advantage of Spark. Why settle for complex and slow analytics performance when Spark can deliver more, simply and easily?
NASA also is taking advantage. NASA and the SETI Institute are using Spark’s machine-learning capabilities to analyze terabytes of complex, deep space radio signals in a hunt for patterns that might betray the presence of intelligent extraterrestrial life. Using Spark, SETI is able to explore new ways to analyze signal data.
In all, although data primarily has been locked away, the way data is perceived is changing. There is no longer one version of the truth, but multiple interpretations of the data. This new way of thinking is propelling a cultural shift in which everyone benefits. It’s time to ascend to the next level in the data-development evolution and truly harness the power data has to offer.
Joel Horwitz is responsible for strategic partnerships and M&A for the IBM Analytics portfolio, including emerging capabilities like Spark, Hadoop, Cloud Data Services, IoT, and the Solutions business.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.