The two fundamental requirements of big data projects commonly are one-dimensional to begin with. The (near) real-time information extraction from a continuous inflow of data and the analysis of an immense volume of data recurrently are each singular problems. Two solutions, each addressing one of the cases, have become popular in the recent years. Storm, an open source, distributed, real-time computation platform is being used by companies like Twitter and Groupon, and the project has been recently adopted in the Apache foundations incubator. The other, Hadoop, is so well known and widely adopted that it has become synonymous with batch processing.
The problem with these approaches is that business requirements are both historic and real-time—simultaneously. Many organizations find the two challenges of extracting real-time data and analyzing immense volumes of data converge with time. Real-time data accumulates and the inevitable demand for an aggregated historic view requires batch processing. And the batch processing solution is slow, which eventually leads to business users or customers asking to get immediate or near real-time insight, such as the most recent data updates to react faster to market changes.
A Way for Batch and Near Real-Time Processing to Work in Concert
One solution to this problem is to create a hybrid architecture, what Nathan Marz, the creator of open source projects Storm and Cascalog, coined as the Lambda Architecture. It will be discussed in the new book Big Data – Principles and Best Practices of Scalable Real-Time Data Systems which he and James Warren are writing.
The idea of the Lambda Architecture is simple and two-fold. First, a batch layer computes views on your collected data and repeats the process when it is done to infinity. Its output is always outdated by the time it is available since new data has been received in the meantime. Second, a parallel speed processing layer closes this gap by constantly processing the most recent data in near real-time. Any query against the data is answered by querying both the speed and the batch layers’ serving stores, and the result is merged to give a near real-time view on the complete dataset.
The Lambda Architecture itself provides only a paradigm. The technologies with which the different parts of a Lambda Architecture are implemented are independent from the general idea. Due to their popularity, Hadoop and Storm are strong contenders for the speed and batch layer of new data architectures. However, any technology, such as a company’s legacy software exhibiting a speed or batch layer characteristic, could fulfill either function.
The Two Layers of the Lambda Architecture
The Lambda Architecture results in two layers, each with an input store, a processing ability and a serving store. The result is a simple and robust architecture, though that description may sound counterintuitive.
The common and easiest approach to speed up a batch process or speed layer in a company is to scale it horizontally or vertically. It usually involves no additional development and operational effort when the deployment is fully automated. However, these systems are bound by their architecture. Batch architectures bent to achieve near real-time performance or speed architectures adapted to process historical big data become operationally complex.
For example, adding more or faster servers to a Hadoop cluster will increase processing speed, in the ideal situation nearly linearly. There is an inherent limit to this solution, for example, network overhead will increase and fixed time for starting and managing Hadoop jobs cannot be scaled away. Speed layers like Storm could potentially be scaled to a level where they can process historical data while managing the real-time throughput. Big historical data would require sophisticated queuing and data management to achieve this, though. Both solutions are likely to become expensive and operationally complex to achieve a type of processing performance they were not designed for.
The Lambda Architecture, while requiring two architectures, utilizes each at their optimum with fewer operational challenges. For example, scaling a Hadoop cluster to process data in under a day is generally much easier to achieve than to guarantee it to be processed in near real-time (that is, in a few minutes). The core application logic in both layers will be equivalent, which means little additional application logic beyond the merging of queried data.
The simplicity and robustness of the Lambda Architecture is twofold. Firstly, it avoids scaling a single-purpose architecture beyond its ability, thus removing operational extremes which may lead to brittle data architectures. Secondly, the output from the two layers can be harnessed to reduce the complexity of the serving layer, providing a forgiving environment where we can rebuild our whole dataset in case of catastrophic errors or disasters.
The Forgiving Natures of Speed and Batch Layers
The serving layer of a constantly updating system at scale is complex and it requires a system that has the ability to perform fast random reads and writes to it. There are systems like HBase, Cassandra, or DynamoDB that can achieve this level of performance and they all are complex and tough to manage and require costly hardware when scaled for big data. In the Lambda Architecture, these systems are needed for the speed layer only and thus can be comparatively small and do not need to scale with accumulating historical data. Additionally, should the speed layer fail, it is not a significant event since the store can be repopulated in hours. This removes significant operational burdens like high availability and makes even fast and volatile in-memory storage solutions like Redis, an open source key-value store, potential options.
The batch serving layer part can be very simple in contrast. It needs to be horizontally scalable and support random reads but is only updated by batch processes. The latter means that the store can generate data and indices offline and switch them over homogeneously. This low frequency of change enables versioning of the data and storing each “version” as backups. Furthermore, the data does not fragment and does not need complex locking mechanisms since it is replaced in one operation. The batch updating of the underlying index and data makes rolling back and forward between versions easy and can be done in the background. ElephantDB is an example of a storage system that achieves this and it has been implemented with only a few thousands lines of code.
Lastly and very importantly, both the batch and speed layers are forgiving and can recover from errors by being recomputed, rolled back (batch layer), or simply flushed (speed layer). A core concept of the Lambda Architecture is that the (historical) input data is stored unprocessed. This concept provides two advantages. Future algorithms, analytics, or business cases can retroactively be applied to all the data by simply adding another view on the whole data. Human errors like bugs can be corrected by re-computing the whole output after a bug fix. Many systems are built initially to store only processed or aggregated data. In these systems errors are often irreversible. The Lambda Architecture avoids this by storing the raw data as the starting point for the batch processing.
In many real-world situations a stumbling block for switching to a Lambda Architecture lies with a scalable batch processing layer. Technologies like Hadoop with Cascading, Scalding, Python streaming, Pig, and Hive are there but there is a shortage of people with the expertise to leverage them. And deploying the architecture in the cloud raises the challenge even further. A number of new cloud-based data services companies, including the one where I work, are starting to fill in gaps. Used in conjunction with Amazon Simple Storage Service as a data sink and various database storage adapters, such services can provide a means to match the batch layer requirements in the Lambda Architecture.
Rajat Jain is a Software Engineer at Qubole, a big data as a service provider based in Mountain View, Calif.