With the data generation and consumption exploding at a rapid pace in every industry, there is an increasing need to have a solid IT architecture that can support the high velocity and volume of data. Some of the common challenges in the space of Big Data are balance between accuracy of the analytics derived from a massive data set and low-latency high speed results. Lambda Architecture is a data processing technology agnostic architecture that is highly scalable, fault-tolerant and balances the batch processing and the real-time processing aspects of Big Data very well, providing a unified serving layer of the data.
Query = (All Data Set)
Consumption of the data via ad-hoc query is naturally a function of the underlying data set. The function operating on the entire massive data set is bound to have high latency due to its sheer size though the accuracy is generally higher with a huge historical data set. Usually, such functions operating on the large data set use the Hadoop MapReduce type of batch frameworks. On the other hand, the high velocity data processing layer usually operates on a small window of data set that is in-flight, thereby achieving low-latency, but might not be as accurate as working against a huge data set. But, with the increasing appetite for data consumption near-real time, there is an opportunity to strike a balance to get the best of the both worlds, and Lambda Architecture plays well in that space.
Originated by Nathan Marz, founder of Apache Storm, Lambda Architecture consists of three components:
- Batch Layer
- Speed Layer
- Serving Layer
Typically, the new data stream is implemented using a publish-subscribe messaging system that can scale for high velocity data ingestion such as Apache Kafka. The inbound data stream is split into two streams, one heading to the Batch Layer and the other to Speed Layer.
Batch Layer is primarily responsible for managing the immutable append-only massive data set and pre-computing the views of the data based on the anticipated queries. Batch Layer is often implemented using a Hadoop based framework such as MapReduce. The premise behind using the immutable data set is that the batch layer relies on re-computation of the entire data set every time to drive higher accuracy of the batch views. It will be extremely difficult, if not possible to re-compute against the entire data set if the data set is mutable as the computation process might not be able to manage various versions of the same dataset. The core goal of this layer is to focus on accuracy by pre-computing the views and making it available in batch layer even though there is an inherent latency as it might take several minutes or hours. HDFS, MapReduce and Spark can be used to implement this layer.
Speed Layer is primarily responsible for continuously incrementing the real-time views based on the snapshot of the incoming data stream or sometimes a small window of the data set. Since these real-time views are constructed based on small data set, they might not be accurate as batch views, but they will be available for immediate consumption, unlike batch views. The core goal of this layer is to focus on the speed of making the real-time views available, though it might not be accurate due to the small data-set used for analysis. Apache Storm, Spark and NoSQL databases are typically used in this layer.
Serving Layer’s responsibility is to provide a unified interface that seamlessly integrates Batch Views and Real-Time Views generated by Batch Layer and Speed Layer, respectively. Serving Layer supports ad-hoc queries optimized for low-latency reads. Typically, technologies such as HBase, Cassandra, Impala and Spark are used in this layer.
Lambda architecture tries to bring the best of the both worlds – Fast and Large Scale Processing layers. With the increasing suite of technologies such as Spark, Storm, Samza, Cassandra, HBase, MapReduce, Impala, ElephantDB, Druid etc., the choices are plenty to pick the right technology for the architecture.