The Internet of Things (IoT) is an all-encompassing and ubiquitous network of devices that facilitates coordination and communication between the devices themselves, as well as between the devices and human end users. The devices are typically constrained devices such as RFID sensors, but more sophisticated ones like smartphones are also considered part of the IoT ecosystem.
Currently, we are witnessing the IoT’s move to the mainstream: a rapidly increasing number of deployed devices, a building of momentum around standardization activities, and more and more acquisitions of IoT companies by big players.
The nature of the IoT or, to be more precise, of the data that IoT devices generate, lends itself to the “big data approach.” This is the idea of running data processing in a scale-out fashion on commodity hardware, using open and community-defined interfaces, such as Hadoop’s MapReduce API, along with openly defined data formats (like Parquet, for example) in a schema-on-read fashion.
Using Gartner’s definition of big data, along the three dimensions of volume, variety, and velocity as a baseline, IoT lends itself to a big data approach for the following reasons:
- Volume. To develop a full-blown (commercial) IoT application and maximize the benefits, you need to be able to capture and store all the incoming sensor data to build up the historical references. A single, simple sensor – say, a temperature sensor – might generate only a few MB per year, but the overall number of sensors in a typical deployment can put you quickly into the high TB, if not PB, range of storage demand.
- Variety. There are dozens of data formats in use in the IoT world, which are often binary formats or compressed textual formats (due to the limitations of the devices in terms of power, storage, bandwidth, etc.). Further, none of the sensor data is relational per se. A data processing platform that allows landing all this data in its raw form is superior to an RDBMS and its dependency on Extract, Transform, Load (ETL).
- Velocity. Many devices generate data at a high rate, and usually we cope with data streams in an IoT context – both of which RDBMSs are not well suited to deal with.
Requirements for an IoT Data-processing Platform
Dealing with raw data. In terms of data ingestion, an IoT data-processing platform should be able to natively deal with IoT data, which shows little standardization in terms of formats. Hadoop makes it possible to land the incoming data in its raw format and, for optimization purposes, to convert data downstream to more sophisticated formats, such as Parquet.
Supporting different workload types. Traditionally, Hadoop was, with MapReduce as its primary processing paradigm, a rather batch-centric system. IoT applications, however, usually require that the platform support stream processing from the get-go as well as dealing with low-latency queries against semi-structured data items, at scale. Hadoop offers an append-only file system called HDFS to persist data. For stream data, usually message queues such as Apache Kafka are used to buffer and feed the data into stream processing systems such as Apache Storm or to leverage the stream part of generic engines such as Apache Spark. For many applications, combining historical data (that have been persisted in HDFS) with the new, incoming data from devices is essential, and this is usually realized using specialized architectures.
Business continuity. Commercial IoT applications usually come with SLAs in terms of uptime, latency, and disaster-recovery metrics such as Recovery Point Objective / Recovery Time Objective. The platform should be able to guarantee those SLAs, innately. This is especially critical in the context of IoT applications in domains such as health care.
Security and Privacy. The platform must ensure a secure operation – something that as of today is still considered to be challenging, in an end-to-end manner. This requirement includes integration with existing authentication and authorization systems, as well as with services in the enterprise such as Lightweight Directory Access Protocol (LDAP), Active Directory (AD), Kerberos, Security Assertion Markup Language (SAML), or Linux Pluggable Authentication Module (PAM). Further, the privacy of human users must be guaranteed. For this, Access Control Lists (ACLs) and data-provenance mechanisms must be available in the platform, along with data encryption on the wire and/or masking at rest.
Comparing the current Hadoop architecture, especially concerning its storage layer HDFS, with the above requirements, a few things stand out: While Hadoop is perfectly capable of dealing with raw IoT data, the support for a wide array of workloads is rather limited. The recent upgrade in the Hadoop ecosystem, with YARN and Mesos, allows isolation on the compute layer. But on the storage layer, HDFS provides only a flat namespace, forcing real-world apps to use dedicated, separate clusters for different workloads. Further, due to architectural constraints of HDFS, it is not able to deal with many (small) files in a read/write manner. The same design constraints also cause issues concerning business continuity (around availability and the capability to recover from disasters) as well as security challenges. The latter, in all fairness, are being addressed as we speak, from on-disk encryption to incubating projects at Apache, such as Sentry (role-based authorization) and Knox (HTTP Gateway for authentication and access).
Michael Hausenblas is the Chief Data Engineer at MapR, where he helps people tap the potential of big data. His background is in large-scale data integration, the Internet of Things, and web applications. He is experienced in advocacy and standardization. Michael shares his experience with large-scale data processing through blog posts and public-speaking engagements. Follow Michael on Twitter at @mhausenblas.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.