As evidenced by the growing amount of big data success stories, it’s clear that enterprises across all types of industries are getting real value out of their big data projects. Savvy retailers use big data to predict trends, target customers, anticipate demand, and optimize pricing and promotions, while recommendation engines personalize their customers’ shopping experience and help sell more products. In the financial services industry, big data projects are used to gain a comprehensive understanding of customers, markets, products, regulations, competitors, and even employees. Internet of Things projects, which include wearable devices, industrial machines, healthcare, and smart cities, use sensor devices that generate large quantities of data that need to be processed and analyzed in real time. Applications must digest and react to this data in instantaneously to provide better services to users.
In all of these use cases, it’s clear that big data projects have moved from a batch-oriented architecture that delivers reports to analysts, to a real-time data platform in which data is streamed into the system and immediately processed and used by many applications and users. This approach poses some new challenges to the modern data architecture, chief among which are reliability, security, and rapid application deployment. Let’s take a look at some these challenges, as well as guidelines for building a successful next-generation data platform.
Accessing a data platform in real time for ingestion, processing, and/or querying forces the system to be always on, fast, and secured. This means that when you are evaluating data platform options, you must closely examine the reliability, security, and scalability of the product(s) that you are choosing, since this platform will be used by an increasing number of applications, services, and users.
In terms of platform reliability, you also should look at the disaster-recovery capabilities between different data centers and clusters. Ideally, the platform will use mirroring techniques that mirror data as well as its metadata, while maintaining data locality and consistency. Because you are dealing with many fast-changing data streams, it’s important to have an easy way to create snapshots so that you can recover from a specific point in time.
Being able to deploy your data platform in different clusters as well as in different data centers not only will give you high availability, but also will allow your application to perform better because ingestion, processing, and querying will happen close to the data. Be sure to select a technology that supports writes in any location and automatically replicates the new or changed data into other data centers.
An important security aspect is related to data permissions. Many applications and users are accessing the platform frequently, so it’s essential to be able to finely select who can access and modify data. When you are selecting a solution for building your next-generation big data platform, your platform should include the following safeguards:
Extensible Authentication. The platform should be able to integrate easily with your enterprise identity management solution, such as Linux Pluggable Authentication Module (PAM) or Kerberos. That way, applications can use existing user profiles and groups when accessing data.
Access Control. The platform should provide fine-grained permissions to cluster resources, including write/read access to files and tables as well as finer permissions such as expression-based permissions on tables and columns. Note that access control and permissions should not be limited to just data (files and tables), but also to the jobs and queries that are executed.
Auditing. In some applications, especially in financial services, all actions on data must be captured and available for analytics in order to demonstrate regulation compliance.
Encryption. Encryption should be evaluated at all levels. Over the wire, the platform should be able to encrypt all communication between cluster nodes and clients in order to prevent data theft via packet sniffing. The other level is data encryption at rest, to prevent unauthorized users from accessing sensitive data as well as providing protection against data theft. For some applications, it’s important to encrypt only a subset of the data, such as personal information like a credit card number.
Rapid Application Deployment
While reliability and security are key components of a successful data platform, an additional success factor is the ability to rapidly develop applications and continuously deploy new application features.
Applications must deal with a large variety of data sources and data structures. In the IoT world, where sensors/devices are deployed, a good example is wearable devices for sports and health care. Initially, these devices were not consistently connected, and were capturing only a small set of data, such as number of steps, time, etc. However, new devices capture a lot more information, such as GPS location, heart rate, blood pressure, respiration, and temperature. This data is sent to the data platform over 4G cellular connections. The data platform that handles all of this data has to be able to ingest, store, and use the new information added by the new devices.
Having a NoSQL document database inside a data platform that supports the same level of reliability and security that the rest of the system has is great for lowering the overhead of modelling and storing data. Consider the wearable-device example: The only thing that is needed is to have the device send all the data in a secured way using JSON and, automatically, the new information will be saved and usable in the database.
As we have seen, using a JSON database and API allows the application development team to adapt to change faster, which results in a faster time to market – an important success factor for any application.
An interesting side effect of having a rich database and API is the fact that developers can leverage new streaming frameworks to ingest the data in real time and to process the data in different ways during the ingestion phase. For example, one user can directly save all JSON documents in the main table while another user can use the same data to create pre-aggregated documents to provide rich information in real time to all users. Coming back to our wearable device example, the total number of hours, steps, or mileage is calculated during the ingestion, resulting in extremely fast, real-time reporting about a user’s activity as well as an easy way to create a leaderboard with a user’s friends.
To reiterate, JSON support inside the data platform provides an easy way to store information, but it’s also a great way to create new services that deliver real-time results because “operational storage” can be added to the global data platform that is also used for analytics and processing.
Dealing with Large-Scale Analytics
The last important aspect to consider when selecting your big data platform concerns data processing and analytics. The platform must provide a way to use traditional BI and visualization tools to help users analyze the data. This is where having a distributed SQL engine running on top of your Hadoop cluster is very important. Tools like Apache Drill provide SQL ANSI support on various data sources and formats (JSON, CSV files, JSON databases, HBase, or any other databases), and they can be used with BI tools like Tableau, QlikView, Pentaho, and more using the ODBC/JDBC drivers.
However, users cannot really perform deep analysis with just data visualizations. The platform must provide an advanced data processing and analytics framework to extract key information from the data store and process this data in order to provide real value. For example, you can build a recommendation engine or a fraud-detection model all in real time, which means that the data can be processed and analyzed as it’s streamed into the platform.
This is where tools like Apache Spark with Spark Streaming and Spark ML libraries come in very handy at the top of a Hadoop cluster. They are saved into the system and processed at the same time to create various machine-learning models or pre-aggregated views of the data. All of this is done on the distributed cluster to leverage the power of all the nodes and also to execute the processing as close as possible to the data (data locality). The platform can analyze the data on the entire cluster, providing users with information in real time and allowing them to continuously analyze the data, build new services, or identify business actions based on the real-time data.
Modern applications are dealing with more and more data coming from a growing variety of sources, and this data must be stored, processed, and analyzed on a rich data platform. When you are evaluating and implementing a next-generation data platform, make sure that it’s highly available and secured in order to provide a flexible data store. This will enable application developers to adapt rapidly to new requirements. Using a highly distributed file system and a NoSQL document database with JSON support will help in this area. In addition, by integrating your data platform with distributed processing and analytics tools (such as Spark and a distributed SQL engine), you will be able to perform large-scale analytics and provide new views and services around your enterprise data.
Tugdual Grall is Technical Evangelist EMEA at MapR Technologies. “Tug” is an open-source advocate and a passionate developer. He currently works with the European developer communities to ease MapR, Hadoop, and NoSQL adoption. Before joining MapR, Tug was Technical Evangelist at MongoDB and Couchbase. Tug has also worked as CTO at eXo Platform and JavaEE product manager, and as a software engineer at Oracle. Follow him on Twitter @tgrall. Email him at firstname.lastname@example.org
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.