Editor’s note: This article is part of a series examining issues related to evaluating and implementing big data analytics in business.
Big data management and analytics applications rely on an ecosystem of components that can be combined in a variety of ways to address application requirements that range from holistic information technology (IT) scalability to objectives associated with specific algorithmic demands. The ecosystem stack may include:
• Scalable storage systems that are used for capturing, manipulating, and analyzing massive data sets.
• A computing platform, sometimes configured specifically for large-scale analytics, often composed of multiple (typically multi-core) processing nodes connected via a high-speed network to memory and disk storage subsystems. These are often referred to as appliances.
• A data management environment, whose configurations may range from a traditional database management system scaled to massive parallelism to databases configured with alternative distributions and layouts, to newer graph-based or other NoSQL data management schemes.
• An application development framework to simplify the process of developing, executing, testing, and debugging new application code. This framework should include programming models, development tools, program execution and scheduling, and system configuration and management capabilities.
• Layering packaged methods of scalable analytics (including statistical and data mining models) that can be configured by the analysts and other business consumers to help improve the ability to design and build analytical and predictive models.
• Oversight and management processes and tools that are necessary to ensure alignment with the enterprise analytics infrastructure and collaboration among the developers, analysts and other business users.
This article provides a high-level overview of three aspects of this big data ecosystem and associated technologies: storage, appliances, and data management. A future article will cover the others.
Storage: Infrastructure Bedrock for the Data Lifecycle
In any environment intended to support the analysis of massive amounts of data, there must be the infrastructure supporting the data lifecycle from acquisition, preparation, integration, and execution. The need to acquire and manage massive amounts of data suggests a need for specialty storage systems to accommodate the big data applications. When evaluating specialty storage offerings, some variables to consider include:
• Scalability – How well can the storage subsystem support massive data volumes?
• Extensibility – Is there flexibility in the storage system architecture to grow without the constraint of artificial limits?
• Accessibility – Can the systems enable simultaneous user community access without compromising performance?
• Fault-tolerance – To what extent can the storage environment tolerate failures?
• High-speed I/O – Can the input/output channels satisfy the [demanding?] timing requirements for big data analytics?
• Integratability – Can the storage environment be easily integrated into the production environment?
Big Data Appliances: Hardware and Software Tuned for Analytics
Because big data applications and analytics demand a high level of system performance that exceeds the capabilities of typical systems, there is a general need for using scalable multiprocessor configurations tuned to meet mixed-used demand for reporting, ad hoc analysis, and more complex analytical models.
Different architectural configurations address different scalability and performance issues in different ways, so when it comes to deciding which type of architecture is best for your analytics needs, consider different alternatives including symmetric multiprocessor (SMP) systems, massively parallel processing (MPP), as well as software appliances that adapt to parallel hardware system models.
Hardware appliances are designed for big data applications. They often will incorporate multiple (multi-core) processing nodes and multiple storage nodes linked via a high speed interconnect. Support tools are usually included as well to manage high-speed integration connectivity and enable mixed configurations of computing and storage nodes.
A software appliance for big data is essentially a suite of high performance software components that can be layered on commodity hardware. Software appliances can incorporate database management software coupled with a high performance execution engine and query optimization to support and take advantage of parallelization and data distribution. Vendors may round out the offering by providing application development tools, analytics capabilities, as well as enable direct user tuning with alternate data layouts for improved performance.
The benefits of using hardware appliances for big data center on engineering and integration. They are engineered for high-performance reporting and analytics, yet have a flexible architecture allowing integrated components to be configured to meet specific application needs. And while there is a capital investment in machinery, hardware appliances are low-cost when compared to massive data warehouse hardware systems.
One benefit of using software appliances, meanwhile, is that they can take advantage of low-cost commodity hardware components. In addition, the reliance on commodity hardware allows a software appliance to be elastic and extensible.
NoSQL: Alternatives for Big Data Management
Both hardware and software appliances support standard approaches to standard, SQL-based relational database management systems (RDBMS). Software appliances often bundle their execution engines with the RDBMS and utilities for creating the database structures and for bulk data loading. However, the availability of a high-performance, elastic distributed data environment enables creative algorithms to exploit variant modes of data management in different ways.
Many of these alternate data management frameworks are bundled under the term “NoSQL databases.” NoSQL holds out the promise of greater flexibility in database management while reducing the dependence on more formal database administration. NoSQL databases have more relaxed modeling constraints, which may benefit both the application developer and the end-user analysts when their interactive analyses are not throttled by the need to cast each query in terms of a relational table-based environment.
Different NoSQL frameworks are optimized for different types of analyses. For example, some are implemented as key-value stores, which nicely align to certain big data programming models, while another emerging model is a graph database, in which a graph abstraction is implemented to embed both semantics and connectivity within its structure. In fact, the general concepts for NoSQL include schema-less modeling in which the semantics of the data are embedded within a flexible connectivity and storage model; this provides for automatic distribution of data and elasticity with respect to the use of computing, storage, and network bandwidth in ways that don’t force specific binding of data to be persistently stored in particular physical locations. NoSQL databases also provide for integrated data caching that helps reduce data access latency and speed performance.
The loosening of the relational structure is intended to allow different models to be adapted to specific types of analyses. The technologies are evolving and maturing. And because of the “relaxed” approach to modeling and management that does not enforce shoehorning data into strictly-defined relational structures, the models themselves do not necessarily impose any validity rules; this potentially introduces risks associated with ungoverned data management activities such as inadvertent inconsistent data replication, reinterpretation of semantics, and currency and timeliness issues.
|Use Case Illustrations|
|Analytics use case||Data management considerations||Storage considerations||Appliance considerations|
|Improving targeted customer marketing||Customer profiles are likely to be managed using a standard data warehouse using dimensional models. Analytic algorithms may require more flexible data structures such as hash tables or graph.||Must combine streamed data for analysis with customer profiles that are typically stored in a data warehouse.||Hardware appliances that can support traditional data warehouse models as well as analytical environments may be preferred.|
|Social media analytics||These applications have a high reliance on algorithmic execution, but may also require entity extraction and identity resolution, necessitating a combination of traditional data management and NoSQL platforms.||Depending on the amount of information to be streamed, may require a large storage footprint with high speed I/O to handle the volume. However, since the data stream quickly and value instantiation may be transient, this application may be tolerant of failures.||Much of the discussions around the use of scalable high performance analytic engines centers on social media analytics, with Hadoop deployed across various hardware configurations a popular choice.|
|Fraud detection||Fraud detection combines continuous analysis in search of patterns that can be related to individuals or cohorts that may either be known or unknown. This suggests a need for a variety of different analytical models that can be integrated with traditional relational data models.||Depending on the application there will be a need for capturing and managing large amounts of data over long periods of time.||Depends on the size of the analysis. Larger environments will require scalable and elastic computational platforms.|
|Website recommendations engine||As with other big data applications, this will need to combine static profile information with dynamic calculations associated with real time activity, requiring a combination of traditional data warehouse and more eclectic models that can be deployed using NoSQL style frameworks.||For large e-commerce applications, the amount of data is proportional to both the number of visitors and the average number of web events per visitor, potentially resulting in massive amounts of data requiring a large, scalable storage footprint. As with social media analytics, there may be some tolerance to failures.||The determination of hardware vs. software appliance is related to the performance expectations. The need for real-time or immediate computations and responses may dictate the need for dedicated hardware systems.|
Considering Platform Alternatives
In deploying an analytics environment, the key decisions for investing in infrastructure focus on platform. One must be willing to specify key measures for system performance to properly assess scalability requirements for the intended analytical applications to help select a specific architectural approach.
Consider all aspects of the performance needs of the different types of applications: data scalability, user scalability, access and loading speed, the need for workload isolation, reliance on parallelization and optimization, reliability in the presence of failures, the dependence on storage duplication or data distribution and replication, among other performance expectations. Then examine how the performance needs of the different types of applications are addressed by each of the architectures. This will provide a measurable methodology for assessing technology suitability.
David Loshin is the author of several books, including Practitioner’s Guide to Data Quality Improvement and the second edition of Business Intelligence—The Savvy Manager’s Guide. As president of Knowledge Integrity Inc., he consults with organizations in the areas of data governance, data quality, master data management and business intelligence. Email him at email@example.com.
Home page photo of Perth, Australia Arena under construction in 2011 by Wikipedia user BartBart.