We live in a connected, data-centric world. Whether we are talking about physical connections (such as thousands of smart meters attached to residential homes), virtual connections (such as a group of social networks manifested via social media environments such as Facebook or Linkedin), or even biological connections (such as the millions of interactions that take place within the human body), there is a growing interest in applications for monitoring and analyzing graphs whose vertices and edges represent the ways that things are connected.
Modeling these networks requires innovative methods for application design and development. The limitations of the relational database management system (RDBMS) model create an opportunity for disruptive approaches like graph data management tools, which are naturally adapted to support business processes related to connectivity. These tools provide an elegant framework for creating, storing, and analyzing data that represents different types of networks.
As interest in exploiting graph analytics increases, organizations with business challenges consuming massive amounts of data may find that general-purpose graph analytics tools will not properly scale to meet the performance needs in analyzing gargantuan graphs.
Here are five mistakes to avoid when choosing a graph data analytics solution:
- Misunderstanding data growth acceleration: Business analysts want to ingest static and streaming data from many sources. However, they seldom understand that not only does the volume of data increases, it accelerates as more sources are incorporated into the graph model. Be aware of the numbers of entities as well as the different ways they are linked together, and develop a data volume growth model that accurately predicts the needs for performance scalability before you acquire a tool. Understanding the growth model will help you choose a graph database and analytics tool that provides an efficient means for representing larger graphs without degrading application performance.
- Not taking data non-locality into account: Optimal performance for graph analytics depends on efficiency in traversing the edges linking entity nodes. As graphs incorporate more entities and their relationships, they consume more memory and disk space. The unpredictability about node connectivity may lead to linked nodes that are not located closely in memory or on disk. Non-locality increases graph data access time, and this data latency is a severe bottleneck for application performance. Look for tools that adapt their internal graph representation to distributed and parallel processing environments with optimizations to minimize the penalties of data latency.
- Relying on in-memory computing: To reduce data access latency, the most efficient graph analytics applications load the entire graph into memory. But once the data volume exceeds what can be put into memory, the application will start threshing the memory cache and will swap data in and out of disk, creating a serious degradation in performance. Look for graph database tools that distribute the graph across an environment with multiple processing nodes and provide tuned libraries for performing general graph operations that can overcome the limitations of the memory hierarchy. For example, techniques from the High Performance Computing (HPC) community such as Graphics Processing Units (GPU) and other many-core hardware acceleration provide promising performance increases for graph query and analytic workloads.
- Underestimating development requirements: Often, small test cases are used in developing graph application prototypes, and the successful proof of concept will encourage the developers to morph the prototype into a production application. But when the prototype goes from a proof-of-concept to a real implementation, issues with scalability and performance are uncovered due to deficiencies in the prototype design. The right design will address functional requirements, graph volume, as well as analysis expectations. Designers and developers may need some hand-holding to encourage development best practices. Look for a tool with an integrated development environment that both guides the designers and programmers in application development and automatically generates code that is targeted to the underlying hardware configurations.
- Not understanding how graph algorithms are used: Queries involving graphs differ from those executed against an RDBMS, and the analyst seldom has any awareness of how graph queries are performed. But since many graph applications are intended to support modeling and predictive analytics, look for a product that natively supports a full library of graph operations, ranging from the standard methods for breadth-first and depth-first traversals, but also how those algorithms are employed for more complex tasks (especially for social network analysis!) such as nearest-neighbor clustering, connected components, subgraph matching, distance measures, node rank analysis, etc.
Avoiding these mistakes means doing your homework ahead of time. If your organization’s business challenges can benefit from very large volume graph analytics, recognize that it is critical to be aware of vendor tool capabilities and limitations, and consider exploring graph and analytic approaches that leverage GPUs and other hardware accelerators. Emerging graph database solutions such as Blazegraph that have been engineered to be deployed using massively parallel processing architectures. Blazegraph has been designed to transparently scale up on platforms built around clusters of GPUs (graphics processing units). They have demonstrated that general graph algorithms have a tremendous speed-up when deployed on multi-GPU cluster machines. Ensuring implementation across a variety of performance architecture configurations provides a reasonable means for continually improving application breadth, precision, accuracy, and especially, performance.
David Loshin is a research contributor for the Bloor Group and is a consultant providing training and development services for business intelligence, big data, data quality, data governance and master data management initiatives. Loshin writes for many industry publications and also speaks frequently at conferences and creates and teaches courses for The Data Warehousing Institute and other educational organizations. In addition, Loshin is the author of numerous books, including Big Data Analytics and Business Intelligence, Second Edition: The Savvy Manager’s Guide. He can be reached via his website or by email at firstname.lastname@example.org.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.