There is a huge amount of buzz in the analytics industry surrounding in-memory databases, in-memory computing, and big memory hardware configurations. However, not all solutions associated to the hype are created equally. It is important to understand that the economics, performance, and availability characteristics vary widely across different implementation approaches.
One approach is to deploy very large memory configurations and place 100 percent of all data to be accessed in memory. This makes I/O orders of magnitude faster than when accessing data on electro-mechanical disk drives. Of course, some form of non-volatile solid state or electro-mechanical disk drive technology is still required as backing store and for logging purposes to protect against data loss in the case of server failure. However, if all data accesses can be directed to in-memory storage, the performance benefit is measurable and compelling.
A major problem with the 100 percent in-memory storage implementations is that the economics do not work for an enterprise class solution. The typical argument for the 100 percent in-memory approach is that we want great performance and that the cost of memory is decreasing by roughly 30 percent every 18 months. Thus, on a compounded basis, it will eventually be cheap enough to put all data in memory at a reasonable cost. The flaw in this reasoning is that every mathematical equation has two sides that must be fully understood. The other side of the equation to the decreasing memory costs is the increasing size of data.
The Rise in Data Volumes Outpaces the Fall in Memory Prices
If we focus only on ERP data, it is perfectly reasonable to suggest that in a mature industry the number of customers, accounts, and transactions would increase at rate much lower than 30 percent every 18 months. However, an analytically sophisticated enterprise will certainly not be satisfied with being restricted to transactions as the lowest level of detail in understanding their business.
For the last three years there has been a clear trend toward analyzing interactions at a level below transactions in analytically sophisticated organizations. In addition to order line detail as the lowest level of transactional data in an e-commerce world, advanced analytics would surely include access to clickstream and search term content. Similarly, in telecommunications the focus has expanded from call detail records (CDRs) as the lowest level of detail from the billing systems to the use of lower level network data to understand the full subscriber experience. Examples such as these exist in every industry. Not to mention the explosion of data coming from social media and sensor sources.
The main point is that the volume of data that a leading enterprise will pursue for analysis grows faster than memory is getting cheaper. Moreover, observation of analytic access patterns makes it very clear that putting all data in memory is economically irrational. In almost all enterprise class analytic environments, less than 25 percent of the data accounts for more than 90 percent of the I/Os over a given (not too large) period of days. This means that investing in placing all data in memory (at an order of magnitude higher cost per terabyte than traditional storage) is a waste of money for 75 percent or more of the data. Most data in a “big” data environment will not be cost-justified for a 100 percent in-memory solution.
An effectively engineered solution will provide the lowest price per I/O for the frequently accessed (“hot”) data and the lowest price per terabyte for infrequently accessed (“cold”) data. In-memory definitely provides the lowest cost per I/O and is very appropriate for the small percent of the enterprise data that is accessed very frequently. Electro-mechanical disk drives remain the most cost effective solution for storing very high volumes of data which are accessed infrequently. Data in the middle ground (“warm”) would be a good candidate for flash memory or solid state disk drives.
This is not to say that there are no interesting use cases for a 100 percent in-memory solution – but these tend not to be enterprise solutions. For example, an in-memory deployment of an operational data store (ODS) with extreme service-level requirements in the areas of tactical query access and up-to-date content could possibly be a very good candidate for such a system. (An operational dashboard application displaying very up-to-date content, and that requires limited visibility into historical data, would be a good use case.) Of course, one should also consider the very long mean time to recovery (MTTR) times associated with unplanned failures when “re-loading” a full in-memory database solution. In the case of an ODS with responsibility for operational reporting using only recent data, the volumes are probably small enough to make recovery times acceptable and the economics of complete in-memory storage viable.
The Trend Toward Hybrid Storage for Analytic Platforms
For enterprise analytic platforms, the clear trend is toward deployments involving a hybrid of storage technologies. The differentiation of the solutions is less about hardware specifics than it is about software sophistication related to multi-temperature data management. A wide variety of competent hardware vendors can assemble a platform engineered with a storage hierarchy consisting of various form factors of memory, flash, and hard disk drives. The tricky part is getting the right data into the right storage to simultaneously optimize both performance and cost.
If it were possible to map hot and cold data to appropriate storage devices, it would be worth sending your best database administrator (DBA) away for two weeks to profile hot versus cold data to figure out the optimal placement of data across devices in the storage hierarchy. The problem is that by the time the DBA got back, she would be wrong. Today’s hot data becomes tomorrow’s cold data. The temperature of data is constantly changing based on dynamic workload characteristics. The data that deserves in-memory residence can change based on day of week, hour of day, and/or current analytics being undertaken. Assigning a human to track these ever changing data temperatures is a losing proposition for large-scale systems.
Automatic placement and migration of data according to analytically predicted and measured access patterns is essential for practical implementations of enterprise scale analytic solutions. It should be noted that intelligent use of multi-temperature data management techniques is much more than simply aging older data to cheaper and slower devices. Data in different subject areas will have different temperature characteristics and there will often be much more sophisticated patterns than simple time decays in data usage. Moreover, understanding different patterns of use associated with data versus indexes versus temp (spool) space content is critical for optimizing data placement and migration.
Once all major hardware vendors provide hierarchical storage capability, the battleground will quickly move to differentiation based on software sophistication in multi-temperature data management rather than hardware configurations. The critical points to remember when evaluating solutions in this space are: (1) smart memory beats big memory, and (2) automatic data management beats manual data management.
Stephen Brobst is the chief technology officer for Teradata Corporation. A fellow of The Data Warehousing Institute, he has taught there since 1996 including courses such as High-Performance Data Warehouse Design, Capacity Planning, and The Future of Data Warehousing.