Although it is well known that the IT industry is under constant transformation, IT changes have accelerated rapidly in the last four or five years. For those of us who follow this industry closely, we know that, every once in a while, a new company shows up with a revolutionary new technology and forces its way through the well-established blue-chip vendors. Changes such as the tablet and smartphone revolution have affected almost everybody from the consumer up to the corner offices of the biggest enterprises. Borders are being crossed, with some waves such as bring your own device (BYOD) and the Internet-of-Things (IoT) blurring the lines between consumers and IT professionals.
Any revolution has room for new ideas, such as Hybrid Transaction/Analytical Processing (HTAP), the concept of having real-time analytics sharing the same live data with high-throughput transactional systems. This revolution also has room for an old, existing technology to suddenly find a unique new application and become the subject of important trends and discussions. In-memory databases fit this category perfectly.
In-memory databases are not new. They were created with the idea that, because RAM access times are tremendously faster than disk I/O (around 100 times, according to Peter Norvig), a database that processes data residing purely in RAM memory should be tremendously faster than a database that stores data on disk.
Over the years, the speed advantage of RAM over disk has caused vendors to seek in-memory solutions for high-throughput application needs, such as data streaming processing in telecommunication packet switching. In past decades, the high cost of memory prevented some projects from using enough RAM to load large, multi-gigabyte databases in memory. At an average cost of US $100 for a megabyte of RAM in the early 1990s, a gigabyte of memory would cost you US $100,000 at that time. But dramatic price declines for memory chips made such technology increasingly attractive. In the early 2000s, each megabyte would cost only US $1.50, and nowadays a megabyte costs less than one penny.
A Cost/Performance Compromise
The performance attractiveness of memory has led to an innovation heavily used by databases: the marriage of disk and memory called “caching.”
The concept of caching has existed since the early microprocessors. It was sparked by the fact that CPUs were faster than the memory they accessed. CPUs already implemented high-speed memory devices, called “registers,” which provide a few bytes of storage inside the microprocessor, where it is much faster to access than external RAM. As CPUs became faster, RAM memory became more of a bottleneck, which led to the practice of placing a small amount of very fast memory close to the microprocessor. Thus the idea of the hardware cache was born: a limited amount of high-speed memory directly accessible by the CPU. Because these cache devices are smaller than traditional memory, the CPU uses sophisticated algorithms to decide what to store in cache and what to leave in memory. Caches became a very popular mechanism for combining cost and performance.
The same forces – speed and cost – that made caching important for CPUs also apply to disk access. In the case of disks, the cache might reside in the disk, in the host adapter, or in RAM memory – or all of the above. This proved to be a major improvement for systems with heavy disk I/O, such as databases. Many, if not all, modern databases have cache technology that allows them to load part of the database in RAM memory while keeping the majority of the data on disk.
So why not make a cache big enough to load an entire database? The problem is that database sizes have grown dramatically as applications are given more and more data to process. Although the price of memory has drastically declined in cost per megabyte, the amount of memory needed to process an entire database has sharply increased, keeping the equation in balance. That, in turn, has kept in-memory databases as a niche solution.
Persistence of Memory
RAM memory has another problem: it does not offer persistent storage. The combination of this issue and high cost make RAM a resource to be used sparingly. Disks offer large amounts of slow, inexpensive –but persistent – storage. This is why the majority of databases assume your main data will be stored on some sort of persistent disk.
Most in-memory databases offer persistence by copying data to disk in a background operation that runs asynchronously from the transaction. It is entirely possible for a transaction to be completed before the data has been written to the disk. This latency can introduce a reliability and durability problem for some systems. Traditional DBMS vendors argue that this has all been solved many years ago with the combination of cache systems and its high-speed storage systems.
The balance of cost vs. performance was tipped out of equilibrium recently by a new business need: real-time analytics. This is among the forces that are acting together to shape the recent revolution in IT. Together, these forces are known by the name SMAC: Social Media, Mobile, Analytics, and Cloud. Beginning with the smartphone invasion, these forces gave the consumer the power to interact with the digital world in a way that was never possible before.
And interact they did, generating mountains of data in the process.
The industry immediately realized the power of this data: precise knowledge of your customers bestows great power to influence their purchase decisions.
With this opportunity came a big problem: how to process the huge volume of data fast enough to allow you to influence your customer immediately? The goal is to extract useful information fast enough to affect marketing campaigns in real time.
This is where analytics, the “A” in SMAC, comes into play. Anyone in IT knows that “analytics” is another word for “very time-consuming batch operation.” Existing technologies could not cope with these demands, because most IT infrastructure is divided in two very separate groups: online transaction processing (OLTP) and data warehousing. The OLTP systems handle typical real-time activities, such as approving credit card transactions, electronic fund transfers, and charging for prepaid cell phone calls. Data warehouses collect useful information after the fact, by the end of the day, filtering and crunching the huge volume of data in large batch jobs. All this data is consolidated into something called “hypercubes,” a sophisticated multi-dimensional representation of the important data, which is the source for all the complex analytical processing done by the Business Intelligence system. Currently, such processing can take days, if not weeks.
Real-time analytics is a major technology trend in today’s IT world. In-memory database processing is a perfect fit for this technology. One reason analytic processing takes days is the time required to manipulate large amounts of data on a slow disk. All these systems use traditional databases that rely on disks. If this inherently slow bottleneck could be removed by shifting processing into memory, systems could immediately have performance gains of 100-fold or more!
Several start-ups have addressed this business problem by revisiting the not-so-new idea of in-memory databases. Traditional database vendors are suddenly being challenged as these new companies appear on the horizon. Large corporations are running projects evaluating this new technology with the potential of replacing existing databases with new in-memory systems.
The Empire Strikes Back
The counter-attack from many of these traditional database vendors is that their technology has in-memory capabilities due to their cache system. Indeed, some cache systems have become be very sophisticated, up to the level of loading an entire database in cache. But is a cache system really an in-memory database, even if it is able to load an entire database?
Some argue that the answer is “no.” There is a conceptual difference between an in-memory database and a database with a traditional cache system. Pure in-memory databases were designed with the mindset that the data is always stored in memory. Cache systems usually consider that only part of the data is stored in memory, with the rest available on disk. This requires a test to be performed for every read to verify the presence of the data in the cache, which could introduce a delay that pure in-memory databases do not have. (In the special case of the entire database being loaded into cache, the test could be avoided, which would remove a little overhead from the cache system.)
Of greater concern is that CPUs access data on disk differently than they access data residing in memory.
Data on disk is accessed using complex I/O operations. Data in memory can be accessed with much simpler operations. Some cache systems are designed to intercept disk I/O operations and divert them to memory calls if the cache algorithm decides the data is already resident in cache. Disk I/O operations usually require tens to hundreds of CPU instructions, consuming thousands to tens of thousands of CPU cycles. Most, if not all, CPUs can handle a load from memory with a single CPU instruction, which may consume fewer than a hundred CPU cycles. This translates into the overhead of thousands of additional CPU cycles to do what a pure in-memory database achieves with a single memory operation.
You would have to rewrite your database application – including its caching mechanism – to beat the speed of a single memory load instruction. To increase performance, some databases avoid slow disk I/O by looking for data in the RAM cache first before issuing I/O instructions to read it from disk. Unfortunately, if the data is not in cache, the long wait for disk I/O is still necessary.
Another factor is the design of the hardware itself. The memory bus, the pathway used for accessing memory, is designed for much greater speed than the peripheral bus, which communicates with slower peripherals, such as network cards, USB ports, and disks.
There are other differences that may or may not become significant, such as being able to implement smarter indexes or even disregard indexes entirely. Because scanning data in memory is so much faster than doing it on disk, traditional index processing might actually slow processing down rather than speed up a search.
Another example of revolutionary changes comes from a completely different direction: the hardware industry. Solid State Drives (SSDs) are tremendously faster than hard disks, since they are actually memory chips. The evolution of this technology disrupted not only the hard disk industry – as memory chips, SSDs are an entirely different industry from hard disks – but it is also undercut the advantages offered by in-memory software vendors.
The idea behind the SSD is to provide a seamless replacement for slow mechanical disk systems so the software layer does not need to be altered. The main advantage of these devices is that they are back-compatible with existing systems. By replacing an old hard disk with a new SSD, a simple hardware installation, you can achieve in-memory speeds with your existing applications! Sounds like a huge advantage and, in fact, these products have seen increasing sales as prices go down and capacity goes up (can anyone say “Moore’s Law?”), and as the vendors prove they are up to the task of mission-critical data storage. This solution has been pushed strongly by the traditional database industry, since it brings increased performance to existing environments, which may prevent customers from jumping ship in favor of an in-memory database.
How does an SSD compare to a pure in-memory database? Does the simple replacement of mechanical disks by an SSD convert your existing traditional database to an in-memory solution?
The short answer is “no.” The reason lies in the architecture. Although the replacement of the hard disk by an SSD means you have persistence of your data in a memory chip, the CPU must still access that data with the additional overhead of those cumbersome I/O instructions. From the CPU point of view, the difference is astounding. In other words, pure in-memory databases process inherently faster than SSD solutions.
To put it all in perspective: If the time required for accessing a hard disk is like a trip to Mars, a SSD shortens that to be more like a trip to the moon. But neither compare to an in-memory database, which is like a quick trip to the corner convenience store.
Since this is all very new, the market will decide what it considers to be a true “in-memory” solution. New requirements will put various solutions to stress tests, and soon we will know if the new will beat the old, or if the old will be flexible enough to adapt in time. Either way, these are really interesting times for the entire IT industry.
Evaldo de Oliveira is the Director of Business Development for FairCom, where he focuses heavily on the company’s competitive market strategies. He has more than 20 years of experience in leading strategic sales in the IT industry with a focus on software, software development and infrastructure solutions architecture. Though his career has evolved more towards the business world, Evaldo came up from the technology side of things, which is where his heart remains. When he’s not playing around with new lines of code or working on a new strategic opportunity, Evaldo is most likely with his family or his motorcycle. He holds a Bachelor’s Degree in Computer Science and a Master’s in Electrical Engineering from the University of São Paulo, Brazil.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.