If you are familiar with the popular hypervisor-based virtualization technologies, then you probably know the benefits of hardware virtualization. By deploying virtual machines (or virtual instances), you get a variety of great advantages: better overall hardware utilization, application isolation, more predictable/repeatable deployments, portability, etc. Virtual machines are clearly a highly useful technology, as evidenced by their widespread adoption. Even cloud providers leverage virtualization for their systems, as they can share large computers among multiple users and deliver only the resources that customers need and pay for. One characteristic of hypervisor-based virtualization is the dependency on a complete operating system running within the virtual machine. While this architecture is often an advantage, it can be an inhibitor in agile environments that require fast startup and shutdown of virtualized instances. Examples of such environments include web applications that face quick load bursts and Internet-of-things solutions that collect data at unpredictable rates. Also, the amount of consumed resources of a complete operating system running within the virtual machine can be significant and unnecessary for many types of deployments.
That’s where containers come in. With containers (specifically, Linux containers), you don’t need to start up a completely separate operating system environment. The virtualization is handled natively in the Linux operating system, running on the host server, so that many system features are shared among containers. You won’t incur the overhead of running a complete, independent, logical computer on top of your physical computer. This addresses the need for less resource consumption along with faster startup and shutdown of virtualized environments. You still get all the key benefits of virtualization but with greater efficiency. In today’s environments that require agility when dealing with data, applications, and infrastructure, containers are a great way to streamline your operations.
Containers are still in the early stages of adoption as they are not as widely used as older virtualization technologies. The container ecosystem requires trade-offs that have slowed broader deployment. Specifically, containers don’t have a scalable and efficient way to store file-based data for long-term use. In other words, they lack a suitable data-persistence tier. They can natively store file data, but only for the life of the container. If you shut down the container for any reason, including to redeploy it on another server to leverage more RAM or CPU power, the files are lost. If containerized applications write data to the host server’s file system, you face the problem of how to redeploy those files, along with the container, to other servers. And since your applications typically create useful sources for analytics such as log files, you have to coordinate the movement of those files to a centralized analytics cluster. Certainly, using a SAN/NAS for container storage is an option, as they were used for hypervisor-based virtualization, but today’s volumes of data make that option cost prohibitive. That option also entails infrastructure complexity, especially when many hundreds or thousands of servers are involved, along with other data stores such as database management systems (DBMS). Consider the performance hit as well, and a SAN/NAS configuration does not look like a viable long-term solution.
Due to these challenges, many application developers resort to writing only stateless applications for containers. This means these applications do not write any data, especially as files. One significant limitation of this approach, however, is that you cannot create log files which are often useful for system analysis. Another phenomenon is that DevOps teams will even prohibit the use of containers to ensure they aren’t overwhelmed by the data-management activities that containerized stateful applications require. These obviously aren’t valid options in the long run, especially considering the great potential that containerized environments promise.
There are actually many solutions available in the market that attempt to provide a cost-effective and low-overhead persistence tier for containers, but most of them face one major challenge or another. One type of solution addresses persistence by putting storage drivers into the containers. This solution works well from an application-development standpoint because developers generally don’t have to worry about the external storage implementation. They write containerized applications no differently than non-containerized applications. However, the burden of data management falls on the DevOps folks who have to coordinate the distinct data silos that store application data. You might have data spread across different storage engines, such as an object store, a NoSQL DBMS, a relational DBMS (RDBMS) and even an event-streaming engine. This gives the DevOps team a lot to manage behind the scenes. They might also have to deal with multiple instances of each of the aforementioned storage engines. Not only would they have to move data across silos for various types of aggregated processing and analytics, but they also have to work with distinct security and administrative frameworks. In such environments, any network hiccups or server failures can result in significant troubleshooting and recovery efforts.
Another type of solution puts the storage engines themselves into containers, but this is more suited to solving a different problem. This configuration is more about hardware utilization on a storage-engine level and not specifically about solving container storage challenges. Rather than having a fixed set of hardware resources dedicated to each individual data silo, putting the storage engines in containers means you can share hardware resources across your containerized storage engines. For example, if you have 20 hardware servers in a cluster dedicated to your NoSQL DBMS, and 20 servers in a cluster dedicated to your event-streaming engine, you cannot easily share servers across clusters without virtualization. Since the number of dedicated servers is intended to handle peak load, rather than typical load, you often experience significant underutilization. Collocating different storage engines on the same exact servers leads to severe resource contention, so you containerize the storage engines to swap them between servers. In this example, containerizing your NoSQL DBMS and your event-streaming engine means you can more easily share all 40 servers between the two systems. This configuration does alleviate some DevOps overhead for cluster expansion and contraction, but it does not address the disparate administration and security frameworks across the storage engines. It also does not consolidate data sources for aggregated processing and analytics.
The challenges of data persistence for containers are best handled by a converged data platform. By combining multiple storage engines into a single platform, accessible via client libraries embedded in the containers, you can store data in any of the formats you need as part of your infrastructure. A platform that converges a distributed file system, NoSQL, and event streaming, gives you the key persistent stores that make up a modern data architecture. And with analytics capabilities via popular technologies like Apache Hadoop and Apache Spark, you simplify your stack even further because you now have immediate analytical access to the data that your containerized applications create. Such a platform provides a competitive advantage by giving you the scale, cost-effectiveness, and agility to manage big data. The platform also enhances your ability to deploy modern data architectures, such as those based on streams and microservices, which promote real-time processing, higher fault tolerance and easier testing of machine-learning models. In addition, the convergence ensures that you aren’t creating separate silos of data that increase complexity and administrative overhead. With a converged platform, you have unified administration and a consistent security framework that simplifies the data-management activities, which, in turn, reduces the risk for error.
A converged data platform also improves hardware utilization on the persistence-tier level. Since the different storage engines are part of the same platform, you only need a single cluster. This means you can dedicate hardware resources to the platform and share utilization across the storage engines without virtualization. Instead of having to redeploy containerized storage engines across servers as the load changes, the converged platform takes care of resource sharing itself. This helps to eliminate the need to monitor and plan for redeployment of storage-engine containers as loads change. This also eliminates the resource-intensive rebalancing that is typically required when servers are added or removed in a containerized cluster.
If you’re looking to get the most out of your hardware, then container-based virtualization is one technology to pursue. But application containerization alone won’t get you the level of benefits you seek, so having the right platform for data storage also is important. A converged data platform with support for Big Data storage engines and Big Data analytics capabilities will give you the scalability, cost-effectiveness and agility you need to take full advantage of containers.
Dale Kim is the Sr. Director of Industry Solutions at MapR. His background includes a variety of technical and management roles at information technology companies. While his experience includes work with relational databases, much of his career pertains to non-relational data in the areas of search, content management, and NoSQL, and includes senior roles in technical marketing, sales engineering, and support engineering. Dale holds an MBA from Santa Clara University, and a BA in Computer Science from the University of California, Berkeley.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.