Many of us have had the experience when we prepare a file – say, an important presentation – on one computer, where it looks and performs beautifully, and then load it up on a different computer only to have it glitch, look strange, or not function at all.
Now imagine that on the scale of a major big data project for a large corporation.
The problem is real: a slightly different version of the application or operating system between development and run can cause big problems, requiring expensive delays and fixes.
And this is where data containers come in.
A container is an application, including all its dependencies, libraries, and other binaries, and the configuration files needed to run it, bundled into a single package that can be moved, in total, from one computing environment to another.
A container might be used when moving from a developer’s laptop to a testing environment, from that testing environment to live production, or even from a physical machine to a virtual machine in the cloud. It can be used to get around differences in operating systems, software versions, infrastructure, security protocols, and storage.
In fact, the flexible and portable nature of containers often makes them very well suited for cloud-based applications – certainly a factor that has contributed to containers’ rise in popularity among IT systems architects. Many think that as computing and storage increasingly moves into the cloud, containerization will become an increasingly important tool.
Data containers are a separate technology from virtualization, though they are based on some of the same theories. With virtualization, an entire machine is replicated, up to and including the operating system, and can be several gigabytes in size. By contrast, a data container shares an operating system with any other container on the same machine, making the file size only tens of megabytes and, therefore, much lighter and resource friendly. In addition, there is no need for data containers to be provided with virtual memory and system resources in the same way as virtual machines, meaning they consume less processing power when running. They also boot and load faster. While a typical server at a web scale enterprise might be expected to support 10 or 15 virtual machine environments, the same server might run hundreds of containerized applications. Crucially, containers are also are far easier to transfer from one environment to another.
Data containers can run on a single operating system, but when users access a container, the container looks and behaves as if it owns the entire operating system. But because containers must be able to interact with the outside world, they can network and share data between containers.
Why You Should Use Data Containers
A data container can be created that allows multiple application containers to access the same data. These application containers can be created, moved, or destroyed without affecting the original data. This gives data held in containers a “stateless” nature, where the data will be identical no matter how many times it is iterated across different operating systems and applications. This is an important development for organizations wanting to run multiple tests or analyses with persistent data. It also eliminates those problems that arise when an entire application is set up in one environment and moved to another.
It’s also this facet of their nature that makes containers particularly suited for deploying microservices – large-scale applications that are built from a number of components, each one being a separate and distinct application in itself. This system of software engineering allows applications to be scaled quickly, by updating existing components or adding new ones while ensuring that the overall integrity of the parent application remains stable.
A notable example of the large-scale adoption of containers in a cloud service is provided by Spotify. It recognized the advantages of this technology in late 2013, when it deployed the open-source container-management platform Docker in order to reduce coding workload and CPU overheads. Google is another large-scale user of containers, reportedly launching around 2 billion every week.
Another advantage of a containerized approach to data is the potential it offers for more comprehensive governance. Laws pertaining to data rights and privacy are in a state of flux and subject to change. Containerized data can be packaged with information regarding who does or does not have the right to access the data, and the purposes for which the data can be used.
Of course, because they share an operating system, data containers are somewhat less secure than virtual machines and so are unlikely to totally replace virtualization any time soon. This means that for certain data, containers may not be suitable. Great care would have to be taken with storing personal medical or financial information, for example.
An inherent flaw in the concept of containers is that data could possibly be leaked through security flaws in the operating system. There also is the possibility that a malicious or inefficiently coded application sharing the same operating system could give rise to a security threat.
Due to these inherent disadvantages, many see virtualization and containerization as complementary, not competing, technologies. Neither are they mutually exclusive, as virtual machines are as capable as any of running containerized applications. The tools required for containerization are not yet as advanced as those for running virtual machines, but they are certainly gaining ground – meaning that containers are quickly becoming an efficient and reliable option for forward-thinking application architects.
Bernard Marr is a bestselling author, keynote speaker, strategic performance consultant, and analytics, KPI, and big data guru. In addition, he is a member of the Data Informed Board of Advisers. He helps companies to better manage, measure, report, and analyze performance. His leading-edge work with major companies, organizations, and governments across the globe makes him an acclaimed and award-winning keynote speaker, researcher, consultant, and teacher.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.