Cloud solutions sound exceedingly attractive to management: It becomes unnecessary to manage all that infrastructure, and management can quickly scale up (and down) volumes if needed.
Yet in using cloud architecture, an organization should have concerns about latency caused by the physical distribution of the data and, even more importantly, the security of the data being stored in the cloud.
Cloud solutions may be slower than local solutions because of potential delay caused by the speed of data traveling to and from the physical location of the cloud service provider, and the extra time needed to traverse additional security requirements for the data located in the cloud. How secure is data located in public cloud solutions? What are the legal ramifications of storing data in the geopolitical domains of the cloud vendors regarding privacy?
In general, the solutions for data integration regarding data in a public or private cloud are the same as for local data, since access to data is usually through virtual addressing. As long as the data consumer has sufficient access to the data located on the cloud, the data integration solutions are the same as for local data, with some additional concerns about latency and security.
With cloud solutions, organizations can rent computing power and software rather than buy it. This allows an organization to have additional servers, environments, and even applications with standard configurations available in minutes from a service provider operating out of a remote data center. Access to the rented computing environment is usually through an Internet protocol. There are many potential benefits possible for an organization using this kind of model; primary advantages are the agility it affords the organization to move very quickly and the cost savings of not having to buy and manage resources needed for infrequent peak demand. In the figure below, the applications located within the organization data center passes data to and from an application located in a public cloud location.
Primary concerns around cloud solutions have to do with security. In the most basic situation, the cloud provider is serving many organizations within the same network environment. An organization could be concerned that their data might be hacked (accessed without permission) by another organization operating in the same area of the cloud data center. Even in cases where the cloud service provider has created a separate private cloud environment for an organization, operating on a separate network, behind a separate firewall, there must be concern for whether the service provider is providing adequate security from intruders.
The physical security provisions of the cloud service provider may be a concern, although since the provider is supporting security for all their customers and security is a differentiator, they are probably providing physical security which exceeds the internal capabilities of most individual organizations.
Additionally, the data security laws of the country where the cloud provider is operating its physical data center will be of concern to some organizations. For example, a Canadian company may not want to use a cloud service provider operating in the United States because its data could be subpoenaed by an American court.
Certain types of organizations will not be able to utilize public cloud solutions for their most private and sensitive information, such as the customer data from financial institutions or classified data from government organizations, but most organizations may find that the capabilities offered by cloud service providers are both less expensive and more secure than those they could support internally and would have many uses. Even the most security-conscious organization may find it useful to be able to create development environments in the cloud quickly, thus speeding up development of custom applications and familiarity with new vendor packages, while their internal organizations are provisioning environments within their own firewalls and data centers.
What many chief security officers are discovering, to their horror, is that cloud services are so easy and inexpensive to acquire that parts of their organizations may already have data out in public cloud environments without having been concerned with the issues of adequate security. Cloud services are so easy to obtain that the inventory of organizational data assets may suddenly be uncertain. Like data on laptops and mobile devices, data in the cloud is outside the organization’s physical control and adds greater complexity to the problems of managing data security.
There are three basic reasons that the speed of data integration with data housed in a cloud environment might be slower than data located in a local data center: the speed of the network infrastructure might be slower, extra time is needed to pass through the cloud security, and extra time is needed for the data to traverse to the physical location of the cloud data center.
The network infrastructure of an internal data center might or might not be constructed with faster connections than a cloud data center. Although an internal data center would probably be using expensive and fast components for their network, especially for production systems (i.e., fiber-optic network), it is likely that a cloud data center would also be investing in fast network infrastructure even though they would be using commodity (cheap) hardware. Delays may not be within the cloud data center but rather within the path data must take to get to and from the cloud data center.
Moving data to or from a cloud data center, or accessing data in a cloud data center, will involve passing through the extra security layers (firewall) around of the cloud data center, with the extra time that would be involved, even though that may be minimal.
What cloud service purveyors minimize in their advertising is that cloud data centers actually do exist in the real world in an actual physical location. Data passing to and from these physical data centers are limited by real-world constraints such as the speed constraints of how long it takes for digital information to pass to and from the physical site of the cloud data center. The physical distance of a cloud data center may have latency just as interaction between sites in different regions of the world will have latency. The physical distance from the cloud data center combined with the network infrastructure to and from the cloud data center may exacerbate any delay.
Although data integration solutions don’t necessarily need to be different in including data from a public cloud as they would for local data integration, if very low latency is a requirement, it may be necessary to architect a data integration solution similar to the integration of geographically separated hubs of data located on different continents. Solutions such as database replication can be used to make up for latency of geographically distributed data, but the extra disk required may negate much of the savings benefits of the cloud solution.
The servers and disk being used in most cloud configurations are commodity devices: inexpensive, easy to acquire, install, and configure. Therefore, the management of these commodity servers includes an assumption that there will be more frequent errors than in traditional in-house server configurations. That is, the mean time to failure is higher on commodity hardware. In order to create a fault-tolerant environment using commodity hardware, most cloud-oriented architectures use some form of data redundancy to enable smooth continuity of processing.
Cloud operating systems and data management systems, such as Hadoop, keep an odd number (as in not even) of copies of data. Additionally, data is usually distributed across multiple servers or nodes. When a server fails, processing falls back to one of the data copies. Having an odd number of copies allows for the nodes to compare versions of the data to verify that none of the copies have been corrupted or lost. The more critical the data, the greater the number of copies that are specified in the configuration, and, of course, the greater the rental cost.
The disk usually used in internal production environments is a “smart” disk with redundancy and fault tolerance built in, costing as much as 10 times that of commodity disk. Having three or five copies of data on commodity disk in a cloud environment should still be less expensive than internal disk, especially when including support costs.
More than with data kept internally, data kept in the cloud should include an inventory and auditing that no data has been lost or misplaced. With thousands and millions of commodity servers being constantly provisioned and deactivated, cloud services users should ensure that they have access to and are processing all the data they think they are. Also, when deactivating servers in the cloud, some concern should be taken to ensure that all data is entirely deleted prior to surrendering the servers.
April Reeve has spent the last 25 years working as an enterprise architect and program manager for large multinational organizations, developing data strategies and managing development and operation of solutions. April is an expert in multiple data management disciplines including data conversion, data warehousing, business intelligence, master data management, data integration, and data governance. Currently, she is working for EMC2 Consulting as an advisory consultant in the Enterprise Information Management practice.