The “cloud” once meant servers and hard drives at someone else’s data center. But more and more cloud-based services are being designed to match the applications enterprises have long done in house—and data warehousing is among the latest to join that list.
Companies have been able to run data analytics applications in the cloud for some time, either by leasing dedicated space or using a utility-style rental model on shared equipment. Now, Internet companies and startups are building data warehousing offerings designed from the ground up to run on the cloud.
These services will challenge in-house data warehousing providers because cloud computing, in general, offers some real benefits, notably lower upfront costs and a faster time to deploy applications. For regulatory and security reasons, running a data warehouse outside a company’s walls doesn’t make sense for everyone but it can be compelling for certain uses and cloud-native companies.
For a sense of the potential disruption of data warehousing as a service, consider Redshift, a service introduced by Amazon Web Services late last year and now in private beta. Customers are already doing analytics on existing Amazon Web services, such as its Elastic MapReduce. Redshift, by contrast, runs on ParAccel’s columnar database so it’s specifically designed for analysis of structured data. People can access it through MicroStrategy or Jaspersoft front-end business intelligence tools.
From a business point of view, the goal of Redshift is to expand Amazon’s data-oriented services and to massively undercut incumbent providers on price. Customers can pay 85 cents an hour for the service or pay $1,000 per terabyte per year with a three-year contract. With a massively parallel hardware architecture and essentially unlimited compute power, the service can speed up complex queries and provide a data warehouse for one-tenth the cost of most solutions available today, Amazon says.
“The challenge is that datasets are increasingly large, they’re increasingly complex and data warehouses traditionally are as large and complex and costly as the data people are trying to store in them,” says Matt Wood, principal data scientist at Amazon Web Services. “With Redshift, we’ve tried to come up with a way to get that analytics, business intelligence approach without having to shell out huge amounts of money.”
A number of other companies are also developing dedicated cloud data warehousing services. Google’s BigQuery is designed for analyzing very large datasets on its cloud and, as part of Google Analytics, is also testing a service that tracks how users interact with websites.
A few startups in this area are emerging as well. Treasure Data has developed a data warehousing service on top of Hadoop. Since it’s hosted, customer companies don’t necessarily need to learn Hadoop to work with it. And the pricing, which is free for people with little storage needs, is meant to be aggressive.
Often, new cloud computing companies target business users who want a specific application, without having to assemble a technical team. “So many people are frustrated by the slowness of analytics tools and we try to aim directly at that,” says Michael Driscoll, CEO and co-founder of Metamarkets, which makes an online visualization and analytics application.
Cloud Begets More Cloud
More sophisticated cloud-based data warehousing services are emerging because powerful software tools, such as Hadoop, and technical infrastructure are available at a relatively low cost, says Wood. This trend will prompt more companies to try cloud-based analytics and to get more value from their data.
“It’s not necessarily that we see customers working with extraordinarily large datasets when this becomes valuable. It’s really the point at which the existing tools stop being scalable—it could be when they move from megabyte to gigabyte scale or terabyte to petabyte scale,” Wood says.
Perhaps the most compelling aspect of a cloud-based data warehouse is speed. A hosted service can make sense for a group with a company that wants to quickly experiment with an analytics application without a drawn-out evaluation and development process. The pay-as-you-go model can also be cost effective if your data warehousing needs spike dramatically at certain times, perhaps only a few times a year, and you can get access to high-end computing resources only when you need them.
Another set of natural customers for cloud-based data warehousing services are companies that already have a lot of data in the cloud, since moving data is significantly easier. Right now, many customers of early data warehousing as a service providers are startups that don’t have a legacy investment in an existing analytics system.
Regulatory Compliance and Technical Challenges
For large companies, though, it’s a bit more complicated. In many cases, regulations will prevent companies from moving personal or other sensitive data to an off-premise location. Often, large companies consider their corporate data strategic and would rather not outsource that application, says Tony Cosentino, a business analytics analyst at Ventana Research.
“Do you consider your information assets and technology and analytics a core competency? If so, don’t you want to control your own data centers?” he says. In that case, a private cloud where the application is still controlled by the IT department makes sense.
Technically, there are challenges as well. Transferring data to a cloud-based data warehouse is not a trivial task, which could have a significant impact on the performance of the system. Cloud providers are trying to address this. Amazon, for one, has a service called Data Pipeline for scheduling data-oriented workflows with its cloud. Still, potential users need to weigh the pain of creating connections to outside services.
Ultimately, the decision on whether to go with cloud-based data warehousing is similar to other applications: is this function a core function that a company wants to run? Or can parts of these business services, such as managing the hardware and software infrastructure, be handled by cloud providers off site?
Regardless of how quickly and who adopts these latest services, incumbent data warehousing companies—such as EMC, IBM, Oracle, SAP and Teradata—will likely respond over time, either with lower cost offerings or a broader set of subscription-based cloud offerings.
“The battle will probably be a little bit more in the mid-market and more for Web properties that may have grown up in the cloud,” Cosentino says. “The net-net is that (incumbents) have to react. There’s lot of innovation and a lot of disruptive change going on.”
Martin LaMonica is a technology journalist in the Boston area. Follow him on Twitter @mlamonica.
Home page photo of clouds by William Warby via Flickr.