The announcement last month that Microsoft would support and promote an ISV’s version of the Hadoop Distributed File System (HDFS) was more than just an effort to get the leading big-data management framework onto Windows. It was an acknowledgement that the bulk of corporate data management will involve big data within a very few years and that Microsoft would be left behind if it didn’t adapt quickly and effectively, according to analysts and Microsoft executives.
Microsoft announced last month a new set of products called HDInsight, based on a version of HDFS from Hortonworks, which develops and packages its own version of Apache’s Hadoop called the Hortonworks Data Platform (HDP). Microsoft will field two versions of the platform and related tools – one for on-premise Windows servers called HDInsight Server for Windows and another to run on Microsoft’s cloud service as “Windows Azure HDInsight Service, according to Doug Leland, general manager of SQL Server Marketing at Microsoft. Both are due for release during the first quarter of 2013, he said.
The initial goal is to automate and simplify the installation and configuration of Hadoop as if it were a single coherent product set, rather than the very modular, do-it-yourself nature of pure open-source Hadoop, as well as to give users access to both using tools they already own, according to Dave Campbell, technical fellow at Microsoft. “Our objective is to wring a bunch of friction out of installation, management, and development” of big-data projects on Windows, says Campbell.
Convenience is good, but easy access to big data by non-specialists could be even more important, according to Merv Adrian, research vice president for information management at Gartner. “The enormous community of Excel users now potentially has a seamless path to access data that until now was as distant as the divide between Windows and Linux,” Adrian said.
The key engineering accomplishment of the whole HDInsight project was to connect relational and non-relational data structures so a single query – written using familiar SQL languages and development tools – can run across both traditional and “big” data sets, according to Leland. “In this case integration means the ability to move information back and forth easily between those platforms [SQL Server and Hadoop]. The common approach has been to use connectors; we are moving toward unifying those two worlds – the relational and non-relational – in a more tightly coupled fashion.”
In its first iteration, both the Windows Server and Azure versions of HDInsight will allow the same query to run against data in both SQL Server and Hadoop from any business-intelligence tool, or even non-Microsoft database, using an ODBC connector jointly engineered by Microsoft and Hortonworks that has been submitted to Apache for possible inclusion in the open-source version as well, Leland said.
Later versions will use a new SQL Server data-processing engine called Polybase that Microsoft announced Nov. 7. Polybase is designed to allow users to query both structured and unstructured data simultaneously in SQL Server and Hadoop, Leland said. “A lot of [end-user] companies are getting mired in the complexity of installation and management of the Hadoop cluster. Our first goal was to take out much of that complexity,” Leland said. “The next step is to take out the complexity of the analytics and access to the data underneath.”
Polybase, whose release date Microsoft has not yet announced, will accept SQL queries and translate them to Hadoop’s Hive data-warehouse module to run Mapreduce job and return the result through Polybase to SQL Server.
SQL Server to front-end big data
Even with all the changes, SQL Server will remain at the center of Microsoft’s data-management offerings, including those involving big data. SQL Server 2012 got a boost in power as well as new links to Hadoop with the release of its first service pack this week. Improvements in in-memory transactional capability could boost performance by as much as 50 times, according to a presentation at a Microsoft conference in Seattle this week by Ted Kummert, corporate VP of Microsoft’s Business Platform division.
Kummert also announced an upcoming version of Microsoft’s Parallel Data Warehouse (PDW) for SQL Server that “dramatically” lowers cost-per-terabyte for storage as well as the ability to process both structured and unstructured data via PolyBase. “This thing’s built for big data,” Kummer said during the presentation.
Though it’s possible to be suspicious of Microsoft’s methods and intentions when leaping on a potentially disruptive technology, the effort it has gone to should ameliorate some of those concerns, Adrian said. “The collaborative work with Hortonworks and introduction of Polybase signal Microsoft is ‘all-in’ on the emerging opportunity for Hadoop-based big data and analytics,” Adrian said. “Its related efforts to tie SQL Azure to this story represent a commitment not to be left behind this transformative wave, either on-premise, in the cloud, or best of all, when they need to be integrated.”
None of the new products or connectors have really been tested, but the combination of relational and non-relational data in a convenient package supporting access via Excel could be a potent combination. “We’ll be watching closely to see as the first customers begin to kick the tires on this one,” Adrian said.
Kevin Fogarty, a freelance writer based in the Boston area, is a veteran technology journalist. Reach him at KFogarty@technologyreporting.com.