Like every other traditional database developer, Microsoft is playing catch-up with big data.
Driven by demand from customers wanting to sift unstructured textual data from social networks, machine-to-machine (M2M) data from server logs and other non-standard data sources, Microsoft has spent more than two years trying to add big-data storage and analytic capabilities to the company’s existing database products.
As Microsoft recently announced with its partnership with Hadoop integrator Hortonworks, the main thrust of the effort bolts the open-source Hadoop Distributed File System onto Microsoft’s SQL Server relational database.
Microsoft’s version, which it said is compatible with the open-source version of Hadoop, comes in two editions: one called Microsoft HDInsight for Windows Server to run in a corporate data center. The other, Windows Azure HDInsight Service is available on demand from Microsoft’s Azure cloud-computing service.
With SQL Server acting as a central data control system, Microsoft executives said the company is creating connections to Hadoop so that users of products like Excel and SharePoint can access much larger datasets and take advantage of sophisticated analytics desktop applications are not powerful enough to run themselves.
“Making the whole thing revolve around SQL Server doesn’t make sense in some ways,” Mark Madsen, president of consultancy Third Nature and veteran data-management and data-warehousing expert. “You can manage structured data perfectly well using Hadoop, and there are plenty of [business-intelligence] tools out there already that can access [Hadoop]. So why put a database between me and the place data is already stored?”
Aside from the effort to keep its main data-management product as the core of new big-data projects, it does make some sense for Microsoft to try to give end users access to big data using tools they are already using through servers their IT staffs are already accustomed to supporting. Madsen said, “There are 110,000 or 115,000 certified Microsoft professionals out there – that’s a lot of people with only limited access to big-data stuff,” Madsen said. “That’s why it makes sense to Microsoft to pay Hortonworks to port its version of Hadoop to Microsoft.”
And while Microsoft’s approach may be a bit of a kludge, Madsen added: “It will take a couple of years before we get to the point that when you get more data you don’t just get more batch processes.”
Microsoft’s plan means that most non-technical users will access Hadoop data using Excel to connect through SharePoint Server and SQL Server. More quantitatively-oriented BI users can access data directly using the PowerPivot add-in to Excel; SharePoint users will do it using Excel’s PowerView.
Here is the rundown according to Microsoft executives, of which products are part of its big-data plan and how these pieces will fit together:
SQL Server at the core. SQL Server will continue to act as the primary control system for data management. The result is a single data-management interface that works with existing data-access tools, systems management tools and homegrown applications.
Users will access unstructured data via Hadoop systems—whether it is HDInsight or another version—using software connectors the company is building to SQL Server, with plans to improve the performance of those connections over time.
A service pack released in last October improves the way SQL Server handles memory and transactional data to such a degree it should drastically improve performance when linked with HDInsight, according Ted Kummert, corporate vice president of Microsoft’s Business Platform division.
Parallel processing to speed up queries. SQL Server Parallel Data Warehouse (PDW) can offload some of the big-data-query processing from SQL Server to keep performance from degrading.
SQL Server can connect directly to HDInsight to run unstructured queries, but its existing functions would bog down if all queries ran only through it. PDW breaks queries into parallel streams to run against Hadoop implementations, and allows an almost unlimited number of servers to join a processing cluster.
PDW can also import data schemes and metadata from a Hadoop file into SQL Server’s own metadata store so Hadoop data can be treated like a local table within the database rather than a separate entity accessible only via a Hadoop bridge.
PDW makes it faster and easier to move data or queries back and forth between SharePoint Server and HDInsight, according to Seayoung Rhee, a product marketing manager for SharePoint.
PDW also drastically lowers the cost-per-terabyte to store large amounts of big data by filtering raw Hadoop data into defined categories to make it accessible quickly using Excel or other desktop BI tools, Kummert said.
Polybase to provide SQL queries for Hadoop. Initially, Microsoft will provide ODBC database connectors to enable access between SQL Server PDW and HDInisght. Later, an update to PDW’s core processing engine, called Polybase will give users the ability to use SQL to query Hadoop, run queries against SQL Server and Hadoop simultaneously, move data in parallel between Hadoop and PDW and create tables or standing queries in PDW using HDFS data. A later version will include an optimization engine that will allow PDW to use either SQL or MapReduce for queries against the Hadoop cluster, depending on which would be most efficient.
Polybase is scheduled to ship during the first half of 2013.
SharePoint Server as big data gateway for end users. Though SQL Server remains the hub for big-data access, SharePoint Server becomes the gateway for BI and other end users – using either custom-developed apps or analytic add-ins to Excel, according to Rhee said.
While SQL Server takes responsibility for big-data access and analysis, SharePoint Server will provide authentication, access rules based on employee rules, location and other policies. SharePoint also will be the gateway for pushing data models created by BI end users into SQL Server for processing.
SharePoint provides central management for BI data and access, governance and the ability to distribute reports or data according to security policies, Rhee said. Among the biggest big-data features on the way for SharePoint is an upgrade to the FAST search engine, a tool that was introduced with the launch of SharePoint Server 2010. This upgrade will allow users to run searches against Hadoop or SQL Server datasets as well as against documents within SharePoint, Rhee said.
PowerPivot in Excel 2013 for data access and queries. Upgraded to allow users to “import millions of rows from multiple data sources,” PowerPivot becomes the primary big-data access and query mechanism for end users of Excel. Though answers can be downloaded into Excel for manipulation, analytic processing itself takes place on the server using the xVelocity in-memory analytics engine in SQL Server 2012.
When it is released in 2013, PowerPivot will have more data-analysis and data-modeling functions, including the ability to filter data during import to focus query results further; the ability to define calculated fields for additional analysis on the desktop itself; a new formula-expression syntax called Data Analysis Expressions (DAX) to make writing those analyses simpler; and the ability to search metadata as well as data to help categorize and present data.
PowerPivot on Excel will become the default BI client for Microsoft users, while SharePoint and PowerView become the default apps through which users who are less data-centric use it, according to Dave Campbell, technical fellow at Microsoft.
Power View in Excel 2013 for business users’ ad hoc reports. As an addition to Excel for SharePoint Server 2010, Power View was designed primarily as a way for non-quantitative users to create ad hoc reports and visualizations using data from SharePoint or SQL Server. The current version is able to import data in SQL Server 2012, and then analyze data in those reports using graphing and other visualization tools.
The Excel 2013 edition of Power View adds the ability to connect to multiple data sources from one workbook, use data models created in other data sources, create and analyze subsets of imported data as Key Performance Indicators, create data hierarchies to weight analyses or organize data and drill up or down through varying levels of hierarchy to explore that data.
Power View is also able to access the Analysis Services in SQL Server, to keep the load of new or unusually heavy analyses on the server rather than the client. It also allows, for the first time, visualizations of data from OLAP databases as well as Hadoop and relational data.
“If you only want data, HDInsight and PowerPivot are good starts,” Rhee said. “When you want central management, governance, scalability, collaboration and the ability to share data, that’s where SharePoint server and PowerView come in.”
Tools to develop Office applications that access Hadoop. Many companies create custom applications for Microsoft Office users to access data via Excel, for example, and Microsoft is adding bridges for them to its big data plans. In addition to PowerPivot and Power View, Office users can get access to big data via custom apps coded with an enhanced version of the Office Developer Tools for Visual Studio 2012 that is code-named ‘Napa.’
Apps built in ‘Napa,’ now available in a preview, can run either as standalones or from within Office, going through SharePoint and SQL Server to access data there and in Hadoop. The tools run within a browser, so it’s not even necessary to install the tool in order to use it.
A second application development tool, LightSwitch HTML Client for Visual Studio 2012, will also get the ability to access Hadoop data via SharePoint server and SQL Server, but is designed mainly for touch-oriented apps for mobile devices.
Napa was designed as a way for developers to build custom applications on top of Office 365, before being expanded to include the ability to address Hadoop on either internal servers or on Microsoft’s Windows Azure cloud service.
“Those new tools and development model allows developers to create custom apps that can hook directly into the [Office or BI] client or [HDInisght] service, using tools that are still familiar to both developers and end users,” Rhee said.
Kevin Fogarty, a freelance writer based in the Boston area, is a veteran technology journalist. Reach him at KFogarty@technologyreporting.com.