The preeminence of data science was inextricably linked to the emergence of big data. Combining business savvy, analytics, and data curation, this discipline was hailed as an enterprise-wide savior for the rapidity of the disparate forms of big data that threatened to overwhelm it.
Numerous developments within the past several months, however, have created a different reality for big data and its future. Its technologies were refined. The self-service movement within the data sphere thrived. The result? Big data came to occupy the central place in the data landscape as critical elements of data science – preparation, analytics, and integration – became automated.
Thanks to the self-service movement’s proliferation, even the smallest of organizations can now access big data’s advantages. “There’s been a lot of discussion about self-service…and having data analysts get at the data directly,” MapR Chief Marketing Officer Jack Norris said. “But you also have to recognize, what do you mean by ‘the data,’ and what has to happen to ‘the data’ before that self-service takes place?”
The Impact of the Cloud
In many instances, what happens to the data prior to self-service is done by others. Facilitating analytics is one of the chief components of data science, particularly when incorporating big data sources. There are numerous analytics options that end users can access via the cloud that can yield insight into all sorts of data – many of which can do so in nearly real time. Ranging the gamut from conventional historic business intelligence to cutting-edge prescriptive cognitive computing analytics, these services simply require organizations to grant providers access to their data. Cloud analytics decreases physical infrastructure, reduces costs, and effectively outsources potentially difficult and resource-intensive computations. Machine-learning algorithms can provide insight into advisable action based on analytics results (in addition to explanations) and automate the data modeling process, which can prove extremely difficult with time-sensitive big data.
According to Norris, the true benefit of big data analytics is in combining data sources. “If you were looking at just the social media activity of potential prospects out there, you can find some trends. But if you pair that also with your customer information and your customer purchases, you have got a richer view. And then if you add weather data or location information, you can look at different trends there.”
The data lake concept has been extremely influential in automating pivotal aspects of data science pertaining to integration, data preparation, and analytics. A number of developments within this facet of data management pertaining to big data can expedite the preparation process that can potentially monopolize the time of data scientists. “Data scientists are a rare and precious commodity in an organization,” said Cambridge Semantics Vice President of Marketing John Rueter. “You have got these brilliant Ph.D.s who are taking on data science responsibilities. Since information is stored in a data lake in its raw form, they are liable to spend 70 percent of their time just doing the data preparation and data management before you can do any kind of analysis on it.”
Data preparation platforms can provide comprehensive views of disparate data sources and their relevance to specific jobs while implementing measures for cleansing and quality. Semantic technologies and machine learning can help identify points of data integration, individual node characteristics, and even facilitate transformative action requisite for specific applications. They also can facilitate much-needed consistency in terms of metadata and schema definitions (on structured and semi-structured data), which Norris said are required “to accomplish self-service so that data analysts can get at the data.”
Another means of providing structure and consistency to unstructured big data without data scientists is to leverage JSON-based document stores in data lakes, such as Hadoop. According to Norris, JSON “has the schema built into it. It’s basically the data interchange standard of web applications now. It’s increasingly the data format produced in the Internet of the Things as the result of sensor data.”
Combined with SQL solutions (there are NoSQL ones as well) that interact with JSON to derive schema in real time, “there’s no dependency on IT to massage the data before you can do that self-service,” Norris explained. Data lakes also provide centralized hubs that are useful for running both operations and analytics simultaneously, with much less need for data-scientist involvement because “having it on a single platform … brings operational agility and results in simplifying your data architecture, simplifying your administration, and simplifying a great deal,” Norris said.
The self-service access to data and its uses that the aforementioned automated aspects of data science facilitate would be useless without adherence to security and governance standards. The metadata and schema consistency of the foregoing methods – which can be augmented by the cataloging capabilities of data-preparation platforms – are useful for restricting and granting access to data based on regulatory concerns, security, and governance policies. They also can provide traceability. Certain governance solutions also are endowed with standards-based semantic capabilities to reinforce policies and procedures while linking data to vital information such as business glossaries.
These methods are also applicable to the varying cloud analytics options, resulting in what Norris referred to as “The four A’s: You need to authenticate, and be able to understand who’s coming in and tapping into your global directory services. You need to control access. Then you need the ability to audit, so you can understand who did what. Then, lastly, is what we refer to as the architecture. Is that security granular? Is that security at the data level, not by the access method?”
Data science is no longer an arcane discipline that is privy to only a select few. The self-service movement has succeeded in automating numerous important aspects of data science, which business users can now leverage without understanding the intricacies of algorithm development for analytics or ETL for data preparation and application loading. Significantly, business users can utilize these facets of data science in a way that adheres to governance policies and provides the level of security required for enterprise data. Best of all, automation makes big data initiatives much more affordable and accessible to the enterprise. Automated data science has not obscured the jobs of data scientists, but instead freed them from some of the more time-consuming aspects of their position so they can work on more profound problems.
Jelani Harper has written extensively about numerous facets of data management for the past several years. His many areas of specialization include semantics, big data, and data governance.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.