Everyone’s been talking about Big Data for years now and how data can be used for better decision-making and advanced analytics, but so few companies have actually mastered their data in a way that 1) makes all data easily accessible and 2) increases agility in making decisions with data.
The concept of a data lake is becoming standard for many organizations as they try to create more value out of their data assets. This trend toward a data-driven business approach targets a growing variety of data, and it’s more critical than ever to include all data assets as part of the next-generation data architectures. To truly “master” all of a company’s data in an environment where they can react quickly to business needs, companies must first liberate all enterprise data, especially hard-to-access data from legacy platforms like the mainframe which houses the most critical data assets such as customer data. Organizations must then integrate these core data assets with the emerging new data sets from the Internet of Things (IoT) and cloud. Below are a number of tips companies should consider when embarking on a project to master their data:
– Ensure easy and secure access to all data assets and visibility to data lineage as the data is populated in the data lake.
– Choose a product-based approach to custom coding for repeatability, simplicity, and productivity.
– Make sure the selected tools and products can adapt to a rapidly changing technology stack, and can integrate and keep up with the speed of the Big Data stack, for example as Apache Hadoop and Spark evolves.
– Ensure the tools and products can interoperate and leverage open source frameworks for advanced analytics.
– Create a streamlined, standardized, and secure process to collect, transform, and distribute data.
– Consider compliance and security needs and requirements from day one and not after the fact, especially those in highly regulated industries and those that house sensitive customer information on the mainframe.
– Ensure the selected tools can interoperate with both existing data platforms and next-gen data architecture to allow organizations to adapt at their own pace, not creating new data silos.
– Make data quality a priority. Mastering your data requires ability to explore data assets, create business rules and use these rules to validate, match, and cleanse the data. Automation of these steps and the ability to validate and cleanse as you populate the data lake becomes critical as organizations become more data centric.
Invest in products that can accommodate on-premise and cloud deployment architectures to accommodate emerging use cases and stay competitive in the convergence of Big Data, IoT and cloud.
Organizations that are able to integrate data from a diverse set of new and legacy data sources inside and outside the organization, both batch and streaming, will reap significant benefits, including improved decision-making, faster product development and the ability to use data to power predictive and advanced analytics. These businesses will also see substantial time savings as employees no longer have to spend a significant percent of their time on extracting data from silos and hard-to-access platforms such as the mainframe – but can focus on understanding the data and operational work that directly supports their business objections. They should know:
– Exactly where all of the data is coming from, taking into account all of the business processes.
– Where the different types of data are actually located.
– Which types of data are most mission-critical for the business to operate.
Organizations have never had more tools available to integrate their enterprise data for meaningful insights, especially with powerful compute frameworks like Apache Hadoop and Spark. However, accessing, integrating, and managing data from all enterprise sources – legacy and new – still proves to be a challenge. For example, accessing data from legacy mainframe is particularly complicated with regard to connectivity, data and file types, security, compliance, and overall expertise. However, with 70 percent of corporate data still stored on mainframe computers, organizations need a way to leverage that data for Big Data analytics. If that critical customer/transaction data isn’t part of the data lake, a significant piece of the puzzle is missing.
This data is critical to serve as historical reference data whether to be used for fraud detection, predictive analytics to prevent security attacks, or for real-time insights on who accessed which data and when. By liberating this data from the mainframe, companies can make better and more informed decisions with information they might never have had the opportunity to explore before – significantly impacting their growth and profitability.
Newer data sources deliver even more potential – and complexity. Streaming telemetry data, data from sensors and IoT use cases require additional components in the technology stack. Supporting connectivity to these data types is an obvious requirement, but the real value lies in the convergence of these streaming data sources with batch historical data. Combining real-time and batch data for bigger insights and greater business agility requires a single software environment. Because of this, we will see more focus on simplifying and unifying the user experience so organizations have a single hub/platform for accessing all enterprise data in real-time. That’s when companies will truly “master” their data, ensuring every single business decision is informed by a comprehensive, holistic view of data.
Tendü Yoğurtçu is Syncsort’s General Manager for the Big Data business. She has 20+ years of computer software industry experience, including extensive Big Data and Hadoop industry knowledge and technical acumen and has worked closely with key ecosystem partners like Cloudera, Hortonworks, MapR and Tableau. As General Manager, she is responsible for building on the success of Syncsort’s global data integration, Hadoop and Cloud solutions.
Before assuming her current position, Tendü served as Vice President, Engineering, leading the global engineering team in development of all of Syncsort’s data integration products on Linux, Unix, Windows and Hadoop. She has pioneered Syncsort’s open source contributions to the Apache Hadoop project, placing Syncsort among the top contributors.
Prior to joining Syncsort, Tendü worked on several high-security software projects, where she managed the product life cycle across engineering disciplines. She also was an Adjunct Faculty for Computer Science Department at Stevens Institute of Technology.
Tendü received her PhD in Computer Science-Graph Theory from Stevens Institute of Technology, NJ. She has a Master’s Degree in Industrial Engineering and a Bachelor’s degree in Computer Engineering from Bosphorus University, Istanbul.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.