Most likely, you are acutely aware of the growing amount of data in your organization. How do you store all that data in a cost-effective, timely manner? If you are looking to offload some of that data to Apache Hadoop, you’ll need to truly understand your data warehouse environment in order to determine if you are moving the right data to the right place – which data is hot, which is cold, and what does that mean in terms of where it should be stored? And how exactly do you bring that data into Hadoop?
Hadoop is a key component of the next-generation data architecture and has proved its ability to scale to hundreds of terabytes of multi-structured data, all at a fraction of the cost of legacy platforms. Instead of thousands to tens of thousands of dollars per terabyte, Hadoop delivers compute and storage for hundreds to thousands of dollars per terabyte. With Hadoop, you can extract more meaningful business insights from more of your data while freeing up resources from existing systems.
Here are the four considerations that you should keep in mind when moving data to Hadoop:
Take your Data’s Temperature
To make the right business and operational decisions about your data, it’s important to determine what data is cold, what is warm, and what is hot. Data that is frequently accessed on fast storage is considered hot data, less-frequently accessed data stored on slower storage is warm data, and data that you rarely access and resides on your slowest storage is considered cold data.
To make sense of all your data, it’s vital that you understand how it’s used, both in terms of the workload and in terms of the capacity that it’s taking up in your data warehouse. To successfully rebalance your data warehouse, you first need to assess and identify less-frequently accessed data and resource-intensive workloads that can be moved to Hadoop.
Take the case of a typical data warehouse, where you might have hot data that is using 50 percent of your CPU cycle, which is busy performing data transformation and extractions. Often, this kind of workload is consuming only 10 percent of your total data warehouse workload. Think about that for a second: 10 percent of your data warehouse workload is using 50 percent of your CPU cycle. That’s hot data! This scenario – organizations using the data warehouse to do a lot of workloads that may be better suited for a different platform – is quite common. That’s the key here: It’s not just about hot, warm, or cold data, it’s about understanding where your hot data is, when and how it is being used, who is using it, and what to do with it.
You also need to think through where you want to put that data and how to move it. Is your data warehouse the best place to keep your hot data? Is there a percentage of your data that you could move to Hadoop? As you know, transferring large volumes of data quickly can be a challenge. Are there other resources that could come into play in terms of moving your data?
Develop a Roadmap for Moving Data and Workloads
Once you understand how your data is used, both in terms of the overall workload and data warehouse usage, you need to put that information in the context of user activity. When you have identified those aspects, you can start developing a roadmap for moving the data.
Developing a data movement roadmap requires a holistic approach, which means that you should do the following:
- Identify the workload(s) that you are moving
- Scope out how you will move that workload
- Understand how your different applications and databases are connected to each other
- Determine architectural investments that need to be made
By including all of these considerations in your roadmap, you will minimize the disruption to your business when you move data to Hadoop. You also should consider leveraging an enterprise-grade Hadoop distribution with one of the data optimization and integration solutions on the market so that you can load your structured and unstructured data into Hadoop quickly and easily.
Implement your Roadmap Incrementally and Iteratively
Understand that your job is not done once you optimize your data warehouse or move data to Hadoop. You need to do this iteratively – again and again – because data is dynamic and changes over time. Data that’s considered hot one month may be considered cold the next month.
Perform SQL Analytics on Big Data
Once you’ve done the heavy lifting of moving your ETL work off of your enterprise data warehouse and into Hadoop, then what? Now you have the flexibility to choose a tool like Apache Drill, a schema-free query engine that can perform secure, concurrent, low-latency SQL analytics on big data. Drill gives you the ability to explore and ask ad hoc questions on full-fidelity data in its native format as it comes in. This is what sets Drill apart from traditional SQL technologies. Drill lets you perform self-service raw data exploration and complex IoT/JSON data analytics, as well as ad hoc queries, on Hadoop-powered enterprise data hubs.
Understanding how your data warehouse is being used is the first consideration to keep in mind when moving data to Hadoop. You need to understand your workload, who is using the data, and how they are using it. By better utilizing your data assets and taking advantage of an enterprise-grade Hadoop distribution paired with a commercial data optimization and integration solution, you’ll be able to dramatically reduce costs while realizing more value from your data.
For more information, check out this webinar.
Bill Peterson is the Director of Product Marketing for MapR. Prior to MapR, William was the Director of Product and Solutions Marketing for CenturyLink Technology Solutions, where he was responsible for marketing, strategy and leadership for the company’s big data efforts. Prior to CenturyLink, Bill ran Product and Solutions Marketing for NetApp’s Big Analytics and Hadoop solutions. In addition to his marketing role at NetApp, Bill was the Marketing Co-Chair for the Analytics and Big Data committee, SNIA. Prior to joining NetApp, Bill held leadership positions at IDC within the Software Consulting Group, and at Page One PR. Bill has also served as a research analyst at IDC and The Hurwitz Group, covering the operating environments, portals, content management and business intelligence markets. In addition, Bill was Director of Marketing for TurboLinux, where he led the S-1 team. Earlier in his career, he served as Vice President of Marketing for Venturcom, ran vertical solutions marketing for Computer Associates, and was an IT manager at Harvard University. Bill did his undergraduate work at Bentley University, and has completed MBA coursework at Suffolk University.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.