Big data analytics has been a critical component of the data center conversation for the last few years. The potential behind the massive data collection in today’s technology market seems to be limitless. Earlier this year, Hortonworks’ founder predicted that by 2020, 75 percent of the Fortune 2000 companies will be running 1,000-node Hadoop clusters.
Although Hadoop might be the most popular big data solution worldwide, it is still accompanied by deployment and management challenges such as being scalable, flexible, and cost effective. Here are five best practices for a robust Hadoop environment in your data center.
- Effective Workload Management
Hadoop’s open-source framework enables you to store vast amounts of data on multiple commodity cloud servers without the need for costly purpose-built resources. Ensure that your infrastructure personnel are involved from the beginning and that they are aware of the feasibility of using commodity servers. Otherwise, you can end up overbuilding your cluster and making an unnecessary investment in proprietary data-handling equipment. Choosing commodity servers can have even more of an impact on your budget if you have multiple environments to support development and testing.
All in all, Hadoop is very much about pairing computation with data, which could mean returning to some mainframe-era roots. Effective workload management is a necessary Hadoop best practice. Hadoop architecture and management have to look at whole cluster and not just single batch jobs in order to avoid future roadblocks. That being said, with the introduction of the cloud, running flexible data center infrastructure can provide the agility required to automatically and efficiently run your Hadoop cluster.
- Starting Small
We’ve all seen the statistics on how many IT projects fail because of their complexity and costs. Implementing Hadoop can come with these same risks. The beauty of Hadoop is that it allows for great scalability by just adding nodes as needed to a cluster in a modular fashion. Although it’s easy to add to a Hadoop cluster, it is not as easy to take away. In addition, you may find that the specifications of your servers need to be altered based on the results from and performance of your initial project, which is supported by Hadoop on an ongoing basis. However, it’s still easy to get carried away with building your first cluster.
Choosing a small project to run as a proof-of-concept (POC) allows development and infrastructure staff to familiarize themselves with the inter-workings of this technology, enabling them to support other groups’ big data requirements in their organization with reduced implementation risks.
- Cluster Monitoring
Although Hadoop offers some redundancy at the data and management levels, many moving parts still need to be monitored. Your cluster monitoring needs to report on the whole cluster as well as on specific nodes. It also needs to be scalable and be able to automatically track an eventual increase in the amount of nodes in the cluster. The pertinent metric data of the cluster is provided by Hadoop’s metrics, which are created by a collection of runtime statistical information that is exposed by the Hadoop daemons.
You can use Nagios to monitor all of the nodes and services in a cluster. Nagios and Cacti can work together to adjust and add checks to your Hadoop cluster such as reviewing each server’s disk health, looking at the overall performance of groups of resources, and allowing you to segment performance tracking by application, department, team and data-sensitivity level.
- Data Integration
Creating a data integration process upfront is essential. In the multi-structured world of big data, the goal is to extract data from multiple sources and then load the raw data quickly and efficiently. However one of Hadoop’s advantages is that it allows you to continually integrate new data sources and adjust data structures. You can use tools such as Sqoop, which are designed to efficiently transfer bulk data between Hadoop and structured datastores such as relational databases. In that case, you can also use Flume, a tool that collects log data from multiple sources, aggregates it and writes it to Hadoop’s distributed file system (HDFS). It’s recommended to create and maintain a wiki site in order to keep documentation up to date for different data sources and where they live within a cluster.
Your Hadoop high-availability (HA) design within your data center needs to be aligned with your business’ and environment’s SLA requirements. Architecting HA into a Hadoop cluster should start in your POC and should include capacity planning as well as scalability considerations for increases in demand and data growth. Hadoop deployment spans multiple volumes, and you need to ensure that replicas of all of your data exist on different racks in your data center. In addition, if a non-redundant hardware component fails, the operating system will crash, a disk partition will run out of space, or some other error will occur. The system needs to be able to detect these failures and be able to recover itself or transfer to a backup site.
Continuous monitoring should include learning the location of latency increases and trigger the system to automatically release bottlenecks. Additionally, as in every open source project, Hadoop is constantly changing. Make sure you test upgrades and functionality so that they don’t impact your production environment.
Because Hadoop clusters hold data, they might require strict security policies and controls. In short, a key to securing your Hadoop cluster is to implement controls on the interactions with the cluster. This includes putting the cluster behind a firewall as well as creating a single point-of-access interface so you can continuously govern all users and services accessing the environment.
The distributed compute and storage power of the Hadoop engine can be used to perform data manipulations and transformations of large volumes of data into relevant and meaningful business insights. An optimized Hadoop “data lake” can help you attain the untapped possibilities of big data for your company. While the benefits of big data are clear, successful Hadoop implementation often hinges on proper infrastructure deployment and management. Flexible cloud infrastructure will allow you to accommodate your Hadoop cluster’s growth and operations. We hope that the best practices discussed above will help you meet your big data needs and efficiently run your Hadoop environment.
Ronen Kofman, VP Product at Stratoscale. An expert in the cloud and virtualization space, Ronen has extensive experience leading virtualization and cloud products from inception to wide market adoption. Prior to joining Stratoscale, Ronen held leadership positions at Mirantis, Oracle VMware and Intel. As VP of Product Management at Mirantis, Ronen played a key role in shaping OpenStack to meet real world use cases for enterprises, service providers, telecommunications vendors and application developers. Prior to Mirantis started Oracle’s OpenStack initiative, forming the development team and launching Oracle’s first OpenStack product. Ronen holds an MBA degree from the MIT Sloan School of Management and a Bachelor’s degree in computer and software engineering (cum laude) from the Technion – Israel Institute of Technology.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise, plus get instant access to more than 20 eBooks.