Hadoop Project Falcon: Data Lifecycle Management for App Developers

by   |   June 27, 2013 12:25 pm   |   1 Comments

InMobi and Hortonworks engineers designed Falcon to enable application developers manage data motion, disaster recovery and data retention workflows. Above, a Hortonworks graphic representing the open source project.

InMobi and Hortonworks engineers designed Falcon to enable application developers manage data motion, disaster recovery and data retention workflows. Above, a Hortonworks graphic representing the open source project.

For Hadoop to finally make it out of the sandbox and into a real production environment, organizations need the right tools to manage and control Hadoop clusters. That’s a considerable challenge given the vast amount and variety of data Hadoop is capable of storing and processing.

Related Stories

Pattern, open source framework, aims to accelerate analytics on Hadoop.
Read the story »

The database resurgence fueled by big data.
Read the story »

Concurrent’s Lingual designed to let SQL developers run big data applications on Hadoop.
Read the story »

Hadoop sandboxes provide low-risk entry for new programmers.
Read the story »

More Hadoop coverage.
Read the story »

More on big data application development.
Read the story »

Data platform provider Hortonworks and mobile ad network InMobi hope to change all that with Falcon. A jointly developed data lifecycle management framework for Hadoop, Falcon simplifies data management by allowing users to easily configure and manage data migration, disaster recovery and data retention workflows. InMobi built Falcon for its own usage nearly two years ago. Since then, the technology has been submitted to the Apache Software Foundation for the open source community to use.

That vendors like Hortonworks are starting to look beyond perfecting Hadoop as a platform to managing its clusters is good news to enterprises. According to Shaun Connolly, Hortonworks’ vice president of corporate strategy, there are a string of common questions that often stump companies when attempting to manage Hadoop clusters: How do you get data into the Hadoop platform from other systems? How do you transform this data into the format you need? How do you share this data easily with downstream systems like data warehouses? And: how do you distribute that data geographically?

Falcon heralds the “second generation of Hadoop” as enterprises begin to expect greater scalability, security and manageability from the relatively nascent technology. “In the past year and a half, more mainstream enterprises have been adopting Hadoop so the market demand has been building very quickly and aggressively over the past 18 months,” says Connolly. “So we’re seeing more and more of these [data management] solutions come onto the market.”

Cloudera Manager, for example, promises greater transparency into and control over Hadoop clusters so that users can easily deploy and configure clusters from a centralized console. Zettaset Orchestrator, on the other hand, acts as a management layer that sits on top of any Hadoop distribution to simplify Hadoop deployments without the need for manual configuration and third-party consulting services.

While data management tools vary, the very nature of Hadoop demands a more mature toolset. The problem, says Mark Madsen, president of Third Nature, a research firm specializing in business intelligence, is that “Hadoop is a data processing platform. It’s just there to crunch data and store the information that you’ve fed into it but it doesn’t have the features of a relational database.” Relational databases, for example, have a straightforward and formal structure for managing data in simple tables.

Another roadblock to managing Hadoop is its ability to store a wide breadth of data from multiple sources. In the case of a traditional database, a set of unique product identifiers can be used to distinguish and control each row of data stored in a table – a process that’s not so simple with Hadoop.

Madsen offers the example of a traditional database used to store a grocery chain’s pricing information. “It’s easy to take a price list and a set of product identifiers and manage and control them so that you don’t have random people changing prices in your system,” he says. “But when you try to extend that to the kinds of information being put in a Hadoop cluster, like Web logs and machine sensor data, the data quality tools we have now can’t really scale to that level effectively.”

And then there’s simply the law of physics when it comes to quickly migrating vast amounts of disparate data in the event of an outage. “Hadoop is a very large distributed file system with many types of files so it’s a general purpose utility,” warns Connolly. “The fact that it can store extremely large amounts of data is where the challenge comes in. You have to think about how you’re actually going to be conducting disaster recovery in that scenario. You can’t just pick up 40 terabytes and move it elsewhere.”

Cindy Waxer, a contributing editor who covers workforce analytics and other topics for Data Informed, is a Toronto-based freelance journalist and a contributor to publications including The Economist and MIT Technology Review. She can be reached at cwaxer@sympatico.ca or via Twitter @Cwaxer.

Home page image of Saker Falcon by € Van 3000 via Wikipedia.

Tags: , ,

One Comment

  1. Alen
    Posted November 20, 2014 at 5:57 am | Permalink

    Hi Cindy, Thanks for sharing this blog. I would also like to add that about Falcon that it simplifies the development and management of data processing pipelines with introduction of higher layer of abstractions for users to work with. Falcon takes the complex coding out of data processing applications by providing common data management services out-of-the-box, simplifying the configuration and orchestration of data motion, disaster recovery and data retention workflows. Key features of Falcon include:

    -Data Replication Handling: Falcon replicates HDFS files and Hive Tables between different clusters for disaster recovery and multi-cluster data discovery scenarios.

    -Data Lifecycle Management: Falcon manages data eviction policies.

    -Data Lineage and Traceability: Falcon entity relationships enable users to view coarse-grained data lineage.

    -Process Coordination and Scheduling: Falcon automatically manages the complex logic of late data handling and retries.

    -Declarative Data Process Programming: Falcon introduces higher-level data abstractions (Clusters, Feeds and Processes) enabling separation of business logic from application logic, maximizing reuse and consistency when building processing pipelines.

    -Leverages Existing Hadoop Services: Falcon transparently coordinates and schedules data workflows using the existing Hadoop services such as Apache Oozie.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>