Designing and building a big data ecosystem comes with challenges as well as expectations based on user needs and industry trends. Here are a few examples of what we commonly face when companies come to us with big data ecosystem projects:
- Businesses know that their data is valuable but don’t know how to optimize their data.
- A processing pipeline is established, but the outcome is not clear enough.
- Companies have a goal but don’t know what data inputs are required to get there.
When working with big data, it’s assumed that the volume of the input to our process may be unknown. But the process needs to be predictable enough that it doesn’t require any additional modifications to perform as expected.
Reporting is a cornerstone of data-driven management, as it’s in the nature of business to summarize metrics while slicing and filtering them through different variables. Think about different machine-learning algorithms that not only are more complex while diving into different combinations of variables on the lookout for patterns, classification of criteria, or clustering, but also are usually developed for standalone implementations by subject-matter experts of different areas. This can leave performance predictability, distributed processing, and high availability out of development scope.
The problem that big data experts are facing is how to translate these algorithms into the scope of a perfectly scalable process, which we define as a completely auto-adjusted process. Whether we are talking about an unforeseen acceleration of data volume, an additional accumulation of unprocessed data, a change of the data input schema, an increase in the number of consumers, or even unexpected hardware failure, the system needs to adjust quickly and remain predictable.
What drives an organization to make a large investment in big data analytics are business constraints that are set by product owners or higher management within the business. The architect is in charge of making the technological decisions that will accommodate all of the business’ needs and anticipate everything that can possibly become a problem once everything starts to run.
Discovery and Business Mapping
Discovery is one of the keys to success in every big data process. It consists of a general assessment driven by questions that address both the technical and business aspects related to the organization:
- What is the expected outcome of the effort? Is there one?
- What are the foreseeable data sources?
- Who and what will consume the data?
- What are the business constraints and maturity regarding data governance?
Structuring Smart Big Data Solutions
The data sources are usually a business constraint. They may come from relational databases, plain text files, APIs, social media, and enriched content. Due to this, data ingestion puts a number of languages and strategies on the table, and for organizations this means that different areas will have to work along with the big data architects to achieve the right integration. Whether it is SQL, Java, Python, or more specific solutions, such as extract, transform, and load (ETL)-specific software, understanding that every piece is essential becomes a challenge for the IT organization in charge.
Outcomes or products are extremely variable but, through time, there are different trends. Currently, we see many interactive dashboards as well as many recommendation engines that follow behavioral patterns. Content-based machines analyze customer attributes and compare them with a range of available products. Search engines are also used, and they include lexicographic analysis as well as semantic search.
There are optimization problems that require a big data framework, and everything comes in at least two flavors, with different levels of complexity: batch processing and real-time or near-real-time processing. In this sense, a number of tools in the ecosystem serve different purposes and follow a pattern of grouping. In general, while decomposing big problems into architecture solutions, it’s advisable to group problems into similar categories:
- Data quality. Includes data ingestion, data transformation, and governance. This is sometimes referred to, from a high-level perspective, as data pipes. Most known pipes consist of the traditional ETL processes but include real-time data streams as well.
- Storage. Includes all the data structures and data management systems that serve and host information. Ideally, relational databases, non-relational databases and, in reality, HDFS as well.
- Analytics and exploitation. These are the tools directly correlated to the expected outcome of the solution. We talk about analytics usually as a combination of aggregations and visualization techniques that provide business-level metrics while we try to install the idea of exploitation as a generic concept that could include machine-learning algorithms, full text search, or automatic decision-making processes.
In reality, this is not as clean as it sounds from the architectural perspective. Most of the tools overlap in functionality, and a lot of the tools will need to be custom built for the purpose of the application. There is no better way to provide tool recommendations for a given use case than by relying on advisers with actual hands-on experience in each of them. Patterns in the designs do exist, as well as many stacks that perform well together, but the ecosystem is big and we have yet to see two completely identical implementations.
Sabina Schneider is the vice president of the big data Studio at Globant. Sabina has worked in the big data field using different open source solutions, including Apache Hadoop, Pig, Hive, HBase, Cassandra, Storm and MongoDB to build a single data platform on which metrics are extracted. She has experience implementing custom dashboard solutions and using AWS and Google as Infrastructure. She also has experience understanding business needs, analyzing needs, and designing a technical solution. Most recently, Sabina has lead several big data consultancy jobs for main leading banks in the U.S., Europe, and Latin America.
She received a Computer Science Engineering degree from the National Technology University and a Bachelor’s degree from Gartenstadt Schule.
Alejandro de la Viña is a big data architect, leading its practice at Globant and passionate about turning data into business solutions; previously an implementation engineer for the big data ecosystem and a PL/SQL developer. Connect with him on LinkedIn.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.