Build a Better Big Data Recommendation Engine

by   |   July 9, 2014 5:30 am   |   0 Comments

Elliot Cordo, Chief Architect, Caserta Concepts

Elliot Cordo, Chief Architect, Caserta Concepts

Over the past few years, retail customers have come to expect relevant product recommendations as an essential part of the shopping experience. Marketing interactions should be helpful and time saving, not generic, out of context, and annoying. Shop at online retailers such as Amazon or Bluefly and you think they are inside your head as they present and recommend products relevant to you. This is an exponential improvement over the traditional psych-demographic profiling and targeting of the “old world.”

But recommendation engines have a wide range of applications beyond retail. Examples of successful recommendation platforms exist in industries such as hospitality, service, music, online radio, television, and video.

Designs vary among industries and applications, but the really good ones have struck an effective balance between serving relevant recommendations and not being obvious. If an engine recommended only items that your customer actually used or directly interacted with, or items nearly identical to these products, the value would likely be low. Instead, employ more sophisticated algorithms – such as peer behavior – to provide customers with surprising results that may be just what they are looking for.

The Big Data Platform

Related Stories

Foursquare, Vinspin Unleash Hadoop-based Recommendation Engines.
Read the story »

Use Predictive Analytics to Improve the Customer Experience.
Read the story »

Human Element Improves Predictive Analytics Success.
Read the story »

MimoLex Uses Mix of Metrics to Predict Success of Web, Marketing Copy.
Read the story »

One of the greatest enablers of this evolution is big data and its ecosystem. Big data and the core Hadoop platform provide a system in which large amounts of data and customer interactions can be stored and processed. The data can include typical transactional information such as purchases, usage, and customer touch points, as well as the immense impression data generated when people view pages and items as they navigate through a website. It is said a simple algorithm with a lot of data can always outperform a complex algorithm with an inadequate amount of data. As we process more and more data with machine-learning algorithms, the more relevant and effective our recommendations become.

With the new big data paradigm, a wide variety of open source and commercial applications generate compelling and effective recommendations. For example, Mahout is an open-source library of machine-learning algorithms with several components dedicated to recommendations and clustering.

Here is an overview of a recommendation engine leveraging Mahout. I am not saying that this approach alone will yield a perfect recommendation engine, but that it serves as a starting point for exploration.

The example organization is an online magazine (as opposed to a retail case). The company’s goal is to offer recommendations based on what customers’ peers are reading, in context of the article they are currently viewing. We address this challenge by leveraging Hadoop, Mahout, and Pig to combine different machine-learning algorithms.

Peer Influence Algorithms (Collaborative Filtering)

An Item-based algorithm provides peer influence recommendations based on item usage similarity. The data used as input is a simple, three-column delimited file of recent reading history.

Customer_IDUnique integer identifier of a customer
Article_IDUnique integer identifier of an article
RatingInteger scale from 1 to 5

The Rating field can be used to control the bias by which the article will be recommended. If available, it could be a star rating by customers. If the customer’s peers think highly of the article, it is more likely to be recommended. To enrich the relevance, you can scale the rating by age and decrease the rating based on how old the article is. If no customer rating is collected, age scaling alone could be used. Other enrichments include items on promotion or on sale, or anything else that might be desirable to influence the results.

Pass these variables to the recommender and it returns results that include the Customer_ID and an array of Article_IDs, plus ratings for that particular customer.

Item Similarity (Collaborative Filtering)

You also can leverage Mahout to generate context-specific recommendations based on the article being viewed. Although “item-based” attempts to provide the best user-biased recommendations, it offers no context in regard to the current navigation of the site. To do this, you can leverage Mahout’s item-similarity algorithm. This algorithm is familiar – appearing on popular retail and content sites as “People who liked this item also liked …”

The algorithm leverages the same data input (Customer_ID, Article_ID, Rating) as item-based. Instead of ranking articles by projected user preference, it compares articles with the article being viewed and creates a similarity score. Articles with the highest rating will have the highest similarity score and likely will be great recommendations while users browse the corresponding article.

Item Clustering (Content Filtering)

Up until now, we’ve discussed algorithms that leverage only the usage data of the articles. To provide additional context to recommendation results, you can leverage the content of the articles using the K-Means clustering algorithm for efficient retrieval of similar articles. This time, the files are sequence files. Sequence files are compressed key/value pairs, and it’s important to know how to work with them. For news articles, I selected six dominant attributes, represented numerically, as input to the recommender engine and created an all-integer file.

Article_IDUnique integer identifier of an article
AuthorWho wrote the article
SectionSection of the site it was published (politics, health, etc.)
TopicTopic hierarchy of the article
RegionRegion it was published
MediaArticle, video, picture

Note that numeric similarity of the attributes can be both friend and enemy. First, within a single field similar things are numerically closer together than dissimilar things. For example, if you use the “natural key” (internal sequential number from the transactional system) to represent the Author field and two different authors were created on the same day, they may be only one digit apart. The problem is that the two authors may work under different teams, for different organizations, handling different types of articles. If you cluster and present articles written by these two authors as similar, it might promote bad recommendations. To solve this, create an artificial numbering schema with a meaningful hierarchy.

The result of the K-Means algorithm is stored in a sequence file containing all the coordinate points (IDs) passed as input, with the addition of a relative cluster number. The cluster number is used downstream along with a little math in the delivery.

Recommendations Delivery

Now that the base calculations are complete, you can serve customer recommendations. The first step is to move recommendations to a store or database that delivers them almost instantaneously. The Redis and Riak data stores are excellent choices as they are fast and easy to work with. Redis has a great data structure and query language for working with recommendations. If you are delivering recommendations in a batch scenario, (i.e., email, direct mail marketing), the native Hadoop file system (HDFS) is fine.

When using the recommendations results in this example, we make design assumptions:

  • The article that is in the same cluster and identical in coordinates (aka IDs) to the one we are viewing, is ranked highest (5 out of 5).
  • An article that is the most highly rated by the user’s peers is ranked highest (5 out of 5).
  • An article with the highest item-similarity score for a given article also will be ranked highest (5 out of 5).
  • You use all three of these scores together, additively, to produce the final score (between 0 and 15).

Using these assumptions, here is what happens when a request for recommendation occurs:

  • A customer views an article online and you receive the Customer_ID and Article_ID for the article being viewed.
  • Retrieve and compare the coordinates of the article and all those in the same cluster to see which ones are closest (using vector algebra such as Euclidean distance). These numbers need to be scaled to the 0-5 ranking.
  • Retrieve all peer recommendations and scale them to the 0-5 ranking based on peer rating.
  • Retrieve all item-similarity recommendations and scale them to the 0-5 ranking based on similarity scores.
  • Join together the two sets of data, add the final rankings, and bring back the most highly rated articles.

The result returned to the customer has a natural mix of similar and peer-based results.

This technique and architecture is gracefully extensible, as it doesn’t have to stop at two algorithms – you can add others based on raw popularity and/or something that reflects customers’ feedback. You can easily swap or replace algorithms or turn them on and off based on circumstances. If peer recommendations are too volatile for a new customer, you can turn that off for the first couple of days, relying only on clustering and popularity.

Big data provides an inexpensive, scalable, and flexible platform for a cutting-edge recommendation engine. The engine can be built on commodity hardware and linearly scalable to terabytes and even petabytes of source data. When leveraging standard Mahout algorithms, very little coding, other than application-specific requirements and delivery, is needed.

Customers are inundated with more choices than ever before and a big data recommendation engine helps deliver a great customer experience and improve an organization’s bottom line.

Elliott Cordo is a big data, data warehouse, and information management expert with a passion for helping transform data into powerful information. He has more than a decade of experience in implementing big data and data warehouse solutions with hands-on experience in every component of the data warehouse software development lifecycle. As chief architect at Caserta Concepts, Elliott oversees large-scale major technology projects, including those involving business intelligence, data analytics, big data, and data warehousing.

Elliott is recognized for his many successful big data projects ranging from big data warehousing and machine learning to his personal favorite, recommendation engines. His passion is helping people understand the true potential in their data, working hand-in-hand with clients and partners to learn and develop cutting edge platforms to truly enable their organizations.

 Caserta has developed a quick-start toolkit for building a Big Data Recommender. It can be found at

Tags: , , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>