Apache Solr was originally architected, and is best known, as a search engine. But it also can serve very well as a general-purpose database. The Search Based Application (SBA) movement presents opportunities for search technologies like Solr to serve as the data access layer and data-store for a wide array of applications. Instead of just using Solr to index the data in your relational or NoSQL database, you just simply use Solr as the database itself.
In particular, Solr really shines as an analytics system.
I would not propose that it entirely replace your existing analytic infrastructure, such as an Enterprise Data Warehouse. But Solr can complement your existing system and serve as a presentation layer database for some of your more complex challenges. That is why I present Solr as a data mart, a usage-specific representation of your Enterprise Data Warehouse.
Here are a few top features for Solr-based analytics:
Query flexibility. The Solr query language is a superset of the Lucene query language. This language provides a large amount of SQL’s capabilities, including filters, group-by’s, and aggregation. It can do many important things that SQL can’t do well, such as geospatial and fuzzy searching.
Search is front and center. In many analytics use cases, textual data is a dominant navigational element. In a relational database, if your primary dimensions are product names, persons, places, or even free text, you would be forced to leverage “like” conditions and user-defined functions. Not only are these options generally poor from a performance perspective, you might not find the data you were looking for. Solr’s search engine heritage makes searching and navigating text a very efficient process.
Low latency analytics. A well-tuned Solr solution will consistently deliver low millisecond response time. In addition, Solr can ingest incoming data in near real-time using tools like Logstash and Flume.
Quick win dashboard. Recently, Lucidworks released an open source project called Banana. Banana is a dashboarding tool built to work with a Solr back end. It is easy to deploy and very productive to configure and customize. Best of all, it has a beautiful and easy to use user interface.
Interactive BI on the Open Web
Solr is leveraged for search and navigation for some of the most high trafficked web properties on the Internet. Solr is both vertically scalable (adding bigger servers) and horizontally scalable by manual sharding or implementation of Solr cloud. This scalability and track record prove that Solr applications are suitable for high-concurrency applications.
Solr Meets the High-performance Challenge
Using Solr as an analytics system is not a new or fringe idea. Search has been used for years to mitigate challenges with query flexibility and performance in emerging big data technologies.
NoSQL databases generally offer very limited query flexibility, and aggregation is a rare feature. Hadoop, although extremely flexible in analytic capabilities, does not provide fast interactive queries. By indexing high value, or frequently used data, within Solr, analytics users are able to get back the capabilities they missed. Today these technologies are more accepted, so I propose that Solr will become a more mainstream option for this use case, and will be leveraged as a successful analytics solution for the traditional world of relational and massively parallel processing (MPP).
And the best part is that Solr is free and open source. There is a huge community supporting it, and a large number of open-source projects are devoted to expanding its capabilities. Additionally, implementing a Solr based analytics solution is incredibly productive, as Solr is easy to set up, administer, and learn. Your average techie can see results with Solr within a day or two. No specific programming knowledge is required for data ingestion or issuing queries, as Solr communicates primarily through http.
The term Polyglot Persistence presents the idea that having a reasonable number of database/storage technologies within an organization is not just okay, but actually will help you solve problems that can’t be solved by a “one size fits all” approach. Solr’s unique set of capabilities can benefit your organization – not by replacing what’s there, but by providing new features and opportunities.
Elliott Cordo is a big data, data warehouse, and information management expert with a passion for helping transform data into powerful information. He has more than a decade of experience in implementing big data and data warehouse solutions with hands-on experience in every component of the data warehouse software development lifecycle. As chief architect at Caserta Concepts, Elliott oversees large-scale major technology projects, including those involving business intelligence, data analytics, big data, and data warehousing.
Elliott is recognized for his many successful big data projects ranging from big data warehousing and machine learning to his personal favorite, recommendation engines. His passion is helping people understand the true potential in their data, working hand-in-hand with clients and partners to learn and develop cutting edge platforms to truly enable their organizations.
See Caserta Concepts’ Github repository for a Solr + Banana quick-start project.
Subscribe to Data Informed for the latest information and news on big data and analytics for the enterprise.