In the previous chapter, we learned about Solr internals and the creation of custom queries. We understood the algorithms behind the working of AND
and OR
clauses in Solr and the internals of the eDisMax parser. We implemented our own plugin in Solr for running a proximity search by using SWAN queries. We understood the internals of how filters work.
In this chapter, we will discuss how and why Solr is an appropriate choice for churning out analytical reports. We will understand the concept of big data and how Solr can be used to solve the problems that come along with running queries on big data. We will discuss different faceting concepts and see how distributed pivot faceting works.
The topics that we will cover in this chapter are:
Big data can simply be defined as data too large to be processed by a single machine. Let us say that we have 1 TB of data and the reports that need to be generated from it cannot be processed on a single machine in a time span acceptable to us. Let us take the example of click stream analysis. Internet companies such as Yahoo or Google keep an eye on the activity of the user by capturing each click that the user does on their website. Sometimes the complete page by page flow is also captured. Google, for example, captures the position from the top of a search result page for a search on a particular keyword or phrase. The amount of data generated and captured is huge and may be running into exabytes every day. This data needs to be processed on a day-to-day basis for analytical purposes. The analytical reports that are generated from this data are used to improve the experience of the user visiting the website.
Is it possible to process an exabyte of data? Of course it is, but the main concern is to process an exabyte of data every day to avoid creating a backlog. This would require huge processing power, generally distributed over a number of machines. A few factors that contribute to data being termed as big data are:
Now, how does Solr handle big data? Let us say that we are doing a click stream analysis of data, and since the amount of data is huge, we are gathering it from different machines and collating and processing it. We have the complete system setup so that we can handle the volume and velocity, and we are generating daily analytical reports that consist of certain data points. On a particular day, the analytics team wants to add a new data point and view analytical reports of all the past data with the added data point. What do we do? It is impossible to process past data to generate the new data point. If we go by the previous architecture, we will be setting up a huge number of machines to generate the new data point for all past data.
Would it make sense to parse the incoming data and store it so that reports can be generated on the fly? A new data point or a mix of multiple data points can be processed dynamically whenever needed. Instead of generating static reports, it would make sense to store the data and run queries to generate a report as and when required. A single Solr machine cannot handle such an amount of volume and velocity.
SolrCloud, which will be discussed in , SolrCloud, comes to our rescue here. It is the perfect tool for distributed data collection and processing. Using SolrCloud, we can have multiple Solr nodes where data can be fed into the system and processed by running queries. SolrCloud is horizontally scalable, which means that as data increases, all we need to do is add more machines to the cloud.
Let us look at some advanced faceting functionalities of Solr for generating the required reports.