Chapter 4. Solr for Big Data

In the previous chapter, we learned about Solr internals and the creation of custom queries. We understood the algorithms behind the working of AND and OR clauses in Solr and the internals of the eDisMax parser. We implemented our own plugin in Solr for running a proximity search by using SWAN queries. We understood the internals of how filters work.

In this chapter, we will discuss how and why Solr is an appropriate choice for churning out analytical reports. We will understand the concept of big data and how Solr can be used to solve the problems that come along with running queries on big data. We will discuss different faceting concepts and see how distributed pivot faceting works.

The topics that we will cover in this chapter are:

Introduction to big data
Getting data points using facets
Radius faceting for location-based data
Data analysis using pivot faceting
Introduction to graphical representation of analytical reports

Introduction to big data

Big data can simply be defined as data too large to be processed by a single machine. Let us say that we have 1 TB of data and the reports that need to be generated from it cannot be processed on a single machine in a time span acceptable to us. Let us take the example of click stream analysis. Internet companies such as Yahoo or Google keep an eye on the activity of the user by capturing each click that the user does on their website. Sometimes the complete page by page flow is also captured. Google, for example, captures the position from the top of a search result page for a search on a particular keyword or phrase. The amount of data generated and captured is huge and may be running into exabytes every day. This data needs to be processed on a day-to-day basis for analytical purposes. The analytical reports that are generated from this data are used to improve the experience of the user visiting the website.

Is it possible to process an exabyte of data? Of course it is, but the main concern is to process an exabyte of data every day to avoid creating a backlog. This would require huge processing power, generally distributed over a number of machines. A few factors that contribute to data being termed as big data are:

Volume: Huge amounts of data, similar to the one we discussed earlier, are to be processed every day. This data is mostly unstructured making it difficult to process. Also, since the amount of data is huge, it needs to be collected using multiple machines.
Velocity: Velocity can be defined as the amount of data being generated in a particular time span. In the example we saw earlier, we had a velocity of around one exabyte per day. If we are unable to process the data with the velocity it comes into the system, we will have a backlog that will keep on increasing and the analytical system will lose its purpose.
Variety: Variety defines the format of data. Data can be an unstructured text document, financial transactions, e-mails, audio, video, and so on. The more variety the data has, the more difficult it is to process the data.
Veracity: Complexity arises from the fact that the data is so huge and needs to be collected with such speed that a single machine cannot handle the task. We may need to deploy a number of machines and then collate the data and process it in a distributed fashion to generate the analytical reports required. The accuracy of analytical output depends on the veracity of the source data.

Now, how does Solr handle big data? Let us say that we are doing a click stream analysis of data, and since the amount of data is huge, we are gathering it from different machines and collating and processing it. We have the complete system setup so that we can handle the volume and velocity, and we are generating daily analytical reports that consist of certain data points. On a particular day, the analytics team wants to add a new data point and view analytical reports of all the past data with the added data point. What do we do? It is impossible to process past data to generate the new data point. If we go by the previous architecture, we will be setting up a huge number of machines to generate the new data point for all past data.

Would it make sense to parse the incoming data and store it so that reports can be generated on the fly? A new data point or a mix of multiple data points can be processed dynamically whenever needed. Instead of generating static reports, it would make sense to store the data and run queries to generate a report as and when required. A single Solr machine cannot handle such an amount of volume and velocity.

SolrCloud, which will be discussed in , SolrCloud, comes to our rescue here. It is the perfect tool for distributed data collection and processing. Using SolrCloud, we can have multiple Solr nodes where data can be fed into the system and processed by running queries. SolrCloud is horizontally scalable, which means that as data increases, all we need to do is add more machines to the cloud.

Let us look at some advanced faceting functionalities of Solr for generating the required reports.