Книга: Apache Solr Search Patterns
Назад: The information gain model
Дальше: Options to TF-IDF similarity

Implementing the information gain model

The problem with the information gain model is that, for each term in the index, we will have to evaluate the occurrence of every other term. The complexity of the algorithm will be of the order of square of the two terms, square(xy). It is not possible to compute this using a simple machine. What is recommended is that we create a map-reduce job and use a distributed Hadoop cluster to compute the information gain for each term in the index.

Our distributed Hadoop cluster would do the following:

  • Count all occurrences of each term in the index
  • Count all occurrences of each co-occurring term in the index
  • Construct a hash table or a map of co-occurring terms
  • Calculate the information gain for each term and store it in a file in the Hadoop cluster

In order to implement this in our scoring algorithm, we will need to build a custom scorer where the IDF calculation is overwritten by the algorithm for deriving the information gain for the term from the Hadoop cluster. If we have a huge index, we will have information gain for most of the terms in the index. However, there can still be cases where the term is not available in the information gain files in the Hadoop cluster. In such cases, we would like to fall back on our original IDF algorithm or return a default value. This may result in some skewed value for the score, as the IDF values may not be comparable with information gain for any term.

Once we have the custom similarity ready, we will have to create a copyField parameter that implements the custom similarity we have built. And copy the earlier fields for which we want the similarity to be altered to this copyField we have created. The schema would then have multiple copies of the same field, each with different implementations of the similarity class.

In order to determine whether our implementation of the similarity class has been more beneficial to the users, we can perform A/B testing. We already have multiple copy fields, each with its own similarity class implementation. We can divide our app servers into two parts, one serving queries out of the field that implements the information gain model and another serving queries out of the field that implements the default IDF model. We can measure the response or conversion ratio (for an e-commerce site) from both the implementations and decide which implementation has been beneficial for us.

The A/B testing methodology is very useful in taking decisions based on data for which implementation has been successful for the business. We can test with live users where some users are exposed to a particular algorithm or flow, while others are exposed to a different algorithm or site flow. It is very important to put evaluation metrics in place so that the output of each test can be measured separately. A/B testing is the perfect way for implementing new concepts side by side and determining which concept is more successful.

Назад: The information gain model
Дальше: Options to TF-IDF similarity

Solr
Testing
dosare
121