Chapter 5. Solr in E-commerce

In this chapter, we will discuss in depth the problems faced during the implementation of Solr for search on an e-commerce website. We will look at the related problems and solutions and areas where optimizations may be necessary. We will also look at semantic search and how it can be implemented in an e-commerce scenario. The topics that will be covered in this chapter are listed as follows:

Designing an e-commerce search
Handling unclean data
Handling variations (such as size and color) in the product
Sorting
Problems and solutions of flash sale searches
Faceting with the option of multi-select
Faceting with hierarchical taxonomy
Faceting with size
Implementing semantic search
Optimizations that we can look into

Designing an e-commerce search

E-commerce search is special. For us, a Lucene search is a Boolean information retrieval model based on the vector space model. However, for an end user, or a customer, any search on an e-commerce website is supposed to be simple. A customer would not make a field-specific search but will focus on what he or she wants from the search.

Suppose a customer is looking out for a pink sweater. The search that will be conducted on the e-commerce website will be pink sweater instead of +color:pink +type:sweater—using the Solr query syntax. It is our search that will have to figure out how to provide results to the customer so that whatever is being searched for is available to the customer. The problem with e-commerce website searches is that most of the searches happen with the idea that the results are to be retrieved from bag of words or plain text documents. However, it is important for us to categorize and rank the results so that whatever is being searched for is present in the result set.

On a broad note, the following fields can be observed on an e-commerce website catering for clothes:

 Category: Clothes Brand: levis Gender: Mens Type: Jeans Size: 34 Fitting: Regular Occasion: Casual Color: Blue

The following fields could be observed if the e-commerce website caters for electronics, especially mobiles:

 Category: Mobile Brand: Motorola OS: Android Screen size: 4 Camera: 5MP Color: Black

What about the fields for electronics such as laptops?

 Category: Laptop Brand: Lenovo Processor: intel i5 Screen size: 14 inch Memory: 4GB OS: Windows 8 Hard disk: 500 GB Graphics Card: Nvidia Graphics card memory: 2 GB DDR5

The point we are arriving at is that the scope of fields on any e-commerce website is huge. On the basis of the categories that we intend to serve through our website, we need to list down the fields for each category and then apply our mind to what needs to be tokenized. Suppose that the website we intend to create caters only for mobiles and laptops. Then, the number of fields are limited to the union of both of them. As we add more and more categories, we will need to add more and more fields specific to those categories. Dynamic fields come to our rescue as the number of fields on an e-commerce website increases.

Another important point is to decide which fields would be serving as facets, as we will have to keep a separate field for creating facets. Let us take an example of a website catering to the three categories we discussed earlier and design the schema for it.

Each product will have its unique ID, which is known as sku for the product. A Stock Keeping Unit (SKU) is a unique identifier for each product in e-commerce. It is recommended to use an SKU as the unique key in Solr as it is the unique key referencing each product in an e-commerce catalog. This would be an indexed, but not tokenized, field:

<field name="sku" type="lowercase" indexed="true" stored="true" omitNorms="true"/>

In this case, the lowercase field type is defined as follows:

<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType>

Next, we define the category that is a string—again non-tokenized. Note that we have set multiValued as true, which is a provision for allowing a single product to belong to multiple categories:

<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>

The brand and product name fields are whitespace tokenized and converted to lowercase:

<field name="name" type="wslc" indexed="true" stored="true"/> <field name="brand" type="wslc" indexed="true" stored="true"/>  <fieldType name="wslc" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>

Notice that we have applied separate logic for indexing and search or querying on these fields and have also included synonyms and stopwords in our query logic. During indexing, the text is simply tokenized on whitespace and lowercased. However, during search, we are using stop words to remove unwanted tokens from the search query and synonyms to map certain words with similar meaning words, thus catering to more relevant results. If the user mistypes certain words, synonyms can be used to map common mistakes with relevant words in the index. They can also be used to map short names with full words. For example, shirt could be mapped to t-shirt, polo, and so on, and the search result for shirt will contain t-shirts, polos, and other variations of t-shirts. This would be a one-way mapping, which means that t-shirts and polos cannot be mapped back to shirts. Performing a reverse mapping will give irrelevant results.

Another common field across all these categories is price. This can be defined as follows:

<field name="price" type="float" indexed="true" stored="true"/>

Now that we have all the common fields defined, let's go ahead and define the category-specific fields.

For the clothes category, we would define the following fields:

<field name="clothes_gender" type="string" indexed="true" stored="true"/> <field name="clothes_type" type="string" indexed="true" stored="true"/> <field name="clothes_size" type="string" indexed="true" stored="true"/> <field name="clothes_fitting" type="string" indexed="true" stored="true"/> <field name="clothes_occassion" type="string" indexed="true" stored="true"/> <field name="clothes_color" type="string" indexed="true" stored="true"/>

Similarly, we have to define fields for the categories mobile and laptop. We can also define these fields via a dynamicField tag. It is important to note that most of these fields would be used for faceting and in filter queries for narrowing down the results:

<dynamicField name="mobile_*" type="string" indexed="true" stored="true" /> <dynamicField name="laptop_*" type="string" indexed="true" stored="true" />

Using dynamic fields gives flexibility to the indexing script to add fields during the indexing process. In addition to all these fields, we will also have a text field that will be used to collect all data from different fields and provide that for search. We will also need a separate field to store the product description:

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/> <field name="desc" type="text_general" indexed="true" stored="false" />

We will have to copy the required fields into the text field for generic search:

<copyField source="cat" dest="text"/> <copyField source="name" dest="text"/> <copyField source="brand" dest="text"/> <copyField source="sku" dest="text"/> <copyField source="clothes_color" dest="text"/> <copyField source="clothes_type" dest="text"/>

This schema should be sufficient for our use case.

Tip

We are copying only the clothes_color and clothes_type values in our text field for generic search, since we want to provide only color and type as a part of the generic search.

Let us see how we would perform a search for any particular query coming on our e-commerce website. Suppose a person searches for iphone. Now, the search engine is not aware that the person is searching for a mobile phone. The search will have to happen across multiple categories. Also, the search engine is not aware whether the search happened over a category or a brand or a product name. We will look at a solution for identifying and providing relevant results later in this chapter. Let us look at a generic solution for the query:

q=text:iphone cat:iphone^2 name:iphone^2 brand:iphone^2&facet=true&facet.mincount=1&facet.field=clothes_gender&facet.field=clothes_type&facet.field=clothes_size&facet.field=clothes_color&facet.field=brand&facet.field=mobile_os&facet.field=mobile_screen_size&facet.field=laptop_processor&facet.field=laptop_memory&facet.field=laptop_hard_disk&defType=edismax

The output from this query would contain results for iphone. As iphone is the name of a product, it will be boosted and results where the name field contains iphone will appear on top. In addition to this, we will be getting a lot of facets. As we have provided the parameter facet.mincount=1, only facets that contain at least one count will display in the result. All we need to do is loop through the facets that we got in the result and display them along with a checkbox.

Once the user clicks on the checkbox, we will have to rewrite our query with the filter query parameter. Suppose, in the preceding query, the user selects the screen size as 4. Then, the following filter query will be appended to our original Solr query:

fq=mobile_screen_size:4

This will narrow down the results and help the customer in getting closer to the product he or she is willing to search. As the customer selects more and more facets, we will keep on adding filter queries and the search will narrow down to what the customer wants.