In this chapter, we will discuss in depth the problems faced during the implementation of Solr for search on an e-commerce website. We will look at the related problems and solutions and areas where optimizations may be necessary. We will also look at semantic search and how it can be implemented in an e-commerce scenario. The topics that will be covered in this chapter are listed as follows:
E-commerce search is special. For us, a Lucene search is a Boolean information retrieval model based on the vector space model. However, for an end user, or a customer, any search on an e-commerce website is supposed to be simple. A customer would not make a field-specific search but will focus on what he or she wants from the search.
Suppose a customer is looking out for a pink sweater. The search that will be conducted on the e-commerce website will be pink sweater
instead of +color:pink +type:sweater
—using the Solr query syntax. It is our search that will have to figure out how to provide results to the customer so that whatever is being searched for is available to the customer. The problem with e-commerce website searches is that most of the searches happen with the idea that the results are to be retrieved from bag of words or plain text documents. However, it is important for us to categorize and rank the results so that whatever is being searched for is present in the result set.
On a broad note, the following fields can be observed on an e-commerce website catering for clothes:
Category: Clothes Brand: levis Gender: Mens Type: Jeans Size: 34 Fitting: Regular Occasion: Casual Color: Blue
The following fields could be observed if the e-commerce website caters for electronics, especially mobiles:
Category: Mobile Brand: Motorola OS: Android Screen size: 4 Camera: 5MP Color: Black
What about the fields for electronics such as laptops?
Category: Laptop Brand: Lenovo Processor: intel i5 Screen size: 14 inch Memory: 4GB OS: Windows 8 Hard disk: 500 GB Graphics Card: Nvidia Graphics card memory: 2 GB DDR5
The point we are arriving at is that the scope of fields on any e-commerce website is huge. On the basis of the categories that we intend to serve through our website, we need to list down the fields for each category and then apply our mind to what needs to be tokenized. Suppose that the website we intend to create caters only for mobiles and laptops. Then, the number of fields are limited to the union of both of them. As we add more and more categories, we will need to add more and more fields specific to those categories. Dynamic fields come to our rescue as the number of fields on an e-commerce website increases.
Another important point is to decide which fields would be serving as facets, as we will have to keep a separate field for creating facets. Let us take an example of a website catering to the three categories we discussed earlier and design the schema for it.
Each product will have its unique ID, which is known as sku
for the product. A Stock Keeping Unit (SKU) is a unique identifier for each product in e-commerce. It is recommended to use an SKU as the unique key in Solr as it is the unique key referencing each product in an e-commerce catalog. This would be an indexed, but not tokenized, field:
<field name="sku" type="lowercase" indexed="true" stored="true" omitNorms="true"/>
In this case, the lowercase
field type is defined as follows:
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> </fieldType>
Next, we define the category that is a string—again non-tokenized. Note that we have set multiValued
as true
, which is a provision for allowing a single product to belong to multiple categories:
<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
The brand
and product
name fields are whitespace tokenized and converted to lowercase:
<field name="name" type="wslc" indexed="true" stored="true"/> <field name="brand" type="wslc" indexed="true" stored="true"/> <fieldType name="wslc" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
Notice that we have applied separate logic for indexing and search or querying on these fields and have also included synonyms
and stopwords
in our query logic. During indexing, the text is simply tokenized on whitespace and lowercased. However, during search, we are using stop words to remove unwanted tokens from the search query and synonyms to map certain words with similar meaning words, thus catering to more relevant results. If the user mistypes certain words, synonyms can be used to map common mistakes with relevant words in the index. They can also be used to map short names with full words. For example, shirt
could be mapped to t-shirt
, polo
, and so on, and the search result for shirt
will contain t-shirts
, polos
, and other variations of t-shirts
. This would be a one-way mapping, which means that t-shirts
and polos
cannot be mapped back to shirts. Performing a reverse mapping will give irrelevant results.
Another common field across all these categories is price
. This can be defined as follows:
<field name="price" type="float" indexed="true" stored="true"/>
Now that we have all the common fields defined, let's go ahead and define the category-specific fields.
For the clothes
category, we would define the following fields:
<field name="clothes_gender" type="string" indexed="true" stored="true"/> <field name="clothes_type" type="string" indexed="true" stored="true"/> <field name="clothes_size" type="string" indexed="true" stored="true"/> <field name="clothes_fitting" type="string" indexed="true" stored="true"/> <field name="clothes_occassion" type="string" indexed="true" stored="true"/> <field name="clothes_color" type="string" indexed="true" stored="true"/>
Similarly, we have to define fields for the categories mobile
and laptop
. We can also define these fields via a dynamicField
tag. It is important to note that most of these fields would be used for faceting and in filter queries for narrowing down the results:
<dynamicField name="mobile_*" type="string" indexed="true" stored="true" /> <dynamicField name="laptop_*" type="string" indexed="true" stored="true" />
Using dynamic fields gives flexibility to the indexing script to add fields during the indexing process. In addition to all these fields, we will also have a text
field that will be used to collect all data from different fields and provide that for search. We will also need a separate field to store the product description:
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/> <field name="desc" type="text_general" indexed="true" stored="false" />
We will have to copy the required fields into the text
field for generic search:
<copyField source="cat" dest="text"/> <copyField source="name" dest="text"/> <copyField source="brand" dest="text"/> <copyField source="sku" dest="text"/> <copyField source="clothes_color" dest="text"/> <copyField source="clothes_type" dest="text"/>
This schema should be sufficient for our use case.
We are copying only the clothes_color
and clothes_type
values in our text
field for generic search, since we want to provide only color and type as a part of the generic search.
Let us see how we would perform a search for any particular query coming on our e-commerce website. Suppose a person searches for iphone
. Now, the search engine is not aware that the person is searching for a mobile phone. The search will have to happen across multiple categories. Also, the search engine is not aware whether the search happened over a category or a brand or a product name. We will look at a solution for identifying and providing relevant results later in this chapter. Let us look at a generic solution for the query:
q=text:iphone cat:iphone^2 name:iphone^2 brand:iphone^2&facet=true&facet.mincount=1&facet.field=clothes_gender&facet.field=clothes_type&facet.field=clothes_size&facet.field=clothes_color&facet.field=brand&facet.field=mobile_os&facet.field=mobile_screen_size&facet.field=laptop_processor&facet.field=laptop_memory&facet.field=laptop_hard_disk&defType=edismax
The output from this query would contain results for iphone
. As iphone
is the name
of a product, it will be boosted and results where the name
field contains iphone
will appear on top. In addition to this, we will be getting a lot of facets. As we have provided the parameter facet.mincount=1
, only facets that contain at least one count will display in the result. All we need to do is loop through the facets that we got in the result and display them along with a checkbox.
Once the user clicks on the checkbox, we will have to rewrite our query with the filter query parameter. Suppose, in the preceding query, the user selects the screen size as 4
. Then, the following filter query will be appended to our original Solr query:
fq=mobile_screen_size:4
This will narrow down the results and help the customer in getting closer to the product he or she is willing to search. As the customer selects more and more facets, we will keep on adding filter queries and the search will narrow down to what the customer wants.