Handling unclean data

What do we mean by unclean data? In the last section, we discussed a customer searching for pink sweater, where pink is the color and sweater is the type of clothing. However, the system or the search engine cannot interpret the input in this fashion. Therefore, in our e-commerce schema design earlier, we created a query that searched across all fields available in the index. We then created a separate copyField class to handle search across fields, such as clothes_color, that are not being searched in the default query.

Now, will our query give good results? What if there is a brand named pink? Then what would the results be like? First of all, we would not be sure whether pink is intended to be the color or the brand. Suppose we say that pink is intended to be the color, but we are also searching across brands and it will contain pink as the brand name. The results will be a mix of both clothes_color and brand. In our query, we are boosting brand, so what happens is we get sweaters from the pink brand in our output. However, the customer was looking for a sweater in pink color.

Now let's think from the user's point of view. It is not necessary that the user has the luxury of going through all the results and figuring out which pink sweater looks interesting. Moreover, the user may be browsing on a mobile, in which case, looping through the results page by page would become very tedious. Thus, we need high precision in our results. Our top results should match with what the user expects.

A way to handle this scenario is to not tokenize during the index time, but only during the search time. Therefore, exact fields such as brand and product name can be kept as it is. However, while searching across these fields, we would need to tokenize our query. Let us alter our schema to handle this scenario.

We had created a fieldType class named wslc to handle the fields brand and name. We will change the tokenizer during index time to keywordTokenizer and leave the tokenizer during query time as it is:

<fieldType name="wslc" class="solr.TextField" positionIncrementGap="100">   <analyzer type="index">     <tokenizer class="solr.KeywordTokenizerFactory"/>     <filter class="solr.LowerCaseFilterFactory" />   </analyzer>   <analyzer type="query">     <tokenizer class="solr.WhitespaceTokenizerFactory"/>     <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />     <filter class="solr.LowerCaseFilterFactory"/>   </analyzer> </fieldType>

Also, let's change fieldType for clothes_type and clothes_occassion to "wslc" and include these fields in our Solr search query:

<field name="clothes_type" type="wslc" indexed="true" stored="true"/> <field name="clothes_occassion" type="wslc" indexed="true" stored="true"/>

We will also use the eDisMax parser, which we have seen in the earlier chapters. Our query for pink sweater will now be:

q=pink sweater&qf=text cat^2 name^2 brand^2 clothes_type^2 clothes_color^2 clothes_occassion^2&pf=text cat^3 name^3 brand^3 clothes_type^3 clothes_color^3 clothes_occassion^3&fl=*,score&defType=edismax&facet=true&facet.mincount=1&facet.field=clothes_gender&facet.field=clothes_type&facet.field=clothes_size&facet.field=clothes_color&facet.field=brand&facet.field=mobile_os&facet.field=mobile_screen_size&facet.field=laptop_processor&facet.field=laptop_memory&facet.field=laptop_hard_disk

We have given a higher boost to exact phrase matches in our query. Therefore, if a term is found as a phrase in any of the specified fields, the results will be a lot better. Also, now since we are tokenizing only during search and not during indexing, our results would be much better. Suppose we have a brand named pink sweater and it is picked up as a phrase and the document where the brand=pink sweater parameter is boosted higher. Next, documents where clothes_color is pink and clothes_type is sweater are boosted. Let us run a debug query and verify what we just said.

Query: pink sweater

Now even if we give multiple words in our search query, the results should be a lot better. A search for tommy hilfiger green sweater should give very precise results. Cases where results are not available are handled by the default search text field. In case there are no results that match our search query, there would be some results due to the copying of fields into text and the index time tokenization that happens over there.

As we go further in depth in this chapter, we will see more and more ways of indexing and searching data in an e-commerce website.