What do we mean by unclean data? In the last section, we discussed a customer searching for pink sweater
, where pink
is the color and sweater
is the type of clothing. However, the system or the search engine cannot interpret the input in this fashion. Therefore, in our e-commerce schema design earlier, we created a query that searched across all fields available in the index. We then created a separate copyField
class to handle search across fields, such as clothes_color
, that are not being searched in the default query.
Now, will our query give good results? What if there is a brand named pink
? Then what would the results be like? First of all, we would not be sure whether pink
is intended to be the color or the brand. Suppose we say that pink
is intended to be the color, but we are also searching across brands and it will contain pink
as the brand name. The results will be a mix of both clothes_color
and brand
. In our query, we are boosting brand, so what happens is we get sweaters from the pink
brand in our output. However, the customer was looking for a sweater in pink
color.
Now let's think from the user's point of view. It is not necessary that the user has the luxury of going through all the results and figuring out which pink sweater looks interesting. Moreover, the user may be browsing on a mobile, in which case, looping through the results page by page would become very tedious. Thus, we need high precision in our results. Our top results should match with what the user expects.
A way to handle this scenario is to not tokenize during the index time, but only during the search time. Therefore, exact fields such as brand
and product name
can be kept as it is. However, while searching across these fields, we would need to tokenize our query. Let us alter our schema to handle this scenario.
We had created a fieldType
class named wslc
to handle the fields brand
and name
. We will change the tokenizer during index time to keywordTokenizer
and leave the tokenizer during query time as it is:
<fieldType name="wslc" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
Also, let's change fieldType
for clothes_type
and clothes_occassion
to "wslc"
and include these fields in our Solr search query:
<field name="clothes_type" type="wslc" indexed="true" stored="true"/> <field name="clothes_occassion" type="wslc" indexed="true" stored="true"/>
We will also use the eDisMax parser, which we have seen in the earlier chapters. Our query for pink sweater
will now be:
q=pink sweater&qf=text cat^2 name^2 brand^2 clothes_type^2 clothes_color^2 clothes_occassion^2&pf=text cat^3 name^3 brand^3 clothes_type^3 clothes_color^3 clothes_occassion^3&fl=*,score&defType=edismax&facet=true&facet.mincount=1&facet.field=clothes_gender&facet.field=clothes_type&facet.field=clothes_size&facet.field=clothes_color&facet.field=brand&facet.field=mobile_os&facet.field=mobile_screen_size&facet.field=laptop_processor&facet.field=laptop_memory&facet.field=laptop_hard_disk
We have given a higher boost to exact phrase matches in our query. Therefore, if a term is found as a phrase in any of the specified fields, the results will be a lot better. Also, now since we are tokenizing only during search and not during indexing, our results would be much better. Suppose we have a brand named pink sweater
and it is picked up as a phrase and the document where the brand=pink sweater
parameter is boosted higher. Next, documents where clothes_color
is pink and clothes_type
is sweater
are boosted. Let us run a debug
query and verify what we just said.
Now even if we give multiple words in our search query, the results should be a lot better. A search for tommy hilfiger green sweater
should give very precise results. Cases where results are not available are handled by the default search text
field. In case there are no results that match our search query, there would be some results due to the copying of fields into text
and the index time tokenization that happens over there.
As we go further in depth in this chapter, we will see more and more ways of indexing and searching data in an e-commerce website.