Semantic search is when the search engine understands what the customer is searching for and provides results that are based on this understanding. Therefore, a search for the term shoes
should display only items that are of type shoes instead of items with a description goes well with black shoes. We could argue that since we are boosting on the fields category
, type
, brand
, color
, and size
, our results should match with what the customer is looking or searching for. However, this might not be the case. Let us take a more appropriate example to understand this situation.
Suppose a customer is searching for blue jeans
where blue
is intended to be the color
and jeans
is the type
of apparel. What if there is a brand
of products called blue jeans
? The results coming from the search would not be as expected by the customer. As all the fields are being boosted by the same boost factor, the results will be a mix of the intended blue
colored jeans
and the products from the brand called blue jeans
.
Now, how do we handle such a scenario? How would the search engine decide what products to show? We saw the same problem earlier with pink sweater
and tried to solve it by proper boosting of fields. In this scenario, brand
, type
, and color
are boosted by the same boost factor. The result is bound to be a mix of what is intended and what is not intended.
Semantic search could come to our rescue. We need the search engine to understand the meaning of blue jeans
. Once the search engine understands that blue
is the color
and jeans
is the type
of clothing, it can boost the color
and type
fields only giving us exactly what we want.
The downside over here is that we have to take a call. We need to understand that blue jeans
is a mix of color
and type
instead of brand
or the other way around. The results that are returned will be heavily dependent on our call. The ideal way would be to give importance to different fields on the basis of the searches our site receives. Suppose that our site receives a lot of brand searches. We would then intend to make the search engine understand that blue jeans
is a brand instead of type
and color
. On the other hand, if we analyze the searches happening on the site and find that the searches are mostly happening on type
and color
, we would have to make the search engine interpret blue jeans
as type
and color
.
Let us try to implement the same.
To make the search engine understand the meaning of the terms, we need a dictionary that maps the terms to their meanings. This dictionary can be a separate index having just two fields, words and meanings. Let us create a separate core in our Solr installation that stores our dictionary:
cd<solr_installation>/example/solr mkdir -p dictionary/data mkdir -p dictionary/conf cp -r collection1/conf/* dictionary/conf
Define the Solr core in solr.xml
file in the <solr_installation>/example/solr
directory by adding the following snippet to the solr.xml
file:
<solr sharedLib="lib"> <cores adminPath="/admin/cores"> <core name="collection1" instanceDir="/collection1"> <property name="dataDir" value="/collection1/data" /> </core> <core name="dictionary" instanceDir="/dictionary"> <property name="dataDir" value="/dictionary/data" /> </core> </cores> </solr>
The format of the solr.xml
file will be changed in Solr 5.0. An optional parameter coreRootDirectory
will be used to specify the directory from where the cores for the current Solr will be auto-discovered. The new format will be as follows:
<solr> <str name="adminHandler" >${adminHandler:org.apache.solr.handler.admin.CoreAdminHandler}< /str> <int name="coreLoadThreads">${coreLoadThreads:3}</int> <str name="coreRootDirectory">${coreRootDirectory:}</str> <str name="managementPath">${managementPath:}</str> <str name="sharedLib">${sharedLib:}</str> <str name="shareSchema">${shareSchema:false}</str> </solr>
Also, as of Solr 5.0, the core discovery process will involve keeping a core.properties
file in each Solr core folder. The format of the core.properties
file will be as follows:
name=core1 shard=${shard:} collection=${collection:core1} config=${solrconfig:solrconfig.xml} schema=${schema:schema.xml} coreNodeName=${coreNodeName:}
Solr parses all the directories inside the directory defined in coreRootDirectory
parameter (which defaults to the Solr home), and if a core.properties
file is found in any directory, the directory is considered to be a Solr core. Also, instanceDir
is the directory in which the core.properties
file was found. We can also have an empty core.properties
file. In this case, the data directory will be in a directory called data
directly below. The name of the core is assumed to be the name of the folder in which the core.properties
file was discovered.
The Solr schema would contain only two fields, key
and value
. In this case, key
would contain the field names and value
would contain the bag of words that identify the field name in key
:
<field name="key" type="lowercase" indexed="true" stored="true" required="true" /> <field name="value" type="wslc" indexed="true" stored="true" multiValued="true" />
Also, the key
field needs to be marked as unique, as the key
contains fields that are unique:
<uniqueKey>key</uniqueKey>
Also, we will need to change the default search field to value
in the solrconfig.xml
file for the dictionary
core. For the request handler named /select
, make the following change:
<str name="df">value</str>
Once the schema is defined, we need to restart Solr to see the new core on our Solr interface.
In order to populate the dictionary
index, we will need to upload the data_dictionary.csv
file onto our index. The following command performs this function. Windows users can use the Solr admin interface to upload the CSV file:
curl "http://localhost:8983/solr/dictionary/update/csv?commit=true&f.value.split=true" --data-binary @data_dictionary.csv -H 'Content-type:application/csv; charset=utf-8'
In order to check the dictionary index, just run a query q=*:*
:
http://localhost:8983/solr/dictionary/select/?q=*:*
We can see each key has multiple strings as its values. For example, brand
is defined by values tommy hilfiger
, levis
, lee
, and wrangler
.
Now that we have the dictionary index, we can query our input string and figure out what the customer is looking for. Let us input our search query blue jeans
against the index and see the output:
http://localhost:8983/solr/dictionary/select/?q=blue%20jeans
We can see that the output contains keys clothes_type
and clothes_color
. Now using this information, we will need to create our actual search query. In this case, the boost will be higher on these two fields than that on the remaining fields. Our query would now become:
q=blue%20jeans&qf=text%20cat^2%20name^2%20brand^2%20clothes_type^4%20clothes_color^4%20clothes_occassion^2&pf=text%20cat^3%20name^3%20brand^3%20clothes_type^5%20clothes_color^5%20clothes_occassion^3&fl=name,brand,clothes_type,clothes_color,score
We have two pairs of jeans in our index, one is blue
and the other is black
in color. The black pair is identified by the brand blue jeans
. Earlier, if we had not incorporated dynamic boosting, the brand blue jeans
would have been boosted higher as it is part of pf in the query. Now that we have identified that the customer is searching for clothes_type
and clothes_color
fields, we increase the boost for these fields in both qf
and pf
. Therefore, even though the keywords blue
and jeans
occur separately in fields clothes_type
and clothes_color
, we get them above the item where brand
is blue jeans
.
This is one way of implementing semantic search. Note that we are performing two searches over here, the first identifies the input fields of the customer's interest and the second is our actual search where the boosting happens dynamically on the basis of the results of the first search.