Now that we know what text tagging is and have seen some algorithms that can be used for text tagging, let us learn how text tagging is done using Solr. There is an open source library, Solr Text Tagger, that can be used for text tagging in Solr.
The library can be referred to at the following link: .
Text tagging via this library involves two layers of FSTs. A word dictionary FST is used to hold each unique word. This enables integers to be used as substitutes for a word (char[]
). For example, the word New
will be mapped to 13452
and another word Delhi
will be mapped to 5223316
:
New => 13452 Delhi => 5223316
The call to Lucene's FST library Util.getByOutput(<fst object>, 13452)
will yield the word New
.
The second layer is a word phrase FST comprising word ID string keys. In the case of New
Delhi
, the word phrase FST will be:
New Delhi => [13452, 5223316]
The tagging algorithm used in the Solr text tagger is a single-pass or streaming algorithm. The algorithm looks for the original ID of each input term and then creates an FST arc iterator for the name phrase. It then appends the iterator onto a queue of active iterators and tries to advance all iterators. The iterators that do not advance are removed.
An FST arc iterator is used to access the transitions leaving an FST state.
The Solr text tagger scans the posted text and looks for matching strings in the Solr index. The tags are formed as a linked list containing a start offset and an end offset. A tag starts without a value in an advancing state. The tag is advanced with subsequent words, and then eventually, if it does not advance any more, the value is set. Now, the linked list is reduced to tags that are to be emitted.
Let us see an example to understand this:
Iterator linked list queue Head => New Delhi, city Head+1 => Delhi, city Head+2 => City
In this case, Head
containing the New
Delhi
City
phrase will advance and will be emitted as an output.