Let us see how we can implement the Solr text tagger. Let us get the latest code for the Solr text tagger from the GitHub repository by cloning the Git repository with the following command:
git clone https://github.com/OpenSextant/SolrTextTagger.git
This will get the code inside a folder called SolrTextTagger
.
Now inside the SolrTextTagger
library, run the following command to create the JAR file:
mvn package
The mvn
command is available in the Maven repository. This repository can be installed using the following command on Ubuntu machines:
sudo apt-get install maven2
We can also install and use the latest release of maven – maven3.
The mvn
command fetches the dependencies required for compiling and creating the JAR file. If any dependencies are not satisfied or remain unavailable, you will need to debug the pom.xml
file inside the SolrTextTagger
folder and re-run the command.
Alternatively, you can use the solr-text-tagger.jar
file available with this chapter.
On successful compilation, we will be able to see the following output on the screen:
[INFO] Building jar: /home/jayant/solrtag/SolrTextTagger/target/solr-text-tagger-2.1-SNAPSHOT.jar [INFO] ------------------------------------------------------------------ [INFO] BUILD SUCCESSFUL [INFO] ------------------------------------------------------------------ [INFO] Total time: 54 seconds [INFO] Finished at: Tue Feb 17 10:43:34 IST 2015 [INFO] Final Memory: 62M/322M [INFO] ------------------------------------------------------------------
The compiled file is available in the target folder inside the SolrTextTagger
folder:
target/solr-text-tagger-2.1-SNAPSHOT.jar
This JAR file has to be copied into the <solr installation>/example/lib
folder from where solrconfig.xml
will pick it up.
To configure a text tagger with this instance of Solr, we will need to modify the solrconfig.xml
and schema.xml
files in our Solr installation. For a fresh installation of Solr, we can go to the <solr
installation>/example/solr/collection1/conf
folder and add or modify the following lines in our schema.xml
:
<field name="name_tag" type="tag" stored="false" omitTermFreqAndPositions="true" omitNorms="true"/> <copyField source="name" dest="name_tag"/>
Here we are defining a new field name_tag
, which is of the field type tag
. In order to populate the field, we have copied the text from our existing field name to the new field name_tag
.
Next, we will also need to define the behavior of tag
. This is also done in schema.xml
, as follows:
<fieldType name="tag" class="solr.TextField" positionIncrementGap="100" postingsFormat="Memory"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="org.opensextant.solrtexttagger.ConcatenateFilterFactory" /> </analyzer> <analyzer type="query"> <!-- 32 just for tests, bumps posInc --> <tokenizer class="solr.StandardTokenizerFactory" maxTokenLength="32"/> <!-- NOTE: This used the WordLengthTaggingFilterFactory to test the TaggingAttribute. The WordLengthTaggingFilter set the TaggingAttribute for words based on their length. The attribute is ignored at indexing time, but the Tagger will use it to only start tags for words that are equals or longer as the configured minLength. --> <filter class="org.opensextant.solrtexttagger.WordLengthTaggingFilterFactory" minLength="4"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
Here, we are defining the analysis happening on the tag
field type during indexing and search or querying. The attribute postingsFormat="Memory"
requires that we set codecFactory
in our solrconfig.xml
file to solr.SchemaCodecFactory
. The postingsFormat
class provides read and write access to all postings, fields, terms, documents, frequencies, positions, offsets, and payloads. These postings define the format of the index being stored. Here we are defining postingsFormat
as memory
, which keeps data in memory thus making the tagger work as fast as possible.
During indexing, we use standard
tokenizer
, which breaks our input into tokens. This is followed by ASCIIFoldingFilterFactory
. This class converts alphabetic, numeric, and symbolic Unicode characters that are not on the 127-character list of ASCII into their ASCII equivalents (if the ASCII equivalent exists). We are converting our tokens to lowercase by using LowerCaseFilterFactory
. Also, we are using ConcatenateFilterFactory
provided by solrTextTagger
, which concatenates all tokens into a final token with a space separator.
During querying, we once again use Standard
Tokenizer
followed by WordLengthTaggingFilterFactory
. This defines the minimum length of a token to be looked up during the tagging process. We use ASCIIFoldingFilterFactory
and LowerCaseFilterFactory
as well.
Now, let us go through the changes we need to make in solrconfig.xml
. We will first need to add solr-text-tagger-2.1-SNAPSHOT.jar
to be loaded as a library when Solr starts. For this, add the following lines in solrconfig.xml
:
<lib dir="../../lib/" regex="solr-text-tagger-2\.1-SNAPSHOT\.jar" />
You can also use the solr-text-tagger.jar
file provided with this chapter. It has been properly compiled with some missing classes.
Copy the solr-text-tagger.jar
file to your <solr installation>/example/lib
folder and add the following line in your solrconfig.xml
:
<lib dir="../../lib/" regex="solr-text-tagger\.jar" />
Also, add the following lines to support postingsFormat="Memory"
, as explained earlier in this section:
<!-- for postingsFormat="Memory" --> <codecFactory name="CodecFactory" class="solr.SchemaCodecFactory" /> <schemaFactory name="SchemaFactory" class="solr.ClassicIndexSchemaFactory" />
We will also have to define our own request handler, say /tag
, which calls TaggerRequestHandler
as provided by the Solr text tagger. Inside the /tag
request handler, we have defined the field to be used for tagging as name_tag
, which we added to our schema.xml
earlier. The filter query is an optional parameter that can be used to match the subset of the documents (for name matching) in our case:
<requestHandler name="/tag" class="org.opensextant.solrtexttagger.TaggerRequestHandler"> <!-- top level params; legacy format just to test it still works --> <str name="field">name_tag</str> <str name="fq">NOT name:(of the)</str><!-- filter out --> </requestHandler>
Now, let us start the Solr server and watch the output to check whether any errors crop up while Solr loads the configuration and library changes that we made.
If you get the following error in your Solr log stating that it is missing WorlLengthTaggingFilterFactory
, this factory has to be compiled and added to the JAR file we had created:
2768 [coreLoadExecutor-4-thread-1] ERROR org.apache.solr.core.CoreContainer – Unable to create core: collection1 . . Caused by: java.lang.ClassNotFoundException: solr.WordLengthTaggingFilterFactory
The files that are required to be compiled, WordLengthTaggingFilterFactory.java
and WordLengthTaggingFilter.java
, are available in the following folder:
SolrTextTagger/src/test/java/org/opensextant/solrtexttagger
In order to compile the file, we will need the following JAR files from our Solr installation:
javac -d . -cp log4j-1.2.16.jar:slf4j-api-1.7.6.jar:slf4j-log4j12-1.7.6.jar:lucene-core-4.8.1.jar:solr-text-tagger-2.1-SNAPSHOT.jar WordLengthTaggingFilterFactory.java WordLengthTaggingFilter.java
These jars could be found inside <solr installation>/example/lib/ext
and webapps/solr.war
file. The WAR file will need to be unzipped to obtain the required JAR files.
Once compiled, the class files can be found inside the org/opensextant/solrtexttagger
folder. In order to add these files into our existing solr-text-tagger-2.1-SNAPSHOT.jar
, simply unzip the JAR file and copy the files to the required folder org/opensextant/solrtexttagger/
and create the JAR file again using the following zip
command:
zip -r solr-text-tagger.jar META-INF/ org/
In order to check whether the tag
request handler is working, let us point our browser to http://localhost:8983/solr/collection1/tag
.
We can see a message stating that we need to post some text into TaggerRequestHandler
. This means that the Solr text tagger is now plugged into our Solr installation and is ready to work.
Now that we have our Solr server running without any issues, let us add some files from the exampledocs
folder to build the index and the tagging index along with it. Execute the following commands:
cd example/exampledocs java -jar post.jar *.xml java -Dtype=text/csv -jar post.jar books.csv
This will index all the xml
documents and the books.csv
file into the index.
Let us check whether the tagging works by checking the tags for the content inside the solr.xml
file inside the exampledocs
folder. Run the following command:
curl -XPOST 'http://localhost:8983/solr/tag?overlaps=ALL&tagsLimit=5000&fl=*&wt=json' -H 'Content-Type:text/xml' -d @solr.xml
We can see the output on our command prompt:
The tagcount
is 2
in the output.
Windows users can download curl
from the following location and install it: .
The commands can then be run from the command prompt.
Linux users can pretty print the JSON output from the previous query by using Python's pretty-print tool for JSON and piping it with the curl command. The command will then be:
curl -XPOST 'http://localhost:8983/solr/tag?overlaps=ALL&tagsLimit=5000&fl=*&wt=json' -H 'Content-Type:text/xml' -d @solr.xml | python -m json.tool
The output will be similar to the following image:
The output contains two sections, tags
and docs
. The tags
section contains an array of tags
with the ids
of the documents in which they are found. The docs
section contains Solr documents referenced by those tags
.
We take up another findtags.txt
file and find the tags in this file. Let us run the following command:
curl -XPOST 'http://localhost:8983/solr/tag?overlaps=ALL&tagsLimit=5000&fl=*' -H 'Content-Type:text/plain' -d @findtags.txt | xmllint --format -
This will give us the output in the XML format, as follows:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">54</int> </lst> <int name="tagsCount">3</int> <arr name="tags"> <lst> <int name="startOffset">905</int> <int name="endOffset">915</int> <arr name="ids"> <str>0553293354</str> </arr> </lst> <lst> <int name="startOffset">1246</int> <int name="endOffset">1256</int> <arr name="ids"> <str>0553293354</str> </arr> </lst> <lst> <int name="startOffset">1358</int> <int name="endOffset">1368</int> <arr name="ids"> <str>0553293354</str> </arr> </lst> </arr> <result name="response" numFound="1" start="0"> <doc> <str name="id">0553293354</str> <arr name="cat"> <str>book</str> </arr> <str name="name">Foundation</str> <float name="price">7.99</float> <str name="price_c">7.99,USD</str> <bool name="inStock">true</bool> <str name="author">Isaac Asimov</str> <str name="author_s">Isaac Asimov</str> <str name="series_t">Foundation Novels</str> <int name="sequence_i">1</int> <str name="genre_s">scifi</str> <long name="_version_">1493514260482883584</long> </doc> </result> </response>
Here, we can see that three tags were found in our text. All of them refer to the same document ID in the index.
For the index to be useful, we will need to index many documents. As the number of documents increases so will the accuracy of the tags found in our input text.
Now let us look at some of the request time parameters passed to the Solr text tagger:
overlaps
: This allows us to choose an algorithm that will be used to determine which overlapping tags should be retained versus which should be pruned away:ALL
: We have used ALL
in our Solr queries. This means all tags should be emitted.NO_SUB
: Do not emit a sub tag, that is, a tag within another tag.LONGEST_DOMINANT_RIGHT
: Compare the character lengths of the tags and emit the longest one. In the case of a tie, pick the right-most tag. Remove tags that overlap with this identified tag and then repeat the algorithm to find other tags that can be emitted.matchText
: This is a Boolean flag that indicates whether the matched text should be returned in the tag response. In this case, the tagger will fully buffer the input before tagging.tagsLimit
: This indicates the maximum number of tags to return in the response. By default, this is 1000
. In our examples, we have mentioned it as 5000
. Tagging is stopped once this limit is reached.skipAltTokens
: This is a Boolean flag used to suppress errors that can occur if, for example, you enable synonym expansion at query time in the analyzer, which you normally shouldn't do. The default value is false
.ignoreStopwords
: This is a Boolean flag that causes stop words to be ignored. Otherwise, the behavior is to treat them as breaks in tagging on the presumption that our indexed text-analysis configuration doesn't have a StopWordFilter
class. By default, the indexed analysis chain is checked for the presence of a StopWordFilter
class and, if found, then ignoreStopWords
is true
if unspecified. If we do not have StopWordFilter
configured, we can safely ignore this parameter.Most of the standard parameters that work with Solr also work here. For example, we have used wt=json
here. We can also use echoParams
, rows
, fl
, and other parameters during tagging.