Let us look at how we can build our own query parser. We will build a proximity query parser known as SWAN query where SWAN stands for Same, With, Adjacent, and Near. This query parser would use the SWAN relationships between terms to process the query and fetch the results.
Solr provides position-aware queries via phrase slop queries. An example of a phrase slop is "samsung galaxy"~4
that suggests samsung
and galaxy
must occur within 4 word positions of each other. However, this does not take care of the SWAN queries that we are looking for. Lucene has support for providing position-aware queries using SpanQueries
. The classes that implement span queries in Lucene are:
SpanTermQuery(Term term)
: Represents the building blocks of SpanQuery
. The SpanTermQuery
class is used for creating SpanOrQuery
, SpanNotQuery
, or SpanNearQuery
.SpanOrQuery(SpanQuery clauses)
: Can contain multiple SpanQuery
clauses. We can use the addClause(SpanQuery clause)
function to add more clauses to the OR
span query.SpanNotQuery(SpanQuery include, SpanQuery exclude)
: Constructs a SpanNotQuery
matching spans from include that have no overlap with spans from exclude. The constructor also provides variations to include the distance between tokens or pre and post numbers of tokens.SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder)
: Constructs a SpanNearQuery
class. Matches spans matching a span from each clause, with up to slop total unmatched positions between them. When inOrder
is true, the spans from each clause must be ordered as in clauses.Let us try and implement SWAN queries using Lucene SpanQueries
. For this, we will have to index our documents in a fashion such that there is enough position gap between multiple sentences and paragraphs. Suppose we identify that in our complete document set, the maximum number of tokens that a sentence can have is 50 and that the maximum number of sentences that a paragraph can have is also 50. Therefore, during indexing of documents in the analysis phase, we will have to put a position gap of 50 tokens between sentences and 5000 between paragraphs. Refer to , Solr Indexing Internals for information on PositionIncrementGap
and how to set it up.
For creating the SwanQueries
class, we need to first define the lengths of the sentence and the paragraph with respect to the token positions in them:
static int MAX_PARAGRAPH_LENGTH = 5000; static int MAX_SENTENCE_LENGTH = 500;
Next, we need to define the implementation for SAME
, WITH
, ADJ
, and NEAR
queries. For SAME
, we define that the left and right clauses should be within the same paragraph irrespective of the order:
public static SpanQuery SAME(SpanQuery left,SpanQuery right) { return new SpanNearQuery( new SpanQuery[] { left, right }, MAX_PARAGRAPH_LENGTH, false); }
For WITH
, we need to define that the left and right clauses should be within the same sentence irrespective of the order in which they were mentioned:
public static SpanQuery WITH(SpanQuery left,SpanQuery right) { return new SpanNearQuery( new SpanQuery[] { left, right }, MAX_SENTENCE_LENGTH, false); }
For ADJ
, the left and right clauses should be next to each other and there should be a position difference of only 1 between them. Also, as order matters, the left clause should occur before the right clause:
public static SpanQuery ADJ(SpanQuery left,SpanQuery right) { return new SpanNearQuery( new SpanQuery[] { left, right }, 1, true); }
For NEAR
, the left and right clauses should be next to each other with a slop of 1, irrespective of the order. This means that the left clause can appear before the right clause and vice versa:
public static SpanQuery NEAR(SpanQuery left,SpanQuery right) { return new SpanNearQuery( new SpanQuery[] { left, right }, 1, false); }
Now that we have the implementation of SWAN queries, we need a parser to parse our syntax with respect to SWAN queries. We would need a parser generator such as javacc - the Java compiler compiler. A parser generator is a tool that reads a grammar specification and converts it into a Java program that can recognize matches for the specification. We will be creating a grammar specification that can be parsed by using JavaCC and be converted into a Java program. More information regarding the javacc
class can be obtained from its official page: .
When dealing with a parser generator such as javacc
, we have to create an external syntax definition file and then compile the grammar definitions to a sort of runnable Java code that can recognize matches for the definitions. This is somewhat complicated for a normal Java programmer. An easier way of implementing the parser would be to define the parsers directly inside the Java code. This is supported by the parboiled Java library and is commonly used as an alternative to javacc
. A parboiled parser supports definition of Parsing Expression Grammar (PEG) parsers directly inside the Java source code. Since the parboiled parser does not require a separate syntax definition file, it is comparatively easy to build custom parsers using the parboiled one.
The PEG specification can include parser actions that perform arbitrary logic at any given point during the parsing process. A parboiled parser works in two phases. The first phase is rule construction where the parser builds a tree (or rather a directed graph) of parsing rules as specified in our code. The second phase is rule execution where the rules are run against a specific input text. The end result is the following information:
We derive our custom parser class from BaseParser
, the required base class of all parboiled for Java parsers, and define methods for returning Rule
instances. These methods construct a rule instance from other rules, terminals, predefined primitives, and action expressions. A PEG parser is basically a set of rules that are composed of other rules and terminals, which are essentially characters or strings.
The primitive rules are defined as follows (where a
and b
denote other parsing rules):
Sequence(a,b)
: Creates a new rule that succeeds if the sub-rules a
and b
also succeed one after the other.FirstOf(a,b)
: Creates a new rule that successively tries both the sub-rules a
and b
and succeeds when the first one of its sub-rules matches. If all sub-rules fail, this rule fails as well.ZeroOrMore(a)
: Creates a new rule that tries repeated matches of its sub-rule a and always succeeds, even if the sub-rule doesn't match even once.OneOrMore(a)
: Creates a new rule that tries repeated matches of its sub-rule a
and succeeds if the sub-rule matches at least once. If the sub-rule does not match at least once, this rule fails.Optional(a)
: Creates a new rule that tries a match on its sub-rule a
and always succeeds, independently of the matching success of its sub-rule.Test(a)
: Creates a new rule that tests the given sub-rule a
against the current input position without actually matching any characters. It succeeds if the sub-rule succeeds, and fails if the sub-rule rails.TestNot(a)
: Is the inverse of the Test
rule. It creates a new rule that tests the given sub-rule a against the current input position without actually matching any characters. It succeeds if the sub-rule fails, and fails if the sub-rule succeeds.In addition to the above primitives, the PEG parser also consists of the following elements:
ParseRunner
class is responsible for supervising a parsing run and optionally applying additional logic, most importantly the handling of illegal input characters or parse errors.More details about the Java parboiled library can be obtained from its wiki pages at:
.
.
You can also refer to its official GitHub page at: .
Let us use the parboiled library to create our SWAN parser. The parboiled library consists of the parboiled-core
and parboiled-java
JAR files that can be downloaded from the following URL: .
Our class SwanParser
extends BaseParser
from parboiled that generates queries of type SpanQuery
(from the Lucene API):
public class SwanParser extends BaseParser<SpanQuery>
We start by defining rules for input strings, OR
, SAME
, WITH
, NEAR
, and ADJ
. These rules are defined as case-insensitive input strings followed by whitespace:
public Rule OR() { return Sequence(IgnoreCase("OR"), WhiteSpace()); } public Rule SAME() { return Sequence(IgnoreCase("SAME"), WhiteSpace()); } public Rule WITH() { return Sequence(IgnoreCase("WITH"), WhiteSpace()); } public Rule NEAR() { return Sequence(IgnoreCase("NEAR"), WhiteSpace()); } public Rule ADJ() { return Sequence(IgnoreCase("ADJ"), WhiteSpace()); }
We need to define rules for matching Term
, Char
, and WhiteSpace
in the input string. A character is defined as any one of numbers, small or capital letters, and a dash (-
) or an underscore (_
):
public Rule Char() { return AnyOf("0123456789" + "abcdefghijklmnopqrstuvwxyz" + "ABCDEFGHIJKLMNOPQRSTUVWXYZ" + "-_" ); }
Whitespace is defined as any space (" "
), tab ("\t"
), or form feed character ("\f"
):
public Rule WhiteSpace() { return OneOrMore(AnyOf(" \t\f")); }
A term is defined as a sequence of one or more characters. We create a Lucene term query using the Term
class and push it into the value stack:
public Rule Term() { return Sequence( OneOrMore(Char()), push(new SpanTermQuery(new Term(match()))) ); }
We need to define the SAME
expression as a sequence of WITH
expression and ZeroOrMore
of SAME
rule, WITH
expression. We also construct the SAME
query by popping two elements from the value stack and push the SAME
query into the value stack:
public Rule SameExpression() { return Sequence( WithExpression(), ZeroOrMore( Sequence( SAME(), WithExpression(), push(SwanQueries.SAME(pop(1), pop())) ) ) ); }
Similarly, the WITH
expression is defined as a sequence of an AdjNear
expression and ZeroOrMore
of a sequence of the WITH
rule along with the AdjNear
expression. As before, we create a WITH
query by popping two elements from the value stack and push the WITH
query into the value stack:
public Rule WithExpression() { return Sequence( AdjNearExpression(), ZeroOrMore( Sequence( WITH(), AdjNearExpression(), push(SwanQueries.WITH(pop(1), pop())) ) ) ); }
Finally, we create the AdjNear
expression to handle both ADJ
and NEAR
clauses. This would contain a sequence of Term
followed by ZeroOrMore
of whichever sequence occurs first from the following two sequences. The first sequence here is a NEAR
rule followed by a term and a NEAR
query constructed by popping two elements from the value stack. The second sequence is an ADJ
rule followed by a term and an ADJ
query constructed by popping two elements from the value stack. The expression will return a Term
if none of the following sequences exist. Else, it will return whichever of the NEAR
and ADJ
sequences it finds first:
public Rule AdjNearExpression() { return Sequence( Term(), ZeroOrMore(FirstOf( Sequence( NEAR(), Term(), push(SwanQueries.NEAR(pop(1), pop())) ), Sequence( ADJ(), Term(), push(SwanQueries.ADJ(pop(1), pop())) ) )) ); }
Next, we define the OR
expression as a sequence of the SAME
expression and ZeroOrMore
of a sequence of expressions including OR
, SAME
, and SpanOrQuery
, which is pushed into the stack. Here we pop the last two elements from the value stack, create a SpanOrQuery
class, and push it into the value stack:
public Rule OrExpression() { return Sequence( SameExpression(), ZeroOrMore( Sequence( OR(), SameExpression(), push(new SpanOrQuery(pop(1), pop())) ) ) ); }
Finally, we create a rule for the Query()
function that is a sequence of OR
expressions followed by the End Of Expression (EOI)
:
public Rule Query() { return Sequence(OrExpression(),EOI); }
In order to compile SwanParser.java
, we would need the lucene-core
, parboiled-core
, and parboiled-java
JAR files in our Java classpath.
We will need to create a Solr plugin to incorporate the SWAN query parser that we created earlier. In order to create a Solr plugin for processing our custom query parser, we will need to extend the QParserPlugin
class and override the createParser
method to return an instance of type QParser
. In order to plug in our Swan parser, we will have to create a SwanQParser
class that extends the QParser
class and override the parse method to return an object of type Query
:
public class SwanQParser extends QParser { // Define the constructor public SwanQParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { super(qstr, localParams, params, req); } // Override the parse method from QParser @Override public Query parse() throws SyntaxError { SwanParser parser = Parboiled.createParser(SwanParser.class); ParsingResult<?> result = new RecoveringParseRunner<SpanQuery>(parser.Query()).run(this.qstr); if (!result.parseErrors.isEmpty()){ throw new SyntaxError(ErrorUtils.printParseError(result.parseErrors.get(0))); } SpanQuery query = (SpanQuery) result.parseTreeRoot.getValue(); return query; } }
Once we have the SwanQParser
class of type QParser
, we will have to create the SwanQParserPlugin
class, which extends the QParserPlugin
class from Solr, and override the createParser
method to return an object of type SwanQParser
:
public class SwanQParserPlugin extends QParserPlugin { // Override the createParser method from QParserPlugin @Override public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { return new SwanQParser(qstr, localParams, params, req); } }
In addition to the parboiled
and lucene
libraries (JAR files), we will need the solr-core
and solr-solrj
libraries in our Java classpath to compile the previously mentioned classes.
Now that we have all the classes ready for creating our plugin, lets create a JAR file and include it in the solrconfig.xml
file in order to integrate the SWAN plugin in Solr. Create the JAR file (swan.jar
) and place it inside the library (<solr_directory>/example/solr-webapp/webapp/WEB-INF/lib/
) folder. Also make the following change in the solrconfig.xml
file:
<queryParser name="swan" class="com.plugin.swan.SwanQParserPlugin"/>
Note that all our classes were put inside the com.plugin.swan
package. Restart Solr and try accessing the SWAN parser by specifying the defType=swan
parameter in a Solr query. As shown in the following Solr query URL:
http://localhost:8983/solr/collection1/select?q=((galaxy ADJ samsung) SAME note) AND (mobile OR tablet)&defType=swan
We can also define a new handler /swan
instead of /select
for processing SWAN queries.
On accessing the above Solr query URL,we get a syntax error. To fix the dependency issues, include the parboiled and the ASM libraries into the Solr library path. Copy the parboiled*
JAR files to the library folder. Also, download and copy the asm-all-4.x.jar
file to the library folder.
We are using ASM 4.2 and it can be downloaded from:
.
Code references: .
If you are still getting a syntax exception, remember that we need to incorporate a position increment gap between multiple sentences and paragraphs within our index. We will need to define our analyzer to tokenize our input text in the required fashion for the SWAN queries to work.