Building a custom query parser

Let us look at how we can build our own query parser. We will build a proximity query parser known as SWAN query where SWAN stands for Same, With, Adjacent, and Near. This query parser would use the SWAN relationships between terms to process the query and fetch the results.

Proximity search using SWAN queries

Solr provides position-aware queries via phrase slop queries. An example of a phrase slop is "samsung galaxy"~4 that suggests samsung and galaxy must occur within 4 word positions of each other. However, this does not take care of the SWAN queries that we are looking for. Lucene has support for providing position-aware queries using SpanQueries. The classes that implement span queries in Lucene are:

SpanTermQuery(Term term): Represents the building blocks of SpanQuery. The SpanTermQuery class is used for creating SpanOrQuery, SpanNotQuery, or SpanNearQuery.
SpanOrQuery(SpanQuery clauses): Can contain multiple SpanQuery clauses. We can use the addClause(SpanQuery clause) function to add more clauses to the OR span query.
SpanNotQuery(SpanQuery include, SpanQuery exclude): Constructs a SpanNotQuery matching spans from include that have no overlap with spans from exclude. The constructor also provides variations to include the distance between tokens or pre and post numbers of tokens.
SpanNearQuery(SpanQuery[] clauses, int slop, boolean inOrder): Constructs a SpanNearQuery class. Matches spans matching a span from each clause, with up to slop total unmatched positions between them. When inOrder is true, the spans from each clause must be ordered as in clauses.

Let us try and implement SWAN queries using Lucene SpanQueries. For this, we will have to index our documents in a fashion such that there is enough position gap between multiple sentences and paragraphs. Suppose we identify that in our complete document set, the maximum number of tokens that a sentence can have is 50 and that the maximum number of sentences that a paragraph can have is also 50. Therefore, during indexing of documents in the analysis phase, we will have to put a position gap of 50 tokens between sentences and 5000 between paragraphs. Refer to , Solr Indexing Internals for information on PositionIncrementGap and how to set it up.

For creating the SwanQueries class, we need to first define the lengths of the sentence and the paragraph with respect to the token positions in them:

static int MAX_PARAGRAPH_LENGTH = 5000; static int MAX_SENTENCE_LENGTH = 500;

Next, we need to define the implementation for SAME, WITH, ADJ, and NEAR queries. For SAME, we define that the left and right clauses should be within the same paragraph irrespective of the order:

public static SpanQuery SAME(SpanQuery left,SpanQuery right) {     return new SpanNearQuery(       new SpanQuery[] { left, right }, MAX_PARAGRAPH_LENGTH, false);   }

For WITH, we need to define that the left and right clauses should be within the same sentence irrespective of the order in which they were mentioned:

public static SpanQuery WITH(SpanQuery left,SpanQuery right) {     return new SpanNearQuery(       new SpanQuery[] { left, right }, MAX_SENTENCE_LENGTH, false);   }

For ADJ, the left and right clauses should be next to each other and there should be a position difference of only 1 between them. Also, as order matters, the left clause should occur before the right clause:

public static SpanQuery ADJ(SpanQuery left,SpanQuery right) {     return new SpanNearQuery(       new SpanQuery[] { left, right }, 1, true);   }

For NEAR, the left and right clauses should be next to each other with a slop of 1, irrespective of the order. This means that the left clause can appear before the right clause and vice versa:

public static SpanQuery NEAR(SpanQuery left,SpanQuery right) {     return new SpanNearQuery(       new SpanQuery[] { left, right }, 1, false);   }

Creating a parboiled parser

Now that we have the implementation of SWAN queries, we need a parser to parse our syntax with respect to SWAN queries. We would need a parser generator such as javacc - the Java compiler compiler. A parser generator is a tool that reads a grammar specification and converts it into a Java program that can recognize matches for the specification. We will be creating a grammar specification that can be parsed by using JavaCC and be converted into a Java program. More information regarding the javacc class can be obtained from its official page: .

When dealing with a parser generator such as javacc, we have to create an external syntax definition file and then compile the grammar definitions to a sort of runnable Java code that can recognize matches for the definitions. This is somewhat complicated for a normal Java programmer. An easier way of implementing the parser would be to define the parsers directly inside the Java code. This is supported by the parboiled Java library and is commonly used as an alternative to javacc. A parboiled parser supports definition of Parsing Expression Grammar (PEG) parsers directly inside the Java source code. Since the parboiled parser does not require a separate syntax definition file, it is comparatively easy to build custom parsers using the parboiled one.

The PEG specification can include parser actions that perform arbitrary logic at any given point during the parsing process. A parboiled parser works in two phases. The first phase is rule construction where the parser builds a tree (or rather a directed graph) of parsing rules as specified in our code. The second phase is rule execution where the rules are run against a specific input text. The end result is the following information:

A Boolean flag determining whether the input matched the root rule or not
A list of potentially encountered parse errors
One or more value object(s) constructed by your parser actions

We derive our custom parser class from BaseParser, the required base class of all parboiled for Java parsers, and define methods for returning Rule instances. These methods construct a rule instance from other rules, terminals, predefined primitives, and action expressions. A PEG parser is basically a set of rules that are composed of other rules and terminals, which are essentially characters or strings.

The primitive rules are defined as follows (where a and b denote other parsing rules):

Sequence(a,b): Creates a new rule that succeeds if the sub-rules a and b also succeed one after the other.
FirstOf(a,b): Creates a new rule that successively tries both the sub-rules a and b and succeeds when the first one of its sub-rules matches. If all sub-rules fail, this rule fails as well.
ZeroOrMore(a): Creates a new rule that tries repeated matches of its sub-rule a and always succeeds, even if the sub-rule doesn't match even once.
OneOrMore(a): Creates a new rule that tries repeated matches of its sub-rule a and succeeds if the sub-rule matches at least once. If the sub-rule does not match at least once, this rule fails.
Optional(a): Creates a new rule that tries a match on its sub-rule a and always succeeds, independently of the matching success of its sub-rule.
Test(a): Creates a new rule that tests the given sub-rule a against the current input position without actually matching any characters. It succeeds if the sub-rule succeeds, and fails if the sub-rule rails.
TestNot(a): Is the inverse of the Test rule. It creates a new rule that tests the given sub-rule a against the current input position without actually matching any characters. It succeeds if the sub-rule fails, and fails if the sub-rule succeeds.

In addition to the above primitives, the PEG parser also consists of the following elements:

Parser actions: These are snippets of custom code that are executed at specific points during rule execution. Apart from inspecting the parser state, parser actions typically construct parser values and can actively influence the parsing process as semantic predicates.
The value stack: During the rule execution phase, parser actions can make use of the value stack for organizing the construction of custom objects. The value stack is a simple stack construct that serves as temporary storage for custom objects.
The parse tree: During the rule execution phase, parboiled parsers can optionally construct a parse tree, whose nodes correspond to the recognized rules. Each parse tree node contains a reference to the matcher of the rule it was constructed from, the matched input text (position), and the current element at the top of the value stack. The parse tree can be viewed as the record of what rules have matched a given input and is particularly useful during debugging.
The ParseRunner: The ParseRunner class is responsible for supervising a parsing run and optionally applying additional logic, most importantly the handling of illegal input characters or parse errors.

More details about the Java parboiled library can be obtained from its wiki pages at:

You can also refer to its official GitHub page at: .

Let us use the parboiled library to create our SWAN parser. The parboiled library consists of the parboiled-core and parboiled-java JAR files that can be downloaded from the following URL: .

Our class SwanParser extends BaseParser from parboiled that generates queries of type SpanQuery (from the Lucene API):

public class SwanParser extends BaseParser<SpanQuery>

We start by defining rules for input strings, OR, SAME, WITH, NEAR, and ADJ. These rules are defined as case-insensitive input strings followed by whitespace:

public Rule OR() {     return Sequence(IgnoreCase("OR"), WhiteSpace());   } public Rule SAME() {     return Sequence(IgnoreCase("SAME"), WhiteSpace());   } public Rule WITH() {     return Sequence(IgnoreCase("WITH"), WhiteSpace());   } public Rule NEAR() {     return Sequence(IgnoreCase("NEAR"), WhiteSpace());   } public Rule ADJ() {     return Sequence(IgnoreCase("ADJ"), WhiteSpace());   }

We need to define rules for matching Term, Char, and WhiteSpace in the input string. A character is defined as any one of numbers, small or capital letters, and a dash (-) or an underscore (_):

public Rule Char() {     return AnyOf("0123456789" +       "abcdefghijklmnopqrstuvwxyz" +       "ABCDEFGHIJKLMNOPQRSTUVWXYZ" +       "-_"     );   }

Whitespace is defined as any space (" "), tab ("\t"), or form feed character ("\f"):

public Rule WhiteSpace() {     return OneOrMore(AnyOf(" \t\f"));   }

A term is defined as a sequence of one or more characters. We create a Lucene term query using the Term class and push it into the value stack:

public Rule Term() {     return Sequence(       OneOrMore(Char()),         push(new SpanTermQuery(new Term(match())))     );   }

We need to define the SAME expression as a sequence of WITH expression and ZeroOrMore of SAME rule, WITH expression. We also construct the SAME query by popping two elements from the value stack and push the SAME query into the value stack:

public Rule SameExpression() {     return Sequence(       WithExpression(),       ZeroOrMore(         Sequence(           SAME(),           WithExpression(),            push(SwanQueries.SAME(pop(1), pop()))         )       )     );   }

Similarly, the WITH expression is defined as a sequence of an AdjNear expression and ZeroOrMore of a sequence of the WITH rule along with the AdjNear expression. As before, we create a WITH query by popping two elements from the value stack and push the WITH query into the value stack:

public Rule WithExpression() {     return Sequence(       AdjNearExpression(),       ZeroOrMore(         Sequence(           WITH(),           AdjNearExpression(),            push(SwanQueries.WITH(pop(1), pop()))         )       )     );   }

Finally, we create the AdjNear expression to handle both ADJ and NEAR clauses. This would contain a sequence of Term followed by ZeroOrMore of whichever sequence occurs first from the following two sequences. The first sequence here is a NEAR rule followed by a term and a NEAR query constructed by popping two elements from the value stack. The second sequence is an ADJ rule followed by a term and an ADJ query constructed by popping two elements from the value stack. The expression will return a Term if none of the following sequences exist. Else, it will return whichever of the NEAR and ADJ sequences it finds first:

  public Rule AdjNearExpression() {     return Sequence(       Term(),       ZeroOrMore(FirstOf(         Sequence(           NEAR(),           Term(),           push(SwanQueries.NEAR(pop(1), pop()))          ),         Sequence(           ADJ(),           Term(),           push(SwanQueries.ADJ(pop(1), pop()))         )       ))     );   }

Next, we define the OR expression as a sequence of the SAME expression and ZeroOrMore of a sequence of expressions including OR, SAME, and SpanOrQuery, which is pushed into the stack. Here we pop the last two elements from the value stack, create a SpanOrQuery class, and push it into the value stack:

public Rule OrExpression() {     return Sequence(       SameExpression(),       ZeroOrMore(         Sequence(           OR(),           SameExpression(),           push(new SpanOrQuery(pop(1), pop()))         )       )     );   }

Finally, we create a rule for the Query() function that is a sequence of OR expressions followed by the End Of Expression (EOI):

public Rule Query() {     return Sequence(OrExpression(),EOI);   }

In order to compile SwanParser.java, we would need the lucene-core, parboiled-core, and parboiled-java JAR files in our Java classpath.

Building a Solr plugin for SWAN queries

We will need to create a Solr plugin to incorporate the SWAN query parser that we created earlier. In order to create a Solr plugin for processing our custom query parser, we will need to extend the QParserPlugin class and override the createParser method to return an instance of type QParser. In order to plug in our Swan parser, we will have to create a SwanQParser class that extends the QParser class and override the parse method to return an object of type Query:

public class SwanQParser extends QParser {   // Define the constructor   public SwanQParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {     super(qstr, localParams, params, req);   }   // Override the parse method from QParser   @Override   public Query parse() throws SyntaxError {     SwanParser parser = Parboiled.createParser(SwanParser.class);     ParsingResult<?> result = new RecoveringParseRunner<SpanQuery>(parser.Query()).run(this.qstr);      if (!result.parseErrors.isEmpty()){       throw new SyntaxError(ErrorUtils.printParseError(result.parseErrors.get(0)));     }     SpanQuery query = (SpanQuery) result.parseTreeRoot.getValue();     return query;   } }

Once we have the SwanQParser class of type QParser, we will have to create the SwanQParserPlugin class, which extends the QParserPlugin class from Solr, and override the createParser method to return an object of type SwanQParser:

public class SwanQParserPlugin extends QParserPlugin {   // Override the createParser method from QParserPlugin   @Override   public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {     return new SwanQParser(qstr, localParams, params, req);   } }

In addition to the parboiled and lucene libraries (JAR files), we will need the solr-core and solr-solrj libraries in our Java classpath to compile the previously mentioned classes.

Integrating the SWAN plugin in Solr

Now that we have all the classes ready for creating our plugin, lets create a JAR file and include it in the solrconfig.xml file in order to integrate the SWAN plugin in Solr. Create the JAR file (swan.jar) and place it inside the library (<solr_directory>/example/solr-webapp/webapp/WEB-INF/lib/) folder. Also make the following change in the solrconfig.xml file:

<queryParser name="swan" class="com.plugin.swan.SwanQParserPlugin"/>

Note that all our classes were put inside the com.plugin.swan package. Restart Solr and try accessing the SWAN parser by specifying the defType=swan parameter in a Solr query. As shown in the following Solr query URL:

http://localhost:8983/solr/collection1/select?q=((galaxy ADJ samsung) SAME note) AND (mobile OR tablet)&defType=swan

Tip

We can also define a new handler /swan instead of /select for processing SWAN queries.

On accessing the above Solr query URL,we get a syntax error. To fix the dependency issues, include the parboiled and the ASM libraries into the Solr library path. Copy the parboiled* JAR files to the library folder. Also, download and copy the asm-all-4.x.jar file to the library folder.

Note

We are using ASM 4.2 and it can be downloaded from:

Code references: .

Tip

If you are still getting a syntax exception, remember that we need to incorporate a position increment gap between multiple sentences and paragraphs within our index. We will need to define our analyzer to tokenize our input text in the required fashion for the SWAN queries to work.