Книга: Apache Solr Search Patterns
Назад: 6. Solr for Spatial Search
Дальше: Indexing for spatial search

Lucene 4 spatial module

Solr 4 contains three field types for spatial search: LatLonType (or its non-geodetic twin PointType), SpatialRecursivePrefixTreeFieldType (RPT for short), and BBoxField (to be introduced in Solr 4.10 onward). LatLonType has been there since Lucene 3. RPT offers more features than LatLonType and offers fast filter performance. LatLonType is more appropriate for efficient distance sorting and boosting. With Solr, we can use both the fields simultaneously—LatLonType for sorting or boosting and RPT for filtering. BBoxField is used for indexing bounding boxes, querying by a box, specifying search predicates such as Intersects, Within, Contains, Disjoint, or Equals, and relevancy sorting or boosting of properties such as overlapRatio.

We have already seen the LatLonType field, which we used to define the location of our store in the earlier examples. Let us explore RPT and have a look at BBoxField.

SpatialRecursivePrefixTreeFieldType

RPT available in Solr 4 is used to implement the RecursivePrefixTree search strategy. RecursivePrefixTreeStrategy is grid- or prefix tree–based class that implements recursive descent algorithms. It is considered as the most mature strategy till date that has been tested well.

It has the following advantages over the LatLonType field:

  • Can be used to query by polygons and other complex shapes, in addition to circles and rectangles
  • Has support for multi-valued indexed fields
  • Ability to index non-point shapes such as polygons as well as point shapes
  • Has support for rectangles with user-specified corners that can cross the dateline
  • Has support for unoptimized multi-valued distance sorting and score boosting
  • Supports the WKT shape syntax, which is required for specifying polygons and other complex shapes
  • Incorporates the basic features of the LatLonType field and enables the use of geofilt, bbox, and geodist query filters with RPT

We can use the RPT field in our Solr by configuring a field in our schema.xml file of type solr.SpatialRecursivePrefixTreeFieldType. Our schema.xml file contains the following code for the RPT field:

<fieldType name="location_rpt" class = "solr.SpatialRecursivePrefixTreeFieldType"         spatialContextFactory = "com.spatial4j.core.context.jts.JtsSpatialContextFactory"         autoIndex="true"         geo="true"         distErrPct="0.025"         maxDistErr="0.000009"         units="degrees" />

We can change the type of the field named store from location to location_rpt and make it multi-valued:

<field name="store" type="location_rpt" indexed="true" stored="true" multiValued="true" />

Now restart Solr.

Tip

If you get an error java.lang.ClassNotFoundException: com.vividsolutions.jts.geom.CoordinateSequenceFactory, please download the JTS library (jts-1.13.jar) from .

Now, put it in the <solr folder>/example/solr-webapp/webapp/WEB-INF/lib path.

Let us understand the options available for the SpatialRecursivePrefixTreeFieldType field type in our schema.xml file:

  • name: This is the name of the field type that we specified as location_rpt.
  • class: This should be solr.SpatialRecursivePrefixTreeFieldType as we have declared.
  • spatialContextFactory: It is specified as com.spatial4j.core.context.jts.JtsSpatialContextFactory only when there is a requirement to implement polygons or linestrings. The JAR file jts-1.13.jar that we put in our lib folder (as mentioned in notes above) is used if this is specified. This context factory has its own options, which can be found if we go through the Java docs for the same. One option that we enabled in our declaration is autoIndex="true", which provides a major performance boost for polygons.
  • units: This is a mandatory parameter and currently accepts the only value as degrees. How the maxDistErr attribute, the radius of a circle, or any other absolute distances are interpreted depends upon this parameter. One degree measures to approximately 111.2 km, which is based on the value we compute as the average radius of Earth.
  • geo: This parameter specifies whether the mathematical model is based on a sphere, or on Euclidean or Cartesian geometry. It is set to true for us, so latitude and longitude coordinates will be used and the mathematical model will generally be a sphere. If set to false, the coordinates will be generic X and Y on a two-dimensional plane having Euclidean or Cartesian geometry.
  • WorldBounds: It sets the valid numerical ranges of x and y coordinates in the minX minY maxX maxY format. In case geo="true", the value of this parameter is assumed to be -180 -90 180 90; else, it needs to be specified exclusively.
  • distCalculator: Defines the distance calculation algorithm. If geo=true, the haversine value is the default. If geo=false, the cartesian value will be the default. Other possible values are lawOfCosines, vincentySphere, and cartesian^2.

The PrefixTree based field visualizes the indexed coordinates as a grid. Each grid cell is further fragmented as another set of grid cells that falls under the next level, thus forming a hierarchy with different levels. The largest set of cells fall under level 1, the next set of fragmented cells in level 2, and so on. Here are some configuration options related to prefixTree:

  • prefixTree: Defines the spatial grid implementation. Since a PrefixTree (such as RecursivePrefixTree) maps the world as a grid, each grid cell is decomposed to another set of grid cells at the next level. If geo=false, then the default prefix tree is geohash; otherwise, it's quad. Geohash has 32 children at each level, and quad has 4. Geohash cannot be used for geo=false as it's strictly geospatial.
  • distErrPct: Defines the default precision of non-point shapes for both the index and the query as a fraction between 0.0 (fully precise) and 0.5. The closer this number is to zero, the more accurate is the shape. We have defined it as 0.025 allowing small amounts of inaccuracy in our shape. More precise indexed shapes use more disk space and take longer to index. Bigger distErrPct values will make querying faster but less accurate.
  • maxDistErr: Defines the highest level of detail required for indexed data. The default value is 1 m, a little less than 0.000009 degrees. This setting is used internally to compute an appropriate maxLevels value.
  • maxLevels: Sets the maximum grid depth for indexed data. It is usually more intuitive to compute an appropriate maxLevels by specifying maxDistErr.

We will need to clear our index and re-index the location.csv and *.xml files.

Note

The data inside the Solr index for a collection can be entirely deleted using the following Solr queries:

http://localhost:8983/solr/collection1/update?stream.body=<delete/><query>*:*</query></delete> http://localhost:8983/solr/collection1/update?stream.body=<commit/>

We will study some queries employing predicates such as Intersects, isWithin, and others on the store field (of type RPT), which we create later in this chapter.

BBoxField (to be introduced in Solr 4.10)

The BBoxField field type can be used to index a single rectangle or bounding box per document field. It supports searching via a bounding box and most spatial search predicates. It has enhanced relevancy modes based on the overlap or area between the search rectangle and the indexed rectangle.

To define it in our schema, we have to first declare a fieldType of class solr.BBoxField having numberType as defined by a separate fieldType having the class solr.TrieDoubleField:

<fieldType name="bbox" class="solr.BBoxField" geo="true" units="degrees" numberType="_bbox_coord" /> <fieldType name="_bbox_coord" class="solr.TrieDoubleField" precisionStep="8" docValues="true" stored="false"/>

Now we define a field of type bbox:

<field name="bbox" type="bbox" />

Since this feature is available in Solr 4.10 onward, we will not delve into the implementation.

Назад: 6. Solr for Spatial Search
Дальше: Indexing for spatial search

Solr
Testing
dosare
121