Hi Gang,
we talked a while ago about adding spatial queries to Qi4j. I hacked on this topic a bit and will trigger a github pull request asap.
Let me summarize in the next lines the most important aspects and design decisions.
I.) General
The main goals was from my side :
- No 3th party dependencies in the Qi4j Core
- Full integration in to the Query DSL
- Practical and easy to understand solution (e.g. no 3D+ support)
II.) Architecture/Design
The implementation has three main blocks :
1.) Qi4j Core (Geometrical Types and Query DSL)
2.) Indexer & Finder implementation (for now ElasticSearch is supported)
3.) A spatial library for converting data types (GeoJSON) and transformation of spatial projections (Grid Coordinate Systems).
spatial relation of two geometries to each other. This relations can be seen as geometrical functions -
like within, equals, distjoint, touches, contains, .. The hole DE-9IM Model specifies in total a matrix
of about 512 geometrical functions. I implemented a indexer/finder just for a very small subset :
- Within
- Disjoint
- Intersects
From my experience this are the most common geometrical functions that are used. They are definitely
domains where much more functions are needed, like CAD, but then most likely a kind of expert system is needed.
The geometries are added to the Qi4j Core are derived from the GeoJSON specification - TPoint, TMultiPoint, TLineString, TMulti-
LineString, TPolygon, TMultiPolygon, TFeature and TFeatureCollection. The "T" prefix was added to have a kind
indication that this is a Qi4j "T"ype as e.g. ElasticSearch defines a number of geometrical types, Spatial4j as well, ..
IV.) Indexer & Finder
For now I added the spatial functions to ElasticSearch. ES is using the so-called quad trees or geohash for spatial indexing. I;m using
Quad tree is in general a very simple indexing method. Each level of the tree has exactly 4 nodes. Within one quadratic node all geometries
(points, ...) has the same spatial position). The required precision/resolution is expressed by the used level. Fortunately one does not need
to define the level directly but it is possible to say "1m" and ES is calculating the required number of levels.
On the implementation itself - well, ElasticSearch hates me, really ! :-)
There are a number of difficulties to be solved to convince ES to work as required. This are in short :
- Spatial mappings in general. Before ANY geometrical type can be indexed it is required to map it accordingly. Means one has to define - this is a
point, this is a line, using indexing method X, with precision Y.
- Spatial mappings for points. ES offers two different indexing method for points - "geoshape" or "gepoint". When a point is indexed as a "geoshape"
then sortings by e.g. distance is supported. When "geoshape" is used then NO sorting filter is available. This has to be considered for the "orderBy"
segments. Further when "gepoint" indexing for points is used then NO disjoint or intersects functions are supported.
- Spatial mapping type names. It is required to use "deep" names, like "city.location", so it was needed to add a way to figure out that "deep" name of a property
Ok, in general this mapping thing took me a while to make it running. Because of the possible supported/unsupported combinations I wrote a kind of matrix
that describes the supported/unsupported combinations. I;m expecting in future some movement on the ES side in terms of supported spatial functions, so a
kind of general "overview"make sense. One can implement the toString() method to get a kind of pretty-printed version of this matrix. This matrix looks for now like :
// ST_Within
supports(enable(ST_WithinSpecification.class), propertyOf(AnyGeometry), filterOf(TPoint.class, TPolygon.class), enable(OrderBy), SpatialConfiguration.INDEXING_METHOD.GEO_POINT);
supports(enable(ST_WithinSpecification.class), propertyOf(AnyGeometry), filterOf(TPoint.class, TPolygon.class), disable(OrderBy), SpatialConfiguration.INDEXING_METHOD.GEO_SHAPE);
// ST_Disjoint
supports(disable(ST_DisjointSpecification.class), propertyOf(AnyGeometry), filterOf(TPoint.class, TPolygon.class), enable(OrderBy), SpatialConfiguration.INDEXING_METHOD.GEO_POINT);
supports(enable(ST_DisjointSpecification.class), propertyOf(AnyGeometry), filterOf(TPoint.class, TPolygon.class), disable(OrderBy), SpatialConfiguration.INDEXING_METHOD.GEO_SHAPE);
// ST_Intersects
supports(disable(ST_IntersectsSpecification.class), propertyOf(AnyGeometry), filterOf(TPoint.class, TPolygon.class), enable(OrderBy), SpatialConfiguration.INDEXING_METHOD.GEO_POINT);
supports(enable(ST_IntersectsSpecification.class), propertyOf(AnyGeometry), filterOf(TPoint.class, TPolygon.class), disable(OrderBy), SpatialConfiguration.INDEXING_METHOD.GEO_SHAPE);
On the ST_Within case : Enable ST_With Spec on arbitrary geometries (point, line, polygon,..) and allow filter geometries of point and polygon, enable orderBy sortings when indexing method is GEOPOINT.
Note that this orderBy constraints are only valid for geometries in the orderBy statements, like e.g. sorting by distance. OrderBy expression without spatial types are not considered by this matrix (means supported).
Another thing is the fact that ES is just supporting the "GPS" aka EPSG 4326 projection. Means when one is using e.g. the Swiss Grid then ES is not able to index it. Therefore I added the possibility
to let convert the projection automatically while be able to control the conversion precision for the Indexer and Finder. This feature is optional and can be switched on/off.
The spatial support is implemented in general as a "add-on/extension" that can be switched on/off and the whole implementation is done as modular as possible.
V.) Hands-on
To create a point in the default projection :
TPoint point1 = TPoint(module).lat(48.13905780942574).lon(11.57958984375).geometry();
To create a point in a arbitrary (supported) projection :
TPoint point2 = TPoint(module).y(1286436.5975464052).x(2389280.7514562616).geometry("EPSG:27572")
To express a query that is searching for all points within a given area (point and radius, unit, default projection) :
Query<A> query = unitOfWork.newQuery(
qb
.where(
ST_Within
(
templateFor(A.class).point(),
TPoint(module).y(48.13905780941111).x(11.57958981111).geometry(),
10, TUnit.METER
)
));
ATTENTION : Depending on the point indexing method (geopoint, geoshape) different filters are used. In case of geopoint the geo_distance_filter
When geoshape indexing is used then a "internal" geometry of type circle is created and then "polygonized" to a polygon with n-points (360 for now)
Means that this is just an approximation !
That expression here is cool :
Query<A> query = unitOfWork.newQuery(
qb
.where(and(
ST_Disjoint
(
templateFor(A.class).point(),
TPoint(module).y(2389280.7514562616).x(1286436.5975464052).geometry("EPSG:27572"),
10, TUnit.METER
),
ST_Within
(
templateFor(A.class).point(),
TPoint(module).y(48.13905780942574).x(11.57958984375).geometry(),
100,TUnit.KILOMETER
)
)
));
The above expression combines within and disjoint expressions and uses two different projections. The "native" EPSG:4326 and EPSG:27572. The latter
is automatically converted to EPSG:4326 while making sure that the conversion error does not exceed the defined error. In this case this are about 2 meters.
VI.) Status & Limitations
Well, it works.. Is this valid, stable, robust ? Honestly I do not know.. I;m going to eat my own dog food in the next weeks, means I;m going to use the
spatial implementation for my productive system. However it looks promising.
Limitations for now are :
- Just a very limited spatial functions (Within, Intersects, Disjoint)
- Indexing of polygons - orientation of vertex order of coordinates has to be added. Required for polygons that cross the dateline.
- Potentially unsupported Query DSL expressions can be defined. I tried to make it as strict as possible, but who knows..
- Exception handling - I think this topic needs some more love..
- Some for documentation/javadoc..
- There are bugs in the code. Lets try to find & fix them..
VII.) Benchmarks/Performance
Benchmarking is hard, really hard. Lot of things has to be considered. Therefore my benchmark setup reflects a scenario that is close to my
production needs. This is a large number of relative simple geometrical objects. Further I;m doing two different benchmarks :
1.) Generation of 100M spatial points that are normally distributed on the world sphere
2.) Importing the whole Open Street Map "Planet.osm" and performing a number of different queries
My overall setup is :
- 3x CoreOS boxes (Xeon, 32 GB RAM, 9 TB disk each)
- Mesos/Marathon/Docker deployment infrastructure
- EntityIndex == ElasticSearch, EntityStore == RIAK
For the 1.) benchmark I can already say that :
- Indexing speed is about 5k/second
- Spatial queries returns back within 20-30ms on 100M records. No real difference between geopoint or geoshape indexing
- Limiting factor is definitely RIAK, need to add some more boxes. ElasticSearch does not care about this load
Interesting is here to play with indexing precision (number of quadtree levels).. Will try it in the next days.
For 2.)
This tests requires some pre-processing. The OSM format has to be first converted to GeoJSON.
For the whole "planet.osm" the transformation takes about 10 hours in my box.
The result is a GeoJSON file that follows the GeoJSON specification. A line is e.g. defined as :
LineString lineString = geoJsonMapper.readValue("{\"type\":\"LineString\",\"coordinates\":[[100.0,0.0],[101.0,1.0]]}",
LineString.class);
The GeoJSON line object can then be converted to a TLineString and directly stored in domain model.
TLineString lineString = TConversions.Convert(module).from(lineString).toTGeometry();
questions like how many concurrent threads are optimal for a given setup, what is the average spatial query
response time, memory usage, and, and, and.. Will keep you posted about that guys.
VIII.) Next Steps
There are some ugly corners, e.g.
.where(
ST_Within
(
templateFor(A.class).point(), // <-- Here, how to figure out the composite type of point() ?
TPoint(module).y(1286436.5975464052).x(2389280.7514562616).geometry("EPSG:27572"),
10, TUnit.METER
)
));
.. I;m using something like
Type typeOfProperty = GenericPropertyInfo.toPropertyType(Classes.typeOf(propertyFunction.accessor()));
Class classOfProperty = Class.forName(typeOfProperty.getTypeName());
Further it would be interesting to add more spatial features backed by e.g. Postgis for those cases where
more spatial functions are required.
Guys lets take a look on the code and let me know where it can be improved. Further any cool ideas are highly welcome !
Thank you.
Cheers,
Jiri