s2 Cell IDs and Elasticsearch Querying: What Works? What is Most Performant?

104 views
Skip to first unread message

Avana Vana

unread,
Jan 13, 2024, 9:31:29 AM1/13/24
to s2geometry-io
Hi,

I am hoping to use s2 with Elasticsearch in order to implement geospatial search in my application.

I am trying to determine the most performant way to index and query documents in Elasticsearch using their locations, as represented by s2 geometry.  From what I understand, I can't do something as simple as:

```
"query": {
    "bool": {
        "must": [
            { 
                "simple_query_string": {
                    "fields" : [ "locationId" ],
                    "query" : "abcde* | abcdf* | abce* | abcf* | abcga*". // i.e. an OR-separated list of potential parent cell tokens as prefixes
                }
            },
            ...
        ]
    }
}
```
Where an example ES doc I'd be searching would look like:
```
{
    ...
    "locationId": "abcdefghijklm...n", // i.e. a s2 leaf cell token
    ... 
}
```
So that on the client, the user provides a location and a search radius in km, and then I generate a spherical cap using s2 and create a cell covering from this spherical cap.  This covering will consist of a simple list of s2 cell tokens of some intermediate level (the max and min level of which can be defined).  What I wanted, ideally, to do, was to just take this list of intermediate/parent levels comprising the cell covering and then use those cell tokens as prefixes to search the `locationId` field of documents in Elasticsearch, such that any document whose `locationId` matches at least one of the prefixes (higher/parent levels from the covering) in that list would be included in the search results.  But from what I understand, such a simple prefix query is not possible, because one cannot simply truncate a leaf cell token by chopping off characters and searching it as a prefix (see, for instance, the discussion here: https://docs.s2cell.aliddell.com/en/stable/s2_concepts.html#s2-tokens, which says "unlike Geohash, you cannot just chop off characters from a high precision S2 token to get a parent lower precision token, since the trailing 1 bit in the cell ID would not be set correctly in most cases").

I also came across this implementation of s2 with Elasticsearch in PHP (https://github.com/to-masz/s2examples), where the author seems to store not just the leaf cell token for each document, but also recursively calculates every one of that leaf cell's parent cells' tokens up to 'face' level, and stores that entire list as a simple whitespace-separated string in each document in Elasticsearch.  Then the author simply calls Elasticsearch with a regular full-text match query, querying the list of tokens in the cell covering against this whitespace-separated list of cell tokens representing a document's leaf cell location and all of its parents.

However this seems like a lot more text to search—what would be the most peformant way to index and then query documents in Elasticsearch using s2?  If I can't simply truncate tokens because of the trailing 1 bit issue, can this be done with the full binary representation of a cell ID?  Because from what I understand, if you consider the binary representation of a cell ID as a string, you first have the face ID from 000-101 as the first 3 characters (i.e. bits 0-2), and then from here on, every two characters (i.e. bits) represents a higher level of precision/recursion in the Hilbert Curve, with up to 30 levels of precision possible, and finally, the string is terminated with a trailing '1' after its final 2-bit/2-character level of precision, and then padded with zeroes, if necessary, out to 64 bits.  Because of this, couldn't I just calculate the binary representations of all my locationIds and store them as strings in Elasticsearch, and then when a user performs the search operation and provides location + radius, I could just take the binary representations of each of the cells in the the cell covering that is generated from this, as strings, drop the trailing 1 bit and any zeroes that pad it out to the right, and simply use the remaining string as a prefix search to match locationIds of documents in Elasticsearch also stored in the binary representation, right?  Furthermore, I also might consider not even storing a representation of the leaf cell ID, but instead the binary representation of a parent of the leaf cell that is at the maximum level of precision used by the region coverer—in other words, if I am only querying with a precision of at most s2 level 8, for example, it makes no difference if I store an ID with more precision than level 8.  If all queries were to return coverings of cells with precision at most level 8 and at minimum level 6 or something, I could just truncate all of my leaf cell IDs to level 8 as well when indexing documents in Elasticsearch.

Please let me know if my reasoning is correct here, if this method would work, and of the methods discussed here, what would be the most performant way to index and query s2 locations with Elasticsearch.

Thanks,
-a

een...@google.com

unread,
Jan 22, 2024, 12:23:49 PM1/22/24
to s2geometry-io
Hello there, two things:

1. Apologies for not answering your question promptly. Our Q&A work has mostly moved to the github repo: https://github.com/google/s2geometry/discussions

2. Your use case should be well solved by the S2RegionTermIndexer: https://github.com/google/s2geometry/blob/master/src/s2/s2region_term_indexer.h#L18

If you have follow-up questions, we'd love to help, but I suggest you use the Github discussions for faster responses.

Kind regards,

- Eric

Avana Vana

unread,
Feb 7, 2024, 2:41:23 AM2/7/24
to s2geometry-io
Hi,

Thank you so much for your response and for the information and redirection to GitHub for discussions.

As it happens, I am working with a port of/wrapper for S2 for Node.js from Radar Labs, and the full library in C++ is not an option.  Unfortunately, this particular wrapper does not have bindings for the S2RegionTermIndexer class.

However, I did examine S2RegionTermIndexer in the original library, and ultimately I ended up implementing something similar in node that works very well.  I'll describe my implementation and share some results below for posterity's sake, in case any future person is working with S2 and Elasticsearch (with point data and covering radius-style queries).  Such an implementation can be easily accomplished in any language where a decent port of S2 is available.

Implementation

While the @radarlabs/s2 port does not come with bindings for S2RegionTermIndexer, it does expose the two useful convenience methods s2.regionCoverer.getRadiusCoveringIds() and s2.regionCoverer.getRadiusCoveringTokens(), that I was able to use which make use of S2's RegionCoverer class, specifically the RegionCoverer::getCovering() method, combined with S2Cap, one of which simply returns a list of IDs for the resulting S2CellUnion covering, while the other converts them to a list of tokens before returning it.

In my specific use case, users who share their device location can search for other users near their location by specifying a radius for their query that may range from 1 to 50 miles.  Both for reasons of anonymizing/obfuscating exact user location data and for optimizing search, I reasoned that, given this restriction in possible query size and an upper limit on the number of S2 cells used to construct any given covering, I could restrict the S2Cell level in the query between a minimum and maximum level, based on the smallest and largest possible S2Cell that could possibly be used to cover any given S2Cap with that range of radii.  I determined the minimum and maximum levels partially by examining the average sizes of S2 cells given here to give me a general idea of which levels would work best, and then I tested different levels experimentally, using a fork of this lightweight region covering web app that uses S2 (the Go port) + Web Assembly + JavaScript + Leaflet, which I customized with extra UI components that allowed me to get coverings (as lists of tokens) of specific S2Cap regions by providing latitude, longitude, and radius as either exact or interactive (on the map) input.  The results of these experiments showed me that, based on my app's requirements, the minimum level of S2Cell that need ever be used is 16, while the maximum level is 7, constraining the number of levels to just 1/3 of all available levels.

I then reasoned that, rather than store (and thus, index and search) more or less "exact" user locations as coordinates or even S2 leaf cells, I could accomplish index and search of these locations in Elasticsearch by instead storing every user location as a list that contains the level-16 (the experimentally-determined minimum level) parent of the user location, as well as every other parent cell up to level 7 (the experimentally-determined maximum level) as tokens, concatenated together as a string separated by whitespace.  The location field containing this string of hierarchical tokens separated by whitespace in Elasticsearch need only then be provided with a mapping that defines the type as "text" and the analyzer as "whitespace".

PUT /users_ctokenlist { "mappings": { "properties": { "location": { "type": "text", "analyzer": "whitespace" } } } }
Creating the index in Elasticsearch.

With this simple mapping, any string like the location strings described above are automatically split into terms by Elasticsearch upon ingestion of the data, and structured into an inverted index that can be searched extremely quickly.  The reason this works is that the queries users create generate a list of S2Cell tokens within the same range of levels 7-16 using the  s2.regionCoverer.getRadiusCoveringTokens()method, and if any users in the ES index are in the S2Cap region defined by the user's provided location and radius, one of those query tokens will match one of the tokens in the string stored with each user location.  I tested the results of this query in a live instance of ES using a set of 100,000 random user locations (the maximum number of documents that would belong to any [geo]shard+index) that I generated in GIS, and the results of the ES query with this method of indexing and searching location data proved to be identical to the physically observed data in GIS.  I was able to compare the results in GIS by using the "Export to GeoJSON" function built into that region coverer web tool that I mentioned, by using the same parameters for min/max level and max # of cells for the covering, and then overlaying the GeoJSON file and a circle of the same radius centered on the same point in GIS.  I will include some screenshots below in the results section.  Internally I am just referring this method of indexing/querying "Constrained Hierarchical Token List" or, more succinctly, "ctokenlist".

The whole query looks something like this in ES:
GET /users_ctokenlist/_search { "size": 10000, "query": { "match": { "location": "89c15c 89c17 89c19 89c1f 89c24 89c2c 89c31 89c33 89c344 89c37 89c3c 89dd3 89dd5 89e81 89e83 89e85 89e86aac 89e9b 89e9d 89e9f" } } }
Note: these S2 tokens represent a covering of an S2Cap with 50km radius centered on Tompkins Square Park in NYC's East Village.

Results
The results in Elasticsearch for the above query look like this:
{ "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 13, "relation": "eq" }, "max_score": 8.8049755, "hits": [ { "_index": "users_ctokenlist", "_id": "66749929-451c-4715-86c3-80d71615ece4", "_score": 8.8049755, "_source": { "uid": "66749929-451c-4715-86c3-80d71615ece4", "location": "89c34 89c31 89c30c 89c30b 89c30b4 89c30b5 89c30b5c 89c30b5d 89c30b5dc 89c30b5dd", } } ... +12 more hits like this ] } }

For comparison's sake, I will list the "uid" of each of the above 13 hits, so you can see them in relation to the test I performed in GIS, which I will share below.

01. 66749929-451c-4715-86c3-80d71615ece4 02. 2b5bbb3c-0644-4819-83f9-faf34ec0167a 03. 643e6521-60b8-45f7-892e-32353c600c46 04. aea6b0f4-12a9-42cc-b312-3e53fe6a8be7 05. 91929a78-83a3-48ae-babd-43844c7b0a1e 06. 8f6a0f7c-69c1-46fb-9efd-3644cbadf1bb 07. 0c46c5d4-a1de-494f-aae1-327a70586caa 08. 6c76444a-1631-4704-abf7-206088800e2a 09. fc3c352d-7898-4300-92de-5d462f5d6e71 10. 822571a3-6dca-4446-9ec3-ca1a9cdd9416 11. 143839b5-1dcb-4369-b6dc-92a51921f7a5 12. 833b4149-327d-4f41-a80a-ec23b5bb6595 13. a055cdec-4219-4ece-b450-46c503d2f500

Below: the process of testing cell coverings using the S2 (Go) + Web Assembly + JS tool I mentioned above, using the same parameters used to generate the "ctokenlist" strings representing each data point's location in ES, as well as how these coverings were exported to GeoJSON for later importing into GIS:

Screen Shot 2024-01-26 at 9.57.39 AM.png

Below: the random dataset in GIS (grey dots), together with the same 50km radius S2Cap centered on Tompkins Square Park in NYC (translucent purple ellipse with thick purple border—the geometry seen here is circular, but the specific projection I was using in GIS makes it look compressed in the latitudinal direction), along with the cell covering generated using the given parameters (min_level = 16, max_level = 7, max_cells = 20), and the list of "hits"—those data points that intersect the cell covering (pink dotted line and translucent rhombs) are shown as red dots.  The center of the query is symbolized using a purple "+".  

Screen Shot 2024-01-26 at 10.15.53 AM.png

The red dots (hits) are actually selected here in the GIS program, and their values/uids can be seen on the far right side of the image. If you compare this to the ES results above, you'll see that they are exactly the same—a successful result.

Another Implementation

I will also share that I tried another method, which I have been referring to internally as "Truncated Binary String" (or "tbstring" for short), which worked nearly equally as well, but the "ctokenlist" method prvoed to have a slight edge (I will explain why below).  Nevertheless I will share it here in case anyone is interested.  This method takes advantage of the fact that every S2 (leaf) cell ID contains all of the information for all of its hierarchical parents, encoded in its binary representation.  For example, given a cell ID with the binary representation 1000100111000010010110010111100001010110111111110100001100011001, the first 3 digits represent one of six level 0 "faces" (numbered 0-5, or rather 000-101), and following this, every pair of digits represents one of four quadrants at any given level on the Hilbert Curve, down to level 30—finally, a trailing "1" is appended to the end.  Because of this, it can be shown that the binary representation of the level-16 parent of this leaf cell (using my experimentally-determined range of levels 7-16, once again) is identical to the binary representation of the leaf cell, up to 3 + (16 * 2) = 35 digits.  And indeed, if you calculate the level-16 parent cell ID and get its binary representation and compare it to the leaf cell, you will see this is the case.  The parent ID is identical to the leaf cell ID up to the 35th digit, after which the leaf cell ID contains more information, and the level-16 parent cell ID just has a trailing "1" digit (as have all S2 cell IDs), followed after this by zeroes that fill out the 64 bits.   The same is true for the level-7 parent of this leaf cell, and for any parent of this leaf cell—so we have:

Leaf Cell ID:     1000100111000010010110010111100001010110111111110100001100011001
Lvl 16 Parent ID: 1000100111000010010110010111100001010000000000000000000000000000
Lvl 7 Parent ID:  1000100111000010010000000000000000000000000000000000000000000000

Knowing this information, we can just go ahead and truncate every cell ID after the 35th digit and store this truncated string in Elasticsearch instead of the entire string, because the minimum cell level in the implementation is 16 and any information after the 35th digit will never be utilized—therefore it's just wasting space in ES when you multiply it by 100,000 documents, and this truncation additionally helps to partially obfuscate other users' exact locations in the event this string need to be returned to the client for some reason, where it could be intercepted by anyone listening to network requests.  To query these "truncated binary strings" in Elasticsearch, we can simply get the list of cell IDs that cover the query region using convenience method s2.regionCoverer.getRadiusCoveringIds() provided by Radar Labs' Node.js port of S2, convert each ID in the list to binary, and then replace all of the trailing zeroes and final trailing "1" with an asterisk ("*"–the prefix operator in Elasticsearch) using a regular expression (e.g. /([01]+)10*/).  From here we just execute a "simple_query_string"-type ES query, employing prefix wildcards (the "*" character/operator) and Boolean OR logic (using the "|" character/operator) to get a query that will match any user's location stored as a "truncated binary string" that is either within or exactly matches one of the query region's (from the covering given by S2RegionCoverer using an S2Cap region defined by the user's location and radius preference) cell IDs—in other words, a "hit", in ES parlance.

The whole query looks something like this in ES:
GET /users_tbstring/_search { "size": 10000, "query": { "simple_query_string": { "query": "100010011100000101011* | 1000100111000001011* | 1000100111000001100* | 1000100111000001111* | 10001001110000100* | 10001001110000101* | 1000100111000011000* | 1000100111000011001* | 100010011100001101000* | 1000100111000011011* | 10001001110000111* | 1000100111011101001* | 1000100111011101010* | 1000100111101000000* | 1000100111101000001* | 1000100111101000010* | 10001001111010000110101010101* | 1000100111101001101* | 1000100111101001110* | 1000100111101001111*", "fields": [ "location" ] } } }

The hits for this search are also right on, but it runs just slightly slower (fractions of milliseconds).  The reason for this I believe is that prefix/wildcard queries are generally more expensive in ES than simply looking up terms in an inverted index.

Anyway, if you made it this far, thank you for your time. I hope this post helps someone else who is wondering how to accomplish this without the S2RegionTermIndexer class available, and specifically in Elasticsearch.

Best,
Avana

Sean McAllister

unread,
Feb 8, 2024, 1:45:25 PM2/8/24
to s2geometry-io
This was a great read, thanks for posting it.  Do you have intentions of supporting non-point geometry?  You can do something similar with polylines and polygons by covering them, but you don't necessarily want to have to store all the parent cell ids too.  Does ElasticSearch support range queries?  If so you could snap points to the nearest S2CellId (~1mm resolution) and store that, then querying for points in a covering is a series of range queries on the cells in the covering.

E.g. if you have four cells in the covering, you could do something like:

SELECT * FROM Table WHERE
        cell0.range_min() <= point && point <= cell0.range_max()
  OR cell1.range_min() <= point && point <= cell1.range_max()
  OR cell2.range_min() <= point && point <= cell2.range_max()
  OR cell3.range_min() <= point && point <= cell3.range_max()
  ...

For however many cells you decide to cover with.  Then you'd only have to store a single value per point, if the data was stored sorted by the point S2CellIds you'd get some degree of spatial locality too.

I think this would work pretty well for points, but you'd need some way to join with a table mapping cell -> feature for 1D and 2D features that would be covered by multiple cells.

Sean
Reply all
Reply to author
Forward
0 new messages