Hi,
Thank you so much for your response and for the information and redirection to GitHub for discussions.
As it happens, I am working with a port of/wrapper for S2 for Node.js from Radar Labs, and the full library in C++ is not an option. Unfortunately, this particular wrapper does not have bindings for the S2RegionTermIndexer class.
However, I did examine S2RegionTermIndexer in the original library, and ultimately I ended up implementing something similar in node that works very well. I'll describe my implementation and share some results below for posterity's sake, in case any future person is working with S2 and Elasticsearch (with point data and covering radius-style queries). Such an implementation can be easily accomplished in any language where a decent port of S2 is available.
Implementation
While the
@radarlabs/s2 port does not come with bindings for
S2RegionTermIndexer, it does expose the two useful convenience methods
s2.regionCoverer.getRadiusCoveringIds() and
s2.regionCoverer.getRadiusCoveringTokens(), that I was able to use which make use of S2's
RegionCoverer class, specifically the
RegionCoverer::getCovering() method, combined with
S2Cap, one of which simply returns a list of IDs for the resulting
S2CellUnion covering, while the other converts them to a list of tokens before returning it.
In my specific use case, users who share their device location can search for other users near their location by specifying a radius for their query that may range from 1 to 50 miles. Both for reasons of anonymizing/obfuscating exact user location data and for optimizing search, I reasoned that, given this restriction in possible query size and an upper limit on the number of S2 cells used to construct any given covering, I could restrict the
S2Cell level in the query between a minimum and maximum level, based on the smallest and largest possible
S2Cell that could possibly be used to cover any given
S2Cap with that range of radii. I determined the minimum and maximum levels partially by examining the average sizes of S2 cells given
here to give me a general idea of which levels would work best, and then I tested different levels experimentally, using a fork of
this lightweight region covering web app that uses S2 (the Go port) + Web Assembly + JavaScript + Leaflet, which I customized with extra UI components that allowed me to get coverings (as lists of tokens) of specific
S2Cap regions by providing latitude, longitude, and radius as either exact or interactive (on the map) input. The results of these experiments showed me that, based on my app's requirements, the minimum level of
S2Cell that need ever be used is 16, while the maximum level is 7, constraining the number of levels to just 1/3 of all available levels.
I then reasoned that, rather than store (and thus, index and search) more or less "exact" user locations as coordinates or even S2 leaf cells, I could accomplish index and search of these locations in Elasticsearch by instead storing every user location as a list that contains the level-16 (the experimentally-determined minimum level) parent of the user location, as well as every other parent cell up to level 7 (the experimentally-determined maximum level) as tokens, concatenated together as a string separated by whitespace. The location field containing this string of hierarchical tokens separated by whitespace in Elasticsearch need only then be provided with a mapping that defines the type as "text" and the analyzer as "whitespace".
PUT /users_ctokenlist
{
"mappings": {
"properties": {
"location": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
Creating the index in Elasticsearch.
With this simple mapping, any string like the location strings described above are automatically split into terms by Elasticsearch upon ingestion of the data, and structured into an inverted index that can be searched extremely quickly. The reason this works is that the queries users create generate a list of S2Cell tokens within the same range of levels 7-16 using the s2.regionCoverer.getRadiusCoveringTokens()method, and if any users in the ES index are in the S2Cap region defined by the user's provided location and radius, one of those query tokens will match one of the tokens in the string stored with each user location. I tested the results of this query in a live instance of ES using a set of 100,000 random user locations (the maximum number of documents that would belong to any [geo]shard+index) that I generated in GIS, and the results of the ES query with this method of indexing and searching location data proved to be identical to the physically observed data in GIS. I was able to compare the results in GIS by using the "Export to GeoJSON" function built into that region coverer web tool that I mentioned, by using the same parameters for min/max level and max # of cells for the covering, and then overlaying the GeoJSON file and a circle of the same radius centered on the same point in GIS. I will include some screenshots below in the results section. Internally I am just referring this method of indexing/querying "Constrained Hierarchical Token List" or, more succinctly, "ctokenlist".
The whole query looks something like this in ES:
GET /users_ctokenlist/_search
{
"size": 10000,
"query": {
"match": {
"location": "89c15c 89c17 89c19 89c1f 89c24 89c2c 89c31 89c33 89c344 89c37 89c3c 89dd3 89dd5 89e81 89e83 89e85 89e86aac 89e9b 89e9d 89e9f"
}
}
}
Note: these S2 tokens represent a covering of an S2Cap with 50km radius centered on Tompkins Square Park in NYC's East Village.
Results
The results in Elasticsearch for the above query look like this:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 13,
"relation": "eq"
},
"max_score": 8.8049755,
"hits": [
{
"_index": "users_ctokenlist",
"_id": "66749929-451c-4715-86c3-80d71615ece4",
"_score": 8.8049755,
"_source": {
"uid": "66749929-451c-4715-86c3-80d71615ece4",
"location": "89c34 89c31 89c30c 89c30b 89c30b4 89c30b5 89c30b5c 89c30b5d 89c30b5dc 89c30b5dd",
}
}
...
+12 more hits like this
]
}
}
For comparison's sake, I will list the "uid" of each of the above 13 hits, so you can see them in relation to the test I performed in GIS, which I will share below.
01. 66749929-451c-4715-86c3-80d71615ece4
02. 2b5bbb3c-0644-4819-83f9-faf34ec0167a
03. 643e6521-60b8-45f7-892e-32353c600c46
04. aea6b0f4-12a9-42cc-b312-3e53fe6a8be7
05. 91929a78-83a3-48ae-babd-43844c7b0a1e
06. 8f6a0f7c-69c1-46fb-9efd-3644cbadf1bb
07. 0c46c5d4-a1de-494f-aae1-327a70586caa
08. 6c76444a-1631-4704-abf7-206088800e2a
09. fc3c352d-7898-4300-92de-5d462f5d6e71
10. 822571a3-6dca-4446-9ec3-ca1a9cdd9416
11. 143839b5-1dcb-4369-b6dc-92a51921f7a5
12. 833b4149-327d-4f41-a80a-ec23b5bb6595
13. a055cdec-4219-4ece-b450-46c503d2f500
Below: the process of testing cell coverings using the S2 (Go) + Web Assembly + JS tool I mentioned above, using the same parameters used to generate the "ctokenlist" strings representing each data point's location in ES, as well as how these coverings were exported to GeoJSON for later importing into GIS:

Below: the random dataset in GIS (grey dots), together with the same 50km radius S2Cap centered on Tompkins Square Park in NYC (translucent purple ellipse with thick purple border—the geometry seen here is circular, but the specific projection I was using in GIS makes it look compressed in the latitudinal direction), along with the cell covering generated using the given parameters (min_level = 16, max_level = 7, max_cells = 20), and the list of "hits"—those data points that intersect the cell covering (pink dotted line and translucent rhombs) are shown as red dots. The center of the query is symbolized using a purple "+".

The red dots (hits) are actually selected here in the GIS program, and their values/uids can be seen on the far right side of the image. If you compare this to the ES results above, you'll see that they are exactly the same—a successful result.
Another Implementation
I will also share that I tried another method, which I have been referring to internally as "Truncated Binary String" (or "tbstring" for short), which worked nearly equally as well, but the "ctokenlist" method prvoed to have a slight edge (I will explain why below). Nevertheless I will share it here in case anyone is interested. This method takes advantage of the fact that every S2 (leaf) cell ID contains all of the information for all of its hierarchical parents, encoded in its binary representation. For example, given a cell ID with the binary representation 1000100111000010010110010111100001010110111111110100001100011001, the first 3 digits represent one of six level 0 "faces" (numbered 0-5, or rather 000-101), and following this, every pair of digits represents one of four quadrants at any given level on the Hilbert Curve, down to level 30—finally, a trailing "1" is appended to the end. Because of this, it can be shown that the binary representation of the level-16 parent of this leaf cell (using my experimentally-determined range of levels 7-16, once again) is identical to the binary representation of the leaf cell, up to 3 + (16 * 2) = 35 digits. And indeed, if you calculate the level-16 parent cell ID and get its binary representation and compare it to the leaf cell, you will see this is the case. The parent ID is identical to the leaf cell ID up to the 35th digit, after which the leaf cell ID contains more information, and the level-16 parent cell ID just has a trailing "1" digit (as have all S2 cell IDs), followed after this by zeroes that fill out the 64 bits. The same is true for the level-7 parent of this leaf cell, and for any parent of this leaf cell—so we have:
Leaf Cell ID: 1000100111000010010110010111100001010110111111110100001100011001
Lvl 16 Parent ID: 1000100111000010010110010111100001010000000000000000000000000000
Lvl 7 Parent ID: 1000100111000010010000000000000000000000000000000000000000000000
Knowing this information, we can just go ahead and truncate every cell ID after the 35th digit and store this truncated string in Elasticsearch instead of the entire string, because the minimum cell level in the implementation is 16 and any information after the 35th digit will never be utilized—therefore it's just wasting space in ES when you multiply it by 100,000 documents, and this truncation additionally helps to partially obfuscate other users' exact locations in the event this string need to be returned to the client for some reason, where it could be intercepted by anyone listening to network requests. To query these "truncated binary strings" in Elasticsearch, we can simply get the list of cell IDs that cover the query region using convenience method s2.regionCoverer.getRadiusCoveringIds() provided by Radar Labs' Node.js port of S2, convert each ID in the list to binary, and then replace all of the trailing zeroes and final trailing "1" with an asterisk ("*"–the prefix operator in Elasticsearch) using a regular expression (e.g. /([01]+)10*/). From here we just execute a "simple_query_string"-type ES query, employing prefix wildcards (the "*" character/operator) and Boolean OR logic (using the "|" character/operator) to get a query that will match any user's location stored as a "truncated binary string" that is either within or exactly matches one of the query region's (from the covering given by S2RegionCoverer using an S2Cap region defined by the user's location and radius preference) cell IDs—in other words, a "hit", in ES parlance.
The whole query looks something like this in ES:
GET /users_tbstring/_search
{
"size": 10000,
"query": {
"simple_query_string": {
"query": "100010011100000101011* | 1000100111000001011* | 1000100111000001100* | 1000100111000001111* | 10001001110000100* | 10001001110000101* | 1000100111000011000* | 1000100111000011001* | 100010011100001101000* | 1000100111000011011* | 10001001110000111* | 1000100111011101001* | 1000100111011101010* | 1000100111101000000* | 1000100111101000001* | 1000100111101000010* | 10001001111010000110101010101* | 1000100111101001101* | 1000100111101001110* | 1000100111101001111*",
"fields": [ "location" ]
}
}
}
The hits for this search are also right on, but it runs just slightly slower (fractions of milliseconds). The reason for this I believe is that prefix/wildcard queries are generally more expensive in ES than simply looking up terms in an inverted index.
Anyway, if you made it this far, thank you for your time. I hope this post helps someone else who is wondering how to accomplish this without the S2RegionTermIndexer class available, and specifically in Elasticsearch.
Best,
Avana