Layout of multi-key indices as key-value pairs?

37 views
Skip to first unread message

Martin Häusler

unread,
Jul 15, 2022, 7:33:56 AMJul 15
to wiredtiger-users
Hi!

I recently discovered that MongoDB can index nested fields with so-called "multi-key indices". I'm very curious how these indices look like on a key-value level.

For a regular (non-unique) field index on a scalar field, the layout is quite simple. The key has to start with the indexed BSON value (potentially subject to collation), followed by the document ID which contains the value. This can be accomplished with a separator or by post-fixing the composite key with the length of the BSON value such that the byte-array can be disassembled again. So for a basic index we might end up with a composite secondary index key like this:

[BSON value][document id][number of bytes in bson value]

The value is often NULL (e.g. MySQL does that). For unique indices, we can move the document ID to the value column. Since we know for each index table which document key was used to create them, we don't need to encode that in every entry.

For multikey indices, the situation is different. If we use the same layout as depicted above, then we lose the information about the path we took from the document root to the given value. We essentially throw the contents of all array elements into one big pot. While this may be sufficient for some queries (e.g. "return all documents where *some* entry of the array matches a condition"), it is insufficient for others (e.g. "return all documents where the *first* entry of the array matches a condition).

How is this layout done in MongoDB and WiredTiger?

mil...@mongodb.com

unread,
Jul 15, 2022, 11:46:50 AMJul 15
to wiredtiger-users
Hi Martin,
The links here will help explain.  We used to keep the document ID in the value for unique index K-V records, but we recently moved it to be in the key instead.
(basic internal index format)

Martin Häusler

unread,
Jul 15, 2022, 1:03:13 PMJul 15
to wiredtiger-users
Hi Eric,

thanks for taking the time to read my post and for the response! Unfortunately I'm getting 404s on the first two links you provided (the last one is working). I double checked - I am logged into my GitHub account but there seems to be an issue with the links. The URL "https://github.com/10gen" works, but "https://github.com/10gen/mongo" does not... Maybe the project is private?

Thanks,
Martin

Martin Häusler

unread,
Jul 15, 2022, 1:06:21 PMJul 15
to wiredtiger-users

Eric Milkie

unread,
Jul 15, 2022, 1:14:13 PMJul 15
to wiredtig...@googlegroups.com
Yes, you figured it out!  My apologies; I linked to the private repo instead of the public one.
-Eric


--
You received this message because you are subscribed to a topic in the Google Groups "wiredtiger-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/wiredtiger-users/2Kqbjij1YIw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to wiredtiger-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/wiredtiger-users/d61bd065-588d-4a0c-bbac-e76ca871ac51n%40googlegroups.com.

Martin Häusler

unread,
Jul 15, 2022, 2:02:30 PMJul 15
to wiredtiger-users
No worries :)

So here's what I've gathered so far. In the metadata of the secondary index, MongoDB tracks if the index is multikey at all (boolean) and in addition stores the path information which part of the indexed path causes it to become multikey. This may change upon document insertion, but it's a one-directional thing; an index keeps being multi-key even if all documents that caused it to become multi-key in the first place are long gone and deleted. The only way to get back to a regular index that I can see is to drop it and create a new one (assuming that there are no arrays left in the indexed documents).

This leaves two questions (if I may):

1) If we consider the non-unique index key layout described here (https://github.com/mongodb/mongo/blob/master/src/mongo/db/catalog/README.md#use-in-wiredtiger-indexes), then an index { "a.b": 1 } on the document:

{
    _id: 123
    "a": [ { "b": "v" }, { "b": "v" } ]

... would produce two identical keys (_id = 123, value = "v"), one for a[0].b one for a[1].b. I assume that one of them simply gets discarded during indexing and we persist no information in the index that we saw this once at a[0] and once at a[1]?

2) continuing with the examle from 1), if we have an incoming query that specifically asks for a[0].b = "x", we look up our index metadata and realize that "a" is the field that causes the index to become multi-key. Does that mean that we cannot use this index for this query at all? Or do we use the index to come up with a list of potential "candidate" documents that have the value "x" at *some* position in the array a, but need to be checked by fetching the actual documents and checking if the position is actually the requested one?

Daniel Gottlieb

unread,
Jul 15, 2022, 4:18:58 PMJul 15
to wiredtiger-users
Hi Martin,

Hopefully some more answers:


> The value is often NULL (e.g. MySQL does that). For unique indices, we can move the document ID to the value column. Since we know for each index table which document key was used to create them, we don't need to encode that in every entry.
That's how we encode our _id index, but it's (no longer) how our secondary unique indexes get encoded. Those do keep the document ID as part of the key. That's not relevant to your multikey question so I'll leave it at that, but I'm happy to provide more detail on the "why" there.

> This may change upon document insertion, but it's a one-directional thing; an index keeps being multi-key even if all documents that caused it to become multi-key in the first place are long gone and deleted. 
That's correct.
> The only way to get back to a regular index that I can see is to drop it and create a new one (assuming that there are no arrays left in the indexed documents).
That used to be correct. I'm not perfectly familiar, but the validate command can unset multikey. There may be some limitations associated with it though (my hunch is that may be a standalone-only operation).
 
> I assume that one of them simply gets discarded during indexing and we persist no information in the index that we saw this once at a[0] and once at a[1]?
That's correct. I'm not an expert in our key generation (so I couldn't tell if we intentionally "generate a single key" or if we just throw two items into WT with the same Key and value, and one is "accidentally" discarded).

To give detail on all the states you're interested in. The index is considered multikey and the multikey path for "a.b" is set:
                        {'backgroundSecondary': False,
                         'head': 0,
                         'multikey': True,
                         'multikeyPaths': {'x.b': b'\x01\x00'},
                         'ready': True,
                         'spec': {'key': {'x.b': 1.0}, 'name': 'x.b_1', 'v': 2}}],

And there is a single item in the index for that document (not shown, but verified locally on a modern version).

Does that mean that we cannot use this index for this query at all?
My understanding is that MDB cannot use the index for this query. I don't think it's related to the multikey-ness, but rather a quirk of the query language and how the notation used for array matching is ambiguous with sub-object matching. Consider another document that the `{"a.0.b": 1}` query must return:
{ _id: ...,
  "a": {"0": {"b": 1}}}

The `{"a.b": 1}` index generates a "null" key for that document. So using the index isn't sufficient for identifying that result.

Hope that helps,
Dan

Daniel Gottlieb

unread,
Jul 15, 2022, 4:20:25 PMJul 15
to wiredtiger-users
Apologies for the bad link to the multikey unsetting. The appropriate line numbers: https://github.com/mongodb/mongo/blob/41994ed8c080afb448b20c631a66bf782a9834c0/src/mongo/db/catalog/validate_adaptor.cpp#L488-L506

Martin Häusler

unread,
Jul 17, 2022, 3:10:31 PMJul 17
to wiredtiger-users
Thanks a lot for the clarifications Dan! The whole topic turned into quite the rabbit hole for me but it's fascinating and I think I got the gist of it now :)

Thank you all for the explanations, I'm learning a lot of new things here!

Reply all
Reply to author
Forward
0 new messages