Distinct records on large dataset

User

unread,

Nov 26, 2018, 2:49:49 PM11/26/18

to mongodb-user

I have a collection with 5 billion records. This collection stores metadata about files in a filesystem such that the "source" field in the record points to the file where that piece of data comes from. With external code, I do a "walk" in this filesystem to find new files that I need to process the metadata for and store while also finding records that I need to remove because the file no longer exists. I only know which ones to remove from mongodb by finding all the sources of data in the collection, then removing the ones I didn't see in the filesystem on my prior walk.

I have 5 million files, which has just crossed into the threshold where mongodb can't return them in a distinct query and instead gives me: "distinct too big, 16mb cap"

A simple aggregation query of db.transactions.aggregate([{"$group":{"_id":"$source"}}]) would work, however it does a COLLSCAN which takes *way* longer than the few seconds it used to take to do the distinct("source") since I have the source field already indexed. How can I get "$group" (or some other command) to recognize that the index already has each unique value? Due to size only, I can't return it in one BSON document. Instead, I need to return one BSON document for each key in the index, at least this seems like it would be the most performant way for mongodb to do this.

Basically, how can I get the functionality of $group to use the index of source instead of doing a COLLSCAN? $group doesn't seem smart enough to realize it doesn't have to pull the whole record, but only the "source" field (which is available in the index).

Thanks for any help.

Bob Cochran

unread,

Nov 26, 2018, 2:56:00 PM11/26/18

to mongod...@googlegroups.com

Hi,

What is your MongoDB version? What is your topology? What is the operating system version for the server? For the client(s)? Do you have some sample documents and details of your indexing scheme?

Thanks

Bob

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/63e62975-beb3-43a9-8f7c-a2354f328e31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matthew Zimmerman

unread,

Nov 26, 2018, 5:17:40 PM11/26/18

to mongod...@googlegroups.com

I am running 4.0.x version. The topology is one server with 1tb of
RAM on 4TB of NVME disks. OS version is ArchLinux with kernel -
4.18.5-arch1-1-ARCH #1 SMP PREEMPT. The client is using the mgo
driver with the Go language, but I'm just trying to do this in the
mongo shell at the moment before I try to code the method of getting
these unique values into the "walker" client code.

The collection looks something like this:
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
"email":["us...@host.com"]}
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
"email":["us...@host.com"]}
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
"IP":"us...@host.com"}
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
"email":["us...@host.com","us...@host.com"]}

There's sparse indexes on source, timestamp, IP, and email.

My understanding of the aggregate functions that only some of them
will use an index. $match is one, but since each file has a "source"
field value, doing {"$match":{"source":{"$exists":true}}} doesn't
actually reduce the dataset for the $group stage at all.

$sort is another, but there's no argument to do a "unique" sort that I know of.

Thanks for any help.

> To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/343B889F-979E-4D05-820F-C1DB3A84E122%40gmail.com.

Robert Cochran

unread,

Nov 26, 2018, 5:32:03 PM11/26/18

to mongodb-user

Hi,

Are you really using a data type called "ISOTime"? That isn't supported in the mongo shell. See This link for supported date types. Perhaps you mean "ISODate()"? That gives a far more precise date and time representation than what your example documents seem to show in this post.

Thanks

Bob

On Monday, November 26, 2018 at 5:17:40 PM UTC-5, User wrote:

I am running 4.0.x version. The topology is one server with 1tb of
RAM on 4TB of NVME disks. OS version is ArchLinux with kernel -
4.18.5-arch1-1-ARCH #1 SMP PREEMPT. The client is using the mgo
driver with the Go language, but I'm just trying to do this in the
mongo shell at the moment before I try to code the method of getting
these unique values into the "walker" client code.

The collection looks something like this:
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
"email":["us...@host.com"]}
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
"email":["us...@host.com"]}
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
"IP":"us...@host.com"}
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),

"email":["us...@host.com","user...@host.com"]}

Matthew Zimmerman

unread,

Nov 26, 2018, 5:50:36 PM11/26/18

to mongod...@googlegroups.com

Forgive me, I don't typically input records in the json format so I
was doing it from memory; given your description, I'm sure it's
ISODate(). The client code handles all of the data parsing and
writing of records to the collection. The records given are just
examples. The schema holds ~15 different data types (in terms of
client side code), most of which are represented by string on the
mongodb side. Each of those 15 has an index on them (for purposes of
querying in the client code).

The client side code does enforce that each record put into the
collection has a "source" value, while the other 15 different data
types are not required.

I can give more specifics, however I'm not sure how that helps the
discussion and I don't want to distract from the core problem. The
core problem being that I have 5 billion records with 5 million unique
"source" values in the collection and I don't want to a COLLSCAN of
all 5 billion to only unique the 5 million. Note that when I had ~4
million unique "source" values, the distinct() command worked great
and within a few seconds, so I know returning quickly on 5 million is
likely possible, I just have to figure out how to get it to use the
index instead of doing a COLLSCAN for the records.

On Mon, Nov 26, 2018 at 5:32 PM Robert Cochran <r2coc...@gmail.com> wrote:
>
> Hi,
>
> Are you really using a data type called "ISOTime"? That isn't supported in the mongo shell. See This link for supported date types. Perhaps you mean "ISODate()"? That gives a far more precise date and time representation than what your example documents seem to show in this post.
>
> Thanks
>
> Bob
>
>
> On Monday, November 26, 2018 at 5:17:40 PM UTC-5, User wrote:
>>
>> I am running 4.0.x version. The topology is one server with 1tb of
>> RAM on 4TB of NVME disks. OS version is ArchLinux with kernel -
>> 4.18.5-arch1-1-ARCH #1 SMP PREEMPT. The client is using the mgo
>> driver with the Go language, but I'm just trying to do this in the
>> mongo shell at the moment before I try to code the method of getting
>> these unique values into the "walker" client code.
>>
>> The collection looks something like this:
>> {source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
>> "email":["us...@host.com"]}
>> {source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
>> "email":["us...@host.com"]}
>> {source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
>> "IP":"us...@host.com"}
>> {source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),

>> "email":["us...@host.com","us...@host.com"]}

> To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/149d5a77-3c7d-4c23-bf4e-a87ae61205f9%40googlegroups.com.

Kevin Adistambha

unread,

Nov 27, 2018, 12:54:07 AM11/27/18

to mongodb-user

Hi,

If I understand correctly, you have a set of files, of which you kept the metadata of said files in MongoDB. One file can have multiple entries in MongoDB. Periodically, you list the files in the directory, query MongoDB for unique filename values, then compare the two listings. For each file that doesn’t exist in the directory file listing, you remove the corresponding metadata in MongoDB to keep the two lists synced. Is this an accurate description of your use case?

If this is correct, with regard to your question:

Basically, how can I get the functionality of $group to use the index of source instead of doing a COLLSCAN?

In some cases, a COLLSCAN could be faster, since the query you’re doing needs to look at every document in your collection anyway. However this is very use case dependent, which would depend on the size of your documents, the amount of RAM you have, whether the whole index or the whole collection fit in RAM, etc.

Having said that, you could try to do a $sort before $group to see if the query runs any faster, such as:

db.collection.aggregate([ {$sort: {source: 1}}, {$group: {_id: '$source'}} ])

Note that this would need you to have an index on the source field, e.g. db.collection.createIndex({source: 1}).

An alternative approach would be using a tool such as inotify. There is a blog post titled Linux Filesystem Events with inotify that runs through an example on its use. The basic idea is to have a script that monitors the output of inotify, and if a delete event is detected, do a delete of the related source in the MongoDB database. This way, you could avoid having to do a directory listing and a comparison with the aggregate output.

OS version is ArchLinux

Please note that Arch Linux is not listed in the MongoDB Supported Platform list. I would encourage you to consider using one of the supported platforms and follow the recommendations in the Production Notes for best results. Having a Replica Set is also strongly recommended in a production setting.

Best regards,
Kevin

User

unread,

Nov 27, 2018, 10:12:25 AM11/27/18

to mongodb-user

Kevin, your understanding is correct.

I'm not a mongodb expert by any means, but in this case I don't think a COLLSCAN should be required since the field is indexed and as such each unique key is in there. The records themselves can have ~15 fields (or just two). I know that when the distinct() function didn't overrun the 16mb BSON limit that it could return in a few seconds when using the index. Now that the BSON limit is hit, I can't get any similar performance using any other method.

The documentation says that $sort will use the index, however: db.collection.aggregate([{"$sort":{"source":1}},{"$group":{"_id":"$source"}}],{"explain":1}) shows the winning plan is a COLLSCAN. My only explanation is that my (current) index is sparse (meaning blank values aren't indexed). I tested that and it does seem to be the case; if the index is sparse $sort won't use it, otherwise it will. So I'm rebuilding my index on "source" without sparse which should be done tomorrow sometime (I probably should have been that way anyway). Then I'll test the performance of it.

I thought about using inotify but that won't work in the general sense because 1) inotify can only watch so many files and I think I'm past that limit already and 2) it can only watch file changes when the watcher is active, so if my client process fails that's doing the watching, any filesystem changes would be permanently missed unless some other walk happens. I do actually keep a list of the files previously found in the filesystem in a seperate mongodb collection where the index on "source" is unique, however that doesn't ensure the two collections are in sync.

Thanks.

User

unread,

Nov 28, 2018, 9:26:30 PM11/28/18

to mongodb-user

Well, $sort is using the index now, yet it still doesn't return in over an hour for db.collection.aggregate([{"$sort":{"source":1}},{"$group":{"_id":"$source"}}]) when the distinct call would return in ~10 seconds if I remember correctly.

Is there anyway to make the "$group" processing multicore?

This is what explain says right now:

"winningPlan" : {

"stage" : "PROJECTION",

"transformBy" : {

"source" : 1,

"_id" : 0

},

"inputStage" : {

"stage" : "IXSCAN",

"keyPattern" : {

"source" : 1

},

"indexName" : "sourcenotsparse",

"isMultiKey" : false,

"multiKeyPaths" : {

"source" : [ ]

},

"isUnique" : false,

"isSparse" : false,

"isPartial" : false,

"indexVersion" : 2,

"direction" : "forward",

"indexBounds" : {

"source" : [

"[MinKey, MaxKey]"

]

}

},

"rejectedPlans" : [ ]

}

},

{

"$group" : {

"_id" : "$source"

}

So in theory, mongodb could pull only the "keys" from the index (and not the values), but if there's ~1000 entries for fileA and ~2000 entries for fileB, the $group stage is going to have to process ~3000 entries even though it will only output two of them in the end. This whole process seems to be only using one core -- I've got 140 available so it'd be nice if this could complete ~100x faster if more cores were used.

I know this is an "optimized" path for $sort/$group with a non-sparse index, but I feel like the system should be able to make this very fast in this particular scenario.

I'm a novice with c++... is there a place I could start to see how to fix this?

Thanks!

Robert Cochran

unread,

Nov 28, 2018, 10:09:09 PM11/28/18

to mongodb-user

You should only work with very small collections -- not more than 20 documents, for example -- until you have your schema and indexing correct. To restate what the official documentation says -- "test on a small test system before implementing any changes on a production system."

Thanks so much

Bob

User

unread,

Nov 29, 2018, 8:48:16 AM11/29/18

to mongodb-user

Bob,

The problem is that the distinct call works great on a small test system. Until you hit the 16mb limit (which is pretty large for just text), there's no reason to switch from distinct. I hit this after using this schema in production for over a year; only when the collection got to be a certain size have I had to rethink the way I was doing it. Testing with a small system doesn't bring out this issue.

If you read carefully, you would see that I am experimenting on another collection with the sparse vs non-sparse indexes. Indeed, according to explain:true, $sort doesn't work on a sparse index but does work on a non-sparse index. On a collection of smaller size (5 million records (which still goes over 16mb with distinct), the $sort/$group query can get me the unique values of one field from one collection in 96 seconds where distinct used to return it (probably on 4.5 million records) in 3-5 seconds. On the 5 billion collection size, distinct still returned in 3-5 seconds (same number of unique "source" values as the 5 million collection), but now the $sort/$group query (even using the non-sparse index) takes about an hour and a half because instead of only looking at the uniques in the index, it's going through the entire 5 billion records.

In this particular instance, in these particular circumstances, I believe mongodb has the ability to be more performant with this by ignoring all 5 billion records and instead pulling the record only from the index. Since there's an immediate $projection which filters out all fields but the field in the index, there's no reason to go through all 5 billion records to determine what the index can already tell you (that there's no more additional values of "source" than what is in the index already). I think by implementing distinct() (distinctIter()?) and returning an iterator to records with the "distinct" value as "_id", this could be just as performant as the distinct call.

Thanks.

badi

unread,

Nov 29, 2018, 11:10:12 AM11/29/18

to mongodb-user

Hi , we ussually solve poor performance of $group by microSharding (sharding on same server). It used multiple cores.

bye

Kevin Adistambha

unread,

Nov 29, 2018, 6:41:17 PM11/29/18

to mongodb-user

Hi,

I understand that your original solution with distinct worked well, until you hit a certain scale, and you were forced to look for other solutions. Unfortunately, an aggregation query is single-threaded by its nature. The distinct command is also not able to return a cursor currently. There is a ticket for this functionality: SERVER-3141.

In the meantime, off the top of my head I can think of three solutions:

One is to create a special collection containing the unique source. In this solution, you can exploit the fact that MongoDB enforces a unique _id index on each collection. Essentially, you’re creating an index in a manual manner. The workflow could be:

When inserting into the main metadata collection, also insert into this unique collection, e.g. db.unique.insert({_id: <filename>}).
You can ignore any duplicate key error when inserting into this collection, since it just informs you that the filename already exists.
In cleanup time, just scan the unique collection, and delete the corresponding entries from both this collection and the main metadata collection.

Another solution is to do multiple aggregations bounded by some criteria based on the characteristics of your data. For example, you can do one aggregation on files starting with ‘a’, and another on files starting with ‘b’. If source is string, you would need to use regex to do this, e.g.:

db.collection.aggregate([ {$match: {source: /^a/}}, ... ])
db.collection.aggregate([ {$match: {source: /^b/}}, ... ])
...

Please see Regex index use for details to ensure that your regex query can use an index.

The last solution I can think of would be to shard the collection, since it appears that the operation you need to do is now limited to the capabilities of a single node to handle given your current workflow. I believe you already have a quite powerful machine, so scaling vertically is probably limited. This is an option if you don’t want to change your current workflow, but this will change your deployment significantly, so it will require a bit more planning, changes in backup methodologies, and any other ops concerns compared to what you currently have.

In any case, please remember to take a complete backup of your app & data before doing any major changes, and to thoroughly test any changes before implementing them in production.

Best regards,
Kevin

Reply all

Reply to author

Forward