--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/63e62975-beb3-43a9-8f7c-a2354f328e31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I am running 4.0.x version. The topology is one server with 1tb of
RAM on 4TB of NVME disks. OS version is ArchLinux with kernel -
4.18.5-arch1-1-ARCH #1 SMP PREEMPT. The client is using the mgo
driver with the Go language, but I'm just trying to do this in the
mongo shell at the moment before I try to code the method of getting
these unique values into the "walker" client code.
The collection looks something like this:
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
"email":["us...@host.com"]}
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
"email":["us...@host.com"]}
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
"IP":"us...@host.com"}
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
"email":["us...@host.com","user...@host.com"]}
Hi,
If I understand correctly, you have a set of files, of which you kept the metadata of said files in MongoDB. One file can have multiple entries in MongoDB. Periodically, you list the files in the directory, query MongoDB for unique filename values, then compare the two listings. For each file that doesn’t exist in the directory file listing, you remove the corresponding metadata in MongoDB to keep the two lists synced. Is this an accurate description of your use case?
If this is correct, with regard to your question:
Basically, how can I get the functionality of $group to use the index of source instead of doing a COLLSCAN?
In some cases, a COLLSCAN could be faster, since the query you’re doing needs to look at every document in your collection anyway. However this is very use case dependent, which would depend on the size of your documents, the amount of RAM you have, whether the whole index or the whole collection fit in RAM, etc.
Having said that, you could try to do a $sort before $group to see if the query runs any faster, such as:
db.collection.aggregate([ {$sort: {source: 1}}, {$group: {_id: '$source'}} ])
Note that this would need you to have an index on the source field, e.g. db.collection.createIndex({source: 1}).
An alternative approach would be using a tool such as inotify. There is a blog post titled Linux Filesystem Events with inotify that runs through an example on its use. The basic idea is to have a script that monitors the output of inotify, and if a delete event is detected, do a delete of the related source in the MongoDB database. This way, you could avoid having to do a directory listing and a comparison with the aggregate output.
OS version is ArchLinux
Please note that Arch Linux is not listed in the MongoDB Supported Platform list. I would encourage you to consider using one of the supported platforms and follow the recommendations in the Production Notes for best results. Having a Replica Set is also strongly recommended in a production setting.
Best regards,
Kevin
Hi,
I understand that your original solution with distinct worked well, until you hit a certain scale, and you were forced to look for other solutions. Unfortunately, an aggregation query is single-threaded by its nature. The distinct command is also not able to return a cursor currently. There is a ticket for this functionality: SERVER-3141.
In the meantime, off the top of my head I can think of three solutions:
One is to create a special collection containing the unique source. In this solution, you can exploit the fact that MongoDB enforces a unique _id index on each collection. Essentially, you’re creating an index in a manual manner. The workflow could be:
db.unique.insert({_id: <filename>}).unique collection, and delete the corresponding entries from both this collection and the main metadata collection.Another solution is to do multiple aggregations bounded by some criteria based on the characteristics of your data. For example, you can do one aggregation on files starting with ‘a’, and another on files starting with ‘b’. If source is string, you would need to use regex to do this, e.g.:
db.collection.aggregate([ {$match: {source: /^a/}}, ... ])
db.collection.aggregate([ {$match: {source: /^b/}}, ... ])
...
Please see Regex index use for details to ensure that your regex query can use an index.
The last solution I can think of would be to shard the collection, since it appears that the operation you need to do is now limited to the capabilities of a single node to handle given your current workflow. I believe you already have a quite powerful machine, so scaling vertically is probably limited. This is an option if you don’t want to change your current workflow, but this will change your deployment significantly, so it will require a bit more planning, changes in backup methodologies, and any other ops concerns compared to what you currently have.
In any case, please remember to take a complete backup of your app & data before doing any major changes, and to thoroughly test any changes before implementing them in production.
Best regards,
Kevin