Why not map/reduce
The reason I'm not going to use map/reduce is that its going to take lots of space on the disk.
Suppose I have 1,000,000 documents and only 2 of them are duplicates (this is the normal case!), so using map/reduce I'm going to have an index on the disk with 999,998 different groups of size 1 each, and only 1 group of size 2. It will work, but I don't wanna pay this penalty! Don't forget that indexes consume disk space.
(Map/reduce would have been a good solution if most of the docs were duplicates, which is not the normal scenario)
And the problem is even worse for me because I have 5 different fields that each of them I would consider as a duplicate, so I'll have to use 5 map/reduces.
Troy,
1. I don't need to store a document of 1,000,000 records because I only need to remember the duplicates, which are supposed to be few.
2. The background task would collect the data in pages, so it would handle 1,000,000 records without a problem.
3. Luckily for me, 24h delay is good enough for my specific problem.
As I said, I'm not going to "solve" the problem.
A real solution would be to query lucene for frequencies of the field. The answer is already there, we just don't have the access to it.
Good weekend everybody!