How to query for duplicates (non-distinct) items?

1,088 views
Skip to first unread message

Gal Koren

unread,
Sep 20, 2012, 11:27:43 AM9/20/12
to rav...@googlegroups.com
Hi.

I have a big number (> 100,000) of Contact objects:

class Contact
{
    public Guid Id { get; set; }
    public string Name { get; set; }
    public string DrivingLicenseNum { get; set; }
    public string Email { get; set; }
}

1. I'd like to get a list of all the DrivingLicenseNum's that are duplicated among different contacts.
1. Same for Email.

I could achieve that by map-reduce:

// map
from contact in docs.Contacts
select new { Email = contact.Email,  Count = 1 }

// reduce
from result in results
group result by result.Email into g
select new { Email = g.Key, Count = g.Sum(x=>x.Count) }

But it would be very inefficient (in term of disk usage) because I'd have ~100,000 groups with Count=1.
So I'm looking for an alternative to map-reduce.
I feel like the Lucene index already holds the answer. All I need is to ask Lucene which terms have more than 1 hit-count.
Is it possible?


Oren Eini (Ayende Rahien)

unread,
Sep 20, 2012, 11:31:11 AM9/20/12
to rav...@googlegroups.com
In theory, yes, you could do a search based on the terms freqneuncies, I guess.
We don't expose that ability to the users, though.

Stacey

unread,
Sep 20, 2012, 12:51:53 PM9/20/12
to rav...@googlegroups.com
I am really interested in seeing what you do to solve this. I have a similar problem where I need to find instances of duplicated data in a large set.

Gal Koren

unread,
Sep 20, 2012, 1:01:10 PM9/20/12
to rav...@googlegroups.com
I'll probably write a bundle.
I'll let you know how it goes.

Gal Koren

unread,
Sep 21, 2012, 11:52:40 AM9/21/12
to rav...@googlegroups.com
Eventually I decided not to solve the problem :)
The solution for me would be to simply query all existing documents, projecting only the properties that I want to check duplicates for, collecting all the data in my code (in memory of the client) (also note that you must collect the data in a paged-manner), and then look for duplicates in memory.
I'd do that in a task that will run once every night (when my customers are sleeping...) and will store the result in a special document.
When the user will query for duplicates, he will get this document created the last night.

Oren Eini (Ayende Rahien)

unread,
Sep 21, 2012, 3:52:05 PM9/21/12
to rav...@googlegroups.com
Why not use a map/reduce for that? 

Troy

unread,
Sep 21, 2012, 3:52:42 PM9/21/12
to rav...@googlegroups.com
This seems like an odd implementation. What happens when you have millions of records? You would create a special document with millions of unique entries? It would take a while to collect all documents when you have a max query of 1024. Not to mention, it would not be real time or potentially 24 hours behind.

Paul Hinett

unread,
Sep 21, 2012, 4:33:05 PM9/21/12
to rav...@googlegroups.com
Oren, how would a map/reduce look like for returning duplicates?

Oren Eini (Ayende Rahien)

unread,
Sep 21, 2012, 7:53:47 PM9/21/12
to rav...@googlegroups.com
Do a map/reduce on the unique key of your choice.
Then query for Count > 1

Gal Koren

unread,
Sep 22, 2012, 8:47:56 PM9/22/12
to rav...@googlegroups.com
Why not map/reduce

The reason I'm not going to use map/reduce is that its going to take lots of space on the disk.
Suppose I have 1,000,000 documents and only 2 of them are duplicates (this is the normal case!), so using map/reduce I'm going to have an index on the disk with 999,998 different groups of size 1 each, and only 1 group of size 2. It will work, but I don't wanna pay this penalty! Don't forget that indexes consume disk space.
(Map/reduce would have been a good solution if most of the docs were duplicates, which is not the normal scenario)
And the problem is even worse for me because I have 5 different fields that each of them I would consider as a duplicate, so I'll have to use 5 map/reduces.

Troy,

1. I don't need to store a document of 1,000,000 records because I only need to remember the duplicates, which are supposed to be few.
2. The background task would collect the data in pages, so it would handle 1,000,000 records without a problem.
3. Luckily for me, 24h delay is good enough for my specific problem.


As I said, I'm not going to "solve" the problem.
A real solution would be to query lucene for frequencies of the field. The answer is already there, we just don't have the access to it.

Good weekend everybody!

Oren Eini (Ayende Rahien)

unread,
Sep 23, 2012, 2:07:32 AM9/23/12
to rav...@googlegroups.com
Gal,
Disk is _cheap_.
I mean, _really really cheap_.

What you are planning would be an extremely expensive operation. 

Gal Koren

unread,
Sep 23, 2012, 5:39:09 PM9/23/12
to rav...@googlegroups.com
Oh... embarrassingly I didn't notice this point :)
You are right.
And map/reduce is the winning solution.

Michael Pavlovsky

unread,
Sep 23, 2012, 10:15:09 PM9/23/12
to rav...@googlegroups.com
Oren 
I agree that disk is cheap BUT please explain the following 

Lets take a few RavenDB hosting solutions for example 

https://appharbor.com/addons/ravenhq  - 2 GB of storage = $ 25  per month and replicated = $200 per month

The point is that it seems like with RavenDB disk is quite expensive.

This is not the only index that system has. From a very quick look  at the data objects that I store in my system and they are not large  it seems like 2 GB is not lots of a disk space. 

I can provide examples with Amazon EC2 instances and the situation will be similar with a twist of provisioned IOPC storage. (Raven needs fast disk for indexing correct?) 
for 10 GB of mounted drive i can easily get 50 $ per month 
So no, disk is cheap only at home. In the cloud it is expensive but not as expensive as memory.

Last thing: YES you are 100% correct that disk is for free in a usual ASP.NET hosting.

Oren Eini (Ayende Rahien)

unread,
Sep 24, 2012, 4:12:53 AM9/24/12
to rav...@googlegroups.com
Michael,
50 GB = 50$ per month? Seriously?
It would be cheaper to get a physical server on that pricing.

Regarding cloud hosting, they charge by disk size because that is a highly visible metric.
What you propose will likely mean a lot of bandwidth to process, and not something that you want on the cloud anyway.
Beside, you don't have any idea what the size of the index is going to be yet.
Reply all
Reply to author
Forward
0 new messages