Is there a Better Way of Finding Duplicates in an Index?

239 views
Skip to first unread message

Bob McLaren

unread,
Mar 18, 2015, 11:11:27 AM3/18/15
to rav...@googlegroups.com
I am running RavenDB #3599 with > 2 million documents.   A small percentage of those documents (< 4%) have duplicate index values.  When asked to identify those duplicates, I could not find a way to directly query the Lucene index for term counts, so instead I had to create a map/reduce index.  The index took over 2 hours to create, consumed every bit of available memory on the server, and consumed enough CPU to noticeably slow down Raven's responsiveness.  I didn't want to revive an old thread, but my issue is an exact duplicate of this one brought up by Gal Koren in 2012.

Maybe it's because I come from a SQL background, but I can't help but feel like I am doing something wrong.  I realize that a relational database is more conducive to ad hoc reporting than a document database, but it just doesn't "feel right" to me that I should consume so many resources and so much time to retrieve information that I know is in the index already.

In short, is there a better way to do this?

Here is the map:
from s in docs.Statements
select new {
 
AccountNumber = s.AccountNumber,
 
EndDate = s.EndDate,
 
Count = 1
}

And the reduce:
from r in results
group r by new {
 
AccountNumber = r.AccountNumber,
 
EndDate = r.EndDate
} into g
select new {
 
AccountNumber = g.Key.AccountNumber,
 
EndDate = g.Key.EndDate,
 
Count = Enumerable.Sum(g, x => ((int)x.Count))
}


Oren Eini (Ayende Rahien)

unread,
Mar 18, 2015, 4:52:15 PM3/18/15
to ravendb

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bob McLaren

unread,
Mar 20, 2015, 9:33:44 AM3/20/15
to rav...@googlegroups.com
Thanks Oren,

I have been playing around with faceted search and dynamic aggregation, and there are definitely a lot of cool things you can do with that, but I'm still struggling to figure out how to get the result I'm looking for.

Using the Order class from the example in the documentation link you provided, if I had > 2 million orders, and I wanted to find the duplicate orders (based on Total, Product, and Currency) how can I use dynamic aggregation to achieve that?

I tried using .CountOn, but that includes all the Orders with a count of 1 (non-duplicates), and I don't see a way of filtering those out.  Even if I wanted to page through the results, there doesn't appear to be a method to do so when using facets.

Chris Marisic

unread,
Mar 20, 2015, 11:01:22 AM3/20/15
to rav...@googlegroups.com
Ship your data to an OLAP system. I would recommend actual cubes but even plain ole sql server or postgres is viable. Post gres might even be able to work with the json directly.

Bob McLaren

unread,
Mar 20, 2015, 12:15:05 PM3/20/15
to rav...@googlegroups.com
Thanks Chris,
I have no doubts that if I shipped my data to a SQL server I could get the information I need.  But that seems like overkill.  I'm not trying to do business analytics here, just an ad-hoc query to find duplicate documents in my database.
Lucene "knows" where my duplicates are.  I just have to find a way to coax the information out of her.

--
You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/wuaMudKgNj0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.

Chris Marisic

unread,
Mar 20, 2015, 12:25:34 PM3/20/15
to rav...@googlegroups.com
2 million records is honestly rather insignificant. just pull it all into memory with a streaming query and run a group by in linq2objects.

Bob McLaren

unread,
Mar 20, 2015, 1:13:56 PM3/20/15
to rav...@googlegroups.com
You are right.  Sending 2 million+ records over the wire to process client-side still offends my delicate sensibilities, but I guess I'll just have to get over it.

Thanks guys.

Chris Marisic

unread,
Mar 20, 2015, 1:21:59 PM3/20/15
to rav...@googlegroups.com
(Not in raven) but i'm actually working with this right now except my numbers are tens and hundreds of millions of items :D

Oren Eini (Ayende Rahien)

unread,
Mar 23, 2015, 3:04:56 AM3/23/15
to ravendb
What you need to do is to use TermSortMode to sort on the count, which will give you the items you want.


Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--

Bob McLaren

unread,
Mar 23, 2015, 1:59:47 PM3/23/15
to rav...@googlegroups.com
That works very nicely.  Thanks Oren!

--
You received this message because you are subscribed to a topic in the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ravendb/wuaMudKgNj0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ravendb+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages