How should 29m docs behave?

Wyatt Barnett

unread,

Aug 1, 2012, 9:32:31 AM8/1/12

to rav...@googlegroups.com

Working on a little project here that involves ingesting some 29m documents and then doing some mapping and reducing on the dataset to crunch out some numbers for a big ole report. The documents in question are pretty small -- think 1kb in general. Just a few data points and a bit of contextual information, nothing more. Our test runs have been pretty successful -- we import stuff, RavenDb chugs for a bit and then we get mapped and reduced results within the hour. But when we tossed all 29m documents into the system -- after fighting a whole host of ESENT boogeymen -- we never got the documents to successfully complete indexing even though we gave it 2 solid weeks running on a 4 processor VM w/ 8gb RAM; we never saw ravendb consume more than 5gb during that time period. Moreover, the database grew far bigger than we expected it to be able to -- it ended up at 80gb. Considering the source data is perhaps 2.83gb we found that a bit alarming.

So, my questions are:

1) Is this the expected behavior for data on this scale?

2) Should we see 10x growth over source data from the ESENT data file alone

3) Is there something we should be doing different -- aside from throwing hardware at the problem, working on that already.

Oren Eini (Ayende Rahien)

unread,

Aug 1, 2012, 9:41:20 AM8/1/12

to rav...@googlegroups.com

Wyatt,

That might be happening because your output multiple rows from each documents during the map phase?

In general, that shouldn't happen, and if we can get a repro, we will fix this.

Wyatt Barnett

unread,

Aug 1, 2012, 11:18:03 AM8/1/12

to rav...@googlegroups.com

Thanks, makes sense. I don't think we are doing that -- check out https://gist.github.com/3227714 for the code examples. Not sure what you'd need for the repro, but I can get you the full source and data for the project.

In the interest of full disclosure, we did start on a 2nd version that dramatically reduced the scope of that map reading document.

On Wednesday, August 1, 2012 9:41:20 AM UTC-4, Oren Eini wrote:

Wyatt,
That might be happening because your output multiple rows from each documents during the map phase?

In general, that shouldn't happen, and if we can get a repro, we will fix this.

Oren Eini (Ayende Rahien)

unread,

Aug 1, 2012, 11:19:47 AM8/1/12

to rav...@googlegroups.com



                                group r by new {r.FccLicenseId, r.ProviderName, r.StateAbbreviation, r.HoldingCompanyId, r.HoldingCompanyName, r.TechnologyTypeId, r.TechnologyTypeName, r.StateFipsId}

You are probably going to get a LOT of groups from this, and each of those have to be handled independently, probably multiple times as we scan through everything.

Wyatt Barnett

unread,

Aug 1, 2012, 12:10:14 PM8/1/12

to rav...@googlegroups.com

Thanks, that makes sense. v2 cuts down on that significantly -- just grouping on FccLicenseId. One problem this has created is "how do we pull in the useful stuff for writing the reports like the HoldingCompanyName when we aren't grouping on it?" I looked at TransformResult but that doesn't help much unless we extract a document for each of these Fcc license holders which felt a bit too relational for this platform.

Wyatt Barnett

unread,

Aug 1, 2012, 12:13:03 PM8/1/12

to rav...@googlegroups.com

PS: I should note that in that example, FccLicenseId, HoldingCompanyId and HoldingCompanyName are all equal and cojoined.

Kijana Woodard

unread,

Aug 1, 2012, 5:35:55 PM8/1/12

to rav...@googlegroups.com

HoldingCompanyName = g.FirstOrDefault(x=>x. HoldingCompanyName)

Oren Eini (Ayende Rahien)

unread,

Aug 2, 2012, 12:32:43 AM8/2/12

to rav...@googlegroups.com

Yep

Wyatt Barnett

unread,

Aug 2, 2012, 7:14:51 AM8/2/12

to rav...@googlegroups.com

Gotcha. I was having misgivings about doing that a half dozen times bit if it is correct then we'll do so.