Thanks for the replies. I'll try to go into a little more detail,
without divulging any secrets. If you guys can't answer with the
information provided here, let me know and I'll talk to my sysadmin
about putting some consulting time on the books.
So our working set is pretty small, maybe 1 - 5% of our documents
needs to be accessed at any time. But our indexes are large, we're
indexing almost the whole document. It's all in a single collection
spread across 3 shards.
Basically we're indexing web pages (not storing any text though). So
for each document/page we're keying off of and sharding on a hash of
the URL which is also a unique key in our database. This gives us a
nice random key and gives us a relatively even distribution across our
shards. For every page we find/crawl we insert a new document which
is made up of the _id, the hash (index), the URL itself (not indexed),
the last time we crawled it (index), and the priority of that page to
be re-indexed (index for sorting). We're also storing a list of sub-
objects (each one includes a string and a long) for each document and
the string portion of those sub-objects is indexed as well (this is by
far the largest index but also the most important).
When we start crawling, everything gets send to one shard until it
splits. We're using a small chunk size (16 MB) to make sure things
get distributed. Our insert rate is so high that we've seen
versioning errors between shards at large chunk sizes. Once
everything has split, the writes get distributed relatively evenly.
At this stage when we get to about 5 - 10M documents one of the slaves
will start losing it and just sit there faulting and blocking all of
the queries. Now, our indexes do not fit in RAM and I don't expect
them to. For now, that's impossible. We just can't buy enough
hardware to make it work. But mongo literally crawls to a halt when
we hit that RAM limit and, like I said, almost every read/write just
starts blocking.
How can we differentiate working from non-working documents? Given
the information above do you see anything wrong with our keys, outside
of the fact that there's a lot of them? By far the most important
index is the largest one (the index of the string portion in the list
of sub-objects) and that being indexable and queryable is one of the
major reasons I picked mongo in the first place so it's pretty
important that that stay.
Let me know if there's any other information I can provide. If it's
not enough info, I'll see about getting some consulting time to
discuss this in a little more of a private forum.
Thanks again,
Terry
On Aug 15, 2:41 pm, Markus Gattol <
markus.gat...@sunoano.org> wrote:
> Terry> Hi guys, We've been playing with MongoDB for a start-up we're
> Terry> doing. We're looking at a very very (insert many more here)
> Terry> large data set (on the order of 1 TB a month) and it needs to be
> Terry> searchable going back at least 3 months, preferably more. It
> Terry> seems near impossible that we're going to be able to fit the
> Terry> data into RAM and we're seeing awful performance pass that
> Terry> boundary.
>
> You do not need to be able to fit all your data into RAM, just the
> working set size and indexes would be the ideal case.
>
>
http://www.markus-gattol.name/ws/mongodb.html#what_is_the_so-called_w...