Let me try to get the two lined up a little better. Here, these two will be an almost exact alignment over a 5 minute period:
mongostat:
http://pastebin.com/yJe36x2fiostat:
http://pastebin.com/tc1Quy6j
The ONLY thing happening on the server (meaning, the mongo instance) is importing. However, I'm importing into over 200 collections simultaneously. There is no C compiler on the machine, so I could not build the Perl driver. So instead I open an instance of mongoimport for each target collection, and as I read the log files, I translate them into JSON, then output them through the right instance of mongoimport.
The Perl write to the file handles is buffered, which means that there may be some bursts of activity on any given import process. Some, however, are so active that they are almost continuously receiving data. Some of the objects being written are large, others are small. Some border on ridiculously large, so for those, the buffering probably doesn't matter much.
I divide into a separate database per day's worth of logs, and a separate collection per method. Each method in our system emits a log entry for every time it is called, which contains the inputs and the outputs, so we can monitor the system activity, and understand the health of the larger system that it is servicing. I divided it up this way to conserve space and, in theory, allow for faster searching. It conserves space because I don't have to have the date field in every record, since the date is determined by the database. And I don't need the layer, service, or method fields, because that is the name of the collection. And had I put it all in one collection, like I originally tried, I would not only have had to have all of those fields in each document, I would have had to index on them, which means that my total disk utilization would have been markedly higher. And its 50 gigs per day as it is, BEFORE importing it.
And yes, this is real production logs being pushed into Mongo, it's not test data. I'm testing out Mongo with it, but it's representative of the real data we need Mongo to deal with. Not only will I need to push data in, I've got to be able to do ad-hoc queries against the most recently added data so that we can identify the current state of the system AS the data is being imported.
One thought that came to my mind was to split things up a little bit. Looking at the stats, clearly Mongo is journaling. This is confirmed by looking in the db directory. The thought that occurred to me is, can I put the journal directory in /tmp, so it is being written to a local disk instead of the SAN? I could create a directory in /tmp, then replace the journal directory with a symlink. Then the respective I/O will go to different places, and will involve less I/O to the SAN, overall. That's my theory, at least. Is that a valid thing to do?
Any other thoughts on how I can get this system to perform the way it clearly SHOULD perform would be appreciated. Thanks!
-- John