Understanding Reads behaviour

90 views
Skip to first unread message

Praveen

unread,
Jul 26, 2012, 3:04:49 AM7/26/12
to mongod...@googlegroups.com
Hi All,

I have a collection which has 100K unique documents. On this collection I run 100 selects with each query retrieving 20 documents. I have 2 fields named field1 and field2 which are duplicates of each other and contain unique values ranging from 1 to 100K. The condition to restrict the number of documents selected to be 20, is as below

field1 $lt x and field2 $gt (x - 20)


Now when I restrict the values of x to be in the range 75k to 100K (which I call same range) the program takes 200 ms. If I vary x (which I call different range) such that out of the 100 queries issued every 25 of them have x in the range 1-25K, 26K-50K, 51K-75K and 76K-100K then the program takes 7000 ms. Which is about 35 times more. Could you help me understand the behaviour. I checked the number of page faults at start and end. It is 0, which means all the data is in memory. Any other parameters that I will need to look at?

Extending the same scenario to a multi-thread situation (50 threads) where each thread will issue 100 queries as similar to above.

When selecting from same range single thread scenario finishes in 200ms and 50 thread takes 17000 ms. When selecting from different range single thread takes 7000 ms and 50 thread takes 250000 ms. That seems to be quite a huge difference approx 35 times slower through put. One thing that I can think of is might be the threads are waiting on the read lock to be released and hence the delay...is this correct? Any other reason that could be making the 50 thread scenario slow?

Thanks

Praveen 

Adam C

unread,
Jul 26, 2012, 10:52:06 AM7/26/12
to mongod...@googlegroups.com
Praveen,

What indexes have you got defined here - is there a compound index defined for {field1 : 1, field2 : 1} for example?

If you run the queries with a .explain() you should get a better idea of what indexes (and bounds) are being used here.

Adam.

Jeremy Mikola

unread,
Jul 26, 2012, 11:00:54 AM7/26/12
to mongod...@googlegroups.com


On Thursday, July 26, 2012 3:04:49 AM UTC-4, Praveen wrote:
Hi All,

I have a collection which has 100K unique documents. On this collection I run 100 selects with each query retrieving 20 documents. I have 2 fields named field1 and field2 which are duplicates of each other and contain unique values ranging from 1 to 100K.

If field1 and field2 have identical values in each document, is there any reason not to use just one field? For the query below, you could use the $and operator to enforce two criteria for the same field. Can you describe how to reproduce this initial data set? I'm thinking:

for (i = 0; i < 100000; ++i) {
    db.foo.insert({x: i, y: i});
}
 
Also, as Adam mentioned, seeing the indexes defined on your collection would also be helpful.

Praveen

unread,
Jul 26, 2012, 12:28:12 PM7/26/12
to mongod...@googlegroups.com
Hi Jeremy, Adam,

Below is the explain output of the query being issued
>doc = {"batch1": {"$lt":100000}, "batch2": {"$gt":99980}}
>db.myCollection.find(doc).explain()

{
        "cursor" : "BtreeCursor batch2_1",
        "nscanned" : 20,
        "nscannedObjects" : 20,
        "n" : 19,
        "millis" : 0,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "isMultiKey" : false,
        "indexOnly" : false,
        "indexBounds" : {
                "batch2" : [
                        [
                                99980,
                                1.7976931348623157e+308
                        ]
                ]
        }
}

>doc1 = {"batch1": {"$lt":10000}, "batch2": {"$gt":9980}}
>db.myCollection.find(doc).explain()
{
        "cursor" : "BtreeCursor batch2_1",
        "nscanned" : 25020,
        "nscannedObjects" : 25020,
        "n" : 19,
        "millis" : 170,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "isMultiKey" : false,
        "indexOnly" : false,
        "indexBounds" : {
                "batch2" : [
                        [
                                74980,
                                1.7976931348623157e+308
                        ]
                ]
        }
}


The fields on which the query is being issued is named batch1 and batch2. The index created is not a composite index. There is index created on each field

{"batch1":1}

{"batch2":1}

Query is issued as below

Fields batch1 and batch2 can contain different values too. In this use case to keep it simple I am inserting same values in both the fields.

I observe that index bounds in on batch2 when docs is selected with value of x > 50k and index bounds is on batch1 with value of x  < 50k (any reason for this behaviour?)



>> Can you describe how to reproduce this initial data set? <<

I spawn 10 threads each inserting 10000 documents. Where in Thread 1 inserts docs with values (for batch1 and bacth2) in the sequence 1, 11, 21, 31, ......99991, Thread 2 inserts docs in the sequence 2,12,22,32........99992 continuing Thread 10 inserts 10, 20, 30........100000. This was done to make sure I don't have the values in serial order on storage.

Also, when I select a value for x it is always on the higher side. That is say, I am selecting from the range 26k-50K x value ranges closer to 50K.

Thanks

Praveen

Jeremy Mikola

unread,
Jul 26, 2012, 4:04:07 PM7/26/12
to mongod...@googlegroups.com
On Thursday, July 26, 2012 12:28:12 PM UTC-4, Praveen wrote:

I observe that index bounds in on batch2 when docs is selected with value of x > 50k and index bounds is on batch1 with value of x  < 50k (any reason for this behaviour?)

Periodically (I believe it's every 100 queries), Mongo will run a query through multiple indexes and elect to use whichever is most efficient. That could explain the switch mid-way during your test, as both of the single-key indexes are perfectly suitable for the queries.

I attempted to recreate your scenario in https://gist.github.com/3184175. Can you confirm if that looks correct? I also included my results running the test with queries all within the 75-100k range, and then in four batches of 25k-sized ranges.

Praveen

unread,
Aug 2, 2012, 1:58:56 AM8/2/12
to mongod...@googlegroups.com
Hi Jeremy,

Sorry was tied up with other work and did not get time to reply back.

The program looks fine. Minor variations from what I am using. The query issued does not use a .toArray but iterates through them as below

DBCursor cur = db.foo.find({x: {$lt: i}, y: {$gt: (i - 20)}})

while(cur.hasNext()){
cur.next()
}

Second my value of X does not go in steps of 1000 but chooses values close to the upper limit. That is in range 0-25K value of x would stay between 20k to 25k . The way I am achieving this is with a formula as below

(_threadCount + (_numberOfThreads * (i + 1))) - _numberOfThreads

Where threadCount - is the number assigned to the thread when invoked (in range 1-50, as I spawn 50 threads. If single thread then this will always be 1).

numberOfThreads - is the number of threads running in parallel

i - iterates from 1 to the number of queries (100)

Also the insert into the collection happens via threads, hence the order of storage might not be sequential (in regular steps) as in this case.

Here you can find my select test https://gist.github.com/3234186

Program used to insert data: https://gist.github.com/3234199

Praveen

unread,
Aug 7, 2012, 12:57:01 AM8/7/12
to mongod...@googlegroups.com
Hi Jeremy,

Did you get time to look into this?

Thanks

Praveen

Jeremy Mikola

unread,
Aug 7, 2012, 2:57:57 AM8/7/12
to mongod...@googlegroups.com
Praveen,

I changed the my test script a bit to collect additional information about index usage. Based on the difference in nscanned results across each batch of queries, it's evident that we're traversing much more of the b-tree for values closer to the middle of the 0-100k range. This is reasonable given the structure of b-trees.

If you look at my updated results, I'm seeing approximately 215,000 scanned b-tree indexes per second across both test cases (the single batch of 100 queries and four batches of 25 queries each).

For the first test (100 queries in the 75k-100k range), I noticed the x index used exclusively. In the second test, I saw the x index used for the first two batches as the nscanned per-query increased to 50,000, after which point MongoDB elected to use the y index. For the last two batches, the nscanned value for each query (using the y index) steadily decreased from 50,000 back to zero. You can see this yourself by inserting a print of the explain() return value in the query loop.

So, from what I can tell, there's nothing strange happening here. B-trees are being utilized as we would expect.

Reply all
Reply to author
Forward
0 new messages