index is completely useless to $nin operator

543 views
Skip to first unread message

realzyy

unread,
Feb 23, 2012, 5:03:03 AM2/23/12
to mongodb-user
I know it is not efficient, but it is useless in the following case.
And I should mention that the index is highly selective. Who can tell
me the reason??

query:
db.PfpModelTaskDO.find({ nextService: { $nin: [ "taskOverService",
"taskFailedService" ] }})

collection:
PFP:PRIMARY> db.PfpModelTaskDO.stats()
{
"ns" : "taobaoP4PTask.PfpModelTaskDO",
"count" : 1043643,
"size" : 972910876,
"avgObjSize" : 932.2257476934162,
"storageSize" : 3295053824,
"numExtents" : 29,
"nindexes" : 5,
"lastExtentSize" : 555698944,
"paddingFactor" : 1.7799999999019072,
"flags" : 1,
"totalIndexSize" : 394747904,
"indexSizes" : {
"_id_" : 57008128,
"gmtCreate_1" : 53673984,
"taskStatus_1_taskRandomCount_1_nextTime_1" : 92520448,
"sellerID_1_taskStatus_1_requestService_1_nextService_1_gmtCreate_1" :
143974400,
"nextService_1" : 47570944
},
"ok" : 1
}

query plan 1, without hint:
PFP:PRIMARY> db.PfpModelTaskDO.find({ nextService: { $nin:
[ "taskOverService", "taskFailedService" ] }}).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 1043643,
"nscannedObjects" : 1043643,
"n" : 0,
"millis" : 1647,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {

}
}

query plan 2, with hint:
PFP:PRIMARY> db.PfpModelTaskDO.find({ nextService: { $nin:
[ "taskOverService",
"taskFailedService" ] }}).hint("nextService_1").explain()
{
"cursor" : "BtreeCursor nextService_1",
"nscanned" : 1043643,
"nscannedObjects" : 1043643,
"n" : 0,
"millis" : 2056,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"nextService" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
}


Marc

unread,
Feb 23, 2012, 4:55:42 PM2/23/12
to mongodb-user
There is actually a note on this in the Indexing FAQ section:
http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-I%27musing%24neor%24nininaquery%2Candwhileitusestheindex%2Cit%27sstillslow.What%27shappening%3F
"The problem with $ne and $nin is that much of an index will match
queries like these. If you need to use $nin, it's often best to make
sure that an additional, more selective criterion is part of the
query."

For a better understanding of how indexes work, I recommend reading
the Blog post "The Joy Of Indexing" written by my colleague, Kyle.

http://kylebanker.com/blog/2010/09/21/the-joy-of-mongodb-indexes/

To continue the cooking analogy, imagine a cookbook with an index on
ingredients, and we would like to find all recipes that don't contain
potatoes.

We flip to the back of the cookbook and look at the recipes in the
"Apples" section. For each of those recipes, we have to make sure
that they don't also contain potatoes. Then we look at the "beets"
section, and so on and so fourth, until we get to the end of the
index. (We can skip the "potatoes" entry, if there is one.) As per
the blog post, this example does not correspond 100% to the way
indexes actually work, but the analogy is valid nevertheless. As you
can see, we have looked at a lot of index entries and recipes looking
for the absence of an ingredient.

The fact of the matter is that $nin is simply inefficient. If
possible, it is preferable to search "positive space" rather than
"negative space".

If you find that your application is performing a lot of $nin queries,
it will be preferable to change your document structure such that
queries can be performed on values that exist, rather than tan looking
for documents that don't contain specific values.

realzyy

unread,
Feb 24, 2012, 4:32:23 AM2/24/12
to mongodb-user
$nin operator requires full index scan when we have index, and full
table scan when we don't have index.
Full index scan is not necessarily faster than full table scan, so
$nin is always slow.

Right?

On Feb 24, 5:55 am, Marc <m...@10gen.com> wrote:
> There is actually a note on this in the Indexing FAQ section:http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingA...

Marc

unread,
Feb 24, 2012, 11:50:04 AM2/24/12
to mongodb-user
In fact, in some cases attempting to use an index on a $nin query may
be even less efficient than doing a full table scan! Imagine a $nin
query that returns every document in a collection but one. For each
document, the query will first have to look in the index, and then
return the document. This is two "look-ups", as opposed to a table
scan, which would only look at each document once.

In general, if your query requires more than half of the index to be
used, you should reexamine how the query is performed, or resort to a
table scan at the very least.

I was discussing this issue with some colleagues of mine, and as it
happens, a similar question was asked last night, which you may find
useful.
http://groups.google.com/group/mongodb-user/browse_thread/thread/ae23ec12a4fac2a5/1c48d27342380fc2

XtraCoder

unread,
Feb 28, 2012, 2:29:03 PM2/28/12
to mongodb-user
Nevertheless Mongo team's comments regarding $nin are more or less
clear, algorithm to use $nin is not as efficient as it can be ...

A colleague of mine come across the problem with $nin's performance -
the use case is

- collection is 10000 items
- $in - is first 100 items
- $nin - is another 9900 items, i.e. both arrays are not intersecting.

it seems that Mongo performs a raw scan of the $nin array to check
each matching item. I think it would be usefull to build a b-tree out
of $nin and check against it. Performance would be >100*times better.

Below is the full Java source to try $in and $nin in various
combinations

import com.mongodb.*;
import java.util.*;

public class Main {

public static class DbOptions {
public static String server = "localhost";
public static int port = 27017;
public static String dbName = "DbPeople";
public static String collName = "People";
public static int numberOfEntities = 10000;
}

public static void main(String[] args) throws Exception {
createEntities();
testPerformance();
}

public static Mongo mongo;
public static DB db;
public static DBCollection coll;

public static void createEntities() {
Date stDate = new Date();

System.out.println("Creating DB with " +
DbOptions.numberOfEntities + " entities...");

DBCollection coll = createCollection();
coll.createIndex(new BasicDBObject("name", 1));
for (int i = 0; i < DbOptions.numberOfEntities; ++i) {
createEntity("Person #" + i, coll);
}

Date endDate = new Date();
System.out.println("time: " + (endDate.getTime() -
stDate.getTime()) + "ms\n");
}

private static DBCollection createCollection() {
try {
mongo = new Mongo(DbOptions.server, DbOptions.port);
mongo.dropDatabase(DbOptions.dbName);
db = mongo.getDB(DbOptions.dbName);
coll = db.getCollection(DbOptions.collName);
return coll;
} catch (Exception ex) {
ex.printStackTrace();
return null;
}
}

private static void createEntity(String name, DBCollection coll) {
BasicDBObject ent = new BasicDBObject();
ent.put("name", name);
coll.insert(ent);
}

public static void testPerformance() {
ArrayList<Object> persons = getAllObjectIdsInCollection();
ArrayList<Object> persons100 = new ArrayList<Object>();
ArrayList<Object> personsNot100 = new ArrayList<Object>();
ArrayList<Object> personsLast100 = new ArrayList<Object>();

persons100.addAll(persons.subList(0, 100));
personsNot100.addAll(persons.subList(100, persons.size()));
personsLast100.addAll(persons.subList(persons.size() - 100,
persons.size()));

test(persons, null);
test(persons, persons);
test(null, persons);
test(persons100, null);
test(persons100, persons);
test(persons100, personsNot100);
test(persons100, personsLast100);
}

private static ArrayList<Object> getAllObjectIdsInCollection() {
BasicDBObject query = new BasicDBObject();

BasicDBObject keys = new BasicDBObject();
keys.put("_id", 1);

DBCursor cursor = coll.find(query, keys);
ArrayList<Object> res = new ArrayList<Object>();
while (cursor.hasNext()) {
res.add(cursor.next().get("_id"));
}
return res;
}

static int testCounter = 0;

private static List<DBObject> test(ArrayList<Object> in,
ArrayList<Object> nin) {
Date dtStart = new Date(), dtQuery, dtFetched;

System.out.println("Test #" + (++testCounter) + "
---------------------------");
System.out.println(" $in: " + (in == null ? "<null>" :
in.size()));
System.out.println(" $nin: " + (nin == null ? "<null>" :
nin.size()));

BasicDBObject query = new BasicDBObject();
BasicDBObject $in = null, $nin = null;

if( in != null )
query = $in = new BasicDBObject("_id", new
BasicDBObject("$in", in));
if( nin != null )
query = $nin = new BasicDBObject("_id", new
BasicDBObject("$nin", nin));
if( $in != null && $nin != null )
query = new BasicDBObject("$and", new BasicDBObject[]
{ $in, $nin} );

DBCursor cursor = coll.find(query); cursor.count(); dtQuery =
new Date();
List<DBObject> list = cursor.toArray(); dtFetched = new
Date();

//System.out.println(" query: " + query.toString());
System.out.println(" records: " + list.size());
System.out.println(" time ...");
System.out.println(" query: " + (dtQuery.getTime() -
dtStart.getTime()) + "ms");
System.out.println(" fetch: " + (dtFetched.getTime() -
dtQuery.getTime()) + "ms");
System.out.println(" total: " + (dtFetched.getTime() -
dtStart.getTime()) + "ms");

return list;
}
}


Reply all
Reply to author
Forward
0 new messages