Counting keys and $exists of lists of keys

312 views
Skip to first unread message

Dan Yamins

unread,
Mar 30, 2010, 12:48:37 PM3/30/10
to mongod...@googlegroups.com
Two related questions:

1) Is there any efficient way to find those records whose keys intersect with a list of key names?  E.g. if I have an arbitrary list of 50 key names, and I want to find all records that have at least one of those keys, is there anything better then taking the union of the list of results for {$exists:key} for each key alone?

2) Is there any efficient way to find those records which have more a given number of keys?  Perhaps a $where query?  But what javascript express would I use?  The following two attempts did not work:
 
{'$where':'function(){return this.Keys.length == 10};'}
{'$where':'function(){return this.length == 10};'}


Thanks,
Dan

Kyle Banker

unread,
Mar 30, 2010, 1:44:46 PM3/30/10
to mongod...@googlegroups.com
Answers below:

On Tue, Mar 30, 2010 at 12:48 PM, Dan Yamins <dya...@gmail.com> wrote:
> Two related questions:
>
> 1) Is there any efficient way to find those records whose keys intersect
> with a list of key names?  E.g. if I have an arbitrary list of 50 key names,
> and I want to find all records that have at least one of those keys, is
> there anything better then taking the union of the list of results for
> {$exists:key} for each key alone?

There may be a better way to model this so that you don't have to use
$exist, etc. Can you share more about your use case? I'm thinking
you'd have
a key pointing to an array of objects, which would be much more
queryable since you could then use $in.

> 2) Is there any efficient way to find those records which have more a given
> number of keys?  Perhaps a $where query?  But what javascript express would
> I use?  The following two attempts did not work:
>
> {'$where':'function(){return this.Keys.length == 10};'}
> {'$where':'function(){return this.length == 10};'}

JavaScript objects don't have a keys method. You have to iterate over
the properties in each object and count them using for(x in obj).
Alternatively,
if you store your objects in an array as suggested above, then you can
cache the size or vote on this issue
(http://jira.mongodb.org/browse/SERVER-478) and
hope it becomes available soon.

>
> Thanks,
> Dan
>
> --
> You received this message because you are subscribed to the Google Groups
> "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to
> mongodb-user...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/mongodb-user?hl=en.
>

Dan Yamins

unread,
Mar 30, 2010, 2:19:30 PM3/30/10
to mongod...@googlegroups.com

Kyle, thanks for your quick reply.


There may be a better way to model this so that you don't have to use
$exist, etc. Can you share more about your use case?  I'm thinking
you'd have
a key pointing to an array of objects, which would be much more
queryable since you could then use $in.


I have a number of similar use cases, but here's a typical one:  I'm storing time series data, where each timepoint represents year or quarter within a year (e.g. '1971' or '1971-1' etc...)  At the moment, I've been thinking of each timepoint as a key name.  E.g. a single time-series document might look like

    {'1951':1.1, '1952':1.2,  .... , '2009':1.6}

Not all series have values in each timepoint, and I want to be able to write queries restrict the returned records to those that possess values in particular range of years or quarters.   

Of course, I could put the timepoints into an array, so that each document looked like this:

   {'Timepoints':['1951','1952', ...., '2009'], 'values':[1.1, 1.2, ..., 1.6]}

This would then allow me to use $in to solve the time range restriction problem.  But then how would I easily do queries across timeseries stored within a collection?  e.g. get all records where the 1951 is greater than 1.2?, or sort by the 1951 values.

Of course, I could simply add the "Timepoints" key/value pair to my original design:

    {'1951':1.1, '1952':1.2,  .... , '2009':1.6,'Timepoints':['1951','1952',...,'2009']}

which would allow me to have both cross-series and within-series queries that I want. But of course, it seems rather redundant and would be annoying given the large value of my data.  Perhaps I could store a "TimepointRange" key/value pair, e.g.

    {'1951':1.1, '1952':1.2,  .... , '2009':1.6,'TimepointRange':['Begin':'1951','End':'2009'}}

But what if the time values don't come in a contiguous range? 

Is there a better design?   I'm sorry if in my ignorance/stupidity I don't see it -- any help would be great.

Generally speaking, I've recently been hitting a number of cases where the lack of any kind of natively-supported "OR" logic on keys has really added to the complexity and inefficiency of my code.  Would it be radically difficult to add some very basic "OR"-style logic?  It wouldn't have to be a general OR operations between all queries.  Perhaps it would be possible to extend the syntax of some operators  that currently accept act on one key (like $exists, $nexists, $in) to allow for multiple keys joined by OR, e.g. if I could write:

    {['key1','key2']: {'$exists':True}}

to mean 'key1' OR 'key2' exists.    (Of course, 'AND' logic would remain as it currently is.)   This would make a whole variety of tasks enormously simpler than they currently are.   Is this out of the question for mongo for a long time to come?


Dan

Kyle Banker

unread,
Apr 1, 2010, 4:23:43 PM4/1/10
to mongod...@googlegroups.com
There is an $or operator in the works: http://jira.mongodb.org/browse/SERVER-205

Have you considered storing one data point per document?

{'series_id': x, year: '1951.1', value: y}

If you index it properly, you should be able to perform range queries
and $in queries as needed.

Dan Yamins

unread,
Apr 2, 2010, 7:48:55 PM4/2/10
to mongod...@googlegroups.com
On Thu, Apr 1, 2010 at 4:23 PM, Kyle Banker <ky...@10gen.com> wrote:
There is an $or operator in the works: http://jira.mongodb.org/browse/SERVER-205


Excellent.  Is it still being actively developed?  I would love to understand what some of the problematic programming issues are with respect to it -- as is hinted at here:  http://groups.google.com/group/mongodb-user/browse_thread/thread/1323d06daa2ac49a/201775d882cbb49b?lnk=gst&q=or&pli=1.

I also think that even if one doesn't have a generic '$or' operator, it might be good to be able to extend the syntax of some operators that currently accept act on one key (like $exists, $nexists, $in) to accept lists of keys -- where the interpretation is to OR the given operation over the list of keys.   I can imagine why query optmization for generic '$or' is an issue, but would some of these issues go away just for this restricted subset of "or"-like operations?

 
Have you considered storing one data point per document?

{'series_id': x, year: '1951.1', value: y}


Absolutely.  But of course we store a lot more information in each row aside from just a series id -- we store actual information about each time series that has human-understandable meaning across time series (e.g. "subject" and "area" etc).  Because of the size of out datasets it would just be prohibitive to repeat all of this information several hundred times, unless we compress all the information into a single series_id and then have a separate series_id table -- e.g. doing massive normalization, and then have joins in the application layer.   Which is kind of exactly what we want to avoid using mongoDB.  In fact, storing, for each time series (or any record for that matter), a array-valued key listing the key names of the other keys present in that record -- e.g. like adding the "timeSeries" key to our exiting records as mentioned above -- is, from the functional point of view probably just as good a solution as storing one data point per document, and would save some space.     And of course, either of these solutions would only solve one of our "OR" logic issues.

So yeah, either a full-fledged OR or maybe just the more restricted version I mentioned above would be really really excellent.  

Dan









 


 

Eliot Horowitz

unread,
Apr 2, 2010, 8:35:46 PM4/2/10
to mongod...@googlegroups.com
we haven't started $or. not that much work, just a matter of getting
it into the priority queue.

JoshL

unread,
Apr 3, 2010, 3:42:02 PM4/3/10
to mongodb-user
Dan -

After racking my brain on a similar issue I got a javascript intersect
to work by basically using the $where condition in find (or group by)
and then writing a simple function. I also found that if using in a
group by you can do it as a finalize.

Anyway, here is my intersection routine. Might help you:

// sample res object
{
"_id.tags.channel" : "Yahoo",
"_id.tags.browser" : "chrome",
"_id.tags.landing_page" : "C",
"csum" : 64.9,
},


// find intersections
var eligibleArray = new Array();
if (res) {
for(var i = 0; i < res.length; i++) {
var intersections = 0;
list = { channel: 'Yahoo', browser: 'chrome' };
for(prop in list) if( res[i].hasOwnProperty("_id.tags."+prop) &&
res[i]["_id.tags."+prop] == list[prop]) intersections++;
res[i].intersections = intersections;
}
}

// output

{
"_id.tags.channel" : "Yahoo",
"_id.tags.browser" : "chrome",
"_id.tags.landing_page" : "C",
"csum" : 64.9,
"intersections" : 2,
},


On Apr 2, 8:35 pm, Eliot Horowitz <eliothorow...@gmail.com> wrote:
> we haven't started $or.  not that much work, just a matter of getting
> it into the priority queue.
>
>
>

> On Fri, Apr 2, 2010 at 7:48 PM, Dan Yamins <dyam...@gmail.com> wrote:


>
> > On Thu, Apr 1, 2010 at 4:23 PM, Kyle Banker <k...@10gen.com> wrote:
>
> >> There is an $or operator in the works:
> >>http://jira.mongodb.org/browse/SERVER-205
>
> > Excellent.  Is it still being actively developed?  I would love to
> > understand what some of the problematic programming issues are with respect
> > to it -- as is hinted at here:

> >  http://groups.google.com/group/mongodb-user/browse_thread/thread/1323....

Dan Yamins

unread,
Apr 6, 2010, 12:10:21 PM4/6/10
to mongod...@googlegroups.com
On Sat, Apr 3, 2010 at 3:42 PM, JoshL <jlip...@gmail.com> wrote:
Dan -

After racking my brain on a similar issue I got a javascript intersect
to work by basically using the $where condition in find (or group by)
and then writing a simple function.  I also found that if using in a
group by you can do it as a finalize.


Josh, thanks for your mail -- this is pretty much how we're doing it as of now.  We evaluated four distinct approaches, and using $where is the only solution that plays well with the rest of the query language (e.g .limit, .sort, etc) and is sufficiently generic that, when the $or does eventually come, we won't have to rewrite too much of the code to handle it    On the other hand, it's slow -- at least, when I have to do an OR over several hundred keys (which is unfortunately quite common for me).   

So hopefully an $or -- or at least the restricted version of it  -- will enter the priority queue someday soon!

Dan


Reply all
Reply to author
Forward
0 new messages