Supporting Geo-spatial operations on GPUs

43 views
Skip to first unread message

Harshada Chavan

unread,
Aug 11, 2016, 3:48:24 PM8/11/16
to mongodb-dev
Hello,

I am working on supporting Geo-spatial operations in MongoDB (Intersects, Contains, etc) using GPUs. My plan is to move the filter phase (where the condition is checked for the bounding boxes) on GPUs. Using the results of filter phase (a Boolean array), the refine phase (the usual fetch call stack, see below) is invoked only for the records that satisfy the filter phase. 

For this purpose, I want to access all the data present in a collection before actually fetching the rows following the fetch calls. Currently, I am able to do it using the cursor available in mongo::CollectionScan::work() method in src/mongo/db/exec/collection_scan.cpp. On the first call to this method, I get all the record geometries from a collection using the cursors, get the query geometry and invoke filter phase on GPU. After filter phase, in this call and on all successive calls, I invoke the remaining functions of fetch only if the current row satisfied the filter phase (i.e. has 1 in the result array from filter phase).

However, this method has obvious overheads. I am fetching every geometry twice (if the geometry satisfies the filter phase). This makes the query slower when the selectivity is greater than 10%.

I would like know:

1. Is there a faster way of fetching row instead of using cursor. I am actually calling the same functions to extract geometries from the records returned by the cursor, as followed in the fetch call stack.

2. Is-mongo::CollectionScan::work() the right place to fetch the data? 

3. Any comments on my approach

I will be happy to answer your questions and provide more information.

Thank you for your help.

Fetch (also called as "refine phase" above) call stack:

#1  0x0000000003215de9 in S2Polygon::Intersects (this=0x7ffff59627a0, b=0x7ffff5963120)
   at src/third_party/s2/s2polygon.cc:418
#2  0x0000000002104068 in mongo::GeometryContainer::intersects (this=0x7ffff5d69fa0, otherPolygon=...)
   at src/mongo/db/geo/geometry_container.cpp:792
#3  0x00000000021029c3 in mongo::GeometryContainer::intersects (this=0x7ffff5d69fa0,
   otherContainer=...) at src/mongo/db/geo/geometry_container.cpp:535
#4  0x00000000021678ff in mongo::GeoMatchExpression::matchesSingleElement (this=0x7ffff5f3cc80, e=...)
   at src/mongo/db/matcher/expression_geo.cpp:349
#5  0x0000000002169cd3 in mongo::LeafMatchExpression::matches (this=0x7ffff5f3cc80,
   doc=0x7ffff7eaea60, details=0x0) at src/mongo/db/matcher/expression_leaf.cpp:73
#6  0x00000000020114c8 in mongo::Filter::passes (wsm=0x7ffff5d52a00, filter=0x7ffff5f3cc80)
   at src/mongo/db/exec/filter.h:157
#7  0x00000000020107ea in mongo::CollectionScan::returnIfMatches (this=0x7ffff5f3cb60,
   member=0x7ffff5d52a00, memberID=1, out=0x7ffff7eaed00)
   at src/mongo/db/exec/collection_scan.cpp:178
#8  0x00000000020105ef in mongo::CollectionScan::work (this=0x7ffff5f3cb60, out=0x7ffff7eaed00)
   at src/mongo/db/exec/collection_scan.cpp:170
#9  0x0000000002308f7f in mongo::PlanExecutor::getNextImpl (this=0x7ffff5ceefe0,
   objOut=0x7ffff7eaede0, dlOut=0x0) at src/mongo/db/query/plan_executor.cpp:390
#10 0x00000000023089ee in mongo::PlanExecutor::getNext (this=0x7ffff5ceefe0, objOut=0x7ffff7eaeff0,
   dlOut=0x0) at src/mongo/db/query/plan_executor.cpp:319


Thanks,
Harshada

Dwight Merriman

unread,
Aug 11, 2016, 7:59:19 PM8/11/16
to mongodb-dev
i'm not sure exactly what the goal (a specific use case or just research), nor do i know this code flow all that well, but i can make a couple of suggestions, at the risk of some degree of imprecision.

i think what you need to do is make a new query plan -- or rather, a new PlanStage.  you might want to look at some of the subclasses of PlanStage to see the general feel of what to do.

so that is a comment on how to "glue" your stuff into the codebase.  the other aspect of your question was about reading everything twice.

one thing you could do is for a given collection, create a regular mongodb index on { lat:1, long:1 } or whatever is appropriate.  that index will contain all the coordinates of interest, but it will likely be much, much smaller than the entire collection, as it doesn't have all the other data from the documents (assuming they aren't just points with no adnorment).  once you have that, you can scan that index sequentially, giving that data to the gpu, and getting back which ones match.  you keep a set of which ones matched, the RecordId's.  then, you sequentially scan the collection as mentioned, skipping over (returned !ismatched for the ones not present in the recordid set).  well, you do that if you are returning a significant percentage of the collection.  you might go get the records by random access methods if there aren't so many.  anyway what i just wrote is a bit coarse and crude but i wanted to keep it short hopefully that conveys the basic idea.

another approach, if the data isn't changing too fast, would be to keep in GPU memory all the lat/longs all the time.  and update them every time a document changes, in real time.  this would be very much like the existing indexes which update as the documents change (potentially with lsm variation, but same idea).  then the data is already on the gpu card and in some circumstances that data transfer could be costly, or not, depending how complex the polygon is.

another approach would be to read N documents, that fit in ram, hand the gpu the coords, get back the results, they are still in ram and stream then onward.  this sort of approach might not be too hard to implement if this were implemented as an aggregation pipeline stage, rather than as a query plan.

anyway there are a few ideas i'd try to figure out how to keep it relatively simple for now as the amount of work could become quite large in some cases.

Harshada Chavan

unread,
Aug 23, 2016, 5:57:59 PM8/23/16
to mongodb-dev
Hello Dwight,

Thank you very much for the ideas. I really liked the index creation idea. I am going to give it a try.

For the data updates, we assume that the data never changes. Therefore, I need not worry about modifying data stored in GPU.

Thanks again,
Harshada 
Reply all
Reply to author
Forward
0 new messages