Why do queries miss or double-load objects when I pre-gather the cursors?

154 views
Skip to first unread message

Joshua Fox

unread,
Apr 13, 2016, 2:49:44 AM4/13/16
to Google Cloud Datastore

I want to load a lot of data -- a snapshot of an entire table -- from Google Datastore into memory for later intensive repeated in-memory navigation of the objects.


So, I pre-gather cursors: I run the query (using keysOnly=true) and loop through the cursors. I store the list of cursor strings in a local variable.


Then, (immediately thereafter, in the same method),  I spin off one thread per cursor. Each thread loads 800 objects in a single query (the exact same query object, but keysOnly=false) and processes them.


(The query does not have NOT_EQUAL, IN, inequality, or sorting.)


It is not the usual way that cursors are used. However, it looks correct to me. 


Yet I am getting some apparently missed pages or objects; and duplicate loading of objects. 


What is going wrong here?

Ed Davisson

unread,
Apr 13, 2016, 1:13:27 PM4/13/16
to Joshua Fox, Google Cloud Datastore
Hi Joshua,

The cursors returned by a query are only valid for use in the same query. When you switch from the keys-only query to the non-keys-only query, the cursors are no longer applicable. Our documentation should make this clearer -- we'll work on improving that.

Can you describe your use case in more detail? If your goal is to split a result set into similar sized chunks, you might want to take a look at QuerySplitter.

--
You received this message because you are subscribed to the Google Groups "Google Cloud Datastore" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcd-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gcd-discuss/59fd8bb4-2230-4536-a1da-1f91fd94d604%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Joshua Fox יהושע פוקס

unread,
Apr 13, 2016, 1:20:29 PM4/13/16
to Ed Davisson, Google Cloud Datastore
. When you switch from the keys-only query to the non-keys-only query, the cursors are no longer applicable.

Thank you, that's important. I had thought that as long as the filters, etc were the same , it is considered the "same" query.
 
Can you describe your use case in more detail?
 
I need to load thousands of objects into memory -- a whole table -- on server initialization and keep them there permanently because I need to run graph-traversal algorithms  (like Dijkstra) potentially across all  these  objects. Loading them on-demand or per-request would be prohibitively slow.
 
If your goal is to split a result set into similar sized chunks
 
I don't specifically want to load chunks, but parallelism is essential for fastest bulk loading, and the old (buggy) approach described in my email was the fastest, because it loaded hundreds of objects in a single query, avoiding the hit of hundreds of queries. 
 
you might want to take a look at QuerySplitter.
 
How do the use case of QuerySplitter differ from those of splitting up a query by cursor?

Ed Davisson

unread,
Apr 13, 2016, 1:38:52 PM4/13/16
to joshu...@gmail.com, Google Cloud Datastore
Thanks for the details.
It's very similar. QuerySplitter is using cursors under the hood, but it only has to scan a subset of the result set to find the split points (the trade off being that they are only approximately evenly spaced).

If your result set isn't changing, then the approach you've described should work fine.

Joshua Fox יהושע פוקס

unread,
Apr 13, 2016, 1:48:41 PM4/13/16
to Ed Davisson, Google Cloud Datastore
Thank you, we don't need identical-sized  chunks. With Query Splitter, can I parallelize the loading of chunks, or will it run into the same problem that I did earlier with pre-gathering the cursors?

If your result set isn't changing, then the approach you've described should work fine.

By "approach that you're described" do you mean the older (apparently buggy) one, or QuerySplitter?

Ed Davisson

unread,
Apr 13, 2016, 3:41:01 PM4/13/16
to joshu...@gmail.com, Google Cloud Datastore
It takes the query you want to run as input and returns a list of queries, each of which will return one chunk (and it avoids the problem of incompatible cursors you ran into earlier).

If your result set isn't changing, then the approach you've described should work fine.

By "approach that you're described" do you mean the older (apparently buggy) one, or QuerySplitter?

I meant that if the result set won't change, you could precompute and store a list of cursors and use them each time you want to load the results into memory. The precomputation could either be done with QuerySplitter or by running the full query (not keys-only) and gathering cursors.

If the results will change, they you would want to compute the splits each time, in which case QuerySplitter should be less expensive than running the full query.

Joshua Fox יהושע פוקס

unread,
Apr 14, 2016, 2:45:56 AM4/14/16
to Ed Davisson, Google Cloud Datastore
On Wed, Apr 13, 2016 at 10:40 PM, Ed Davisson <eddav...@google.com> wrote:
‪On Wed, Apr 13, 2016 at 10:48 AM ‫Joshua Fox יהושע פוקס‬‎ <joshu...@gmail.com> wrote:‬
On Wed, Apr 13, 2016 at 8:38 PM, Ed Davisson <eddav...@google.com> wrote:
Thanks for the details.

‪On Wed, Apr 13, 2016 at 10:20 AM ‫Joshua Fox יהושע פוקס‬‎ <joshu...@gmail.com> wrote:‬
. When you switch from the keys-only query to the non-keys-only query, the cursors are no longer applicable.

Thank you, that's important. I had thought that as long as the filters, etc were the same , it is considered the "same" query.
 
Can you describe your use case in more detail?
 
I need to load thousands of objects into memory -- a whole table -- on server initialization and keep them there permanently because I need to run graph-traversal algorithms  (like Dijkstra) potentially across all  these  objects. Loading them on-demand or per-request would be prohibitively slow.
 
If your goal is to split a result set into similar sized chunks
 
I don't specifically want to load chunks, but parallelism is essential for fastest bulk loading, and the old (buggy) approach described in my email was the fastest, because it loaded hundreds of objects in a single query, avoiding the hit of hundreds of queries. 
 
you might want to take a look at QuerySplitter.
 
How do the use case of QuerySplitter differ from those of splitting up a query by cursor?
 
It's very similar. QuerySplitter is using cursors under the hood, but it only has to scan a subset of the result set to find the split points (the trade off being that they are only approximately evenly spaced).

Thank you, we don't need identical-sized  chunks. With Query Splitter, can I parallelize the loading of chunks, or will it run into the same problem that I did earlier with pre-gathering the cursors?

It takes the query you want to run as input and returns a list of queries, each of which will return one chunk (and it avoids the problem of incompatible cursors you ran into earlier).
Looks good. I'll give it a try. 

If your result set isn't changing, then the approach you've described should work fine.

By "approach that you're described" do you mean the older (apparently buggy) one, or QuerySplitter?

I meant that if the result set won't change, you could precompute and store a list of cursors and use them each time you want to load the results into memory. The precomputation could either be done with QuerySplitter or by running the full query (not keys-only) and gathering cursors.

If the results will change, they you would want to compute the splits each time, in which case QuerySplitter should be less expensive than running the full query.


Can I suggest that you append a hash of the full query parameters onto the cursor string, then check that all queries using that cursor have matching parameters, and throw an exception if not. 

We encountered  very inconsistent and mysterious behavior; if there had been an exception, we would have figured it out immediately.

Thanks again,

Joshua

Ed Davisson

unread,
Apr 14, 2016, 12:44:38 PM4/14/16
to joshu...@gmail.com, Google Cloud Datastore
It's definitely something we'd like to do. Filed https://github.com/GoogleCloudPlatform/google-cloud-datastore/issues/108 to track it externally, and we'll work on improving the documentation in the interim.

Joshua Fox יהושע פוקס

unread,
May 16, 2016, 3:05:45 PM5/16/16
to Ed Davisson, Google Cloud Datastore
Is QuerySplitter available on Flexible Environment (or App Engine)? That QuerySplitter is in the Client API which is intended only for applications not hosted on App Engine (or Flexible Environment).

Or if not, what is the recommended way to parallelize the load of an entire table for a Flexible Environment user?

(I note that an interface called QuerySplitter is  available in the App Engine SDK since 2009, but that  is part of a different MultiQueryBuilder mechanism which apparently is not for parallelizing a full table load.)


Ed Davisson

unread,
May 18, 2016, 12:57:13 PM5/18/16
to joshu...@gmail.com, Google Cloud Datastore
That's correct, QuerySplitter is not intended for use in App Engine (and the interface you found with the same name is doing something different).

For on-App Engine use cases, you might want to take a look at appengine-mapreduce.
Reply all
Reply to author
Forward
0 new messages