I want to load a lot of data -- a snapshot of an entire table -- from Google Datastore into memory for later intensive repeated in-memory navigation of the objects.
So, I pre-gather cursors: I run the query (using keysOnly=true) and loop through the cursors. I store the list of cursor strings in a local variable.
Then, (immediately thereafter, in the same method), I spin off one thread per cursor. Each thread loads 800 objects in a single query (the exact same query object, but keysOnly=false) and processes them.
(The query does not have NOT_EQUAL, IN, inequality, or sorting.)
It is not the usual way that cursors are used. However, it looks correct to me.
Yet I am getting some apparently missed pages or objects; and duplicate loading of objects.
What is going wrong here?
--
You received this message because you are subscribed to the Google Groups "Google Cloud Datastore" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gcd-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gcd-discuss/59fd8bb4-2230-4536-a1da-1f91fd94d604%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
. When you switch from the keys-only query to the non-keys-only query, the cursors are no longer applicable.
Can you describe your use case in more detail?
If your goal is to split a result set into similar sized chunks
you might want to take a look at QuerySplitter.
If your result set isn't changing, then the approach you've described should work fine.
If your result set isn't changing, then the approach you've described should work fine.By "approach that you're described" do you mean the older (apparently buggy) one, or QuerySplitter?
On Wed, Apr 13, 2016 at 10:48 AM Joshua Fox יהושע פוקס <joshu...@gmail.com> wrote:On Wed, Apr 13, 2016 at 8:38 PM, Ed Davisson <eddav...@google.com> wrote:Thanks for the details.On Wed, Apr 13, 2016 at 10:20 AM Joshua Fox יהושע פוקס <joshu...@gmail.com> wrote:. When you switch from the keys-only query to the non-keys-only query, the cursors are no longer applicable.Thank you, that's important. I had thought that as long as the filters, etc were the same , it is considered the "same" query.Can you describe your use case in more detail?I need to load thousands of objects into memory -- a whole table -- on server initialization and keep them there permanently because I need to run graph-traversal algorithms (like Dijkstra) potentially across all these objects. Loading them on-demand or per-request would be prohibitively slow.If your goal is to split a result set into similar sized chunksI don't specifically want to load chunks, but parallelism is essential for fastest bulk loading, and the old (buggy) approach described in my email was the fastest, because it loaded hundreds of objects in a single query, avoiding the hit of hundreds of queries.you might want to take a look at QuerySplitter.How do the use case of QuerySplitter differ from those of splitting up a query by cursor?It's very similar. QuerySplitter is using cursors under the hood, but it only has to scan a subset of the result set to find the split points (the trade off being that they are only approximately evenly spaced).Thank you, we don't need identical-sized chunks. With Query Splitter, can I parallelize the loading of chunks, or will it run into the same problem that I did earlier with pre-gathering the cursors?It takes the query you want to run as input and returns a list of queries, each of which will return one chunk (and it avoids the problem of incompatible cursors you ran into earlier).
If your result set isn't changing, then the approach you've described should work fine.By "approach that you're described" do you mean the older (apparently buggy) one, or QuerySplitter?I meant that if the result set won't change, you could precompute and store a list of cursors and use them each time you want to load the results into memory. The precomputation could either be done with QuerySplitter or by running the full query (not keys-only) and gathering cursors.If the results will change, they you would want to compute the splits each time, in which case QuerySplitter should be less expensive than running the full query.