I wouldn't try to put all keys in a single entity group, since AFAIK
it doesn't scale to the volume I assume you are interested in
(millions of verbs, I presume).
The times reported by Appstats for overlapping async RPCs aren't all
that meaningful; all you can really tell for sure is how many RPCs
happened and how long they took all together in real time. The sum of
their real times is meaningless; the api time is somewhat meaningful
as it is a measure for how much work the backend had to do to satisfy
your request. (However, since the introduction of new billing, this
number is not directly related to the price charged for that work.
This is a known issue that we hope to address.)
The actual CPU time (not real time) used by your request is most
likely due to deserialization costs. This should be pretty much
independent of how many RPCs are being executed in real time. The only
way to reduce this is to have fewer properties in your entities. (Even
if the request is satisfied from memcache, you still incur exactly the
same deserialization overhead -- each entity still has to be converted
from a string to a structured objects.)
It seems you are not getting the data from memcache. Did you
explicitly turn it off in NDB? Or are these verbs simply too "fresh"
or too infrequently used to be found in memcache?
--Guido
--
--Guido van Rossum (python.org/~guido)
I am guessing you are using the HR Datastore. In this case, the
lowest-level datastore API code (in
google/appengine/datastore/datastore_rpc.py -- this is not part of
NDB) splits your requests up by entity group. Assuming that all your
keys are root keys, each is in its own entity group, and you end up
with as many parallel Get RPCs as you have keys. The engineers
responsible for the HRD implementation assure me this is more
efficient than issuing a single multi-key Get RPC. Nevertheless, you
can always try issuing fewer RPCs by passing
max_entity_groups_per_rpc=N to the get_multi_async() call. I would try
small values first and see if it improves the real time taken by your
requests: N=1, N=2, N=5, N=10. You can try the same thing for the
put_multi_async() call.
I wouldn't try to put all keys in a single entity group, since AFAIK
it doesn't scale to the volume I assume you are interested in
(millions of verbs, I presume).
The times reported by Appstats for overlapping async RPCs aren't all
that meaningful; all you can really tell for sure is how many RPCs
happened and how long they took all together in real time. The sum of
their real times is meaningless; the api time is somewhat meaningful
as it is a measure for how much work the backend had to do to satisfy
your request. (However, since the introduction of new billing, this
number is not directly related to the price charged for that work.
This is a known issue that we hope to address.)
The actual CPU time (not real time) used by your request is most
likely due to deserialization costs. This should be pretty much
independent of how many RPCs are being executed in real time. The only
way to reduce this is to have fewer properties in your entities. (Even
if the request is satisfied from memcache, you still incur exactly the
same deserialization overhead -- each entity still has to be converted
from a string to a structured objects.)
It seems you are not getting the data from memcache. Did you
explicitly turn it off in NDB? Or are these verbs simply too "fresh"
or too infrequently used to be found in memcache?
--Guido
Actually you can break down the string like
[s,sp,spa,spam,e,eg,egg,eggs] you could also add phonetic codes that can catch many spelling errors.
If you don't have too many query terms in your entity I guess you could also add
[am,pam,gs,ggs]
On Tue, Mar 6, 2012 at 12:51, Fredrik Bonander
<carl.fredr...@gmail.com> wrote:
> Yes using HRD. I made some simple tests with
> changing max_entity_groups_per_rpc from 1, 4 and then 10. For both 4 and 10
> I saw a dramatic increase in speed. From the get operations taking in total
> ~3300ms down to ~1300ms and about half as many RPC
> for max_entity_groups_per_rpc=4. And down to ~450ms
> for max_entity_groups_per_rpc=10. I'll verify this better so it's not due to
> random text I'm testing did contain only stop-words etc. But unfortunately
> my write operations quota is at 100%. So have to wait for billing to get
> enabled. One thing I'm curios about is if I
> increase max_entity_groups_per_rpc from 1 to 10, what's the drawbacks? Or is
> it simply just better ?
It's a tuning parameter. If it's better it's better. Enjoy. :-)
>> I wouldn't try to put all keys in a single entity group, since AFAIK
>> it doesn't scale to the volume I assume you are interested in
>> (millions of verbs, I presume).
>
> I sure hope that I'm able to build something that that can scale to that! :)
>> The times reported by Appstats for overlapping async RPCs aren't all
>> that meaningful; all you can really tell for sure is how many RPCs
>> happened and how long they took all together in real time. The sum of
>> their real times is meaningless; the api time is somewhat meaningful
>> as it is a measure for how much work the backend had to do to satisfy
>> your request. (However, since the introduction of new billing, this
>> number is not directly related to the price charged for that work.
>> This is a known issue that we hope to address.)
>>
>> The actual CPU time (not real time) used by your request is most
>> likely due to deserialization costs. This should be pretty much
>> independent of how many RPCs are being executed in real time. The only
>> way to reduce this is to have fewer properties in your entities. (Even
>> if the request is satisfied from memcache, you still incur exactly the
>> same deserialization overhead -- each entity still has to be converted
>> from a string to a structured objects.)
>
> Do you mean entities in general? Or my SearchIndex (it only has 2
> properties) ?
Remember that a repeated property with 10 items in the list costs the
same as 10 single properties.
>> It seems you are not getting the data from memcache. Did you
>> explicitly turn it off in NDB? Or are these verbs simply too "fresh"
>> or too infrequently used to be found in memcache?
>
> In this test the verbs are too "fresh", trying to do my tests on a clean
> datastore to ensure that my code handles a lot of new "fresh" verbs.
Cool.
Two more thoughts.
1. Unindexed properties cost fewer write ops than indexed properties.
2. Instead of KeyProperty(repeated=True), consider either a
(non-repeated) JsonProperty containing a list of verbs (it's easy to
reconstitute the key from the verb in your app, IIUC) or even a
TextProperty containing the verbs concatenated with spaces -- it's
easy to split that into a list of verbs, from which you can, again,
easily construct the keys. (This only makes sense if the model
containing the list of key properties is the one whose deserialization
is slowing you down, of course.)