Summing 5000 datastore items

13 views
Skip to first unread message

Ryan Lamansky

unread,
May 12, 2008, 2:56:08 PM5/12/08
to Google App Engine
I'm trying to figure out the Google-approved way to do bulk math on
datastore items. I'm having difficulty.

For simplicity's sake, lets say I have 5000 values I need to add up.
The values may be duplicated, and theoretically could be all
identical. The values contain other information besides the actual
number, so ListProperties can't be used.

Since we're limited to 1000 results in our queries, how can you
accomplish this?

-Ryan

Advay Mengle

unread,
May 12, 2008, 10:16:49 PM5/12/08
to Google App Engine
Would a fairly simple way of doing it be computing the sum a chunk at
a time, and then redirecting to another service call, which takes the
intermediate sum and the location of your 'pointer' into the result
set as parameters?

timh

unread,
May 12, 2008, 10:30:04 PM5/12/08
to Google App Engine
Can you perform the sum as you add data and accumulate the final value
in another record,
rather then performing the calc on request ?

T

bjubbs

unread,
May 12, 2008, 11:07:32 PM5/12/08
to Google App Engine
Dig deeper into the message board on this but you must perform these
things
at write time in order to take advantage of this architecture. Lets
assume for
starters that your webapp fits the mold of write once/read lots. If
it doesnt you
might want to rethink using this engine for starters. But - if it
does you must
think slightly outside the box. Denormalize your data. Write
alot. Store pages
of sums , counts, totals, averages, whatever you need. Space is
cheap.
Writes are infrequent. When it comes time to produce the sum of 5000
items
it should be a single & fast read.

The concepts are discussed at length throughout this msg board.

Ryan Lamansky

unread,
May 13, 2008, 12:58:57 PM5/13/08
to Google App Engine
EFlight: What would I use as a "pointer"?

timh/bjubbs: This could work in theory, but without multi-"kind"
transactions I don't know how I'll maintain data integrity under high
load.

It's possible for two concurrent accesses to modify the values.
Maintaining a separate "sum" value without multi-"kind" transactions
creates a race condition as both threads try to update the "sum" with
whatever their perspective of the truth is at the time they do it.

For example, assume two concurrent users add an item to the values
list. Both read the current sum (let's say 100), user A adds 5 and
user B adds 10. For user A, the new sum is 105 and user B thinks it's
110. They then update the sum entity with either 105 or 110. Whoever
writes last wins, and both are wrong.

I like the idea of storing a precalculated value for rapid response
times, but I don't see how it can work reliably...

Tim Hoffman

unread,
May 14, 2008, 3:44:27 AM5/14/08
to Google App Engine
Hi Ryan

As to reliability I think it will depend on your specific case.

I didn't realise you would have multiple writers to your set of of
000's of values.

Is each user adding their own series of values (ie a set of values
bounded by the user)
or any user can update any one elses sets of values.

Not knowing what you are specifically trying to achieve makes it
pretty hard to come up
with a reliable solution or to say that there is in fact a solution.

Sorry ;-)

T

Ben the Indefatigable

unread,
May 14, 2008, 2:35:28 PM5/14/08
to Google App Engine
I would hope there is a logical way to group those numbers, and break
your problem down into more maneageable processes. If you could have
entities with about 1000 numbers each, then you could have a tally
entity with an entry for each number group. So you can independently
update the tallies in the tally entity without overwriting/
miscalculating the grand total. You said it is not a simple array of
numbers since each number might have associated values, so the
mechanism of storing multiple numbers in an entity might involve
serializing an object hierarchy or something but that is just an
implementation detail. A 1000 numbers in a single entity is much more
manageable than 1000 numbers spread across separate entities in a non-
relational datastore. Ultimately the best answer to your problem needs
to take advantage of more specific details about your problem.

Patrick Keogh

unread,
May 15, 2008, 2:59:49 AM5/15/08
to Google App Engine
But you can do read/modify/write as a single transaction.
You only have to deal with one entity group in this transaction, so
multiple updaters are serialised.

Patrick Keogh

unread,
May 15, 2008, 3:27:43 AM5/15/08
to Google App Engine

Ryan Lamansky

unread,
May 16, 2008, 1:39:29 PM5/16/08
to Google App Engine
I'll probably start a new discussion about this in a month or so that
features a more complete example.

Tim Hoffman: Yeah, I regret that I wasn't clear about that in my
original example. Yes, there can be multiple simultaneous writers.
It may not be a common event, but I have to support it. All writers
are affecting the common set of values.

Ben the Indefatigable: There are definitely ways to group the numbers,
but I want to avoid restricting users to a 1000 item limit; I can
definitely imagine scenarios where that would be exceeded in a single
"group". As for storing 1000 numbers on a single entity, I suppose
that would technically work but I'm seriously concerned about read/
write performance in that case. Additionally, since each number has
additional information associated with it, packing the whole thing
onto a single entity seems like it would break the ability to query on
it.

Patrick Keogh: I'd appreciate if you'd refrain from swearing, even via
acronyms. Anyway, I'm well aware of transactions but they won't help
me sum up the data. They don't do anything about the fetch limit and,
as far as a running total goes, transactions don't work across
different "kinds" (like the value "kind" and the running total
"kind"), exposing a race condition for multiple writers.

-Ryan

Patrick Keogh

unread,
May 16, 2008, 8:06:01 PM5/16/08
to Google App Engine
Ryan,

I believe that, in combination, Ben, TimH, bjubbs and The Flaming
Manual give you all you need.
Key ideas:
- Store the sum on every write. In fact store any aggregated statistic
that you might later want, such as sums of subcategories, averages,
max, min or whatever. If your read/write ratio is high then this is
comparatively low cost.
- Do it using a transaction so that the "race condition" is not a
problem. The transaction is not to help you "sum up the data" instead
what has been suggested is to never have to calculate the sum in one
go.
- If this "simple minded" approach does not work then to get better
advice you need to describe your problem in more detail as Ben
suggests.

If you are still stuck I could bash out some sample code for you, but
it would be good to get a clearer description of the problem.

Patrick

P.S. I'm sorry that you are so sensitive.

Patrick Keogh

unread,
May 16, 2008, 8:20:18 PM5/16/08
to Google App Engine
Ryan,

The update to the data "kind" does not have to be in the transaction.
The web request that writes a row can do that part outside the
transaction,
and then use a transaction for the running total "kind". Just like the
example
in The Flaming Manual.

Patrick

Tim Hoffman

unread,
May 16, 2008, 8:46:13 PM5/16/08
to Google App Engine
HI Ryan

On May 17, 1:39 am, Ryan Lamansky <Spam...@kardax.com> wrote:
> I'll probably start a new discussion about this in a month or so that
> features a more complete example.
>
> Tim Hoffman: Yeah, I regret that I wasn't clear about that in my
> original example. Yes, there can be multiple simultaneous writers.
> It may not be a common event, but I have to support it. All writers
> are affecting the common set of values.

I have been giving you requirements (the bit I know about) and am
having trouble coming
up with a really clear mechanism that would work. I am coming to GAE
from Zope
and even in Zope I would be trying accumlate all of the values,
however even in Zope
multiple concurrent writes cause probelms. Having said that becuase
in zope land we
are talking to a single backend conflicts will occurr when trying to
write to the same object
or reading and writing the same obect. (ZODB , zopes object store is
ACID compliant)
So in zope I would typically get an error and have to deal with that
to prevent overwriting
data.

So in your case I would think you will need to accumulate you values
in selected
identifyiable entities. Because GAE store is read committed it wil
mean you will need
always need to read a record before for editing, and then reread it
again before
comitting in case someone else blats it, and then commit it. However
there will always be a window
of opportunity to for someone else update the record between your read
compare and commits
Pity there is not test and set operation that from GAE's backend
operation that could be
used to implement an advisory lock. (This would generate a whole set
of other issues to deal with though)

To be honest in all my projects of people overwrite each others data
it isn't the end of the world ;-)
or we have locks which have to removed by users before they can edit.

I wonder if GAE is really the correct platform for what you want to
achieve. Alternately may be you need to
use one other service to implement a reliable locking mechanism. For
instance you could use S3 simpleque
as well so that you can have a garunteed ordering of entities that you
could rely on to implement locking.

Any way just a few thoughts

See ya

Tim

Patrick Keogh

unread,
May 16, 2008, 10:17:40 PM5/16/08
to Google App Engine
import wsgiref.handlers
import cgi
from google.appengine.api import users
from google.appengine.ext import webapp
from google.appengine.ext import db
import os
from google.appengine.ext.webapp import template
from google.appengine.ext import db

class Statistics(db.Model):
entryCount = db.IntegerProperty()
entryMin = db.FloatProperty()
entryMax = db.FloatProperty()
entrySum = db.FloatProperty()


class Reading(db.Model):
xxx = db.StringProperty()
reading = db.FloatProperty()
yyy = db.DateProperty()

def increment_counter(key, val):
s = db.get(key)
s.entryCount += 1
if val < s.entryMin: s.entryMin = val
if val > s.entryMax: s.entryMax = val
s.entrySum += val
s.put()


class EnterData(webapp.RequestHandler):
def post(self):
# Insert data and display graphs
r = Reading()
r.yyy = datetime.strptime(cgi.escape(self.request.get('Date')),"%Y-
%m-%d").date()
r.reading = float(cgi.escape(self.request.get('Observation')))
r.xxx = "blah"
r.put()
stats = s.GqlQuery("SELECT * FROM Statistics")
s = stats.get()
db.run_in_transaction(increment_counters, s.key(), r.reading)

def get(self):
# Display the most recent data and a form to collect a new reading
displayData(self, True)

class ReadData(webapp.RequestHandler):
def get(self):
displayData(self, False)

def main():
application = webapp.WSGIApplication(
[('/Rainfall', ReadData),
('/Rainfall/Enter', EnterData)],
debug=True)
wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
main()

Ryan Lamansky

unread,
May 17, 2008, 1:59:31 PM5/17/08
to Google App Engine
Patrick Keogh: Oh, I see... perform the increment inside the
transaction. I was thinking calculate the new value outside of it.
Oops.

Your design solves part of the problem, but it doesn't account for the
possibility that the transaction could fail. If the transaction
fails, the Statistics value will not be consistent with the Reading
list. Any idea on how to prevent that issue?

-Ryan

Patrick Keogh

unread,
May 17, 2008, 7:40:34 PM5/17/08
to Google App Engine
Yes. Handle the exception. So wrap the call to db.run_in_transaction
in a try ... except ...

Don't forget though that the "FM" says:
"The transaction function may be called multiple times if a datastore
operation fails due to another user updating entities in the entity
group at the same time. When this happens, the datastore API retries
the transaction a fixed number of times."

In other words there is built-in retry, So a single collision or even
a few writers will be handled without you having to do anything.
Exactly what to do if it raises TransactionFailedError depends on the
logic of your application. You could roll back the bigger transaction
by deleting your new data entity, give the user the choice, just retry
the whole thing again ... but from what you said about your
application I think this is quite unnecessary.

I agree that the Google doco is pretty terse and may not be of much
help to newcomers. I have the advantage that, over 35 years of
programming I have seen these kinds of patterns before, so I knew what
I was looking for. In fact it reminds me of of a little embedded DBMS,
in a programming language that I regard as one of Python's ancestors
that I knew back in the mid 70's LOL.

Patrick Keogh

unread,
May 17, 2008, 8:34:58 PM5/17/08
to Google App Engine
Sorry, one sentence of that didn't come out quite right. I wrote:
"from what you said about your
application I think this is quite unnecessary"

But what I should have written is more like:
You should handle this exception one way or another, although from
what you have said about your application it is unlikely that this
exception will ever be raised".

Scott Mahr

unread,
May 17, 2008, 10:10:08 PM5/17/08
to Google App Engine
Patrick thanks for explaining this so well, it is very instructive to
read even when it is not my problem.

-Scott
> > > -Ryan- Hide quoted text -
>
> - Show quoted text -

Ben the Indefatigable

unread,
May 18, 2008, 10:29:49 AM5/18/08
to Google App Engine
Patrick, I don't think that kind of solution will work if you need the
stats to be accurate. None of the Statistics and Reading entities are
in an entity group. The transaction you describe serializes updates to
the Statistics entity (so that one does not wipe out another), but
that's only a small part of the problem. Trying to work off of an
exception in the Statistics write in order to delete the corresponding
Reading entity is just plain unreliable for example if the request
handler fails or times out. That's the kind of thing transactions are
for but since you can't do both (add Reading, update Statistics)
within a transaction it cannot be reliable.

Ryan said:

>It's possible for two concurrent accesses to modify the values.
>Maintaining a separate "sum" value without multi-"kind" transactions
>creates a race condition as both threads try to update the "sum" with
>whatever their perspective of the truth is at the time they do it.

This is basically the problem except that multi-kind transactions
should be "transactions not limited to entity groups."

The only reliable solution is to segregate the numbers into meaningful
groups, with each group in an entity group or a single entity, and
keep statistics for each group. The grand total is then just
determined from the group subtotals, which is a smaller more
manageable problem.

Patrick Keogh

unread,
May 18, 2008, 6:25:35 PM5/18/08
to Google App Engine
Ben

You may be right, but after careful re-reading of the documentation I
cannot see any problem with my approach.
There is only one Statistics entity, so it doesn't matter which entity
group it is in, it will be in exactly one (the root).
So all requests that attempt to update this entity are referencing the
same data. The transaction mechanism ensures
that this is serialised.

If and when the exception occurs, you get to handle it in the context
of your specific request, so you know what data entity
it refers to, and no matter which entity group it is in you can delete
it. Or you can simply retry the transaction.

If my understanding is deficient then could you tell me where in the
doco I can read more about it? Once again not disagreeing
with you, but I have double checked my understanding of what the doco
says, but I may be missing something.

Patrick Keogh

unread,
May 19, 2008, 7:03:51 AM5/19/08
to Google App Engine
So to update my example (and please note that I just typed this, I
have not run it, and it may have syntax errors)
with appropriate exception handling:

import wsgiref.handlers
import cgi
from google.appengine.api import users
from google.appengine.ext import webapp
from google.appengine.ext import db
import os
from google.appengine.ext.webapp import template
from google.appengine.ext import db

class Statistics(db.Model):
entryCount = db.IntegerProperty()
entryMin = db.FloatProperty()
entryMax = db.FloatProperty()
entrySum = db.FloatProperty()

class Reading(db.Model):
xxx = db.StringProperty()
reading = db.FloatProperty()
yyy = db.DateProperty()

def increment_counters(key, val):
s = db.get(key)
s.entryCount += 1
if val < s.entryMin: s.entryMin = val
if val > s.entryMax: s.entryMax = val
s.entrySum += val
s.put()

class EnterData(webapp.RequestHandler):
def post(self):
# Insert data and display graphs
r = Reading()
r.yyy =
datetime.strptime(cgi.escape(self.request.get('Date')),"%Y-%m-
%d").date()
r.reading =
float(cgi.escape(self.request.get('Observation')))
r.xxx = "blah"
r.put()
rk = r.key()
stats = s.GqlQuery("SELECT * FROM Statistics")
s = stats.get()
try:
db.run_in_transaction(increment_counters, s.key(), r.reading)
except TransactionFailedError:
# Delete the entity that we just put() and tell the user to try again
later
db.delete(rk)
displayError(self, "Update failed")

def get(self):
# Display the most recent data and a form to collect a new reading
displayData(self, True)

class ReadData(webapp.RequestHandler):
def get(self):
displayData(self, False)

def main():
application = webapp.WSGIApplication(
[('/Rainfall', ReadData),
('/Rainfall/Enter', EnterData)],
debug=True)
wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
main()

Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

gnz

unread,
May 19, 2008, 9:10:13 AM5/19/08
to Google App Engine
Ok, maybe this is completely wrong, but what about having the
following model classes:

class Statistics(db.Model):
entry_count = db.IntegerProperty();
entry_min = db.FloatProperty();
entry_max = db.FloatProperty();
entry_sum = db.FloatProperty();
last_timestamp = db.IntegerProperty();

class Reading(db.Model):
xxx = db.StringProperty();
reading = db.FloatProperty();
yyy = db.DateProperty();
ts = db.DateTimeProperty(auto_now_add=True);

And then, after each new reading is inserted, you could do the
following:
* Retrieve the current Statistics instance
* Select all readings between the s.last_timestamp and the
difference of {current date} minus {some time limit} (1/5/10 minutes
or something that can keep the result count safely under 1000 with
your write ratio). The reason is twofold: first, to work-around the
1000 limit; and second, to keep it consistent even without multi-kind
transactions. Notice that even if the time limit is too high, and the
result count is therefore limited by the DataStore, the results will
still be consistent.
* Update the Statistics with the calculated values and the newest
timestamp already aggregated.

Of course, the read/update on Statistics will have to be run in a
transaction.
Also, the first insertion of a Reading is a special case, but should
be fairly easy to manage.

It is important to find a good balance for the {time limit}: if the
value is low, the real-time computation will be faster.

And finally, to display fully updated values on a web page you will
need to compute/aggregate on real-time the remaining readings (all
readings where ts is greater than s.last_timestamp) with the saved
statistics. But reading is a fast operation, and assuming the remaing
cannot grow too much, it shouldn't be a problem...

Bye!
Gonzalo

wen...@gmail.com

unread,
May 19, 2008, 10:01:49 AM5/19/08
to Google App Engine
I don't see the problem with Patrick's example, and think it'd work
fine for your requirements you've given thus far Ben. The two
entities would stay in sync, because if the statistic update failed,
than the read can be removed (not rolled back through a transaction,
but removed). That seems to me to be the best GAE style solution to
the problem. The transaction itself isn't the entire solution to your
problem, but it provides you with enough hooks to solve the entire
problem. As long as you handle it in logic, your two entities should
remain in sync.

1. Update Reading Entity
2. Update Statistics in transaction
3. If Transaction fails remove updated Reading element.

I guess I could see a problem if your reading entity had some form of
data, that got removed when updating. You shouldn't run into this
during any insertion syle operation, but I guess I could see it if you
were updating something like the standard bank account model. If you
had statistics, summarizing the total in all bank accounts, when you
updated someones bank account total you'd want to update this
statistic. You than run the risk of the statistics update failing,
and needing to roll the bank account information back, so you'd need
to keep the original value.

This could run into an issue if someone else has access to update the
bank account information, and has since updated the total in that
account from the value you stored. However, this is a very long chain
of not very plausible scenarios. You'd need to have a single entity
with multiple users able to update it, than you need the transaction
on the statistics to fail, and someone to have updated the original
entity in that time. I'm not saying the case should be ignored
entirely, but I'm not sure if that kind of issue even fits your
requirements. I haven't tested it out but my assumption is that
running something in a transaction will not fail all that often.

Ryan Lamansky

unread,
May 19, 2008, 1:13:58 PM5/19/08
to Google App Engine
Ben the Indefatigable: It seems that you understand my concern... but
I don't understand how your use of entity groups solves the problem...

Patrick Keogh: The problem with your solution is that not every
failure will result in an error/exception that you can handle.
Between the put-record and update-statistics phase, if Google kills
your app (due to timer or quota restrictions or whatever), or if the
server dies, your record will be written but your statistics will be
out of date.

gnz: An interesting design, but it has flaws. The first flaw is a
race condition between the read-records phase and the update-
statistics phase. If a bunch of users add records and update
statistics, many of those records will be counted multiple times
because the statistic's "ts" field hasn't been updated yet.

The second flaw is that DateTime only has so much precision. Under
high loads, (or even occasionally with light loads) you could have
multiple values being added that report the exact same time. The
statistics calculation will count records multiple times in this
situation.

The third flaw is that it depends on all of the AppEngine servers to
have their clocks perfectly synchronized. That's impossible, so there
will be a bit of fuzziness to what each server would report as the
current time. Naturally, this will wreak havoc with your time-
sensitive code.

wen...@gmail.com: The problem with your solution is that not every
failure will result in an error/exception that you can handle.
Between the put-record and update-statistics phase, if Google kills
your app (due to timer or quota restrictions or whatever), or if the
server dies, your record will be written but your statistics will be
out of date.

-Ryan

Patrick Keogh

unread,
May 19, 2008, 6:34:56 PM5/19/08
to Google App Engine
OK, add a "stats recorded" boolean to your entity.
Initially set it to False, but after the transaction succeeds, update
the flag to True.
You can then run a report to find records that do not not have the
Flag set.
If a record shows up it two queries then the failure mode that you
describe,
unlikely that it is, has occurred. Probably the best strategy is to
then take your
application offline for a few seconds, read every record and rewrite
the statistics record,
then set the flags of the offending records to True.

Don't forget that in the app engine environment, the statistics are
only
approximately correct in any event. The sum of all the records was
correct at exactly
one moment in time, which has now passed. So I guess it is back to
needing an
understanding of the business logic, as we have discussed throughout
this thread.
Since the transaction model in the app engine is so limited, it is
probably not
suitable for business apps that require an authoritative value for the
sum.

If you need
start transaction
get sum
do some arbitrary operations for which the sum continues to be
true
end transaction
then my feeling is that you should probably look elsewhere.

Got to go to work now, maybe tonight I'll update the sample code...

Patrick Keogh

unread,
May 20, 2008, 5:21:04 AM5/20/08
to Google App Engine
I hope this doesn't get much bigger ... if so I'll start sending a
pointer to it rather than including the code in the group.
Here is a version with the "outer" transaction check flag, and the
mechanism to take the app offline for normal use or make it readonly.
Obviously you'd have to write the code to set and clear these flags,
to check the statsFlag and to "fix" the statistics entity and set the
statsFlag as you go.
If you do this right (update the statistics before you update the
entity record) it will have transactional integrity.

import wsgiref.handlers
import cgi
from google.appengine.api import users
from google.appengine.ext import webapp
from google.appengine.ext import db
import os
from google.appengine.ext.webapp import template
from google.appengine.ext import db

class Statistics(db.Model):
entryCount = db.IntegerProperty()
entryMin = db.FloatProperty()
entryMax = db.FloatProperty()
entrySum = db.FloatProperty()

class ApplicationFlags(db.Model):
availableFlag = db.BooleanProperty()
readonlyFlag = db.BoolenProperty()

class Reading(db.Model):
xxx = db.StringProperty()
reading = db.FloatProperty()
yyy = db.DateProperty()
statsFlag = db.BooleanProperty()

def increment_counters(key, val):
s = db.get(key)
s.entryCount += 1
if val < s.entryMin: s.entryMin = val
if val > s.entryMax: s.entryMax = val
s.entrySum += val
s.put()

class EnterData(webapp.RequestHandler):
def post(self):
#
# If you want to be able to make the application readonly, you will
need some code here tu check availableFlag,
# tell the user that the application is not available for writes and
then exit
#
flags = db.GblQuery("SELECT * FROM ApplicationFlags")
f = flags.fetch(1)
if f.availableFlag and not f.readonlyFlag:
r = Reading()
r.yyy = datetime.strptime(cgi.escape(self.request.get('Date')),"%Y-
%m-%d").date()
r.reading = float(cgi.escape(self.request.get('Observation')))
r.xxx = "blah"
r.statsFlag = False
r.put()
rk = r.key()
stats = db.GqlQuery("SELECT * FROM Statistics")
s = stats.get()
try:
db.run_in_transaction(increment_counters, s.key(), r.reading)
except TransactionFailedError:
# Delete the entity that we just put() and tell the user to try again
later
db.delete(rk)
displayError(self, "Update failed")
r.statsFlag = True
r.put()
else:
displayError(self,"Application not available for updates, try again
later")

def get(self):
# Display the most recent data and a form to collect a new reading
flags = db.GblQuery("SELECT * FROM ApplicationFlags")
f = flags.fetch(1)
if f.availableFlag and not f.readonlyFlag:
displayData(self, True)
else:
displayError(self,"Application not available for updates, try again
later")


class ReadData(webapp.RequestHandler):
def get(self):
#
# If you want to be able to take the application completely offline
you will need some code here tu check availableFlag,
# tell the user that the application is not available
#
flags = db.GblQuery("SELECT * FROM ApplicationFlags")
f = flags.fetch(1)
if f.availableFlag:
displayData(self, False)
else:
displayError(self,"Application not available, try again later")

def main():
application = webapp.WSGIApplication(
[('/Rainfall', ReadData),
('/Rainfall/Enter', EnterData)],
debug=True)
wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
main()

Ben the Indefatigable

unread,
May 20, 2008, 8:59:48 AM5/20/08
to Google App Engine
Patrick, the failure can still happen between re-updating the
Statistics and unsetting the flag in the Reading entity. The bottom
line is it is still a workaround on a workaround and you're getting
diminishing returns for your efforts. If you need accuracy, you can't
get around not having a proper transaction. The point is not that you
can get the exact number in a certain moment, but that it is *correct*
in the system at that moment, and your solution is not reliable.

Ryan, it is the same point I made before that you rejected based on
having too many numbers ~1000 in a single entity. If you use an entity
group of 1000 Reading entities, you could keep a Statistics entity in
the entity group with them (they don't need to be the same Kind), and
then use a transaction to update the Statistics with the creation of
the Reading. Then you would keep a combined MasterStatistics object
yet still with copies of the segregated Statistics entities to avoid
innacuracies between cross-segment overwrites. So you would sum all
the segments within the combined MasterStatistics every time you need
it.

Rather than looking for google appengine to solve a problem in a
classic way, look for how to use the appengine's datastore in a way
that naturally blends with its design. My sense is that querying and
other common relational operations are not fast enough, it is best
when you can just go directly to a big complex entity and do
operations on it yourself.

Patrick Keogh

unread,
May 20, 2008, 6:14:23 PM5/20/08
to Google App Engine
Yes, but the re updating failure is safe. If it doesn't complete
(there is still one record with the flag set) you can run it again and
again and again. Don't forget this kind of failure is VERY VERY
UNLIKELY in the first place.

This code is efficient and deterministic. I still say I want to see
the business logic because the general solution may not be required
for any particular business problem. I don't really believe in the
concept of an "accurate" sum in this case anyway, with my reservations
echoed in the doco at http://code.google.com/appengine/articles/transaction_isolation.html
> oerations on it yourself.

Ben the Indefatigable

unread,
May 21, 2008, 6:59:05 AM5/21/08
to Google App Engine
Even with request timeouts under the current quotas I know failure is
extremely unlikely, but that's not the point. You either accept a
potential statistical error that accumulates with ill-timed failures
over time (which you could periodically run some sort of correction
process for), or you insist on a safe accurate design which is very
different from this.

How can your solution have transactional integrity? In your latest
case if the process dies between a successful increment_counters and
r.put() you're left thinking you haven't incremented the counters yet
even though you have, resulting in an unwanted second increment from a
cleanup call on a subsequent process.

Patrick Keogh

unread,
May 21, 2008, 8:04:19 AM5/21/08
to Google App Engine
I'll leave you to figure out the answer to that question.

Ryan Lamansky

unread,
May 22, 2008, 5:29:59 PM5/22/08
to Google App Engine
It looks like entity groups may be the answer. I simply didn't
understand how they worked initially.

My SDK testing shows that if I use a statistic entry as the "parent"
entity of all the values, it puts them all in the same entity group
and thus you can combine them in transactions. My key
misunderstanding is I thought entity groups could only consist of the
same "kind", which is not the case. Creative use of db.get with key
lists enables quite a lot of versatility inside a transaction.
Hopefully the SDK is consistent with the "true" app engine's
operation, in this case.

So... the one remaining question is performance. Google says entity
groups should not be "large", without explaining what "large" means.
Is it large if it has 5000 items, or 50 items? Can a 5000-item entity
group be updated within Google's time limit?

I know the SDK is not representative of real-world performance, and I
foolishly hesitated 3 days before signing up and thus Google hasn't
given me an account yet, so I cannot test this myself :(
Reply all
Reply to author
Forward
0 new messages