Bulk data deletion woe

24 views
Skip to first unread message

Justin

unread,
Nov 14, 2010, 4:25:49 AM11/14/10
to Google App Engine
I've been trying to bulk delete data from my application as described
here

http://code.google.com/appengine/docs/python/datastore/creatinggettinganddeletingdata.html#Deleting_Entities_in_Bulk

This seems to have kicked off a series of mapreduce workers, whose
execution is killing my CPU - approximately 5 mins later I have
reached 100% CPU time and am locked out for the rest of the day.

I figure I'll just delete by hand; create some appropriate :delete
controllers and wait till the next day.

Unfortunately the mapreduce process still seems to be running - 10
past midnight and my CPU has reached 100% again.

Is there some way to kill these processes and get back control of my
app?

Erik Wilson

unread,
Nov 14, 2010, 12:29:18 PM11/14/10
to Google App Engine

If you check in the datastore viewer you might be able to find and
delete your jobs from one of the tables. You may also need to go into
your task queues and purge the default.

On this topic, why does deleting data have such a large difference
between actual time spent and billed time?

For instance, I had two mapreduce shards running to delete data, which
took a combined a total of 15 minutes, but I was actually charged for
11(!) hours. I know there isn't a 1:1 correlation but a >40x
difference is a little silly!


On Nov 14, 4:25 am, Justin <justin.worr...@gmail.com> wrote:
> I've been trying to bulk delete data from my application as described
> here
>
> http://code.google.com/appengine/docs/python/datastore/creatinggettin...

Justin

unread,
Nov 14, 2010, 1:23:08 PM11/14/10
to Google App Engine
I can't get into the datastore viewer - whenever the CPU quota is
exceeded all the datastore admin interfaces return http 503 -
something else to fix ?

Stephen Johnson

unread,
Nov 14, 2010, 1:32:50 PM11/14/10
to google-a...@googlegroups.com
Why do you say that's silly? If your map reduce task does bulk deletes and let's say they do 100 at a time, then those 100 deletes are done in parallel. So that's 100x. So for each second of delete real time you're getting 100 seconds of CPU time.  You should be pleased that instead of your task taking 11 hours to delete all your data it took only 15 minutes. Isn't that scalability? Isn't that what you're looking for? How many entities did you delete? How many indexes did you have (composite and single property)?

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.


Jeff Schwartz

unread,
Nov 14, 2010, 1:39:52 PM11/14/10
to google-a...@googlegroups.com
I think there might have been a misunderstanding regarding parallel seconds of api time & how that actually plays out as far as billing is concerned. With that confusion out of the way I'd answer your question regarding the merits of AppEngine's scalability with a big YES. Like my mom used to say, "nothing in life is really free so why should this one be?"

:)

Jeff
--
Jeff Schwartz

Erik Wilson

unread,
Nov 14, 2010, 4:57:33 PM11/14/10
to Google App Engine


On Nov 14, 1:32 pm, Stephen Johnson <onepagewo...@gmail.com> wrote:
> Why do you say that's silly? If your map reduce task does bulk deletes and
> let's say they do 100 at a time, then those 100 deletes are done in
> parallel. So that's 100x. So for each second of delete real time you're
> getting 100 seconds of CPU time.  You should be pleased that instead of your
> task taking 11 hours to delete all your data it took only 15 minutes. Isn't
> that scalability? Isn't that what you're looking for? How many entities did
> you delete? How many indexes did you have (composite and single property)?

This was using only 1 shard per kind that was being deleted, so
effectively there should be no parallelism occurring, unless there is
something I am missing?
Deleted about ~300k entities, each with a single indexed collection.
> > google-appengi...@googlegroups.com<google-appengine%2Bunsu...@googlegroups.com>
> > .

Stephen Johnson

unread,
Nov 14, 2010, 9:37:37 PM11/14/10
to google-a...@googlegroups.com
Thank you for sharing your numbers with us. I think it's a good way for all of us to get an idea of how much things cost on the cloud, so here's my thoughts.

Even though you had one shard executing the shard should be doing batch deletes and not one delete at a time. From the documentation batch deletes can do up to 500 entities in one call and would execute in parallel (perhaps not 500 all at once but with parallelism none the less). I would assume the shard would probably do about 100 or so at a time (maybe more / maybe less).

Anyway, a good way to prove some parallelism must be occurring would be to do a proof by negation. So, let's assume that in fact the shard is doing one delete at a time. Looking at the System Status the latency of a single delete on an entity (probably a very simple entity with no composite indexes which would add additional overhead) is approximately 50ms to 100ms or so. If we assume 50ms per delete for latency we end up with (assuming no overhead for mapreduce/shard maintenance and spawning additional tasks, etc. which would add even more additional time).

    300000 entities * .05 seconds per entitiy = 15000 seconds
    15000 seconds / 60 seconds per minute = 250 minutes or 4 hours 10 minutes

Additionally if a delete takes approximately 100 milliseconds then 300000 entities would take 8 hours 20 minutes to complete.
Even an unrealistic 25ms per delete is still over two hours.
 
Now remember this is latency (real time) and not CPU time. So even if something has latency time of 50ms it could still eat up 100ms of API CPU time. For example 50ms to delete the entity and 50ms to update the indexes (done in parallel). So if latency time is 4 hours 10 minutes and we just double latency time to approximate API CPU time we get over 8 hours of CPU time. If average delete time for your job was 75ms then latency time is approximately 6 hours and CPU time 12 hours. Your total was 11 hours billed time so if my logic is sound it seems reasonable the amount you were billed could be correct.

Furthermore if we take another look at this from another angle we find that if your delete job took 15 minutes to complete then:

300000 entities / 15 minutes = 20000 entities per minute
20000 entities per minute / 60 seconds per minute = 333.33 entities per second

So, if 333.33 entities are being deleted per second serially then the average latency would be 3ms per delete which seems rather unlikely.

My thoughts. Hope it helps (and I hope my math is right),
Steve


To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

Erik Wilson

unread,
Nov 14, 2010, 11:47:37 PM11/14/10
to Google App Engine
Thanks for the well thought response, numbers, and reality check
Stephen! That makes a lot of sense when you consider parallel deletes
and datastore CPU time.
> > <google-appengine%2Bunsu...@googlegroups.com<google-appengine%252Buns...@googlegroups.com>

Eli Jones

unread,
Nov 15, 2010, 1:18:21 AM11/15/10
to google-a...@googlegroups.com
This is just an anecdotal aside (in other words, I have not bothered to do any testing or comparison of performance).. but.. I have my own utility code that I use for batch deletes.

Recently, I decided to wipe out all of the entities for one of my models, but I was too lazy to look up the exact command I needed to use in the remote console.

So, I just used the new Datastore Admin page to delete them.  This page uses map reduce jobs to perform deletes.

From what I could tell, the map reduce delete job took up several times more CPU time (and wall clock time) than my custom delete job usually took.

My usual utility class uses this method for deletes:

1. Create a query for all entities in a model with keys_only = True.
2. Fetch 100 keys.
3. Issues a deferred task to delete those 100 key names.
4. Use a  cursor to fetch 100 more, and issue deferred deletes until the query returns no more entities.

This is usually pretty fast.. since the only bottle neck is the time it takes to fetch 100 key names and add the deferred task.  The surprising fact was that the default map reduce delete from the Datastore Admin page took so much for CPU.

So, if you think you'll be doing more bulk deletes in the future, it might be useful to compare the CPU usage of a map reduce delete (using keys only and not full entities) to a method that deletes batches of 100 key names using deferred with a query cursor.

Though, deleting 300,000 entities will take up a lot of CPU hours no matter what method you use.

Like I said.. this is anecdotal and there could be no real difference in performance.. but the Datastore Admin delete took up way more CPU time than it seemed it should have, and I didn't bother to use it or test it again.

To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

Stephen Johnson

unread,
Nov 15, 2010, 1:27:50 PM11/15/10
to google-a...@googlegroups.com
Yes I see what you're saying. Map Reduce would bring over the whole entity even though it isn't needed and would consume more CPU in fetching the entity not just the key. Seems almost like it would be nice to have an option of Map Reduce only handing off keys and leaving out the entity.

Robert Kluin

unread,
Nov 15, 2010, 1:47:06 PM11/15/10
to google-a...@googlegroups.com
In the Python MR libs, there is a DatastoreKeyInputReader input
reader. It looks like that is what's used to iterate over the
entities.
http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/datastore_admin/delete_handler.py#148

Robert

Stephen Johnson

unread,
Nov 15, 2010, 2:01:03 PM11/15/10
to google-a...@googlegroups.com
In the Java SDK I think one could subclass the DatastoreRecordReader to do a keys only query and return null for the entity value and use this class in lieu of the normal DatastoreRecordReader when needed. Probably similar in Python.

Chris Prinos

unread,
Nov 17, 2010, 1:14:26 PM11/17/10
to Google App Engine
Here's what I have done that seems to work:

1) select the entities you want to delete from the datastore admin
page as you described
2) Manually monitor the cup usage in the dashboard or quota page
3) Once the quota reaches a high value (say 75-80% for example), go to
the "Task Queues" page of the admin console
4) You should see a "default" queue, that contains the worker tasks
that are deleting your data
5) Use the "Pause Queue" button to pause the default queue
6) Wait until your quota is reset (done once per day)
7) With a new quota, go and resume the default task queue. Repeat
from step 2. If you are deleting less than 1GB of data, you can
probably do this in one or two passes

Note: How high you let the quota go in step #3 above depends on how
much CPU your app normally needs... make sure to leave enough for your
app to function.


On Nov 14, 4:25 am, Justin <justin.worr...@gmail.com> wrote:
> I've been trying to bulk delete data from my application as described
> here
>
> http://code.google.com/appengine/docs/python/datastore/creatinggettin...

Erik Wilson

unread,
Nov 17, 2010, 5:40:48 PM11/17/10
to Google App Engine
Hi Everyone,

So I have been going through the process of deleting more things and
wanted to share some results. Previously I had deleted about ~300k
entities each with a single indexed list, with an average cost of
132ms per delete, using the mapreduce method.

Now I realise I need to change my schema again, and needed to delete
288373 entries from a kind, with the following properties:
public class Pw {
@Persistent private String[] a;
@Persistent private String[] b;
}

This time I used the Java remote_api library to batch delete with the
following python code:

class deleter(object):
def run(self, kind, batch_size=200):
q = db.GqlQuery("select __key__ from %s" % (kind))
entities = q.fetch(batch_size)
while entities:
db.delete(entities)
q.with_cursor(q.cursor())
entities = q.fetch(batch_size)

This job completed with 8.8 billed hours, or a 110ms average per
delete. However I had to restart the deletion several times, as I
would receive an "Unknown Java Error" from remote_api, which
correlated with a spike in the latency of the instance running the
remote_api, and nothing in the dashboard logs.

Later I decided to delete another 288373 entries from a kind, with the
following properties:
public class Pi {
@Persistent private String n;
@Persistent @Extension(vendorName = "datanucleus", key =
"gae.unindexed", value="true") private Long[] a;
@Persistent @Extension(vendorName = "datanucleus", key =
"gae.unindexed", value="true") private Long[] b;
}

This time I uploaded a Python project with the remote_api enabled and
fired up a shell, then deleted those entries with only 3.5 billed
hours, for an average of 44ms per delete!
The delete process using the Python remote_api occurred without
incident, no "Unknown Errors" were received.

Also, interestingly, according to the datastore status during this
time period it should have been costing between 100ms and 500ms per
delete!

Not sure if my 110ms average was due to more indexed properties or
using the Java remote_api?

Hopefully I won't have to delete much else in the future but at least
44ms is much more tolerable.
It would be nice to be able to mark a kind as "lazy delete" and allow
Google to delete the data at their convenience without being billed
(nightly maintenance?).

Cheers!
> > > > <google-appengine%2Bunsu...@googlegroups.com<google-appengine%252Buns...@googlegroups.com>
> > <google-appengine%252Buns...@googlegroups.com<google-appengine%25252Bun...@googlegroups.com>

Beruk

unread,
Dec 6, 2010, 8:31:26 AM12/6/10
to Google App Engine
I am having similar problems using the bulkupdate library, which was
sort of a precursor to MapReduce, because bulkupdate iterates over a
query instead of fetching, and I've found that to be buggy and
unreliable:

http://code.google.com/p/googleappengine/issues/detail?id=4046

Could it be that MapReduce uses the iterator interface to the query
(for i in q) instead of fetching batches of entities, which would
explain why your custom delete job, which uses fetch, takes less time
to complete than the MR job?

Pascal


On Nov 15, 1:18 am, Eli Jones <eli.jo...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages