~7 GB of ghost data???

4 views
Skip to first unread message

homunq

unread,
Mar 20, 2010, 11:39:16 PM3/20/10
to Google App Engine
Something is wrong. My app is showing with 7.42GB of total stored
data, but only 615 MB of datastore. There is only one version string
uploaded, which is almost 150MB, and nothing in the blobstore. This
discrepancy has been getting worse - several hours ago (longer than
the period since datastore statistics were updated, if you're
wondering), there were the same 615 MB in the datastore, and only
3.09GB of "total stored data". (at that time, my theory was that it
was old uploads of tweaks to the same "version" - but the numbers have
gone far, far beyond that explanation now.) It's not some exploding
index; the only non-default index I have is on an entity type with
just 33 entities.

Here's the line from my dashboard:
Total Stored Data $0.005/GByte-day 82% 7.42 of 9.00 GBytes
$0.04 / $0.04

And here is the word from my datastore statistics:
Last updated Total number of entities Size of all entities
1:32:13 ago 232,867 615 MBytes
(metadata 11%, if that matters)

Please, can someone help me figure out this issue? I'd be happy to
share any info or code which would help track this down. My app id is
vulahealth.

杨浩

unread,
Mar 21, 2010, 9:42:07 PM3/21/10
to google-a...@googlegroups.com
I have the same about that!
my entities:167MB but total's 1G,It's over quota!
the other 833MB it's meta datas!
It's very confused!

2010/3/21 homunq <jameso...@gmail.com>
Message has been deleted

Robert Kluin

unread,
Mar 22, 2010, 12:26:25 AM3/22/10
to google-a...@googlegroups.com
I do not think they charge separately for backups and replicas. I am
pretty sure they have stated before that the cost of those services is
already included in the storage charge.

I can not find the post that referenced this though.

On Mon, Mar 22, 2010 at 12:21 PM, Tom Wu <servic...@gmail.com> wrote:
> GAE is cluster which include master & slaver, backup system... etc..
> So the quota is much bigger than your local file.
>
>
>
> 2010/3/22 杨浩 <skzr...@gmail.com>
>>
>> I have the same about thaTt!

>> --
>> You received this message because you are subscribed to the Google Groups
>> "Google App Engine" group.
>> To post to this group, send email to google-a...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> google-appengi...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/google-appengine?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

Brett Shelley

unread,
Mar 22, 2010, 6:09:48 AM3/22/10
to google-a...@googlegroups.com
Are you storing anything in the Blob Store?  If so, well, deleting Blobs from the AppSpot console does not work.  Perhaps the problem is systemic.  But, if it helps revenue, then why fix it?  

-Brett

Nick Johnson (Google)

unread,
Mar 22, 2010, 6:42:00 AM3/22/10
to google-a...@googlegroups.com
Hi,

The discrepancy between datastore stats volume and stored data is generally due to indexing overhead, which is not included in the datastore stats. This can be very high for entities with many properties, or with long entity and property names or entity keys. Do you have reason to suppose that's not the case in your situation?

-Nick Johnson

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.




--
Nick Johnson, Developer Programs Engineer, App Engine
Google Ireland Ltd. :: Registered in Dublin, Ireland, Registration Number: 368047

homunq

unread,
Mar 22, 2010, 12:07:13 PM3/22/10
to Google App Engine
OK, I guess I'm guilty on all counts.

Clearly, I can fix that moving forward, though it will cost me a lot
of CPU to fix the data I've already entered. But as a short-term
stopgap, is there any way to delete entire default indexes for a given
property? (I mean, anything besides setting indexed=False and then
touching each entity one-by-one). You can vacuum custom indexes - can
you do it with indexes created by default?

Thanks,
Jameson

On 22 mar, 03:42, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:

> > google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com>

Patrick Twohig

unread,
Mar 22, 2010, 4:07:39 PM3/22/10
to google-a...@googlegroups.com
Hey Nick,

Just out of curiosity, how many properties would it take to get that amount of wasted space in overhead?  Are we talking about entities in the orders of magnitudes of tens/thousands/hundreds?


To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.




--
Patrick H. Twohig.

Namazu Studios
P.O. Box 34161
San Diego, CA 92163-4161

Nick Johnson (Google)

unread,
Mar 22, 2010, 4:18:12 PM3/22/10
to google-a...@googlegroups.com
Hi Patrick,

An overhead factor of 12 (as observed below) is high, but not outrageous. With long model names and property names, this could happen with relatively few indexed properties - on the order of magnitude of tens, at most.

-Nick Johnson

homunq

unread,
Mar 22, 2010, 4:45:47 PM3/22/10
to Google App Engine
OK, after hashing it out on IRC, I see that I have to erase my data
and start again. Since it took me 3 days of CPU quota to add the data,
I want to know if I can erase it quickly.

1. Is the overhead for erasing data (and thus whittling down indexes)
over half the overhead from adding it? Under 10%? Or what? (I don't
need exact numbers, just approximates.

2. If it's more like half - is there some way to just nuke all my data
and start over?

Thanks,
Jameson


On 22 mar, 03:42, "Nick Johnson (Google)" <nick.john...@google.com>
wrote:

> > google-appengi...@googlegroups.com<google-appengine%2Bunsubscrib e...@googlegroups.com>

Patrick Twohig

unread,
Mar 22, 2010, 5:19:27 PM3/22/10
to google-a...@googlegroups.com
I'd use a cursor on the task queue.  Do bulk deletes in blocks of 500 (I think that's the most keys you can pass to delete on a single call) and it shouldn't be that hard to wipe it out.

Cheers!

To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.




--

Eli Jones

unread,
Mar 22, 2010, 5:43:52 PM3/22/10
to google-a...@googlegroups.com
oh man.. well, he's going to be wiping out 7GB of junk... :)

When I went through process of deleting something like 400MB of junk.. it was not fun....

First I started off deleting by __key__ in batches of 500, then I had to limit down to 200.. then down to 100.. then down to 50.. then down to 10.. then it stopped responding for hours (I could not even fetch(1) from the Model).

There must be a sanctioned way to remove 100,000s of entities based on how the datastore is structured.  For example, does it make sense to do something like this.

Use a cursor to:
1. Select __key__ from Model Order By __key__
2. append every 10th (or 100th) result to a list.. and delete that list for every 100 or 200 or 500 entities added.
3. Once at end of cursor, start over at the beginning.

That way, you wouldn't be deleting everything on the same table at the same time?  The datastore completely died on me when I tried to straight delete by __key__ using GqlQuery in a loop.. Just kept getting slower and slower. (I think maybe directly deleting by key_name might work better but I never had to do a bulk delete again.. so have not tested that theory).

Nick Johnson (Google)

unread,
Mar 22, 2010, 5:48:25 PM3/22/10
to google-a...@googlegroups.com
Hi,

On Mon, Mar 22, 2010 at 8:45 PM, homunq <jameso...@gmail.com> wrote:
OK, after hashing it out on IRC, I see that I have to erase my data
and start again.

Why is that? Wouldn't updating the data be a better option?
 
Since it took me 3 days of CPU quota to add the data,
I want to know if I can erase it quickly.

1. Is the overhead for erasing data (and thus whittling down indexes)
over half the overhead from adding it? Under 10%? Or what? (I don't
need exact numbers, just approximates.

It should be significantly lower - you can do a keys-only query, and delete the returned keys.

-Nick Johnson
 
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

homunq

unread,
Mar 23, 2010, 6:25:09 AM3/23/10
to Google App Engine

On Mar 22, 3:48 pm, "Nick Johnson (Google)" <nick.john...@google.com>
wrote


> On Mon, Mar 22, 2010 at 8:45 PM, homunq <jameson.qu...@gmail.com> wrote:
> > OK, after hashing it out on IRC, I see that I have to erase my data
> > and start again.
>
> Why is that? Wouldn't updating the data be a better option?

Because everything about it is wrong for saving space - the key names,
the field names, the indexes, and even in one case the fact of
breaking a string out into a list. (something I did for better
searching in several cases, one of which is not worth it now I realize
that 10X is easy to hit.)

And because the data import runs smoothly, and I have code for that
already.

....

Watching my deletion process start to get trapped in molasses, as Eli
Jones mentions above, I have to ask two things again:

1. Is there ANY ANY way to delete all indexes on a given property
name? Without worrying about keeping indexes in order when I'm just
paring them down to 0, I'd just be running through key names and
deleting them. It seems that would be much faster. (If it's any help,
I strongly suspect that most of my key names are globally unique
across all of Google).

2. What is the reason for the slowdown? If I understand his suggestion
to delete every 10th record, Eli Jones seems to suspect that it's
because there's some kind of resource conflict on specific sections of
storage, thus the solution is to attempt to spread your load across
machines. I don't see why that would cause a gradual slowdown. My best
theory is that write-then-delete leaves the index somehow a little
messier (for instance, maybe the index doesn't fully recover the
unused space because it expects you to fill it again) and that when
you do it on a massive scale you get massively messy and slow indexes.
Thus, again, I suspect this question reduces to question 1, although I
guess that if my theory is right a compress/garbage-collect/degunking
call for the indexes would be (for me) second best after a way to nuke
them.

Nick Johnson (Google)

unread,
Mar 23, 2010, 6:39:22 AM3/23/10
to google-a...@googlegroups.com
Hi,

On Tue, Mar 23, 2010 at 10:25 AM, homunq <jameso...@gmail.com> wrote:


On Mar 22, 3:48 pm, "Nick Johnson (Google)" <nick.john...@google.com>
wrote
> On Mon, Mar 22, 2010 at 8:45 PM, homunq <jameson.qu...@gmail.com> wrote:
> > OK, after hashing it out on IRC, I see that I have to erase my data
> > and start again.
>
> Why is that? Wouldn't updating the data be a better option?

Because everything about it is wrong for saving space - the key names,
the field names, the indexes, and even in one case the fact of
breaking a string out into a list. (something I did for better
searching in several cases, one of which is not worth it now I realize
that 10X is easy to hit.)

And because the data import runs smoothly, and I have code for that
already.

....

Watching my deletion process start to get trapped in molasses, as Eli
Jones mentions above, I have to ask two things again:

1. Is there ANY ANY way to delete all indexes on a given property
name? Without worrying about keeping indexes in order when I'm just
paring them down to 0, I'd just be running through key names and
deleting them. It seems that would be much faster. (If it's any help,
I strongly suspect that most of my key names are globally unique
across all of Google).

No - that would violate the constant that indexes are always kept in sync with the data they refer to.
 

2. What is the reason for the slowdown? If I understand his suggestion
to delete every 10th record, Eli Jones seems to suspect that it's
because there's some kind of resource conflict on specific sections of
storage, thus the solution is to attempt to spread your load across
machines. I don't see why that would cause a gradual slowdown. My best
theory is that write-then-delete leaves the index somehow a little
messier (for instance, maybe the index doesn't fully recover the
unused space because it expects you to fill it again) and that when
you do it on a massive scale you get massively messy and slow indexes.
Thus, again, I suspect this question reduces to question 1, although I
guess that if my theory is right a compress/garbage-collect/degunking
call for the indexes would be (for me) second best after a way to nuke
them.

Deletes using the naive approach slow down because when a record is deleted in Bigtable, it simply inserts a 'tombstone' record indicating the original record is deleted - the record isn't actually removed entirely from the datastore until the tablet it's on does its next compaction cycle. Until then, every subsequent query has to skip over the tombstone records to find the live records.

This is easy to avoid: Use cursors to delete records sequentially. That way, your queries won't be skipping the same tombstoned records over and over again - O(n) instead of O(n^2)!

-Nick Johnson
 

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

homunq

unread,
Mar 23, 2010, 9:57:27 AM3/23/10
to Google App Engine

>
> > Watching my deletion process start to get trapped in molasses, as Eli
> > Jones mentions above, I have to ask two things again:
>
> > 1. Is there ANY ANY way to delete all indexes on a given property
> > name? Without worrying about keeping indexes in order when I'm just
> > paring them down to 0, I'd just be running through key names and
> > deleting them. It seems that would be much faster. (If it's any help,
> > I strongly suspect that most of my key names are globally unique
> > across all of Google).
>
> No - that would violate the constant that indexes are always kept in sync
> with the data they refer to.
>

It seems to me that having no index at all is the same situation as if
the property was indexed=False from the beginning. If that's so, it
can't be violating a hard constraint.

>
> > 2. What is the reason for the slowdown? If I understand his suggestion
> > to delete every 10th record, Eli Jones seems to suspect that it's
> > because there's some kind of resource conflict on specific sections of
> > storage, thus the solution is to attempt to spread your load across
> > machines. I don't see why that would cause a gradual slowdown. My best
> > theory is that write-then-delete leaves the index somehow a little
> > messier (for instance, maybe the index doesn't fully recover the
> > unused space because it expects you to fill it again) and that when
> > you do it on a massive scale you get massively messy and slow indexes.
> > Thus, again, I suspect this question reduces to question 1, although I
> > guess that if my theory is right a compress/garbage-collect/degunking
> > call for the indexes would be (for me) second best after a way to nuke
> > them.
>
> Deletes using the naive approach slow down because when a record is deleted
> in Bigtable, it simply inserts a 'tombstone' record indicating the original
> record is deleted - the record isn't actually removed entirely from the
> datastore until the tablet it's on does its next compaction cycle. Until
> then, every subsequent query has to skip over the tombstone records to find
> the live records.
>
> This is easy to avoid: Use cursors to delete records sequentially. That way,
> your queries won't be skipping the same tombstoned records over and over
> again - O(n) instead of O(n^2)!
>

Thanks for explaining. Can you say anything about how often the
compaction cycles are? Just an order of magnitude - hours, days, or
weeks?

Thanks,
Jameson

Nick Johnson (Google)

unread,
Mar 23, 2010, 10:10:25 AM3/23/10
to google-a...@googlegroups.com
On Tue, Mar 23, 2010 at 1:57 PM, homunq <jameso...@gmail.com> wrote:

>
> > Watching my deletion process start to get trapped in molasses, as Eli
> > Jones mentions above, I have to ask two things again:
>
> > 1. Is there ANY ANY way to delete all indexes on a given property
> > name? Without worrying about keeping indexes in order when I'm just
> > paring them down to 0, I'd just be running through key names and
> > deleting them. It seems that would be much faster. (If it's any help,
> > I strongly suspect that most of my key names are globally unique
> > across all of Google).
>
> No - that would violate the constant that indexes are always kept in sync
> with the data they refer to.
>

It seems to me that having no index at all is the same situation as if
the property was indexed=False from the beginning. If that's so, it
can't be violating a hard constraint.

Internally, indexed fields are stored in the 'properties' list in the Entity Protocol Buffer, while unindexed fields are stored in the 'unindexed_properties' list in the Entity PB. The only way to change the indexing properties is to fetch them and store them.
They're based on the quantity of modifications to data in a given tablet. Doing many inserts, updates or deletes will, sooner or later, cause a compaction.

-Nick Johnson
 

Thanks,
Jameson


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Reply all
Reply to author
Forward
0 new messages