Is the datastore not good for lots of small sized objects?

7 views
Skip to first unread message

Nate Bauernfeind

unread,
Jun 27, 2010, 9:55:21 PM6/27/10
to Google App Engine
I would post to one of my other posts, except that they're still in
moderator-limbo. So I apologize for creating a new thread.

To briefly summarize my project has lots of very small objects. In
particular I've uploaded about 1.45 Million objects with a Long id and
a String name. On average, the strings are about 12 characters in
length. I plan(ned) on adding about 84 Million other objects that
consisted of 4 Longs (including the id). (And this is just for a
subset of my data... which is probably on the order of 2x to 3x larger
in total). Also it seems that I could estimate adding in about 200k
new records every day.

So I'm having several issues. One is that inserting 500 items takes
about 30 seconds, which turns out to be a lot of CPU-api time. The
second is that after loading up the 1.45 Million objects the datastore
usage has ballooned to 500 megabytes. The raw data is only on the
order of 30 megabytes, this is about the size that I can store it and
the size that the datastore statistics says it takes up. The datastore
statistics also says that about 108mb exists as metadata. I imagine
this is the default index on key, and I imagine it is a reasonable
value.

However, on the main app of my page it says I'm taking up about 500
megabytes when the datastore statistics only account for about 140mb.
Is this normal?

Should I expect the rest of my data to balloon about the same? (i.e.
2.5 gigabytes of raw data to require 65 gigabytes pre-indexes)

If so... Well that really sucks.

Darien Caldwell

unread,
Jun 28, 2010, 1:28:15 PM6/28/10
to Google App Engine
The reason for the ballooning is, by default two indexes are created
for every property in your Entity model. All of these indexes add
quite a bit to the necessary storage space. If you have any properties
in your model that don't require an index, it's a good idea to specify
indexed=False for that property in the model.

More info here: http://code.google.com/appengine/docs/python/datastore/queriesandindexes.html

Bulk uploading does take time, especially for so many entities, but
hopefully it's something you only have to do once.

Nate Bauernfeind

unread,
Jun 28, 2010, 1:47:26 PM6/28/10
to google-a...@googlegroups.com
Is it possible to force it to make only the Ascending index and not the reverse too? I would like to be able to lookup the id by name, but I will never need reverse order.


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.


Nate Bauernfeind

unread,
Jun 28, 2010, 1:49:04 PM6/28/10
to google-a...@googlegroups.com
Oh and does anyone know if the 500 mb in the quota usage is the 140mb of data replicated redundantly? I'm trying to figure out where that is coming from. 

Or is it that the 108mb of metadata is *not* the primary indexes?

Geoffrey Spear

unread,
Jun 28, 2010, 3:06:59 PM6/28/10
to Google App Engine


On Jun 28, 1:49 pm, Nate Bauernfeind <nate.bauernfe...@gmail.com>
wrote:
> Oh and does anyone know if the 500 mb in the quota usage is the 140mb of
> data replicated redundantly? I'm trying to figure out where that is coming
> from.
>
> Or is it that the 108mb of metadata is *not* the primary indexes?

You don't pay extra for replication of your data.

I believe the "metadata" figure is the space taken up in the protocol
buffers by things like your application name and property names. If
you have lots tiny entities and relatively long names for things, this
can certainly be a significant proportion of your storage. The space
used by the indexes themselves is, as far as I can tell, not reported
anywhere, and only appears in the totals.

Nate Bauernfeind

unread,
Jun 28, 2010, 3:54:52 PM6/28/10
to google-a...@googlegroups.com
Hmm. It looks like I should've read through the datastore specs prior to uploading so much data. I will give this another run with exactly the same data but with an obfuscated class definition.


--

djidjadji

unread,
Jun 28, 2010, 4:46:31 PM6/28/10
to google-a...@googlegroups.com
Also try to use the name argument in the Property constructor and see
if this reduces the meta data storage. Choose name values with 1 or 2
characters. You can use long property names in your code

longPropName = db.IntegerProperty(name='i', indexed=False)

Also choose the Model name as short as possible and the application
name in the appspot.com domain.

At the moment you can specify that you only want the ascending index.
It's both or none.
If the property is only used in a combined index you must specify that
"indexed=True".

2010/6/28 Nate Bauernfeind <nate.bau...@gmail.com>:

Nate Bauernfeind

unread,
Jul 2, 2010, 1:03:32 AM7/2/10
to google-a...@googlegroups.com
For those who were following this thread,

After reuploading all of my data but reducing the class name from 12 characters to 1 and from reducing the non-key field from 4 characters to 1, I was able to reduce my total footprint from 530mb to 340mb for 1.45M entities. Yielding about 245 bytes per entity (including the default indexes).

Certainly wish I could get rid of the DESC index on my one property (for this entity type).

djidjadji

unread,
Jul 2, 2010, 12:02:56 PM7/2/10
to google-a...@googlegroups.com
A nice reduction.
How much percent is the Meta Data now?

It would be a nice addition to GAE if we can have a section in
index.yaml where we make all the needed indices explicit. No more
implicit indices. To be compatible with the current implementation of
GAE this section should be optional. The developer chooses if he wants
implicit property indices that he partially can disable with the
"indexed" argument of the Property. Or that all the indices are
explicit. Maybe then we don't need an index on a property if we only
use it in a complex query.

The development server will create the explicit index definitions if
needed, and the explicit-section is specified in index.yaml.

GAE team is this possible? If yes, I will make an issue for this feature.

2010/7/2 Nate Bauernfeind <nate.bau...@gmail.com>:

Nate Bauernfeind

unread,
Jul 2, 2010, 6:15:02 PM7/2/10
to google-a...@googlegroups.com
The statistic displayed went from 88% metadata down to 69%. Pre-index average size went from 100 bytes to 64 bytes. Again, the only modification was 12 char class name down to 1 and 4 char property name down to 1.

Reply all
Reply to author
Forward
0 new messages