Feedback on data exporter design?

115 views
Skip to first unread message

Pete

unread,
Apr 15, 2008, 1:44:49 AM4/15/08
to Google App Engine
Getting large amounts of data from existing apps into the datastore
can be a challenge, and to help we offer a Bulk Data Uploader. You
can read more about this tool here:

http://code.google.com/appengine/articles/bulkload.html

Conversely, we know many of you would like a better tool for moving
your data *off* of Google App Engine, and we'd like to ask for your
feedback on what formats would be the most useful for a data
exporter. XML output? CSV transform? RDF? Let us know how what you
think!

Jared

unread,
Apr 15, 2008, 1:54:55 AM4/15/08
to Google App Engine
Not to beat the django horse to death, but the django data dump
commands can output in xml or json. Those seem like two to start
with.

Also, the option to make the files human-readable (again, like django)
would be good too.

Aaron Krill

unread,
Apr 15, 2008, 2:19:00 AM4/15/08
to Google App Engine
I concur with Jared. XML and JSON are my preferred methods.

What really needs to be done is there to be added a toJson function,
to DataStore models. Then when you want to dump an entire DB you give
the option of XML and JSON, and on your side you simply execute a
query where you grab all objects from all models and export them one
by one.

I would like a format like this for JSON:

{"model_name": "MyModel", "properties": {"myproperty": {"type":
"UserProperty", "options": {"someoption": "value"}}}, "objects":
{"id": 0, "myproperty": "aa...@krillr.com"}}

Thoughts?

Justin Worrell

unread,
Apr 15, 2008, 3:04:00 AM4/15/08
to Google App Engine
I admit that I don't have much experience with this matter, but it
seems like binary data over xml would be messy.

It also seems like binary data would be necessary, since file system
access isn't a part of the engine. For example if someone is doing a
flickr-like application, and has a large collection of existing images
to upload.

Maybe there's a standard way to do this and I'm just dense. At the
very least any method seems like it should be binary friendly.

--
Justin

Cameron Singe

unread,
Apr 15, 2008, 3:12:50 AM4/15/08
to Google App Engine
Justin, for binary content you can use 64 base encoding or some other
string based encoding format to cater for this and still use xml, json
or whatever

Would it be possible to be able to define our own export templates,
using say the django templates
Also some sort of web service where we can poll the data would be
great as well.

Json format would be great as well

Aaron Krill

unread,
Apr 15, 2008, 3:14:19 AM4/15/08
to google-a...@googlegroups.com
There are several options for exporting binary data in XML and JSON formats. The most common is to encode it in Base64. It's not really messy at all.

Also, a majority of content types probably wouldn't need exported if ticket #208 comes true. My thought with this is that most content types (images, video, office documents) would be stored in the appropriate google services (Picasa web albums, google video, google docs, etc). Then instead of storing these in your Model you'd simply store references to them for each user.

Joscha Feth

unread,
Apr 15, 2008, 3:41:32 AM4/15/08
to Google App Engine

On 15 Apr., 09:04, Justin Worrell <jworr...@gmail.com> wrote:
> I admit that I don't have much experience with this matter, but it
> seems like binary data over xml would be messy.

When I think about migrating a GAE app to a "standard" model (e.g.
some language plus a relational database) the main problem I see is
the model references (e.g. db.Key and parent relationships).
I would also prefer to have XML output, but the advantage with binary
data would be, that the references to other objects could be stored
natively. I my eyes, XML with a lot of reference attributes/tags is
suboptimal for importing the data into a relational database. But at
least with XML it is possible to define a proper Schema (sepezifying
all models and db.Key, etc.) which standardizes the export - with JSON
this would not be that easy, so JSON is probably only good for doing
small exports and not worth thinking about here anyways, because every
developer should be able to write some lines of code which transforms
his/her specific model into JSON.
The other thing with the references is, that exporting while
maintaining all relationships between all models would need one big
XML file. If we now take Justins flickr example, the XML file
additionally would be bloated with base64-encoded image data - which
makes the XML export huge and pretty tough to handle...
How about giving a general interface with some already provided
filters (e.g. for XML) but also making it possible for developers to
write their own export filters?

regards,
Joscha

Aaron Krill

unread,
Apr 15, 2008, 3:59:08 AM4/15/08
to google-a...@googlegroups.com
If you want your own export filter, simply write it into your application. It isn't hard to do.

As far as XML being bloated with Base64-encoded binary data, I don't see how this could be hard to handle. This article here has some good info on it:

All you would have to do to handle the binary data is parse the XML tree, grab the base64 and decode it. Simple. How is this difficult? This is the standard for any XML-based document format that must contend with binary data (SVG containing scalar graphics comes to mind). Also ODF and the like which may contain similar image data.

I don't see why JSON would not be suitable for a large export -- the format is small and easy to understand. It's also far easier to import and export from other python applications. In some cases, parsing a JSON "packet" is considerably faster than parsing XML as well.

Joscha Feth

unread,
Apr 15, 2008, 4:09:25 AM4/15/08
to Google App Engine
On 15 Apr., 09:59, "Aaron Krill" <aa...@krillr.com> wrote:
> If you want your own export filter, simply write it into your application.
> It isn't hard to do.

sure - but why inventing the wheel a thousand times - a general API
with open filters would bundle our powers - and also makes the filters
more bug free as anyone developing filters on their own

> All you would have to do to handle the binary data is parse the XML tree,
> grab the base64 and decode it. Simple. How is this difficult? This is the
> standard for any XML-based document format that must contend with binary
> data (SVG containing scalar graphics comes to mind). Also ODF and the like
> which may contain similar image data.

I didn't mean base64-encoded data itself is hard to handle - I meant a
single XML file with X Gigabyte base64-encoded data in it isn't easy
to handle.

> I don't see why JSON would not be suitable for a large export -- the format
> is small and easy to understand. It's also far easier to import and export
> from other python applications. In some cases, parsing a JSON "packet" is
> considerably faster than parsing XML as well.

so how would you translate the references to other models into JSON?
Just by throwing out the key name as a string? The nice thing about
XML with a schema is, that it is self-describing - the nice thing
about a binary format would be that the references could be modeled as
pointers - just the key name as a string now seems messy to me - but
as I said before: for small exports this might be the right thing to
do...

regards,
Joscha

Aaron Krill

unread,
Apr 15, 2008, 4:16:11 AM4/15/08
to google-a...@googlegroups.com
On Tue, Apr 15, 2008 at 1:09 AM, Joscha Feth <jos...@feth.com> wrote:

On 15 Apr., 09:59, "Aaron Krill" <aa...@krillr.com> wrote:
> If you want your own export filter, simply write it into your application.
> It isn't hard to do.

sure - but why inventing the wheel a thousand times - a general API
with open filters would bundle our powers - and also makes the filters
more bug free as anyone developing filters on their own

So then simply release your filter code and others can include it in their own projects.
 


> All you would have to do to handle the binary data is parse the XML tree,
> grab the base64 and decode it. Simple. How is this difficult? This is the
> standard for any XML-based document format that must contend with binary
> data (SVG containing scalar graphics comes to mind). Also ODF and the like
> which may contain similar image data.

I didn't mean base64-encoded data itself is hard to handle - I meant a
single XML file with X Gigabyte base64-encoded data in it isn't easy
to handle.

This is true. Perhaps the export filter could be set in such a way that if there are objects that would translate to very large XML files, each object would be in its own file.



> I don't see why JSON would not be suitable for a large export -- the format
> is small and easy to understand. It's also far easier to import and export
> from other python applications. In some cases, parsing a JSON "packet" is
> considerably faster than parsing XML as well.

so how would you translate the references to other models into JSON?
Just by throwing out the key name as a string? The nice thing about
XML with a schema is, that it is self-describing - the nice thing
about a binary format would be that the references could be modeled as
pointers - just the key name as a string now seems messy to me - but
as I said before: for small exports this might be the right thing to
do...
 
If you look at the example I already provided, it allows for you to define a property type. This includes the ReferenceType and any arguments associated with it. If you write something to handle a JSON export you can make it understand these types and arguments, can you not?
 


regards,
Joscha


Joscha Feth

unread,
Apr 15, 2008, 4:28:33 AM4/15/08
to Google App Engine
Hi,

> > sure - but why inventing the wheel a thousand times - a general API
> > with open filters would bundle our powers - and also makes the filters
> > more bug free as anyone developing filters on their own
>
> So then simply release your filter code and others can include it in their
> own projects.
yep, that's what I am talking about - my idea was to combine the
export functionalities given by Google and our ideas on one
platform...

> > I didn't mean base64-encoded data itself is hard to handle - I meant a
> > single XML file with X Gigabyte base64-encoded data in it isn't easy
> > to handle.
>
> This is true. Perhaps the export filter could be set in such a way that if
> there are objects that would translate to very large XML files, each object
> would be in its own file.
yeah - so giving a parameter such as externalize="image/*" would store
all properties matching this mime type outside in an external file and
the XML would just include a link to it.


> If you look at the example I already provided, it allows for you to define a
> property type. This includes the ReferenceType and any arguments associated
> with it. If you write something to handle a JSON export you can make it
> understand these types and arguments, can you not?

I admit that it is possible - it just didn't seem convenient to me to
flatten the references - but in turn the key names are just perfect
for that - are key names unique throughout the whole GAE ecosystem or
just within one project?!

greets,
J

Tobias

unread,
Apr 15, 2008, 4:37:17 AM4/15/08
to Google App Engine
What about exporting in standard SQL format? I guess most of the data
will be going more or less into a dbms, so this would ease it.
Ok, might not fit for some rare circumstances - but should be fine for
most of the users.
What do you think?

Tobias

Aaron Krill

unread,
Apr 15, 2008, 4:38:34 AM4/15/08
to google-a...@googlegroups.com
The thought of exporting binary blobs makes me want to gag.

David N

unread,
Apr 15, 2008, 4:38:52 AM4/15/08
to Google App Engine
OOXML?

Joscha Feth

unread,
Apr 15, 2008, 4:45:58 AM4/15/08
to Google App Engine
Are you talking about generating a binary file of some format (e.g.
MyISAM, etc.) or about generating the CREATE and INSERT statements? If
second: what is standard? I don't know two RDBMS sharing exactly the
same set of SQL commands...also the modelling with GAE is very
different from how you would do it in a relational database...but the
idea of having an SQL Statement export seems tempting - one would be
able to export the data without losing any information and then you
can remodel the structure within your target RDBMS.

Tobias

unread,
Apr 15, 2008, 4:59:33 AM4/15/08
to Google App Engine
Exactly, I ment standard SQL code like CREATE and INSERT. Just the
plain and standard stuff (yes, there exist some standards - at least
in history), no dbms specific code (for the beginning).

OT / One other feature would be to make some simple changes to the
data stored in GAE - e.g. renaming existing data structures, inserting
random data directly in the GAE control center and such simple stuff.
This should not be to dramatically for security reasons, so it might
be a low-hanging fruit. ;-)

Import: of course RSS, CSV/TSV, XML and (standard) SQL, too (all with
UTF8 encoding)

jat...@gmail.com

unread,
Apr 15, 2008, 5:23:50 AM4/15/08
to Google App Engine
What about AMF? It can handle recursive objects and it is binary.

It would be simple to export from an existing ORM in most languages
and reimport in BigTable.

regards,

Javier.

TeunD

unread,
Apr 15, 2008, 7:23:37 AM4/15/08
to Google App Engine
Apparently I'm the only one, but I would like AtomPub. It has the
advantages of XML, can handle binary downloads as, well, binary
downloads. It combines a data formatting protocol with a protocol for
references, both internally in the file as externally to related
content. Think blog entries and comments in Blogger. JSON seems handy
when you want to download part of the data to a web browser client for
display, but I fail to see why you would want JSON formatted data as a
generic bulk import/export format.


martian

unread,
Apr 15, 2008, 12:39:31 PM4/15/08
to Google App Engine
Yes, pure text-based SQL output as is generated by pg_dump (PostgreSQL
dump tool) or equivalent would be most useful. Then the data can be
imported straight into some local database by simply piping it into
the client:

psql my_postgresql_database < my_gae_sql_dump.txt

A similar operation could work for MySQL or Oracle, and if the data is
in "pure" standard SQL3 (CREATE and INSERT only), it should be very
portable. If you could provide some suggested CREATE INDEX statements
that Google uses to speed up the BigTable queries that would be great
too, though probably extra-credit.

XML and JSON are also good options.

m++

On Apr 15, 1:59 am, Tobias <tkl...@enarion.it> wrote:
> Exactly, I meant standard SQL code like CREATE and INSERT. Just the

martian

unread,
Apr 15, 2008, 12:52:41 PM4/15/08
to Google App Engine
Yes, generating a text-based file of SQL statements similar to what is
generated by pg_dump (PostgreSQL dump tool) would be great. This
could then be imported by simply piping to the database client tool:

psql my_database < my_gae_sql_dump

If the SQL statements are all "pure standard" SQL3 (CREATE and INSERT
only) they should be very portable and work the same way for MySQL,
Oracle, and others. Also including any CREATE INDEX statements that
Google uses to optimize the BigTable queries would be a great plus --
extra-credit.

Being able to generate XML, JSON, or even CSV for individual tables
would be useful as well.

m++

On Apr 15, 1:59 am, Tobias <tkl...@enarion.it> wrote:

MattCruikshank

unread,
Apr 15, 2008, 1:04:42 PM4/15/08
to Google App Engine
It strikes me as a bizarrre way you've asked the question, Pete.
"XML, CSV, RDF?" Why, I'd like to access it through GQL, of course!

Easily half of the reason for wanting to download and then later
upload the data is for backups. And no, I don't mean the kind of
backup where I think Google is likely to lose my data - I mean the
kind of backup where *I'm* likely to make a coding error that destroys
my data, and I'd like to be able to roll back to my last snapshot that
I saved myself. So, I don't care what format that's in, as long as
download and upload work flawlessly. I know that's complicated,
because the data can be changing while you download, but I'd like your
best stab at it.

Also, my other reason for wanting this is that I'd like to download,
process locally, and then upload the results on a periodic (every 24
hours?) basis. (That is, until Google makes available its own CRON
processing, hopefully with the full power of Map Reduce. *grin*)

So, if I could download the data in *any* format, as long as I can
access it on my local machine through GQL in my own Python scripts,
that'd be fantastic.

Thanks,
-Matt

Frank

unread,
Apr 15, 2008, 1:19:18 PM4/15/08
to Google App Engine
Other formats might be interesting, but I think you can't avoid XML.
if anything it has to be at least XML.

now this is true that datastores with lots of blobs might then export
huge XML files if you include them as base64 encoded elements.

So why not encapsulate everything in a zip file instead, containing an
xml file per object class, but including the binary fields as files
named using their entity key.
those files could be organized in nested folders using their class
name and property name:

Zip file
class1.xml
class2.xml
+class1
+blob_property1
entity_key1.bin
entity_key2.bin
+blob_property2
entity_key1.bin
...

(+ means folder)

anyhow, I think this export feature needs to be as generic as
possible, that why I would recommend XML and maybe a compressed 'zip'
file like this to make the export easier, and readable on every
platform.

Frank

Tobias

unread,
Apr 15, 2008, 2:10:21 PM4/15/08
to Google App Engine
It would also be great to versionize the data stored by Google and
export/download or roll back the data of a specific version.
Say, I'm uploading a new version and would like to backup the current
data while setting a backup point / version and be able to roll back
to this. I don't know if this is currently possible, but I guess not
for the public App Engines hosted on appspot.com.

BTW - the idea of Frank looks great to me. (ZIP-File with xml + bin
files)
zXML++

Brett Morgan

unread,
Apr 15, 2008, 6:53:38 PM4/15/08
to google-a...@googlegroups.com
That only makes sense if your datastore structure can be output in a
format that makes sense in pgSQL. I'm not sure that will always be
possible.

Ben the Indefatigable

unread,
Apr 15, 2008, 7:20:49 PM4/15/08
to Google App Engine
I don't see any difficult issues as long as you include the string
encoding of all the keys so we can put back together the relationships
and references if/when we need to. JSON and XML are fine, but to_xml
methods on the entities would be handy to help with homemade dumps.
You can't try to convert to a relational data model, but if all the
entities of each kind happen to have all the same properties then it
will look relational. Before each kind, you could include the to_xml
of the Model Instance. You'll need some means of a) paging to download
it in pieces and b) filtering in case the huge base64 binaries are not
wanted.

Darryl Cousins

unread,
Apr 15, 2008, 7:26:15 PM4/15/08
to google-a...@googlegroups.com
Hi,

On Tue, 2008-04-15 at 10:19 -0700, Frank wrote:
> Other formats might be interesting, but I think you can't avoid XML.
> if anything it has to be at least XML.
>
> now this is true that datastores with lots of blobs might then export
> huge XML files if you include them as base64 encoded elements.
>
> So why not encapsulate everything in a zip file instead, containing an
> xml file per object class, but including the binary fields as files
> named using their entity key.
> those files could be organized in nested folders using their class
> name and property name:

Something like this would also be my choice. I use a similar pattern for
the import/export data of objects stored on zodb but I use config.ini
and contents.csv instead of xml. I find it useful because clients can
also easily browse and read these formats.

Darryl

Ben the Indefatigable

unread,
Apr 15, 2008, 7:37:15 PM4/15/08
to Google App Engine
>to_xml methods on the entities would be handy to help with homemade dumps.

I thought that to_xml was on the model class, not the entity, now I
see it is already available for the entity.

Nevermind. btw, looking at an example of to_xml, it looks good.

<entity kind="Greeting" key="agpoZWxsb3dvcmxkcg4LEghHcmVldGluZxgBDA">
<key>tag:helloworld.gmail.com,
2008-04-15:Greeting[agpoZWxsb3dvcmxkcg4LEghHcmVldGluZxgBDA]</key>
<property name="author" type="user">te...@example.com</property>
<property name="content" type="string">Test</property>
<property name="date" type="gd:when">2008-04-09 06:08:50.609000</
property>
</entity>

Victor Kryukov

unread,
Apr 15, 2008, 9:10:16 PM4/15/08
to Google App Engine


On Apr 15, 12:44 am, Pete <pkoo...@google.com> wrote:
> Getting large amounts of data from existing apps into the datastore
> can be a challenge, and to help we offer a Bulk Data Uploader.  You
> can read more about this tool here:
>
> http://code.google.com/appengine/articles/bulkload.html

While moving data in and out is important, and bulkload tools helps a
lot in moving data in, there's yet another data-related task: data
update. And it can be really challenging with current system design
(e.g. w/o long-running processes).

I.e. I have uploaded some 400 entries, and expect to add thousands
more. I already see the need to update my data (e.g. add long/lat
information based on address, or to add country field to expand my
program to international audience), yet there's no convenient way to
do that.

One potential solution is to delete all entries one-by-one and then to
re-upload them - needless to say it's grossly inefficient and
inconvenient. Some version of bulkload which could update existing
entries based on some key fields and additional rows in CSV file could
be really helpful.

Regards,
Victor

Jeff Hinrichs

unread,
Apr 15, 2008, 11:01:29 PM4/15/08
to Google App Engine
Since no one else has mentioned it, I will. How about the ability to
download the db in a format instantly usable by the development
server? Everyone can then have their own local code extract it out in
whatever format they want. It is the most directly usable, and I
would gamble that no matter what format is selected the majority are
going to pump it in to their local dev environment. Now I don't hold
that this is the one true way, you could make the argument for XML
with some converter script to push it into your dev environment.

Here is a partial list of qualities a solution should have:
0) The most common operations with the downloaded data should be
trivial to do. For instance, restoring it to the dev environment

1) Bandwidth friendly, both for Google and us developers, so I don't
see how it could be practical without taking whatever intermediate
structure and compressing it for transmission. The answer has to keep
in mind that it not just your data store, but potentially tens of
thousands of datastores that people are going to be downloading for a
variety of reasons.

2) The bulk_uploadclient needs to be able to use it directly.
Scenario - Something bad happens to your data, and you need to push
your latest backup back out to production. It would be best if you
could use whatever you downloaded, without any extra manipulation and
upload it to recover back to that point.

regards,

Jeff

Victor Kryukov

unread,
Apr 16, 2008, 12:20:46 AM4/16/08
to Google App Engine
On Apr 15, 10:01 pm, dundeemt <dunde...@gmail.com> wrote:
> On Apr 15, 8:10 pm, Victor Kryukov <victor.kryu...@gmail.com> wrote:
>
> > On Apr 15, 12:44 am, Pete <pkoo...@google.com> wrote:
>
> > > Getting large amounts of data from existing apps into the datastore
> > > can be a challenge, and to help we offer a Bulk Data Uploader.  You
> > > can read more about this tool here:
>
> > >http://code.google.com/appengine/articles/bulkload.html
>
> > While moving data in and out is important, and bulkload tools helps a
> > lot in moving data in, there's yet another data-related task: data
> > update. And it can be really challenging with current system design
> > (e.g. w/o long-running processes).

> Since no one else has mentioned it, I will.  How about the ability to
> download the db in a format instantly usable by the development
> server?  Everyone can then have their own local code extract it out in
> whatever format they want.  It is the most directly usable, and I
> would gamble that no matter what format is selected the majority are
> going to pump it in to their local dev environment.  Now I don't hold
> that this is the one true way, you could make the argument for XML
> with some converter script to push it into your dev environment.

Jeff, I second that idea. Now that I think about it, it would be
excellent to be able to synchronize development and production
database via some simple way. In that case, bulkload is not needed -
you can populate database however you want on the development center
and then replicate to (or completely replace) production database. It
will also solve update problem to some extent.

pear

unread,
Apr 16, 2008, 7:28:51 AM4/16/08
to Google App Engine
i prefer the xml approach

Filip

unread,
Apr 16, 2008, 8:10:41 AM4/16/08
to Google App Engine
Frankly, I'd prefer XML.

However, exports will most often serve as backups to be reloaded into
Google. So whatever format is chosen, it should be symmetical with the
bulk input. Currently, bulk input only accepts CSV files, so the
output should also support CSV files (which are formatted to cope with
international characters, quotes and other characters that should be
escaped).

The ideal option for me is to have bulk input accept XML, and export
to XML.

Filip.

On 15 apr, 07:44, Pete <pkoo...@google.com> wrote:
> Getting large amounts of data from existing apps into the datastore
> can be a challenge, and to help we offer a Bulk Data Uploader.  You
> can read more about this tool here:
>
> http://code.google.com/appengine/articles/bulkload.html
>

Jeff Hinrichs

unread,
Apr 16, 2008, 9:08:40 AM4/16/08
to Google App Engine


On Apr 16, 7:10 am, Filip <filip.verhae...@gmail.com> wrote:
> Frankly, I'd prefer XML.
>
> However, exports will most often serve as backups to be reloaded into
> Google. So whatever format is chosen, it should be symmetical with the
> bulk input. Currently, bulk input only accepts CSV files, so the
> output should also support CSV files (which are formatted to cope with
> international characters, quotes and other characters that should be
> escaped).
>
> The ideal option for me is to have bulk input accept XML, and export
> to XML.
>
I was posting quickly so I inadvertently slipped in to implementation
details by talking about XML and the bulk_uploadclient, which was bad
form on my part. That should read:

2) A supplied utility needs to be able to use it directly.
Scenario - Something bad happens to your data, and you need to push
your latest backup back out to production. It would be best if you
could use whatever you downloaded, without any extra manipulation and
upload it to recover back to that point.


I also thought of another quality:
3) support for ReferenceProperties, Ancestor/Parent relations and
SelfReferenceProperties. I am not sure if this is possible with the
current bulk_uploadclient


As to the format, I am neutral, I am for anything that allows these
qualities.

Regards,

Jeff

Joshua Heitzman

unread,
Apr 16, 2008, 5:58:51 PM4/16/08
to Google App Engine
Being able to do single command replacements of the entire data set on
both a locally and on App Engine would be excellent for all of the
reasons mentioned here already. Additionally, a tool that could
produce the diff between two data sets in a format then can be used to
update a third data set would also be great for updating App Engine
with data added to a set locally. Finally a tool that could capture
the changes since a particular point in time (or a least the last
checkpoint with the first being the data set upload) would be great
for updating a local data set with the most recent updates to the data
set on App Engine (could be used in the reverse direction instead of
the diffing tool just mentioned as well).

I have no problem pushing and pulling data from such a data set
locally via the App Engine APIs, so I wouldn't care what format the
data was in if these tools/scenarios were supported.

Josh Heitzman

Aaron Krill

unread,
Apr 16, 2008, 6:03:19 PM4/16/08
to google-a...@googlegroups.com
I just want to be able to take data I've inserted into the SDK server and throw it into the prod server

Cameron Singe

unread,
Apr 16, 2008, 6:49:42 PM4/16/08
to Google App Engine
Also this might seem like a strange request, however if the keys could
could be exported as different formats, I.E on export the keys are
change to Guid or Int