Feedback on data exporter design?

129 views
Skip to first unread message

Pete

unread,
Apr 15, 2008, 1:44:49 AM4/15/08
to Google App Engine
Getting large amounts of data from existing apps into the datastore
can be a challenge, and to help we offer a Bulk Data Uploader. You
can read more about this tool here:

http://code.google.com/appengine/articles/bulkload.html

Conversely, we know many of you would like a better tool for moving
your data *off* of Google App Engine, and we'd like to ask for your
feedback on what formats would be the most useful for a data
exporter. XML output? CSV transform? RDF? Let us know how what you
think!

Jared

unread,
Apr 15, 2008, 1:54:55 AM4/15/08
to Google App Engine
Not to beat the django horse to death, but the django data dump
commands can output in xml or json. Those seem like two to start
with.

Also, the option to make the files human-readable (again, like django)
would be good too.

Aaron Krill

unread,
Apr 15, 2008, 2:19:00 AM4/15/08
to Google App Engine
I concur with Jared. XML and JSON are my preferred methods.

What really needs to be done is there to be added a toJson function,
to DataStore models. Then when you want to dump an entire DB you give
the option of XML and JSON, and on your side you simply execute a
query where you grab all objects from all models and export them one
by one.

I would like a format like this for JSON:

{"model_name": "MyModel", "properties": {"myproperty": {"type":
"UserProperty", "options": {"someoption": "value"}}}, "objects":
{"id": 0, "myproperty": "aa...@krillr.com"}}

Thoughts?

Justin Worrell

unread,
Apr 15, 2008, 3:04:00 AM4/15/08
to Google App Engine
I admit that I don't have much experience with this matter, but it
seems like binary data over xml would be messy.

It also seems like binary data would be necessary, since file system
access isn't a part of the engine. For example if someone is doing a
flickr-like application, and has a large collection of existing images
to upload.

Maybe there's a standard way to do this and I'm just dense. At the
very least any method seems like it should be binary friendly.

--
Justin

Cameron Singe

unread,
Apr 15, 2008, 3:12:50 AM4/15/08
to Google App Engine
Justin, for binary content you can use 64 base encoding or some other
string based encoding format to cater for this and still use xml, json
or whatever

Would it be possible to be able to define our own export templates,
using say the django templates
Also some sort of web service where we can poll the data would be
great as well.

Json format would be great as well

Aaron Krill

unread,
Apr 15, 2008, 3:14:19 AM4/15/08
to google-a...@googlegroups.com
There are several options for exporting binary data in XML and JSON formats. The most common is to encode it in Base64. It's not really messy at all.

Also, a majority of content types probably wouldn't need exported if ticket #208 comes true. My thought with this is that most content types (images, video, office documents) would be stored in the appropriate google services (Picasa web albums, google video, google docs, etc). Then instead of storing these in your Model you'd simply store references to them for each user.

Joscha Feth

unread,
Apr 15, 2008, 3:41:32 AM4/15/08
to Google App Engine

On 15 Apr., 09:04, Justin Worrell <jworr...@gmail.com> wrote:
> I admit that I don't have much experience with this matter, but it
> seems like binary data over xml would be messy.

When I think about migrating a GAE app to a "standard" model (e.g.
some language plus a relational database) the main problem I see is
the model references (e.g. db.Key and parent relationships).
I would also prefer to have XML output, but the advantage with binary
data would be, that the references to other objects could be stored
natively. I my eyes, XML with a lot of reference attributes/tags is
suboptimal for importing the data into a relational database. But at
least with XML it is possible to define a proper Schema (sepezifying
all models and db.Key, etc.) which standardizes the export - with JSON
this would not be that easy, so JSON is probably only good for doing
small exports and not worth thinking about here anyways, because every
developer should be able to write some lines of code which transforms
his/her specific model into JSON.
The other thing with the references is, that exporting while
maintaining all relationships between all models would need one big
XML file. If we now take Justins flickr example, the XML file
additionally would be bloated with base64-encoded image data - which
makes the XML export huge and pretty tough to handle...
How about giving a general interface with some already provided
filters (e.g. for XML) but also making it possible for developers to
write their own export filters?

regards,
Joscha

Aaron Krill

unread,
Apr 15, 2008, 3:59:08 AM4/15/08
to google-a...@googlegroups.com
If you want your own export filter, simply write it into your application. It isn't hard to do.

As far as XML being bloated with Base64-encoded binary data, I don't see how this could be hard to handle. This article here has some good info on it:

All you would have to do to handle the binary data is parse the XML tree, grab the base64 and decode it. Simple. How is this difficult? This is the standard for any XML-based document format that must contend with binary data (SVG containing scalar graphics comes to mind). Also ODF and the like which may contain similar image data.

I don't see why JSON would not be suitable for a large export -- the format is small and easy to understand. It's also far easier to import and export from other python applications. In some cases, parsing a JSON "packet" is considerably faster than parsing XML as well.

Joscha Feth

unread,
Apr 15, 2008, 4:09:25 AM4/15/08
to Google App Engine
On 15 Apr., 09:59, "Aaron Krill" <aa...@krillr.com> wrote:
> If you want your own export filter, simply write it into your application.
> It isn't hard to do.

sure - but why inventing the wheel a thousand times - a general API
with open filters would bundle our powers - and also makes the filters
more bug free as anyone developing filters on their own

> All you would have to do to handle the binary data is parse the XML tree,
> grab the base64 and decode it. Simple. How is this difficult? This is the
> standard for any XML-based document format that must contend with binary
> data (SVG containing scalar graphics comes to mind). Also ODF and the like
> which may contain similar image data.

I didn't mean base64-encoded data itself is hard to handle - I meant a
single XML file with X Gigabyte base64-encoded data in it isn't easy
to handle.

> I don't see why JSON would not be suitable for a large export -- the format
> is small and easy to understand. It's also far easier to import and export
> from other python applications. In some cases, parsing a JSON "packet" is
> considerably faster than parsing XML as well.

so how would you translate the references to other models into JSON?
Just by throwing out the key name as a string? The nice thing about
XML with a schema is, that it is self-describing - the nice thing
about a binary format would be that the references could be modeled as
pointers - just the key name as a string now seems messy to me - but
as I said before: for small exports this might be the right thing to
do...

regards,
Joscha

Aaron Krill

unread,
Apr 15, 2008, 4:16:11 AM4/15/08
to google-a...@googlegroups.com
On Tue, Apr 15, 2008 at 1:09 AM, Joscha Feth <jos...@feth.com> wrote:

On 15 Apr., 09:59, "Aaron Krill" <aa...@krillr.com> wrote:
> If you want your own export filter, simply write it into your application.
> It isn't hard to do.

sure - but why inventing the wheel a thousand times - a general API
with open filters would bundle our powers - and also makes the filters
more bug free as anyone developing filters on their own

So then simply release your filter code and others can include it in their own projects.
 


> All you would have to do to handle the binary data is parse the XML tree,
> grab the base64 and decode it. Simple. How is this difficult? This is the
> standard for any XML-based document format that must contend with binary
> data (SVG containing scalar graphics comes to mind). Also ODF and the like
> which may contain similar image data.

I didn't mean base64-encoded data itself is hard to handle - I meant a
single XML file with X Gigabyte base64-encoded data in it isn't easy
to handle.

This is true. Perhaps the export filter could be set in such a way that if there are objects that would translate to very large XML files, each object would be in its own file.



> I don't see why JSON would not be suitable for a large export -- the format
> is small and easy to understand. It's also far easier to import and export
> from other python applications. In some cases, parsing a JSON "packet" is
> considerably faster than parsing XML as well.

so how would you translate the references to other models into JSON?
Just by throwing out the key name as a string? The nice thing about
XML with a schema is, that it is self-describing - the nice thing
about a binary format would be that the references could be modeled as
pointers - just the key name as a string now seems messy to me - but
as I said before: for small exports this might be the right thing to
do...
 
If you look at the example I already provided, it allows for you to define a property type. This includes the ReferenceType and any arguments associated with it. If you write something to handle a JSON export you can make it understand these types and arguments, can you not?
 


regards,
Joscha


Joscha Feth

unread,
Apr 15, 2008, 4:28:33 AM4/15/08
to Google App Engine
Hi,

> > sure - but why inventing the wheel a thousand times - a general API
> > with open filters would bundle our powers - and also makes the filters
> > more bug free as anyone developing filters on their own
>
> So then simply release your filter code and others can include it in their
> own projects.
yep, that's what I am talking about - my idea was to combine the
export functionalities given by Google and our ideas on one
platform...

> > I didn't mean base64-encoded data itself is hard to handle - I meant a
> > single XML file with X Gigabyte base64-encoded data in it isn't easy
> > to handle.
>
> This is true. Perhaps the export filter could be set in such a way that if
> there are objects that would translate to very large XML files, each object
> would be in its own file.
yeah - so giving a parameter such as externalize="image/*" would store
all properties matching this mime type outside in an external file and
the XML would just include a link to it.


> If you look at the example I already provided, it allows for you to define a
> property type. This includes the ReferenceType and any arguments associated
> with it. If you write something to handle a JSON export you can make it
> understand these types and arguments, can you not?

I admit that it is possible - it just didn't seem convenient to me to
flatten the references - but in turn the key names are just perfect
for that - are key names unique throughout the whole GAE ecosystem or
just within one project?!

greets,
J

Tobias

unread,
Apr 15, 2008, 4:37:17 AM4/15/08
to Google App Engine
What about exporting in standard SQL format? I guess most of the data
will be going more or less into a dbms, so this would ease it.
Ok, might not fit for some rare circumstances - but should be fine for
most of the users.
What do you think?

Tobias

Aaron Krill

unread,
Apr 15, 2008, 4:38:34 AM4/15/08
to google-a...@googlegroups.com
The thought of exporting binary blobs makes me want to gag.

David N

unread,
Apr 15, 2008, 4:38:52 AM4/15/08
to Google App Engine
OOXML?

Joscha Feth

unread,
Apr 15, 2008, 4:45:58 AM4/15/08
to Google App Engine
Are you talking about generating a binary file of some format (e.g.
MyISAM, etc.) or about generating the CREATE and INSERT statements? If
second: what is standard? I don't know two RDBMS sharing exactly the
same set of SQL commands...also the modelling with GAE is very
different from how you would do it in a relational database...but the
idea of having an SQL Statement export seems tempting - one would be
able to export the data without losing any information and then you
can remodel the structure within your target RDBMS.

Tobias

unread,
Apr 15, 2008, 4:59:33 AM4/15/08
to Google App Engine
Exactly, I ment standard SQL code like CREATE and INSERT. Just the
plain and standard stuff (yes, there exist some standards - at least
in history), no dbms specific code (for the beginning).

OT / One other feature would be to make some simple changes to the
data stored in GAE - e.g. renaming existing data structures, inserting
random data directly in the GAE control center and such simple stuff.
This should not be to dramatically for security reasons, so it might
be a low-hanging fruit. ;-)

Import: of course RSS, CSV/TSV, XML and (standard) SQL, too (all with
UTF8 encoding)

jat...@gmail.com

unread,
Apr 15, 2008, 5:23:50 AM4/15/08
to Google App Engine
What about AMF? It can handle recursive objects and it is binary.

It would be simple to export from an existing ORM in most languages
and reimport in BigTable.

regards,

Javier.

TeunD

unread,
Apr 15, 2008, 7:23:37 AM4/15/08
to Google App Engine
Apparently I'm the only one, but I would like AtomPub. It has the
advantages of XML, can handle binary downloads as, well, binary
downloads. It combines a data formatting protocol with a protocol for
references, both internally in the file as externally to related
content. Think blog entries and comments in Blogger. JSON seems handy
when you want to download part of the data to a web browser client for
display, but I fail to see why you would want JSON formatted data as a
generic bulk import/export format.


martian

unread,
Apr 15, 2008, 12:39:31 PM4/15/08
to Google App Engine
Yes, pure text-based SQL output as is generated by pg_dump (PostgreSQL
dump tool) or equivalent would be most useful. Then the data can be
imported straight into some local database by simply piping it into
the client:

psql my_postgresql_database < my_gae_sql_dump.txt

A similar operation could work for MySQL or Oracle, and if the data is
in "pure" standard SQL3 (CREATE and INSERT only), it should be very
portable. If you could provide some suggested CREATE INDEX statements
that Google uses to speed up the BigTable queries that would be great
too, though probably extra-credit.

XML and JSON are also good options.

m++

On Apr 15, 1:59 am, Tobias <tkl...@enarion.it> wrote:
> Exactly, I meant standard SQL code like CREATE and INSERT. Just the

martian

unread,
Apr 15, 2008, 12:52:41 PM4/15/08
to Google App Engine
Yes, generating a text-based file of SQL statements similar to what is
generated by pg_dump (PostgreSQL dump tool) would be great. This
could then be imported by simply piping to the database client tool:

psql my_database < my_gae_sql_dump

If the SQL statements are all "pure standard" SQL3 (CREATE and INSERT
only) they should be very portable and work the same way for MySQL,
Oracle, and others. Also including any CREATE INDEX statements that
Google uses to optimize the BigTable queries would be a great plus --
extra-credit.

Being able to generate XML, JSON, or even CSV for individual tables
would be useful as well.

m++

On Apr 15, 1:59 am, Tobias <tkl...@enarion.it> wrote:

MattCruikshank

unread,
Apr 15, 2008, 1:04:42 PM4/15/08
to Google App Engine
It strikes me as a bizarrre way you've asked the question, Pete.
"XML, CSV, RDF?" Why, I'd like to access it through GQL, of course!

Easily half of the reason for wanting to download and then later
upload the data is for backups. And no, I don't mean the kind of
backup where I think Google is likely to lose my data - I mean the
kind of backup where *I'm* likely to make a coding error that destroys
my data, and I'd like to be able to roll back to my last snapshot that
I saved myself. So, I don't care what format that's in, as long as
download and upload work flawlessly. I know that's complicated,
because the data can be changing while you download, but I'd like your
best stab at it.

Also, my other reason for wanting this is that I'd like to download,
process locally, and then upload the results on a periodic (every 24
hours?) basis. (That is, until Google makes available its own CRON
processing, hopefully with the full power of Map Reduce. *grin*)

So, if I could download the data in *any* format, as long as I can
access it on my local machine through GQL in my own Python scripts,
that'd be fantastic.

Thanks,
-Matt

Frank

unread,
Apr 15, 2008, 1:19:18 PM4/15/08
to Google App Engine
Other formats might be interesting, but I think you can't avoid XML.
if anything it has to be at least XML.

now this is true that datastores with lots of blobs might then export
huge XML files if you include them as base64 encoded elements.

So why not encapsulate everything in a zip file instead, containing an
xml file per object class, but including the binary fields as files
named using their entity key.
those files could be organized in nested folders using their class
name and property name:

Zip file
class1.xml
class2.xml
+class1
+blob_property1
entity_key1.bin
entity_key2.bin
+blob_property2
entity_key1.bin
...

(+ means folder)

anyhow, I think this export feature needs to be as generic as
possible, that why I would recommend XML and maybe a compressed 'zip'
file like this to make the export easier, and readable on every
platform.

Frank

Tobias

unread,
Apr 15, 2008, 2:10:21 PM4/15/08
to Google App Engine
It would also be great to versionize the data stored by Google and
export/download or roll back the data of a specific version.
Say, I'm uploading a new version and would like to backup the current
data while setting a backup point / version and be able to roll back
to this. I don't know if this is currently possible, but I guess not
for the public App Engines hosted on appspot.com.

BTW - the idea of Frank looks great to me. (ZIP-File with xml + bin
files)
zXML++

Brett Morgan

unread,
Apr 15, 2008, 6:53:38 PM4/15/08
to google-a...@googlegroups.com
That only makes sense if your datastore structure can be output in a
format that makes sense in pgSQL. I'm not sure that will always be
possible.

Ben the Indefatigable

unread,
Apr 15, 2008, 7:20:49 PM4/15/08
to Google App Engine
I don't see any difficult issues as long as you include the string
encoding of all the keys so we can put back together the relationships
and references if/when we need to. JSON and XML are fine, but to_xml
methods on the entities would be handy to help with homemade dumps.
You can't try to convert to a relational data model, but if all the
entities of each kind happen to have all the same properties then it
will look relational. Before each kind, you could include the to_xml
of the Model Instance. You'll need some means of a) paging to download
it in pieces and b) filtering in case the huge base64 binaries are not
wanted.

Darryl Cousins

unread,
Apr 15, 2008, 7:26:15 PM4/15/08
to google-a...@googlegroups.com
Hi,

On Tue, 2008-04-15 at 10:19 -0700, Frank wrote:
> Other formats might be interesting, but I think you can't avoid XML.
> if anything it has to be at least XML.
>
> now this is true that datastores with lots of blobs might then export
> huge XML files if you include them as base64 encoded elements.
>
> So why not encapsulate everything in a zip file instead, containing an
> xml file per object class, but including the binary fields as files
> named using their entity key.
> those files could be organized in nested folders using their class
> name and property name:

Something like this would also be my choice. I use a similar pattern for
the import/export data of objects stored on zodb but I use config.ini
and contents.csv instead of xml. I find it useful because clients can
also easily browse and read these formats.

Darryl

Ben the Indefatigable

unread,
Apr 15, 2008, 7:37:15 PM4/15/08
to Google App Engine
>to_xml methods on the entities would be handy to help with homemade dumps.

I thought that to_xml was on the model class, not the entity, now I
see it is already available for the entity.

Nevermind. btw, looking at an example of to_xml, it looks good.

<entity kind="Greeting" key="agpoZWxsb3dvcmxkcg4LEghHcmVldGluZxgBDA">
<key>tag:helloworld.gmail.com,
2008-04-15:Greeting[agpoZWxsb3dvcmxkcg4LEghHcmVldGluZxgBDA]</key>
<property name="author" type="user">te...@example.com</property>
<property name="content" type="string">Test</property>
<property name="date" type="gd:when">2008-04-09 06:08:50.609000</
property>
</entity>

Victor Kryukov

unread,
Apr 15, 2008, 9:10:16 PM4/15/08
to Google App Engine


On Apr 15, 12:44 am, Pete <pkoo...@google.com> wrote:
> Getting large amounts of data from existing apps into the datastore
> can be a challenge, and to help we offer a Bulk Data Uploader.  You
> can read more about this tool here:
>
> http://code.google.com/appengine/articles/bulkload.html

While moving data in and out is important, and bulkload tools helps a
lot in moving data in, there's yet another data-related task: data
update. And it can be really challenging with current system design
(e.g. w/o long-running processes).

I.e. I have uploaded some 400 entries, and expect to add thousands
more. I already see the need to update my data (e.g. add long/lat
information based on address, or to add country field to expand my
program to international audience), yet there's no convenient way to
do that.

One potential solution is to delete all entries one-by-one and then to
re-upload them - needless to say it's grossly inefficient and
inconvenient. Some version of bulkload which could update existing
entries based on some key fields and additional rows in CSV file could
be really helpful.

Regards,
Victor

Jeff Hinrichs

unread,
Apr 15, 2008, 11:01:29 PM4/15/08
to Google App Engine
Since no one else has mentioned it, I will. How about the ability to
download the db in a format instantly usable by the development
server? Everyone can then have their own local code extract it out in
whatever format they want. It is the most directly usable, and I
would gamble that no matter what format is selected the majority are
going to pump it in to their local dev environment. Now I don't hold
that this is the one true way, you could make the argument for XML
with some converter script to push it into your dev environment.

Here is a partial list of qualities a solution should have:
0) The most common operations with the downloaded data should be
trivial to do. For instance, restoring it to the dev environment

1) Bandwidth friendly, both for Google and us developers, so I don't
see how it could be practical without taking whatever intermediate
structure and compressing it for transmission. The answer has to keep
in mind that it not just your data store, but potentially tens of
thousands of datastores that people are going to be downloading for a
variety of reasons.

2) The bulk_uploadclient needs to be able to use it directly.
Scenario - Something bad happens to your data, and you need to push
your latest backup back out to production. It would be best if you
could use whatever you downloaded, without any extra manipulation and
upload it to recover back to that point.

regards,

Jeff

Victor Kryukov

unread,
Apr 16, 2008, 12:20:46 AM4/16/08
to Google App Engine
On Apr 15, 10:01 pm, dundeemt <dunde...@gmail.com> wrote:
> On Apr 15, 8:10 pm, Victor Kryukov <victor.kryu...@gmail.com> wrote:
>
> > On Apr 15, 12:44 am, Pete <pkoo...@google.com> wrote:
>
> > > Getting large amounts of data from existing apps into the datastore
> > > can be a challenge, and to help we offer a Bulk Data Uploader.  You
> > > can read more about this tool here:
>
> > >http://code.google.com/appengine/articles/bulkload.html
>
> > While moving data in and out is important, and bulkload tools helps a
> > lot in moving data in, there's yet another data-related task: data
> > update. And it can be really challenging with current system design
> > (e.g. w/o long-running processes).

> Since no one else has mentioned it, I will.  How about the ability to
> download the db in a format instantly usable by the development
> server?  Everyone can then have their own local code extract it out in
> whatever format they want.  It is the most directly usable, and I
> would gamble that no matter what format is selected the majority are
> going to pump it in to their local dev environment.  Now I don't hold
> that this is the one true way, you could make the argument for XML
> with some converter script to push it into your dev environment.

Jeff, I second that idea. Now that I think about it, it would be
excellent to be able to synchronize development and production
database via some simple way. In that case, bulkload is not needed -
you can populate database however you want on the development center
and then replicate to (or completely replace) production database. It
will also solve update problem to some extent.

pear

unread,
Apr 16, 2008, 7:28:51 AM4/16/08
to Google App Engine
i prefer the xml approach

Filip

unread,
Apr 16, 2008, 8:10:41 AM4/16/08
to Google App Engine
Frankly, I'd prefer XML.

However, exports will most often serve as backups to be reloaded into
Google. So whatever format is chosen, it should be symmetical with the
bulk input. Currently, bulk input only accepts CSV files, so the
output should also support CSV files (which are formatted to cope with
international characters, quotes and other characters that should be
escaped).

The ideal option for me is to have bulk input accept XML, and export
to XML.

Filip.

On 15 apr, 07:44, Pete <pkoo...@google.com> wrote:
> Getting large amounts of data from existing apps into the datastore
> can be a challenge, and to help we offer a Bulk Data Uploader.  You
> can read more about this tool here:
>
> http://code.google.com/appengine/articles/bulkload.html
>

Jeff Hinrichs

unread,
Apr 16, 2008, 9:08:40 AM4/16/08
to Google App Engine


On Apr 16, 7:10 am, Filip <filip.verhae...@gmail.com> wrote:
> Frankly, I'd prefer XML.
>
> However, exports will most often serve as backups to be reloaded into
> Google. So whatever format is chosen, it should be symmetical with the
> bulk input. Currently, bulk input only accepts CSV files, so the
> output should also support CSV files (which are formatted to cope with
> international characters, quotes and other characters that should be
> escaped).
>
> The ideal option for me is to have bulk input accept XML, and export
> to XML.
>
I was posting quickly so I inadvertently slipped in to implementation
details by talking about XML and the bulk_uploadclient, which was bad
form on my part. That should read:

2) A supplied utility needs to be able to use it directly.
Scenario - Something bad happens to your data, and you need to push
your latest backup back out to production. It would be best if you
could use whatever you downloaded, without any extra manipulation and
upload it to recover back to that point.


I also thought of another quality:
3) support for ReferenceProperties, Ancestor/Parent relations and
SelfReferenceProperties. I am not sure if this is possible with the
current bulk_uploadclient


As to the format, I am neutral, I am for anything that allows these
qualities.

Regards,

Jeff

Joshua Heitzman

unread,
Apr 16, 2008, 5:58:51 PM4/16/08
to Google App Engine
Being able to do single command replacements of the entire data set on
both a locally and on App Engine would be excellent for all of the
reasons mentioned here already. Additionally, a tool that could
produce the diff between two data sets in a format then can be used to
update a third data set would also be great for updating App Engine
with data added to a set locally. Finally a tool that could capture
the changes since a particular point in time (or a least the last
checkpoint with the first being the data set upload) would be great
for updating a local data set with the most recent updates to the data
set on App Engine (could be used in the reverse direction instead of
the diffing tool just mentioned as well).

I have no problem pushing and pulling data from such a data set
locally via the App Engine APIs, so I wouldn't care what format the
data was in if these tools/scenarios were supported.

Josh Heitzman

Aaron Krill

unread,
Apr 16, 2008, 6:03:19 PM4/16/08
to google-a...@googlegroups.com
I just want to be able to take data I've inserted into the SDK server and throw it into the prod server

Cameron Singe

unread,
Apr 16, 2008, 6:49:42 PM4/16/08
to Google App Engine
Also this might seem like a strange request, however if the keys could
could be exported as different formats, I.E on export the keys are
change to Guid or Int

Alfonso

unread,
Apr 18, 2008, 12:38:49 PM4/18/08
to Google App Engine
Formats should be a function of the objective behind the data exchange
between the (at least two) parties involved.

If data size is relatively small, and the XML parser can swallow it,
then likely XML is a good option.

However .... it we are talking about data intensive uploads /
downloads (initial data loads, data corrections (longitudinal,
historical), backups) ... then XML will likely not scale.

Scalability and performance are reasons why database suppliers have
come up with things as 'bulk loaders', 'external tables' (based on
flat files as opposed to DB files) and a long etc.

So ... the discussion about formats will be properly framed once:
1) The context of data exchange is better defined (based on
categories, for example 'small', 'medium', 'large', 'huge')
2) The expected response times of the data exchange

On Apr 15, 1:44 am, Pete <pkoo...@google.com> wrote:
> Getting large amounts of data from existing apps into the datastore
> can be a challenge, and to help we offer a Bulk Data Uploader.  You
> can read more about this tool here:
>
> http://code.google.com/appengine/articles/bulkload.html
>

pkeane

unread,
Apr 21, 2008, 12:35:07 AM4/21/08
to Google App Engine
I would second the request for AtomPub for moving large amounts of
data into Google Apps Engine. It's ideal, really. (I'd also make a
pitch for an AtomPub server as a demo app in the SDK -- it could get
lots of folks started on the road to Atom/AtomPub bliss).



On Apr 15, 6:23 am, TeunD <goo...@duynstee.com> wrote:
> Apparently I'm the only one, but I would likeAtomPub. It has the

Toby DiPasquale

unread,
Apr 21, 2008, 9:09:42 AM4/21/08
to Google App Engine
On Apr 15, 1:44 am, Pete <pkoo...@google.com> wrote:
> Getting large amounts of data from existing apps into the datastore
> can be a challenge, and to help we offer a Bulk Data Uploader.  You
> can read more about this tool here:
>
> http://code.google.com/appengine/articles/bulkload.html
>
> Conversely, we know many of you would like a better tool for moving
> your data *off* of Google App Engine, and we'd like to ask for your
> feedback on what formats would be the most useful for a data
> exporter.  XML output? CSV transform? RDF? Let us know how what you
> think!

In addition to the more standard formats, I'd like to see JSON output.
This seems to me to be the least impedance mismatch with how BigTable
data is stored and would leave the least possibility for data loss or
mangling.

--
Toby DiPasquale

termie

unread,
Apr 21, 2008, 2:50:31 PM4/21/08
to Google App Engine
For the record we've got some of this stuff already working well
enough to play with (JSON, XML and YAML) in the helper kit for django
(thread: http://groups.google.com/group/google-appengine/browse_thread/thread/6f2737aad1239a07),
a quick hack of a handler will let you accept posted serialized data:

def api_loaddata(request):
# load serializers
#
format = request.POST.get('format', 'json')
fixture = request.POST.get('fixture', '[]')
fixture_ref = StringIO(fixture)


def _loaddata():
try:
count = 0
models = set()
objects = serializers.deserialize(format, fixture_ref)
for obj in objects:
count += 1
models.add(obj.object.__class__)
real_obj = obj.object

real_obj.put()
return count
except Exception, e:
raise

#count = db.run_in_transaction(_loaddata)
count = _loaddata()

return HttpResponse("Loaded %s items from fixture" % count)



My biggest concern is having an interface where I can push more data
at a time without having to make sure all requests fit under 3
seconds.

--andy

Nico

unread,
Apr 22, 2008, 3:12:14 AM4/22/08
to Google App Engine
How about setting up 2 download options:

1) Non-blob data, which would just be the reverse of the "Bulk Data
Uploader", easiest would be CSV files, which people could then use to
populate their SQL tables, etc. Include a quick search-and-replace for
commas in the data itself to ensure proper CSV structure. I'm assuming
this would be rather easy to create, as most of the required code is
already in the Bulk Data Uploader".

2) For Blob data, use one of the more sophisticated options discussed
above.

Saves headaches for people who are not incorporating pictures, movies,
misc files, etc into their data-models, and still allows for the more
complex data downloads.

javaDinosaur

unread,
Apr 22, 2008, 3:26:07 PM4/22/08
to Google App Engine
> My biggest concern is having an interface where I can push more data
> at a time without having to make sure all requests fit under 3
> seconds.

Would the App Engine team consider implementing any core JSON
conversion routines in C please. Looking at the following benchmarks
of Python JSON serialization this is one area where the performance
win of C code can be 5 to 20 times.

http://blog.modp.com/2008/04/python-and-json-performance.html
http://kbyanc.blogspot.com/2007/07/python-serializer-benchmarks.html

Also could such a hi-performance json api be made available
independent of the data export module? I ask because in my proposed
single page RIA application the majority of data flows between web
server and client browser will be data objects serialized in json
format. ExtJS defaults to using json for transporting data and I
believe this is a trend in Web 2.0 JavaScript frameworks.

Brett Morgan

unread,
Apr 23, 2008, 12:51:25 AM4/23/08
to google-a...@googlegroups.com
Remember the focus here is about security, not outright speed.

Are you chewing a lot of cpu with your communications? I'd expect
reasonably similar levels of cpu usage between json and django
templating, honestly.

javaDinosaur

unread,
Apr 23, 2008, 8:07:48 PM4/23/08
to Google App Engine
> On Apr 23, 5:51 am, "Brett Morgan" <brett.mor...@gmail.com> wrote:
> Remember the focus here is about security, not outright speed.

Sure. I appreciate that the App Engine team would have to scrutinize
each line of any public domain C json serialization code they admitted
to their Phython runtime library. My intention was to raise awareness
that in some RIA based applications all client-server HTTP exchanges
will be nothing but data serialized in JSON format once .js and image
files are loaded. In fact with an Adobe AIR application or Gears all
static files would be pre-installed at the client. Another danger
inherent in my proposal is that a hacker might discover a malicious
corrupt JSON format that would cause the C routine to run over buffers
and crash the Phython runtime hosting an App.

From my perspective I just see JSON ousting XML in this use-case.

> Are you chewing a lot of cpu with your communications? I'd expect
> reasonably similar levels of cpu usage between json and django
> templating, honestly.

I don't have anything up and running yet in Python but taking the case
of a client grid paging forward 100 rows at a time, that would equate
to 100 entities serialized to json on the server each request. My
hunch is that a 5 or 10 times gain in serialization efficiency via a C
function call would have a material effect on the CPU cycles total
that clock up against a App Engine application allowance during a busy
day.

In the case of a 5000 entity bulk data exchange I think we would then
be talking about a measurable response time difference on a human
perceived time scale.

javaDinosaur

unread,
Apr 23, 2008, 8:03:52 PM4/23/08
to Google App Engine
> On Apr 23, 5:51 am, "Brett Morgan" <brett.mor...@gmail.com> wrote:
> Remember the focus here is about security, not outright speed.

Sure. I appreciate that the App Engine team would have to scrutinize
each line of any public domain C json serialization code they admitted
to their Phython runtime library. My intention was to raise awareness
that in some RIA based applications all client-server HTTP exchanges
will be nothing but data serialized in JSON format once .js and image
files are loaded. In fact with an Adobe AIR application or Gears all
static files would be pre-installed at the client. Another danger
inherent in my proposal is that a hacker might discover a malicious
corrupt JSON format that would cause the C routine to run over buffers
and crash the Phython runtime hosting an App.

From my perspective I just see JSON ousting XML in this Ajax use-case.
Enough said.

> Are you chewing a lot of cpu with your communications? I'd expect
> reasonably similar levels of cpu usage between json and django
> templating, honestly.

Brett Morgan

unread,
Apr 23, 2008, 8:27:57 PM4/23/08
to google-a...@googlegroups.com
On Thu, Apr 24, 2008 at 10:07 AM, javaDinosaur
<jonat...@hotmail.co.uk> wrote:
>
> > On Apr 23, 5:51 am, "Brett Morgan" <brett.mor...@gmail.com> wrote:
> > Remember the focus here is about security, not outright speed.
>
> Sure. I appreciate that the App Engine team would have to scrutinize
> each line of any public domain C json serialization code they admitted
> to their Phython runtime library. My intention was to raise awareness
> that in some RIA based applications all client-server HTTP exchanges
> will be nothing but data serialized in JSON format once .js and image
> files are loaded. In fact with an Adobe AIR application or Gears all
> static files would be pre-installed at the client. Another danger
> inherent in my proposal is that a hacker might discover a malicious
> corrupt JSON format that would cause the C routine to run over buffers
> and crash the Phython runtime hosting an App.
>
> From my perspective I just see JSON ousting XML in this use-case.

I've been hoping for that to happen for a while. Both GWT and Flex
have pretty good support for JSON.

> > Are you chewing a lot of cpu with your communications? I'd expect
> > reasonably similar levels of cpu usage between json and django
> > templating, honestly.
>
> I don't have anything up and running yet in Python but taking the case
> of a client grid paging forward 100 rows at a time, that would equate
> to 100 entities serialized to json on the server each request. My
> hunch is that a 5 or 10 times gain in serialization efficiency via a C
> function call would have a material effect on the CPU cycles total
> that clock up against a App Engine application allowance during a busy
> day.
>
> In the case of a 5000 entity bulk data exchange I think we would then
> be talking about a measurable response time difference on a human
> perceived time scale.

Yup, I agree. On the flip side, having done a touch of HCI, I know
there are plenty of ways of hiding that time impact, or at least
allowing the user to continue doing other things while waiting for
data to load.

Danny Ayers

unread,
Apr 28, 2008, 1:27:33 PM4/28/08
to Google App Engine
On Apr 15, 7:44 am, Pete <pkoo...@google.com> wrote:

> Conversely, we know many of you would like a better tool for moving
> your data *off* of Google App Engine, and we'd like to ask for your
> feedback on what formats would be the most useful for a data
> exporter. XML output? CSV transform? RDF? Let us know how what you
> think!

The App Eng data looks like it should map pretty well to the RDF
model, which would allow a variety of consistent, standards-based
formats -

* RDF/XML - hard on the eye, but supported by pretty much every RDF
tool
* Turtle - human friendly, good for debugging (and hand authoring):
http://en.wikipedia.org/wiki/Turtle_(syntax)
* RDF/JSON - good for scripting languaes: http://n2.talis.com/wiki/RDF_JSON_Specification

Being a simple entity-relation style model which names things with
URIs, RDF makes data integration (mashups!) fairly trivial.
The SPARQL query language would be available, and through it other
application-specific formats.

Immediate benefits would be the ability to hook up with existing
tools, e.g. Tim Berners-Lee's Tabulator data browser:
http://www.w3.org/2005/ajar/tab

- easy integration with linked data :
http://en.wikipedia.org/wiki/Linked_Data

Lots of programming APIs are available, here's a list of developer
tools (in need of updating):
http://www4.wiwiss.fu-berlin.de/bizer/toolkits/

Longer term, a big benefit would be that App Engine applications could
be on the Semantic Web out of the box.

Cheers,
Danny.

tijer

unread,
Apr 30, 2008, 9:28:49 AM4/30/08
to Google App Engine
So, how far are we to a public release?

It would be great to have an exporter that would work well with the
(bulk) importer utility.

/

David Cifuentes

unread,
May 7, 2008, 4:03:33 AM5/7/08
to Google App Engine
My vote is for RDF. That can be the trigger for mainstream Semantic
Web. It is a lot more powerful than the other formats in terms of
structure of your data.

On 15 abr, 00:44, Pete <pkoo...@google.com> wrote:
> Getting large amounts of data from existing apps into the datastore
> can be a challenge, and to help we offer a Bulk Data Uploader. You
> can read more about this tool here:
>
> http://code.google.com/appengine/articles/bulkload.html
>

Brett Morgan

unread,
May 7, 2008, 4:44:02 AM5/7/08
to google-a...@googlegroups.com
Silly question, have you actually worked with rdf? Especially rdf
expressed in xml?

Don't get me wrong, I like some of the concepts of rdf. Hell, the
browser I use is based very heavily on an rdf engine. But, as a data
export format, I have serious questions about the applicability of
rdf. It's verbosity is high, and the tools that can deal with it are
few and far between.

--

Brett Morgan http://brett.morgan.googlepages.com/

David Cifuentes

unread,
May 7, 2008, 1:17:27 PM5/7/08
to Google App Engine
Yes, I'm working right now with a RDF compliant systems and you're
right it is verbose, but sincerely who cares is designed for machines
anyway, doesn't it? I know it is not the "panacea" either. But I don't
agree with you in the tools support, there are excellent projects (at
least in the Java world http://jena.sourceforge.net/ or http://www.openrdf.org/)
for dealing with RDF + SPARQL. I believe you can express data
structures cleaner and with more semantics than any other of the
formats listed in the original message. I mean, if Google adopts RDF
for GAE data exports, tools support eventually starts getting better
and better. In my humble opinion...

David C.

Brett Morgan

unread,
May 7, 2008, 9:29:44 PM5/7/08
to google-a...@googlegroups.com
On Thu, May 8, 2008 at 3:17 AM, David Cifuentes
<david.c...@eforcers.com> wrote:
>
> Yes, I'm working right now with a RDF compliant systems and you're
> right it is verbose, but sincerely who cares is designed for machines
> anyway, doesn't it? I know it is not the "panacea" either. But I don't
> agree with you in the tools support, there are excellent projects (at
> least in the Java world http://jena.sourceforge.net/ or http://www.openrdf.org/)
> for dealing with RDF + SPARQL. I believe you can express data
> structures cleaner and with more semantics than any other of the
> formats listed in the original message. I mean, if Google adopts RDF
> for GAE data exports, tools support eventually starts getting better
> and better. In my humble opinion...
>
> David C.

Yeah, I was following Jena for a while. It just fails the
"understandable by an engineer in a five minute elevator pitch" test.

A large chunk of the problem is that there isn't a nice way to
linearise a data graph in a heirarchical storage medium like xml. At
the very least you wind up with references that are across the
heirarchy. Yes Sparql is designed to deal with all of this, and yes
I've been waiting for an RDF datastore to hit it big.

But RDF/Sparql are competing with the easy understandability of a
persistent hash map like we are seeing in google's BigTable, or
CouchDB, or AWS's SimpleDB. These are readily understandable because
they extend metaphors (a hashmap) that every programmer has already
come to terms with.

RDF datastores are, in my experience, hard to reason about. Sparql has
a long way to go in optimisation strategies, and more over, in being
at a place where an engineer can get a reasonable handle on whether a
given Sparql query will run quickly or slowly. This is an important
consideration. We are currently watching a bunch of engineers coming
to terms with DataStore's performance characteristics, and DataStore
is a lot more predictable than most RDF stores I've played with...

David Cifuentes

unread,
May 8, 2008, 9:04:36 AM5/8/08
to Google App Engine
Ok Brett, I understand your point and I have to agree with you
especially "elevator pitch" test, I laughed so hard with that one.
About the performance optimization of the queries I'm sure that as we
write there is someone working trying to improve that. Is guess is
very difficult to take the "one size fits all" approach and I believe
having different alternatives or maybe to give the ability and the
tools to transform your data in whatever format it easier and cleaner.
I brought the discussion into twitter and @nikolasco came with very
interesting ideas and resources :

# 16:01 @dcifuen don't fight for RDF-only... good XML with an XSLT
stylesheet to produce RDF+XML should cover everyone nicely #
# 16:03 @dcifuen I'd also point out that Redland/librdf is portable
with many language bindings, Redfoot/rdflib for Python, Jena for Java,
etc... #
# 16:08 @dcifuen also the bonus of "free" SPARQL for selecting just
the data you want, plus an HTTP-service spec; a REST service could
complement #

I'm concern that for taking the simple approach of being "easy to
understand" sacrifices the semantics of the data.

Thanks for the interesting discussion Brett.

David C.

On May 7, 8:29 pm, "Brett Morgan" <brett.mor...@gmail.com> wrote:
> On Thu, May 8, 2008 at 3:17 AM, David Cifuentes
>
> <david.cifuen...@eforcers.com> wrote:
>
> > Yes, I'm working right now with a RDF compliant systems and you're
> > right it is verbose, but sincerely who cares is designed for machines
> > anyway, doesn't it? I know it is not the "panacea" either. But I don't
> > agree with you in the tools support, there are excellent projects (at
> > least in the Java worldhttp://jena.sourceforge.net/orhttp://www.openrdf.org/)

Brett Morgan

unread,
May 9, 2008, 7:49:46 PM5/9/08
to google-a...@googlegroups.com
On Thu, May 8, 2008 at 11:04 PM, David Cifuentes
<david.c...@eforcers.com> wrote:
>
> Ok Brett, I understand your point and I have to agree with you
> especially "elevator pitch" test, I laughed so hard with that one.
> About the performance optimization of the queries I'm sure that as we
> write there is someone working trying to improve that. Is guess is
> very difficult to take the "one size fits all" approach and I believe
> having different alternatives or maybe to give the ability and the
> tools to transform your data in whatever format it easier and cleaner.
> I brought the discussion into twitter and @nikolasco came with very
> interesting ideas and resources :
>
> # 16:01 @dcifuen don't fight for RDF-only... good XML with an XSLT
> stylesheet to produce RDF+XML should cover everyone nicely #
> # 16:03 @dcifuen I'd also point out that Redland/librdf is portable
> with many language bindings, Redfoot/rdflib for Python, Jena for Java,
> etc... #
> # 16:08 @dcifuen also the bonus of "free" SPARQL for selecting just
> the data you want, plus an HTTP-service spec; a REST service could
> complement #
>
> I'm concern that for taking the simple approach of being "easy to
> understand" sacrifices the semantics of the data.

One of the main things that we are gently encouraged to do with GAE is
move processing to write time. This means you nee to know ahead of
time everything you want to do.

This means we are trading away the power of a flexible query engine,
but in return we get quick page renders. I suppose the question
becomes, is this trade off appropriate for the application you want to
build?

> Thanks for the interesting discussion Brett.

My pleasure =)

--

Brett Morgan http://brett.morgan.googlepages.com/

Joeles

unread,
May 18, 2008, 4:22:28 AM5/18/08
to Google App Engine
Since mySQL is one of the most used database systems out there,
it would be very nice if one could export in a mysql compatible format
and import this format to bigtable as well.

i already opened an issue before i noticed this thread (covering the
import part), since i believe that a mysql to bigtable importer would
be a very welcome feature. as it makes migrating projects to appengine
very easy.

please also have a look at my issue:
http://code.google.com/p/googleappengine/issues/detail?id=366
and star it, if you think its a good idea.

so - what in my opinion would be the best way to go:
phpmyadmin - err googlebigtableadmin
100% mysql compatible in both ways.

best.

Joeles

peterk

unread,
Jun 5, 2008, 7:12:20 AM6/5/08
to Google App Engine
I know this thread's been inactive for a month, but..any update on
this?

Being able to move my application off AppEngine (if God forbid I ever
needed to), is a concern for the worrier in me :)

My dream tool for getting data out of GAE would be a replication tool
that could automatically keep another database reasonably up-to-date
with my GAE version. I suppose I could do this myself write now by
just copying my queries to another database elsewhere, but I don't
know if I'd trust my own coding with this, particularly if my app
scales up a lot and data integrity and consistency becomes more
critical.

I also am not sure how likely it is Google would support replication
to non-google services? Like, say, I wanted to keep a reasonably fresh
copy on SimpleDB (gasp, I know! don't hurt me! :p)

I know, again, that these are things I, or the community, could
probably roll themselves with a general data export tool from Google,
but I always like 'official' google tools :)

Thanks..

On May 7, 9:44 am, "Brett Morgan" <brett.mor...@gmail.com> wrote:
> Silly question, have you actually worked with rdf? Especially rdf
> expressed in xml?
>
> Don't get me wrong, I like some of the concepts of rdf. Hell, the
> browser I use is based very heavily on an rdf engine. But, as adataexportformat, I have serious questions about the applicability of
> rdf. It's verbosity is high, and the tools that can deal with it are
> few and far between.
>
> On Wed, May 7, 2008 at 6:03 PM, David Cifuentes
>
>
>
> <david.cifuen...@eforcers.com> wrote:
>
> > My vote is for RDF. That can be the trigger for mainstream Semantic
> > Web. It is a lot more powerful than the other formats in terms of
> > structure of yourdata.
>
> > On 15 abr, 00:44, Pete <pkoo...@google.com> wrote:
> >> Getting large amounts ofdatafrom existing apps into the datastore
> >> can be a challenge, and to help we offer a BulkDataUploader.  You
> >> can read more about this tool here:
>
> >>http://code.google.com/appengine/articles/bulkload.html
>
> >> Conversely, we know many of you would like a better tool for moving
> >> yourdata*off* of Google App Engine, and we'd like to ask for your

Nevin Freeman

unread,
Jun 30, 2008, 2:37:14 PM6/30/08
to Google App Engine
> I know this thread's been inactive for a month, but..any update on
> this [datastore backup]?

I'd like to chime in the same. It seems evident that this is a concern
for a lot of people; has an open-source project sprung up to address
this need? I'd rather not write the utility from scratch if people are
already working on it.

Nevin

Michael (blog.crisatunity.com)

unread,
Jun 30, 2008, 5:04:11 PM6/30/08
to google-a...@googlegroups.com
For me this defect (not feature) is a deal breaker on moving any production application towards this platform.  I just dropped off my comment on Issue 59.

On Mon, Jun 30, 2008 at 1:37 PM, Nevin Freeman <nevin....@gmail.com> wrote:

> I know this thread's been inactive for a month, but..any update on

Garrett Davis

unread,
Jul 12, 2008, 9:11:09 PM7/12/08
to Google App Engine
I built a bulk download module. Anyone care to help me beta-test it?

Please read about it here:
http://code.google.com/p/gawsh/wiki/BulkDownload

and download it from
http://code.google.com/p/gawsh/downloads/list

If it works for you, please let me know.
If it doesn't work for you, please let me know that as well.

Gary


On Jun 30, 2:04 pm, "Michael (blog.crisatunity.com)"
<mich...@crisatunity.com> wrote:
> For me this defect (not feature) is a deal breaker on moving any production
> application towards this platform.  I just dropped off my comment on Issue
> 59 <http://code.google.com/p/googleappengine/issues/detail?id=59>.
>
> On Mon, Jun 30, 2008 at 1:37 PM, Nevin Freeman <nevin.free...@gmail.com>
Reply all
Reply to author
Forward
0 new messages