Getting large amounts of data from existing apps into the datastore
can be a challenge, and to help we offer a Bulk Data Uploader. You
can read more about this tool here:
Conversely, we know many of you would like a better tool for moving
your data *off* of Google App Engine, and we'd like to ask for your
feedback on what formats would be the most useful for a data
exporter. XML output? CSV transform? RDF? Let us know how what you
think!
> Getting large amounts of data from existing apps into the datastore
> can be a challenge, and to help we offer a Bulk Data Uploader. You
> can read more about this tool here:
> Conversely, we know many of you would like a better tool for moving
> your data *off* of Google App Engine, and we'd like to ask for your
> feedback on what formats would be the most useful for a data
> exporter. XML output? CSV transform? RDF? Let us know how what you
> think!
I concur with Jared. XML and JSON are my preferred methods.
What really needs to be done is there to be added a toJson function,
to DataStore models. Then when you want to dump an entire DB you give
the option of XML and JSON, and on your side you simply execute a
query where you grab all objects from all models and export them one
by one.
I admit that I don't have much experience with this matter, but it
seems like binary data over xml would be messy.
It also seems like binary data would be necessary, since file system
access isn't a part of the engine. For example if someone is doing a
flickr-like application, and has a large collection of existing images
to upload.
Maybe there's a standard way to do this and I'm just dense. At the
very least any method seems like it should be binary friendly.
--
Justin
On Apr 15, 2:19 am, Aaron Krill <aa...@krillr.com> wrote:
> I concur with Jared. XML and JSON are my preferred methods.
> What really needs to be done is there to be added a toJson function,
> to DataStore models. Then when you want to dump an entire DB you give
> the option of XML and JSON, and on your side you simply execute a
> query where you grab all objects from all models and export them one
> by one.
Justin, for binary content you can use 64 base encoding or some other
string based encoding format to cater for this and still use xml, json
or whatever
Would it be possible to be able to define our own export templates,
using say the django templates
Also some sort of web service where we can poll the data would be
great as well.
Json format would be great as well
On Apr 15, 4:04 pm, Justin Worrell <jworr...@gmail.com> wrote:
> I admit that I don't have much experience with this matter, but it
> seems like binary data over xml would be messy.
> It also seems like binary data would be necessary, since file system
> access isn't a part of the engine. For example if someone is doing a
> flickr-like application, and has a large collection of existing images
> to upload.
> Maybe there's a standard way to do this and I'm just dense. At the
> very least any method seems like it should be binary friendly.
> --
> Justin
> On Apr 15, 2:19 am, Aaron Krill <aa...@krillr.com> wrote:
> > I concur with Jared. XML and JSON are my preferred methods.
> > What really needs to be done is there to be added a toJson function,
> > to DataStore models. Then when you want to dump an entire DB you give
> > the option of XML and JSON, and on your side you simply execute a
> > query where you grab all objects from all models and export them one
> > by one.
There are several options for exporting binary data in XML and JSON formats. The most common is to encode it in Base64. It's not really messy at all.
Also, a majority of content types probably wouldn't need exported if ticket #208 comes true. My thought with this is that most content types (images, video, office documents) would be stored in the appropriate google services (Picasa web albums, google video, google docs, etc). Then instead of storing these in your Model you'd simply store references to them for each user.
On Tue, Apr 15, 2008 at 12:04 AM, Justin Worrell <jworr...@gmail.com> wrote:
> I admit that I don't have much experience with this matter, but it > seems like binary data over xml would be messy.
> It also seems like binary data would be necessary, since file system > access isn't a part of the engine. For example if someone is doing a > flickr-like application, and has a large collection of existing images > to upload.
> Maybe there's a standard way to do this and I'm just dense. At the > very least any method seems like it should be binary friendly.
> -- > Justin
> On Apr 15, 2:19 am, Aaron Krill <aa...@krillr.com> wrote: > > I concur with Jared. XML and JSON are my preferred methods.
> > What really needs to be done is there to be added a toJson function, > > to DataStore models. Then when you want to dump an entire DB you give > > the option of XML and JSON, and on your side you simply execute a > > query where you grab all objects from all models and export them one > > by one.
On 15 Apr., 09:04, Justin Worrell <jworr...@gmail.com> wrote:
> I admit that I don't have much experience with this matter, but it
> seems like binary data over xml would be messy.
When I think about migrating a GAE app to a "standard" model (e.g.
some language plus a relational database) the main problem I see is
the model references (e.g. db.Key and parent relationships).
I would also prefer to have XML output, but the advantage with binary
data would be, that the references to other objects could be stored
natively. I my eyes, XML with a lot of reference attributes/tags is
suboptimal for importing the data into a relational database. But at
least with XML it is possible to define a proper Schema (sepezifying
all models and db.Key, etc.) which standardizes the export - with JSON
this would not be that easy, so JSON is probably only good for doing
small exports and not worth thinking about here anyways, because every
developer should be able to write some lines of code which transforms
his/her specific model into JSON.
The other thing with the references is, that exporting while
maintaining all relationships between all models would need one big
XML file. If we now take Justins flickr example, the XML file
additionally would be bloated with base64-encoded image data - which
makes the XML export huge and pretty tough to handle...
How about giving a general interface with some already provided
filters (e.g. for XML) but also making it possible for developers to
write their own export filters?
All you would have to do to handle the binary data is parse the XML tree, grab the base64 and decode it. Simple. How is this difficult? This is the standard for any XML-based document format that must contend with binary data (SVG containing scalar graphics comes to mind). Also ODF and the like which may contain similar image data.
I don't see why JSON would not be suitable for a large export -- the format is small and easy to understand. It's also far easier to import and export from other python applications. In some cases, parsing a JSON "packet" is considerably faster than parsing XML as well.
On Tue, Apr 15, 2008 at 12:41 AM, Joscha Feth <jos...@feth.com> wrote:
> On 15 Apr., 09:04, Justin Worrell <jworr...@gmail.com> wrote: > > I admit that I don't have much experience with this matter, but it > > seems like binary data over xml would be messy.
> When I think about migrating a GAE app to a "standard" model (e.g. > some language plus a relational database) the main problem I see is > the model references (e.g. db.Key and parent relationships). > I would also prefer to have XML output, but the advantage with binary > data would be, that the references to other objects could be stored > natively. I my eyes, XML with a lot of reference attributes/tags is > suboptimal for importing the data into a relational database. But at > least with XML it is possible to define a proper Schema (sepezifying > all models and db.Key, etc.) which standardizes the export - with JSON > this would not be that easy, so JSON is probably only good for doing > small exports and not worth thinking about here anyways, because every > developer should be able to write some lines of code which transforms > his/her specific model into JSON. > The other thing with the references is, that exporting while > maintaining all relationships between all models would need one big > XML file. If we now take Justins flickr example, the XML file > additionally would be bloated with base64-encoded image data - which > makes the XML export huge and pretty tough to handle... > How about giving a general interface with some already provided > filters (e.g. for XML) but also making it possible for developers to > write their own export filters?
On 15 Apr., 09:59, "Aaron Krill" <aa...@krillr.com> wrote:
> If you want your own export filter, simply write it into your application.
> It isn't hard to do.
sure - but why inventing the wheel a thousand times - a general API
with open filters would bundle our powers - and also makes the filters
more bug free as anyone developing filters on their own
> All you would have to do to handle the binary data is parse the XML tree,
> grab the base64 and decode it. Simple. How is this difficult? This is the
> standard for any XML-based document format that must contend with binary
> data (SVG containing scalar graphics comes to mind). Also ODF and the like
> which may contain similar image data.
I didn't mean base64-encoded data itself is hard to handle - I meant a
single XML file with X Gigabyte base64-encoded data in it isn't easy
to handle.
> I don't see why JSON would not be suitable for a large export -- the format
> is small and easy to understand. It's also far easier to import and export
> from other python applications. In some cases, parsing a JSON "packet" is
> considerably faster than parsing XML as well.
so how would you translate the references to other models into JSON?
Just by throwing out the key name as a string? The nice thing about
XML with a schema is, that it is self-describing - the nice thing
about a binary format would be that the references could be modeled as
pointers - just the key name as a string now seems messy to me - but
as I said before: for small exports this might be the right thing to
do...
On Tue, Apr 15, 2008 at 1:09 AM, Joscha Feth <jos...@feth.com> wrote:
> On 15 Apr., 09:59, "Aaron Krill" <aa...@krillr.com> wrote: > > If you want your own export filter, simply write it into your > application. > > It isn't hard to do.
> sure - but why inventing the wheel a thousand times - a general API > with open filters would bundle our powers - and also makes the filters > more bug free as anyone developing filters on their own
So then simply release your filter code and others can include it in their own projects.
> > All you would have to do to handle the binary data is parse the XML > tree, > > grab the base64 and decode it. Simple. How is this difficult? This is > the > > standard for any XML-based document format that must contend with binary > > data (SVG containing scalar graphics comes to mind). Also ODF and the > like > > which may contain similar image data.
> I didn't mean base64-encoded data itself is hard to handle - I meant a > single XML file with X Gigabyte base64-encoded data in it isn't easy > to handle.
This is true. Perhaps the export filter could be set in such a way that if there are objects that would translate to very large XML files, each object would be in its own file.
> > I don't see why JSON would not be suitable for a large export -- the > format > > is small and easy to understand. It's also far easier to import and > export > > from other python applications. In some cases, parsing a JSON "packet" > is > > considerably faster than parsing XML as well.
> so how would you translate the references to other models into JSON? > Just by throwing out the key name as a string? The nice thing about > XML with a schema is, that it is self-describing - the nice thing > about a binary format would be that the references could be modeled as > pointers - just the key name as a string now seems messy to me - but > as I said before: for small exports this might be the right thing to > do...
If you look at the example I already provided, it allows for you to define a property type. This includes the ReferenceType and any arguments associated with it. If you write something to handle a JSON export you can make it understand these types and arguments, can you not?
> > sure - but why inventing the wheel a thousand times - a general API
> > with open filters would bundle our powers - and also makes the filters
> > more bug free as anyone developing filters on their own
> So then simply release your filter code and others can include it in their
> own projects.
yep, that's what I am talking about - my idea was to combine the
export functionalities given by Google and our ideas on one
platform...
> > I didn't mean base64-encoded data itself is hard to handle - I meant a
> > single XML file with X Gigabyte base64-encoded data in it isn't easy
> > to handle.
> This is true. Perhaps the export filter could be set in such a way that if
> there are objects that would translate to very large XML files, each object
> would be in its own file.
yeah - so giving a parameter such as externalize="image/*" would store
all properties matching this mime type outside in an external file and
the XML would just include a link to it.
> If you look at the example I already provided, it allows for you to define a
> property type. This includes the ReferenceType and any arguments associated
> with it. If you write something to handle a JSON export you can make it
> understand these types and arguments, can you not?
I admit that it is possible - it just didn't seem convenient to me to
flatten the references - but in turn the key names are just perfect
for that - are key names unique throughout the whole GAE ecosystem or
just within one project?!
What about exporting in standard SQL format? I guess most of the data
will be going more or less into a dbms, so this would ease it.
Ok, might not fit for some rare circumstances - but should be fine for
most of the users.
What do you think?
On Tue, Apr 15, 2008 at 1:37 AM, Tobias <tkl...@enarion.it> wrote:
> What about exporting in standard SQL format? I guess most of the data > will be going more or less into a dbms, so this would ease it. > Ok, might not fit for some rare circumstances - but should be fine for > most of the users. > What do you think?
> Getting large amounts of data from existing apps into the datastore
> can be a challenge, and to help we offer a Bulk Data Uploader. You
> can read more about this tool here:
> Conversely, we know many of you would like a better tool for moving
> your data *off* of Google App Engine, and we'd like to ask for your
> feedback on what formats would be the most useful for a data
> exporter. XML output? CSV transform? RDF? Let us know how what you
> think!
Are you talking about generating a binary file of some format (e.g.
MyISAM, etc.) or about generating the CREATE and INSERT statements? If
second: what is standard? I don't know two RDBMS sharing exactly the
same set of SQL commands...also the modelling with GAE is very
different from how you would do it in a relational database...but the
idea of having an SQL Statement export seems tempting - one would be
able to export the data without losing any information and then you
can remodel the structure within your target RDBMS.
On 15 Apr., 10:37, Tobias <tkl...@enarion.it> wrote:
> What about exporting in standard SQL format? I guess most of the data
> will be going more or less into a dbms, so this would ease it.
> Ok, might not fit for some rare circumstances - but should be fine for
> most of the users.
> What do you think?
Exactly, I ment standard SQL code like CREATE and INSERT. Just the
plain and standard stuff (yes, there exist some standards - at least
in history), no dbms specific code (for the beginning).
OT / One other feature would be to make some simple changes to the
data stored in GAE - e.g. renaming existing data structures, inserting
random data directly in the GAE control center and such simple stuff.
This should not be to dramatically for security reasons, so it might
be a low-hanging fruit. ;-)
Import: of course RSS, CSV/TSV, XML and (standard) SQL, too (all with
UTF8 encoding)
Apparently I'm the only one, but I would like AtomPub. It has the
advantages of XML, can handle binary downloads as, well, binary
downloads. It combines a data formatting protocol with a protocol for
references, both internally in the file as externally to related
content. Think blog entries and comments in Blogger. JSON seems handy
when you want to download part of the data to a web browser client for
display, but I fail to see why you would want JSON formatted data as a
generic bulk import/export format.
Yes, pure text-based SQL output as is generated by pg_dump (PostgreSQL
dump tool) or equivalent would be most useful. Then the data can be
imported straight into some local database by simply piping it into
the client:
psql my_postgresql_database < my_gae_sql_dump.txt
A similar operation could work for MySQL or Oracle, and if the data is
in "pure" standard SQL3 (CREATE and INSERT only), it should be very
portable. If you could provide some suggested CREATE INDEX statements
that Google uses to speed up the BigTable queries that would be great
too, though probably extra-credit.
XML and JSON are also good options.
m++
On Apr 15, 1:59 am, Tobias <tkl...@enarion.it> wrote:
> Exactly, I meant standard SQL code like CREATE and INSERT. Just the
> plain and standard stuff (yes, there exist some standards - at least
> in history), no dbms specific code (for the beginning).
> OT / One other feature would be to make some simple changes to the
> data stored in GAE - e.g. renaming existing data structures, inserting
> random data directly in the GAE control center and such simple stuff.
> This should not be to dramatically for security reasons, so it might
> be a low-hanging fruit. ;-)
> Import: of course RSS, CSV/TSV, XML and (standard) SQL, too (all with
> UTF8 encoding)
Yes, generating a text-based file of SQL statements similar to what is
generated by pg_dump (PostgreSQL dump tool) would be great. This
could then be imported by simply piping to the database client tool:
psql my_database < my_gae_sql_dump
If the SQL statements are all "pure standard" SQL3 (CREATE and INSERT
only) they should be very portable and work the same way for MySQL,
Oracle, and others. Also including any CREATE INDEX statements that
Google uses to optimize the BigTable queries would be a great plus --
extra-credit.
Being able to generate XML, JSON, or even CSV for individual tables
would be useful as well.
m++
On Apr 15, 1:59 am, Tobias <tkl...@enarion.it> wrote:
> Exactly, I ment standard SQL code like CREATE and INSERT. Just the
> plain and standard stuff (yes, there exist some standards - at least
> in history), no dbms specific code (for the beginning).
> OT / One other feature would be to make some simple changes to the
> data stored in GAE - e.g. renaming existing data structures, inserting
> random data directly in the GAE control center and such simple stuff.
> This should not be to dramatically for security reasons, so it might
> be a low-hanging fruit. ;-)
> Import: of course RSS, CSV/TSV, XML and (standard) SQL, too (all with
> UTF8 encoding)
It strikes me as a bizarrre way you've asked the question, Pete.
"XML, CSV, RDF?" Why, I'd like to access it through GQL, of course!
Easily half of the reason for wanting to download and then later
upload the data is for backups. And no, I don't mean the kind of
backup where I think Google is likely to lose my data - I mean the
kind of backup where *I'm* likely to make a coding error that destroys
my data, and I'd like to be able to roll back to my last snapshot that
I saved myself. So, I don't care what format that's in, as long as
download and upload work flawlessly. I know that's complicated,
because the data can be changing while you download, but I'd like your
best stab at it.
Also, my other reason for wanting this is that I'd like to download,
process locally, and then upload the results on a periodic (every 24
hours?) basis. (That is, until Google makes available its own CRON
processing, hopefully with the full power of Map Reduce. *grin*)
So, if I could download the data in *any* format, as long as I can
access it on my local machine through GQL in my own Python scripts,
that'd be fantastic.
Other formats might be interesting, but I think you can't avoid XML.
if anything it has to be at least XML.
now this is true that datastores with lots of blobs might then export
huge XML files if you include them as base64 encoded elements.
So why not encapsulate everything in a zip file instead, containing an
xml file per object class, but including the binary fields as files
named using their entity key.
those files could be organized in nested folders using their class
name and property name:
anyhow, I think this export feature needs to be as generic as
possible, that why I would recommend XML and maybe a compressed 'zip'
file like this to make the export easier, and readable on every
platform.
It would also be great to versionize the data stored by Google and
export/download or roll back the data of a specific version.
Say, I'm uploading a new version and would like to backup the current
data while setting a backup point / version and be able to roll back
to this. I don't know if this is currently possible, but I guess not
for the public App Engines hosted on appspot.com.
BTW - the idea of Frank looks great to me. (ZIP-File with xml + bin
files)
zXML++
On Wed, Apr 16, 2008 at 2:39 AM, martian <martintu...@gmail.com> wrote:
> Yes, pure text-based SQL output as is generated by pg_dump (PostgreSQL > dump tool) or equivalent would be most useful. Then the data can be > imported straight into some local database by simply piping it into > the client:
> A similar operation could work for MySQL or Oracle, and if the data is > in "pure" standard SQL3 (CREATE and INSERT only), it should be very > portable. If you could provide some suggested CREATE INDEX statements > that Google uses to speed up the BigTable queries that would be great > too, though probably extra-credit.
> XML and JSON are also good options.
> m++
> On Apr 15, 1:59 am, Tobias <tkl...@enarion.it> wrote: > > Exactly, I meant standard SQL code like CREATE and INSERT. Just the
> > plain and standard stuff (yes, there exist some standards - at least > > in history), no dbms specific code (for the beginning).
> > OT / One other feature would be to make some simple changes to the > > data stored in GAE - e.g. renaming existing data structures, inserting > > random data directly in the GAE control center and such simple stuff. > > This should not be to dramatically for security reasons, so it might > > be a low-hanging fruit. ;-)
> > Import: of course RSS, CSV/TSV, XML and (standard) SQL, too (all with > > UTF8 encoding)
I don't see any difficult issues as long as you include the string
encoding of all the keys so we can put back together the relationships
and references if/when we need to. JSON and XML are fine, but to_xml
methods on the entities would be handy to help with homemade dumps.
You can't try to convert to a relational data model, but if all the
entities of each kind happen to have all the same properties then it
will look relational. Before each kind, you could include the to_xml
of the Model Instance. You'll need some means of a) paging to download
it in pieces and b) filtering in case the huge base64 binaries are not
wanted.