Storing JSON efficiently in Datastore (in Python)

Andrin von Rechenberg

unread,

Jun 1, 2012, 2:40:38 PM6/1/12

to google-a...@googlegroups.com

Hey there

If you want to store megabytes of JSON in datastore

and get it back from datastore into python already parsed,

this post is for you.

I ran a couple of performance tests where I want to store

a 4 MB json object in the datastore and then get it back at

a later point and process it.

There are several ways to do this.

Challenge 1) Serialization

You need to serialize your data.

For this you can use several different libraries.

JSON objects can be serialized using:

the json lib, the cPickle lib or the marshal lib.

(these are the libraries I'm aware of atm)

Challenge 2) Compression

If your serialized data doesn't fit into 1mb you need

to shard your data over multiple datastore entities and

manually build it together when loading the entities back.

If you compress your serialized data and store it then,

you have the cost of compression and decompression,

but you have to fetch fewer datastore entities when you

want to load your data and you have to write fewer

datastore entities if you want to update your data if it

sharded.

Solution for 1) Serialization:

cPickle is very slow. It's meant to serialize real

objects and not just json. JSON is much faster,

but compared to marshal it has no chance.

The python marshal library is definitely the

way to serialize JSON. It has the best performance

Solution for 2) Compression:

For my use-case it makes absolutely sense to

compress the data the marshal lib produces

before storing it in datastore. I have gigabytes

of JSON data. Compressing the data makes

it about 5x smaller. Doing 5x fewer datastore

operations definitely pays for the the time it

takes to compress and decompress the data.

There are several compression levels you

can use to when using python's zlib.

From 1 (lowest compression, but fastest)

to 9 (highest compression but slowest).

During my tests I figured that the optimum

is to compress your serialized data using

zlib with level 1 compression. Higher

compression takes to much CPU and

the result is only marginally smaller.

Here are my test results:

cPickle ziplvl: 0

dump: 1.671010s

load: 0.764567s

size: 3297275

cPickle ziplvl: 1

dump: 2.033570s

load: 0.874783s

size: 935327

json ziplvl: 0

dump: 0.595903s

load: 0.698307s

size: 2321719

json ziplvl: 1

dump: 0.667103s

load: 0.795470s

size: 458030

marshal ziplvl: 0

dump: 0.118067s

load: 0.314645s

size: 2311342

marshal ziplvl: 1

dump: 0.315362s

load: 0.335677s

size: 470956

marshal ziplvl: 2

dump: 0.318787s

load: 0.380117s

size: 457196

marshal ziplvl: 3

dump: 0.350247s

load: 0.364908s

size: 446085

marshal ziplvl: 4

dump: 0.414658s

load: 0.318973s

size: 437764

marshal ziplvl: 5

dump: 0.448890s

load: 0.350013s

size: 418712

marshal ziplvl: 6

dump: 0.516882s

load: 0.367595s

size: 409947

marshal ziplvl: 7

dump: 0.617210s

load: 0.315827s

size: 398354

marshal ziplvl: 8

dump: 1.117032s

load: 0.346452s

size: 392332

marshal ziplvl: 9

dump: 1.366547s

load: 0.368925s

size: 391921

The results do not include datastore operations,

it's just about creating a blob that can be stored

in the datastore and getting the parsed data back.

The times of "dump" and "load" are seconds it takes

to do this on a Google AppEngine F1 instances

(600Mhz, 128mb RAM).

I posted this email on my blog: http://devblog.miumeet.com/2012/06/storing-json-efficiently-in-python-on.html

You can also comment there or on this email thread.

Enjoy,

-Andrin

Here is the library i created an use:

#!/usr/bin/env python
#
# Copyright 2012 MiuMeet AG
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from google.appengine.api import datastore_types
from google.appengine.ext import db

import zlib
import marshal

MARSHAL_VERSION = 2
COMPRESSION_LEVEL = 1

class JsonMarshalZipProperty(db.BlobProperty):
  """Stores a JSON serializable object using zlib and marshal in a db.Blob"""

  def default_value(self):
    return None
  
  def get_value_for_datastore(self, model_instance):
    value = self.__get__(model_instance, model_instance.__class__)
    if value is None:
      return None
    return db.Blob(zlib.compress(marshal.dumps(value, MARSHAL_VERSION),
                                 COMPRESSION_LEVEL))

  def make_value_from_datastore(self, value):
    if value is not None:
      return marshal.loads(zlib.decompress(value))
    return value

  data_type = datastore_types.Blob
  
  def validate(self, value):
    return value

Emanuele Ziglioli

unread,

Jun 1, 2012, 7:41:19 PM6/1/12

to Google App Engine

I do that with Java, and store it as gz. Would love to serve it
uncompressed but GAE doesn't let us set the Content-Encoding type to
gz (well, it works on the local server but on the GAE server the
header is stripped and the message body corrupt).

On Jun 2, 6:40 am, Andrin von Rechenberg <and...@miumeet.com> wrote:
> Hey there

Renzo Nuccitelli

unread,

Jun 2, 2012, 12:12:56 AM6/2/12

to google-a...@googlegroups.com

Hi,

ndb has an JasonProperty which is parsed using json module. It can be compressed using compressed with True value: https://developers.google.com/appengine/docs/python/ndb/properties#compressed.

Maybe it´s a good to add this solution in your tests.

Renzo Nuccitelli

Brandon Wirtz

unread,

Jun 2, 2012, 1:41:23 AM6/2/12

to google-a...@googlegroups.com

Nice. I like it. Going to try playing with your method vs some of the things I tried. But I suspect you are faster.

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/lohoWLsuVD0J.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

johnP

unread,

Jun 2, 2012, 11:09:02 AM6/2/12

to Google App Engine

Is this using python the 2.7 version of cPickle, or are you using
python 2.5 with the Python Pickle aliased to cPickle?

johnP

Bryce Cutt

unread,

Jun 2, 2012, 5:34:39 PM6/2/12

to google-a...@googlegroups.com

The docs for marshal seem to indicate there are no guarantees marshaled data is compatible between python versions. That worries me. If I decide to eventually upgrade my python 2.5 apps to 2.7 am I going to have to convert all my data between marshal versions? While pickle is not always backwards compatible it does appear to be forwards compatible.

I took your class and made the tiny modifications for it to use pickle instead of marshal and it is working well for me. Thanks.

Bryce Cutt

unread,

Jun 2, 2012, 6:43:33 PM6/2/12

to google-a...@googlegroups.com

The type of data being serialized certainly affects how much faster marshal is. When testing with just a string approximately 1MB in size marshal was around 10 times as fast as JSON but only about 10% faster than pickle. With a dict of dicts of integers (around 1MB when serialized with marshal) I found that pickle was about 50% faster than JSON but marshal was around 200 times faster than pickle!

If I run some more empirical tests I will share some real numbers. What sort of data were you using in your tests Andrin?

Andrin von Rechenberg

unread,

Jun 3, 2012, 11:25:30 PM6/3/12

to google-a...@googlegroups.com

Marshal also has versioning. I hardcoded version 2 and am hopping that it will be forward compatible.

If it's not, then oh well I need to get the data out first and reencode it. but that's a small burden compared

to the amount of speed i get.

I was using python 2.7 with cPickle.

I don't think anything can beat marshal. Marshal is used by Python internally when serializing data-structures

in compiled code (the pyc files). So it's critical for python's performance and if there would be something faster

than marshal, python would def use it for exactly this case.

the test data was of this form

[

[ ~8 bytes string,

~10 bytes string,

{ about 6 key value pairs with up to 200 bytes }

] ,

...

]

Cheers,

-Andrin

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/E7uCryqsk2QJ.

Andreas

unread,

Jun 4, 2012, 9:49:57 AM6/4/12

to google-a...@googlegroups.com

great. how would this look for the ndb package?

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.

Andreas

unread,

Jun 4, 2012, 10:41:14 AM6/4/12

to google-a...@googlegroups.com

is this a valid implementation?

class JsonMarshalZipProperty(ndb.BlobProperty):

    def _to_base_type(self, value):
        return zlib.compress(marshal.dumps(value, MARSHAL_VERSION))

    def _from_base_type(self, value):
        return marshal.loads(zlib.decompress(value))

Bryce Cutt

unread,

Jun 4, 2012, 3:24:48 PM6/4/12

to google-a...@googlegroups.com

aschmid: The ndb BlobProperty has optional compression built in (see ndb.model.BlobProperty). You could implement the MarshalProperty like this:

class MarshalProperty(BlobProperty):

def _to_base_type(self, value):

return marshal.dumps(value, MARSHAL_VERSION)

def _from_base_type(self, value):

return marshal.loads(value)

Then when you instantiate a property instance you would specify the compressed option to enable compression:

prop = MarshalProperty(compressed=True)

The compressed option in BlobProperty is implemented in such a way that you can turn it on an off and old values will still be read properly as the _from_base_type() method in BlobProperty only decompresses the stored value if it actually was compressed.

The BlobProperty uses the default compression level (and does not have an option to change the compression level) so if you want to use level 1 (as Andrin recommends) you would need to implement that in your own subclass.

To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to google-appengine+unsubscribe@googlegroups.com.

Bryce Cutt

unread,

Jun 4, 2012, 3:31:36 PM6/4/12

to google-a...@googlegroups.com

Hi Andrin,

I have run a few more tests and now I am convinced you are right.

On Python 2.5 where pickle and json are both implemented in pure Python the speed benefits of marshal outweigh any conversion issues in the future (which are not likely anyways). Your numbers show even on Python 2.7 (with a C implementation of pickle) that marshal is much faster. Thanks!

- Bryce

Andreas

unread,

Jun 4, 2012, 3:30:57 PM6/4/12

to google-a...@googlegroups.com

ok good to know.

but this still does not help with the 1mb entity size limit... even by compressing some json objects i would still be over that size.

think the only solution here is to use the BlobStore with the files api.

To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/qKSg7YkFW5YJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

Bryce Cutt

unread,

Jun 4, 2012, 4:01:46 PM6/4/12

to google-a...@googlegroups.com

Andreas,

Yup. I have had to resort to using the blobstore on many occasions for
exactly this reason.

One gotcha that I have run into when doing this is that there appears
to be no way to write a new blob to the blobstore (using the files
API) inside of a transaction that also modifies a datastore model,
even with the XG option enabled. I would love a solution to this if
anyone knows of one. Right now I am doing a workaround that may cause
a new blob to be written even if the transaction fails and using a
delayed task to clean that up.

- Bryce

Reply all

Reply to author

Forward