Google Groups

Re: [google-appengine] Storing JSON efficiently in Datastore (in Python)

aschmid Jun 4, 2012 6:49 AM
Posted in group: Google App Engine
great. how would this look for the ndb package?

On Jun 1, 2012, at 2:40 PM, Andrin von Rechenberg wrote:

Hey there

If you want to store megabytes of JSON in datastore
and get it back from datastore into python already parsed, 
this post is for you.

I ran a couple of performance tests where I want to store
a 4 MB json object in the datastore and then get it back at
a later point and process it.

There are several ways to do this.

Challenge 1) Serialization
You need to serialize your data.
For this you can use several different libraries.
JSON objects can be serialized using:
the json lib, the cPickle lib or the marshal lib.
(these are the libraries I'm aware of atm)

Challenge 2) Compression
If your serialized data doesn't fit into 1mb you need
to shard your data over multiple datastore entities and
manually build it together when loading the entities back.
If you compress your serialized data and store it then,
you have the cost of compression and decompression,
but you have to fetch fewer datastore entities when you
want to load your data and you have to write fewer
datastore entities if you want to update your data if it

Solution for 1) Serialization:
cPickle is very slow. It's meant to serialize real
objects and not just json. JSON is much faster,
but compared to marshal it has no chance.
The python marshal library is definitely the
way to serialize JSON. It has the best performance

Solution for 2) Compression:
For my use-case it makes absolutely sense to
compress the data the marshal lib produces
before storing it in datastore. I have gigabytes
of JSON data. Compressing the data makes
it about 5x smaller. Doing 5x fewer datastore
operations definitely pays for the the time it
takes to compress and decompress the data.
There are several compression levels you
can use to when using python's zlib.
From 1 (lowest compression, but fastest)
to 9 (highest compression but slowest).
During my tests I figured that the optimum
is to compress your serialized data using
zlib with level 1 compression. Higher
compression takes to much CPU and
the result is only marginally smaller.

Here are my test results:
cPickle ziplvl: 0

dump: 1.671010s
load: 0.764567s
size: 3297275
cPickle ziplvl: 1

dump: 2.033570s
load: 0.874783s
size: 935327
json ziplvl: 0

dump: 0.595903s
load: 0.698307s
size: 2321719
json ziplvl: 1

dump: 0.667103s
load: 0.795470s
size: 458030
marshal ziplvl: 0

dump: 0.118067s
load: 0.314645s
size: 2311342
marshal ziplvl: 1

dump: 0.315362s
load: 0.335677s
size: 470956
marshal ziplvl: 2

dump: 0.318787s
load: 0.380117s
size: 457196
marshal ziplvl: 3

dump: 0.350247s
load: 0.364908s
size: 446085
marshal ziplvl: 4

dump: 0.414658s
load: 0.318973s
size: 437764
marshal ziplvl: 5

dump: 0.448890s
load: 0.350013s
size: 418712
marshal ziplvl: 6

dump: 0.516882s
load: 0.367595s
size: 409947
marshal ziplvl: 7

dump: 0.617210s
load: 0.315827s
size: 398354
marshal ziplvl: 8

dump: 1.117032s
load: 0.346452s
size: 392332
marshal ziplvl: 9

dump: 1.366547s
load: 0.368925s
size: 391921
The results do not include datastore operations,
it's just about creating a blob that can be stored
in the datastore and getting the parsed data back.
The times of "dump" and "load" are seconds it takes
to do this on a Google AppEngine F1 instances
(600Mhz, 128mb RAM).

You can also comment there or on this email thread.


Here is the library i created an use:

#!/usr/bin/env python
# Copyright 2012 MiuMeet AG
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# See the License for the specific language governing permissions and
# limitations under the License.

from google.appengine.api import datastore_types
from google.appengine.ext import db

import zlib
import marshal


class JsonMarshalZipProperty(db.BlobProperty):
  """Stores a JSON serializable object using zlib and marshal in a db.Blob"""

  def default_value(self):
    return None
  def get_value_for_datastore(self, model_instance):
    value = self.__get__(model_instance, model_instance.__class__)
    if value is None:
      return None
    return db.Blob(zlib.compress(marshal.dumps(value, MARSHAL_VERSION),

  def make_value_from_datastore(self, value):
    if value is not None:
      return marshal.loads(zlib.decompress(value))
    return value

  data_type = datastore_types.Blob
  def validate(self, value):
    return value

You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to
To unsubscribe from this group, send email to
For more options, visit this group at