Storing JSON efficiently in Datastore (in Python)

2,858 views
Skip to first unread message

Andrin von Rechenberg

unread,
Jun 1, 2012, 2:40:38 PM6/1/12
to google-a...@googlegroups.com
Hey there

If you want to store megabytes of JSON in datastore
and get it back from datastore into python already parsed, 
this post is for you.

I ran a couple of performance tests where I want to store
a 4 MB json object in the datastore and then get it back at
a later point and process it.

There are several ways to do this.

Challenge 1) Serialization
You need to serialize your data.
For this you can use several different libraries.
JSON objects can be serialized using:
the json lib, the cPickle lib or the marshal lib.
(these are the libraries I'm aware of atm)

Challenge 2) Compression
If your serialized data doesn't fit into 1mb you need
to shard your data over multiple datastore entities and
manually build it together when loading the entities back.
If you compress your serialized data and store it then,
you have the cost of compression and decompression,
but you have to fetch fewer datastore entities when you
want to load your data and you have to write fewer
datastore entities if you want to update your data if it
sharded.

Solution for 1) Serialization:
cPickle is very slow. It's meant to serialize real
objects and not just json. JSON is much faster,
but compared to marshal it has no chance.
The python marshal library is definitely the
way to serialize JSON. It has the best performance

Solution for 2) Compression:
For my use-case it makes absolutely sense to
compress the data the marshal lib produces
before storing it in datastore. I have gigabytes
of JSON data. Compressing the data makes
it about 5x smaller. Doing 5x fewer datastore
operations definitely pays for the the time it
takes to compress and decompress the data.
There are several compression levels you
can use to when using python's zlib.
From 1 (lowest compression, but fastest)
to 9 (highest compression but slowest).
During my tests I figured that the optimum
is to compress your serialized data using
zlib with level 1 compression. Higher
compression takes to much CPU and
the result is only marginally smaller.

Here are my test results:

cPickle ziplvl: 0


dump: 1.671010s

load: 0.764567s

size: 3297275

cPickle ziplvl: 1


dump: 2.033570s

load: 0.874783s

size: 935327

json ziplvl: 0


dump: 0.595903s

load: 0.698307s

size: 2321719

json ziplvl: 1


dump: 0.667103s

load: 0.795470s

size: 458030

marshal ziplvl: 0


dump: 0.118067s

load: 0.314645s

size: 2311342

marshal ziplvl: 1


dump: 0.315362s

load: 0.335677s

size: 470956

marshal ziplvl: 2


dump: 0.318787s

load: 0.380117s

size: 457196

marshal ziplvl: 3


dump: 0.350247s

load: 0.364908s

size: 446085

marshal ziplvl: 4


dump: 0.414658s

load: 0.318973s

size: 437764

marshal ziplvl: 5


dump: 0.448890s

load: 0.350013s

size: 418712

marshal ziplvl: 6


dump: 0.516882s

load: 0.367595s

size: 409947

marshal ziplvl: 7


dump: 0.617210s

load: 0.315827s

size: 398354

marshal ziplvl: 8


dump: 1.117032s

load: 0.346452s

size: 392332

marshal ziplvl: 9


dump: 1.366547s

load: 0.368925s

size: 391921

The results do not include datastore operations,

it's just about creating a blob that can be stored
in the datastore and getting the parsed data back.
The times of "dump" and "load" are seconds it takes
to do this on a Google AppEngine F1 instances
(600Mhz, 128mb RAM).

You can also comment there or on this email thread.

Enjoy,
-Andrin

Here is the library i created an use:

#!/usr/bin/env python
#
# Copyright 2012 MiuMeet AG
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from google.appengine.api import datastore_types
from google.appengine.ext import db

import zlib
import marshal

MARSHAL_VERSION = 2
COMPRESSION_LEVEL = 1

class JsonMarshalZipProperty(db.BlobProperty):
  """Stores a JSON serializable object using zlib and marshal in a db.Blob"""

  def default_value(self):
    return None
  
  def get_value_for_datastore(self, model_instance):
    value = self.__get__(model_instance, model_instance.__class__)
    if value is None:
      return None
    return db.Blob(zlib.compress(marshal.dumps(value, MARSHAL_VERSION),
                                 COMPRESSION_LEVEL))

  def make_value_from_datastore(self, value):
    if value is not None:
      return marshal.loads(zlib.decompress(value))
    return value

  data_type = datastore_types.Blob
  
  def validate(self, value):
    return value


Emanuele Ziglioli

unread,
Jun 1, 2012, 7:41:19 PM6/1/12
to Google App Engine
I do that with Java, and store it as gz. Would love to serve it
uncompressed but GAE doesn't let us set the Content-Encoding type to
gz (well, it works on the local server but on the GAE server the
header is stripped and the message body corrupt).


On Jun 2, 6:40 am, Andrin von Rechenberg <and...@miumeet.com> wrote:
> Hey there

Renzo Nuccitelli

unread,
Jun 2, 2012, 12:12:56 AM6/2/12
to google-a...@googlegroups.com
Hi,

 ndb has an JasonProperty which is parsed using json module. It can be compressed using compressed with True value: https://developers.google.com/appengine/docs/python/ndb/properties#compressed

 Maybe it´s a good to add this solution in your tests.

 Renzo Nuccitelli

Brandon Wirtz

unread,
Jun 2, 2012, 1:41:23 AM6/2/12
to google-a...@googlegroups.com

Nice. I like it. Going to try playing with your method vs some of the things I tried. But I suspect you are faster.

 

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/lohoWLsuVD0J.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

johnP

unread,
Jun 2, 2012, 11:09:02 AM6/2/12
to Google App Engine
Is this using python the 2.7 version of cPickle, or are you using
python 2.5 with the Python Pickle aliased to cPickle?

johnP

Bryce Cutt

unread,
Jun 2, 2012, 5:34:39 PM6/2/12
to google-a...@googlegroups.com
The docs for marshal seem to indicate there are no guarantees marshaled data is compatible between python versions. That worries me. If I decide to eventually upgrade my python 2.5 apps to 2.7 am I going to have to convert all my data between marshal versions? While pickle is not always backwards compatible it does appear to be forwards compatible.

I took your class and made the tiny modifications for it to use pickle instead of marshal and it is working well for me. Thanks.

Bryce Cutt

unread,
Jun 2, 2012, 6:43:33 PM6/2/12
to google-a...@googlegroups.com
The type of data being serialized certainly affects how much faster marshal is. When testing with just a string approximately 1MB in size marshal was around 10 times as fast as JSON but only about 10% faster than pickle. With a dict of dicts of integers (around 1MB when serialized with marshal) I found that pickle was about 50% faster than JSON but marshal was around 200 times faster than pickle!

If I run some more empirical tests I will share some real numbers. What sort of data were you using in your tests Andrin?

Andrin von Rechenberg

unread,
Jun 3, 2012, 11:25:30 PM6/3/12
to google-a...@googlegroups.com
Marshal also has versioning. I hardcoded version 2 and am hopping that it will be forward compatible.
If it's not, then oh well I need to get the data out first and reencode it. but that's a small burden compared
to the amount of speed i get.

I was using python 2.7 with cPickle.

I don't think anything can beat marshal. Marshal is used by Python internally when serializing data-structures
in compiled code (the pyc files). So it's critical for python's performance and if there would be something faster
than marshal, python would def use it for exactly this case.

the test data was of this form

  [ ~8 bytes string,
    ~10 bytes string,
    { about 6 key value pairs with up to 200 bytes }
  ] ,
  ...
  ...
  ...
]


Cheers,
-Andrin

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/E7uCryqsk2QJ.

Andreas

unread,
Jun 4, 2012, 9:49:57 AM6/4/12
to google-a...@googlegroups.com
great. how would this look for the ndb package?

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.

Andreas

unread,
Jun 4, 2012, 10:41:14 AM6/4/12
to google-a...@googlegroups.com
is this a valid implementation?

class JsonMarshalZipProperty(ndb.BlobProperty):

    def _to_base_type(self, value):
        return zlib.compress(marshal.dumps(value, MARSHAL_VERSION))

    def _from_base_type(self, value):
        return marshal.loads(zlib.decompress(value))

Bryce Cutt

unread,
Jun 4, 2012, 3:24:48 PM6/4/12
to google-a...@googlegroups.com
aschmid: The ndb BlobProperty has optional compression built in (see ndb.model.BlobProperty). You could implement the MarshalProperty like this:

class MarshalProperty(BlobProperty):
  def _to_base_type(self, value):
    return marshal.dumps(value, MARSHAL_VERSION)
  def _from_base_type(self, value):
    return marshal.loads(value)

Then when you instantiate a property instance you would specify the compressed option to enable compression:
prop = MarshalProperty(compressed=True)

The compressed option in BlobProperty is implemented in such a way that you can turn it on an off and old values will still be read properly as the _from_base_type() method in BlobProperty only decompresses the stored value if it actually was compressed.

The BlobProperty uses the default compression level (and does not have an option to change the compression level) so if you want to use level 1 (as Andrin recommends) you would need to implement that in your own subclass.
To post to this group, send email to google-appengine@googlegroups.com.
To unsubscribe from this group, send email to google-appengine+unsubscribe@googlegroups.com.

Bryce Cutt

unread,
Jun 4, 2012, 3:31:36 PM6/4/12
to google-a...@googlegroups.com
Hi Andrin,

I have run a few more tests and now I am convinced you are right.

On Python 2.5 where pickle and json are both implemented in pure Python the speed benefits of marshal outweigh any conversion issues in the future (which are not likely anyways). Your numbers show even on Python 2.7 (with a C implementation of pickle) that marshal is much faster. Thanks!

- Bryce

Andreas

unread,
Jun 4, 2012, 3:30:57 PM6/4/12
to google-a...@googlegroups.com
ok good to know.
but this still does not help with the 1mb entity size limit... even by compressing some json objects i would still be over that size. 
think the only solution here is to use the BlobStore with the files api.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/qKSg7YkFW5YJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.

Bryce Cutt

unread,
Jun 4, 2012, 4:01:46 PM6/4/12
to google-a...@googlegroups.com
Andreas,

Yup. I have had to resort to using the blobstore on many occasions for
exactly this reason.

One gotcha that I have run into when doing this is that there appears
to be no way to write a new blob to the blobstore (using the files
API) inside of a transaction that also modifies a datastore model,
even with the XG option enabled. I would love a solution to this if
anyone knows of one. Right now I am doing a workaround that may cause
a new blob to be written even if the transaction fails and using a
delayed task to clean that up.

- Bryce
Reply all
Reply to author
Forward
0 new messages