the simplejson.loads() procedure is so slow for large json

1 view
Skip to first unread message

wentrue

unread,
Jun 25, 2009, 11:02:41 AM6/25/09
to CouchDB-Python
Hi, I'm using couchdb-python this day, but I got some troubles. That
is the json parseing speed. When I fetch a large json string from
couchDB (about 200 million records, cost 60 seconds), after the
request the couchdb-python do "json.loads()" in couchdb.client module,
however it consumes many time to do it (about 1800 seconds).

It is no doubt about that there are some problems for simplejson to
parse a large json string. So what are you going to do with it? Maybe
the information about python3.1 would be helpful.

Python 3.1rc2 Release: 2009-06-13
#

The json module now has a C extension to substantially improve its
performance. In addition, the API was modified so that json works only
with str, not with bytes. That change makes the module closely match
the JSON specification which is defined in terms of Unicode.

(Contributed by Bob Ippolito and converted to Py3.1 by Antoine Pitrou
and Benjamin Peterson; issue 4136.)

Issue4136
Title: merge json library with latest simplejson 2.0.x
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.1

Dirkjan Ochtman

unread,
Jun 25, 2009, 11:09:33 AM6/25/09
to couchdb...@googlegroups.com
On Thu, Jun 25, 2009 at 17:02, wentrue<guoz...@gmail.com> wrote:
> It is no doubt about that there are some problems for simplejson to
> parse a large json string. So what are you going to do with it? Maybe
> the information about python3.1 would be helpful.

Since simplejson is not part of CouchDB-python, nothing? You can
upgrade your simplejson version separately, if you want. Recent
simplejson versions should be relatively fast (not sure what version
you got there).

Cheers,

Dirkjan

Randall Leeds

unread,
Jun 25, 2009, 12:26:30 PM6/25/09
to couchdb...@googlegroups.com
couchdb-python only relies on simplejson when the python version is less than (I believe) 2.6. Above that it uses the built-in python modules for support. Whenever couchdb-python works on python3 we should get this boost automatically.

I evaluated a few different alternatives to simplejson last summer but found compatible problems with all of them.

-Randall

Dirkjan Ochtman

unread,
Jun 25, 2009, 2:01:47 PM6/25/09
to couchdb...@googlegroups.com
On 25/06/2009 18:26, Randall Leeds wrote:
> couchdb-python only relies on simplejson when the python version is less
> than (I believe) 2.6. Above that it uses the built-in python modules for
> support.

It's the other way around (sniopet from client.py):

try:
import simplejson as json
except ImportError:
import json # Python 2.6

So it tries simplejson first, then falls back to json.

Cheers,

Dirkjan

Sergey Shepelev

unread,
Jun 25, 2009, 12:30:36 PM6/25/09
to couchdb...@googlegroups.com
On Thu, Jun 25, 2009 at 8:26 PM, Randall Leeds <randal...@gmail.com> wrote:
couchdb-python only relies on simplejson when the python version is less than (I believe) 2.6. Above that it uses the built-in python modules for support. Whenever couchdb-python works on python3 we should get this boost automatically.

Builtin module starting from 2.6 is simplejson. It's just renamed to json.
 

I evaluated a few different alternatives to simplejson last summer but found compatible problems with all of them.

What problems with cjson?

Randall Leeds

unread,
Jun 25, 2009, 5:32:44 PM6/25/09
to couchdb...@googlegroups.com
According to http://code.google.com/p/couchdb-python/source/detail?r=93 it was utf-8 problems, but I don't remember the specifics. However, almost a year ago I made these commits which tried cjson before falling back to simplejson and the consensus at the time was that it was a bad idea.

On Thu, Jun 25, 2009 at 12:30, Sergey Shepelev <tem...@gmail.com> wrote:


On Thu, Jun 25, 2009 at 8:26 PM, Randall Leeds <randal...@gmail.com> wrote:
couchdb-python only relies on simplejson when the python version is less than (I believe) 2.6. Above that it uses the built-in python modules for support. Whenever couchdb-python works on python3 we should get this boost automatically.

Builtin module starting from 2.6 is simplejson. It's just renamed to json.

Should we change the import statement to try 'json' first so that simplejson is more clearly the fallback?
 

On Thu, Jun 25, 2009 at 11:09, Dirkjan Ochtman <dir...@ochtman.nl> wrote:

Since simplejson is not part of CouchDB-python, nothing? You can
upgrade your simplejson version separately, if you want. Recent
simplejson versions should be relatively fast (not sure what version
you got there).

Yeah, recent simplejson has some native components I think, but not as much as cjson.

-Randall

Sergey Shepelev

unread,
Jun 25, 2009, 7:16:58 PM6/25/09
to couchdb...@googlegroups.com
On Fri, Jun 26, 2009 at 1:32 AM, Randall Leeds<randal...@gmail.com> wrote:
> According to http://code.google.com/p/couchdb-python/source/detail?r=93 it
> was utf-8 problems, but I don't remember the specifics. However, almost a
> year ago I made these commits which tried cjson before falling back to
> simplejson and the consensus at the time was that it was a bad idea.
> On Thu, Jun 25, 2009 at 12:30, Sergey Shepelev <tem...@gmail.com> wrote:
>>
>>
>> On Thu, Jun 25, 2009 at 8:26 PM, Randall Leeds <randal...@gmail.com>
>> wrote:
>>>
>>> couchdb-python only relies on simplejson when the python version is less
>>> than (I believe) 2.6. Above that it uses the built-in python modules for
>>> support. Whenever couchdb-python works on python3 we should get this boost
>>> automatically.
>>
>> Builtin module starting from 2.6 is simplejson. It's just renamed to json.
>
> Should we change the import statement to try 'json' first so that simplejson
> is more clearly the fallback?
>

The fallback must be the older version of simplejson. Depending on
version of python and simplejson installed it could be either way.
And, if python's builtin json continues to evolve separately of
simplejson, maybe they even make it faster rewriting more stuff to C,
then yes, you'd better bet on json.

I guess you should do some benchmarking before making the decision.

For my python2.6 i have these results:

>>> import json, simplejson, time
>>> t = time.time(); _ = [ json.dumps(D) for _ in xrange(10000) ] ; time.time()-t
7.8098819255828857
>>> t = time.time(); _ = [ simplejson.dumps(D) for _ in xrange(10000) ] ; time.time()-t
2.257606029510498

Which lead me to guess that python's builtin json doesn't use C
extensions or something horrible like that. According to these
results, your import order is just fine :)

I also tried to get wonder results for comparing simplejson and cjson,
but difference i got was only 6%. Of course, cjson is faster, but
difference is not great for yelling.

But there is more interesting result. cjson raises exception when you
feed it with dictionary with non-string key.
simplejson is just silently using str() on key:

>>> simplejson.dumps({0:1})
'{"0": 1}'

>>> simplejson.dumps({None:1})
'{"null": 1}'

I don't think that's correct behaviour :)

wentrue

unread,
Jun 25, 2009, 11:22:33 PM6/25/09
to CouchDB-Python
I have done some experiments in python 2.5.2, which illustrate that
how slow the simplejson is when large data is encountered.

>>> import simplejson as json
>>> import time
>>> j=dict([(i,{i:i+1}) for i in xrange(1000000)])
>>> t=time.time();s=json.dumps(j);time.time()-t
27.05741286277771
>>> t=time.time();nj=json.loads(s);time.time()-t
77.924577951431274
>>> j=dict([(i,{i:i+1,i+1:i-1}) for i in xrange(1000000)])
>>> t=time.time();s=json.dumps(j);time.time()-t
32.886076927185059
>>> t=time.time();nj=json.loads(s);time.time()-t
125.49113798141479
>>> j=dict([(i,{i:i+1,i+1:i-1}) for i in xrange(2000000)])
>>> t=time.time();s=json.dumps(j);time.time()-t
56.730959177017212
>>> t=time.time();nj=json.loads(s);time.time()-t
300.81360006332397

I'll try to upgrade the simplejson and turn the c support up (http://
code.google.com/p/simplejson/source/browse/trunk/simplejson/
_speedups.c).



On Jun 26, 7:16 am, Sergey Shepelev <temo...@gmail.com> wrote:
> On Fri, Jun 26, 2009 at 1:32 AM, Randall Leeds<randall.le...@gmail.com> wrote:
> > According tohttp://code.google.com/p/couchdb-python/source/detail?r=93it
> > was utf-8 problems, but I don't remember the specifics. However, almost a
> > year ago I made these commits which tried cjson before falling back to
> > simplejson and the consensus at the time was that it was a bad idea.
> > On Thu, Jun 25, 2009 at 12:30, Sergey Shepelev <temo...@gmail.com> wrote:
>
> >> On Thu, Jun 25, 2009 at 8:26 PM, Randall Leeds <randall.le...@gmail.com>
> >>> On Thu, Jun 25, 2009 at 11:09, Dirkjan Ochtman <dirk...@ochtman.nl>

wentrue

unread,
Jun 25, 2009, 11:48:02 PM6/25/09
to CouchDB-Python
I installed the latest simplejson with c speedup.

~$sudo easy_install -U simplejson

And did the experiments again, the results are list below. It seems
better now.

>>> import simplejson as json
>>> import time
>>> j=dict([(i,{i:i+1}) for i in xrange(1000000)])
>>> t=time.time();s=json.dumps(j);time.time()-t
2.626917839050293
>>> t=time.time();nj=json.loads(s);time.time()-t
16.414155960083008
>>> j=dict([(i,{i:i+1,i+1:i-1}) for i in xrange(1000000)])
>>> t=time.time();s=json.dumps(j);time.time()-t
4.0347471237182617
>>> t=time.time();nj=json.loads(s);time.time()-t
31.043606996536255
>>> j=dict([(i,{i:i+1,i+1:i-1}) for i in xrange(2000000)])
>>> t=time.time();s=json.dumps(j);time.time()-t
7.8784449100494385
>>> t=time.time();nj=json.loads(s);time.time()-t
95.534356832504272

Christopher Lenz

unread,
Jun 26, 2009, 6:48:15 AM6/26/09
to couchdb...@googlegroups.com
On 25.06.2009, at 23:32, Randall Leeds wrote:
> On Thu, Jun 25, 2009 at 12:30, Sergey Shepelev <tem...@gmail.com>
> wrote:
>
> On Thu, Jun 25, 2009 at 8:26 PM, Randall Leeds <randal...@gmail.com
> > wrote:
> couchdb-python only relies on simplejson when the python version is
> less than (I believe) 2.6. Above that it uses the built-in python
> modules for support. Whenever couchdb-python works on python3 we
> should get this boost automatically.
>
> Builtin module starting from 2.6 is simplejson. It's just renamed to
> json.
>
> Should we change the import statement to try 'json' first so that
> simplejson is more clearly the fallback?

No. The rationale behind this order is that simplejson has a faster
release cycle than the stdlib, so if you have it installed, it's
likely to be a newer version (compared to the one bundled with Python).

Ideally, there'd be an explicit way to tell CouchDB-Python which JSON
module to use. If we had that, it'd be pretty easy to also add cjson
support. The simplest option would be to read from an environment
variable (or a module-level variable), but that's rather ugly, and
only works nicely when you don't have more than one application per-
process. Anything else would require accepting and passing through the
JSON module choice in all the various APIs, which is probably rather
tedious.

So that's why we only support simplejson right now, and hope it just
gets faster and faster :P

Cheers,
--
Christopher Lenz
cmlenz at gmx.de
http://www.cmlenz.net/

Sergey Shepelev

unread,
Jun 26, 2009, 7:54:45 AM6/26/09
to couchdb...@googlegroups.com
Personally, i'm all up for environment variable, that's a great way to
configure software.
I see nothing ugly in it.

As for one application per-process, that's quite what process is :)
even if someone is making multiplexed applications in one process, and
he wants different json libraries in each application, he can't
achieve that with current hardcoded library binding either. So,
environment variable approach gives at least same possibilites as now
+ extra flexibility for those who want it. It doesn't give abstract
100% flexibility (like different json library per call with conditions
defined in some configuration file, etc crazy ways) which is not
needed anyway.

Anand Chitipothu

unread,
Jun 26, 2009, 7:59:27 AM6/26/09
to couchdb...@googlegroups.com
Here is a workaround.

>>> import cjson
>>>
>>> import sys
>>> sys.modules['simplejson'] = cjson
>>>
>>> import simplejson
>>> simplejson
<module 'cjson' from
'/Users/anand/.python-eggs/python_cjson-1.0.6-py2.5-macosx-10.5-i386.egg-tmp/cjson.so'>

-Anand

Sergey Shepelev

unread,
Jun 26, 2009, 8:11:08 AM6/26/09
to couchdb...@googlegroups.com
And then you also need to do this:

sys.modules['simplejson'].dumps = cjson.encode
sys.modules['simplejson'].loads = cjson.decode

And hope that user doesn't try to encode dicts with non-string keys
(for which, simplejson silently converts to string, and cjson raises
exception), and also hope that user doesn't try to use other than
those two functions of simplejson module. Couchdb-python may be not
the only user of json library in program.

Christopher Lenz

unread,
Jul 1, 2009, 8:58:40 AM7/1/09
to couchdb...@googlegroups.com

I have a change ready in my local workspace that adds a couchdb.json
module, which is used as a thin abstraction layer between the actual
JSON module and the rest of couchdb-python. Instead of an environment
variable, it uses a module-level global to determine which library
gets used.

Example:

from couchdb import json
json.use(module='cjson')

Currently support modules are 'json' (stdlib), 'simplejson', and
'cjson'. The current behavior remains the default, i.e. try simplejson
first, and fall back to stdlib-json.

You can also plug in a different JSON encoding/decoding routines using
the same use() function:

from couchdb import json
json.use(encode=myencodefun, decode=mydecodefun)

I think this is pretty nice, and am only waiting for svn@googlecode to
come back to life to check it in.

Comments?

Christopher Lenz

unread,
Jul 1, 2009, 9:17:22 AM7/1/09
to couchdb...@googlegroups.com
On 01.07.2009, at 14:58, Christopher Lenz wrote:
> I have a change ready in my local workspace that adds a couchdb.json
> module, which is used as a thin abstraction layer between the actual
> JSON module and the rest of couchdb-python. Instead of an environment
> variable, it uses a module-level global to determine which library
> gets used.
>
> Example:
>
> from couchdb import json
> json.use(module='cjson')
>
> Currently support modules are 'json' (stdlib), 'simplejson', and
> 'cjson'. The current behavior remains the default, i.e. try simplejson
> first, and fall back to stdlib-json.
>
> You can also plug in a different JSON encoding/decoding routines using
> the same use() function:
>
> from couchdb import json
> json.use(encode=myencodefun, decode=mydecodefun)
>
> I think this is pretty nice, and am only waiting for svn@googlecode to
> come back to life to check it in.

Okay it's checked in now:

<http://code.google.com/p/couchdb-python/source/browse/trunk/couchdb/json.py

Sergey Shepelev

unread,
Jul 1, 2009, 10:26:13 AM7/1/09
to couchdb...@googlegroups.com
That's great, except for module='cjson' syntax.
Why quotes?

Christopher Lenz

unread,
Jul 1, 2009, 10:53:20 AM7/1/09
to couchdb...@googlegroups.com

Changed in r172. You can now specify the module either as a string or
by passing in the module object itself.

(Internally, it just looks for module.__name__ and works with that.
Otherwise we'd need to import all the supported modules up front to be
able to make the comparison and correctly map the encode/decode
functions, which would be too expensive IMHO.)

Thanks,

Sergey Shepelev

unread,
Jul 1, 2009, 1:02:08 PM7/1/09
to couchdb...@googlegroups.com
def use(module=None, encode=None, decode=None):
if module is not None:
encode = module.loads
decode = module.dumps
if encode is None or decode is None:
raise ValueError("Either module or encode/decode pair must be specified")
_our_global_setting.encode = encode
_our_global_setting.decode = decode


module works for json/simplejson, encode/decode pair works for cjson.
No imports at all.

Christopher Lenz

unread,
Jul 2, 2009, 5:15:57 AM7/2/09
to couchdb...@googlegroups.com

I don't really like this. I think it's counter-intuitive and awkward
that module= would work with simplejson but not cjson. What about some
random JSON module that had loads/dumps functions, but different
parameters? (We currently call dumps with allow_nan=False,
ensure_ascii=False)

I suppose it comes down to a matter of taste in this case, and the
difference isn't all that big ;)

Cheers,

Sergey Shepelev

unread,
Jul 2, 2009, 5:22:35 AM7/2/09
to couchdb...@googlegroups.com
On Thu, Jul 2, 2009 at 1:15 PM, Christopher Lenz<cml...@gmx.de> wrote:
>
> On 01.07.2009, at 19:02, Sergey Shepelev wrote:
>> On Wed, Jul 1, 2009 at 6:53 PM, Christopher Lenz<cml...@gmx.de> wrote:
>>> (Internally, it just looks for module.__name__ and works with that.
>>> Otherwise we'd need to import all the supported modules up front to
>>> be
>>> able to make the comparison and correctly map the encode/decode
>>> functions, which would be too expensive IMHO.)
>>>
>>
>> def use(module=None, encode=None, decode=None):
>>  if module is not None:
>>    encode = module.loads
>>    decode = module.dumps
>>  if encode is None or decode is None:
>>    raise ValueError("Either module or encode/decode pair must be
>> specified")
>>  _our_global_setting.encode = encode
>>  _our_global_setting.decode = decode
>>
>>
>> module works for json/simplejson, encode/decode pair works for cjson.
>> No imports at all.
>
> I don't really like this. I think it's counter-intuitive and awkward
> that module= would work with simplejson but not cjson. What about some
> random JSON module that had loads/dumps functions, but different
> parameters? (We currently call dumps with allow_nan=False,
> ensure_ascii=False)
>

Deep knowledge of module is great, i'm all with you. But you went
through drawbacks of that.

If that's really not just simple __import__, you're making use of some
knowledge about module, then i'd rather do separate
use_lib("simplejson") and use(encode, decode). Taking back my words
about quotes around module. That's using library and knowing what it
is, instead of just using generic interface module.
Reply all
Reply to author
Forward
0 new messages