simplejson 2.0 @ pylons 0.9.7

24 views
Skip to first unread message

Max Ischenko

unread,
Apr 2, 2009, 3:01:23 AM4/2/09
to pylons-...@googlegroups.com
Hello,

I have run into another problem while update my code to pylons 0.9.7.

I'm using simplejson library to parse json feeds and simplejson 2.0 fails to parse the same content that was successfully parsed with 1.9.
I cannot require simplejson 1.9 since setuptools complains about conflict:
Installed distribution simplejson 1.7.1 conflicts with requirement simplejson>=2.0.8

Probably Pylons or PasteScript or something else has this req.

Here is the error I get with parsing json:

 $ python zz.py
Traceback (most recent call last):
  File "zz.py", line 5, in <module>
    data =  simplejson.loads(js)
  File "/usr/lib/python2.5/site-
packages/PIL/__init__.py", line 305, in loads
    
  File "build/bdist.linux-i686/egg/simplejson/decoder.py", line 329, in decode
  File "build/bdist.linux-i686/egg/simplejson/decoder.py", line 345, in raw_decode
ValueError: Invalid control character at: line 16 column 418 (char 1244)

$ cat zz.py

import simplejson

js = open('z').read()
data =  simplejson.loads(js)


And the json content itself:

http://www.developers.org.ua/static/js/z.txt

I've tried to .decode('utf8') before feeding it to json, it makes no difference.

Any ideas how to fix the json feed or resolve req conflict?

--
Max.Ischenko // twitter.com/maxua

Deron Meranda

unread,
Apr 2, 2009, 3:18:29 AM4/2/09
to pylons-...@googlegroups.com
On Thu, Apr 2, 2009 at 3:01 AM, Max Ischenko <isch...@gmail.com> wrote:
> Here is the error I get with parsing json:
>
> ValueError: Invalid control character at: line 16 column 418 (char 1244)
>
> http://www.developers.org.ua/static/js/z.txt

You have invalid JSON. I suspect that the newer version of
simplejson is just being more pedantic, i.e., correct; whereas
the older version was letting you get by with bad input.

The problem is that JSON does not permit line separator characters
from appearing within a quoted string literal. Your value for the "content"
item has several linefeed characters in it.

If you want that character inside a JSON string you must escape it.
Either with "\n" or with "\u000a".


BTW, if you have my demjson python package installed, the included
"jsonlint" command will give you a slightly more understandable error
message (in this case). Just another debugging tool at your disposal...

$ jsonlint -v z.json
z.json: line terminator characters must be escaped inside string
literals: u'\n<h3>\u0423\u0441...
--
Deron Meranda

Max Ischenko

unread,
Apr 2, 2009, 3:25:57 AM4/2/09
to pylons-...@googlegroups.com
On Thu, Apr 2, 2009 at 10:18, Deron Meranda <deron....@gmail.com> wrote:

On Thu, Apr 2, 2009 at 3:01 AM, Max Ischenko <isch...@gmail.com> wrote:
> Here is the error I get with parsing json:
>
> ValueError: Invalid control character at: line 16 column 418 (char 1244)
>
You have invalid JSON.  I suspect that the newer version of
simplejson is just being more pedantic, i.e., correct; whereas
the older version was letting you get by with bad input.

Many thanks!

The problem indeed was with \n characters. I have tried jsonlint.com, but it doesn't give meaninful error messages, just "syntax error".

Deron Meranda

unread,
Apr 2, 2009, 4:17:24 AM4/2/09
to pylons-...@googlegroups.com
> The problem indeed was with \n characters. I have tried jsonlint.com, but it
> doesn't give meaninful error messages, just "syntax error".

(Don't confuse "jsonlint.com", the online service, with "jsonlint", which is a
command line tool that is part of my demjson python package)

Unfortunately http://jsonlint.com/ is not really the best "lint". It's a great
service for the casual syntax error, and it's free and easy which is wonderful!
But it can miss some of the edge cases, and as you've seen, doesn't
always give you the most helpful messages.


Fortunately, Python has several high quality JSON packages today (one of
which has been adopted into of the standard Python 3.0 library). After
having collaborated with most of the various package authors, they are all
today, for the most, very correct in their interpretation of the JSON
standard and
all its subtle nuances. So if one doesn't give you an understandable error
message, one of the others is likely to do so.

Although JSON appears quite simple on the surface, it is actually surprisingly
intricate to implement parsers *correctly*. But I think that the JSON modules
available for Python are probably the best (as in correct) implementations
available in any language.


Oh, I'm not really trying to pimp my package, because simplejson is
perfectly fine! But in case the need ever arises, I've been told that my
demjson is one of the best for being able to parse badly-formed JSON
(it has a toggleable strict/nonstrict mode); and it also generally gives
fairly descriptive error messages. Unfortunately though, not even
it will allow linefeeds inside of strings, even in it's loosest mode.
--
Deron Meranda

Bob Ippolito

unread,
Apr 2, 2009, 7:30:31 AM4/2/09
to pylons-...@googlegroups.com
On Thu, Apr 2, 2009 at 3:17 AM, Deron Meranda <deron....@gmail.com> wrote:
>
>> The problem indeed was with \n characters. I have tried jsonlint.com, but it
>> doesn't give meaninful error messages, just "syntax error".
>
> (Don't confuse "jsonlint.com", the online service, with "jsonlint", which is a
> command line tool that is part of my demjson python package)
>
> Unfortunately http://jsonlint.com/ is not really the best "lint".  It's a great
> service for the casual syntax error, and it's free and easy which is wonderful!
> But it can miss some of the edge cases, and as you've seen, doesn't
> always give you the most helpful messages.
>
>
> Fortunately, Python has several high quality JSON packages today (one of
> which has been adopted into of the standard Python 3.0 library).  After
> having collaborated with most of the various package authors, they are all
> today, for the most, very correct in their interpretation of the JSON
> standard and
> all its subtle nuances.  So if one doesn't give you an understandable error
> message, one of the others is likely to do so.

"one of which" is simplejson, and it is also in the Python 2.6+
standard library. It's called json there.

> Although JSON appears quite simple on the surface, it is actually surprisingly
> intricate to implement parsers *correctly*.  But I think that the JSON modules
> available for Python are probably the best (as in correct) implementations
> available in any language.
>
>
> Oh, I'm not really trying to pimp my package, because simplejson is
> perfectly fine!  But in case the need ever arises, I've been told that my
> demjson is one of the best for being able to parse badly-formed JSON
> (it has a toggleable strict/nonstrict mode); and it also generally gives
> fairly descriptive error messages. Unfortunately though, not even
> it will allow linefeeds inside of strings, even in it's loosest mode.

simplejson allows linefeeds in JSON if you pass strict=False (although
its strict parameter ONLY applies to control characters in strings).
Some of the pedanticness of simplejson is your fault for comparing it
to your package :)

>>> import simplejson
>>> simplejson.loads('"foo\nbar"', strict=False)
'foo\nbar'
>>> simplejson.loads('"foo\nbar"')


Traceback (most recent call last):

File "<stdin>", line 1, in <module>
File "/mochi/lib/python2.5/site-packages/PIL/__init__.py", line 307, in loads

File "/mochi/lib/python2.5/site-packages/simplejson-2.0.9-py2.5-macosx-10.3-i386.egg/simplejson/decoder.py",
line 335, in decode
File "/mochi/lib/python2.5/site-packages/simplejson-2.0.9-py2.5-macosx-10.3-i386.egg/simplejson/decoder.py",
line 351, in raw_decode
ValueError: Invalid control character at: line 1 column 4 (char 4)

If anyone has any suggestions on how to format that error message so
that it is more clear then I'd be willing to give it a shot. I suppose
the repr of the control character in that exception would be useful?

-bob

Deron Meranda

unread,
Apr 2, 2009, 11:00:25 AM4/2/09
to pylons-...@googlegroups.com
I guess we're getting a little off topic of Pylons here (my fault), but
its useful stuff anyway....


On Thu, Apr 2, 2009 at 7:30 AM, Bob Ippolito <b...@redivi.com> wrote:
> "one of which" is simplejson, and it is also in the Python 2.6+
> standard library. It's called json there.

Thanks for the clarification. I forgot the exact details and was too lazy
to look it up.

You know, I never did hear exactly how simplejson became the standard
json. Other than just a package rename, and perhaps internal code
formatting and cleanup; were there any notable changes made to it?


> simplejson allows linefeeds in JSON if you pass strict=False (although
> its strict parameter ONLY applies to control characters in strings).

Ok, then this can solve the poster's problem! ... considering he can't
fix the JSON, which would be the preferable solution.


> Some of the pedanticness of simplejson is your fault for comparing it
> to your package :)

Actually, I think the pedanticness is a *good* thing.

Look at all the bad JSON its helping to discover. I'd hate to see JSON
take the path of HTML tag soup where leinient parsers promote sloppy
input. Having parsers that are strictly pedantic (by default) is one of the
ways to guard against that.

Now if we can just get some of the other language implementations
up to the quality of the Python ones :)

And yes, my comparison page is woefully out of date now; I really need
to update it to reflect all the fixes you guys made!


> If anyone has any suggestions on how to format that error

> message sothat it is more clear then I'd be willing to give it a shot.


> I suppose the repr of the control character in that exception would
> be useful?

Perhaps one would be to add an additional argument into the
ValueError instance that included the offending character. Seeing
a "\n" in the traceback might make it more obvious to users what
character is bogus.

Another might be to treat "\n" or "\r" as a special cases; solely for
choosing the error message text. I suspect that embedded newlines
is a common error (although since it's not even permitted in Javascript
I'm not sure where that idea is coming from). So seeing an error like

ValueError("Quoted strings may not span multiple lines, invalid character","\n")

might make more sense to people than illegal control code?

Actually in general I kind of like it when parsers of any type return a snippet
of the offending source with the error, to provide context. And you can
always just pack more arguments into any of Python's error classes
to carry this.

Though your package does display the line number of the error, something
that mine unfortunately doesn't do :(
--
Deron Meranda

Bob Ippolito

unread,
Apr 2, 2009, 5:47:32 PM4/2/09
to pylons-...@googlegroups.com
On Thu, Apr 2, 2009 at 10:00 AM, Deron Meranda <deron....@gmail.com> wrote:
>
> I guess we're getting a little off topic of Pylons here (my fault), but
> its useful stuff anyway....
>
>
> On Thu, Apr 2, 2009 at 7:30 AM, Bob Ippolito <b...@redivi.com> wrote:
>> "one of which" is simplejson, and it is also in the Python 2.6+
>> standard library. It's called json there.
>
> Thanks for the clarification.  I forgot the exact details and was too lazy
> to look it up.
>
> You know, I never did hear exactly how simplejson became the standard
> json.  Other than just a package rename, and perhaps internal code
> formatting and cleanup; were there any notable changes made to it?

Other than renaming the package the only changes are related to py3k
forward compatibility. simplejson is far ahead of what ships with 2.6,
but it's in sync on 2.7 trunk.

> And yes, my comparison page is woefully out of date now; I really need
> to update it to reflect all the fixes you guys made!

Yeah, simplejson doesn't have any more correctness nits and it's
faster or at least comparable in performance to any of the others
these days. The only thing it doesn't really have is a lenient parser,
which I'm perfectly happy letting someone else write and maintain.

>> If anyone has any suggestions on how to format that error
>> message sothat it is more clear then I'd be willing to give it a shot.
>> I suppose the repr of the control character in that exception would
>> be useful?
>
> Perhaps one would be to add an additional argument into the
> ValueError instance that included the offending character.  Seeing
> a "\n" in the traceback might make it more obvious to users what
> character is bogus.
>
> Another might be to treat "\n" or "\r" as a special cases; solely for
> choosing the error message text.  I suspect that embedded newlines
> is a common error (although since it's not even permitted in Javascript
> I'm not sure where that idea is coming from).  So seeing an error like
>
> ValueError("Quoted strings may not span multiple lines, invalid character","\n")
>
> might make more sense to people than illegal control code?

I'll look into that next time I do some hacking on it, 2.1.0 is coming
soon with an optional OrderedDict implementation that allows you to
preserve order in the JSON document.

-bob

Deron Meranda

unread,
Apr 2, 2009, 6:51:34 PM4/2/09
to pylons-...@googlegroups.com
> Yeah, simplejson doesn't have any more correctness nits and it's
> faster or at least comparable in performance to any of the others
> these days. The only thing it doesn't really have is a lenient parser,
> which I'm perfectly happy letting someone else write and maintain.

Actually I was going to "retire" my own module after yours got
adopted as the official one. My biggest goal was just to get *some*
implementation up to spec, and in that I think we all succeeded.

But then I've gotten messages from people saying that mine is
the only one that could handle certain malformed JSON data and
could I please keep supporting it.

I really wish that malformed JSON data would just disappear; but
since this is the real world, and people seem to need it, I'm keeping
my module around. To that end I'm even contemplating new changes
to it so that rather than competing with simplejson (what's the point in
that?), it becomes even more useful for those dealing with bad or unusual
input or need to try to debug data or do something a little bit strange.


>> ValueError("Quoted strings may not span multiple lines, invalid character","\n")
>>
>> might make more sense to people than illegal control code?
>
> I'll look into that next time I do some hacking on it, 2.1.0 is coming
> soon with an optional OrderedDict implementation that allows you to
> preserve order in the JSON document.

You know, thinking about this more. Really what we have here is a
parser. Perhaps the errors should be designed so that they can
be machine-interpreted if desired, and not only for human consumption.

Think about somebody writing a JSON data viewer app. It doesn't
want to just throw up a traceback or a mostly useless error message.
It would want to be able to interpret the error, or at least to locate
the position in the input where the error occurred.

So some other suggestions are perhaps:

* Subclass ValueError to make JSONError, or similar. This would
allow all parser errors (caused by bad input) to be distinguished
from any other non-input errors that might arise, if any. Also by
subclassing you won't break any old code that caught ValueError.

* Add a consistent set of arguments or attributes to the error class
describing the error "location". Like line_number, column_number,
byte_offset (or character_offset), and perhaps a snippet of the raw
input data around where the error was found for context.
--
Deron Meranda

Bob Ippolito

unread,
Apr 2, 2009, 9:26:28 PM4/2/09
to pylons-...@googlegroups.com
On Thu, Apr 2, 2009 at 3:51 PM, Deron Meranda <deron....@gmail.com> wrote:
>
>>> ValueError("Quoted strings may not span multiple lines, invalid character","\n")
>>>
>>> might make more sense to people than illegal control code?
>>
>> I'll look into that next time I do some hacking on it, 2.1.0 is coming
>> soon with an optional OrderedDict implementation that allows you to
>> preserve order in the JSON document.
>
> You know, thinking about this more.  Really what we have here is a
> parser.  Perhaps the errors should be designed so that they can
> be machine-interpreted if desired, and not only for human consumption.
>
> Think about somebody writing a JSON data viewer app.  It doesn't
> want to just throw up a traceback or a mostly useless error message.
> It would want to be able to interpret the error, or at least to locate
> the position in the input where the error occurred.
>
> So some other suggestions are perhaps:
>
> * Subclass ValueError to make JSONError, or similar.  This would
>  allow all parser errors (caused by bad input) to be distinguished
>  from any other non-input errors that might arise, if any.  Also by
>  subclassing you won't break any old code that caught ValueError.
>
> * Add a consistent set of arguments or attributes to the error class
>  describing the error "location".  Like line_number, column_number,
>  byte_offset (or character_offset), and perhaps a snippet of the raw
>  input data around where the error was found for context.

That was easy to do. I had considered doing it before, but forgot to
do it. This is implemented in trunk with JSONDecodeError wrapping
ValueError with pos, msg, end, etc. attributes (see the class
docstring).

-bob

Reply all
Reply to author
Forward
0 new messages