JSON Cleanup Regex

168 views
Skip to first unread message

John C. Bland II

unread,
Aug 19, 2008, 9:49:52 PM8/19/08
to as3co...@googlegroups.com
Hey guys, what do you think about cleaning up JSON so the parser is much better to work with?

Here is a solid regex by Nathan Strutz (dopefly.com):

var json:String = "{blah: 'some string', blah2 : ' some: string', blah3: 'some string \' with internal quote'}";
json.replace(/(([\w]+)\s*(:)|'(.*)(?<!\\)')/gi, "\"$2$4\"$3");

...that turns into:
{"blah": "some string", "blah2": " some: string", "blah3": "some string \' with internal quote"}

...which works perfectly fine with the parser. Otherwise the first one above fails since it is object notation vs actual json.

If it isn't implemented, hope it being here helps. I'll blog it too so it can help someone else as it helped me.

--
John C. Bland II
http://www.johncblandii.com
---
http://www.lifthimhigh.com - "Christian Products for Those Bold Enough to Wear Them"

Home of azFPUG - http://www.gotoandstop.org

Mike Chambers

unread,
Aug 20, 2008, 4:47:08 AM8/20/08
to as3co...@googlegroups.com
Are you suggesting adding this to the parser?

I would be open to that. Ideally, we would just update the parser to
handle the different cases, but this could be a quick way to do it
until or unless someone voluteers to look at the current implimentation.

mike chambers

me...@adobe.com

John C. Bland II

unread,
Aug 20, 2008, 4:52:37 AM8/20/08
to as3co...@googlegroups.com
Yeah, implemented right in the parser.

So, JSON.decode(myjson) would handle the clean-up.

One thing I noticed, which I'm going to talk to Nate about tomorrow, is the first regex below is greater for the sloppy json that prompted this need but for clean object notation (json with quotes around properties) it didn't work properly. I had to use a combo of the two below.


json.replace(/(([\w]+)\s*(:)|'(.*)(?<!\\)')/gi, "\"$2$4\"$3");
json.replace(/(([\w]+)(:)|'([^']*)')/gi, "\"$2$4\"$3");

Now, I'm sure Nate can probably make this 1 regex. If so, I'll throw it in the JSON class and shoot it over for review.

John C. Bland II

unread,
Aug 20, 2008, 3:18:25 PM8/20/08
to as3co...@googlegroups.com
(([\w]+)\s*(:)|'(.*?)(?<!\\)')

There's the JSON cleaning regex that works for badly formatted ON (object notation) and property formatted ON. Converts it to a solid JSON format.

Updated json class coming in a sec.

John C. Bland II

unread,
Aug 20, 2008, 3:26:03 PM8/20/08
to as3co...@googlegroups.com
Hope that helps.
JSON.as

Mike Chambers

unread,
Aug 20, 2008, 7:07:14 PM8/20/08
to as3co...@googlegroups.com
Thanks. I will try and integrate this tonight and run the unit tests.

mike chambers

me...@adobe.com

> <JSON.as>

John C. Bland II

unread,
Aug 20, 2008, 7:47:54 PM8/20/08
to as3co...@googlegroups.com
Good deal.

Mike Chambers

unread,
Aug 29, 2008, 12:28:42 PM8/29/08
to as3co...@googlegroups.com
Sorry for the delay on this. The change you include causes an assert
in the unit test.

It is failing on:

--
o = JSON.decode( ' "http:\/\/digg.com\/security\/Simple_Digg_Hack" ' );
assertTrue( "String not decoded correctly", o =="http://digg.com/security/Simple_Digg_Hack
" );
--

mike chambers

me...@adobe.com

John C. Bland II

unread,
Aug 29, 2008, 2:27:44 PM8/29/08
to as3co...@googlegroups.com
Why would json.decode be used on a url?

What if instead of attempting to clean-up initially we do a try/catch and in the catch try a second time with the cleanup regex?

try{
//decode
}catch(e:*){
//clean
//decode
}

That might help solve this issue.

OR

What if the cleanup became a separate function and we left it 100% to the developer to implement, as needed.

JSON.decode(JSON.clean("..."));

Thoughts?

Mike Chambers

unread,
Aug 29, 2008, 5:02:46 PM8/29/08
to as3co...@googlegroups.com
Well, the URL is a valid string, which should be parsed.

The only issue with having a separate normalize or cleanup method
call, is that then we would have to document when it would not work...
i.e. in the case of the url below.

Unfortunately, I am not very strong with RegularExpressions.

So, the primary issue is that currently the parser only works if the
property name in an object is quoted?

i.e:

Works:

{"foo":"bar"}

doesnt work:

{foo:"bar"}

let me ping Darron (Who wrote the original code) and see if he has any
ideas.

Maybe we can have a strict property which defaults to true (current
behaviour). If set to false, then it was parse non-quoted values.

mike chambers

me...@adobe.com

Mike Chambers

unread,
Aug 29, 2008, 5:52:27 PM8/29/08
to as3co...@googlegroups.com
fyi, I spoke with Darron Schall, who wrote the original code. Below is
the chat with the high level info on how to fix the issue.

I dont have time to work on this right now, so if someone else wants
to try it out, go for it. Otherwise I will look at it when I get a
chance:

---
Mike Chambers
hey darron, you have a sec for a quick question?
2:06
haystackr
sure
whats up?
2:06
Mike Chambers
you remember the JSON code you wrote for us?
for corelib?
2:06
haystackr
yeah
well, I dont really "remember" it, but I wrote it 2 years ago.. hehe
2:06
Mike Chambers
the spec specifies the object values are enclosed in doubel qwuotes
{"foo":"bar"}
which the lib works with
but apparently a lot of json doesnt do this
{foo:"bar"}
2:07
haystackr
teh web iz borken
2:07
Mike Chambers
amen to that
so I am looking into adding a strict flag
2:07
haystackr
yeah, we can update the parser to not look for a " after a {
right now it looks for a string value after {
2:08
Mike Chambers
yeah
2:08
haystackr
would have to make some code to recognize identifiers
2:08
Mike Chambers
is that something relatively easy?
2:08
haystackr
and ask for an ident or string after {
relatively, yeah
2:08
Mike Chambers
im looking through it now, but not familiar with the code
2:08
haystackr
i dont have it in front of me, so can't really point you to it
its probably in tokenizer
you need a new ident token
so when you call getNextToken() you can get a string or an ident
in the parseObject() portion
2:09
Mike Chambers
yeah. im looking at parseObject
2:09
haystackr
its like getNetxToken() for open_curly_brace
then look for ident or string
you'll have to write that part
not sure if that's quite right as I dont rmemeber 100%
2:10
Mike Chambers
http://code.google.com/p/as3corelib/source/browse/trunk/src/com/adobe/serialization/json/JSONDecoder.as
2:10
haystackr
but that's the general idea
2:10
Mike Chambers
if you have a sec, can you leave the comments inline on that page?
just click on the code line and you can leave a comment
2:11
haystackr
you have to edit this one too: http://code.google.com/p/as3corelib/source/browse/trunk/src/com/adobe/serialization/json/JSONTokenizer.as
2:11
Mike Chambers
getNextToken?
2:11
haystackr
in getNextToken()
new case for idents
gonna have to figure out a way to work around the case "t", "f" and "n"
2:11
Mike Chambers
yeah
2:11
haystackr
as thoughs could start tokens as well as true/false, etc
its more like.. just grab the ident in that case
then check if the dent is the special case true/false/null
otherwise toekn.type = IDENT
2:12
Mike Chambers
ok. you have lost me, but i am not as familiar with the code
2:12
haystackr
for JOSNDecoder
like 182 -- else if token.type == JSONTokenType.IDENT
or, line 151 can be token.type == string or token.type == ident
the only thing that would change would be line 153
actually, 153 wouldn't have to change at all
because the value of the tokne would be astring
eah, you just need to say if string or ident there on 151
2:14
Mike Chambers
thats it?
2:14
haystackr
then make tokenize recognize ident tokens
yeah
2:14
Mike Chambers
what is ident?
2:14
haystackr
a new JSONTokenType
that represents an identifiers
(so the parse knows what the next token type is..)
2:15
Mike Chambers
right
and then in JSONTokenizer is where i set the IDENT type
the JSONTokenType
2:16
haystackr
yeah
when you find an identifer in getNextToken()
you set token.type = IDENT
2:16
Mike Chambers
would that just be in default then?
if it is not a number
2:16
haystackr
possibly
2:17
Mike Chambers
i assume it is a name
2:17
haystackr
if it's not a number, just consume valid ident chars (A-Z, a-z, $_)
but
that won't work, remember, because of the "t", "f", and "n" cases
you need to get rid of those
and check after you're done with the ident parsing code
2:18
Mike Chambers
before i return the token?
2:18
haystackr
yeah
2:18
Mike Chambers
so if the token was "false"
2:18
haystackr
get your ident, then check its value.. if "false" token type is FALSE
2:18
Mike Chambers
i would
token.value = false;
2:19
haystackr
exactly
2:19
Mike Chambers
that is outside of the swtich statement?
right?
2:19
haystackr
otherwie, token.type = IDENT
2:19
Mike Chambers
just remove those cases
2:19
haystackr
that would be in the default case handler
2:19
Mike Chambers
oh
in the default
but remove the f,t cases
2:19
haystackr
right
I would actually make a new method
and put the t, f, and n in there
new methods called readIdentifier()
then in default you case say, on line 165, else if ( isIdentChar( ch )
token = readIdentifier()
that should do it
good luck

Wink.png

John C. Bland II

unread,
Aug 29, 2008, 6:02:55 PM8/29/08
to as3co...@googlegroups.com
Thanks for the full convo paste. Let me see if I can get to this in a few minutes.

So, the idea, from the API perspective, is to allow the user to set a property OR pass it into decode?

JSON.decode(myJsonVar, false); //non-strict mode; default is true

or

JSON.strict = true;
JSON.decode(myJsonVar);

What would be the preference here? Both? :-)

readIdentifier will be a lot like readString()
minus the question sequence stuff
gotta run
let me know how it works out
-

mike chambers

me...@adobe.com
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google
> Groups "ActionScript 3 Core Library" group.
> To post to this group, send email to as3co...@googlegroups.com
> To unsubscribe from this group, send email to as3corelib+...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/as3corelib?hl=en
> -~----------~----~----~----~------~----~------~--~---
>


Mike Chambers

unread,
Aug 29, 2008, 10:30:19 PM8/29/08
to as3co...@googlegroups.com
Well, since the decode method is static having a second argument
decode seems to make more sense to me. The property will also be
exposed a little better through code hints.

XML does have the ignoreWhiteSpace static property though, which
applies to all instances.

I prefer #1, but I could go either way. I wouldn't do both though as
that could be confusing (which takes precedence, etc..).

Ill defer to whatever you think makes the most sense.

mike chambers

me...@adobe.com

John C. Bland II

unread,
Aug 30, 2008, 12:22:13 AM8/30/08
to as3co...@googlegroups.com
I think passing it with decode is best. The property wouldn't be bad since most likely they wouldn't switch back and forth but still...I think the method would be the clear path.

Lemme finish banging on this app and I'll look at it, for sure, in a few hours.
Reply all
Reply to author
Forward
0 new messages