Wrap up -- I was wrong (was Re: [json-schema] Re: Open question: should regex support be removed from the JSON Schema specification altogether?)

92 views
Skip to first unread message

Francis Galiegue

unread,
Sep 8, 2012, 6:12:29 AM9/8/12
to json-...@googlegroups.com
On Fri, Sep 7, 2012 at 6:53 PM, penduin <owen...@gmail.com> wrote:
> I wouldn't miss them personally, but I can see useful cases for both
> "pattern" and "patternProperties", and I think there is a need for some type
> of pattern-matching, though regex (specifically ECMA 262 regex) is probably
> overkill. In WJElement (written in C) this stuff gets handled by the
> standard GNU regex library, with the ability to plug in a different regex
> handler should the need arise. In the meantime, we just don't care about
> the differences; if we're doing a big fancy implementation-specific regex in
> schema, we're doing something wrong. ;^)
>
> If what we're after is a spec that's easy to strictly implement in any
> language (that seems a worthy goal) then ECMA 262 probably should not be the
> pattern-matching method of choice. Regular expressions could be ditched
> altogether, or perhaps the spec can MAY and SHOULD its way around this
> issue; a validator looking to be widely-used can document its regex-handling
> details or maybe be configurable. (I know that approach rubs some people
> the wrong way, but that's my non-OCD pragmatist tinkerer side talking)
>
> A lowest-common-denominator regex subset (or even something as basic as a
> handful of wildcards) would be just fine as far as I'm concerned. I haven't
> had any real-world cases come up for either "pattern" or
> "patternProperties", though we did consider "patternProperties" for one of
> our schemas. (I forget what we worked out instead, but it simplified our
> lives a bit.)
>

That is a nice summary, thank you! And I agree that a lowest common
denominator subset would be nice too. Defining it can be tricky,
though. From what I see, regex constructs which can safely be used
are:

* character classes ("[a-z]" etc);
* the "+", "*" and "?" quantifiers, along with their "lazy" versions
("+?", "*?", "??") -- even though I positively loathe the latters :p
* alternation ("|"), grouping ("( ... )") -- BUT NOT non capturing
grouping like "(?: ... )";
* backreferences ("\1", etc)

Disallowed: anything else! No language-specific character classes (not
even "\d" and "\w" -- those differ between regex dialects, for
instance \w will only do ASCII in Java but the full Unicode charset in
JavaScript, and similarly "\d" in .NET languages matches any Unicode
digit and not only "[0-9]"), no possessive quantifiers ("*+", "++",
"?+"), no named captures (the syntax of which differ among languates
anyway) etc.


--
Francis Galiegue, fgal...@gmail.com
JSON Schema: https://github.com/json-schema
"It seems obvious [...] that at least some 'business intelligence'
tools invest so much intelligence on the business side that they have
nothing left for generating SQL queries" (Stéphane Faroult, in "The
Art of SQL", ISBN 0-596-00894-5)

penduin

unread,
Sep 9, 2012, 8:46:52 PM9/9/12
to json-...@googlegroups.com
That subset looks great to me.  Anything that would use something too crazy to be represented with those bits of regex is probably either doing something wrong, or going to need something other than (or in addition to) json-schema to be satisfied. 

Armishev, Sergey

unread,
Sep 10, 2012, 12:07:57 PM9/10/12
to json-...@googlegroups.com

Take a look at how XML schema defined subset of regular expressions. I think it makes perfect sense and if it is OK JSON schema standard  can make a claim that it regular expressions are 100% with XML schema regular expressions

http://www.regular-expressions.info/xml.htmlhttp://www.regular-expressions.info/xml.html

Again, it is just suggestion that rather to create one more flavor of regular expression just re-use something that was already done for the similar purposes

-Sergey

--
 
 

Eric Stob

unread,
Sep 10, 2012, 12:16:15 PM9/10/12
to json-...@googlegroups.com
That is genius.

Sent from my iPhone
--
 
 

Francis Galiegue

unread,
Sep 10, 2012, 1:12:11 PM9/10/12
to json-...@googlegroups.com
On Mon, Sep 10, 2012 at 6:07 PM, Armishev, Sergey <sarm...@idirect.net> wrote:
> Take a look at how XML schema defined subset of regular expressions. I think
> it makes perfect sense and if it is OK JSON schema standard can make a
> claim that it regular expressions are 100% with XML schema regular
> expressions
>
> http://www.regular-expressions.info/xml.html
>

Uh. I will certainly not take that subset to the letter:

* regexes anchored at the beginning and end of input by default --> no
thanks! Programming languages such as Java (.matches()) and Python
(.match()) already have done enough damage, let's not go there;
* \[iIcC] are way too specific;
* character class substractions are not supported by ECMA 262 (and the
way it's done? Come on, XML, [...^[...]] existed before XSD, why
didn't you choose that instead of inventing your own?)
* we want word anchors -- well, \b to be more specific;
* we want the dot to match \n, which it does in ECMA 262 -- you cannot
validate multiline inputs otherwise.

Just the first point of this subset (having regexes forcefully
anchored) means you lose 90+% of the expressiveness of regexes. What
do you think of:

[0-9]

vs

.*[0-9].*

to match a single digit?

Apologies for sounding somewhat rude, but regexes are a subject which
I master particularly well ;)

Have fun,

Dean Landolt

unread,
Sep 10, 2012, 1:30:38 PM9/10/12
to json-...@googlegroups.com
So it sounds like you'd prefer the subset defined by xpath and xquery [1]. But it still seems wise to try to use an already-specified regex subset -- particularly one from a spec as widely-implemented as xpath.


Francis Galiegue

unread,
Sep 10, 2012, 2:34:54 PM9/10/12
to json-...@googlegroups.com
On Mon, Sep 10, 2012 at 7:30 PM, Dean Landolt <de...@deanlandolt.com> wrote:
[...]
>
>
> So it sounds like you'd prefer the subset defined by xpath and xquery [1].
> But it still seems wise to try to use an already-specified regex subset --
> particularly one from a spec as widely-implemented as xpath.
>
>
> [1] http://www.regular-expressions.info/xpath.html
>

Nope, not that one either. Would it be because it extends the XML
Schema regex language, with all its pecularities which should not make
it into a usable, generalizable regex subset.

Well, I'll come up with something in the end. And submit it for
approval in any case.

Nate Morse

unread,
Sep 10, 2012, 4:26:13 PM9/10/12
to json-...@googlegroups.com
Not bad http://www.regular-expressions.info/refflavors.html , but I
might put a few more "no"s in the XML column.
> --
>
>

--
--Nate

Armishev, Sergey

unread,
Sep 11, 2012, 5:25:41 PM9/11/12
to json-...@googlegroups.com
Guys,
XML schema was created for the same purposes as JSON schema - validation for the data scructures. And people been thinking about efficiency as well. Look at the XML schema Regular Expression engine statement

"Compared with other regular expression flavors, the XML schema flavor is quite limited in features. Since it's only used to validate whether an entire element matches a pattern or not, rather than for extracting matches from large blocks of data, you won't really miss the features often found in other flavors. The limitations allow schema validators to be implemented with efficient text-directed engines."

Somebody care about efficiency?
Code reuse?
Think about people like myself switching from XML schema to JSON schema?

-Sergey
--


Francis Galiegue

unread,
Sep 11, 2012, 5:35:39 PM9/11/12
to json-...@googlegroups.com
On Tue, Sep 11, 2012 at 11:25 PM, Armishev, Sergey
<sarm...@idirect.net> wrote:
> Guys,
> XML schema was created for the same purposes as JSON schema - validation for the data scructures. And people been thinking about efficiency as well. Look at the XML schema Regular Expression engine statement
>

Err... Disagree. XSD proper has quite a lot of semantic analysis built in.

> "Compared with other regular expression flavors, the XML schema flavor is quite limited in features. Since it's only used to validate whether an entire element matches a pattern or not, rather than for extracting matches from large blocks of data, you won't really miss the features often found in other flavors. The limitations allow schema validators to be implemented with efficient text-directed engines."
>
> Somebody care about efficiency?
> Code reuse?
> Think about people like myself switching from XML schema to JSON schema?
>

Oh, I do, BUT: when it comes to regex, XSD has one thing VERY WRONG:
anchoring regexes is NOT a way to make them perform better!

Ever heard of "first character optimization" in regex engines? Pretty
much all engines _on earth_ have that nowadays. And they perform
_worse_ when the regex is surrounded with .*.

In any event, losing regex expressiveness just so that JSON Schema is
"compatible with XSD" is a definite, resounding NO. Sorry!

Geraint (David)

unread,
Sep 12, 2012, 5:21:54 AM9/12/12
to json-...@googlegroups.com
I agree with you about efficiency, but I must chime in to say that JSON Schema is not just about validation.  It has some fantastic features in the hyper-schema that have nothing to do with validation.

In fact, right at the top of the document it says "JSON Schema is intended to define validation, documentation, hyperlink navigation, and interaction control of JSON data".  People have already posted software to this list that use the last three, and I have to admit that my pet project focuses heavily on the last two in that list.

I explained JSON Schema to an XML-geek friend of mine recently, and he was actually incredibly jealous of the non-validation features that JSON Schema is defining.  I think that to limit JSON Schema to the features that XML schemas have is a mistake.  However, I agree that trying to make it as efficient as competing schema formats is a noble goal.

Andrei Neculau

unread,
Sep 22, 2012, 6:41:51 AM9/22/12
to json-...@googlegroups.com
#1 Is it just me or is it wrong to decide what regex capabilities to allow based on their efficiencies? (re: first character recognition)
Today it might be slow, tomorrow it might be super fast or at least negligible. How many times did we (devs) go down the path of "write super unreadable code because you gain 1 second" (today) but 6 months after, you only gain 0.01, but your code has been read by tens, hundreds..

#2 did I skip some important lines, or do you suggest no ^$ ?

Andrei Neculau

unread,
Sep 22, 2012, 6:44:43 AM9/22/12
to json-...@googlegroups.com
sorry. ignore #2

Francis Galiegue

unread,
Sep 22, 2012, 6:55:26 AM9/22/12
to json-...@googlegroups.com
On Sat, Sep 22, 2012 at 12:41 PM, Andrei Neculau
<andrei....@gmail.com> wrote:
> #1 Is it just me or is it wrong to decide what regex capabilities to allow
> based on their efficiencies? (re: first character recognition)

The defining criteria was not efficiency, but common support of regex
constructs.

I didn't even include \d, you will have noticed: it is because for
instance, in .NET, \d matches _any_ Unicode digit, whereas it is
strictly equivalent to [0-9] in most other regex dialects. Call about
a trap!

I even did not include \b (word anchor) -- maybe that is a mistake, I
don't know. But for instance, ECMA 262 has no \< and \>. \b, on the
other hand, is widely supported -- I don't know of a regex dialect
which does not support it. I should probably include that into the
recommended set.
Reply all
Reply to author
Forward
0 new messages