Re: Regex - Ensure 0,1 occurrences from a list of possibilities

Peter Duniho

unread,

Mar 27, 2009, 1:39:26 PM3/27/09

to

On Fri, 27 Mar 2009 09:57:04 -0700, Byron
<By...@discussions.microsoft.com> wrote:

> I'm looking for an expression that checks that within a string only
> certain
> characters appear, and that they appear at most 1 time. For example, a
> string can have up to 4 unique digits between 1 and 4.

I'm no regex expert. But, I suspect you can use the conditional
expression syntax to accomplish what you want.

That said, that doesn't mean regex is the best solution. Sometimes,
simply writing some code is a better way to express what you're doing. I
think that using the conditional syntax, a suitable regex expression would
be, while more concise than the equivalent C#, also very difficult to
easily infer its purpose, as well as to modify should the requirements
change later.

Pete

nospam@invegat

unread,

Mar 29, 2009, 6:17:36 AM3/29/09

to

i am not a regex expert, so there is probably a better way, but this
works

Regex tester = new Regex(@"D-(" +
@"(?<m1>[1234])|" +
@"(?<m1>[1234])(?<m2>(?!\k<m1>)[1234])|" +

@"(?<m1>[1234])(?<m2>(?!\k<m1>)[1234])(?<m3>(?!(\k<m1>|\k<m2>))[1234])|"
+

@"(?<m1>[1234])(?<m2>(?!\k<m1>)[1234])(?<m3>(?!(\k<m1>|\k<m2>))[1234])"
+
@"(?<m4>(?!(\k<m1>|\k<m2>|\k<m3>))[1234])" +
@")$");
invegat

On Fri, 27 Mar 2009 09:57:04 -0700, Byron
<By...@discussions.microsoft.com> wrote:

>I'm looking for an expression that checks that within a string only certain
>characters appear, and that they appear at most 1 time. For example, a
>string can have up to 4 unique digits between 1 and 4.
>

>The following strings should pass:
>D-1
>D-14
>D-1234
>D-41
>
>The following should fail:
>D- (no digit)
>D-0 (out of range)
>D-5 (out of range)
>D-11 (repeated digit)
>D-1 2 (embedded space)
>D-12345 (out of range and too long)
>
>The expression "D-[1234]{1,4}" comes close, but it matches even when a digit
>is repeated within the string.
>
>Any help would be greatly appreciated.

Jesse Houwing

unread,

Mar 29, 2009, 7:33:45 PM3/29/09

to

Hello Byron,

> I'm looking for an expression that checks that within a string only
> certain characters appear, and that they appear at most 1 time. For
> example, a string can have up to 4 unique digits between 1 and 4.
>
> The following strings should pass:
> D-1
> D-14
> D-1234
> D-41
> The following should fail:
> D- (no digit)
> D-0 (out of range)
> D-5 (out of range)
> D-11 (repeated digit)
> D-1 2 (embedded space)
> D-12345 (out of range and too long)
> The expression "D-[1234]{1,4}" comes close, but it matches even when a
> digit is repeated within the string.
>
> Any help would be greatly appreciated.

As there's only a few possibilities, you could write them out like this:

(1234|1324|1342|1423|1432|....4321)

And do the same with the other options.

A better wat woukd be to check the general format with a regex, and then
later (if you're using asp.net validators) check the other requirement.

There are other possibilities (as invegat pointed out), but those aren't
supported by clientside Javascript , which would require a serverside check
anyway.

--
Jesse Houwing
jesse.houwing at sogeti.nl

nospam@invegat

unread,

Mar 29, 2009, 8:07:51 PM3/29/09

to

By my calculation there are 504 possibilities, I wouldn't call that
"only a few". If I were doing it these way I would generate the
matching strings with a 1 time use program.

Mark Tolonen

unread,

Mar 29, 2009, 9:16:24 PM3/29/09

to

There are 64 possibilities:

<nospam@invegat> wrote in message
news:vb20t4pe3tgjj7gm8...@4ax.com...

Mark Tolonen

unread,

Mar 29, 2009, 9:30:27 PM3/29/09

to

There are only 64 permutations. Not a pretty regex though:

"D-(1234|1243|1324|1342|1423|1432|2134|2143|2314|2341|2413|2431|
3124|3142|3214|3241|3412|3421|4123|4132|4213|4231|4312|4321|123|
124|132|134|142|143|213|214|231|234|241|243|312|314|321|324|341|342|
412|413|421|423|431|432|12|13|14|21|23|24|31|32|34|41|42|43|1|2|3|4)$"

-Mark

<nospam@invegat> wrote in message
news:vb20t4pe3tgjj7gm8...@4ax.com...

nospam@invegat

unread,

Mar 30, 2009, 2:26:34 AM3/30/09

to

You are right, 64. With that number your way is probably at least
100 times faster.

Jesse Houwing

unread,

Mar 30, 2009, 3:22:15 AM3/30/09

to

Hello Mark,

> There are only 64 permutations. Not a pretty regex though:
>
> "D-(1234|1243|1324|1342|1423|1432|2134|2143|2314|2341|2413|2431|
> 3124|3142|3214|3241|3412|3421|4123|4132|4213|4231|4312|4321|123|
> 124|132|134|142|143|213|214|231|234|241|243|312|314|321|324|341|342|
> 412|413|421|423|431|432|12|13|14|21|23|24|31|32|34|41|42|43|1|2|3|4)$"

You could make that much shorter by making permutations like ([1234]|1[234]|2[134]...
etc, whould make it even shorter. There's even tools for that, but I can't
find them right now. Used to use them to generate regex's for SpamAssassin.

But I agree... not very pretty.

Jesse

nospam@invegat

unread,

Mar 30, 2009, 3:52:56 AM3/30/09

to

of the 64 possible matches 40 would be incorrect for the pattern
1[234], for example 1222

On Mon, 30 Mar 2009 07:22:15 +0000 (UTC), Jesse Houwing

Jesse Houwing

unread,

Mar 30, 2009, 4:06:07 AM3/30/09

to

Hello Jesse,

> Hello Mark,
>
>> There are only 64 permutations. Not a pretty regex though:
>>
> "D-(1234|1243|1324|1342|1423|1432|2134|2143|2314|2341|2413|2431|
> 3124|3142|3214|3241|3412|3421|4123|4132|4213|4231|4312|4321|123|
> 124|132|134|142|143|213|214|231|234|241|243|312|314|321|324|341|342|
> 412|413|421|423|431|432|12|13|14|21|23|24|31|32|34|41|42|43|1|2|3|4)$"
>
> You could make that much shorter by making permutations like
> ([1234]|1[234]|2[134]... etc, whould make it even shorter. There's
> even tools for that, but I can't find them right now. Used to use them
> to generate regex's for SpamAssassin.
>
> But I agree... not very pretty.

I had some thinking and came up with this regex:

^D-(?:(?(1)(?!(?(1)))|(?<1>1))|(?(2)(?!(?(2)))|(?<2>2))|(?(3)(?!(?(3)))|(?<3>3))|(?(4)(?!(?(4)))|(?<4>4))){1,4}$

it works a lot like the regex by invegat. But it is more generic. It would
be pretty easy to extend this or generate this for longer strings. Evwen
though a simple piece of C# would be more elegant, I couldn't resist trying
to see if it was possible with just Regex :)

Basically it does this:

(?(1) checks to see if we've already matched a 1
(?!(?(1)) creates a no-match if that was the case
|(?<1>1) matches a 1 (and due to the constructs before, only if we hadn't
matched one before)

Now or this together with the same construct for 2,3 and 4 and you can use
{1,4} to specify the range.

Jesse

>
> Jesse
>
>> -Mark
>>
> <nospam@invegat> wrote in message
> news:vb20t4pe3tgjj7gm8...@4ax.com...
>
>>> By my calculation there are 504 possibilities, I wouldn't call that
>>> "only a few". If I were doing it these way I would generate the
>>> matching strings with a 1 time use program.
>>>
> On Sun, 29 Mar 2009 23:33:45 +0000 (UTC), Jesse Houwing
> <jesse....@newsgroup.nospam> wrote:
>
>>>> Hello Byron,
>>>>

>>>>> I'm looking for an expression that checks that within a string
>>>>> only certain characters appear, and that they appear at most 1
>>>>> time. For example, a string can have up to 4 unique digits between
>>>>> 1 and 4.
>>>>>
> The following strings should pass:
> D-1
> D-14
> D-1234
> D-41
> The following should fail:
> D- (no digit)
> D-0 (out of range)
> D-5 (out of range)
> D-11 (repeated digit)
> D-1 2 (embedded space)
> D-12345 (out of range and too long)
> The expression "D-[1234]{1,4}" comes close, but it matches even
> when a
> digit is repeated within the string.
> Any help would be greatly appreciated.

>>>> As there's only a few possibilities, you could write them out like
>>>> this:
>>>>
> (1234|1324|1342|1423|1432|....4321)
>
> And do the same with the other options.
>
> A better wat woukd be to check the general format with a regex, and
> then later (if you're using asp.net validators) check the other
> requirement.
>
> There are other possibilities (as invegat pointed out), but those
> aren't
> supported by clientside Javascript , which would require a
> serverside
> check
> anyway.

> --
> Jesse Houwing
> jesse.houwing at sogeti.nl

> Posted by JetBrains Omea Pro 1098.1

Jesse Houwing

unread,

Mar 30, 2009, 4:10:07 AM3/30/09

to

Hello nospam@invegat,

> of the 64 possible matches 40 would be incorrect for the pattern
> 1[234], for example 1222

No that wouldn't be true. There's no * after the character set, it would
only match 12, 13, 14.

1[234] would never match 1222.
12[34]
1234
13[24]
1324
1342
14[23]
1423
1432

and go on like this.

could condense that into:

1([234]|2([34]|34)|3([24]|24|42)|4([23]|23|32)... etc

It's horrible to read, but quite efficient :)

Jesse

Jeff Johnson

unread,

Mar 30, 2009, 1:51:54 PM3/30/09

to

"Byron" <By...@discussions.microsoft.com> wrote in message
news:51030B54-8718-4D6E...@microsoft.com...

> I'm looking for an expression that checks that within a string only
> certain
> characters appear, and that they appear at most 1 time. For example, a
> string can have up to 4 unique digits between 1 and 4.
>
> The following strings should pass:
> D-1
> D-14
> D-1234
> D-41
>
> The following should fail:
> D- (no digit)
> D-0 (out of range)
> D-5 (out of range)
> D-11 (repeated digit)
> D-1 2 (embedded space)
> D-12345 (out of range and too long)
>
> The expression "D-[1234]{1,4}" comes close, but it matches even when a
> digit
> is repeated within the string.
>
> Any help would be greatly appreciated.

Maybe there's a Regex guru out there who can help you, but my personal
advice is that you should put down your Regex hammer and stop looking for
nails to hit with it. Parse these strings with plain old boring code and
move on.

Mark Tolonen

unread,

Mar 31, 2009, 2:15:29 AM3/31/09

to

1[234] matches 12, 13, 14. It doesn't match 1222.

21 characters shorter with Jesse's suggestion (and even less pretty ;^)

"D-(1234|1243|1324|1342|1423|1432|2134|2143|2314|2341|2413|2431|

3124|3142|3214|3241|3412|3421|4123|4132|4213|4231|4312|4321|12[34]|
13[24]|14[23]|21[34]|23[14]|24[13]|31[24]|32[14]|34[12]|41[23]|42[13]|43[12]|
1[234]|2[134]|3[124]|4[123]|[1234])$"

-Mark

<nospam@invegat> wrote in message
news:nau0t4h2qm1kl7mi9...@4ax.com...

Jesse Houwing

unread,

Mar 31, 2009, 3:12:56 AM3/31/09

to

Hello Mark,

> 1[234] matches 12, 13, 14. It doesn't match 1222.
>
> 21 characters shorter with Jesse's suggestion (and even less pretty
> ;^)
>
> "D-(1234|1243|1324|1342|1423|1432|2134|2143|2314|2341|2413|2431|
> 3124|3142|3214|3241|3412|3421|4123|4132|4213|4231|4312|4321|12[34]|
> 13[24]|14[23]|21[34]|23[14]|24[13]|31[24]|32[14]|34[12]|41[23]|42[13]|
> 43[12]| 1[234]|2[134]|3[124]|4[123]|[1234])$"

I find it more readable than

^D-(?:(?(1)(?!(?(1)))|(?<1>1))|(?(2)(?!(?(2)))|(?<2>2))|(?(3)(?!(?(3)))|(?<3>3))|(?(4)(?!(?(4)))|(?<4>4))){1,4}$

though... even though it's shorter...

But as I say as Regex teacher... Shorter is always better ;).

Jesse

> -Mark
>
> <nospam@invegat> wrote in message

Lee

unread,

Apr 2, 2009, 3:52:18 PM4/2/09

to

Hello All,

Now I am not a "Regular Expression Master" by any means but, I *think*
you guys are looking at this kinda wrong.

Why are you focusing on the permutations.

What we want to do is exclude a character from ever appearing more
than once, right/

So would not the start point be here? ^[^1]*1?[^1]*$

This says that the string can only contain one '1'.

now if you bunch put this in a positive lookahead with a four
character limit:

(?=^[^1]*1?[^1]*$)[1234]{1,4}

This says that the string can only be 4 characters long, consisting of
only the numbers 1 thru 4, and can contain at most one '1'. And three
more lookaheads and you are done:

(?=^[^1]*1?[^1]*$)(?=^[^2]*2?[^2]*$)(?=^[^3]*3?[^3]*$)(?=^[^4]*4?[^4]*
$)[1234]{1,4}

I've exaustively tested this in "The Regulator' and it seams that it
will only allow what the requestor is asking for.

Of course, I could be wrong.

L. Lee Saunders
http://oldschooldotnet.blogspot.com

Lee

unread,

Apr 2, 2009, 4:12:59 PM4/2/09

to

Sorry, forgot the D- .... so the Regex would be:

(?=^[^1]*1?[^1]*$)(?=^[^2]*2?[^2]*$)(?=^[^3]*3?[^3]*$)(?=^[^4]*4?[^4]*

$)D-[1234]{1,4}

Plus if you need it to be readable, add comments like:

var reg = new Regex("(?=^[^1]*1?[^1]*$) (?# Positive Lookahead -
Limit data to at most one '1' )" +
"(?=^[^2]*2?[^2]*$) (?# Positive Lookahead -
Limit data to at most one '2' )" +
"(?=^[^3]*3?[^3]*$) (?# Positive Lookahead -
Limit data to at most one '3' )" +
"(?=^[^4]*4?[^4]*$) (?# Positive Lookahead -
Limit data to at most one '4' )" +
"D-[1234]{1,4} (?# Pattern starts with 'D-'
and has 1 to for digits to the set [1234] )",
RegexOptions.IgnorePatternWhitespace);

Jesse Houwing

unread,

Apr 2, 2009, 4:19:54 PM4/2/09

to

Hello Lee,

Nice one! I hadn't thought of that yet... I think we could make this even
more general... Let's see...

^D-(?:([1234])(?!.*\1)){1,4}$

This is it... match a number and at the same time check if it doesn't occur
further on in the string...

I should have thought of that right from the beginning... I even think there
is an excercise in my training that comes pretty close to this...
(making sure a password doesn't contain more than x of the same characters)...

Jesse Houwing

unread,

Apr 2, 2009, 5:52:32 PM4/2/09

to

Hello Jesse,

and because we already know there will be no more than 3 numbers following
the first one, we can make it yet a little bit more efficient.

^D-(?:([1234])(?!.{0,2}\1)){1,4}$

Peter Duniho

unread,

Apr 2, 2009, 6:30:14 PM4/2/09

to

On Thu, 02 Apr 2009 14:52:32 -0700, Jesse Houwing
<jesse....@newsgroup.nospam> wrote:

> [...]

> and because we already know there will be no more than 3 numbers
> following the first one, we can make it yet a little bit more efficient.
>
> ^D-(?:([1234])(?!.{0,2}\1)){1,4}$

I'm amused by this whole thread, as it's a great illustration of what I
said originally, which is that even if one can come up with a regex to do
the job, it's not clear doing so really adds any value, and in fact has a
tendency to obfuscate what's really going on. For example:

bool IsValid(string strInput, string strStart, char[] rgchValid)
{
if (!strInput.StartsWith(strStart))
{
return false;
}

bool[] rgfMatched = new bool[rgchValid.Length];

foreach (char chCheck in strInput.Substring(strStart.Length))
{
int ichMatch = rgchValid.IndexOf(chCheck);

// Valid character?
if (ichMatch == -1)
{
return false;
}

// Have we seen it before?
if (rgfMatched[ichMatch])
{
return false;
}

rgfMatched[ichMatch] = true;
}

return true;
}

Call it as so:

bool fValid = IsValid("D-1234", "D-", { '1', '2', '3', '4' });

(where the matching pattern arguments would more likely be cached
somewhere, assuming they are used more than once).

That's a LOT longer, but anyone can read the code and easily comprehend
what the method actually does and how it does it. More importantly, that
same "anyone" can easily modify the code to address new criteria, should
they need to be introduced later on.

Performance-wise, for short "valid character" arrays it should be fine,
and for longer ones it'd be trivial to use a Dictionary to perform the
look-up instead. I'd be surprised if there's any significant difference
in performance between an optimized explicit version and an optimized
regex version.

Considering how long it took a handful of people familiar with regex to
come up with a really optimized regex solution, you can imagine how many
hours could be wasted by someone not familiar with regex to figure out how
that solution works, never mind how to update it to support a new feature
that's needed. :)

Pete

Jesse Houwing

unread,

Apr 3, 2009, 3:30:55 AM4/3/09

to

Hello Peter,

The problem is that I can say the same about your solution...

Mine would be different when doing it in code...

For the check on duplicates I'd do something like (not compiled and just
out of bed ;)):

char[] chars = value.ToCharArray();
Array.Sort(chars);
char prev = '0';
for (int i =0; i<chars.Length; i++)
{
if (prev != 0 && chars[i] != prev)
{
prev = chars[i];
}
else
{
return false;
}
}
return true;

And even then there's probably even faster ways. or different ways or more
optimized ways or less optimized ways. I also must say that I find your solution
not so intuitive as you thought it would be. But hey.. That's me.

In the end the regex will be slower (always) than a small functio like this
one. But a regex also has a couple of advantages...
1) You can put it in a config file for on the fly adjustments
2) works across platforms, the regex I built works in javascript, C#, java,
even in SQL 2005 with a little work, write once, use everywhere ;)
3) For people who do understand regex, it's an easy to read solution (if
you haven't written any C# in your life functions like these can become just
as ugly... trust me)
4) I happen to like Regex ;)

Peter Duniho

unread,

Apr 3, 2009, 1:49:55 PM4/3/09

to

On Fri, 03 Apr 2009 00:30:55 -0700, Jesse Houwing
<jesse....@newsgroup.nospam> wrote:

> [...]

> And even then there's probably even faster ways. or different ways or
> more optimized ways or less optimized ways.

Your n log n approach is faster than the n^2 code I posted, but slower
than a hashtable solution (as I suggested as an optimization). If
performance is an issue, you'd use a hashtable. If it's not, you should
use what's most easily understood, and IMHO the "sort and find duplicates"
approach is over-complicated.

And as with the comparison between the n^2 and hashtable that I made in my
previous post, for short inputs (such as the 4-character match array
postulated in the OP), the n^2 approach could easily be faster than either
"optimized" versions (hashtable or sort).

> I also must say that I find your solution not so intuitive as you
> thought it would be. But hey.. That's me.

I'm not sure what was so difficult to understand about the code I posted,
but IMHO there is no question that compared to a regex pattern, the code
is much more broadly understandable.

> In the end the regex will be slower (always) than a small functio like
> this one. But a regex also has a couple of advantages...
> 1) You can put it in a config file for on the fly adjustments
> 2) works across platforms, the regex I built works in javascript, C#,
> java, even in SQL 2005 with a little work, write once, use everywhere ;)
> 3) For people who do understand regex, it's an easy to read solution (if
> you haven't written any C# in your life functions like these can become
> just as ugly... trust me)
> 4) I happen to like Regex ;)

I don't dispute those advantages. But that doesn't mean that they are
relevant in all cases. The fact is, regex simply has a maintenance
problem in terms of comprehension, _and_ it can be very difficult to get
it right (as so amply demonstrated by this thread). The only advantage
you mention above that has a genuine, strong maintenance benefit itself is
the first one, and that would have to be a pretty important criteria to
outweigh the benefit of an implementation that any random programmer can
easily understand.

Pete