(Maybe) a simple question about regex

Sam Kong

unread,

Mar 23, 2005, 8:47:31 PM3/23/05

to

Hello!

I think that I am missing a very simple concept about regex.

s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]

Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.

What I want is:

s = '0123456789'
s.scan(some_regex) #-> ["01", "23", "67", "89"]

What should some_regex be?

Can somebody help me?

Sam

Assaph Mehr

unread,

Mar 23, 2005, 9:04:54 PM3/23/05

to

> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.
>
> What I want is:
>
> s = '0123456789'
> s.scan(some_regex) #-> ["01", "23", "67", "89"]

Negative lookahead:
s.scan /(?!4|5)\d\d/
Note the OR sign ('|') between the digits, otherwise it would produce:
["01", "23", "56", "78"]

You need to tune it to your exact domain.

Cheers,
Assaph

Carlos

unread,

Mar 23, 2005, 9:08:06 PM3/23/05

to

[Sam Kong <sam.s...@gmail.com>, 2005-03-24 02.49 CET]

> Hello!
>
> I think that I am missing a very simple concept about regex.
>
> s = '0123456789'
> s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
>
> Now I want to exclude "45".
> How can I express it in the regex?
> When it's only one character, I can use ^.
> But for 2 characters, I don't think I can use it.

You can use a "negative lookahead assertion":

s.scan(/(?!45)\d\d/)

This means, at every point the regex tries to match, "if the next two
characters aren't "45", match \d\d".

HTH.
--

Jason Sweat

unread,

Mar 23, 2005, 9:09:15 PM3/23/05

to

You can use a negative assertion to say you want to skip "45", but it
will bump forward one space and you will end up with the last matches
being "56" and "78"

>> s.scan(/(?!45)\d\d/)
=> ["01", "23", "56", "78"]

So with a little uglier assertion, you can say:

>> s.scan(/(?!45|5)\d\d/)
=> ["01", "23", "67", "89"]

and get what you specified, but though it works for your toy case, I
would be worried that this might not extrapolate out to your real goal
well.

HTH

Regards,
Jason
http://blog.casey-sweat.us/

Patrick Hurley

unread,

Mar 23, 2005, 9:50:47 PM3/23/05

to

What they said, but also if you can be more precise about your real
problem, we might be able to better model a solution. You might find
matching the expression you want and then scanning it to be more
flexible for example.

Robert Klemme

unread,

Mar 24, 2005, 3:08:41 AM3/24/05

to

"Assaph Mehr" <ass...@gmail.com> schrieb im Newsbeitrag
news:1111629894.4...@l41g2000cwc.googlegroups.com...

>
> > s = '0123456789'
> > s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
> >
> > Now I want to exclude "45".
> > How can I express it in the regex?
> > When it's only one character, I can use ^.
> > But for 2 characters, I don't think I can use it.
> >
> > What I want is:
> >
> > s = '0123456789'
> > s.scan(some_regex) #-> ["01", "23", "67", "89"]
>
> Negative lookahead:
> s.scan /(?!4|5)\d\d/
> Note the OR sign ('|') between the digits, otherwise it would produce:
> ["01", "23", "56", "78"]

But:

>> s = '01234567894657'
=> "01234567894657"

>> s.scan /(?!4|5)\d\d/

=> ["01", "23", "67", "89", "65"]
>> s.scan /\d\d/
=> ["01", "23", "45", "67", "89", "46", "57"]

IOW, you loose "46" and "57".

I prefer a non RE solution in these cases as it's simpler

>> s.scan(/\d\d/).reject {|x| "45" == x}
=> ["01", "23", "67", "89", "46", "57"]

Otherwise RE becomes really complex if you want to make it right - if it's
possible at all (see other postings).

Kind regards

robert

Sam Kong

unread,

Mar 24, 2005, 4:05:45 AM3/24/05

to

Thank you and other posters for the answers.
Actually s.scan(/(?!45)\d\d/) suffices my real problem.

What I was trying to solve was...
To extract url's from an html source which includes list of sites.
They are formatted like <a href="something.html">.
But I wanted to exclude <a href="index.html"> from the list.
So (?!index.html) will do.
Actually my toy case was not well-defined (I realized this later) and
thus it required more complex solutions like your second case -
s.scan(/(?!45|5)\d\d/) .

I think non-RE solution would be better like Mr. Robert Klemme said.
But I wanted to learn some RE.

Thanks.
Sam

Simon Strandgaard

unread,

Mar 24, 2005, 6:00:22 AM3/24/05

to

On Thu, 24 Mar 2005 18:09:50 +0900, Sam Kong <sam.s...@gmail.com> wrote:
> To extract url's from an html source which includes list of sites.
> They are formatted like <a href="something.html">.
> But I wanted to exclude <a href="index.html"> from the list.
> So (?!index.html) will do.

does this help?

ary=%w(a.html index.html other.txt evil.html.exe stuff.html)
ary.select{|s| s =~ /\A(?!index).*\.html\z/ } #=> ["a.html", "stuff.html"]

--
Simon Strandgaard

Csaba Henk

unread,

Mar 25, 2005, 8:25:37 AM3/25/05

to

On 2005-03-24, Sam Kong <sam.s...@gmail.com> wrote:
> What I was trying to solve was...
> To extract url's from an html source which includes list of sites.
> They are formatted like <a href="something.html">.
> But I wanted to exclude <a href="index.html"> from the list.
> So (?!index.html) will do.
> Actually my toy case was not well-defined (I realized this later) and
> thus it required more complex solutions like your second case -
> s.scan(/(?!45|5)\d\d/) .

Why don't you use a dedicated html parser? Eg. there's htmltokenizer,
available ar Rubyforge, quite lightweight and very easy to use, but
there are others, of course.

> I think non-RE solution would be better like Mr. Robert Klemme said.
> But I wanted to learn some RE.

This thread was useful, I admit :)

Csaba