I think that I am missing a very simple concept about regex.
s = '0123456789'
s.scan(/\d\d/) #-> ["01", "23", "45", "67", "89"]
Now I want to exclude "45".
How can I express it in the regex?
When it's only one character, I can use ^.
But for 2 characters, I don't think I can use it.
What I want is:
s = '0123456789'
s.scan(some_regex) #-> ["01", "23", "67", "89"]
What should some_regex be?
Can somebody help me?
Sam
Negative lookahead:
s.scan /(?!4|5)\d\d/
Note the OR sign ('|') between the digits, otherwise it would produce:
["01", "23", "56", "78"]
You need to tune it to your exact domain.
Cheers,
Assaph
You can use a "negative lookahead assertion":
s.scan(/(?!45)\d\d/)
This means, at every point the regex tries to match, "if the next two
characters aren't "45", match \d\d".
HTH.
--
You can use a negative assertion to say you want to skip "45", but it
will bump forward one space and you will end up with the last matches
being "56" and "78"
>> s.scan(/(?!45)\d\d/)
=> ["01", "23", "56", "78"]
So with a little uglier assertion, you can say:
>> s.scan(/(?!45|5)\d\d/)
=> ["01", "23", "67", "89"]
and get what you specified, but though it works for your toy case, I
would be worried that this might not extrapolate out to your real goal
well.
HTH
Regards,
Jason
http://blog.casey-sweat.us/
But:
>> s = '01234567894657'
=> "01234567894657"
>> s.scan /(?!4|5)\d\d/
=> ["01", "23", "67", "89", "65"]
>> s.scan /\d\d/
=> ["01", "23", "45", "67", "89", "46", "57"]
IOW, you loose "46" and "57".
I prefer a non RE solution in these cases as it's simpler
>> s.scan(/\d\d/).reject {|x| "45" == x}
=> ["01", "23", "67", "89", "46", "57"]
Otherwise RE becomes really complex if you want to make it right - if it's
possible at all (see other postings).
Kind regards
robert
What I was trying to solve was...
To extract url's from an html source which includes list of sites.
They are formatted like <a href="something.html">.
But I wanted to exclude <a href="index.html"> from the list.
So (?!index.html) will do.
Actually my toy case was not well-defined (I realized this later) and
thus it required more complex solutions like your second case -
s.scan(/(?!45|5)\d\d/) .
I think non-RE solution would be better like Mr. Robert Klemme said.
But I wanted to learn some RE.
Thanks.
Sam
does this help?
ary=%w(a.html index.html other.txt evil.html.exe stuff.html)
ary.select{|s| s =~ /\A(?!index).*\.html\z/ } #=> ["a.html", "stuff.html"]
--
Simon Strandgaard
Why don't you use a dedicated html parser? Eg. there's htmltokenizer,
available ar Rubyforge, quite lightweight and very easy to use, but
there are others, of course.
> I think non-RE solution would be better like Mr. Robert Klemme said.
> But I wanted to learn some RE.
This thread was useful, I admit :)
Csaba