Regex Issue

82 views
Skip to first unread message

Onorio Catenacci

unread,
Jun 6, 2016, 2:28:04 PM6/6/16
to elixir-lang-core
Hi all,

Not sure if this is a bug or simply me misunderstanding the mechanism of regex's.  I hope no one minds me asking about this here so I don't file a bug if it's just a misunderstanding on my part.

On Windows 10 with Elixir 1.2.6, I've been working on a regex to pull the version portion of a string.  This works:

iex(4)> v = Regex.named_captures(~r/(?<version>\d+.\d+.\d+.\d+)*$/,"Version=15.0.4815.1002")
%{"version" => "15.0.4815.1002"}
iex(5)> v
%{"version" => "15.0.4815.1002"}
iex(6)> v["version"]
"15.0.4815.1002"

But as I was playing with the regex, I found this which struck me as sort of curious:

iex(7)> {:ok, rc} = Regex.compile("(?<version>\d+.\d+.\d+.\d+)*$")
{:ok, ~r/(?<version>\x7F+.\x7F+.\x7F+.\x7F+)*$/}
iex(8)> rc
~r/(?<version>\x7F+.\x7F+.\x7F+.\x7F+)*$/

Which when I try with named_captures gives me this:

iex(9)> v = Regex.named_captures(rc,"Version=15.0.4815.1002")
%{"version" => ""}

Knowing this is Windows 10 and knowing the issues we've had with unicode over the years, I ran chcp 65001. The problem still occurs.  It's a minor issue because as you can see, I've got a working regex.  

So am I misunderstanding something or is this a bug?  Any advice would be appreciated.

--
Onorio



Johnny Winn

unread,
Jun 6, 2016, 2:37:19 PM6/6/16
to elixir-l...@googlegroups.com
This would be a bug from what I can see. the `\` seems to be giving `Regex.compile/1` a fit.

On a mac, Elixir 1.2.5:

iex(1)> {:ok, rc} = Regex.compile("(?<version>\d+.\d+.\d+.\d+)*$")
{:ok, ~r/(?<version>\x7F+.\x7F+.\x7F+.\x7F+)*$/}
iex(2)> {:ok, rc} = Regex.compile("(?<version>\d+.\w+.\w+.\w+)*$")
{:ok, ~r/(?<version>\x7F+.w+.w+.w+)*$/}

If Jose is ok with you submitting the bug, I can take a look at it :)

~ Johnny

--
You received this message because you are subscribed to the Google Groups "elixir-lang-core" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/d921ceaf-de00-4137-ac6e-dc11209d3db8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Onorio Catenacci

unread,
Jun 6, 2016, 2:43:25 PM6/6/16
to elixir-lang-core
Yeah I think you've hit on the problem Johnny.  It does appear the \d is the issue.

iex(10)> {:ok, rc} = Regex.compile("\d")
{:ok, ~r/\x7F/}
iex(11)> {:ok, rc} = Regex.compile("\w")
{:ok, ~r/w/}

As I say, I can certainly do this without having to compile the regex but it sure looks like a bug (albeit a minor one as far as I can tell).

--
Onorio


--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-core" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-core/7KxqS9FW5vw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-co...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/CALBtUW6aTpPG3rNsmNEzCdxvtwyyfQVhANYvdpvORF%2Bb7XCqCw%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Johnny Winn

unread,
Jun 6, 2016, 2:48:18 PM6/6/16
to elixir-l...@googlegroups.com
It falls back to the erlang regex compiler `:re.compile/1` so that might actually be where the problem starts. 

José Valim

unread,
Jun 6, 2016, 2:50:55 PM6/6/16
to elixir-l...@googlegroups.com
If you call Regex.compile, then you are passing a string which means \w must now be \\w. If you just pass \w, you can see how it becomes simply w, because \ was discarded by the *string* and never made to the regex compile.

Since \d represents the delete escape character, it becomes the codepoint 007F before it ever enters the regex and so the regex needs to escape it using \x to avoid ambiguity with its own \d, so not a bug. :)

For more options, visit https://groups.google.com/d/optout.


--


José Valim
Skype: jv.ptec
Founder and Director of R&D

Onorio Catenacci

unread,
Jun 6, 2016, 3:00:30 PM6/6/16
to elixir-lang-core
Thanks José.  This is why I asked before submitting a bug. 


For more options, visit https://groups.google.com/d/optout.

Johnny Winn

unread,
Jun 6, 2016, 3:08:48 PM6/6/16
to elixir-l...@googlegroups.com
José,

I figured that was what was happening but looking at the code/tests/docs it isn't clear that's what is supposed to happen. Maybe it's a lesser issue but some clarification in the docs/tests might help for next time and I wouldn't mind contributing if you're open to that.

Thanks,
Johnny


José Valim

unread,
Jun 6, 2016, 3:23:55 PM6/6/16
to elixir-l...@googlegroups.com
Yes, contributions to the docs are always welcome. In particular, we should make it clear in Regex.compile that entries must be double escaped or a sigil like ~S must be used to avoid confusion.
To view this discussion on the web visit https://groups.google.com/d/msgid/elixir-lang-core/CALBtUW5VWcm6fJF_4GAd4-9v1PcnELY-FAdht1VG%2BtGTx%3DK4Hg%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Josh Adams

unread,
Jun 6, 2016, 7:57:44 PM6/6/16
to elixir-l...@googlegroups.com
...and now I understand an issue I had with Regex.compile 6 months ago and didn't take the time to ask anyone smarter than myself about.  Onorio - your strategy is better!


For more options, visit https://groups.google.com/d/optout.



--
Josh Adams

Onorio Catenacci

unread,
Jun 6, 2016, 8:00:04 PM6/6/16
to elixir-l...@googlegroups.com

Well I'm glad to hear it was helpful!


Reply all
Reply to author
Forward
0 new messages