Regex.split: how to capture delimiters?

Mark Reed

unread,

Sep 1, 2014, 3:59:25 PM9/1/14

to elixir-l...@googlegroups.com

Input: "caterpillar"

Desired output: [ "c", "a", "t", "e", "rp", "i", "ll", "a", "r"] - that is, split on vowels, but include the vowels in the resulting list.

In Perl, Python, Ruby, Javascript, and several other languages, I get this result by just splitting on /([aeiou])/. With PHP's preg_split I have to additionally specify a flag option to get this behavior, and that was how I interpreted the API doc for Elixir as well, but I can't get it to work:

iex(1)> Regex.split ~r{([aeiou])}, "caterpillar"
["c", "t", "rp", "ll", "r"]
iex(2)> Regex.split ~r{([aeiou])}, "caterpillar", on: [:all]
["caterpillar"]

I tried all the other listed possible values for 'on' - including :first, which the API doc says is the default - and they all returned the single-string list ["caterpillar"].

So I'm a bit stumped. What am I doing wrong?

Mark J. Reed

unread,

Sep 1, 2014, 4:04:13 PM9/1/14

to elixir-l...@googlegroups.com

If I pass the argument to :on as a bare keyword instead of a list, I get ["c", "t", "rp", "ll", "r"]for :all, and this error for :second:

** (ArgumentError) argument error

(stdlib) re.erl:733: :re.run("caterpillar", {:re_pattern, 1, 0, 0, <<69, 82, 67, 80, 112, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 255, 255, 255, 255, 255, 255, 255, 255, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>}, [:global, {:capture, :second}])

(elixir) lib/regex.ex:363: Regex.split/3

--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/3xQ4q5YSCnk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Mark J. Reed <mark...@gmail.com>

Doug Goldie

unread,

Sep 1, 2014, 6:30:59 PM9/1/14

to elixir-l...@googlegroups.com

Hi Mark,

Here's my solution.

https://gist.github.com/dgoldie/f052ce147541f9a7ee52

This was a fun exercise in pattern matching....

But I'm sure there's a much simpler way to do it. :-)

...I look forward to seeing the "expert" solutions

-doug.

Gilbert Kennen

unread,

Sep 1, 2014, 6:59:51 PM9/1/14

to elixir-l...@googlegroups.com

I can't seem to find a straightforward way of doing it either, but here
is a relatively simple way of achieving the goal.

Regex.split ~r/()[aeiou]()/, "caterpillar", on: [1,2]

Mark J. Reed

unread,

Sep 1, 2014, 7:01:52 PM9/1/14

to elixir-l...@googlegroups.com

Thanks, Doug, but my question was really about `split` and how to get the desired behavior out of it, which it looks like I ought to be able to, but couldn't figure out how.

Solving the stated problem was just an example. I would probably put together some horrid pipeline like this:

Regex.scan(~r{(?!<[^aeiou])([aeiou]*)([^aeiou]*)}, "caterpillar") |>

Enum.map(&tl/1) |>

List.flatten |>

Enum.filter(fn s -> String.length(s) > 0 end)

["c", "a", "t", "e", "rp", "i", "ll", "a", "r"]

... in the course of constructing which I find myself surprised that there's not a built-in predicate to test for the empty string. If I Just missed it, I'd appreciate a pointer to it. :)

Mark J. Reed

unread,

Sep 1, 2014, 8:29:51 PM9/1/14

to elixir-l...@googlegroups.com

> I find myself surprised that there's not a built-in predicate to test for the empty string.

While perhaps not the most obvious application, String.last works for this, since it returns nil on empty strings and a truthy value otherwise. So this might or might not be mildly less horrid:

Regex.scan(~r{(?!<[^aeiou])([aeiou]*)([^aeiou]*)}, "caterpillar") |>

Enum.map(&tl/1) |>
List.flatten |>

Enum.filter(&String.last/1)

José Valim

unread,

Sep 1, 2014, 8:40:18 PM9/1/14

to elixir-l...@googlegroups.com

I can't seem to find a straightforward way of doing it either, but here is a relatively simple way of achieving the goal.

Regex.split ~r/()[aeiou]()/, "caterpillar", on: [1,2]

This is the correct answer.

We have moved away from the behaviour where regex captures are implicitly returned. Instead, you pass which captures you want to split on. In this case, it is splitting on the first and second parens (0 would be the whole regex itself).

José Valim

www.plataformatec.com.br

Skype: jv.ptec

Founder and Lead Developer

Mark J. Reed

unread,

Sep 1, 2014, 8:59:05 PM9/1/14

to elixir-l...@googlegroups.com

> We have moved away from the behaviour where regex captures are implicitly returned. Instead, you pass which captures you want to split on. In this case, it is splitting on the first and second parens (0 would be the whole regex itself).

Ah, so the parentheses indicate where to split, not what to return. That was the source of my confusion. Thanks!

--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/3xQ4q5YSCnk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark J. Reed

unread,

Sep 1, 2014, 9:04:22 PM9/1/14

to elixir-l...@googlegroups.com

> Regex.split ~r/()[aeiou]()/, "caterpillar", on: [1,2]

This could also be written:

Regex.split ~r/()[aeiou]()/, "caterpillar", on: :all_but_first

To more generally split on all the parenthesized spots in the regex, no matter how many there are, but not around the whole thing.

--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/3xQ4q5YSCnk/unsubscribe.

To unsubscribe from this group and all its topics, send an email to elixir-lang-talk+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jim Freeze

unread,

Sep 2, 2014, 10:13:17 AM9/2/14

to elixir-l...@googlegroups.com

What does the empty () indicate?

How is this better (more flexible) than the following:

2.1.2 :002 > str = "caterpillar"

=> "caterpillar"

2.1.2 :003 > str.split(/([aeiou])/)

=> ["c", "a", "t", "e", "rp", "i", "ll", "a", "r"]

--
You received this message because you are subscribed to the Google Groups "elixir-lang-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elixir-lang-ta...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Dr. Jim Freeze

(m) 512 949 9683

Mark J. Reed

unread,

Sep 2, 2014, 11:01:24 AM9/2/14

to elixir-l...@googlegroups.com

With the Perl/Python/Ruby/Javascript semantics you demonstrated, you can choose to add portions or all of the delimiter to the returned results. But you have no way to selectively omit some of the splits, to effectively join some of those results together.

What if I want to split the string after vowels but keep the vowel as part of the preceding string? It's a bad job of syllabification, but it's easy to do with the Elixir semantics:

Regex.split ~r/[aeioru]+()/, "caterpillar", on: [1] #=> ["ca", "ter", "pi", "llar", ""]

You can't quite do that with the Ruby semantics. If you had variable-length look behind, that would work, but `\K` doesn't quite work as expected in the context of split.

So: the new Elixir semantics can duplicate the result of the old, but not the other way around.

Whether it's worth it is debatable, but it's clearly more flexible.

To unsubscribe from this group and all its topics, send an email to elixir-lang-ta...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Mark J. Reed <mark...@gmail.com>

Jim Freeze

unread,

Sep 2, 2014, 11:10:35 AM9/2/14

to elixir-l...@googlegroups.com

Thank you, Mark.

I figured this notation was more flexible.

But, can you explain (or provide a link) to the '()' notation in the Regex.

I have no idea what that's doing or why it works. To me it says, match on nothing and capture it.

Gilbert Kennen

unread,

Sep 2, 2014, 11:19:55 AM9/2/14

to elixir-l...@googlegroups.com

On 09/02/2014 08:10 AM, Jim Freeze wrote:
> Thank you, Mark.
> I figured this notation was more flexible.
> But, can you explain (or provide a link) to the '()' notation in the Regex.
>
> I have no idea what that's doing or why it works. To me it says, match on
> nothing and capture it.

() is simply a zero-length capture. Regex.split receives indices and
lengths of all of the captures and if it is configured to split on a
particular capture, does so. Anything within the capture will be
'deleted', but since it is zero length that is nothing in this case.

Mark J. Reed

unread,

Sep 2, 2014, 11:46:00 AM9/2/14

to elixir-l...@googlegroups.com

The captures in the regular expression are indicators that say "split here".

Anything to the left of the capture goes in one item in the result array, while anything to the right of the capture goes in the next item. Anything inside the capture is deleted. So `()` is just a split point - it's empty, so nothing is removed, but the string is split there.

However, not all captures are automatically used as split locations. In fact, captures in the split pattern are completely ignored by default. So if you want to split selectively, you have to specify which captures to split at using the on: option to the split function. It takes a list of indexes (where 0 is the whole regex, 1 is the first capture, etc.) or one of the capture keywords listed at the top of the Regex API doc.

--
You received this message because you are subscribed to a topic in the Google Groups "elixir-lang-talk" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elixir-lang-talk/3xQ4q5YSCnk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elixir-lang-talk+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward