split /(..)*/, 1234567890

158 views
Skip to first unread message

Autrijus Tang

unread,
May 10, 2005, 10:53:35 AM5/10/05
to perl6-l...@perl.org
In Pugs, the current logic for array submatches in split() is
to stringify each element, and return them separately in the
resulting list. To wit:

pugs> split /(..)*/, 1234567890
('', '12', '34', '56', '78', '90')

Is this sane?

Thanks,
/Autrijus/

Autrijus Tang

unread,
May 12, 2005, 11:59:26 AM5/12/05
to "TSa (Thomas Sandla�)", Autrijus Tang, perl6-l...@perl.org
On Thu, May 12, 2005 at 04:53:06PM +0200, "TSa (Thomas Sandla�)" wrote:
> Autrijus Tang wrote:
> > pugs> split /(..)*/, 1234567890
> > ('', '12', '34', '56', '78', '90')
> >
> >Is this sane?
>
> Why the empty string match at the start?

I don't know, I didn't invent that! :-)

$ perl -le 'print join ",", split /(..)/, 123'
,12,3

Thanks,
/Autrijus/

David Storrs

unread,
May 12, 2005, 12:22:46 PM5/12/05
to Autrijus Tang, "TSa (Thomas Sandla�)", perl6-l...@perl.org

This makes sense when I think about what split is doing, but it is
surprising at first glance. Perhaps this should be included as an
example in the docs?

--Dks

Aaron Sherman

unread,
May 12, 2005, 12:59:20 PM5/12/05
to David Storrs, Perl6 Language List
On Thu, 2005-05-12 at 12:22, David Storrs wrote:
> On May 12, 2005, at 11:59 AM, Autrijus Tang wrote:
> > On Thu, May 12, 2005 at 04:53:06PM +0200, "TSa (Thomas Sandla�)"
> > wrote:
> >> Autrijus Tang wrote:
> >>
> >>> pugs> split /(..)*/, 1234567890
> >>> ('', '12', '34', '56', '78', '90')

> >> Why the empty string match at the start?

> > I don't know, I didn't invent that! :-)

> > $ perl -le 'print join ",", split /(..)/, 123'
> > ,12,3
>
> This makes sense when I think about what split is doing, but it is
> surprising at first glance. Perhaps this should be included as an
> example in the docs?

perldoc -f split says:

"Splits a string into a list of strings and returns that list.
By default, empty leading fields are preserved, and empty
trailing ones are deleted [...] If PATTERN is also omitted,
splits on whitespace (after skipping any leading whitespace).
[...] Empty leading (or trailing) fields are produced when there
are positive width matches at the beginning (or end) of the
string [...] As a special case, specifying a PATTERN of space ('
') will split on white space just as "split" with no arguments
does. Thus, "split(' ')" can be used to emulate awk's default
behavior, whereas "split(/ /)" will give you as many null
initial fields as there are leading spaces [...]"

And there you have it.

--
Aaron Sherman <a...@ajs.com>
Senior Systems Engineer and Toolsmith
"It's the sound of a satellite saying, 'get me down!'" -Shriekback


Uri Guttman

unread,
May 12, 2005, 1:12:26 PM5/12/05
to TSa, perl6-l...@perl.org
>>>>> "JSD" == Jonathan Scott Duff <du...@pobox.com> writes:

JSD> To bring this back to perl6, autrijus' original query was regarding

JSD> $ pugs -e 'say join ",", split /(..)*/, 1234567890'

JSD> which currently generates a list of ('','12','34','56','78','90')
JSD> In perl5 it would generate a list of ('','90') because only the last
JSD> pair of characters matched is kept (such is the nature of quantifiers
JSD> applied to capturing parens). But in perl6 quantified captures put all
JSD> of the matches into an array such that "abcdef" ~~ /(..)*/ will make
JSD> $0 = ['ab','cd','ef'].

JSD> I think that the above split should generate a list like this:

JSD> ('', [ '12','34','56','78','90'])

i disagree. if you want complex tree results, use a rule. split is for
creating a single list of elements from a string. it is better keep
split simple for it is commonly used in this domain. tree results are
more for real parsing (which split is not intended to do) so use a
parsing rule for that.

also note the coding style rule (i think randal created it) which is to
use split when you want to throw things away (the delimiters) and m//
when you want to keep thinks.

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.stemsystems.com
--Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org

Jonathan Scott Duff

unread,
May 12, 2005, 1:03:55 PM5/12/05
to TSa (Thomas Sandlaß), perl6-l...@perl.org
On Thu, May 12, 2005 at 06:29:49PM +0200, "TSa (Thomas Sandlaß)" wrote:
> Autrijus Tang wrote:
> >I don't know, I didn't invent that! :-)
> >
> > $ perl -le 'print join ",", split /(..)/, 123'
> > ,12,3
>
> Hmm,
>
> perl -le 'print join ",", split /(..)/, 112233445566'
> ,11,,22,,33,,44,,55,,66
>
> For longer strings it makes every other match an empt string.

Not quite. The matching part are the strings "11", "22", "33", etc.
And since what matches is what we're splitting on, we get the empty
string between pairs of characters (including the leading empty
string). The only reason you're getting the string that was matched
in the output is because that's what you've asked split to do by
placing parens around the pattern. (Type "perldoc -f split" at your
command prompt and read all about it)

To bring this back to perl6, autrijus' original query was regarding

$ pugs -e 'say join ",", split /(..)*/, 1234567890'

which currently generates a list of ('','12','34','56','78','90')


In perl5 it would generate a list of ('','90') because only the last

pair of characters matched is kept (such is the nature of quantifiers

applied to capturing parens). But in perl6 quantified captures put all

of the matches into an array such that "abcdef" ~~ /(..)*/ will make

$0 = ['ab','cd','ef'].

I think that the above split should generate a list like this:

('', [ '12','34','56','78','90'])

Or, another example:

$ pugs -e 'say join ",", split /(<[abc]>)*/, "xabxbxbcx"'
# ('x', ['a','b'], 'x', ['b'], 'x', ['b','c'], 'x')

But that's just MHO.

> With the "Positions between chars" interpretation the above
> string is with '.' indication position:
>
> .1.1.2.2.3.3.4.4.5.5.6.6.
> 0 1 2 3 4 5 6 7 8 9 1 1 1
> 0 1 2
>
> There are two matches each at 0, 2, 4, 6, 8 and 10.
> The empty match at the end seams to be skipped because
> position 12 is after the string?

No, the empty match at the end is skipped because that's the default
behaviour of split. Preserve leading empty fields and discard empty
trailing ones.

> And for odd numbers of
> chars the before last position doesn't produce an empty
> match:
> perl -le 'print join ",", split /(..)/, 11223'
> ,11,,22,3

There's an empty field between the beginning of the string and "11",
there's an empty field between the "11" and the "22", and finally
there's a field at the end containing only "3"

> Am I the only one who finds that inconsistent?

Probably. :-)

-Scott
--
Jonathan Scott Duff
du...@pobox.com

Jody Belka

unread,
May 12, 2005, 1:13:22 PM5/12/05
to perl6-l...@perl.org
On Thu, May 12, 2005 at 06:29:49PM +0200, "TSa (Thomas Sandla?)" wrote:
> perl -le 'print join ",", split /(..)/, 112233445566'
> ,11,,22,,33,,44,,55,,66
[snipped]
> perl -le 'print join ",", split /(..)/, 11223'
> ,11,,22,3

>
> Am I the only one who finds that inconsistent?

Maybe, but it's because you're misunderstanding what split does (i can
heartily recommend TFM in this case).

Let's start with a simpler case (inside debugger for help):


x split /../, 112233445566, -1 [ -1 to preserve all found fields ]

0 ''
1 ''
2 ''
3 ''
4 ''
5 ''

6 ''

Split uses the regular expression to find "seperators" in the text, and
then return the contents of the fields between them. The above case looks
like this:

sep sep sep sep sep sep
| | | | | |
11 22 33 44 55 66
| | | | | |
field field field field field field

Ok, let's try that with your second example:

x split /../, 11223, -1

0 ''
1 ''
2 3

sep sep
| |
11 22 3
| | |
field field field


Now, if the regular expression contains parentheses, additional list
elements are created from each matching substring (quoted almost verbatim
from TFM). So:

x split /(..)/, 112233445566, -1

0 ''
1 11
2 ''
3 22
4 ''
5 33
6 ''
7 44
8 ''
9 55
10 ''
11 66
12 ''


x split /(..)/, 11223, -1

0 ''
1 11
2 ''
3 22
4 3

And of course, if we remove the LIMIT from the equation, then any trailing
fields will be removed. Ergo the results quoted at the top of this email.
Hope this helps you (and anyone else who might have been confused) understand
what is going on.


J

--
Jody Belka
knew (at) pimb (dot) org

Jonathan Scott Duff

unread,
May 12, 2005, 1:50:08 PM5/12/05
to Uri Guttman, TSa, perl6-l...@perl.org
On Thu, May 12, 2005 at 01:12:26PM -0400, Uri Guttman wrote:
> >>>>> "JSD" == Jonathan Scott Duff <du...@pobox.com> writes:
>
> JSD> To bring this back to perl6, autrijus' original query was regarding
>
> JSD> $ pugs -e 'say join ",", split /(..)*/, 1234567890'
>
> JSD> which currently generates a list of ('','12','34','56','78','90')
> JSD> In perl5 it would generate a list of ('','90') because only the last
> JSD> pair of characters matched is kept (such is the nature of quantifiers
> JSD> applied to capturing parens). But in perl6 quantified captures put all
> JSD> of the matches into an array such that "abcdef" ~~ /(..)*/ will make
> JSD> $0 = ['ab','cd','ef'].
>
> JSD> I think that the above split should generate a list like this:
>
> JSD> ('', [ '12','34','56','78','90'])
>
> i disagree. if you want complex tree results, use a rule.

Well ... we *are* using a rule; it just doesn't have a name.

So, would you advocate too that

my @a = "foofoofoobarbarbar" ~~ /(foo)+ (bar)+/;

should flatten? thus @a = ('foo','foo','foo','bar','bar','bar')
rather than (['foo','foo','foo'],['bar','bar','bar]) ?

This may have even been discussed before but we should probably make
the determination as to whether or not we keep the delimiters be
something other than the presence or absense of parentheses in the
pattern. Perhaps the flattening/non-flattening behavior could be
modulated the same way. Probably as a modifier to split

> split is for creating a single list of elements from a string. it is
> better keep split simple for it is commonly used in this domain.

I'll wager that splits with non-capturing patterns are far and away the
most common case. :-)

Larry Wall

unread,
May 12, 2005, 3:01:59 PM5/12/05
to perl6-l...@perl.org, TSa (Thomas Sandlaß)
On Thu, May 12, 2005 at 12:03:55PM -0500, Jonathan Scott Duff wrote:
: I think that the above split should generate a list like this:

:
: ('', [ '12','34','56','78','90'])

Yes, though I would think of it more generally as

('', $0, '', $0, '', $0, ...)

where in this case it just happens to be

('', $0)

and $0 expands to ['12','34','56','78','90'] if you treat it as an array.

Larry

Jonathan Scott Duff

unread,
May 12, 2005, 3:56:37 PM5/12/05
to perl6-l...@perl.org, TSa (Thomas Sandlaß)

Exactly so. Principle of least surprise wins again! ;)

Autrijus Tang

unread,
May 12, 2005, 4:05:23 PM5/12/05
to perl6-l...@perl.org, TSa (Thomas Sandla�)

Thanks, implemented as such.

pugs> map { ref $_ } split /(..)*/, 1234567890
(::Str, ::Array::Const)

Thanks,
/Autrijus/

Rick Delaney

unread,
May 12, 2005, 8:33:40 PM5/12/05
to Autrijus Tang, perl6-l...@perl.org
On Fri, May 13, 2005 at 04:05:23AM +0800, Autrijus Tang wrote:
> > On Thu, May 12, 2005 at 12:01:59PM -0700, Larry Wall wrote:
> > > Yes, though I would think of it more generally as
> > >
> > > ('', $0, '', $0, '', $0, ...)
> > >
> > > where in this case it just happens to be
> > >
> > > ('', $0)
> > >
> > > and $0 expands to ['12','34','56','78','90'] if you treat it as an array.
>
> Thanks, implemented as such.
>
> pugs> map { ref $_ } split /(..)*/, 1234567890
> (::Str, ::Array::Const)

Sorry if I'm getting ahead of the implementation but if it is returning
$0 then shouldn't ref($0) return ::Rule::Result or somesuch? It would
just look like an ::Array::Const if you treat it as such.

--
Rick Delaney
ri...@bort.ca

Autrijus Tang

unread,
May 12, 2005, 9:19:51 PM5/12/05
to Rick Delaney, Autrijus Tang, perl6-l...@perl.org
On Thu, May 12, 2005 at 08:33:40PM -0400, Rick Delaney wrote:
> Sorry if I'm getting ahead of the implementation but if it is returning
> $0 then shouldn't ref($0) return ::Rule::Result or somesuch? It would
> just look like an ::Array::Const if you treat it as such.

...also note that the $0 here is $/[0], also known as Perl 5's $1...

Indeed, the entire match result, that is $/, will always be a
single ::Match object if a match succeeds.

Thanks,
/Autrijus/

Autrijus Tang

unread,
May 12, 2005, 9:17:27 PM5/12/05
to Rick Delaney, Autrijus Tang, perl6-l...@perl.org
On Thu, May 12, 2005 at 08:33:40PM -0400, Rick Delaney wrote:

Er, where does this ::Rule::Result thing come from?

I was basing my implementation on Damian's:

Quantifiers (except C<?> and C<??>) cause a matched subrule or
subpattern to return an array of C<Match> objects, instead of just a
single object.

As well as the PGE's implementation of treating the quantified capture as a
simple PerlArray PMC.

Thanks,
/Autrijus/

Markus Laire

unread,
May 13, 2005, 4:21:51 AM5/13/05
to perl6-l...@perl.org
Rick Delaney wrote:
> On Fri, May 13, 2005 at 04:05:23AM +0800, Autrijus Tang wrote:
>
>>>On Thu, May 12, 2005 at 12:01:59PM -0700, Larry Wall wrote:
>>>
>>>>Yes, though I would think of it more generally as
>>>>
>>>> ('', $0, '', $0, '', $0, ...)
>>>>
>>>>where in this case it just happens to be
>>>>
>>>> ('', $0)
>>>>
>>>>and $0 expands to ['12','34','56','78','90'] if you treat it as an array.

I don't understand this comment. The $0 here is an array of
match-objects and when treated as array it returns an array of
match-objects, not an array of strings. (see below)

>>
>>Thanks, implemented as such.
>>
>> pugs> map { ref $_ } split /(..)*/, 1234567890
>> (::Str, ::Array::Const)
>
>
> Sorry if I'm getting ahead of the implementation but if it is returning
> $0 then shouldn't ref($0) return ::Rule::Result or somesuch? It would
> just look like an ::Array::Const if you treat it as such.

With pugs (r2917) this doesn't return an Array of Strings but an Array
of Match-objects:

pugs> map { ref $_ } split /(..)*/, 1234567890
(::Str, ::Array::Const)

pugs> map { ref $_ } [split /(..)*/, 1234567890][1]
(::Match, ::Match, ::Match, ::Match, ::Match)
pugs> map { ~$_ } [split /(..)*/, 1234567890][1]
('12', '34', '56', '78', '90')
pugs> map { $_.from } [split /(..)*/, 1234567890][1]
(0, 2, 4, 6, 8)

--
Markus Laire
<Jam. 1:5-6>

Jody Belka

unread,
May 12, 2005, 1:23:59 PM5/12/05
to perl6-l...@perl.org
On Thu, May 12, 2005 at 07:13:22PM +0200, Jody Belka wrote:
> sep sep sep sep sep sep
> | | | | | |
> 11 22 33 44 55 66
> | | | | | |
> field field field field field field

whoops. add an extra field component in at the end of that of course.

Mark A Biggar

unread,
May 12, 2005, 2:21:33 PM5/12/05
to TSa (Thomas Sandlaß), perl6-l...@perl.org
No, it's not inconsistant. Think about the simpler case split /a/,'aaaaa' which return a list of empty strings. Now ask to keep the separators
split /(a), 'aaaaa' which will return ('', 'a', '', 'a', '', 'a', '', 'a, '', 'a'). Now look at
split /(a)/, 'aaab' which returns ('', 'a', '', 'a', '', 'a', 'b'). not no empty string ebfore the 'b'.

In the case of split /(..)/, "12345678" all those pairs of digits are all spearators so again you get empty strings aternating with digit pairs. If the number of digits is odd the lat on isn't a separator so it takes the place of the final empty string and there won;t be a empty string in the list before it, I.e,
split /(..)/, 12345 returns (''. '12', '', '34', '5');

This is another of those cases where the computer did exactly what you ask it to.

--
Mark Biggar
ma...@biggar.org
mark.a...@comcast.net
mbi...@paypal.com


> Autrijus Tang wrote:
> > I don't know, I didn't invent that! :-)
> >

> > $ perl -le 'print join ",", split /(..)/, 123'
> > ,12,3
>
> Hmm,


>
> perl -le 'print join ",", split /(..)/, 112233445566'
> ,11,,22,,33,,44,,55,,66
>

> For longer strings it makes every other match an empt string.

> With the "Positions between chars" interpretation the above
> string is with '.' indication position:
>
> .1.1.2.2.3.3.4.4.5.5.6.6.
> 0 1 2 3 4 5 6 7 8 9 1 1 1
> 0 1 2
>
> There are two matches each at 0, 2, 4, 6, 8 and 10.
> The empty match at the end seams to be skipped because

> position 12 is after the string? And for odd numbers of


> chars the before last position doesn't produce an empty
> match:

> perl -le 'print join ",", split /(..)/, 11223'
> ,11,,22,3
>
> Am I the only one who finds that inconsistent?

> --
> TSa (Thomas Sandlaß)
>

Reply all
Reply to author
Forward
0 new messages