Re: Subpatterns in Grep?

Patrick Woolsey

unread,

Oct 5, 2012, 2:01:12 PM10/5/12

to bbe...@googlegroups.com

At 08:18 -0700 10/05/2012, jmichel wrote:
>I have a file consisting of groups of lines (unknown number of lines in
>each group).
>Each line begins by a 6 digit number, followed by an unknown sequence of
>words and numbers.
>Consecutive lines starting with the same number form a group.
>My problem is to combine lines from each group into a single line, keeping
>only the first occurrence of the distinctive number.
>I have been able to "find" groups using the pattern
>(\d{6})(.+)(?:\r\1(.+))+
>
>However, this does not appear to store the expressions matching the inner
>parentheses into separate variables.
>
>Is there a way to achieve the desired replacement using grep?

Provided I understand the task correctly, though your pattern should match
all such groups of lines, I don't see any way to restructure the matched
text in a single step.

(A relatively easy brute force solution would be to concatenate all
matching line pairs, then rinse & repeat. :)

As to your question about storage:

Though the contents of that inner subpattern (.+) are being captured N
times (where N is the number of lines within the match), only the last
instance matched will be stored and available by reference to that
subpattern.

[ As an aside for anyone else who may be wondering, this part of
the pattern (?: ) consists of `non-capturing parentheses` which
do not themselves store matched text. ]

For example, if you apply the following search & replace patterns:

Find: (?:(\d{6})\r)+
Replace: \1

to this text:

111222
333444
555666

the result will be:

555666

Regards,

Patrick Woolsey
==
Bare Bones Software, Inc. <http://www.barebones.com/>

John Delacour

unread,

Oct 5, 2012, 7:25:47 PM10/5/12

to bbe...@googlegroups.com

On 5 Oct 2012, at 16:18, jmichel <jmi.m...@gmail.com> wrote:

> I have a file consisting of groups of lines (unknown number of lines in each group).
> Each line begins by a 6 digit number, followed by an unknown sequence of words and numbers.
> Consecutive lines starting with the same number form a group.
> My problem is to combine lines from each group into a single line, keeping only the first occurrence of the distinctive number.
> I have been able to "find" groups using the pattern
> (\d{6})(.+)(?:\r\1(.+))+
> However, this does not appear to store the expressions matching the inner parentheses into separate variables.
> Is there a way to achieve the desired replacement using grep?

Using regular expressions yes, but you need a routine. If you put a file containing
this Perl Script in ~/LibraryApplication Support/BBEdit/Text Filters, it will do what
you want. Open the Text Filters palette from the Window menu and you will see the
filter. Double-click it or click on Run or, if its a frequent task, assign a shortcut to the
script.

Save this as ???.pl

#!/usr/bin/perl
my %hash;
my $six_digits = "[0-9]{6}";
my $remaining_text = ".*";
my $delimiter = ""; # or ", " for example
while (<>) {
if ( /^($six_digits)($remaining_text)/ ) {
$hash{$1} .= $2 # append the text after the 6 digits
}
}
for (sort {$a<=>$b} keys %hash) {
print "$_$delimiter$hash{$_}\n"
}

#JD

Patrick Woolsey

unread,

Oct 6, 2012, 11:04:32 AM10/6/12

to bbe...@googlegroups.com, pwoo...@barebones.com

At 23:50 -0700 10/05/2012, jmichel wrote:
>Thanks for these explanations. They confirm what I suspected.
>Assuming that the number of lines in one group can never exceed, say, 15
>or so, could one circumvent the difficulty by explicitly repeating the
>search pattern a sufficient number of times?

Yes, and please see below.

>Then the problem would be to ensure a match also in the case when the
>number of lines is smaller. Any idea on how that could be achieved? Could
>conditional matching help (I am not familiar with those "advanced
>features")?

To do this, just modify your existing pattern to find successive pairs of
matching lines and combine their contents:

Find: (\d{6})(.+)(?:\r\1(.+))

Replace: \1\2\3

and then repeatedly apply Replace All until all line pairs which start with
the same numeric prefix have been consolidated to single lines.

(E.g. for groups of 16 lines or fewer, this will take at most 4 passes of
Replace All; for groups of 64 lines or fewer, 6 passes; etc.)

PS: John Delacour's text filter is a much nicer general solution; the only
advantage of the above is it doesn't require knowledge of Perl.

Reply all

Reply to author

Forward