I have a file consisting of groups of lines (unknown number of lines in each group). Each line begins by a 6 digit number, followed by an unknown sequence of words and numbers. Consecutive lines starting with the same number form a group. My problem is to combine lines from each group into a single line, keeping only the first occurrence of the distinctive number. I have been able to "find" groups using the pattern (\d{6})(.+)(?:\r\1(.+))+ However, this does not appear to store the expressions matching the inner parentheses into separate variables. Is there a way to achieve the desired replacement using grep?
>I have a file consisting of groups of lines (unknown number of lines in
>each group).
>Each line begins by a 6 digit number, followed by an unknown sequence of
>words and numbers.
>Consecutive lines starting with the same number form a group.
>My problem is to combine lines from each group into a single line, keeping
>only the first occurrence of the distinctive number.
>I have been able to "find" groups using the pattern
>(\d{6})(.+)(?:\r\1(.+))+
>However, this does not appear to store the expressions matching the inner
>parentheses into separate variables.
>Is there a way to achieve the desired replacement using grep?
Provided I understand the task correctly, though your pattern should match
all such groups of lines, I don't see any way to restructure the matched
text in a single step.
(A relatively easy brute force solution would be to concatenate all
matching line pairs, then rinse & repeat. :)
As to your question about storage:
Though the contents of that inner subpattern (.+) are being captured N
times (where N is the number of lines within the match), only the last
instance matched will be stored and available by reference to that
subpattern.
[ As an aside for anyone else who may be wondering, this part of
the pattern (?: ) consists of `non-capturing parentheses` which
do not themselves store matched text. ]
For example, if you apply the following search & replace patterns:
On 5 Oct 2012, at 16:18, jmichel <jmi.mig...@gmail.com> wrote:
> I have a file consisting of groups of lines (unknown number of lines in each group). > Each line begins by a 6 digit number, followed by an unknown sequence of words and numbers. > Consecutive lines starting with the same number form a group.
> My problem is to combine lines from each group into a single line, keeping only the first occurrence of the distinctive number.
> I have been able to "find" groups using the pattern
> (\d{6})(.+)(?:\r\1(.+))+
> However, this does not appear to store the expressions matching the inner parentheses into separate variables. > Is there a way to achieve the desired replacement using grep?
Using regular expressions yes, but you need a routine. If you put a file containing
this Perl Script in ~/LibraryApplication Support/BBEdit/Text Filters, it will do what
you want. Open the Text Filters palette from the Window menu and you will see the
filter. Double-click it or click on Run or, if its a frequent task, assign a shortcut to the
script.
Save this as ???.pl
#!/usr/bin/perl
my %hash;
my $six_digits = "[0-9]{6}";
my $remaining_text = ".*";
my $delimiter = ""; # or ", " for example
while (<>) {
if ( /^($six_digits)($remaining_text)/ ) {
$hash{$1} .= $2 # append the text after the 6 digits
}
}
for (sort {$a<=>$b} keys %hash) {
print "$_$delimiter$hash{$_}\n"
Thanks for these explanations. They confirm what I suspected.
Assuming that the number of lines in one group can never exceed, say, 15 or so, could one circumvent the difficulty by explicitly repeating the search pattern a sufficient number of times? Then the problem would be to ensure a match also in the case when the number of lines is smaller. Any idea on how that could be achieved? Could conditional matching help (I am not familiar with those "advanced features")?
Le vendredi 5 octobre 2012 20:01:21 UTC+2, Patrick Woolsey a écrit :
> At 08:18 -0700 10/05/2012, jmichel wrote: > >I have a file consisting of groups of lines (unknown number of lines in > >each group). > >Each line begins by a 6 digit number, followed by an unknown sequence of > >words and numbers. > >Consecutive lines starting with the same number form a group. > >My problem is to combine lines from each group into a single line, > keeping > >only the first occurrence of the distinctive number. > >I have been able to "find" groups using the pattern > >(\d{6})(.+)(?:\r\1(.+))+
> >However, this does not appear to store the expressions matching the inner > >parentheses into separate variables.
> >Is there a way to achieve the desired replacement using grep?
> Provided I understand the task correctly, though your pattern should match > all such groups of lines, I don't see any way to restructure the matched > text in a single step.
> (A relatively easy brute force solution would be to concatenate all > matching line pairs, then rinse & repeat. :)
> As to your question about storage:
> Though the contents of that inner subpattern (.+) are being captured N > times (where N is the number of lines within the match), only the last > instance matched will be stored and available by reference to that > subpattern.
> [ As an aside for anyone else who may be wondering, this part of > the pattern (?: ) consists of `non-capturing parentheses` which > do not themselves store matched text. ]
> For example, if you apply the following search & replace patterns:
This sounds amazingly powerful and flexible. Thanks a lot. I will try it asap.
The only problem is that I will need to learn Perl if I want to be able to write such scripts…
Le samedi 6 octobre 2012 01:25:58 UTC+2, eremita a écrit :
> On 5 Oct 2012, at 16:18, jmichel <jmi.m...@gmail.com <javascript:>> > wrote:
> > I have a file consisting of groups of lines (unknown number of lines in > each group). > > Each line begins by a 6 digit number, followed by an unknown sequence of > words and numbers. > > Consecutive lines starting with the same number form a group. > > My problem is to combine lines from each group into a single line, > keeping only the first occurrence of the distinctive number. > > I have been able to "find" groups using the pattern > > (\d{6})(.+)(?:\r\1(.+))+ > > However, this does not appear to store the expressions matching the > inner parentheses into separate variables. > > Is there a way to achieve the desired replacement using grep?
> Using regular expressions yes, but you need a routine. If you put a file > containing > this Perl Script in ~/LibraryApplication Support/BBEdit/Text Filters, it > will do what > you want. Open the Text Filters palette from the Window menu and you will > see the > filter. Double-click it or click on Run or, if its a frequent task, > assign a shortcut to the > script.
> Save this as ???.pl
> #!/usr/bin/perl > my %hash; > my $six_digits = "[0-9]{6}"; > my $remaining_text = ".*"; > my $delimiter = ""; # or ", " for example > while (<>) { > if ( /^($six_digits)($remaining_text)/ ) { > $hash{$1} .= $2 # append the text after the 6 digits > } > } > for (sort {$a<=>$b} keys %hash) { > print "$_$delimiter$hash{$_}\n" > }
>Thanks for these explanations. They confirm what I suspected.
>Assuming that the number of lines in one group can never exceed, say, 15
>or so, could one circumvent the difficulty by explicitly repeating the
>search pattern a sufficient number of times?
Yes, and please see below.
>Then the problem would be to ensure a match also in the case when the
>number of lines is smaller. Any idea on how that could be achieved? Could
>conditional matching help (I am not familiar with those "advanced
>features")?
To do this, just modify your existing pattern to find successive pairs of
matching lines and combine their contents:
Find: (\d{6})(.+)(?:\r\1(.+))
Replace: \1\2\3
and then repeatedly apply Replace All until all line pairs which start with
the same numeric prefix have been consolidated to single lines.
(E.g. for groups of 16 lines or fewer, this will take at most 4 passes of
Replace All; for groups of 64 lines or fewer, 6 passes; etc.)
PS: John Delacour's text filter is a much nicer general solution; the only
advantage of the above is it doesn't require knowledge of Perl.
> At 23:50 -0700 10/05/2012, jmichel wrote: > >Thanks for these explanations. They confirm what I suspected. > >Assuming that the number of lines in one group can never exceed, say, 15 > >or so, could one circumvent the difficulty by explicitly repeating the > >search pattern a sufficient number of times?
> Yes, and please see below.
> >Then the problem would be to ensure a match also in the case when the > >number of lines is smaller. Any idea on how that could be achieved? Could > >conditional matching help (I am not familiar with those "advanced > >features")?
> To do this, just modify your existing pattern to find successive pairs of > matching lines and combine their contents:
> Find: (\d{6})(.+)(?:\r\1(.+))
> Replace: \1\2\3
> and then repeatedly apply Replace All until all line pairs which start > with > the same numeric prefix have been consolidated to single lines.
> (E.g. for groups of 16 lines or fewer, this will take at most 4 passes of > Replace All; for groups of 64 lines or fewer, 6 passes; etc.)
> PS: John Delacour's text filter is a much nicer general solution; the only > advantage of the above is it doesn't require knowledge of Perl.
> At 23:50 -0700 10/05/2012, jmichel wrote: > >Thanks for these explanations. They confirm what I suspected. > >Assuming that the number of lines in one group can never exceed, say, 15 > >or so, could one circumvent the difficulty by explicitly repeating the > >search pattern a sufficient number of times?
> Yes, and please see below.
> >Then the problem would be to ensure a match also in the case when the > >number of lines is smaller. Any idea on how that could be achieved? Could > >conditional matching help (I am not familiar with those "advanced > >features")?
> To do this, just modify your existing pattern to find successive pairs of > matching lines and combine their contents:
> Find: (\d{6})(.+)(?:\r\1(.+))
> Replace: \1\2\3
> and then repeatedly apply Replace All until all line pairs which start > with > the same numeric prefix have been consolidated to single lines.
> (E.g. for groups of 16 lines or fewer, this will take at most 4 passes of > Replace All; for groups of 64 lines or fewer, 6 passes; etc.)
> PS: John Delacour's text filter is a much nicer general solution; the only > advantage of the above is it doesn't require knowledge of Perl.