Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

regexp question, joining EOL (awk,sed,etc...)

114 views
Skip to first unread message

Larry Wall

unread,
Jan 19, 1990, 7:30:12 PM1/19/90
to
In article <13...@island.uu.net> dan...@island.uu.net ((Dan Smith "Remember MLK")) writes:
:
: Someone here needs a generalized way of turning:
:
: foo
: bar
: 1.23
:
: to the following:
:
: foo_bar
: 1.23
:
: also, it would need to work for:
:
: foo
: bar
: baz
: 1.23
: goo
: tar
: 293
:
: which should give:
:
: foo_bar_baz
: 1.23
: goo_tar
: 293
:
: So the rule seems to be "if a line ends with a character, and the next
: line begins with one, replace the newline with a '_'".
:
: I've tried (in vi) "g/[a-z]\n[a-z]/s//_/"...but that doesn't
: cut it. Any ideas? (I take it that it may be a two-pass sort of solution).

In the first pass, install perl. :-)

In the second pass, feed your file to a perl script that says

#!/usr/bin/perl
$/ = "\0"; # line sep is something non-existent
$_ = <>; # whomp in entire file
s/([a-z])\n([a-z])/${1}_$2/g; # do it
s/([a-z])\n([a-z])/${1}_$2/g; # in case of single char identifiers
print; # whomp out entire file

Alternately, it's pretty easy to do with sed too. Something like

N
:again
/[a-z]\n[a-z]/{
s/\([a-z]\)\n\([a-z]\)/\1_\2/g
N
b again
}
P
D

In awk, we get something like

{if ($0 ~ /^[a-z]/ && prev ~ /[a-z]$/) ORS="_"
else ORS="\n"
if (prev != "") print prev
prev = $0}
END{ORS="\n"
print prev}

(I'm sure that that could be indented more readably, but I'm scared of
the awk parser.)

Running that through the awk-to-perl translator, we get the following fluff:

#!/usr/bin/perl
eval "exec /usr/local/bin/perl -S $0 $*"
if $running_under_some_shell;
# this emulates #! processing on NIH machines.
# (remove #! line above if indigestible)

eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z_]+=)(.*)/ && shift;
# process any FOO=bar switches

$, = ' '; # set output field separator
$\ = "\n"; # set output record separator

while (<>) {
chop; # strip record separator
if ($_ =~ /^[a-z]/ && $prev =~ /[a-z]$/) {
$\ = '_';
}
else {
$\ = "\n";
}
if ($prev ne '') {
print $prev;
}
$prev = $_;
}

$\ = "\n";
print $prev;

or, more idiomatically

#!/usr/bin/perl
chop($prev = <>);
while (<>) {
chop; # strip record separator
$prev .= ($_ =~ /^[a-z]/ && $prev =~ /[a-z]$/) ? '_' : "\n";
print $prev;
$prev = $_;
}
print $prev,"\n";

Larry Wall
lw...@jpl-devvax.jpl.nasa.gov

Scott Schwartz

unread,
Jan 20, 1990, 3:00:09 AM1/20/90
to
Larry writes:
>$/ = "\0"; # line sep is something non-existent

You know, I've always kind of disliked doing that. Suppose your file
contains all possible byte values 0..255? Something loses. Maybe doing
something like ``undef /;'' to make ``$/'' undefined could be used to
tell perl to just read the whole thing. (Undefined is different from
the null string, right?)

Larry Wall

unread,
Jan 22, 1990, 2:12:16 PM1/22/90
to
In article <Ckv...@cs.psu.edu> schw...@cs.psu.edu (Scott Schwartz) writes:

Right, though the incantation would be "undef $/;".

If you are in that situation, then it's easier just to say

read(STDIN, $_, 1000000000);

No doubt you'll now complain that you have a file larger than a gigabyte... :-)

However, your idea has merit (in particular because the above won't read
from <>). In fact, I just implemented it. Thanks.

Larry

0 new messages