regexp question, joining EOL (awk,sed,etc...)

Larry Wall

unread,

Jan 19, 1990, 7:30:12 PM1/19/90

to

In article <13...@island.uu.net> dan...@island.uu.net ((Dan Smith "Remember MLK")) writes:
:
: Someone here needs a generalized way of turning:
:
: foo
: bar
: 1.23
:
: to the following:
:
: foo_bar
: 1.23
:
: also, it would need to work for:
:
: foo
: bar
: baz
: 1.23
: goo
: tar
: 293
:
: which should give:
:
: foo_bar_baz
: 1.23
: goo_tar
: 293
:
: So the rule seems to be "if a line ends with a character, and the next
: line begins with one, replace the newline with a '_'".
:
: I've tried (in vi) "g/[a-z]\n[a-z]/s//_/"...but that doesn't
: cut it. Any ideas? (I take it that it may be a two-pass sort of solution).

In the first pass, install perl. :-)

In the second pass, feed your file to a perl script that says

#!/usr/bin/perl
$/ = "\0"; # line sep is something non-existent
$_ = <>; # whomp in entire file
s/([a-z])\n([a-z])/${1}_$2/g; # do it
s/([a-z])\n([a-z])/${1}_$2/g; # in case of single char identifiers
print; # whomp out entire file

Alternately, it's pretty easy to do with sed too. Something like

N
:again
/[a-z]\n[a-z]/{
s/$[a-z]$\n$[a-z]$/\1_\2/g
N
b again
}
P
D

In awk, we get something like

{if ($0 ~ /^[a-z]/ && prev ~ /[a-z]$/) ORS="_"
else ORS="\n"
if (prev != "") print prev
prev = $0}
END{ORS="\n"
print prev}

(I'm sure that that could be indented more readably, but I'm scared of
the awk parser.)

Running that through the awk-to-perl translator, we get the following fluff:

#!/usr/bin/perl
eval "exec /usr/local/bin/perl -S $0 $*"
if $running_under_some_shell;
# this emulates #! processing on NIH machines.
# (remove #! line above if indigestible)

eval '$'.$1.'$2;' while $ARGV[0] =~ /^([A-Za-z_]+=)(.*)/ && shift;
# process any FOO=bar switches

$, = ' '; # set output field separator
$\ = "\n"; # set output record separator

while (<>) {
chop; # strip record separator
if ($_ =~ /^[a-z]/ && $prev =~ /[a-z]$/) {
$\ = '_';
}
else {
$\ = "\n";
}
if ($prev ne '') {
print $prev;
}
$prev = $_;
}

$\ = "\n";
print $prev;

or, more idiomatically

#!/usr/bin/perl
chop($prev = <>);
while (<>) {
chop; # strip record separator
$prev .= ($_ =~ /^[a-z]/ && $prev =~ /[a-z]$/) ? '_' : "\n";
print $prev;
$prev = $_;
}
print $prev,"\n";

Larry Wall
lw...@jpl-devvax.jpl.nasa.gov

Scott Schwartz

unread,

Jan 20, 1990, 3:00:09 AM1/20/90

to

Larry writes:
>$/ = "\0"; # line sep is something non-existent

You know, I've always kind of disliked doing that. Suppose your file
contains all possible byte values 0..255? Something loses. Maybe doing
something like ``undef /;'' to make ``$/'' undefined could be used to
tell perl to just read the whole thing. (Undefined is different from
the null string, right?)

Larry Wall

unread,

Jan 22, 1990, 2:12:16 PM1/22/90

to

In article <Ckv...@cs.psu.edu> schw...@cs.psu.edu (Scott Schwartz) writes:

Right, though the incantation would be "undef $/;".

If you are in that situation, then it's easier just to say

read(STDIN, $_, 1000000000);

No doubt you'll now complain that you have a file larger than a gigabyte... :-)

However, your idea has merit (in particular because the above won't read
from <>). In fact, I just implemented it. Thanks.

Larry