Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

delete pattern across multiple lines using sed

34 views
Skip to first unread message

Harry

unread,
Apr 18, 2018, 3:56:52 PM4/18/18
to
I have an html file where I want to delete an "empty row of 7 null values".

The line numbers below are generated by 'cat -n' and are not part of the html file.

I want to delete from lines 694 to 716 inclusive. Of course the location of this pattern are not always fall on this line range.

Any advice appreciated.

TIA

670 <table border='1' width='90%' align='center'>
671 <tr>
672 <th scope="col">
673 PERSON_ID
674 </th>
675 <th scope="col">
676 ALIAS_POOL
677 </th>
678 <th scope="col">
679 ALIAS
680 </th>
681 <th scope="col">
682 ACTIVE_IND
683 </th>
684 <th scope="col">
685 END_EFF_DT
686 </th>
687 <th scope="col">
688 BEG_EFF_DT
689 </th>
690 <th scope="col">
691 COUNT
692 </th>
693 </tr>
694 <tr>
695 <td>
696 &nbsp;
697 </td>
698 <td>
699 &nbsp;
700 </td>
701 <td>
702 &nbsp;
703 </td>
704 <td>
705 &nbsp;
706 </td>
707 <td>
708 &nbsp;
709 </td>
710 <td>
711 &nbsp;
712 </td>
713 <td>
714 &nbsp;
715 </td>
716 </tr>
717 </table>

Ben Bacarisse

unread,
Apr 18, 2018, 4:44:48 PM4/18/18
to
Harry <harryoo...@hotmail.com> writes:

> I have an html file where I want to delete an "empty row of 7 null values".
>
> The line numbers below are generated by 'cat -n' and are not part of
> the html file.
>
> I want to delete from lines 694 to 716 inclusive. Of course the
> location of this pattern are not always fall on this line range.

This might do what you want:

sed 'H;1h;$!d;x;s!<tr>\n\(<td>\n&nbsp;\n</td>\n\)*</tr>\n!!g'

This is what happens when IBM JCL goes on a date with Teco and one thing
leads to another. It's the code of nightmares.

But it's so useful a nightmare that I have note of this part so I can
reuse it as needed:

sed 'H;1h;$!d;x; <stuff>'

This reads a while file into sed so that that commands in <stuff> apply
to it all. Whole-file reading is sometimes the easiest way to do
something so this is in my "worth knowing" list.

The substitute command uses ! delimiters for convenience and is globally
applied (the g at the end) to match all occurrences. The \(...\)
creates a group to be repeated and the \n should match the newline. If
you have some other line ending, take appropriate action. You can limit
the match to only 7 cells by replacing the * with \{7\}.

Also, note that this is fragile because it does not parse the HTML. It
assumes an exact arrangement of the strings being matched. Even a rogue
extra space in the input will break it. It can be made more robust, but
the effort may not be worth it. You only need to want a little more
flexibility and using something like xsltproc or Perl with an HTML
parser will be a better bet.

<snip>
--
Ben.

Harry

unread,
Apr 18, 2018, 5:03:27 PM4/18/18
to
On Wednesday, April 18, 2018 at 1:44:48 PM UTC-7, Ben Bacarisse wrote:

> sed 'H;1h;$!d;x;s!<tr>\n\(<td>\n&nbsp;\n</td>\n\)*</tr>\n!!g'

It worked perfect for me.

Thanks

Thomas 'PointedEars' Lahn

unread,
Apr 18, 2018, 9:50:32 PM4/18/18
to
Harry #74656 wrote:

> I have an html file [with a table] where I want to delete an "empty row of
> 7 null values".
>
> The line numbers below are generated by 'cat -n' and are not part of the
> html file.
>
> I want to delete from lines 694 to 716 inclusive. Of course the location
> of this pattern are not always fall on this line range.

Do NOT use sed(1) here. Use a *markup parser*, like BeautifulSoup, to parse
the HTML document, then remove the corresponding “tr” *element* from the
result, and serialize the tree again. You can even do this in the Web
browser (which has already parsed the document when it is displayed), with
client-side DOM scripting.

Any line-based or regular-expression based approach is generally *wrong*
here, and sed(1) can only process text files line-by-line and match text
with regular expressions.

See also: <https://stackoverflow.com/a/1732454/855543>

--
PointedEars
<https://github.com/PointedEars> | <http://PointedEars.de/wsvn/>
Twitter: @PointedEars2
Please do not cc me. /Bitte keine Kopien per E-Mail.

Ed Morton

unread,
Apr 19, 2018, 12:24:07 AM4/19/18
to
You said 'I want to delete an "empty row of 7 null values"' - where is the count
of 7 handled in the above script?

Ed.

Ed Morton

unread,
Apr 19, 2018, 12:34:34 AM4/19/18
to
On 4/18/2018 2:56 PM, Harry wrote:
> I have an html file where I want to delete an "empty row of 7 null values".
>
> The line numbers below are generated by 'cat -n' and are not part of the html file.
>
> I want to delete from lines 694 to 716 inclusive. Of course the location of this pattern are not always fall on this line range.
>
> Any advice appreciated.
>

With GNU awk for multi-char RS:

awk -v RS='<tr>(\\s*<td>\\s*&nbsp;\\s*</td>\\s*){7}</tr>\\s+' -v ORS= '1' file

or if you don't REALLY need exactly 7 null values but just want an all-empty row:

awk -v RS='<tr>(\\s*<td>\\s*&nbsp;\\s*</td>\\s*)+</tr>\\s+' -v ORS= '1' file

The above will work no matter what white space is around the tags.

Regards,

Ed.

Ed Morton

unread,
Apr 19, 2018, 12:38:51 AM4/19/18
to
On 4/18/2018 3:44 PM, Ben Bacarisse wrote:
> Harry <harryoo...@hotmail.com> writes:
>
>> I have an html file where I want to delete an "empty row of 7 null values".
>>
>> The line numbers below are generated by 'cat -n' and are not part of
>> the html file.
>>
>> I want to delete from lines 694 to 716 inclusive. Of course the
>> location of this pattern are not always fall on this line range.
>
> This might do what you want:
>
> sed 'H;1h;$!d;x;s!<tr>\n\(<td>\n&nbsp;\n</td>\n\)*</tr>\n!!g'
>
> This is what happens when IBM JCL goes on a date with Teco and one thing
> leads to another. It's the code of nightmares.
>
> But it's so useful a nightmare that I have note of this part so I can
> reuse it as needed:
>
> sed 'H;1h;$!d;x; <stuff>'
>
> This reads a while file into sed so that that commands in <stuff> apply
> to it all. Whole-file reading is sometimes the easiest way to do
> something so this is in my "worth knowing" list.

GNU sed, which you're almost certainly using given that chain of runes, has a
`-z` option to read up to a NUL char at a time into it's buffer so unless your
file contains NULs you can replace the druidic chant with that:

sed -z 's!<tr>\n\(<td>\n&nbsp;\n</td>\n\)*</tr>\n!!g' file

May help you sleep a bit better, it is a "z" after all :-).

Regards,

Ed.

Ben Bacarisse

unread,
Apr 19, 2018, 5:44:43 AM4/19/18
to
Ed Morton <morto...@gmail.com> writes:

> On 4/18/2018 4:03 PM, Harry wrote:
>> On Wednesday, April 18, 2018 at 1:44:48 PM UTC-7, Ben Bacarisse wrote:
>>
>>> sed 'H;1h;$!d;x;s!<tr>\n\(<td>\n&nbsp;\n</td>\n\)*</tr>\n!!g'
>>
>> It worked perfect for me.
>
> You said 'I want to delete an "empty row of 7 null values"' - where is
> the count of 7 handled in the above script?

In my post. I explained that if the 7 mattered, the OP should replace
the * with \{7\}.

--
Ben.

Ben Bacarisse

unread,
Apr 19, 2018, 5:59:24 AM4/19/18
to
That's handy. It's nice to find another simple sed example.

The trouble with a "worth knowing" file is that it needs to be checked
against the man pages every few years.

--
Ben.
0 new messages