parse html table? (using awk to process data in html table)

Zhang Weiwu

unread,

Jan 8, 2005, 9:48:13 AM1/8/05

to

Hello. I have a very long table of date that I wish to process with my
awk skill.

The question: the data is stored in a very large html <table>. AFAIK I
cannot feed this data directly to awk to process.

What I need is to convert this <table> html ELEMENT into a text file,
TAB seperated or CSV file, so that awk could work on it. The problem is
how to do the convertion.

Another small problem: There are line-breaks in table-cells, but I don't
think it will be a problem, I guess could gloably replace these
line-breaks (s/<br>/\r/g) with \r (I am on Linux, so the correct
line-break is \n, and I guess awk will think \r is in a field, rather
than record seperator).

Just an example:

product \t height \t price \t comment \n
wine \t 30mm \t 32.20USD \t Looks nice.\rJohn said we better try it.\n

In this case, there is a line-break inside of the 4nd field. I hope I
could workaround it by using \r for line-breaks inside fields. Any
suggestions?

Zhang Weiwu

unread,

Jan 8, 2005, 9:50:14 AM1/8/05

to

Zhang Weiwu wrote:
> Hello. I have a very long table of date that I wish to process with my

Just some corrections:
1) I have a very long table of data. (I hope I could have a very long
table of dates:)

2) forgot to say "thanks" in last mail ;)

William Park

unread,

Jan 8, 2005, 1:22:11 PM1/8/05

to

Zhang Weiwu <zhang...@realss.com> wrote:
> Hello. I have a very long table of date that I wish to process with my
> awk skill.
>
> The question: the data is stored in a very large html <table>. AFAIK I
> cannot feed this data directly to awk to process.
>
> What I need is to convert this <table> html ELEMENT into a text file,
> TAB seperated or CSV file, so that awk could work on it. The problem is
> how to do the convertion.

...

> Just an example:
>
> product \t height \t price \t comment \n
> wine \t 30mm \t 32.20USD \t Looks nice.\rJohn said we better try it.\n

So, what input example do you have? Or, should we should our telepathic
powers?

john

unread,

Jan 8, 2005, 2:41:21 PM1/8/05

to

Zhang Weiwu <zhang...@realss.com> wrote in message
news:<34aashF...@individual.net>...

>Hello. I have a very long table of date that I wish to process with my
>awk skill.
>
>The question: the data is stored in a very large html <table>. AFAIK I
>cannot feed this data directly to awk to process.
>
>What I need is to convert this <table> html ELEMENT into a text file,
>TAB seperated or CSV file, so that awk could work on it. The problem is
>how to do the convertion.

A general solution that will work for any table is difficult
if not impossible. Consider something like this in gawk:

$> gawk -F'</td>' -vRS='</tr>' -vOFS=','\
'RT == "</tr>" { for(i = 1; i <= NF; i++) sub(/.*<td.*>/, "", $i);\
print $1, $2;}'

will work for the following table, which I think is valid html:

All bets are off with < 100% valid html. So I would suggest
you first make sure that the table code is valid html before
trying to use awk to convert it to some other format.

Steve Calfee

unread,

Jan 9, 2005, 1:54:09 AM1/9/05

to

On Sat, 08 Jan 2005 22:48:13 +0800, Zhang Weiwu
<zhang...@realss.com> wrote:

>Hello. I have a very long table of date that I wish to process with my
>awk skill.
>
>The question: the data is stored in a very large html <table>. AFAIK I
>cannot feed this data directly to awk to process.
>
>What I need is to convert this <table> html ELEMENT into a text file,
>TAB seperated or CSV file, so that awk could work on it. The problem is
>how to do the convertion.
>

HTML is text. However, it is formatted strangely compared to regular
text files.

>Another small problem: There are line-breaks in table-cells, but I don't
>think it will be a problem, I guess could gloably replace these
>line-breaks (s/<br>/\r/g) with \r (I am on Linux, so the correct
>line-break is \n, and I guess awk will think \r is in a field, rather
>than record seperator).
>
>Just an example:
>
>product \t height \t price \t comment \n
>wine \t 30mm \t 32.20USD \t Looks nice.\rJohn said we better try it.\n
>

I don't think you will find tabs in normal html. In most cases
whitespace is irrelevant and not used for formatting. That includes \n
EOLs.

>In this case, there is a line-break inside of the 4nd field. I hope I
>could workaround it by using \r for line-breaks inside fields. Any
>suggestions?

When I wanted to process HTML I used 3 awk programs, The first throws
away the \n characters and changes <BR> to \n. This generates a file
that is more "normal" and most screen lines correspond to text lines.
eg:
function usage()
{
print "usage: gawk -f htmlxfers.awk < webdump.html "
print "lineify the html so xfers.awk will extract transfers"
exit 1
}

BEGIN {
#print ARGC;
# validate arguments
if (ARGC < 1)
usage()
RS = "<BR>"
}

{
gsub(/\n/,"");
print $0;
}
NOTE: the argc<x is boilerplate and will never fail for this program.
However I try to put in a usage function so if I type the program I
remember how to start the awk progam. Also, the wizards here can
probably do this in 12 characters, but I try to make it clear so I can
change it next year without too much study.

The second awk program processes that file to extract data that may be
on multiple lines of the first file.

The third awk program processes the 2nd output to pretty print a
report based on the data.

Christian Opitz

unread,

Jan 9, 2005, 4:02:22 AM1/9/05

to

Zhang Weiwu wrote:

> The question: the data is stored in a very large html <table>. AFAIK I
> cannot feed this data directly to awk to process.
>
> What I need is to convert this <table> html ELEMENT into a text file,
> TAB seperated or CSV file, so that awk could work on it. The problem
> is how to do the convertion.

Use lynx --dump yourfile.html

cu CO