parse a html file and get url and filename

TingChong

unread,

Jan 8, 2004, 5:37:52 PM1/8/04

to

I want to parse a html file and then get url and filename.
I had written a partially correct tcl program.

e.g. the html file is:
<td valign="top">(<a
href="/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/735992D6-3051-4B10-B3F9-30603975224A/0/lecture17.pdf">PDF</a>)
(Resources)</td>
<td valign="top">(<a
href="/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/751ACB45-D53D-4DB0-AF1F-A3A579B25405/0/lecture18taggernotes.pdf">Notes-PDF</a>)
(<a href="/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/76FDB3C2-5FB3-46E1-A1C5-1EBB4691D266/0/lecture18taggerslides.pdf">Slides-PDF</a>)
(Resources) </td>

#My tcl program is:
proc parseFile {filename} {
if {[catch {open $filename RDONLY} input]} {
puts "Error: open $filename. System returned: $input"
exit
}
while {[gets $input line] >= 0} {
if {[regexp -- {NR.*(lecture.*pdf)} $line url file]} {
puts "url $url\nfile $file\n"
}
}
catch {close $input}
}

parseFile [lindex $argv 0]

#the incorrect output is:
url NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/735992D6-3051-4B10-B3F9-30603975224A/0/lecture17.pdf
file lecture17.pdf

url NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/751ACB45-D53D-4DB0-AF1F-A3A579B25405/0/lecture18taggernotes.pdf">Notes-PDF</a>)
(<a href="/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/76FDB3C2-5FB3-46E1-A1C5-1EBB4691D266/0/lecture18taggerslides.pdf
file lecture18taggerslides.pdf

#the correct output should be:
url NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/735992D6-3051-4B10-B3F9-30603975224A/0/lecture17.pdf
file lecture17.pdf

url NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/751ACB45-D53D-4DB0-AF1F-A3A579B25405/0/lecture18taggernotes.pdf
file lecture18taggernotes.pdf

url
NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/76FDB3C2-5FB3-46E1-A1C5-1EBB4691D266/0/lecture18taggerslides.pdf
file lecture18taggerslides.pdf

Please help.
Thanks.

Bryan Oakley

unread,

Jan 8, 2004, 6:08:55 PM1/8/04

to

TingChong wrote:
> I want to parse a html file and then get url and filename.
> I had written a partially correct tcl program.
>
> e.g. the html file is:
> <td valign="top">(<a
> href="/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/735992D6-3051-4B10-B3F9-30603975224A/0/lecture17.pdf">PDF</a>)
> (Resources)</td>
> <td valign="top">(<a
> href="/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/751ACB45-D53D-4DB0-AF1F-A3A579B25405/0/lecture18taggernotes.pdf">Notes-PDF</a>)
> (<a href="/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/76FDB3C2-5FB3-46E1-A1C5-1EBB4691D266/0/lecture18taggerslides.pdf">Slides-PDF</a>)
> (Resources) </td>
>
> #My tcl program is:
> proc parseFile {filename} {
> if {[catch {open $filename RDONLY} input]} {
> puts "Error: open $filename. System returned: $input"
> exit
> }
> while {[gets $input line] >= 0} {
> if {[regexp -- {NR.*(lecture.*pdf)} $line url file]} {

Try changing the .* to .*?, which tells regexp to pick the shortest
possible match.

You could also try a different pattern, perhaps {href="([^"]+)"}. This
assumes you have input with proper matching quotes, and the quotes are
always double quotes. It's possible to make more robust patterns to deal
with those matters.

Gerald Lester

unread,

Jan 8, 2004, 7:33:52 PM1/8/04

to

TingChong wrote:

> I want to parse a html file and then get url and filename.
> I had written a partially correct tcl program.

Did you look at the htmlparse package in tcllib?

--
+--------------------------------+---------------------------------------+
| Gerald W. Lester | "The man who fights for his ideals is |
| Gerald...@cox.net | the man who is alive." -- Cervantes |
+--------------------------------+---------------------------------------+

Michael A. Cleverly

unread,

Jan 8, 2004, 11:28:19 PM1/8/04

to

On 8 Jan 2004, TingChong wrote:

> I want to parse a html file and then get url and filename.
> I had written a partially correct tcl program.

You might be interested in the ns_hrefs command in the nstcl package,
available at nstcl.sourceforge.net.

foreach url [ns_hrefs $html] {
set filename [file tail $url]
# do whatever with $url & $filename ...
}

Michael

David N. Welton

unread,

Jan 9, 2004, 4:05:08 AM1/9/04

to

Gerald Lester <Gerald...@cox.net> writes:

> TingChong wrote:
>
> > I want to parse a html file and then get url and filename.
> > I had written a partially correct tcl program.
>
> Did you look at the htmlparse package in tcllib?

Yes, this is the correct thing to do. Parsing HTML with regular
expressions is a recipe for trouble unless you are sure the HTML won't
change.

--
David N. Welton
Consulting: http://www.dedasys.com/
Personal: http://www.dedasys.com/davidw/
Free Software: http://www.dedasys.com/freesoftware/
Apache Tcl: http://tcl.apache.org/