e.g. the html file is:
<td valign="top"><span class="tablecopy">(<a
href="/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/735992D6-3051-4B10-B3F9-30603975224A/0/lecture17.pdf">PDF</a>)
(Resources)</span></td>
<td valign="top"><span class="tablecopy">(<a
href="/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/751ACB45-D53D-4DB0-AF1F-A3A579B25405/0/lecture18taggernotes.pdf">Notes-PDF</a>)
(<a href="/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/76FDB3C2-5FB3-46E1-A1C5-1EBB4691D266/0/lecture18taggerslides.pdf">Slides-PDF</a>)
(Resources)</span> </td>
#My tcl program is:
proc parseFile {filename} {
if {[catch {open $filename RDONLY} input]} {
puts "Error: open $filename. System returned: $input"
exit
}
while {[gets $input line] >= 0} {
if {[regexp -- {NR.*(lecture.*pdf)} $line url file]} {
puts "url $url\nfile $file\n"
}
}
catch {close $input}
}
parseFile [lindex $argv 0]
#the incorrect output is:
url NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/735992D6-3051-4B10-B3F9-30603975224A/0/lecture17.pdf
file lecture17.pdf
url NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/751ACB45-D53D-4DB0-AF1F-A3A579B25405/0/lecture18taggernotes.pdf">Notes-PDF</a>)
(<a href="/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/76FDB3C2-5FB3-46E1-A1C5-1EBB4691D266/0/lecture18taggerslides.pdf
file lecture18taggerslides.pdf
#the correct output should be:
url NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/735992D6-3051-4B10-B3F9-30603975224A/0/lecture17.pdf
file lecture17.pdf
url NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/751ACB45-D53D-4DB0-AF1F-A3A579B25405/0/lecture18taggernotes.pdf
file lecture18taggernotes.pdf
url
NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-170Laboratory-in-Software-EngineeringFall2001/76FDB3C2-5FB3-46E1-A1C5-1EBB4691D266/0/lecture18taggerslides.pdf
file lecture18taggerslides.pdf
Please help.
Thanks.
Try changing the .* to .*?, which tells regexp to pick the shortest
possible match.
You could also try a different pattern, perhaps {href="([^"]+)"}. This
assumes you have input with proper matching quotes, and the quotes are
always double quotes. It's possible to make more robust patterns to deal
with those matters.
> I want to parse a html file and then get url and filename.
> I had written a partially correct tcl program.
Did you look at the htmlparse package in tcllib?
--
+--------------------------------+---------------------------------------+
| Gerald W. Lester | "The man who fights for his ideals is |
| Gerald...@cox.net | the man who is alive." -- Cervantes |
+--------------------------------+---------------------------------------+
> I want to parse a html file and then get url and filename.
> I had written a partially correct tcl program.
You might be interested in the ns_hrefs command in the nstcl package,
available at nstcl.sourceforge.net.
foreach url [ns_hrefs $html] {
set filename [file tail $url]
# do whatever with $url & $filename ...
}
Michael
> TingChong wrote:
>
> > I want to parse a html file and then get url and filename.
> > I had written a partially correct tcl program.
>
> Did you look at the htmlparse package in tcllib?
Yes, this is the correct thing to do. Parsing HTML with regular
expressions is a recipe for trouble unless you are sure the HTML won't
change.
--
David N. Welton
Consulting: http://www.dedasys.com/
Personal: http://www.dedasys.com/davidw/
Free Software: http://www.dedasys.com/freesoftware/
Apache Tcl: http://tcl.apache.org/