Hello,
I have an HTML page with some links in a standard HTML format, such as:
index.html:
<!-- start links -->
<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>
<a href="poivre">
<img src="pepper.jpg">
Poivre</a>
<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>
<!-- end links -->
I would like to read link values appearing between the comments (<!-- start
links --> and <!-- end links -->) segment of the document, avoiding other
links that may appear elsewhere above and below the assigned segment.
The relevant parts are the strings from:
<a href="
until:
"
... so only until each first occurrence of a double quote " and not
necessarily including a closing bracket (">) as some links can appear a bit
different (as in <a href="green_coffee.jpg" onclick="...etc" >).
What regex can can be used extract these strings?
And in their same order of appearance as in the original HTML file.
After, they simply need to be printed in a new (JS) array, like this:
("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");
linkextract.pl:
#!/usr/bin/perl -w
open (data, "index.html");
# capture part between <!-- start links --> and <!-- end links -->
# extract the parts between each <a href=" and the
# first " double quote that follows each
# place in array in order of appearance
}
print $data;
Many thanks for any advice and ideas.
Tuxedo