Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Read and return <a href="?" values

7 views
Skip to first unread message

Tuxedo

unread,
Sep 26, 2022, 1:09:58 AM9/26/22
to
Hello,

I have an HTML page with some links in a standard HTML format, such as:

index.html:

<!-- start links -->

<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>

<a href="poivre">
<img src="pepper.jpg">
Poivre</a>

<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>

<!-- end links -->


I would like to read link values appearing between the comments (<!-- start
links --> and <!-- end links -->) segment of the document, avoiding other
links that may appear elsewhere above and below the assigned segment.

The relevant parts are the strings from:

<a href="

until:

"

... so only until each first occurrence of a double quote " and not
necessarily including a closing bracket (">) as some links can appear a bit
different (as in <a href="green_coffee.jpg" onclick="...etc" >).

What regex can can be used extract these strings?

And in their same order of appearance as in the original HTML file.

After, they simply need to be printed in a new (JS) array, like this:

("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");


linkextract.pl:

#!/usr/bin/perl -w

open (data, "index.html");

# capture part between <!-- start links --> and <!-- end links -->

# extract the parts between each <a href=" and the
# first " double quote that follows each

# place in array in order of appearance

}

print $data;


Many thanks for any advice and ideas.

Tuxedo

Andrzej Adam Filip

unread,
Sep 26, 2022, 4:29:18 AM9/26/22
to
0. Your html file misses closing quotes in src attribute of img tags.
1. You may use HTML::TokeParser module or more intuitive
HTML::TokeParser::Simple

use strict;
use warnings;
use utf8;

use IO::HTML;
use HTML::TokeParser::Simple;

# Make STDOUT ut8 encoded
binmode(STDOUT,':utf8');

# html_file - autodetect encoding of the html file
my $p = HTML::TokeParser::Simple->new( html_file('x.html') );

my $n;
my @hrefs; # array to store detected href
my $in_block;
while ( my $token = $p->get_token ) {
if( not $in_block ) {
if( $token->is_comment() and $token->as_is() =~ /^<!-- start links -->$/ ) {
$in_block = 1;
}
next;
}elsif( $token->is_start_tag('a') and defined($token->get_attr('href'))){
printf "%d: %s\n", ++$n, $token->get_attr('href');
push( @hrefs, $token->get_attr('href'));
}elsif( $token->is_comment() and $token->as_is() =~ /^<!-- end links -->$/ ) {
last;
}
}

--
[Andrew] Andrzej A. Filip

Shvili, the Kookologist

unread,
Sep 26, 2022, 5:20:24 AM9/26/22
to
On Mon, 26 Sep 2022 06:59:51 +0200, in article <tgrc6u$43sr$1...@solani.org>, Tuxedo
wrote:
That is an extremely old form of Perl. Slightly modernised, it wuold be:

#!/usr/bin/perl

use strict;
use warnings;

open (my $data, "index.html")
or die "Couldn't open 'index.html' for reading: $!";



> # capture part between <!-- start links --> and <!-- end links -->

my @links;

while (<$data>) {
if (/<!-- start links -->/ .. /<!-- end links -->/) {
...;
}
}

> # extract the parts between each <a href=" and the
> # first " double quote that follows each

/<a href=("[^"]*")/

> # place in array in order of appearance

push @links, $1

> print $data;

print '(', join(', ', @links), ')';

Putting these snippets together, gives us:


#!/usr/bin/perl

use strict;
use warnings;

open (my $data, "index.html")
or die "Couldn't open 'index.html' for reading: $!";

my @links;

while (<$data>) {
if (/<!-- start links -->/ .. /<!-- end links -->/) {
push @links, $1 if /<a href=("[^"]*")/;
}
}

print '(', join(', ', @links), ')';

Eli the Bearded

unread,
Sep 26, 2022, 2:16:27 PM9/26/22
to
In comp.lang.perl.misc,
Shvili, the Kookologist <kooks-an...@kookology.invalid> wrote:

Kookologist you say?

> Tuxedo wrote:
> > I have an HTML page with some links in a standard HTML format, such as:
> > <a href="noix_de_muscade.html">
> > < img src="nutmeg.jpg>
> > Noix de muscate</a>

Oddly formatted HTML (or, worse, malformed like that missing quote) will
be the bane of your existence using regexps to parse HTML.

> /<a href=("[^"]*")/

<a class="linkmain" href="page1.html">
<a href='page2.html'>
<A HREF="page3.html">
<a href=page4.html>
<a href = "page5.html">
<a
href="page6.html">
<!-- <a href="[[placeholder]]"> -->

And that doesn't begin to cover the malformed HTML.

See the TokeParser answer from Andrzej Adam Filip for a better way.

Elijah
------
just because it looks easy doesn't mean it is

Tuxedo

unread,
Sep 27, 2022, 12:12:38 AM9/27/22
to
Thank you for sharing this solution.

In this case, which is pre-processing HTML in a page edit situation and not
for each and every web request, I think I will use the procedure by Shvili
the Kookologist in the next post, mainly to avoid keeping track of
additional modules. I've not used HTML::TokeParser but it appears useful for
things that require more flexibility.

Tuxedo

Tuxedo

unread,
Sep 27, 2022, 12:21:24 AM9/27/22
to
Thank you for posting this old and modernised yet perfectly working
solution.

I will use it capture links in the order they appear on an overview (index)
page while keeping navigational links between the individual pages in sync
without manually having to update the same links in a separate (JS) script.

It's an HTML pre-publication process.

Tuxedo

Tuxedo

unread,
Sep 27, 2022, 12:25:58 AM9/27/22
to
Thanks for pointing this out. I didn't see the malformed HTML. It would
certainly break the regex procedure and perhaps also fail as an HTML link.
Normally, I type a bit better :-)

Tuxedo
0 new messages