Read and return <a href="?" values

Tuxedo

unread,

Sep 26, 2022, 1:09:58 AM9/26/22

to

Hello,

I have an HTML page with some links in a standard HTML format, such as:

index.html:



<a href="noix_de_muscade.html">
<img src="nutmeg.jpg>
Noix de muscate</a>

<a href="poivre">
<img src="pepper.jpg">
Poivre</a>

<a href="grains_de_cafe.html">
<img src="coffee_beans.jpg>
Grains de café</a>



I would like to read link values appearing between the comments ( and ) segment of the document, avoiding other
links that may appear elsewhere above and below the assigned segment.

The relevant parts are the strings from:

<a href="

until:

"

... so only until each first occurrence of a double quote " and not
necessarily including a closing bracket (">) as some links can appear a bit
different (as in <a href="green_coffee.jpg" onclick="...etc" >).

What regex can can be used extract these strings?

And in their same order of appearance as in the original HTML file.

After, they simply need to be printed in a new (JS) array, like this:

("noix_de_muscade.html",
"poivre.html",
"grains_de_cafe.html");

linkextract.pl:

#!/usr/bin/perl -w

open (data, "index.html");

# capture part between  and 

# extract the parts between each <a href=" and the
# first " double quote that follows each

# place in array in order of appearance

}

print $data;

Many thanks for any advice and ideas.

Tuxedo

Andrzej Adam Filip

unread,

Sep 26, 2022, 4:29:18 AM9/26/22

to

0. Your html file misses closing quotes in src attribute of img tags.
1. You may use HTML::TokeParser module or more intuitive
HTML::TokeParser::Simple

use strict;
use warnings;
use utf8;

use IO::HTML;
use HTML::TokeParser::Simple;

# Make STDOUT ut8 encoded
binmode(STDOUT,':utf8');

# html_file - autodetect encoding of the html file
my $p = HTML::TokeParser::Simple->new( html_file('x.html') );

my $n;
my @hrefs; # array to store detected href
my $in_block;
while ( my $token = $p->get_token ) {
if( not $in_block ) {
if( $token->is_comment() and $token->as_is() =~ /^$/ ) {
$in_block = 1;
}
next;
}elsif( $token->is_start_tag('a') and defined($token->get_attr('href'))){
printf "%d: %s\n", ++$n, $token->get_attr('href');
push( @hrefs, $token->get_attr('href'));
}elsif( $token->is_comment() and $token->as_is() =~ /^$/ ) {
last;
}
}

--
[Andrew] Andrzej A. Filip

Shvili, the Kookologist

unread,

Sep 26, 2022, 5:20:24 AM9/26/22

to

On Mon, 26 Sep 2022 06:59:51 +0200, in article <tgrc6u$43sr$1...@solani.org>, Tuxedo
wrote:

That is an extremely old form of Perl. Slightly modernised, it wuold be:

#!/usr/bin/perl

use strict;
use warnings;

open (my $data, "index.html")
or die "Couldn't open 'index.html' for reading: $!";

> # capture part between  and

my @links;

while (<$data>) {
if (// .. //) {
...;

}
}

> # extract the parts between each <a href=" and the
> # first " double quote that follows each

/<a href=("[^"]*")/

> # place in array in order of appearance

push @links, $1

> print $data;

print '(', join(', ', @links), ')';

Putting these snippets together, gives us:

#!/usr/bin/perl

use strict;
use warnings;

open (my $data, "index.html")
or die "Couldn't open 'index.html' for reading: $!";

my @links;

while (<$data>) {
if (// .. //) {
push @links, $1 if /<a href=("[^"]*")/;
}
}

print '(', join(', ', @links), ')';

Eli the Bearded

unread,

Sep 26, 2022, 2:16:27 PM9/26/22

to

In comp.lang.perl.misc,
Shvili, the Kookologist <kooks-an...@kookology.invalid> wrote:

Kookologist you say?

> Tuxedo wrote:
> > I have an HTML page with some links in a standard HTML format, such as:

> > <a href="noix_de_muscade.html">
> > < img src="nutmeg.jpg>
> > Noix de muscate</a>

Oddly formatted HTML (or, worse, malformed like that missing quote) will
be the bane of your existence using regexps to parse HTML.

> /<a href=("[^"]*")/

<a class="linkmain" href="page1.html">
<a href='page2.html'>
<A HREF="page3.html">
<a href=page4.html>
<a href = "page5.html">
<a
href="page6.html">


And that doesn't begin to cover the malformed HTML.

See the TokeParser answer from Andrzej Adam Filip for a better way.

Elijah
------
just because it looks easy doesn't mean it is

Tuxedo

unread,

Sep 27, 2022, 12:12:38 AM9/27/22

to

Thank you for sharing this solution.

In this case, which is pre-processing HTML in a page edit situation and not
for each and every web request, I think I will use the procedure by Shvili
the Kookologist in the next post, mainly to avoid keeping track of
additional modules. I've not used HTML::TokeParser but it appears useful for
things that require more flexibility.

Tuxedo

Tuxedo

unread,

Sep 27, 2022, 12:21:24 AM9/27/22

to

Thank you for posting this old and modernised yet perfectly working
solution.

I will use it capture links in the order they appear on an overview (index)
page while keeping navigational links between the individual pages in sync
without manually having to update the same links in a separate (JS) script.

It's an HTML pre-publication process.

Tuxedo

Tuxedo

unread,

Sep 27, 2022, 12:25:58 AM9/27/22

to

Thanks for pointing this out. I didn't see the malformed HTML. It would
certainly break the regex procedure and perhaps also fail as an HTML link.
Normally, I type a bit better :-)

Tuxedo