Feature request - Mojo::DOM, have a line number for each element?

52 views
Skip to first unread message

Ekki Plicht

unread,
Jul 22, 2016, 3:44:45 AM7/22/16
to Mojolicious
I use Mojo::DOM for various web scraping and analysis, very easy, very fast, nice.

Usually I am interested in only a few tags, not the entire dom. So I use ->find() to select the interesting nodes, check some facts on the found nodes and store the results in a database for later viewing.

For this later viewing I would love to retain the sequence in which the nodes are in the source. Unfortunately all information about the sequence of tags is lost when I use ->find().

The parser I used to use before (HMTL::HTML5::Parser) does provide a line-number function for each element. This is enough for me to retain the sequence of nodes, the absolute position is not important.

Do you think it would be possible to extend Mojo::DOM to provide a line number for each element? I understand this this might be insufficient for the situation where many tags are on the same line, but that's too bad then...

TIA,
Ekki




Scott Wiersdorf

unread,
Jul 22, 2016, 10:28:07 AM7/22/16
to Mojolicious
You can use map() to do that:

$dom->find('div')->map(sub { state $i = 0; say $i++ . " $_" });

Scott

Ekki Plicht

unread,
Jul 22, 2016, 1:24:17 PM7/22/16
to Mojolicious

Am Freitag, 22. Juli 2016 16:28:07 UTC+2 schrieb Scott Wiersdorf:
You can use map() to do that:

$dom->find('div')->map(sub { state $i = 0; say $i++ . " $_" });

Right, that would give me the proper sequence for all <div>s. 
And then I would have another sequence for all <h1>s, and another for all <td>s and another for all <p>s, and so on.

What I need is one sequence which gives me the right order of all tags I am looking at.

Cheers,
Ekki

Dan Book

unread,
Jul 22, 2016, 2:51:06 PM7/22/16
to mojol...@googlegroups.com
You could try something like this...

$dom->find('*')->map(sub { state $i = 0; $_->{_myapp_counter} = $i++ });

It would add an attribute to every tag but that may or may not be a problem for your application.

Alternatively you could go through $dom->find('*') in order and test $_->tag or other methods to collect the tags in the order you want. This would only work if your criteria are simple enough that you don't really need the CSS selector to find them.

--
You received this message because you are subscribed to the Google Groups "Mojolicious" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mojolicious...@googlegroups.com.
To post to this group, send email to mojol...@googlegroups.com.
Visit this group at https://groups.google.com/group/mojolicious.
For more options, visit https://groups.google.com/d/optout.

Ekki Plicht

unread,
Jul 24, 2016, 5:01:20 PM7/24/16
to Mojolicious
Am Freitag, 22. Juli 2016 20:51:06 UTC+2 schrieb Dan Book:
You could try something like this...

$dom->find('*')->map(sub { state $i = 0; $_->{_myapp_counter} = $i++ });


Nice idea, but wouldn't work.
A doc like 
<h2>1</h2>
   <p>bar</p>
<h2>2</h2>
  <p>bar</p>

Would result in numbering
1: 1st h1
2: 2nd h2
3. 1st p
4. 2nd p

whereas the sequence actually is 1 3 2 4. And that's all I want: the sequence the elements showed up in the original html source.






Alternatively you could go through $dom->find('*') in order and test $_->tag or other methods to collect the tags in the order you want. This would only work if your criteria are simple enough that you don't really need the CSS selector to find them.


I thought about this, but this would render all use of selectors futile and probably take much longer. I would have to loop through all elements, check each if it is a match (out of maybe a dozen, with very simple selectors) amd then work on that element. But I found this concept so inelegant that I rather went back to HTML::HTML5::Parser which has a source_line() function. Ok, the penalty is to deal with XML and Xpath and so on. Mojo:_:DOM is so much easier to use in most other cases...

Regards,
Ekki

sri

unread,
Jul 25, 2016, 3:32:07 PM7/25/16
to Mojolicious
The parser I used to use before (HMTL::HTML5::Parser) does provide a line-number function for each element. This is enough for me to retain the sequence of nodes, the absolute position is not important.

What if the DOM tree changes, like a newline gets added somewhere above, and the line number changes as well? Are there any specs that cover this topic?

--
sebastian 

Ekki Plicht

unread,
Jul 26, 2016, 5:38:23 AM7/26/16
to Mojolicious


Am Montag, 25. Juli 2016 21:32:07 UTC+2 schrieb sri:
The parser I used to use before (HMTL::HTML5::Parser) does provide a line-number function for each element. This is enough for me to retain the sequence of nodes, the absolute position is not important.

What if the DOM tree changes, like a newline gets added somewhere above, and the line number changes as well? Are there any specs that cover this topic?



Hello sri,

My application is kind of a verification service tailored to my companies needs. It runs daily as a cron job to check for changes. So changes are fine... It's purpose is to point out shortcomings and errors in the HTML source, like "The alt tag in file foo.html, at line 13, column 14 is empty." or "The filename in 'img' tag in file bar.html, line 23, column 42 does not correspond to company standards" and so on.

Of course errors are fixed someplace else (in a database), but having filename + line/column number helps the maintainer of the website to identify the correct item in question. The question why I am checking the HTML output and not the database input is certainly valid and can only be explained by historical reasons, it's just something I have to live with for the moment.

No there is no standard for this topic, AFAICS.

It's no big deal and I just asked because I found the source_line() function in HTML::HTML5::Parser.

Thanks for Mojolicious in any case, it's a very nice tool which I use a lot :)

Cheers,
Ekki

sri

unread,
Jul 26, 2016, 9:20:40 AM7/26/16
to Mojolicious
It's no big deal and I just asked because I found the source_line() function in HTML::HTML5::Parser.

Turns out HTML::HTML5::Parser actually uses XMK::LibXML for parsing, which only activates line numbers if the parser has been initialized with a special option, and only up to 56535 lines, so it's a rather ugly hack.


I very much doubt that this feature will ever find it's way into Mojo::DOM.

--
sebastian 
Reply all
Reply to author
Forward
0 new messages