HTML Library

Vinay

unread,

May 11, 2010, 9:34:49 AM5/11/10

to

Is there a library available to parse HTML ? I need to extract certain
tags like links and images from the body.

--- news://freenews.netfront.net/ - complaints: ne...@netfront.net ---

Captain Obvious

unread,

May 11, 2010, 9:45:41 AM5/11/10

to

V> Is there a library available to parse HTML ? I need to extract certain
V> tags like links and images from the body.

If you don't care about getting it 100% correct for all pages, just use
regexes.
E.g. cl-ppcre library.

It usually works very well when you need to scrap a certain site, not "sites
in general".

Petter Gustad

unread,

May 11, 2010, 10:02:46 AM5/11/10

to

Vinay <vi...@vmmenon.org> writes:

> Is there a library available to parse HTML ? I need to extract certain
> tags like links and images from the body.

Use the Drakma client:
http://weitz.de/drakma/

Or net.html.parser:parse-html see this older post for a simple example:
http://groups.google.no/group/comp.lang.lisp/msg/cda1a24ac3b50a43

There's several other options available:
http://www.cliki.net/HTML

Petter
--
.sig removed by request.

Vinay

unread,

May 11, 2010, 10:14:35 AM5/11/10

to

Thanks.

Closure HTML (http://common-lisp.net/project/closure/closure-html/)
seems pretty simple to use. Any other suggestions welcome ...

vanekl

unread,

May 11, 2010, 11:59:21 AM5/11/10

to

Vinay wrote:
snip

>
> Closure HTML (http://common-lisp.net/project/closure/closure-html/)
> seems pretty simple to use. Any other suggestions welcome ...
>

First I tried Python's Beautiful Soup because I've read many people say it's
quite good. It choked on the first complicated page I tried to parse, so I
gave that up.

Then I tried closure html (:closure-common :closure-html :cxml-stp) and this
was able to parse malformed HTML much better than beautiful soup, but it
wasn't perfect.

Then I tried using the following Ruby packages,
hpricot
mechanize
open-uri
iconv
and this combination proved the most robust and productive of the three HTML
parsing platforms for me. If you are trying to parse misshapen html I highly
recommended hpricot, even if you don't know Ruby. It saved me a great deal
of time.

Drew Crampsie

unread,

May 12, 2010, 4:08:31 PM5/12/10

to

"Captain Obvious" <udod...@users.sourceforge.net> writes:

> V> Is there a library available to parse HTML ? I need to extract certain
> V> tags like links and images from the body.
>
> If you don't care about getting it 100% correct for all pages, just
> use regexes.
> E.g. cl-ppcre library.

This is one of the worst pieces of advice i've ever seen given on this
newsgroup. Please do not attempt to roll your own HTML parser using
CL-PPCRE when CLOSURE-HTML exists.

See :
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

and for a little dissent :
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

> It usually works very well when you need to scrap a certain site, not
> "sites in general".

Why would you half-ass code a non-working parser when there are already
libraries that can do it for you? Why would you do extra work you don't
have to, only to end up with something that is broken by design?

So many nice API's exist for querying and manipulating an XML tree, yet
you suggest the author roll their own? Why brought this on?

Cheers,

drewc

Drew Crampsie

unread,

May 12, 2010, 6:42:12 PM5/12/10

to

Drew Crampsie <dr...@tech.coop> writes:
> "Captain Obvious" <udod...@users.sourceforge.net> writes:
>
>> V> Is there a library available to parse HTML ? I need to extract certain
>> V> tags like links and images from the body.
>>
>> If you don't care about getting it 100% correct for all pages, just
>> use regexes.
>> E.g. cl-ppcre library.

[snip]

>
> So many nice API's exist for querying and manipulating an XML tree, yet
> you suggest the author roll their own? Why brought this on?

^What

posted this before i had my coffee :)

Anyways, to back this up, here is the cxml-stp code to get all a-hrefs and
img-src's out of almost any mostly well formed HTML page :

CL-USER> (defun show-links-and-images (url)
(let* ((str (drakma:http-request url))
(document (chtml:parse str (cxml-stp:make-builder))))
(stp:do-recursively (a document)
(when (and (typep a 'stp:element)
(or (equal (stp:local-name a) "img")
(equal (stp:local-name a) "a")))
(print (or (stp:attribute-value a "src" )
(stp:attribute-value a "href")))))))
SHOW-LINKS-AND-IMAGES
CL-USER> (show-links-and-images "http://common-lisp.net")

"logo120x80.png"
"http://www.lisp.org/"
"http://www.cliki.net/"
"http://planet.lisp.org/"
"http://www.cl-user.net/"
... etc

The example code at
http://common-lisp.net/project/closure/closure-html/examples.html
contains all you need to know... this code was a trivial cut and paste
job.

Cheers,

drewc

Captain Obvious

unread,

May 13, 2010, 12:13:21 PM5/13/10

to

??>> If you don't care about getting it 100% correct for all pages, just
??>> use regexes.
??>> E.g. cl-ppcre library.

DC> This is one of the worst pieces of advice i've ever seen given on this
DC> newsgroup. Please do not attempt to roll your own HTML parser using
DC> CL-PPCRE

Have I mentioned parser?
You can extract information from HTML without parsing its structure, that's
my point.
Regex-based extractor tailored for a specific purpose might be faster/more
robust than a real HTML parser.

If one needs to work with HTML structure, then of course he needs a real
parser.

??>> It usually works very well when you need to scrap a certain site, not
??>> "sites in general".

DC> Why would you half-ass code a non-working parser when there are already
DC> libraries that can do it for you?

Because library doesn't do same thing exactly. Library tries to parse whole
structure and to support wide variety of sites.
Therefore, it is complex and is probably more error prone. Also, slower.

DC> Why would you do extra work you don't have to,

Extra work, huh?

This is regex-based scrapper which works with a certain site:

(ppcre:do-scans (ms me rs re
'(:sequence
"class=\"title\" href=\""
(:register
(:sequence
"http://"
(:non-greedy-repetition
1 nil
(:inverted-char-class #\"))))
#\") html-str)
(push (unescape-html (subseq html-str (elt rs 0) (elt re 0)))
links))

I bet code which users library won't be much shorter.

DC> only to end up with something that is broken by design?

HTML is broken by design, so a solution which works with HTML wont' be
perfect anyway.
And custom solution for specific purpose is not necessarily worse than
general library.

DC> So many nice API's exist for querying and manipulating an XML tree,

XML is a different thing. It is supposed to be exactly parseable.
I would agree that trying to parse XML with regexes is a bad idea,
but HTML is junk and should be treated like a junk, with regexes and shit.

DC> yet you suggest the author roll their own?

I'm not suggesting "to roll their own API".

Drew Crampsie

unread,

May 13, 2010, 2:17:51 PM5/13/10

to

"Captain Obvious" <udod...@users.sourceforge.net> writes:

> ??>> If you don't care about getting it 100% correct for all pages, just
> ??>> use regexes.
> ??>> E.g. cl-ppcre library.
>
> DC> This is one of the worst pieces of advice i've ever seen given on this
> DC> newsgroup. Please do not attempt to roll your own HTML parser using
> DC> CL-PPCRE
>
> Have I mentioned parser?

No, but the OP did :

"Is there a library available to parse HTML ? I need to extract certain

tags like links and images from the body."

> You can extract information from HTML without parsing its structure,
> that's my point.

My point is that, when someone asks for a parser, telling them that can
make a crappy half-assed one via regexps is a terrible bit of advice.

> Regex-based extractor tailored for a specific purpose might be
> faster/more robust than a real HTML parser.

Or it might not... hiring cheap labour to do it by hand might work too
and will likely be the most robust of all... but the OP asked for a parser.

>
> If one needs to work with HTML structure, then of course he needs a
> real parser.
>
> ??>> It usually works very well when you need to scrap a certain site, not
> ??>> "sites in general".

So, if you'd like your code to work more than once, avoid regexps? I can
agree with you there.

>
> DC> Why would you half-ass code a non-working parser when there are already
> DC> libraries that can do it for you?
>
> Because library doesn't do same thing exactly. Library tries to parse
> whole structure and to support wide variety of sites.

Right.. a library for HTML parsing... just like the OP asked for....

> Therefore, it is complex and is probably more error prone. Also,
> slower.

Non sequitur ... and absolute bollocks.

>
> DC> Why would you do extra work you don't have to,
>
> Extra work, huh?
>
> This is regex-based scrapper which works with a certain site:

ahem ^

>
> (ppcre:do-scans (ms me rs re
> '(:sequence
> "class=\"title\" href=\""
> (:register
> (:sequence
> "http://"
> (:non-greedy-repetition
> 1 nil
> (:inverted-char-class #\"))))
> #\") html-str)
> (push (unescape-html (subseq html-str (elt rs 0) (elt re 0)))
> links))
>
> I bet code which users library won't be much shorter.

I applaud your use of the sexp syntax for regexps, but, this following
code actually fetches any given webpage from the internet and collects
images and links, something similar to what the OP may have wanted.

(defun links-and-images (url &aux links-and-images)
(flet ((collect (e name attribute)
(when (equal (stp:local-name e) name)
(push (stp:attribute-value e attribute)
links-and-images))))
(stp:do-recursively (e (chtml:parse (drakma:http-request url)
(cxml-stp:make-builder)))
(when (typep e 'stp:element)
(collect e "a" "href")
(collect e "img" "src")))
links-and-images))

Does more, actually readable and works for a significantly larger
portion of the interwebs. If you're going to make that regexp above into
a string and claim victory because its "shorter", i'm not interested.

Can you show me a regexp based solution that meets the op's
specification and works for a equal or greater number of pages? If you
can show me that then we'll compare speed if you like.

And when we're done that, we'll compare programmer time... then change
the spec around a bit...

Lets make it realistic and a little more interesting than the average
usenet pissing match. Perhaps we can get the OP back

> DC> only to end up with something that is broken by design?
>
> HTML is broken by design, so a solution which works with HTML wont' be
> perfect anyway.

Looks like you're sinking now and trying to grab anything that might
float. If the need to parse a particular broken page comes up, fix it
first... most HTML parsers will do this automatically :

(defun clean-html (string)
(chtml:parse string (chtml:make-string-sink)))

No solution is going to work with all broken html, that's
impossible. Is your advice that, because it's possible it may not work with a
small portion of HTML, to ensure it's limited to an even smaller
portion? As i previously stated, i think that's horrible advice.

> And custom solution for specific purpose is not necessarily worse than
> general library.

Given the OP's question and our respective example versions, i don't
think you can reasonably claim this. His specific purpose was to extract
links and images, and he requested an HTML parser.

>
> DC> So many nice API's exist for querying and manipulating an XML tree,
>
> XML is a different thing. It is supposed to be exactly parseable.
> I would agree that trying to parse XML with regexes is a bad idea,
> but HTML is junk and should be treated like a junk, with regexes and
> shit.

I'd stopped talking about parsing at this point. The author has asked
for a library to parse HTML _and_ extract images and links.

The way CLOSURE-HTML (and a good many HTML parsers) work is by cleaning
up the HTML to make an XML-like parse tree, and then using established API's
to work with that. Makes sense to use tools that are designed to solve
the problem you are trying to solve unless those tools are deficient in
some way, non?

>
> DC> yet you suggest the author roll their own?
>
> I'm not suggesting "to roll their own API".

That's where they'd end up while attempting to use your advice to solve
their problem, in my opinion. Sure, a regexp based solution can be
convinced to work, but it's not the right tool for the job.

In an attempt to avoid cliche while simultaneously referencing it, i'll
end with this quote :

“Whenever faced with a problem, some people say `Lets use AWK.'
Now, they have two problems.” -- D. Tilbrook

Cheers,

drewc

Captain Obvious

unread,

May 13, 2010, 4:52:50 PM5/13/10

to

??>> Have I mentioned parser?

DC> No, but the OP did :

DC> "Is there a library available to parse HTML ? I need to extract certain
DC> tags like links and images from the body."

I thought he might not know that you don't need to fully parse HTML to
extract links and images from the body.
And I think I was pretty clear about that it is a half-assed solution.
That's just an option.

DC> My point is that, when someone asks for a parser, telling them that can
DC> make a crappy half-assed one via regexps is a terrible bit of advice.

I wrote that you can do this in "crappy half-assed way" without a parser at
all. That it is a different thing.

I used thing like that few times, it took me maybe 5 minutes to get it
working and it was working 100% well (for a certain site).
So what's wrong with it?

DC> Or it might not... hiring cheap labour to do it by hand might work too
DC> and will likely be the most robust of all... but the OP asked for a
DC> parser.

Well... Do you know that 95% of posts here on comp.lang.lisp which start
with "Help me to fix this macro..." are not about macros but about a person
not understanding something?
It just might be that person looking for a parser can do thing he wants
without a parser.

If he really needs a parser, he can just ignore my comment and listen to
people who have provided links to various parsers.

So what's wrong with it, really?

??>>>> It usually works very well when you need to scrap a certain site,

??>>>> not "sites in general".

DC> So, if you'd like your code to work more than once, avoid regexps?
DC> I can agree with you there.

Well, I don't know what he is doing. Sometimes people need to scrap a
certain site or few of them.
Regexes are fine for that. They might get broken if they change layout on
the site.
But parser-based solution might get broken too (that is, way you need to
traverse DOM changes).

??>> Therefore, it is complex and is probably more error prone. Also,
??>> slower.

DC> Non sequitur ... and absolute bollocks.

Of course it might be the other way around, but, generally, more complex
things are more prone to errors.
And also when you extract more information and do it in more complex way
that takes more time.

E.g. if it is DOM-based parser, I'm pretty sure that full DOM of a page
takes more memory than a bunch of links.

DC> I applaud your use of the sexp syntax for regexps, but, this following
DC> code actually fetches any given webpage from the internet and collects
DC> images and links, something similar to what the OP may have wanted.

DC> (defun links-and-images (url &aux links-and-images)
DC> (flet ((collect (e name attribute)
DC> (when (equal (stp:local-name e) name)
DC> (push (stp:attribute-value e attribute)
DC> links-and-images))))
DC> (stp:do-recursively (e (chtml:parse (drakma:http-request url)
DC> (cxml-stp:make-builder)))
DC> (when (typep e 'stp:element)
DC> (collect e "a" "href")
DC> (collect e "img" "src")))
DC> links-and-images))

DC> Does more, actually readable and works for a significantly larger
DC> portion of the interwebs.

Well, you see, I was not interested in all links on that page, I was
interested only on those with class "title" and are http:// links.
So regex says exactly that.
I don't know what exactly OP wants, I guess we don't have a full story here.

DC> Can you show me a regexp based solution that meets the op's
DC> specification and works for a equal or greater number of pages?

I don't see specification here. He did not say that he just wants all links.

DC> If you can show me that then we'll compare speed if you like.

That would be interesting...

DC> Lets make it realistic and a little more interesting than the average
DC> usenet pissing match. Perhaps we can get the OP back

Well, sorry, I probably don't have time for this.

DC> No solution is going to work with all broken html, that's
DC> impossible. Is your advice that, because it's possible it may not work
DC> with a small portion of HTML, to ensure it's limited to an even smaller
DC> portion?

If you use regex like this to get a tags: <a\s[^>]+> -- I honestly don't see
a lot of ways how it can get broken.
In fact, I don't see any. Well, I wrote this in 5 seconds, if I'd think on
it for a hour or so I think I'll have absolutely bulletproof link-extracting
regex.
Good thing about it is that it absolutely does not care about context. It
might be also a bad thing, depending on what's you doing.
So I don't agree on "smaller portion." Fixing HTML is general is harder than
fixing for a specific task.

DC> As i previously stated, i think that's horrible advice.

Ok, ok, I get it. But it is just your opinion :)

DC> The way CLOSURE-HTML (and a good many HTML parsers) work is by cleaning
DC> up the HTML to make an XML-like parse tree, and then using established
DC> API's to work with that. Makes sense to use tools that are designed to
DC> solve the problem you are trying to solve unless those tools are
DC> deficient in some way, non?

Having full DOM is an overkill. If you're only working with small sites
that's ok, but a DOM of a larger thing can eat lots of memory.

DC> That's where they'd end up while attempting to use your advice to solve
DC> their problem, in my opinion. Sure, a regexp based solution can be
DC> convinced to work, but it's not the right tool for the job.

So DOM parser might be not the right tool either. Sometimes job requires
very specific tool and general ones are deficient.

DC> “Whenever faced with a problem, some people say `Lets use AWK.'
DC> Now, they have two problems.” -- D. Tilbrook

Overly simplistic regexes are inferior to formal parsers, but HTML as people
use it is not formally defined, so it is inapplicable here.

Tim X

unread,

May 13, 2010, 6:58:17 PM5/13/10

to

Drew Crampsie <dr...@tech.coop> writes:

Drew,

thank you for doing this. I cannot express how frustrating it is when
people use REs in this context for more than a single page/url hack. I
have lost count of the number of times I have had to fix broken systems
that were due precisely to the use of REs to process HTML pages.

The RE solution is not a good solution. It /can/ work in a very limited
context, but as soon as you try to apply this approach in a more
generalised solution, you end up with a maintenance nightmare. Worse
still, it means that you will need somone to maintain this solution that
is experienced and understands how REs work. Unfortunately, few people
actually do understand this. Trivial REs are easy, but a soon as they
begin to get a little complex, you really need a deeper understanding of
how they work, anchoring, backtracking etc.

For the OP, there are a number of tools, such as 'tidy' which can
lean-up HTML that can in turn make the parsers work better. It is true
that HTML is broken in many ways, making it hard to process reliably.
Many HTML generation libraries are extremely inefficient and buggy -
take a look at the HTML generated by many MS programs such as outlook to
see what I mean. Howeever, this does not mean you cannot parse it.
Obviously you can or we wold not have any working web browsers. Using
tools like 'tidy' to clean up the HTML before parsing means that the
parser doesn't have to work as hard and may not need to deal with as
many exceptions.

Avoid the RE solution unless you are dealing with a single page that is
fairly simple. If you are loking for something more general, use one of
the parses and something like 'tidy'. When it fails, extend the parser
and incrementally improve it.

Tim

--
tcross (at) rapttech dot com dot au

Drew Crampsie

unread,

May 18, 2010, 6:29:28 PM5/18/10

to

"Captain Obvious" <udod...@users.sourceforge.net> writes:

> ??>> Have I mentioned parser?
>
> DC> No, but the OP did :
>
> DC> "Is there a library available to parse HTML ? I need to extract certain
> DC> tags like links and images from the body."
>
> I thought he might not know that you don't need to fully parse HTML to
> extract links and images from the body.

Let's say that was true, and that the OP was quite naive and had never
used regular expressions to extract data from text, or never thought to
apply regular expressions to this particular problem.

If that were the case, i think it would also be likely that the OP was
not all that familiar with regular expressions to begin with. I'm of the
opinion that this is unlikely, and the OP had already rejected regular
expressions as the wrong solution.

In the former case, learning regular expressions in order to extract
links and images from HTML is not something i would recommend.

> And I think I was pretty clear about that it is a half-assed
> solution. That's just an option.

It's a terrible solution, and i can't see why you're still defending
it.

>
> DC> My point is that, when someone asks for a parser, telling them that can
> DC> make a crappy half-assed one via regexps is a terrible bit of advice.
>
> I wrote that you can do this in "crappy half-assed way" without a
> parser at all. That it is a different thing.
>
> I used thing like that few times, it took me maybe 5 minutes to get it
> working and it was working 100% well (for a certain site).
> So what's wrong with it?

It took me a lot less to get mine working, and it works for more than
one site. If the site changes, mine will still extract the images and
links, yours will not. Also, mine was a complete and working piece of
code.

>
> DC> Or it might not... hiring cheap labour to do it by hand might work too
> DC> and will likely be the most robust of all... but the OP asked for a
> DC> parser.
>
> Well... Do you know that 95% of posts here on comp.lang.lisp which
> start with "Help me to fix this macro..." are not about macros but
> about a person not understanding something?

What you don't seem to understand is that regular expressions, for
extracting things from HTML, are almost always the wrong
solution. Similar to using a macro when a function is what is called
for, or using EVAL when a macro would do.

> It just might be that person looking for a parser can do thing he
> wants without a parser.

Since the OP is not around to comment on what their exact needs were, we
have to assume that, having long since chosen closure-html as the thing
that will do the thing he wants, the OP was in fact looking to do the
kind of things a parser does.

> If he really needs a parser, he can just ignore my comment and listen
> to people who have provided links to various parsers.

As they did.

>
> So what's wrong with it, really?

What worries me is not so much that the OP might have taken your advice,
but rather that you thought it was good advice to give. My contrary
demonstrations are as much for your benefit as for those who may still
be following this thread.

> ??>>>> It usually works very well when you need to scrap a certain site,
> ??>>>> not "sites in general".
>
> DC> So, if you'd like your code to work more than once, avoid regexps?
> DC> I can agree with you there.
>
> Well, I don't know what he is doing.

Seemed to me like he was trying to extract links and images and the like
from html.

> Sometimes people need to scrap a
> certain site or few of them.

'scrape', scrap is what i'd do to your code if you tried to get it past
review.

> Regexes are fine for that.

If you want to write brittle code that is prone to breakage and only
works on one site as long as that site stays relatively static, rather
than fairly solid code that works on a majority of sites that are
allowed to change significantly, and you don't know anything about how
to use HTML parsers and the surrounding toolsets, i'd still recommend
you learn to use the right tools for the job.

> They might get broken if they change layout
> on the site.

Indeed they might.

> But parser-based solution might get broken too (that is, way you need
> to traverse DOM changes).

This is nonsense. For extracting links and images (that is, a and img
tags, and their attributes, in a useful data structure), a parser based
solution will track html changes significantly better than a regular
expression based solution.

For any scraping task where the specifics of the document structure are
involved, either solution is going to have problems.. so what are you
trying to say? That changes in input structure may break code that depends on
it? Hardly an argument for regexps.

> ??>> Therefore, it is complex and is probably more error prone. Also,
> ??>> slower.
>
> DC> Non sequitur ... and absolute bollocks.
>
> Of course it might be the other way around, but, generally, more
> complex things are more prone to errors.

CXML and companion libraries are excellent code that is well tested, and
the problem they solve (xml parsing, manipulation, and unparsing) is not
that difficult.

The closer-html library is able to understand more HTML, and allow the
user to do more with the data with less code, than the solutions you've
presented.

> And also when you extract more information and do it in more complex
> way that takes more time.

You have not shown that it takes significantly more time or is any
more complex for a task of reasonable size.

For any less complex tasks, regular expressions themselves are too
complex and take too much time. Read on and i will be happy to show
this.

>
> E.g. if it is DOM-based parser, I'm pretty sure that full DOM of a
> page takes more memory than a bunch of links.

If you knew anything about the topic at hand, you'd know that there are
many many solutions for XML that do not involve a 'full DOM'. SAX, for
example... the 'Streaming API for XML'.

Here is a modification of my previous code (which was not DOM based, so
enough about DOM) to use the SAX interface and not create a data
structure that represents the entire document :

(defclass links-and-images-handler (sax:default-handler)
((links-and-images :accessor links-and-images
:initform nil)))

(defmethod sax:end-document ((handler links-and-images-handler))
(links-and-images handler))

(defmethod sax:start-element ((handler links-and-images-handler)
uri local-name qname attributes)
(flet ((collect (element attribute)
(when (string-equal element local-name)
(let ((attribute
(find attribute attributes
:key #'sax:attribute-local-name
:test #'string-equal)))
(when attribute
(push (sax:attribute-value attribute)
(links-and-images handler)))))))
(collect "a" "href")
(collect "img" "src")))

(defun sax-links-and-images (url)
(chtml:parse (drakma:http-request url :want-stream t)
(make-instance 'links-and-images-handler)))

Also notice that it doesn't even make a string out of the webpage
itself, but rather reads from the stream and parses it
incrementally. I'm sure that having to read the entire file into
memory in order run a regular expression over it is going to take more
memory than, well, not doing that.

>
> DC> I applaud your use of the sexp syntax for regexps, but, this

> DC> following code actually fetches any given webpage from the
> DC> internet and collects images and links, something similar to what
> DC> the OP may have wanted.

>
> DC> (defun links-and-images (url &aux links-and-images)
> DC> (flet ((collect (e name attribute)
> DC> (when (equal (stp:local-name e) name)
> DC> (push (stp:attribute-value e attribute)
> DC> links-and-images))))
> DC> (stp:do-recursively (e (chtml:parse (drakma:http-request url)
> DC> (cxml-stp:make-builder)))
> DC> (when (typep e 'stp:element)
> DC> (collect e "a" "href")
> DC> (collect e "img" "src")))
> DC> links-and-images))
>
> DC> Does more, actually readable and works for a significantly larger
> DC> portion of the interwebs.
>
> Well, you see, I was not interested in all links on that page, I was
> interested only on those with class "title" and are http:// links.
> So regex says exactly that.

Does it now? What if the class comes after the href?

Ok, here's that code modified to your new spec:

(defun http-links-with-class (url &aux links-and-images)
(flet ((collect (e class name attribute)

(when (equal (stp:local-name e) name)

(when (equal (stp:attribute-value e "class") class)
(let ((value (stp:attribute-value e attribute)))
(when (string-equal "http://"
(subseq value 0 7))
(push value links-and-images)))))))

(stp:do-recursively (e (chtml:parse (drakma:http-request url)

(cxml-stp:make-builder))
links-and-images)

(when (typep e 'stp:element)

(collect e "title" "a" "href")))))

Please note that it actually works for a greater percentage of pages,
including those where the class attribute is not directly before the
href attribute.

[snip]

> If you use regex like this to get a tags: <a\s[^>]+> -- I honestly
> don't see a lot of ways how it can get broken.
> In fact, I don't see any.

If that's all you want to achieve, why bother with regular
expression at all?

It's complex, and more error prone, and therefore slower, than a
hand-rolled function :

(defvar *page* (drakma:http-request "http://common-lisp.net"))

(defun match-html-a (stream)
(declare (optimize (speed 3) (space 3)))
(loop
:for char := (read-char stream nil)
:while char
:when (and (eql char #\<)
(member (peek-char nil stream) '(#\a #\A)))
:collect (loop
:for char := (read-char stream nil)
:while char :collect char into stack
:until (eql char #\>)
:finally (return (coerce (cons #\< stack) 'string)))))

(locally (declare (optimize (speed 3) (space 3)))
(let ((scanner (cl-ppcre:create-scanner
"<a\\s[^>]*>"
:case-insensitive-mode t
:multi-line-mode t)))
(defun ppcre-match-html-a (string)
(nreverse
(let (links)
(cl-ppcre:do-matches-as-strings
(string scanner string links)
(push string links)))))))

CL-USER> (equalp (ppcre-match-html-a *page*)
(with-input-from-string (s *page*)
(match-html-a s)))
=> T

CL-USER> (time (dotimes (n 1024)
(with-input-from-string (page *page*)
(match-html-a page))))

Evaluation took:
0.778 seconds of real time
0.776049 seconds of total run time (0.776049 user, 0.000000 system)
[ Run times consist of 0.044 seconds GC time,
and 0.733 seconds non-GC time. ]
99.74% CPU
1,553,424,264 processor cycles
63,615,928 bytes consed

NIL
CL-USER> (time (dotimes (n 1024)
(ppcre-match-html-a *page*)))

Evaluation took:
1.612 seconds of real time
1.612101 seconds of total run time (1.608101 user, 0.004000 system)
[ Run times consist of 0.024 seconds GC time,
and 1.589 seconds non-GC time. ]
100.00% CPU
3,214,872,948 processor cycles
47,587,008 bytes consed

NIL
CL-USER> (time (dotimes (n 1024)
(with-input-from-string (page *page*)
;level the playing field
(ppcre-match-html-a *page*))))

Evaluation took:
1.942 seconds of real time
1.948122 seconds of total run time (1.948122 user, 0.000000 system)
[ Run times consist of 0.044 seconds GC time,
and 1.905 seconds non-GC time. ]
100.31% CPU
3,874,149,564 processor cycles
85,368,544 bytes consed

NIL
CL-USER>

> Well, I wrote this in 5 seconds, if I'd
> think on it for a hour or so I think I'll have absolutely bulletproof
> link-extracting regex.

MATCH-HTML-A took a little longer than 5 seconds, but not much, and i
didn't have to use regular expressions, which in this case add
complexity.

> Good thing about it is that it absolutely does not care about
> context. It might be also a bad thing, depending on what's you doing.
> So I don't agree on "smaller portion." Fixing HTML is general is
> harder than fixing for a specific task.

The thing is, we want context at some point. All we have now is a list
of string that look like "<a ... >". If we want to extract the actual
href, we have to work a little bit harder :

CL-USER> (with-input-from-string (page *page*)
(match-html-a page))
("<a href=\"http://www.lisp.org/\">"
"<a href=\"http://www.cliki.net/\">"
"<a href=\"http://planet.lisp.org/\">" ...)

(let ((scanner (cl-ppcre:create-scanner "\\s+" :multi-line-mode t)))
(defun match-a-href-value (string)
(dolist (attribute (cl-ppcre:split scanner string))
(when (and (> (length attribute) 5)
(string-equal
(subseq attribute 0 4) "href"))
(return-from match-a-href-value
(second (split-sequence:split-sequence
#\" attribute )))))))

(defun linear-match-a-href (string)
(mapcar #'match-a-href-value
(ppcre-match-html-a string)))

CL-USER> (linear-match-a-href *page*)
("http://www.lisp.org/"
"http://www.cliki.net/"
"http://planet.lisp.org/" ...)

CL-USER> (time (dotimes (n 1024)
(linear-match-a-href *page*)))

Evaluation took:
2.529 seconds of real time
2.528158 seconds of total run time (2.528158 user, 0.000000 system)
[ Run times consist of 0.032 seconds GC time,
and 2.497 seconds non-GC time. ]
99.96% CPU
5,043,773,340 processor cycles
77,095,968 bytes consed

Maybe that's not the best way to do that, but that's the naive
ad-hoc implementation i came up with of the top of my head.

A SAX based version might look like this :

(defclass links-handler (sax:default-handler)
((links :accessor links
:initform nil)))

(defmethod sax:end-document ((handler links-handler))
(nreverse (links handler)))

(defmethod sax:start-element ((handler links-handler)
uri local-name qname attributes)
(when (string-equal "a" local-name)
(let ((attribute
(find "href" attributes
:key #'sax:attribute-local-name
:test #'string-equal)))
(when attribute
(push (sax:attribute-value attribute)
(links handler))))))

(defun sax-match-html-a-href (string)
(chtml:parse string (make-instance 'links-handler)))

CL-USER> (time (dotimes (n 1024)
(sax-match-html-a-href *page*)))

Evaluation took:
2.959 seconds of real time
2.952185 seconds of total run time (2.904182 user, 0.048003 system)
[ Run times consist of 0.160 seconds GC time,
and 2.793 seconds non-GC time. ]
99.76% CPU
5,903,549,172 processor cycles
282,859,128 bytes consed

Those run times are comparable, and if we're not storing the web pages
in memory, the differences will be lost in the i/o latency. At this
point, the regexp based solution starts to take on the characteristics
of a parser, and also becomes more prone to error as it requires new
untested code.

So what is the advantage of the half assed regular expression based
solution?

It's not code size, nor run time, nor ease of use. Only for a very
simple task like this one is the amount of effort anywhere near
comparable.

> DC> As i previously stated, i think that's horrible advice.
>
> Ok, ok, I get it. But it is just your opinion :)

One that i can back up with experience, and actual code. If you'd ever
had to work with some other coder's regexp-based pseudo-parser, or even
your own if you've made that mistake (as i have), you'd recognize it is
as a good opinion.

[snipped more DOM-related nonsense]

> DC> “Whenever faced with a problem, some people say `Lets use AWK.'
> DC> Now, they have two problems.” -- D. Tilbrook
>
> Overly simplistic regexes are inferior to formal parsers, but HTML as
> people use it is not formally defined, so it is inapplicable here.

This is fallacious as well. The HTML that CLOSURE-HTML is designed to
parse is the HTML as people use it. Just as an ad-hoc regular expression
will attempt to extract meaning from improperly structured HTML, so does
the parser behind CHTML:PARSE.

It is possible to construct input that chtml rejects but an ad-hoc
regexp based solution might accept, just as i can easily construct a
string that the regular expression will reject but a hand-rolled
solution will accept. This proves nothing, unless you can make an
argument that the hand-rolled solution is in fact a better solution than
CLOSURE-HTML. That i'd be willing to listen to.

Obviously, the solution that works is better than the one that
doesn't. In the case where two tools are arguably equally good (the very
simple case we presented above), using the one that is designed for the
job is most likely, in all cases, going to be the right idea.

If you'll indulge me one more cliché

"It is tempting, if the only tool you have is a hammer, to treat
everything as if it were a nail."
-- Abraham Maslow, 1966

Cheers,

drewc

Drew Crampsie

unread,

May 18, 2010, 6:34:50 PM5/18/10

to

Tim X <ti...@nospam.dev.null> writes:

> Drew Crampsie <dr...@tech.coop> writes:
>
>> "Captain Obvious" <udod...@users.sourceforge.net> writes:
>>
>>> ??>> If you don't care about getting it 100% correct for all pages, just
>>> ??>> use regexes.
>>> ??>> E.g. cl-ppcre library.
>>>
>>> DC> This is one of the worst pieces of advice i've ever seen given on this
>>> DC> newsgroup. Please do not attempt to roll your own HTML parser using
>>> DC> CL-PPCRE

> Drew,

>
> thank you for doing this.

Someone had to :).

> I cannot express how frustrating it is when people use REs in this
> context for more than a single page/url hack. I have lost count of
> the number of times I have had to fix broken systems that were due
> precisely to the use of REs to process HTML pages.

This, more than the specific unsuitability of the tool to the task, is
the reason i spoke up. I've been there, it's hell, and completely
avoidable.

Hopefully, the epic followup i just posted will end any questions about
the matter! :)

Cheers,

drewc

RG

unread,

May 18, 2010, 8:27:02 PM5/18/10

to

In article <87d3ws4...@tech.coop>, Drew Crampsie <dr...@tech.coop>
wrote:

[A ton of useful info on HTML and XML parsing]

Wow, that has to be one of the most content-full posts ever to C.L.L.
Thanks for taking the time to write it!

rg

Dmitry Statyvka

unread,

May 19, 2010, 5:13:30 AM5/19/10

to

>>>>> Drew Crampsie writes:

[...]

>> And I think I was pretty clear about that it is a half-assed
>> solution. That's just an option.

DC> It's a terrible solution, and i can't see why you're still
DC> defending it.

I can. Because it may be a pretty good solution in certain conditions.
For example, when one wants just to extract some links (not all) or
pieces of text from given page on given site, and the page is generated
by

[...]

Drew Crampsie

unread,

May 19, 2010, 1:22:37 PM5/19/10

to

Dmitry Statyvka <dmi...@statyvka.org.ua> writes:

>>>>>> Drew Crampsie writes:
>
> [...]
>
> >> And I think I was pretty clear about that it is a half-assed
> >> solution. That's just an option.
>
> DC> It's a terrible solution, and i can't see why you're still
> DC> defending it.
>
> I can. Because it may be a pretty good solution in certain
> conditions.

Compared to what? I've shown for the simple cases where a regexp may be
a 'pretty good' solution, a hand-rolled matcher is the same amount of
code and effort, and significantly faster.

For anything more, i've shown that using a proper parser is less code,
less effort, and actually works in more cases.

> For example, when one wants just to extract some links (not all) or
> pieces of text from given page on given site, and the page is generated
> by

(with-piss-take ()

I presented a decent argument and backed it up with a lot of code, and
you haven't even finished this sentence... not convincing at all.

Or is this a joke? Because a regexp solution is half-assed and
unfinished, so is your response?

I figure you meant to 'cancel' and send instead, as nobody using their
real name would in their right mind defend a regexp based solution in
the face of the evidence offered.)

:)

Cheers,

drewc
>
> [...]

Dmitry Statyvka

unread,

May 19, 2010, 4:40:21 PM5/19/10

to

>>>>> Drew Crampsie writes:

DC> Dmitry Statyvka <dmi...@statyvka.org.ua> writes:
>>>>>>> Drew Crampsie writes:
>>
>> [...]
>>
>> >> And I think I was pretty clear about that it is a half-assed >>
>> solution. That's just an option.
>>
DC> It's a terrible solution, and i can't see why you're still
DC> defending it.
>>
>> I can. Because it may be a pretty good solution in certain
>> conditions.

DC> Compared to what? I've shown for the simple cases where a regexp
DC> may be a 'pretty good' solution, a hand-rolled matcher is the same
DC> amount of code and effort, and significantly faster.

DC> For anything more, i've shown that using a proper parser is less
DC> code, less effort, and actually works in more cases.

It is, if a proper parser can parse what we downloaded. It's wrong to
assume that any downloaded page will be valid HTML. It's wrong to
suppose that driving the pipeline "HTML-client | HTML-beautifier |
parser + content extractor | business logic" will be simplier, faster,
whatever than driving "HTML-client | content extractor | business logic"
for any possible case. It's a point in original Cap's post. And it's
obviously.

[...]

DC> I figure you meant to 'cancel' and send instead,

I meant to 'save' and send instead.

DC> as nobody using their real name would in their right mind defend a
DC> regexp based solution in the face of the evidence offered.)

For the sake of clarity, in my opinion: regexps based solution can be
simplier, more manageable and faster than HTML-parsing based solution.
I have no good example at hand, sorry. I've just saw it in real
development (and real support too) a few years ago.

Dmitry.

Drew Crampsie

unread,

May 19, 2010, 8:26:07 PM5/19/10

to

Dmitry Statyvka <dmi...@statyvka.org.ua> writes:

>>>>>> Drew Crampsie writes:
>
> DC> Dmitry Statyvka <dmi...@statyvka.org.ua> writes:
> >>>>>>> Drew Crampsie writes:
> >>
> >> [...]
> >>
> >> >> And I think I was pretty clear about that it is a half-assed >>
> >> solution. That's just an option.
> >>
> DC> It's a terrible solution, and i can't see why you're still
> DC> defending it.
> >>
> >> I can. Because it may be a pretty good solution in certain
> >> conditions.
>
> DC> Compared to what? I've shown for the simple cases where a regexp
> DC> may be a 'pretty good' solution, a hand-rolled matcher is the same
> DC> amount of code and effort, and significantly faster.
>
> DC> For anything more, i've shown that using a proper parser is less
> DC> code, less effort, and actually works in more cases.
>
> It is, if a proper parser can parse what we downloaded. It's wrong to
> assume that any downloaded page will be valid HTML. It's wrong to
> suppose that driving the pipeline "HTML-client | HTML-beautifier |
> parser + content extractor | business logic" will be simplier, faster,
> whatever than driving "HTML-client | content extractor | business logic"
> for any possible case. It's a point in original Cap's post. And it's
> obviously.

If your input is corrupted in such a way as to be unparsable, then
perhaps fixing your input before parsing it is a better idea than making
a one-off matcher, unless your data format is simple enough that a
parser is not needed...

And, as i've shown, for the simple cases where a regexp based solution
is usable, it's just as easy to hand-craft a matcher, so regexps could
be said to suffer from the same 'over complex' problem in the instances
you claim they are useful.

So none of this is convincing me that regular expressions are the right
tool for extracting information HTML. When you tell me they are good for
extacting information from things that are not-html, don't be offended
if i say 'who gives a toss?'.

I've shown the two pipelines to be of similar speed and similar effort
for a toy problem, and also demonstrated that for a larger problem, the
parser gains a significant advantage.

Have you ever used CLOSURE-HTML? CL-PPCRE? Can you whip open a repl and
back up your assertions with actual code? If not, then whatever
hypotheticals you are discussing are not something that interest me, and
are certainly not the topic under discussion (extracting information
from HTML using Common Lisp).

> DC> as nobody using their real name would in their right mind defend a
> DC> regexp based solution in the face of the evidence offered.)
>
> For the sake of clarity, in my opinion: regexps based solution can be
> simplier, more manageable and faster than HTML-parsing based solution.

I recognize your right to your opinion, but your unwillingness to back
it up with any code, or even an example, sets off my bullshit
alarm. Since you have no refutation at hand, i'm not going to turn it
off. :P

> I have no good example at hand, sorry. I've just saw it in real
> development (and real support too) a few years ago.

In this case, is 'real' a mess of perl or php code? Perhaps something
written by morons? Is this a solution involving parsing HTML in order
to retrieve information such as links and images, which is what we are
talking about here... non?

I'd like to see an example that's a little more substantial, but i doubt
your ability to produce it. I'm willing to be swayed with a good
argument, but i'm pretty sure it doesn't exist. :)

Cheers,

drewc

>
> Dmitry.

Dmitry Statyvka

unread,

May 20, 2010, 6:39:13 PM5/20/10

to

[...]

DC> If your input is corrupted in such a way as to be unparsable, then
DC> perhaps fixing your input before parsing it is a better idea than
DC> making a one-off matcher, unless your data format is simple enough
DC> that a parser is not needed...

Almost agreed. It's a better idea to fixing an input before parsing it,
unless it takes less effort than making a matcher. Sure, effort of
support matters.

[...]

DC> So none of this is convincing me that regular expressions are the
DC> right tool for extracting information HTML.

Regexps are just one of such tools. Note: not for parsing HTML, for
extracting information.

[...]

DC> Have you ever used CLOSURE-HTML? CL-PPCRE? Can you whip open a repl
DC> and back up your assertions with actual code? If not, then whatever
DC> hypotheticals you are discussing are not something that interest
DC> me, and are certainly not the topic under discussion (extracting
DC> information from HTML using Common Lisp).

Yes, yes, no. I have no time to write epic post on trivial subject,
sorry. :-)

[...]

>> I have no good example at hand, sorry. I've just saw it in real
>> development (and real support too) a few years ago.

DC> In this case, is 'real' a mess of perl or php code?

No, they used C#.

DC> Perhaps something written by morons?

I have no reasons to think so. The developed system seems well
designed, the code was well structured, most sites were processed by
parser-based extractors, re-based extractors seems to fit themselves in
the system well enough...

DC> Is this a solution involving parsing HTML in order to retrieve
DC> information such as links and images, which is what we are talking
DC> about here... non?

The goal was to extract posts and comments from a several forums.

[...]

Dmitry.