Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Perl HTML searching

10 views
Skip to first unread message

Steve

unread,
Mar 19, 2010, 1:27:12 PM3/19/10
to
I started a little project where I need to search web pages for their
text and return the links of those pages to me. I am using
LWP::Simple, HTML::LinkExtor, and Data::Dumper. Basically all I have
done so far is a list of URL's from my search query of a website, but
I want to be able to filter this content based on the pages contents.
How can I do this? How can I get the content of a web page, and not
just the URL?

Kyle T. Jones

unread,
Mar 19, 2010, 1:37:45 PM3/19/10
to

my $pagecontents=get("url");

Then you'll have to parse it yourself to pull out whatever stuff you're
interested in...

Cheers.

Jürgen Exner

unread,
Mar 19, 2010, 2:01:08 PM3/19/10
to

???

I don't understand.

use LWP::Simple;
$content = get("http://www.whateverURL");

will get you exactly the content of that web page and assign it to
$content and apparently you are doing that already.

So what is your problem?

jue

Steve

unread,
Mar 19, 2010, 2:25:12 PM3/19/10
to

Sorry I am a little overwhelmed with the coding so far (I'm not very
good at perl). I have what you have posted, but my problem is that I
would like to filter that content... like lets say I searched a site
that had 15 news links and 3 of them said "Hello" in the title. I
would want to extract only the links that said hello in the title.

J. Gleixner

unread,
Mar 19, 2010, 2:42:42 PM3/19/10
to
Steve wrote:


'"Hello" in the title'??.. The title element of the HTML????
Or the 'a' element contains 'Hello'?? e.g. <a href="...">Hello Kitty</a>

How are you using HTML::LinkExtor??

That seems like the right choice.

Why are you using Data::Dumper?

That's helpful when debugging, or logging, so how are you using it?

Post your very short example, because there's something you're
missing and no one can tell what that is based on your description.

Kyle T. Jones

unread,
Mar 19, 2010, 2:58:39 PM3/19/10
to

Read up on perl regular expressions.

for instance, taking the above, you might first split it into a
"one-line per" array -

@stuff=split(/\n/, $content);

then parse each line for hello -

foreach(@stuff){
if($_=~/Hello/){
do whatever;}
}

Cheers.

Ben Morrow

unread,
Mar 19, 2010, 2:53:23 PM3/19/10
to

Quoth Steve <st...@staticg.com>:

> On Mar 19, 11:01 am, Jürgen Exner <jurge...@hotmail.com> wrote:
> > Steve <st...@staticg.com> wrote:
> > >I started a little project where I need to search web pages for their
> > >text and return the links of those pages to me.  I am using
> > >LWP::Simple, HTML::LinkExtor, and Data::Dumper.  Basically all I have
> > >done so far is a list of URL's from my search query of a website, but
> > >I want to be able to filter this content based on the pages contents.
> > >How can I do this? How can I get the content of a web page, and not
> > >just the URL?
> >
> >         use LWP::Simple;
> >         $content = get("http://www.whateverURL");
> >
> > will get you exactly the content of that web page and assign it to
> > $content and apparently you are doing that already.
>
> Sorry I am a little overwhelmed with the coding so far (I'm not very
> good at perl). I have what you have posted, but my problem is that I
> would like to filter that content... like lets say I searched a site
> that had 15 news links and 3 of them said "Hello" in the title. I
> would want to extract only the links that said hello in the title.

Ah, you don't want the content *pointed to* by the link, you want the
content of the <a> element itself. I don't think you can use
HTML::LinkExtor for that.

I would start by building a DOM for the page, and then going through and
finding the <a> elements and checking their content. XML::LibXML
(despite the name) has a decent HTML parser, though you will probably
want to set the 'recover' option if you are parsing random HTML from the
Web. You can then use DOM methods like ->getElementsByTagName to find
the <a> elements and ->textContent to find their contents (ignoring
further tags within the <a> element).

Ben

J. Gleixner

unread,
Mar 19, 2010, 5:10:14 PM3/19/10
to
J. Gleixner wrote:
> Steve wrote:
After looking at it further, HTML::LinkExtor only gives the
attributes, not the text that makes up the hyperlink. Seems
like that would be a useful enhancement.

This might help you:

http://cpansearch.perl.org/src/GAAS/HTML-Parser-3.64/eg/hanchors

Steve

unread,
Mar 19, 2010, 5:10:15 PM3/19/10
to
On Mar 19, 11:42 am, "J. Gleixner" <glex_no-s...@qwest-spam-

Based on what you all said, I can make a more clear description.
Essentially, I'm trying to search craigslist more efficiently. I want
the link the a tag points to, as well as the description. here is
code I used already that I made that gets me only the links:
-----------------------------

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::LinkExtor;
use Data::Dumper;

###### VARIABLES ######
my $craigs = "http://seattle.craigslist.org";
my $source = "$craigs/search/sss?query=what+Im+Looking
+for&catAbbreviation=sss";
my $browser = 'google-chrome';

###### SEARCH #######

my $page = get("$source");
my $parser = HTML::LinkExtor->new();

$parser->parse($page);
my @links = $parser->links;
open LINKS, ">/home/me/Desktop/links.txt";
print LINKS Dumper \@links;

open READLINKS, "</home/me/Desktop/links.txt";
open OUT, ">/home/me/Desktop/final.txt";
while (<READLINKS>){
if ( /html/ ){
my $url = $_;
for ($url){
s/\'//g;
s/^\s+//;
}

print OUT "$craigs$url";
}
}
open BROWSE, "</home/me/Desktop/final.txt";

system ($browser);
foreach(<BROWSE>){
system ($browser, $_);
}
-----------------------------

I've since created a different script that's a little more cleaned up

Ben Morrow

unread,
Mar 19, 2010, 5:40:14 PM3/19/10
to

Quoth Steve <st...@staticg.com>:

>
> Based on what you all said, I can make a more clear description.
> Essentially, I'm trying to search craigslist more efficiently. I want

Are you sure craigslist's Terms of Use allow this? Most sites of this
nature don't.

> the link the a tag points to, as well as the description. here is
> code I used already that I made that gets me only the links:
> -----------------------------
>
> #!/usr/bin/perl -w
> use strict;
> use LWP::Simple;
> use HTML::LinkExtor;
> use Data::Dumper;
>
> ###### VARIABLES ######
> my $craigs = "http://seattle.craigslist.org";
> my $source = "$craigs/search/sss?query=what+Im+Looking
> +for&catAbbreviation=sss";
> my $browser = 'google-chrome';
>
> ###### SEARCH #######
>
> my $page = get("$source");
> my $parser = HTML::LinkExtor->new();
>
> $parser->parse($page);
> my @links = $parser->links;
> open LINKS, ">/home/me/Desktop/links.txt";

Use 3-arg open.
Use lexical filehandles.
*Always* check the return value of open.

open my $LINKS, ">", "/home/me/Desktop/links.txt"
or die "can't write to 'links.txt': $!";

You may wish to consider using the 'autodie' module from CPAN, which
will do the 'or die' checks for you.

> print LINKS Dumper \@links;
>
> open READLINKS, "</home/me/Desktop/links.txt";
> open OUT, ">/home/me/Desktop/final.txt";

As above.

> while (<READLINKS>){

Why are you writing the links out to a file only to read them in again?
Just use the array you already have:

for (@links) {

> if ( /html/ ){
> my $url = $_;
> for ($url){
> s/\'//g;
> s/^\s+//;
> }
>
> print OUT "$craigs$url";
> }
> }
> open BROWSE, "</home/me/Desktop/final.txt";

As above.

Ben

Steve

unread,
Mar 19, 2010, 6:10:15 PM3/19/10
to

I have no idea, but it's personal use. I don't see what so bad about
it, if I was using my web browser I'd be doing the same thing.
Craigslist is just an example.

That's aside the point though, I'm just doing it for fun/practice/
learning. Let's say we are using a different site then, perhaps one
I'm going to make, it makes no difference to me.

So any way I can do this or...?

Ben Morrow

unread,
Mar 19, 2010, 6:30:11 PM3/19/10
to

Quoth Steve <st...@staticg.com>:

>
> I have no idea, but it's personal use. I don't see what so bad about
> it, if I was using my web browser I'd be doing the same thing.

That's not the point. If their TOS say 'no robots' then that means 'no
robots', not 'no robots unless it's for personal use and you can't see
why you shouldn't'. Apart from anything else, a lot of these sites make
money from ads, which you will completely bypass.

> Craigslist is just an example.
>
> That's aside the point though, I'm just doing it for fun/practice/
> learning. Let's say we are using a different site then, perhaps one
> I'm going to make, it makes no difference to me.
>
> So any way I can do this or...?

I've already suggested using XML::LibXML. Others have pointed you to an
example of using HTML::Parser. Pick one and try it.

Ben

Steve

unread,
Mar 19, 2010, 6:39:49 PM3/19/10
to

I realize this, I'm not using craigslist. It was the first thing I
could think of for an example. This is for internal/personal use
only, and I don't like how you're labeling me as breaking any TOS for
an _EXAMPLE_. Notice how my home folder is changed to "me"? I'm
putting as little personal information here, hence the craigslist
example.

Tad McClellan

unread,
Mar 19, 2010, 10:38:19 PM3/19/10
to
Kyle T. Jones <KBf...@realdomain.net> wrote:
> Steve wrote:

>> like lets say I searched a site
>> that had 15 news links and 3 of them said "Hello" in the title. I
>> would want to extract only the links that said hello in the title.
>
> Read up on perl regular expressions.


While reading up on regular expressions is certainly a good idea,
it is a horrid idea for the purposes of parsing HTML.

Have you read the FAQ answers that mention HTML?

perldoc -q HTML


> for instance, taking the above, you might first split it into a
> "one-line per" array -
>
> @stuff=split(/\n/, $content);
>
> then parse each line for hello -
>
> foreach(@stuff){
> if($_=~/Hello/){
> do whatever;}
> }


The code below prints "do whatever" 3 times, but there is only one link
containing "Hello"...


---------------------------
#!/usr/bin/perl
use warnings;
use strict;

# some perfectly valid HTML:
my $content = '
<html><body>
<p>Hello
Kitty</p>
<a
href
=
"hello.com"
>Hello</a
>
<!--
There is no Hello here
-->
</body></html>
';

my @stuff = split /\n/, $content;
foreach (@stuff) {
if(/Hello/) {
print "do whatever\n";
}
}
---------------------------


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg\100cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

Peter J. Holzer

unread,
Mar 20, 2010, 7:35:53 AM3/20/10
to
On 2010-03-19 22:39, Steve <st...@staticg.com> wrote:
> On Mar 19, 3:30 pm, Ben Morrow <b...@morrow.me.uk> wrote:
>> Quoth Steve <st...@staticg.com>:
>> > I have no idea, but it's personal use.  I don't see what so bad about
>> > it, if I was using my web browser I'd be doing the same thing.
>>
>> That's not the point. If their TOS say 'no robots' then that means 'no
>> robots', not 'no robots unless it's for personal use and you can't see
>> why you shouldn't'. Apart from anything else, a lot of these sites make
>> money from ads, which you will completely bypass.

=======

>> > Craigslist is just an example.
>>
>> > That's aside the point though, I'm just doing it for fun/practice/
>> > learning.  Let's say we are using a different site then, perhaps one
>> > I'm going to make, it makes no difference to me.
>>
>> > So any way I can do this or...?
>>
>> I've already suggested using XML::LibXML. Others have pointed you to an
>> example of using HTML::Parser. Pick one and try it.
>>
>> Ben
>
> I realize this,

Please quote only the relevant parts of the posting you are responding
to and write your answer directly beneath the part you are referring to.

Nobody knows what "this" is that you realize. From your quoting it looks
like you realize that you should use XML::LibXML or HTML::Parser. But
from the content of your reply it seems more likely you realize that you
should abide of the terms of use of any site you use. If so you should
have inserted your response at the point I've marked with "======="
above. And if you don't intend to respond to the part about the tools
you should use, don't quote it (and change the subject, since the topic
is now no longer "Perl HTML searching" but "TOS of web pages").

hp

s...@netherlands.com

unread,
Mar 20, 2010, 10:43:21 AM3/20/10
to
On Fri, 19 Mar 2010 21:40:14 +0000, Ben Morrow <b...@morrow.me.uk> wrote:

>
>Quoth Steve <st...@staticg.com>:
>>
>> Based on what you all said, I can make a more clear description.
>> Essentially, I'm trying to search craigslist more efficiently. I want
>
>Are you sure craigslist's Terms of Use allow this? Most sites of this
>nature don't.

There is no "Terms of Use" web page making a caller
agree to, sign, a legal notorized document as a condition of usage.
Its a public record, available to be parsed, quoted or anything else,
by routers, virus scanners, BROWSERs, hosts filters, search engines,
Operating Systems, etc..

As for alterring the content and viewing just what the viewer wants,
its a one way street. I filter adds, active controls/content, links
and anything else I want to.

Don't make me laugh, this lame phrase is just that -- LAME!

-sln

s...@netherlands.com

unread,
Mar 20, 2010, 5:17:22 PM3/20/10
to

This might help you. Requires Perl 5.10 or better.

-sln

Output:
Specific Tag/Attr Titles found --
Hello:
"http://helloA.com"
"helloB.com"
no_title:
"/info/twitter.aspx"

All Tag/Attr found --
a-href:
"http://helloA.com"
"/info/twitter.aspx"
"helloB.com"
link-href:
"/includes/css/main.css"

Code:
# -------------------------------------------
# rx_html_href.pl
# -sln, 3/20/2010
#
# Util to extract some attribute/val's from
# html/xml
# -------------------------------------------

use strict;
use warnings;

my ($Name,$Rxmarkup);
InitName();

my $rxopen = "(?: $Name )"; # Open tag with 'href' attrib, cannot be empty alternation

#my $rxopen = "(?: a )"; # Open tag with 'href' attrib, cannot have an empty alternation
my $rxattr = "(?: href )"; # Attribute we seek, cannot have an empty alternation
my $rxclose = "(?: a )"; # Close tag to match with content, cannot have an empty alternation
my $rxtitle = "(?: Hello | )"; # Content Title, can be empty alternation

my %hTitles; # hash of titles => attribute values matching tag open, title, and tag close
my %hHrefs; # hash of tag => attribute values matching tag open expression, not necessaryily titles

InitRegex();

##
# open my $fh, '<', 'C:/temp/XML/tennis1.html' or
# die "can't open file for input: $!";
# my $html = join '', <$fh>;
# close $fh;

my $html = join '', <DATA>;

##
ParseHref(\$html);

##
print "\nSpecific Tag/Attr Titles found --\n";
for my $key (keys %hTitles) {
print " $key:\n";
for my $val (@{$hTitles{$key}}) {
print " $val\n";
}
}

print "\nAll Tag/Attr found -- \n";
for my $key (keys %hHrefs) {
print " $key:\n";
for my $val (@{$hHrefs{$key}}) {
print " $val\n";
}
}

exit (0);


##
sub ParseHref
{
my ($markup) = @_;
my (
$url,
$title,
$content,
$tfound,
$lcbpos,
$last_content_pos,
$begin_pos
) = ('','','',0,0,0,0);

## parse loop
while ($$markup =~ /$Rxmarkup/g)
{
## handle content buffer
if (defined $+{C1}) {
## speed it up
$content .= $+{C1};
if (length $+{C2})
{
if ($lcbpos == pos($$markup)) {
$content .= $+{C2};
} else {
$lcbpos = pos($$markup);
pos($$markup) = $lcbpos - 1;
}
}
$last_content_pos = pos($$markup);
next;
}
## content here ... take it off
if (length $content)
{
$begin_pos = $last_content_pos;
## check '<'
if ($content =~ /</) {
## markup in content
#print "Markup '<' in content, da stuff is crap!\n";
}
if ($content =~ /($rxtitle)/x && length $url) {
$tfound = 1;
$title = $1;
$title =~ s/^\s*//;
$title =~ s/\s*$//;
$title = 'no_title' if !length($title);
}
$content = '';
}
## markup here ... take it off
if (defined $+{OPEN}) {
push @{$hHrefs{$+{OPEN}.'-'.$+{ATTR}}}, $+{VAL} ;
$url = $+{VAL};
$tfound = 0;
$title = '';
}
elsif (defined $+{CLOSE}) {
if (length $url && $tfound) {
push @{$hTitles{$title}}, $url;
}
$url = '';
$tfound = 0;
$title = '';
}
} ## end parse loop

## check for leftover content
if (length $content)
{
## check '<'
if ($content =~ /</) {
## markup in content
#print "Markup '<' in left over content, da stuff is crap!\n";
}
}
}

sub InitName
{
my @UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
my @UC_Nchar = (
"\\x{B7}",
"\\x{0300}-\\x{036F}",
"\\x{203F}-\\x{2040}",
);
my $Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
my $Nchar = "[\\w:.".join ('',@UC_Nchar).join ('',@UC_Nstart)."-]";
$Name = "(?:$Nstrt$Nchar*)";
}

sub InitRegex
{
$Rxmarkup = qr/
(?:
<
(?:
# Specific markup
(?: (?<OPEN> $rxopen ) \s+[^>]*? (?<=\s) (?<ATTR> $rxattr) \s*=\s* (?<VAL> ".+?"|'.+?')[^>]*? \s* \/?) # OPEN, ATTR, VAL
|(?: (?<CLOSE> \/$rxclose ) \s* ) # CLOSE

# Ordinary exclusionary markup
|(?: \/* $Name \s* \/*)
|(?: $Name (?:\s+(?:".*?"|'.*?'|[^>]*?)+) \s* \/?)
|(?: \?.*?\?)
|(?:
!
(?: # markup types that have '!'
(?: DOCTYPE.*?)
|(?: \[CDATA\[.*?\]\])
|(?: --.*?--)
|(?: \[[A-Z][A-Z\ ]*\[.*?\]\]) # who knows?
|(?: ATTLIST.*?)
|(?: ENTITY.*?)
|(?: ELEMENT.*?)
# add more if necessary
)
)
)
>
)
# This alternation handles content
| (?<C1> [^<]*) (?<C2> <?) # C1, C2
/xs;

}


__DATA__
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 # $ \ Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D "MSHTML 6.00.2900.3395" name=3DGENERATOR>

<STYLE></STYLE>
<test name = " thi<s # $ \ is a " test>
</HEAD>
<BODY bgColor=3D#ffffff>

should fix these: # $ \
but not these: &#21; &#xAF;
fix some here: &&%#$ &as; &&#a0

<a href="http://helloA.com">Hello</a>

<IMG SRC = "foo.gif" ALT = "A > B">
<IMG SRC = "foo.gif"
ALT = "A > # $ \ B">
<!-- <A comment # $ \ > -->
<NN & a # $ \>
<AA & # $ \>

<# Just data #>

<![INCLUDE CDATA [ >>>>>\\ # $ \ >>>>>>> ]]>

<!-- This section commented out.
<B>You can't # $ \ see me!</B>
-->

<link rel="stylesheet" type="text/css" href="/includes/css/main.css">


at root # $ \ > # $ \ level

<a href="/info/twitter.aspx" target="_top">
<img src="/images/icons/icon_twitter.gif" border="0" align="absmiddle">
</a>


<html><body>
<p>Hello
Kitty</p>
<a
href
=

"helloB.com"

Tad McClellan

unread,
Mar 20, 2010, 8:58:48 PM3/20/10
to
s...@netherlands.com <s...@netherlands.com> wrote:
> On Fri, 19 Mar 2010 21:40:14 +0000, Ben Morrow <b...@morrow.me.uk> wrote:
>
>>
>>Quoth Steve <st...@staticg.com>:
>>>
>>> Based on what you all said, I can make a more clear description.
>>> Essentially, I'm trying to search craigslist more efficiently. I want
>>
>>Are you sure craigslist's Terms of Use allow this? Most sites of this
>>nature don't.
>
> There is no "Terms of Use" web page making a caller
> agree to, sign, a legal notorized document as a condition of usage.


There is no legal need to sign anything.

http://www.craigslist.org/about/terms.of.use

By using the Service in any way, you are agreeing to comply with the TOU.


> Its a public record,


Whether it is public or private does not matter either.

It is copyrighted either way.


> available to be parsed, quoted or anything else,
> by routers, virus scanners, BROWSERs, hosts filters, search engines,
> Operating Systems, etc..


The owner can impose whatever restrictions they want.

This license does not include:
...
(b) any collection, aggregation, copying, duplication, display
or derivative use of the Service nor any use of data mining,
robots, spiders, or similar data gathering and extraction tools
for any purpose unless expressly permitted by craigslist.


> As for alterring the content and viewing just what the viewer wants,
> its a one way street. I filter adds, active controls/content, links
> and anything else I want to.


Just because you violate the license you've been given does not
make it OK for others to also violate the license.

s...@netherlands.com

unread,
Mar 20, 2010, 11:25:27 PM3/20/10
to
On Sat, 20 Mar 2010 19:58:48 -0500, Tad McClellan <ta...@seesig.invalid> wrote:

>s...@netherlands.com <s...@netherlands.com> wrote:
>> On Fri, 19 Mar 2010 21:40:14 +0000, Ben Morrow <b...@morrow.me.uk> wrote:
>>
>>>
>>>Quoth Steve <st...@staticg.com>:
>>>>
>>>> Based on what you all said, I can make a more clear description.
>>>> Essentially, I'm trying to search craigslist more efficiently. I want
>>>
>>>Are you sure craigslist's Terms of Use allow this? Most sites of this
>>>nature don't.
>>
>> There is no "Terms of Use" web page making a caller
>> agree to, sign, a legal notorized document as a condition of usage.
>
>
>There is no legal need to sign anything.
>
> http://www.craigslist.org/about/terms.of.use
>
> By using the Service in any way, you are agreeing to comply with the TOU.
>
>
>> Its a public record,
>
>
>Whether it is public or private does not matter either.
>
>It is copyrighted either way.

^^^^^^^^^^^^^^
There is nothing copyrighted about a href link. There is
nothing copyrighted about words, html, xml, browsers, nor
anything else that flows through the public airways, nor
is air, water or food copyrighted.

If craig has some unique combination of words that may
be considered "artfull and unique" and apart from all others, that
may be extracted from thier "public" broadcast, they would publish
it as literrary content.

Otherwise, the computer rips appart, repackages, transmits data
as it sees fit, unless you think the HOSTS file violates that
"artfull and unique" web page.

>
>
>> available to be parsed, quoted or anything else,
>> by routers, virus scanners, BROWSERs, hosts filters, search engines,
>> Operating Systems, etc..
>
>
>The owner can impose whatever restrictions they want.

^^^^^^^^^^^^^^^
No, they cannot. Give an example.

>
> This license does not include:
> ...

BEGIN Browser definition


> (b) any collection, aggregation, copying, duplication, display
> or derivative use of the Service nor any use of data mining,
> robots, spiders, or similar data gathering and extraction tools
> for any purpose unless expressly permitted by craigslist.

END Browser definition

>
>
>> As for alterring the content and viewing just what the viewer wants,
>> its a one way street. I filter adds, active controls/content, links
>> and anything else I want to.
>
>
>Just because you violate the license you've been given does not
>make it OK for others to also violate the license.

Just because you say it doesen't make it so.
Its not a movie, music, literrary art. Its a composition
of ordinary off the shelf components that can be broken down
and examined. Happens every day, its public information, and
public information cannot be licensed for which craig has any
patent.

-sln

Peter J. Holzer

unread,
Mar 21, 2010, 4:56:17 AM3/21/10
to
This is getting a bit off-topic, but ...

On 2010-03-21 03:25, s...@netherlands.com <s...@netherlands.com> wrote:
> On Sat, 20 Mar 2010 19:58:48 -0500, Tad McClellan <ta...@seesig.invalid> wrote:
>
>>s...@netherlands.com <s...@netherlands.com> wrote:
>>> On Fri, 19 Mar 2010 21:40:14 +0000, Ben Morrow <b...@morrow.me.uk> wrote:
>>>
>>>>
>>>>Quoth Steve <st...@staticg.com>:
>>>>>
>>>>> Based on what you all said, I can make a more clear description.
>>>>> Essentially, I'm trying to search craigslist more efficiently. I want
>>>>
>>>>Are you sure craigslist's Terms of Use allow this? Most sites of this
>>>>nature don't.
>>>
>>> There is no "Terms of Use" web page making a caller
>>> agree to, sign, a legal notorized document as a condition of usage.
>>
>>
>>There is no legal need to sign anything.
>>
>> http://www.craigslist.org/about/terms.of.use
>>
>> By using the Service in any way, you are agreeing to comply with
>> the TOU.

That may or may not be binding.


>>> Its a public record,
>>
>>
>>Whether it is public or private does not matter either.
>>
>>It is copyrighted either way.
> ^^^^^^^^^^^^^^
> There is nothing copyrighted about a href link. There is
> nothing copyrighted about words, html, xml, browsers, nor
> anything else that flows through the public airways, nor
> is air, water or food copyrighted.
>
> If craig has some unique combination of words that may
> be considered "artfull and unique" and apart from all others, that
> may be extracted from thier "public" broadcast, they would publish
> it as literrary content.

>>The owner can impose whatever restrictions they want.
> ^^^^^^^^^^^^^^^
> No, they cannot. Give an example.

"whatever restrictions they want" is too strong. The copyright law has
some limits.


>> This license does not include:
>> ...
>
> BEGIN Browser definition
>> (b) any collection, aggregation, copying, duplication, display
>> or derivative use of the Service nor any use of data mining,
>> robots, spiders, or similar data gathering and extraction tools
>> for any purpose unless expressly permitted by craigslist.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> END Browser definition

I would assume that viewing stuff in the browser is expressly permitted
by craigslist.


>>> As for alterring the content and viewing just what the viewer wants,
>>> its a one way street. I filter adds, active controls/content, links
>>> and anything else I want to.
>>
>>
>>Just because you violate the license you've been given does not
>>make it OK for others to also violate the license.
>
> Just because you say it doesen't make it so. Its not a movie, music,
> literrary art. Its a composition of ordinary off the shelf components
> that can be broken down and examined. Happens every day, its public
> information, and public information cannot be licensed for which craig
> has any patent.

Don't know about the US, but in Europe "a composition of ordinary ...
information" is more strongly protected by copyright law than "a movie,
music, literrary art". Because while the former need to be "artful and
unique" as you say (in Austrian law the term is "Werkshöhe"), no such
restriction exists for databases. So if you if you compile a list of the
students of your final year in high school, that's copyrighted. Same for
the data on craigs list.

(Similarly for programs: A "hello world" program is copyrighted, a
literary work of the same originality wouldn't be - but that's not the
point here)

hp

s...@netherlands.com

unread,
Mar 22, 2010, 4:43:03 PM3/22/10
to
On Sun, 21 Mar 2010 09:56:17 +0100, "Peter J. Holzer" <hjp-u...@hjp.at> wrote:

[snip]

>Don't know about the US, but in Europe "a composition of ordinary ...
>information" is more strongly protected by copyright law than "a movie,
>music, literrary art". Because while the former need to be "artful and
>unique" as you say (in Austrian law the term is "Werkshöhe"), no such
>restriction exists for databases. So if you if you compile a list of the
>students of your final year in high school, that's copyrighted. Same for
>the data on craigs list.
>

I would say a "list" is just that and nothing more, not copyrighted at
all. A list of students is not unique nor copyrighted. The published
year book is copyrighted as an entity, not the parts. A list of
credit card names and numbers are not copyrighted either and are not
published for legal reasons. Besides that, lists are not unique in the
sense that they are composed of common publicly obtained (private
information or not, but obtained from public sources) items that
idividually or collectively cannot be copyrighted. You just can't
say you have invented a unique color from 24-bit registers.

The point at which something becomes copyrightable is blurred.
A word/phrase in a book? Probably not. A sequential paragraph or two
in a book? Probably so. Its unique and highly unlikey to be randomly
duplicated. That is not the case of public information that can be
filterred to create a comparable list. In this case, the dimensions
of information are too easy to duplicate, unlike that of say a few
paragraphs of a book.

It is not likely that public information can be wrapped in a list
structure and its contents declared copyrighted. Copyright label is
attached to everything in general. It doesen't even need to be filed
with the copyright office. When in doubt, just 'say' its copyrighted
in a flimsy 'Terms Of Use', then blast it out in an uncontrolled public
fashion. Yeah, thats legal to do, but it holds no weight - especially
when the listed items themselves are not copyrighted or trademarked,
and otherwise, specifically public or general-knowlege information in
nature.

-sln

John Bokma

unread,
Mar 22, 2010, 7:26:02 PM3/22/10
to
s...@netherlands.com writes:

> On Sun, 21 Mar 2010 09:56:17 +0100, "Peter J. Holzer" <hjp-u...@hjp.at> wrote:
>
> [snip]
>
>>Don't know about the US, but in Europe "a composition of ordinary ...
>>information" is more strongly protected by copyright law than "a
>>movie,

If this European law is the same as the Dutch Databanken-recht "database
law", a database is protected under that law if there has been put
substantial effort into the compilation of such a database. This is
*not* copyright however, it's a separate law.

> I would say a "list" is just that and nothing more, not copyrighted at
> all. A list of students is not unique nor copyrighted.

Correct, and under the Dutch law such a list is only protected if there
has been put a substantial effort into its compilation. So there is
probably no way you can protect a list of 500 students, but most likely
you can if such a list has thousands and thousands of students, and
effort has been put into keeping the addresses of each student actual,
etc.

IANAL,

--
John Bokma j3b

Hacking & Hiking in Mexico - http://johnbokma.com/
http://castleamber.com/ - Perl & Python Development

Mart van de Wege

unread,
Mar 23, 2010, 5:46:08 AM3/23/10
to
Tad McClellan <ta...@seesig.invalid> writes:

> s...@netherlands.com <s...@netherlands.com> wrote:
>> On Fri, 19 Mar 2010 21:40:14 +0000, Ben Morrow <b...@morrow.me.uk> wrote:
>>
>>>
>>>Quoth Steve <st...@staticg.com>:
>>>>
>>>> Based on what you all said, I can make a more clear description.
>>>> Essentially, I'm trying to search craigslist more efficiently. I want
>>>
>>>Are you sure craigslist's Terms of Use allow this? Most sites of this
>>>nature don't.
>>
>> There is no "Terms of Use" web page making a caller
>> agree to, sign, a legal notorized document as a condition of usage.
>
>
> There is no legal need to sign anything.
>
> http://www.craigslist.org/about/terms.of.use
>
> By using the Service in any way, you are agreeing to comply with the TOU.

Irrelevant.

The protocols do not specify what I should or should not GET from an
HTTP server. If I am using a text-based browser, I don't download
images, for example.

And Terms of Use are nice, but unless you can prove I read them, you
cannot force me to abide by them.

Not using robots is common courtesy, Terms of Use have no legal power to
stop me from using them.

Mart

--
"We will need a longer wall when the revolution comes."
--- AJS, quoting an uncertain source.

Peter J. Holzer

unread,
Mar 23, 2010, 11:18:23 AM3/23/10
to
On 2010-03-22 23:26, John Bokma <jo...@castleamber.com> wrote:
> s...@netherlands.com writes:
>> On Sun, 21 Mar 2010 09:56:17 +0100, "Peter J. Holzer" <hjp-u...@hjp.at> wrote:
>>>Don't know about the US, but in Europe "a composition of ordinary ...
>>>information" is more strongly protected by copyright law than "a
>>>movie,
>
> If this European law is the same as the Dutch Databanken-recht "database
> law", a database is protected under that law if there has been put
> substantial effort into the compilation of such a database.

Or "if the selection or arrangement are his own creation" (my
translation from Austrian UrhG, �40f). As I read it, only one of these
criteria needs to be fulfilled for protection.

An example of the former category is a telephone book: The "selection or
arrangement" are not the creation of the publisher: They have been the
same for decades. But compiling the data and keeping it up to date is a
substantial investment, so it is protected (unless you are a phone
company - then you have the data anyway, so there is no investment, and
hence no protection (say the Judges[1][2])).

But if I come up with a new way to arrange the data (let's say I sort
them by phone number instead of name (well, that isn't that new, but it
serves as an example) then this new database is protected even if I
didn't have a substantial investment in the data.


> This is *not* copyright however, it's a separate law.

Strictly speaking, "copyright" doesn't exist in continental Europe.
What is called "Urheberrecht" in German emerged during the French
revolution and is based on quite different ideas. But since in practice
the difference is almost non-existent and there doesn't seem to be a
commonly accepted English term for this law, I talk about "copyright"
unless the topic is the difference between these laws.

In Austria (and, AFAIK, Germany) IP rights for databases are part of
the "Urheberrechtsgesetz" (�40f, �76, etc. in Austria). It is possible
that the Netherlands made it a seperate law, but that doesn't matter -
the contents are (substantially) the same.


>> I would say a "list" is just that and nothing more, not copyrighted at
>> all. A list of students is not unique nor copyrighted.
>
> Correct, and under the Dutch law such a list is only protected if there
> has been put a substantial effort into its compilation. So there is
> probably no way you can protect a list of 500 students,

You can if you come up with a "selection or arrangement of your own
creation". Or maybe you can argue that collecting data about 500
students was a substantial effort (depending on the students and the
data this may be true).

> but most likely you can if such a list has thousands and thousands of
> students, and effort has been put into keeping the addresses of each
> student actual, etc.
>
> IANAL,

Neither am I.

hp

[1] It was actually about horse races, not phone numbers, but I think
that makes no difference.
[2] The lecturer who mentioned this example wasn't very fond of the law.
If I say that he called it a complete failure I'm only slightly
exaggerating.

Randal L. Schwartz

unread,
Mar 23, 2010, 11:14:37 AM3/23/10
to
>>>>> "sln" == sln <s...@netherlands.com> writes:

sln> I would say a "list" is just that and nothing more, not copyrighted at
sln> all.

And since lawyers disagree with you, a smart person would be wise to ignore
you and find a lawyer.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<mer...@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion

Willem

unread,
Mar 23, 2010, 1:29:47 PM3/23/10
to
Mart van de Wege wrote:
) Irrelevant.
)
) The protocols do not specify what I should or should not GET from an
) HTTP server. If I am using a text-based browser, I don't download
) images, for example.
)
) And Terms of Use are nice, but unless you can prove I read them, you
) cannot force me to abide by them.

One could argue that you're *not* allowed to do anything whatsoever
with a web page, *except* when the copyright holder allows it.
Which he does, obviously, through a terms-of-use agreement.

With that basis, viewing a web page *does* imply that you agree with
the terms of use, because if you didn't, then you would not have had
the right to download and view anything.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

s...@netherlands.com

unread,
Mar 23, 2010, 1:58:20 PM3/23/10
to
On Tue, 23 Mar 2010 08:14:37 -0700, mer...@stonehenge.com (Randal L. Schwartz) wrote:

>>>>>> "sln" == sln <s...@netherlands.com> writes:
>
>sln> I would say a "list" is just that and nothing more, not copyrighted at
>sln> all.
>
>And since lawyers disagree with you, a smart person would be wise to ignore
>you and find a lawyer.

No need to do that. Below is a general explaination of what a copyright
is. Nope, nothing about "lists" as being copyrighted. Even if you could extrapolate
and declare that a single field, filterred list from a table, is a "database", it is
not a distinct collection of uncommon information, nor is it substantive in its nature
to even qualify as a database.

Its a huge leap to say a list is copyrighted, if it was it would be a "related right"
as a database, with extreme limitations and qualifications. Even then, only the UA
recognizes it as such, NOT the United States nor Australia.

In reality, a "list" is just a collection of uncreative facts, nothing more,
not copyrighted at all.

-sln

COPYRIGHT
----------
Copyright is the set of exclusive rights granted to the author or
creator of an original work, including the right to copy, distribute
and adapt the work. These rights can be licensed,
transferred and/or assigned.

The type of works which are subject to copyright has been expanded
over time. Initially only covering books, copyright law was revised
in the 19th century to include maps, charts, engravings, prints,
musical compositions, dramatic works, photographs, paintings,
drawings and sculptures. In the 20th century copyright was expanded
to cover motion pictures, computer programs, sound recordings,
dance and architectural works.

Copyright law is typically designed to protect the fixed expression
or manifestation of an idea rather than the fundamental idea itself.

RELATED RIGHTS
----------------
Related rights is used to describe database rights, public lending
rights (rental rights), artist resale rights and performers’ rights.

Related rights award copyright protection to works which are
not author works, but rather technical media works which allowed
author works to be communicated to a new audience in a different
form. The substance of protection is usually not as great as
there is for author works.

- DATABASES
EU:
In European Union law, a database right is a legal right,
introduced in 1996. Database rights are specifically coded
(i.e. sui generis) laws on the copying and dissemination
of information in computer databases.
... giving a specific and separate legal rights
(and limitations) to certain computer records.
Rights afforded to manual records under EU database right
law are similar in format, but not identical,
to those afforded artistic works.

United States:
Uncreative collections of facts are outside of
Congressional authority under Article I, § 8, cl. 8,
i.e. the Copyright Clause, of the United States
Constitution, therefore no database right exists
in the United States.

Australia:
No specific law exists in Australia protecting databases.
Databases may only be protected if they fall under general
copyright law. Australian copyright law protects "compilations",
which can include databases, phone books, etc.
This copyright protection only covers the unique arrangement
of data within the compilation, however, not the data itself.

sreservoir

unread,
Mar 23, 2010, 8:52:26 PM3/23/10
to
On 3/23/2010 1:29 PM, Willem wrote:
> Mart van de Wege wrote:
> ) Irrelevant.
> )
> ) The protocols do not specify what I should or should not GET from an
> ) HTTP server. If I am using a text-based browser, I don't download
> ) images, for example.
> )
> ) And Terms of Use are nice, but unless you can prove I read them, you
> ) cannot force me to abide by them.
>
> One could argue that you're *not* allowed to do anything whatsoever
> with a web page, *except* when the copyright holder allows it.
> Which he does, obviously, through a terms-of-use agreement.
>
> With that basis, viewing a web page *does* imply that you agree with
> the terms of use, because if you didn't, then you would not have had
> the right to download and view anything.

of course, as the terms of use are on the website, reading them implies
agreeing to them. so.

--

"Six by nine. Forty two."
"That's it. That's all there is."
"I always thought something was fundamentally wrong with the universe."

Mart van de Wege

unread,
Mar 24, 2010, 10:49:29 AM3/24/10
to
Willem <wil...@turtle.stack.nl> writes:

> Mart van de Wege wrote:
> ) Irrelevant.
> )
> ) The protocols do not specify what I should or should not GET from an
> ) HTTP server. If I am using a text-based browser, I don't download
> ) images, for example.
> )
> ) And Terms of Use are nice, but unless you can prove I read them, you
> ) cannot force me to abide by them.
>
> One could argue that you're *not* allowed to do anything whatsoever
> with a web page, *except* when the copyright holder allows it.
> Which he does, obviously, through a terms-of-use agreement.
>

Yeah, but that works two ways. One could also argue that putting
information on a publicly reachable server, using a protocol
specifically designed for publishing, without access controls, implies
that you want the world to read your pages.

IMO, using a robot that doesn't GET faster than a human would is about
as bad as using Lynx.

Kyle T. Jones

unread,
Mar 24, 2010, 2:54:34 PM3/24/10
to
Tad McClellan wrote:
> Kyle T. Jones <KBf...@realdomain.net> wrote:
>> Steve wrote:
>
>>> like lets say I searched a site
>>> that had 15 news links and 3 of them said "Hello" in the title. I
>>> would want to extract only the links that said hello in the title.
>> Read up on perl regular expressions.
>
>
> While reading up on regular expressions is certainly a good idea,
> it is a horrid idea for the purposes of parsing HTML.
>

Ummm. Could you expand on that?

My initial reaction would be something like - I'm pretty sure *any*
method, including the use of HTML::LinkExtor, or XML transform (both
outlined upthread), involves using regular expressions "for the purposes
of parsing HTML".

At best, you're just abstracting the regex work back to the includes.
AFAIK, and feel free to correct me (I'll go take a look at some of the
relevant module code in a bit), every CPAN module that is involved with
parsing HTML uses fairly straightforward regex matching somewhere within
that module's methods.

I think there's an argument that, considering you can do this so easily
(in under 15 lines of code) without the overhead of unnecessary
includes, my way would be more efficient. We can run some benchmarks if
you want (see further down for working code).

> Have you read the FAQ answers that mention HTML?
>
> perldoc -q HTML
>
>
>> for instance, taking the above, you might first split it into a
>> "one-line per" array -
>>
>> @stuff=split(/\n/, $content);
>>
>> then parse each line for hello -
>>
>> foreach(@stuff){
>> if($_=~/Hello/){
>> do whatever;}
>> }
>
>
> The code below prints "do whatever" 3 times, but there is only one link
> containing "Hello"...
>

I should have been clearer - the above wasn't a "solution", meant to be
copied, pasted, and put into use - it was just meant to illustrate the
basic operation.

I think this works fine:

#!/usr/bin/perl -w
use strict;
use warnings;
use LWP::Simple;

my $targeturl="http://www.google.com";
my $searchstring="google";
my $contents=get($targeturl);
my @semiparsed=split(/href/i, $contents);

foreach(@semiparsed){
if($_=~/^\s*=\s*('|")(.*?)('|")/){
my $link=$2;
if($link=~/$searchstring/i){
print "Link: $link\n";
}
}
}

OUTPUT:

Link: http://images.google.com/imghp?hl=en&tab=wi
Link: http://video.google.com/?hl=en&tab=wv
Link: http://maps.google.com/maps?hl=en&tab=wl
Link: http://news.google.com/nwshp?hl=en&tab=wn
Link: http://www.google.com/prdhp?hl=en&tab=wf
Link: http://mail.google.com/mail/?hl=en&tab=wm
Link: http://www.google.com/intl/en/options/
Link:
/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg
Link:
https://www.google.com/accounts/Login?hl=en&continue=http://www.google.com/
Link:
/aclk?sa=L&ai=CbpBLOFeqS_gX3ZmVB_SbuZINs_2WoQHf44OSEMHZnNkTEAEgwVRQpuf5xAJgPaoEhQFP0M0ypnTnQAI3b4WYFAHIvHiLv4iZWVehmiie-78BOdRJQOj6QayRkYYHH4cKXyaNmAp2rmQiiPSHxtEyaVD5OZo41Kxvy6SAeAAF6CIw-SQAFsLT-9iHRfJUcoYh4qlpGqGbC080ZVCWlUUipS404rornNJFmeGlP89sgXehqOfpe8uL&num=1&sig=AGiWqtw95aIEfk5F25oGM2i6eMwkBBuj6Q&q=http://www.google.com/doodle4google/


Or, if you're only interested in the http/https links, you can do this:

#!/usr/bin/perl -w
use strict;
use warnings;
use LWP::Simple;

my $targeturl="http://www.google.com";
my $searchstring="google";
my $contents=get($targeturl);
my @semiparsed=split(/href/i, $contents);

foreach(@semiparsed){
if($_=~/^\s*=\s*('|")(http.*?)('|")/i){
my $link=$2;
if($link=~/$searchstring/i){
print "Link: $link\n";
}
}
}

OUTPUT:

Link: http://images.google.com/imghp?hl=en&tab=wi
Link: http://video.google.com/?hl=en&tab=wv
Link: http://maps.google.com/maps?hl=en&tab=wl
Link: http://news.google.com/nwshp?hl=en&tab=wn
Link: http://www.google.com/prdhp?hl=en&tab=wf
Link: http://mail.google.com/mail/?hl=en&tab=wm
Link: http://www.google.com/intl/en/options/
Link:
https://www.google.com/accounts/Login?hl=en&continue=http://www.google.com/

Like I said, if you want to present a different method where you push
all the regex work off to an include like HTML::LinkExtor, please post
it, and I can run both using a benchmark module to determine which
method is more efficient. I could be way off, here - maybe using one or
more of the modules mentioned in this thread somehow improves
efficiency. If so, please let me know.

By the way - I can think of wrenches to throw into this solution, too -
addressing the use of ' or " inside a link, for instance - but, then, I
could throw "you prolly won't ever see this but it's theoretically
possible" wrenches into most of the HTML parsing CPAN modules, too, so...

Cheers.

Jürgen Exner

unread,
Mar 24, 2010, 3:15:53 PM3/24/10
to
"Kyle T. Jones" <KBf...@realdomain.net> wrote:
>Tad McClellan wrote:
>> Kyle T. Jones <KBf...@realdomain.net> wrote:
>>> Steve wrote:
>>
>>>> like lets say I searched a site
>>>> that had 15 news links and 3 of them said "Hello" in the title. I
>>>> would want to extract only the links that said hello in the title.
>>> Read up on perl regular expressions.
>>
>>
>> While reading up on regular expressions is certainly a good idea,
>> it is a horrid idea for the purposes of parsing HTML.
>>
>
>Ummm. Could you expand on that?
>
>My initial reaction would be something like - I'm pretty sure *any*
>method, including the use of HTML::LinkExtor, or XML transform (both
>outlined upthread), involves using regular expressions "for the purposes
>of parsing HTML".

Regular expressions recognize regular languages. But HTML is a
context-free language and therefore cannot be recognized solely by a
regular parser.
Having said that Perl's extended regular expressions are indeed more
powerful than regular, but still it is a bad idea because the
expressions are becoming way to complex.

>At best, you're just abstracting the regex work back to the includes.
>AFAIK, and feel free to correct me (I'll go take a look at some of the
>relevant module code in a bit), every CPAN module that is involved with
>parsing HTML uses fairly straightforward regex matching somewhere within
>that module's methods.

Using REs to do _part_ of the work of parsing any language is a
no-brainer, of course everyone does it e.g. in the tokenizer.

But unless your language is a regular language (and there aren't many
useful regular languages because regular is just too restrictive) you
need additional algorithms that cannot be expressed as REs to actually
parse a context-free or context-sensitive language.

>I think there's an argument that, considering you can do this so easily
>(in under 15 lines of code) without the overhead of unnecessary
>includes, my way would be more efficient. We can run some benchmarks if
>you want (see further down for working code).

But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
Theory of Computer Languages or Basics of Compiler Construction?
What do people learn in Computer Science today?

jue

Ben Morrow

unread,
Mar 24, 2010, 5:08:40 PM3/24/10
to

Quoth Jürgen Exner <jurg...@hotmail.com>:

>
> But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
> Theory of Computer Languages or Basics of Compiler Construction?
> What do people learn in Computer Science today?

I suspect that most people writing Perl have never formally studied
Computer Science. I certainly haven't, though I've picked up a fair bit
of the theory along the way because I'm interested.

Ben

Kyle T. Jones

unread,
Mar 24, 2010, 6:55:29 PM3/24/10
to

But isn't the Chomsky Hierarchy completely irrelevant in this (forgive
the pun) context? Surely you "get" that my input is analyzed in terms
of being nothing more or less than a sequence of characters - that it
was originally written in HTML, or any other CFG-based language, is
meaningless - both syntactical and semantical considerations of that
original language are irrelevant in the (again, forgive me) context of
what I'm attempting - which is simply to match one finite sequence of
characters against another finite sequence of characters - I could care
less what those characters mean, what href indicates, what a <body> tag
is, etc.

I don't need to understand English to count the # of e's in the above
passage, right? Neither does Perl.

I believe what you say above is true - to truly "parse" the page AS HTML
is beyond the ability of REs - but I'm not parsing anything AS HTML, if
that makes sense. In fact, to take that a step further, I'm not
"parsing" period - so perhaps it was a mistake for me to use that term.
I meant to use the term colloquially, sorry if that caused any confusion.

Cheers.


" 'Regular expressions' [...] are only marginally related to real
regular expressions. Nevertheless, the term has grown with the
capabilities of our pattern matching engines, so I'm not going to try to
fight linguistic necessity here. I will, however, generally call them
"regexes" (or "regexen", when I'm in an Anglo-Saxon mood)" - Larry Wall

Tad McClellan

unread,
Mar 24, 2010, 7:10:33 PM3/24/10
to
Kyle T. Jones <KBf...@realdomain.net> wrote:
> Tad McClellan wrote:
>> Kyle T. Jones <KBf...@realdomain.net> wrote:
>>> Steve wrote:
>>
>>>> like lets say I searched a site
>>>> that had 15 news links and 3 of them said "Hello" in the title. I
>>>> would want to extract only the links that said hello in the title.
>>> Read up on perl regular expressions.
>>
>>
>> While reading up on regular expressions is certainly a good idea,
>> it is a horrid idea for the purposes of parsing HTML.
>>
>
> Ummm. Could you expand on that?


I think the FAQ answer does a pretty good job of it.


> My initial reaction would be something like - I'm pretty sure *any*
> method, including the use of HTML::LinkExtor, or XML transform (both
> outlined upthread), involves using regular expressions "for the purposes
> of parsing HTML".


"pattern matching" is not at all the same as "parsing".

Regular expressions are *great* for pattern matching.

It is mathematically impossible to do a proper parse of a context-free
lanuguage such as HTML with nothing more than regular expressions.

They do not contain the requisite power.

Google for the "Chomsky hierarchy".

HTML allows a table within a table within a table within a table,
to an arbitrary depth. ie. it is not "regular".


> I think there's an argument that, considering you can do this so easily
> (in under 15 lines of code) without the overhead of unnecessary
> includes, my way would be more efficient.


Do you want easy and wrong or hard and correct?


> you want (see further down for working code).


You have a strange definition of "working"...


>> Have you read the FAQ answers that mention HTML?
>>
>> perldoc -q HTML


Did you try that yet?

It points out at least one way that your code below can fail.


> I think this works fine:


You just haven't used a data set that exposes its flaws.

You are not "parsing", you are "pattern matching".

"pattern matching" is often "good enough", but you should realize
its fragility so that you can assess whether it is worth the ease
of implementation or not.


> #!/usr/bin/perl -w
^^
^^
> use strict;
> use warnings;
^^^^^^^^


Turning on warnings 2 times is kind of silly...

Lose the command line switch, lexical warnings are much better.


Try it with this:

-------------------
my $contents = '
<html><body>
<!--
this is NOT a link...
<a href="google.com">Google</a>
-->
</body></html>
';
-------------------


It will make output when it should make none.


> my @semiparsed=split(/href/i, $contents);
>
> foreach(@semiparsed){
> if($_=~/^\s*=\s*('|")(.*?)('|")/){


Gak!

Whitespace is not a scarce resource, feel free to use as much of it
as you like to make your code easier to read and understand.

Character classes are much more efficient than alternation.

Either be explicit in both places:

foreach $_ (
if ( $_ =~ /...

or in neither:

foreach (
if ( /...

be consistent.

So, let's rewrite that line as an experienced Perl programmer might:

if ( /^\s*=\s*['"](.*?)['"]/ ) { # now link will be in $1 instead of $2


Also, your code does not address the OP's question.

It tests the URL for a string rather than testing the <a> tag's _contents_.

That is, he wanted to test

<a href="...">...</a>
^^^
^^^ here

rather than

<a href="...">...</a>
^^^
^^^

Ben Morrow

unread,
Mar 24, 2010, 7:21:41 PM3/24/10
to

Quoth "Kyle T. Jones" <KBf...@realdomain.net>:

>
> But isn't the Chomsky Hierarchy completely irrelevant in this (forgive
> the pun) context? Surely you "get" that my input is analyzed in terms
> of being nothing more or less than a sequence of characters - that it
> was originally written in HTML, or any other CFG-based language, is
> meaningless - both syntactical and semantical considerations of that
> original language are irrelevant in the (again, forgive me) context of
> what I'm attempting - which is simply to match one finite sequence of
> characters against another finite sequence of characters - I could care
> less what those characters mean, what href indicates, what a <body> tag
> is, etc.

This is correct, and treating HTML (or whatever) as plain text for the
purposes of grabbing something you want can be a valuable technique.
It's worth being aware that it's basically a hack, though, and that a
problem like 'find all the links in this document' is much better solved
by parsing the HTML properly than by trying to construct a regex to
match all possible forms of <a> tag.

Ben

Ben Morrow

unread,
Mar 24, 2010, 7:40:52 PM3/24/10
to

Quoth Tad McClellan <ta...@seesig.invalid>:

>
> "pattern matching" is not at all the same as "parsing".
>
> Regular expressions are *great* for pattern matching.
>
> It is mathematically impossible to do a proper parse of a context-free
> lanuguage such as HTML with nothing more than regular expressions.
>
> They do not contain the requisite power.
>
> Google for the "Chomsky hierarchy".
>
> HTML allows a table within a table within a table within a table,
> to an arbitrary depth. ie. it is not "regular".

Perl's regexen are not regular. With the new features in 5.10 it's easy
to match something like that (it was possible before with (??{}), but
not easy):

perl -E'"[[][[][]]]" =~ m!(?<nest> \[ (?&nest)* \] )!x
and say $+{nest}'
[[][[][]]]

Building a proper grammar for something like HTML would be harder,
especially if you wanted to keep it readable, but I expect it would be
possible. Certainly something simple that tracked comment/not-comment/
tag/not-tag would not be too hard, and would be sufficient for many
purposes.

> > I think there's an argument that, considering you can do this so easily
> > (in under 15 lines of code) without the overhead of unnecessary
> > includes, my way would be more efficient.
>
>
> Do you want easy and wrong or hard and correct?

I want easy and correct, so I'll use a module :).

> "pattern matching" is often "good enough", but you should realize
> its fragility so that you can assess whether it is worth the ease
> of implementation or not.

I just quoted that because I think it bears repeating.

Ben

Jürgen Exner

unread,
Mar 24, 2010, 8:29:24 PM3/24/10
to
"Kyle T. Jones" <KBf...@realdomain.net> wrote:
>Jürgen Exner wrote:
>> "Kyle T. Jones" <KBf...@realdomain.net> wrote:
>>> Tad McClellan wrote:
>>>> Kyle T. Jones <KBf...@realdomain.net> wrote:
>>>>> Steve wrote:
>>>>>> like lets say I searched a site
>>>>>> that had 15 news links and 3 of them said "Hello" in the title. I
>>>>>> would want to extract only the links that said hello in the title.
>>>>> Read up on perl regular expressions.
>>>>
>>>> While reading up on regular expressions is certainly a good idea,
>>>> it is a horrid idea for the purposes of parsing HTML.
>>>>
>>> Ummm. Could you expand on that?
[...]

>> Regular expressions recognize regular languages. But HTML is a
>> context-free language and therefore cannot be recognized solely by a
>> regular parser.
[...]

>> But you cannot! Ever heard of the Chomsky Hierarchy? No recollection of
>> Theory of Computer Languages or Basics of Compiler Construction?
>> What do people learn in Computer Science today?
>
>But isn't the Chomsky Hierarchy completely irrelevant in this (forgive
>the pun) context? Surely you "get" that my input is analyzed in terms
>of being nothing more or less than a sequence of characters - that it
>was originally written in HTML, or any other CFG-based language, is
>meaningless - both syntactical and semantical considerations of that
>original language are irrelevant in the (again, forgive me) context of
>what I'm attempting - which is simply to match one finite sequence of
>characters against another finite sequence of characters - I could care
>less what those characters mean, what href indicates, what a <body> tag
>is, etc.

True. If you know exactly what format your input can possibly have (and
if that input can be described using a finite state automaton) then by
all means yes, go for it. REs are perfect for such tasks.

But that is not what you have been asking, see the Subject of this
thread.

>I believe what you say above is true - to truly "parse" the page AS HTML
>is beyond the ability of REs - but I'm not parsing anything AS HTML, if
>that makes sense. In fact, to take that a step further, I'm not
>"parsing" period - so perhaps it was a mistake for me to use that term.
> I meant to use the term colloquially, sorry if that caused any confusion.

Well, yes and no. If you are in control of the format and you know
exactly what format is allowed and which formats are not allowed, then
you are right.
But if you are not in control of the input format, e.g. you are reading
from a third-party web page or you get your input data from finance or
marketing or the subsidiary on the opposite side of the world, then your
code must be able to handle any legal HTML because the format could be
changed on you at any time. Which in turn means you must formally parse
the HTML code as HTML code, their is just no way around it.

jue

s...@netherlands.com

unread,
Mar 24, 2010, 9:36:11 PM3/24/10
to
On Wed, 24 Mar 2010 23:40:52 +0000, Ben Morrow <b...@morrow.me.uk> wrote:

>
>Quoth Tad McClellan <ta...@seesig.invalid>:
>>
>> "pattern matching" is not at all the same as "parsing".
>>
>> Regular expressions are *great* for pattern matching.
>>
>> It is mathematically impossible to do a proper parse of a context-free
>> lanuguage such as HTML with nothing more than regular expressions.
>>
>> They do not contain the requisite power.
>>
>> Google for the "Chomsky hierarchy".
>>
>> HTML allows a table within a table within a table within a table,
>> to an arbitrary depth. ie. it is not "regular".
>
>Perl's regexen are not regular. With the new features in 5.10 it's easy
>to match something like that (it was possible before with (??{}), but
>not easy):
>
> perl -E'"[[][[][]]]" =~ m!(?<nest> \[ (?&nest)* \] )!x
> and say $+{nest}'
> [[][[][]]]
>

^^^^^^^^^^
All this shows is balanced character '[' ']' matching using the
recursive ability of the 5.10 engine.

Could this be an example such that each square bracket is a
markup instruction, like <tag> ?
It certainly doesen't pertain the the '<' angle brackets, the
parsing delimeter of the instruction.

There is no compliance in HTML to have closing tags so as embedded
markup ustructions interspersed with content are parsed, a guess is
made, if errors are found, where to discontinue the instruction
as applied to the context. And in general, where the nesting is stopped.

There is a separation between the markup instruction and the content
via the markup delimeter '<'. That is the first level of parsing,
extracting the instruction from its delimeter and thereby the
content. The second level is structuring the markup instruction
within the content.

When a complete discreet structure is obtained, the document processor
renders it, a chunk at a time, mid-stream.

The first level, separating markup instructions from its delimeter
(and as a side-effect, exposing content) can be done by any language
that can compare characters.

The second level can be done by any language that can do a stack
or nested variables.

There is no place for balanced text processing for the first
level of parsing markup instructions. Instructions within
instructions are NOT well formed and will be kicked out of
processors.

So essentially, as slow as it can be, if the aim is to peal away
delimeters to expose the markup instruction, regular expressions
work great. C processors work about 100 - 500 times faster but
don't have the ability to give extended (look ahead) errors,
nor will they self correct and continue. Most cases, a
regular expression can identify errant markup instruction syntax
while correctly encapsulating the delimeting expression.
If there is an errant '<' delimeter in content, it is not
well-formed but is still captured as content and easily reported.

Overall, there is no requirement for processors to stop on
not well-formed, but most do because they are full featured
and compliant. Most go out and bring in includes, do substitutions,
reparse, etc.

No, you won't get that with regular expressions, but there
is nothing stopping anybody from using them to parse out
markup instructions and content, nothing at all. Just compare
characters is all you do.

The reason regex is so slow is that it does pattern matching
with backtracking, grouping, etc.

This doesen't mean it can't compare characters, it sure can,
and in a variable way which allows looking ahead which has
benifits over state processing.

As long as the regex takes into account ALL possible markup
instructions and delimeters as exclusionary items, there is
no reason why it can't be used to find specific sub-patterns
either in content or, markup instructions themselves.

And it can drive over and re-align after discrete syntax errors without
stopping. All in all, its a niche parser and perfect at times
when a Dom or SAX is just too cumbersome, too much code overhead
for something simple.

-sln

Ted Zlatanov

unread,
Mar 25, 2010, 10:58:46 AM3/25/10
to
On Wed, 24 Mar 2010 15:49:29 +0100 Mart van de Wege <mvd...@mail.com> wrote:

MvdW> Willem <wil...@turtle.stack.nl> writes:
>> One could argue that you're *not* allowed to do anything whatsoever
>> with a web page, *except* when the copyright holder allows it.
>> Which he does, obviously, through a terms-of-use agreement.
>>

MvdW> Yeah, but that works two ways. One could also argue that putting
MvdW> information on a publicly reachable server, using a protocol
MvdW> specifically designed for publishing, without access controls, implies
MvdW> that you want the world to read your pages.

(OT but slightly relevant to WWW::Mechanize for example)

Sadly this common-sense interpretation has been eroded by Congress and
courts in the USA. Look for info on the Computer Fraud and Abuse Act,
e.g. http://www.techdirt.com/articles/20100305/0404088432.shtml

Ted

Kyle T. Jones

unread,
Mar 26, 2010, 1:14:27 AM3/26/10
to
Tad McClellan wrote:


Thanks for the reply - in particular, some of the code you provided and
corrected was interesting and informative.

You make a big deal about my use of the term "parse" throughout - I sure
felt as if I was being chastised. I was kind of surprised that I did
use it, to be honest. I figured I must have used it casually - and
mentioned such in another response:

"I believe what you say above is true - to truly "parse" the page AS
HTML is beyond the ability of REs - but I'm not parsing anything AS
HTML, if that makes sense. In fact, to take that a step further, I'm
not "parsing" period - so perhaps it was a mistake for me to use that
term. I meant to use the term colloquially, sorry if that caused any

confusion. " - me

I'll attempt to stay away from such casual use of that particular term
in future interactions here. As for suggestions that I google "Chomsky
hierarchy" - all my peeps got a kick out of that one.

Cheers.

Peter J. Holzer

unread,
Mar 28, 2010, 6:05:39 AM3/28/10
to

However, for extracting links you don't need to process nested tables.
You can view the file as linear sequence of tags and text. And this can
be done with a regular grammar, you don't need a context-free grammar.


> Try it with this:
>
> -------------------
> my $contents = '
><html><body>
><!--
> this is NOT a link...
> <a href="google.com">Google</a>
> -->
></body></html>
> ';
> -------------------

Comments in HTML can also be described by regular expressions - no need
to write a context-free grammar for that.

But this is a good example why you should use an existing module instead
of rolling your own: When you roll your own it is easy to forget about
special cases like this. A module which has been in use by lots of
people for some time is unlikely to contain such a bug.


> Also, your code does not address the OP's question.
>
> It tests the URL for a string rather than testing the <a> tag's _contents_.

A tag doesn't have content, an element has.

> That is, he wanted to test
>
> <a href="...">...</a>
> ^^^
> ^^^ here
>
> rather than
>
> <a href="...">...</a>
> ^^^
> ^^^

There are two tags in this snippet:

* <a href="...">
* </a>

The a element consists of the start tag, the end tag and the content,
which is enclosed between the two tags.

For some elements the end tag and for some even the start tag can be
omitted, but the element is still there.

hp

Peter J. Holzer

unread,
Mar 28, 2010, 6:17:08 AM3/28/10
to

Actually it is much worse. If you read from a third-party web page or
get your input from some crap application finance or marketing happens
to use you can't formally parse HTML because you won't get HTML. Instead
you will get, as a friend of mine likes to call it, a file with pointy
brackets. So you need a parser which can cope with all the usual errors.

(An HTML5 parser might do this - AIUI, HTML5 is completely deterministic
for every possible input).

hp

0 new messages