How to make Web Crawler

Adderly Jauregui

unread,

Mar 6, 2012, 3:31:01 PM3/6/12

to Pattern

Anyone have a example to deploy a web crawler ??

this is correct ?? : form pattern.web import Spider

???

Please give a minimal example to do this, Thank You!!!! :D

This is the code I found:

from pattern.web import Spider

class Spiderling(Spider):
def visit(self, link, source=None):
print 'visited:', repr(link.url), 'from:', link.referrer
def fail(self, link):
print 'failed:', repr(link.url)

s = Spiderling(links=['http://www.clips.ua.ac.be/'], delay=5,
queue=True)
while not s.done:
s.crawl(method=DEPTH, cached=False, throttle=5)

Tom De Smedt

unread,

Mar 6, 2012, 7:05:32 PM3/6/12

to Pattern

Hi Adderly,

I've added an example in the latest revision, with more in-depth
comments on how to create web spiders / web crawlers. You can get the
latest revision from GitHub. Here is the link to the new example:
https://github.com/clips/pattern/blob/master/examples/01-web/12-spider.py

Best,
Tom

Adderly Jauregui

unread,

Mar 6, 2012, 7:27:21 PM3/6/12

to pattern-f...@googlegroups.com

many thank you very much, just what I needed :D

2012/3/6 Tom De Smedt <tomde...@gmail.com>

Adderly Jauregui

unread,

Mar 15, 2012, 1:10:37 PM3/15/12

to pattern-f...@googlegroups.com

some example to get the title of a page with pattern.web?

2012/3/6 Tom De Smedt <tomde...@gmail.com>

Hi Adderly,

Tom De Smedt

unread,

Mar 26, 2012, 8:11:24 PM3/26/12

to Pattern

It is easy if you use the Document class, which has functionality to
search the HTML as a tree:

from pattern.web import URL, Document

html = URL("http://www.clips.ua.ac.be").download()
document = Document(html)
# Get a list of <title> elements in the <head> section:
print document.head.by_tag("title")[0]
print document.head.by_tag("title")[0].content

On Mar 15, 7:10 pm, Adderly Jauregui <ajacs1...@gmail.com> wrote:
> some example to get the title of a page with pattern.web?
>

> 2012/3/6 Tom De Smedt <tomdesm...@gmail.com>

>
>
>
> > Hi Adderly,
>
> > I've added an example in the latest revision, with more in-depth
> > comments on how to create web spiders / web crawlers. You can get the
> > latest revision from GitHub. Here is the link to the new example:

> >https://github.com/clips/pattern/blob/master/examples/01-web/12-spide...

Ben Tortora

unread,

Oct 16, 2012, 9:59:35 PM10/16/12

to pattern-f...@googlegroups.com

Has anyone thought about using http://scrapy.org as the crawler and then using Patterns for the processing and analysing?

Tom De Smedt

unread,

Nov 1, 2012, 12:33:12 PM11/1/12

to pattern-f...@googlegroups.com

It should not be difficult, both packages are pure Python and BSD.

Let me know if there are any requirements in Pattern to better integrate Scrapy.

Reply all

Reply to author

Forward