How to make Web Crawler

328 views
Skip to first unread message

Adderly Jauregui

unread,
Mar 6, 2012, 3:31:01 PM3/6/12
to Pattern
Anyone have a example to deploy a web crawler ??


this is correct ?? : form pattern.web import Spider


???

Please give a minimal example to do this, Thank You!!!! :D

This is the code I found:


from pattern.web import Spider

class Spiderling(Spider):
def visit(self, link, source=None):
print 'visited:', repr(link.url), 'from:', link.referrer
def fail(self, link):
print 'failed:', repr(link.url)

s = Spiderling(links=['http://www.clips.ua.ac.be/'], delay=5,
queue=True)
while not s.done:
s.crawl(method=DEPTH, cached=False, throttle=5)

Tom De Smedt

unread,
Mar 6, 2012, 7:05:32 PM3/6/12
to Pattern
Hi Adderly,

I've added an example in the latest revision, with more in-depth
comments on how to create web spiders / web crawlers. You can get the
latest revision from GitHub. Here is the link to the new example:
https://github.com/clips/pattern/blob/master/examples/01-web/12-spider.py

Best,
Tom

Adderly Jauregui

unread,
Mar 6, 2012, 7:27:21 PM3/6/12
to pattern-f...@googlegroups.com
many thank you very much, just what I needed  :D 

2012/3/6 Tom De Smedt <tomde...@gmail.com>

Adderly Jauregui

unread,
Mar 15, 2012, 1:10:37 PM3/15/12
to pattern-f...@googlegroups.com
some example to get the title of a page with pattern.web?

2012/3/6 Tom De Smedt <tomde...@gmail.com>
Hi Adderly,

Tom De Smedt

unread,
Mar 26, 2012, 8:11:24 PM3/26/12
to Pattern
It is easy if you use the Document class, which has functionality to
search the HTML as a tree:

from pattern.web import URL, Document

html = URL("http://www.clips.ua.ac.be").download()
document = Document(html)
# Get a list of <title> elements in the <head> section:
print document.head.by_tag("title")[0]
print document.head.by_tag("title")[0].content


On Mar 15, 7:10 pm, Adderly Jauregui <ajacs1...@gmail.com> wrote:
> some example to get the title of a page with pattern.web?
>
> 2012/3/6 Tom De Smedt <tomdesm...@gmail.com>
>
>
>
> > Hi Adderly,
>
> > I've added an example in the latest revision, with more in-depth
> > comments on how to create web spiders / web crawlers. You can get the
> > latest revision from GitHub. Here is the link to the new example:
> >https://github.com/clips/pattern/blob/master/examples/01-web/12-spide...

Ben Tortora

unread,
Oct 16, 2012, 9:59:35 PM10/16/12
to pattern-f...@googlegroups.com
Has anyone thought about using http://scrapy.org as the crawler and then using Patterns for the processing and analysing?

Tom De Smedt

unread,
Nov 1, 2012, 12:33:12 PM11/1/12
to pattern-f...@googlegroups.com
It should not be difficult, both packages are pure Python and BSD.
Let me know if there are any requirements in Pattern to better integrate Scrapy.
Reply all
Reply to author
Forward
0 new messages