Scrapy and machine learning

504 views
Skip to first unread message

Manuel 'naeg' Rotter

unread,
Oct 10, 2011, 3:03:45 PM10/10/11
to scrapy-users
Hello,

I'm wondering whether there are people out there, who combine scrapy
with machine learning?

An example:

Let's say you have a lot of sites which offer you all kind of coupons
(such as "50% allowance on Adobe PhotoShop"). Instead of crawling each
site seperatly which one crawler each, you probably could write one
crawler, which is clever enough to get all the coupons from all the
sites.


This idea came up to me beacause of this StackOverflow question:
http://stackoverflow.com/questions/7714422/what-are-some-of-the-artificial-intelligence-ai-related-techniques-one-would-us

That SO question lead me to that book, which seems quite interesting:
http://shop.oreilly.com/product/9780596529321.do


Do you consider this possible, and if so, how?
Will that book lead one into the right direction in order to achieve a
crawler as described in my example, or will there be more necessary?

Thanks in advance
Greetings naeg

conrado

unread,
Oct 10, 2011, 10:46:46 PM10/10/11
to scrapy-users
Hi Naeg,

Though the Collective Intelligence book is very good and worth reading
(I have a copy here and just went over it to make sure), it will not
help you for this. The book doesn't cover AI on *this* side.

It will certainly be "cheaper" if you build a spider for each site you
want to get your coupons from. Also much more likely to work. Even if
you did get something that could be relatively successful you will
still find yourself fixing special cases manually.

Do not invest your time in this for coupons, sounds like a failed
business plan.

Best,
Conrado

On Oct 10, 4:03 pm, "Manuel 'naeg' Rotter" <rotter.man...@gmail.com>
wrote:
> Hello,
>
> I'm wondering whether there are people out there, who combine scrapy
> with machine learning?
>
> An example:
>
> Let's say you have a lot of sites which offer you all kind of coupons
> (such as "50% allowance on Adobe PhotoShop"). Instead of crawling each
> site seperatly which one crawler each, you probably could write one
> crawler, which is clever enough to get all the coupons from all the
> sites.
>
> This idea came up to me beacause of this StackOverflow question:http://stackoverflow.com/questions/7714422/what-are-some-of-the-artif...

Manuel Rotter

unread,
Oct 11, 2011, 4:47:07 AM10/11/11
to scrapy...@googlegroups.com
Hello conrado,

this is far away from a business plan ;) It's just a thought incepted
into my mind by this SO question which made me curious, because in
theory, this should be possible (OK, there is not much which is
impossible in theory ;)) . I actually didn't plan to implement that
yet, but maybe in a few years.

But thanks for letting me know that this book won't cover that! I'll
probably buy and read it anyway.


Greetings, naeg

2011/10/11, conrado <conrado...@gmail.com>:

> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to
> scrapy-users...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/scrapy-users?hl=en.
>
>

Shane Evans

unread,
Oct 11, 2011, 5:17:03 AM10/11/11
to scrapy...@googlegroups.com
Hi!

There are quite a few people who use machine learning for data
extraction in web crawlers based on scrapy. Additionally, you don't need
to use machine learning to have a single spider capable of crawling many
sites, often data can be extracted using simpler mechanisms (e.g. custom
parsing, regular expressions, etc.).

There are a lot of different techniques for applying machine learning to
this area, with different trade-offs. Some verticals (e.g. extracting
news, product reviews, opinions, etc.) have been studied in more depth
and depending on the domain you are interested in, different techniques
may be applicable.

Programming collective intelligence is an interesting book, it's a nice
introduction to a few topics and I'd certainly recommend it, but
unfortunately it doesn't include anything directly applicable to this
task. One book that comes to mind is Web Data MIning, by Bing Liu
http://www.cs.uic.edu/~liub/WebMiningBook.html (maybe others can suggest
more).. otherwise if you have more details maybe I can point you at some
research papers or talks (but it sounds like you're not quite ready yet..).

You might also be interested in scrapely:
https://github.com/scrapy/scrapely
which has been used quite a lot with scrapy.

Cheers,

Shane

Manuel Rotter

unread,
Oct 11, 2011, 8:10:12 AM10/11/11
to scrapy...@googlegroups.com
Hello Shane,

Thanks for that book and framework suggestions, can't wait to take a
deeper look into those.

You are right with your thought that I'm not quite ready yet. As I
said I don't plan to implement that right now, since I actually have
almost no experience with AI/machine learning.
Anyway I'd like to have some links to research papers and talks, if I
read and watch them, I'm sure my self-evaluation is distinct enough to
let me realize whether I can do that in a few years time ;)

But about what details are you talking?

It's just about scraping sites like
http://www.ultimatecoupons.com/coupons/, and scrape all the coupons
from there, but with one crawler instead of several crawler for every
coupon site like ultimatecoupons. The problem is that they may differ
hard in html structure, but not that hard in how a human would go
through all the coupons.

Note that ultimatecoupon is just an example page which I found by the
help of Google - I'm currently only scraping sites _like_ that, but I
can't show them because they are not translated ino english.


Greetings naeg

2011/10/11, Shane Evans <shane...@gmail.com>:

Mike Christensen

unread,
Oct 11, 2011, 7:22:12 PM10/11/11
to scrapy...@googlegroups.com
I'm actually using Scrapy for my site (www.kitchenpc.com) to parse
recipes from other sources. Luckily, this is a lot easier than
coupons because almost every major recipe site marks up recipes with
the hRecipe microformat. However, I do some natural language
processing on the ingredients themselves to figure out what
ingredients are called for, how they're used, and what amounts you'd
need to buy at the store. I don't use Scrapy for any AI, mostly
because I want to keep the crawler lean and fast. I simply crawl
recipe data, make sure it's got an hRecipe tag, and dump it into a
huge database. I then process this later on with C# code.

I wrote a blog on the processing side of things if anyone is interested:

http://blog.kitchenpc.com/2011/07/06/chef-watson/

Mike

Reply all
Reply to author
Forward
0 new messages