this is far away from a business plan ;) It's just a thought incepted
into my mind by this SO question which made me curious, because in
theory, this should be possible (OK, there is not much which is
impossible in theory ;)) . I actually didn't plan to implement that
yet, but maybe in a few years.
But thanks for letting me know that this book won't cover that! I'll
probably buy and read it anyway.
Greetings, naeg
2011/10/11, conrado <conrado...@gmail.com>:
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to
> scrapy-users...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/scrapy-users?hl=en.
>
>
There are quite a few people who use machine learning for data
extraction in web crawlers based on scrapy. Additionally, you don't need
to use machine learning to have a single spider capable of crawling many
sites, often data can be extracted using simpler mechanisms (e.g. custom
parsing, regular expressions, etc.).
There are a lot of different techniques for applying machine learning to
this area, with different trade-offs. Some verticals (e.g. extracting
news, product reviews, opinions, etc.) have been studied in more depth
and depending on the domain you are interested in, different techniques
may be applicable.
Programming collective intelligence is an interesting book, it's a nice
introduction to a few topics and I'd certainly recommend it, but
unfortunately it doesn't include anything directly applicable to this
task. One book that comes to mind is Web Data MIning, by Bing Liu
http://www.cs.uic.edu/~liub/WebMiningBook.html (maybe others can suggest
more).. otherwise if you have more details maybe I can point you at some
research papers or talks (but it sounds like you're not quite ready yet..).
You might also be interested in scrapely:
https://github.com/scrapy/scrapely
which has been used quite a lot with scrapy.
Cheers,
Shane
Thanks for that book and framework suggestions, can't wait to take a
deeper look into those.
You are right with your thought that I'm not quite ready yet. As I
said I don't plan to implement that right now, since I actually have
almost no experience with AI/machine learning.
Anyway I'd like to have some links to research papers and talks, if I
read and watch them, I'm sure my self-evaluation is distinct enough to
let me realize whether I can do that in a few years time ;)
But about what details are you talking?
It's just about scraping sites like
http://www.ultimatecoupons.com/coupons/, and scrape all the coupons
from there, but with one crawler instead of several crawler for every
coupon site like ultimatecoupons. The problem is that they may differ
hard in html structure, but not that hard in how a human would go
through all the coupons.
Note that ultimatecoupon is just an example page which I found by the
help of Google - I'm currently only scraping sites _like_ that, but I
can't show them because they are not translated ino english.
Greetings naeg
2011/10/11, Shane Evans <shane...@gmail.com>:
I wrote a blog on the processing side of things if anyone is interested:
http://blog.kitchenpc.com/2011/07/06/chef-watson/
Mike