On 2/16/12, Bruce Eckel <
bruce...@gmail.com> wrote:
> I'll just point out that BS4 uses lxml which is way faster than BS3, so
> that might have been one of the reasons people suggested alternatives for
> large crawls. BS4 will get even faster, apparently, by the time it's
> released. So it's certainly worth a try, IMO.
>
> -- Bruce Eckel
>
www.Reinventing-Business.com
>
www.MindviewInc.com
>
>
>
> On Wed, Feb 15, 2012 at 11:50 PM, dan123456789
> <
daninth...@gmail.com>wrote:
>
>> Hi,
>> Im looking at crawling 5000 + websites and need a solution. They are
>> real estate listings, so the data is similar, but every site has its
>> own html code - they are all unique sites. No clean datafeed or api is
>> available.
>>
>> I am looking for a solution that is halfway intelligent, or I can
>> program intelligence into it. Something I can just load the root
>> domains into, it crawls, and will capture data between html tags and
>> present it in a somewhat orderly manner. I cannot write a unique
>> parser for every site.
>>
>> What I need is something that will capture everything, then I will
>> know that in say Field XYZ the price has been stored (because the html
>> code on every page of that site that had price was <td id=price> 100 </
>> td> ) for example.
>>
>> Is Beautiful Soup for me? Im asking the same thing of Scrapy as well.
>> I think BS will handle the html mess but others say scrapy handles
>> large crawls better.
>>
>> Im hoping to load the captured data into some sort of DB, map the
>> fields to what I need (eg find the field that is price and call it
>> price) then that becomes the parser/clean data for that site until the
>> html changes. Perhaps there is an easier way?
>>
>> Any ideas on how to do this with BS if possible?
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "beautifulsoup" group.
>> To post to this group, send email to
beauti...@googlegroups.com.
>> To unsubscribe from this group, send email to
>>
beautifulsou...@googlegroups.com.
>> For more options, visit this group at
>>
http://groups.google.com/group/beautifulsoup?hl=en.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To post to this group, send email to
beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
>
beautifulsou...@googlegroups.com.
> For more options, visit this group at
>
http://groups.google.com/group/beautifulsoup?hl=en.
>
>
--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To post to this group, send email to
beauti...@googlegroups.com.
To unsubscribe from this group, send email to
beautifulsou...@googlegroups.com.
For more options, visit this group at
http://groups.google.com/group/beautifulsoup?hl=en.