Re: is BS for scraping large number of sites (5000+), unstructured html, giving me structured data?

28 views

Skip to first unread message

Message has been deleted

AndriusZilenas

unread,

Feb 16, 2012, 9:10:38 AM2/16/12

to beauti...@googlegroups.com

sometimes I use
http://www.outwit.com/
to more quickly understant how to deal with structure

***

On Thu, Feb 16, 2012 at 03:34, Bruce Eckel <bruce...@gmail.com> wrote:

Create a short demo that just grabs each of your sites and turns each one into a BeautifulSoup tree. The parsing is where most of the time is, so if that demo is within your parameters then it should work.

-- Bruce Eckel
www.Reinventing-Business.com
www.MindviewInc.com

On Thu, Feb 16, 2012 at 1:31 AM, Dan Tarasenko <daninth...@gmail.com> wrote:

Thanks, yeah it looks the goods, but im worried I cannot find anything
that says it can actually crawl the websites - seems I need something
else to crawl. Any ideas?

On 2/16/12, Bruce Eckel <bruce...@gmail.com> wrote:
> I'll just point out that BS4 uses lxml which is way faster than BS3, so
> that might have been one of the reasons people suggested alternatives for
> large crawls. BS4 will get even faster, apparently, by the time it's
> released. So it's certainly worth a try, IMO.
>
> -- Bruce Eckel
> www.Reinventing-Business.com
> www.MindviewInc.com
>
>
>
> On Wed, Feb 15, 2012 at 11:50 PM, dan123456789
> <daninth...@gmail.com>wrote:
>
>> Hi,
>> Im looking at crawling 5000 + websites and need a solution. They are
>> real estate listings, so the data is similar, but every site has its
>> own html code - they are all unique sites. No clean datafeed or api is
>> available.
>>
>> I am looking for a solution that is halfway intelligent, or I can
>> program intelligence into it. Something I can just load the root
>> domains into, it crawls, and will capture data between html tags and
>> present it in a somewhat orderly manner. I cannot write a unique
>> parser for every site.
>>
>> What I need is something that will capture everything, then I will
>> know that in say Field XYZ the price has been stored (because the html
>> code on every page of that site that had price was <td id=price> 100 </
>> td> ) for example.
>>
>> Is Beautiful Soup for me? Im asking the same thing of Scrapy as well.
>> I think BS will handle the html mess but others say scrapy handles
>> large crawls better.
>>
>> Im hoping to load the captured data into some sort of DB, map the
>> fields to what I need (eg find the field that is price and call it
>> price) then that becomes the parser/clean data for that site until the
>> html changes. Perhaps there is an easier way?
>>
>> Any ideas on how to do this with BS if possible?
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "beautifulsoup" group.
>> To post to this group, send email to beauti...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> beautifulsou...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/beautifulsoup?hl=en.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
> beautifulsou...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beautifulsoup?hl=en.
>
>

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.

Reply all

Reply to author

Forward

0 new messages