Re: is BS for scraping large number of sites (5000+), unstructured html, giving me structured data?

28 views
Skip to first unread message
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted

AndriusZilenas

unread,
Feb 16, 2012, 9:10:38 AM2/16/12
to beauti...@googlegroups.com
sometimes I use
http://www.outwit.com/
to more quickly understant how to deal with structure

***


On Thu, Feb 16, 2012 at 03:34, Bruce Eckel <bruce...@gmail.com> wrote:
Create a short demo that just grabs each of your sites and turns each one into a BeautifulSoup tree. The parsing is where most of the time is, so if that demo is within your parameters then it should work.
On Thu, Feb 16, 2012 at 1:31 AM, Dan Tarasenko <daninth...@gmail.com> wrote:
Thanks, yeah it looks the goods, but im worried I cannot find anything
that says it can actually crawl the websites - seems I need something
else to crawl. Any ideas?

On 2/16/12, Bruce Eckel <bruce...@gmail.com> wrote:
> I'll just point out that BS4 uses lxml which is way faster than BS3, so
> that might have been one of the reasons people suggested alternatives for
> large crawls. BS4 will get even faster, apparently, by the time it's
> released. So it's certainly worth a try, IMO.
>
> -- Bruce Eckel
> www.Reinventing-Business.com
> www.MindviewInc.com
>
>
>
> On Wed, Feb 15, 2012 at 11:50 PM, dan123456789
> <daninth...@gmail.com>wrote:
>
>> Hi,
>> Im looking at crawling 5000 + websites and need a solution. They are
>> real estate listings, so the data is similar, but every site has its
>> own html code - they are all unique sites. No clean datafeed or api is
>> available.
>>
>> I am looking for a solution that is halfway intelligent, or I can
>> program intelligence into it. Something I can just load the root
>> domains into, it crawls, and will capture data between html tags and
>> present it in a somewhat orderly manner. I cannot write a unique
>> parser for every site.
>>
>> What I need is something that will capture everything, then I will
>> know that in say Field XYZ the price has been stored (because the html
>> code on every page of that site that had price was <td id=price> 100 </
>> td> ) for example.
>>
>> Is Beautiful Soup for me? Im asking the same thing of Scrapy as well.
>> I think BS will handle the html mess but others say scrapy handles
>> large crawls better.
>>
>> Im hoping to load the captured data into some sort of DB, map the
>> fields to what I need (eg find the field that is price and call it
>> price) then that becomes the parser/clean data for that site until the
>> html changes. Perhaps there is an easier way?
>>
>> Any ideas on how to do this with BS if possible?
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "beautifulsoup" group.
>> To post to this group, send email to beauti...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> beautifulsou...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/beautifulsoup?hl=en.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to
> beautifulsou...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/beautifulsoup?hl=en.
>
>

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.


--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.

Reply all
Reply to author
Forward
0 new messages