Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
I want a MultiSoupStrainer and WildSoupStrainer!
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  8 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
ssteinerX@gmail.com  
View profile  
 More options Feb 8 2010, 11:26 am
From: "sstein...@gmail.com" <sstein...@gmail.com>
Date: Mon, 8 Feb 2010 11:26:52 -0500
Local: Mon, Feb 8 2010 11:26 am
Subject: I want a MultiSoupStrainer and WildSoupStrainer!
Hey all...

        I'm writing a fairly large application that has need to digest tons of HTML.  Tons, like gigabytes, not like a hundred pages.

        I've used SoupStrainers pretty heavily but that seems pretty wasteful since I'm passing over the data multiple times.

        Some of the pages are quite large, and there are many being processed at once, so I don't think I can get away with reading them in once and using the find*() methods on them (maybe scaling back on threading and doing that would be better?).

        So I've come up with (but not written) two new SoupStrainer classes that could be used to selectively strain documents on multiple tags in one pass.

        If this has been done, I'd appreciate a pointer to where.  If not, comments are welcome and, if I have to do this to un-bog my app, I'll contribute them back.

        Please excuse any not-quite-right-ness, this is an idea, not an implementation or a detailed spec.

Thanks,

S

    What I really need here is a MultiSoupStrainer that will make one pass
    over all the HTML and give me back a dict of tags, keyed by tag:

    html = '''
    <html>
    <body>
    <b>first bold text</b>
    <a href="someurl.com">Link text</a>
    <h1>H1 text here</h1>
    <h1>H1 text there</h1>
    <h2>H2 text yay!</h1>
    <b>second bold text</b>
    <a href="someotherurl.com">Other link text</a>
   </body>
    </html>
    '''
    multiStrainer = MultiSoupStrainer(['b','a','h1'])
    tagsDict = BS(html, parseOnlyThese=multiStrainer)

    print tagsDict
    {
        'a': [ firstAtagObject, secondAtagObject],              # someurl.com, someotherurl.com
        'h1': [firstH1tagObject, secondH1tagObject]     # 'H1 text here', 'H1 text there'
        'b': [firstBtagObject, secondBtagObject]                # 'first bold text', 'second bold text'
     }

     Also, I'd like a WildSoupStrainer that will let me collect, for example, all "h*" tags
     at once.

     wildStrainer = WildSoupStrainer('h?')
     tagsDict = BS(html, parseOnlyThese=wildStrainer)

     print tagsDict
     {
        'h1': [firstH1object, secondH1object],          # 'H1 text here', 'H1 text there'
        'h2': [firstH2object]                                           # 'H2 text yay!'
     }


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zulq Alam  
View profile  
 More options Feb 8 2010, 12:52 pm
From: Zulq Alam <zulq.a...@googlemail.com>
Date: Mon, 8 Feb 2010 09:52:04 -0800 (PST)
Local: Mon, Feb 8 2010 12:52 pm
Subject: Re: I want a MultiSoupStrainer and WildSoupStrainer!
I wouldn't write anything untill I knew:

   How many pages?
   How long do you have to do them, - an hour, a day, a week?
   How long does it take, on average, per page?
   Do you have the files or are you downloading them as you go?

That aside, you should be able to strain as you describe already:

# MultiSoupStrainer
strainer = SoupStrainer(name=['b','a','h1'])

# WildSoupStrainer
import re
strainer = SoupStrainer(name=re.compile(r'h.'))

Producing a tag dict would take another pass or you would need to
modify BS. But don't do that unless you actually need to, hence my
initial questions.

Hope this helps,

Zulq

On Feb 8, 4:26 pm, "sstein...@gmail.com" <sstein...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
ssteinerX@gmail.com  
View profile  
 More options Feb 8 2010, 1:03 pm
From: "sstein...@gmail.com" <sstein...@gmail.com>
Date: Mon, 8 Feb 2010 13:03:31 -0500
Local: Mon, Feb 8 2010 1:03 pm
Subject: Re: I want a MultiSoupStrainer and WildSoupStrainer!
On Feb 8, 2010, at 12:52 PM, Zulq Alam wrote:

> I wouldn't write anything untill I knew:

>   How many pages?

About 5000 per run, so far.

>   How long do you have to do them, - an hour, a day, a week?

They're used for reports as soon as they're done so not long; I start processing them as soon as they arrive.

>   How long does it take, on average, per page?

Web latency time to get them, soup extraction time as they're processed.  I'm adding more tag digestion right now and things are beginning to bog down so that my processing is taking longer than the original collection.

>   Do you have the files or are you downloading them as you go?

I have them stored in a database.

> That aside, you should be able to strain as you describe already:

> # MultiSoupStrainer
> strainer = SoupStrainer(name=['b','a','h1'])

I didn't see that multiple names were allowed.  The docs seem to always reference and show the tag name as singular.

> # WildSoupStrainer
> import re
> strainer = SoupStrainer(name=re.compile(r'h.'))

I didn't see that you could do that either!  

> Producing a tag dict would take another pass or you would need to
> modify BS. But don't do that unless you actually need to, hence my
> initial questions.

If I can do the two things above, I can fudge having the nice neat dictionary (I think).  I'd love to avoid modifications to BS itself if I can.

Thanks, I'll go give those two things a try!

S


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
ssteinerX@gmail.com  
View profile  
 More options Feb 8 2010, 1:22 pm
From: "sstein...@gmail.com" <sstein...@gmail.com>
Date: Mon, 8 Feb 2010 13:22:04 -0500
Local: Mon, Feb 8 2010 1:22 pm
Subject: Re: I want a MultiSoupStrainer and WildSoupStrainer!
On Feb 8, 2010, at 12:52 PM, Zulq Alam wrote:

> That aside, you should be able to strain as you describe already:

> # MultiSoupStrainer
> strainer = SoupStrainer(name=['b','a','h1'])

This works great, thanks so much!

> # WildSoupStrainer
> import re
> strainer = SoupStrainer(name=re.compile(r'h.'))

I seem to just get back an unfiltered bucket of soup from using this strainer.  I'll have to poke around some more.  Maybe I have to make it into a callable?

Thanks,

S


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aaron DeVore  
View profile  
 More options Feb 8 2010, 2:52 pm
From: Aaron DeVore <aaron.dev...@gmail.com>
Date: Mon, 8 Feb 2010 11:52:09 -0800
Local: Mon, Feb 8 2010 2:52 pm
Subject: Re: I want a MultiSoupStrainer and WildSoupStrainer!

On Mon, Feb 8, 2010 at 8:26 AM, sstein...@gmail.com <sstein...@gmail.com> wrote:
> Hey all...

>        I'm writing a fairly large application that has need to digest tons of HTML.  Tons, like gigabytes, not like a hundred pages.

>        I've used SoupStrainers pretty heavily but that seems pretty wasteful since I'm passing over the data multiple times.

>        Some of the pages are quite large, and there are many being processed at once, so I don't think I can get away with reading them in once and using the find*() methods on them (maybe scaling back on threading and doing that would be better?).

You might want to look at lxml. I very much prefer Beautiful Soup's
API, but its performance on gigabytes of pages is going to make you
very unhappy. It would easily take several hours.

lxml, on the other hand, is ridiculously fast and memory efficient.
Even several gigabytes shouldn't take much more than about 20 minutes,
if that.

If you do decide to stick with Beautiful Soup, make sure to call
soup.decompose()! If the Python interpreter doesn't manage to garbage
collect all of the elements then you'll run out of memory very
quickly.

Good Luck!
Aaron DeVore


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
ssteinerX@gmail.com  
View profile  
 More options Feb 8 2010, 8:37 pm
From: "sstein...@gmail.com" <sstein...@gmail.com>
Date: Mon, 8 Feb 2010 20:37:12 -0500
Local: Mon, Feb 8 2010 8:37 pm
Subject: Re: I want a MultiSoupStrainer and WildSoupStrainer!

On Feb 8, 2010, at 2:52 PM, Aaron DeVore wrote:

> On Mon, Feb 8, 2010 at 8:26 AM, sstein...@gmail.com <sstein...@gmail.com> wrote:
>> Hey all...

>>        I'm writing a fairly large application that has need to digest tons of HTML.  Tons, like gigabytes, not like a hundred pages.

>>        I've used SoupStrainers pretty heavily but that seems pretty wasteful since I'm passing over the data multiple times.

>>        Some of the pages are quite large, and there are many being processed at once, so I don't think I can get away with reading them in once and using the find*() methods on them (maybe scaling back on threading and doing that would be better?).

> You might want to look at lxml. I very much prefer Beautiful Soup's
> API, but its performance on gigabytes of pages is going to make you
> very unhappy. It would easily take several hours.

Funny you should mention that.  I'm using lxml for XML processing, but I'm so used to Beautiful Soup's API that I didn't even think of using it for HTML.

I know it whips right through my XML, but I'm not so sure about some of these 'dirty' pages.  I've had too many things just give up on bad HTML that I've come to rely on BeautifulSoup (3.0.x) to just handle it.  I'm not sure how forgiving lxml can be.  Faster and wrong is more wrong than it is faster.

> If you do decide to stick with Beautiful Soup, make sure to call
> soup.decompose()! If the Python interpreter doesn't manage to garbage
> collect all of the elements then you'll run out of memory very
> quickly.

Yes, I think that part of what's going on is that I'm not aggressively letting go of things I'm done with.  Guess it's time to get out heapy/guppy/whatever-is-in-fashion and see what I'm leaving behind.

Thanks!

S


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
sstein...@gmail.com  
View profile  
 More options Feb 8 2010, 10:47 pm
From: sstein...@gmail.com
Date: Mon, 8 Feb 2010 22:47:15 -0500
Local: Mon, Feb 8 2010 10:47 pm
Subject: Re: I want a MultiSoupStrainer and WildSoupStrainer!

On Feb 8, 2010, at 2:52 PM, Aaron DeVore wrote:

> On Mon, Feb 8, 2010 at 8:26 AM, sstein...@gmail.com <sstein...@gmail.com> wrote:
>> Hey all...

>>        I'm writing a fairly large application that has need to digest tons of HTML.  Tons, like gigabytes, not like a hundred pages.

>>        I've used SoupStrainers pretty heavily but that seems pretty wasteful since I'm passing over the data multiple times.

>>        Some of the pages are quite large, and there are many being processed at once, so I don't think I can get away with reading them in once and using the find*() methods on them (maybe scaling back on threading and doing that would be better?).

> You might want to look at lxml. I very much prefer Beautiful Soup's
> API, but its performance on gigabytes of pages is going to make you
> very unhappy. It would easily take several hours.

Funny you should mention that.  I'm using lxml for XML processing, but I'm so used to Beautiful Soup's API that I didn't even think of using it for HTML.

I know it whips right through my XML, but I'm not so sure about some of these 'dirty' pages.  I've had too many things just give up on bad HTML that I've come to rely on BeautifulSoup (3.0.x) to just handle it.  I'm not sure how forgiving lxml can be.  Faster and wrong is more wrong than it is faster.

> If you do decide to stick with Beautiful Soup, make sure to call
> soup.decompose()! If the Python interpreter doesn't manage to garbage
> collect all of the elements then you'll run out of memory very
> quickly.

Yes, I think that part of what's going on is that I'm not aggressively letting go of things I'm done with.  Guess it's time to get out heapy/guppy/whatever-is-in-fashion and see what I'm leaving behind.

Thanks!

S


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Aaron DeVore  
View profile  
 More options Feb 9 2010, 12:45 am
From: Aaron DeVore <aaron.dev...@gmail.com>
Date: Mon, 8 Feb 2010 21:45:51 -0800
Local: Tues, Feb 9 2010 12:45 am
Subject: Re: I want a MultiSoupStrainer and WildSoupStrainer!

On Mon, Feb 8, 2010 at 5:37 PM, sstein...@gmail.com <sstein...@gmail.com> wrote:
>> You might want to look at lxml. I very much prefer Beautiful Soup's
>> API, but its performance on gigabytes of pages is going to make you
>> very unhappy. It would easily take several hours.

> Funny you should mention that.  I'm using lxml for XML processing, but I'm so used to Beautiful Soup's API that I didn't even think of using it for HTML.

> I know it whips right through my XML, but I'm not so sure about some of these 'dirty' pages.  I've had too many things just give up on bad HTML that I've come to rely on BeautifulSoup (3.0.x) to just handle it.  I'm not sure how forgiving lxml can be.  Faster and wrong is more wrong than it is faster.

lxml's lenient HTML parser has a good reputation. I haven't used it
myself, so I can't give any more details than that.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »