I'm writing a fairly large application that has need to digest tons of HTML. Tons, like gigabytes, not like a hundred pages.
I've used SoupStrainers pretty heavily but that seems pretty wasteful since I'm passing over the data multiple times.
Some of the pages are quite large, and there are many being processed at once, so I don't think I can get away with reading them in once and using the find*() methods on them (maybe scaling back on threading and doing that would be better?).
So I've come up with (but not written) two new SoupStrainer classes that could be used to selectively strain documents on multiple tags in one pass.
If this has been done, I'd appreciate a pointer to where. If not, comments are welcome and, if I have to do this to un-bog my app, I'll contribute them back.
Please excuse any not-quite-right-ness, this is an idea, not an implementation or a detailed spec.
Thanks,
S
What I really need here is a MultiSoupStrainer that will make one pass over all the HTML and give me back a dict of tags, keyed by tag:
html = ''' <html> <body> <b>first bold text</b> <a href="someurl.com">Link text</a> <h1>H1 text here</h1> <h1>H1 text there</h1> <h2>H2 text yay!</h1> <b>second bold text</b> <a href="someotherurl.com">Other link text</a> </body> </html> ''' multiStrainer = MultiSoupStrainer(['b','a','h1']) tagsDict = BS(html, parseOnlyThese=multiStrainer)
How many pages? How long do you have to do them, - an hour, a day, a week? How long does it take, on average, per page? Do you have the files or are you downloading them as you go?
That aside, you should be able to strain as you describe already:
# WildSoupStrainer import re strainer = SoupStrainer(name=re.compile(r'h.'))
Producing a tag dict would take another pass or you would need to modify BS. But don't do that unless you actually need to, hence my initial questions.
Hope this helps,
Zulq
On Feb 8, 4:26 pm, "sstein...@gmail.com" <sstein...@gmail.com> wrote:
> I'm writing a fairly large application that has need to digest tons of HTML. Tons, like gigabytes, not like a hundred pages.
> I've used SoupStrainers pretty heavily but that seems pretty wasteful since I'm passing over the data multiple times.
> Some of the pages are quite large, and there are many being processed at once, so I don't think I can get away with reading them in once and using the find*() methods on them (maybe scaling back on threading and doing that would be better?).
> So I've come up with (but not written) two new SoupStrainer classes that could be used to selectively strain documents on multiple tags in one pass.
> If this has been done, I'd appreciate a pointer to where. If not, comments are welcome and, if I have to do this to un-bog my app, I'll contribute them back.
> Please excuse any not-quite-right-ness, this is an idea, not an implementation or a detailed spec.
> Thanks,
> S
> What I really need here is a MultiSoupStrainer that will make one pass > over all the HTML and give me back a dict of tags, keyed by tag:
> How long do you have to do them, - an hour, a day, a week?
They're used for reports as soon as they're done so not long; I start processing them as soon as they arrive.
> How long does it take, on average, per page?
Web latency time to get them, soup extraction time as they're processed. I'm adding more tag digestion right now and things are beginning to bog down so that my processing is taking longer than the original collection.
> Do you have the files or are you downloading them as you go?
I have them stored in a database.
> That aside, you should be able to strain as you describe already:
I didn't see that multiple names were allowed. The docs seem to always reference and show the tag name as singular.
> # WildSoupStrainer > import re > strainer = SoupStrainer(name=re.compile(r'h.'))
I didn't see that you could do that either!
> Producing a tag dict would take another pass or you would need to > modify BS. But don't do that unless you actually need to, hence my > initial questions.
If I can do the two things above, I can fudge having the nice neat dictionary (I think). I'd love to avoid modifications to BS itself if I can.
> # WildSoupStrainer > import re > strainer = SoupStrainer(name=re.compile(r'h.'))
I seem to just get back an unfiltered bucket of soup from using this strainer. I'll have to poke around some more. Maybe I have to make it into a callable?
On Mon, Feb 8, 2010 at 8:26 AM, sstein...@gmail.com <sstein...@gmail.com> wrote: > Hey all...
> I'm writing a fairly large application that has need to digest tons of HTML. Tons, like gigabytes, not like a hundred pages.
> I've used SoupStrainers pretty heavily but that seems pretty wasteful since I'm passing over the data multiple times.
> Some of the pages are quite large, and there are many being processed at once, so I don't think I can get away with reading them in once and using the find*() methods on them (maybe scaling back on threading and doing that would be better?).
You might want to look at lxml. I very much prefer Beautiful Soup's API, but its performance on gigabytes of pages is going to make you very unhappy. It would easily take several hours.
lxml, on the other hand, is ridiculously fast and memory efficient. Even several gigabytes shouldn't take much more than about 20 minutes, if that.
If you do decide to stick with Beautiful Soup, make sure to call soup.decompose()! If the Python interpreter doesn't manage to garbage collect all of the elements then you'll run out of memory very quickly.
> On Mon, Feb 8, 2010 at 8:26 AM, sstein...@gmail.com <sstein...@gmail.com> wrote: >> Hey all...
>> I'm writing a fairly large application that has need to digest tons of HTML. Tons, like gigabytes, not like a hundred pages.
>> I've used SoupStrainers pretty heavily but that seems pretty wasteful since I'm passing over the data multiple times.
>> Some of the pages are quite large, and there are many being processed at once, so I don't think I can get away with reading them in once and using the find*() methods on them (maybe scaling back on threading and doing that would be better?).
> You might want to look at lxml. I very much prefer Beautiful Soup's > API, but its performance on gigabytes of pages is going to make you > very unhappy. It would easily take several hours.
Funny you should mention that. I'm using lxml for XML processing, but I'm so used to Beautiful Soup's API that I didn't even think of using it for HTML.
I know it whips right through my XML, but I'm not so sure about some of these 'dirty' pages. I've had too many things just give up on bad HTML that I've come to rely on BeautifulSoup (3.0.x) to just handle it. I'm not sure how forgiving lxml can be. Faster and wrong is more wrong than it is faster.
> If you do decide to stick with Beautiful Soup, make sure to call > soup.decompose()! If the Python interpreter doesn't manage to garbage > collect all of the elements then you'll run out of memory very > quickly.
Yes, I think that part of what's going on is that I'm not aggressively letting go of things I'm done with. Guess it's time to get out heapy/guppy/whatever-is-in-fashion and see what I'm leaving behind.
> On Mon, Feb 8, 2010 at 8:26 AM, sstein...@gmail.com <sstein...@gmail.com> wrote: >> Hey all...
>> I'm writing a fairly large application that has need to digest tons of HTML. Tons, like gigabytes, not like a hundred pages.
>> I've used SoupStrainers pretty heavily but that seems pretty wasteful since I'm passing over the data multiple times.
>> Some of the pages are quite large, and there are many being processed at once, so I don't think I can get away with reading them in once and using the find*() methods on them (maybe scaling back on threading and doing that would be better?).
> You might want to look at lxml. I very much prefer Beautiful Soup's > API, but its performance on gigabytes of pages is going to make you > very unhappy. It would easily take several hours.
Funny you should mention that. I'm using lxml for XML processing, but I'm so used to Beautiful Soup's API that I didn't even think of using it for HTML.
I know it whips right through my XML, but I'm not so sure about some of these 'dirty' pages. I've had too many things just give up on bad HTML that I've come to rely on BeautifulSoup (3.0.x) to just handle it. I'm not sure how forgiving lxml can be. Faster and wrong is more wrong than it is faster.
> If you do decide to stick with Beautiful Soup, make sure to call > soup.decompose()! If the Python interpreter doesn't manage to garbage > collect all of the elements then you'll run out of memory very > quickly.
Yes, I think that part of what's going on is that I'm not aggressively letting go of things I'm done with. Guess it's time to get out heapy/guppy/whatever-is-in-fashion and see what I'm leaving behind.
On Mon, Feb 8, 2010 at 5:37 PM, sstein...@gmail.com <sstein...@gmail.com> wrote: >> You might want to look at lxml. I very much prefer Beautiful Soup's >> API, but its performance on gigabytes of pages is going to make you >> very unhappy. It would easily take several hours.
> Funny you should mention that. I'm using lxml for XML processing, but I'm so used to Beautiful Soup's API that I didn't even think of using it for HTML.
> I know it whips right through my XML, but I'm not so sure about some of these 'dirty' pages. I've had too many things just give up on bad HTML that I've come to rely on BeautifulSoup (3.0.x) to just handle it. I'm not sure how forgiving lxml can be. Faster and wrong is more wrong than it is faster.
lxml's lenient HTML parser has a good reputation. I haven't used it myself, so I can't give any more details than that.