How to strain for everything bound by a particular class value

27 views

Skip to first unread message

OlyDLG

unread,

Oct 4, 2016, 11:35:41 PM10/4/16

to beautifulsoup

Hi! Relative newbie here. I'm trying to scrape http://www.secinfo.com/dScj2.w82d.htm for just the source bound by a class_ value of "s-70." I've gotten as far as successfully creating a strainer based on that class value to retrieve the lines indicated thereby, but I want to retrieve everything--tags and content--bound by those lines. For example, here's what my strainer returns:

strainer = bs4.SoupStrainer(class_="s-70")
strainedSoup = BS(result, parse_only=strainer)
print(strainedSoup.prettify()[:400])
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<p class="s-70">
 TIAA-CREF FUNDS -
 <b>
  Social Choice Low Carbon Equity Fund
 </b>
</p>
<p class="s-70">
 TIAA-CREF FUNDS
 <b>
  - Social Choice Low Carbon Equity Fund
 </b>
</p>
<p class="s-70">
 TIAA-CREF FUNDS
 <b>
  - Social Choice Low Carbon Equity Fund
 </b>
</p>
<p class="s-70">
 TIAA-C

And here's just a small amount of the source between the first occurrence of class="s-70" and the second (and there are a dozen occurrences!):

<p class=s-70>TIAA-CREF FUNDS - <b>Social Choice Low Carbon Equity Fund</b></p> <p class=s-1><b> </b></p> <p class=s-1><b>TIAA-CREF FUNDS</b></p> <p class=s-1><b>SOCIAL CHOICE LOW CARBON EQUITY FUND</b></p> <p class=s-1><b>SCHEDULE OF INVESTMENTS (unaudited)</b></p> <p class=s-1><b><A HRef=#Dates OnMouseOver="return _(this,D)">July 31, 2016</A></b></p> <p class=s-1> </p> <table cellpadding=0 cellspacing=0 class=s-b> <tr class=s-c> <td colspan=2 class=s-3s>SHARES</td> <td> </td> <td class=s-1b> </td><td class=s-h> </td> <td class=s-h><font class=s-71>COMPANY</font></td><td class=s-h> </td> <td colspan=2 class=s-2e>VALUE</td><td class=s-3t> </td></tr> <tr class=s-c> <td class=s-k> </td><td class=s-10> </td> <td> </td> <td class=s-10> </td><td> </td> <td> </td><td> </td> <td class=s-k> </td><td class=s-10> </td><td class=s-k> </td></tr> <tr class=s-c> <td colspan=6 class=s-m><b>COMMON STOCKS - 99.9%</b></td> <td> </td> <td> </td><td class=s-10> </td><td class=s-k> </td></tr> <tr class=s-c> <td class=s-k> </td><td class=s-10> </td> <td> </td> <td class=s-10> </td><td> </td> <td> </td><td> </td> <td class=s-k> </td><td class=s-10> </td><td class=s-k> </td></tr> <tr class=s-c> <td colspan=6 class=s-k>AUTOMOBILES & COMPONENTS - 1.5%</td><td> </td> <td class=s-k> </td><td class=s-10> </td><td class=s-k> </td></tr> <tr class=s-c> <td class=s-n> </td><td class=s-o>29,505</td> <td class=s-1f> </td> <td class=s-72> </td><td class=s-p> </td> <td class=s-73>Ford Motor Co</td><td class=s-s> </td> <td class=s-n>$</td><td class=s-t>373,533</td><td class=s-n> </td></tr> <tr class=s-c> <td class=s-k> </td><td class=s-10>499</td> <td> </td> <td class=s-10> </td><td> </td> <td>Harley-Davidson, Inc</td><td> </td> <td class=s-k> </td><td class=s-10>26,407</td><td class=s-k> </td></tr> <tr class=s-c> <td class=s-k> </td><td class=s-10>3,549</td> <td> </td> <td class=s-10> </td><td> </td> Essentially, class s-70 is serving as a proxy for a new page of content, but also as an indicator of precisely those pages I want to grab from a much, much larger document (~100,000 lines of source!). So, to recap, I want to grab not just the lines (i.e., tags) actually containing the the s-70 class indicator, but everything following each such line, up to a stop indicator to be determined: does Soup support this kind of straining directly, or does someone have some code I can have, or am I going to have to "roll my own"?
Thanks! OlyDLG

OlyDLG

unread,

Oct 5, 2016, 1:26:04 AM10/5/16

to beautifulsoup

I think I understand the problem: since the tables I want to extract data from aren't nested between the opening and closing <p> tags with which the s-70 class is associated, those tables aren't among the descendants of those tags, thus straining for those tags doesn't retrieve them; in other words, since the class value I'm straining on is only associated with those <p> tags, those tags (and their associated content) are the only things getting strained. The tables are essentially siblings (or "cousins") of these <p> tags--the only useful relation between them is "spatial," not "arboreal." If I'm understanding BeautifulSoup's capabilities correctly, it would seem I need to use another approach, which I am having some success with, namely, first treating the URL return result as a string and splitting out the section I'm interested in based on "geographic" markers, then making soup out of that reduced string. As indicated, I'm having some success going about it that way, but I would still like to know if I'm in error about not being able to do (readily) the initial extraction using BS, and if it is possible, receive some guidance for doing so. Thanks!

DLG

Reply all

Reply to author

Forward

0 new messages