strainer = bs4.SoupStrainer(class_="s-70")And here's just a small amount of the source between the first occurrence of class="s-70" and the second (and there are a dozen occurrences!):
strainedSoup = BS(result, parse_only=strainer)
print(strainedSoup.prettify()[:400])
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<p class="s-70">
TIAA-CREF FUNDS -
<b>
Social Choice Low Carbon Equity Fund
</b>
</p>
<p class="s-70">
TIAA-CREF FUNDS
<b>
- Social Choice Low Carbon Equity Fund
</b>
</p>
<p class="s-70">
TIAA-CREF FUNDS
<b>
- Social Choice Low Carbon Equity Fund
</b>
</p>
<p class="s-70">
TIAA-C
<p class=s-70>TIAA-CREF FUNDS - <b>Social Choice Low Carbon Equity Fund</b></p> <p class=s-1><b> </b></p> <p class=s-1><b>TIAA-CREF FUNDS</b></p> <p class=s-1><b>SOCIAL CHOICE LOW CARBON EQUITY FUND</b></p> <p class=s-1><b>SCHEDULE OF INVESTMENTS (unaudited)</b></p> <p class=s-1><b><A HRef=#Dates OnMouseOver="return _(this,D)">July 31, 2016</A></b></p> <p class=s-1> </p> <table cellpadding=0 cellspacing=0 class=s-b> <tr class=s-c> <td colspan=2 class=s-3s>SHARES</td> <td> </td> <td class=s-1b> </td><td class=s-h> </td> <td class=s-h><font class=s-71>COMPANY</font></td><td class=s-h> </td> <td colspan=2 class=s-2e>VALUE</td><td class=s-3t> </td></tr> <tr class=s-c> <td class=s-k> </td><td class=s-10> </td> <td> </td> <td class=s-10> </td><td> </td> <td> </td><td> </td> <td class=s-k> </td><td class=s-10> </td><td class=s-k> </td></tr> <tr class=s-c> <td colspan=6 class=s-m><b>COMMON STOCKS - 99.9%</b></td> <td> </td> <td> </td><td class=s-10> </td><td class=s-k> </td></tr> <tr class=s-c> <td class=s-k> </td><td class=s-10> </td> <td> </td> <td class=s-10> </td><td> </td> <td> </td><td> </td> <td class=s-k> </td><td class=s-10> </td><td class=s-k> </td></tr> <tr class=s-c> <td colspan=6 class=s-k>AUTOMOBILES & COMPONENTS - 1.5%</td><td> </td> <td class=s-k> </td><td class=s-10> </td><td class=s-k> </td></tr> <tr class=s-c> <td class=s-n> </td><td class=s-o>29,505</td> <td class=s-1f> </td> <td class=s-72> </td><td class=s-p> </td> <td class=s-73>Ford Motor Co</td><td class=s-s> </td> <td class=s-n>$</td><td class=s-t>373,533</td><td class=s-n> </td></tr> <tr class=s-c> <td class=s-k> </td><td class=s-10>499</td> <td> </td> <td class=s-10> </td><td> </td> <td>Harley-Davidson, Inc</td><td> </td> <td class=s-k> </td><td class=s-10>26,407</td><td class=s-k> </td></tr> <tr class=s-c> <td class=s-k> </td><td class=s-10>3,549</td> <td> </td> <td class=s-10> </td><td> </td> Essentially, class s-70 is serving as a proxy for a new page of content, but also as an indicator of precisely those pages I want to grab from a much, much larger document (~100,000 lines of source!). So, to recap, I want to grab not just the lines (i.e., tags) actually containing the the s-70 class indicator, but everything following each such line, up to a stop indicator to be determined: does Soup support this kind of straining directly, or does someone have some code I can have, or am I going to have to "roll my own"? | |
Thanks! OlyDLG | |