Question about targeting specific classes to extract text into lists

105 views
Skip to first unread message

Tom

unread,
Jul 18, 2012, 8:58:18 AM7/18/12
to beauti...@googlegroups.com
Im new to bs4 and I am having issues with extracting the text into lists...

Here is a snippet of my code..

from bs4 import BeautifulSoup
code = """<tr class="odd"><td>QB</td><<th scope=row>a href="http://rivals.yahoo.com/ncaa/football/recruiting/player-Tyrone-Swoopes-124071;_ylt=Ap1EPVj2dmRRkPO4OqcVVshIPZB4" >Tyrone Swoopes</a></th><td>Whitewright, Texas<em>Whitewright</em></td><td>6'5"</td><td>229</td><td>4.8</td><td><span class="stars ysr-results-5-star">5 stars</span></td><td>6.1</td><td>1</td><td><div class="wrapper"><a href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2013/texas-83;_ylt=AmYuZOKVFsFgtKnD1LRq84JIPZB4?&sport=1" class="committed">Texas</a></div></td></tr><tr class="even"><td>LB</td><th scope=row><a href="http://rivals.yahoo.com/ncaa/football/recruiting/player-Reuben-Foster-108287;_ylt=Ak6Ntf2pP47bHwK70Ea3buRIPZB4" >Reuben Foster</a></th><td>Auburn, Alabama<em>Auburn</em></td><td>6'2"</td><td>228</td><td>N/A</td><td><span class="stars ysr-results-5-star">5 stars</span></td><td>6.1</td><td>1</td><td><div class="wrapper"><a href="http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2013/auburn-75;_ylt=AiA6iZ4bW_IhMI4ZjLBQHARIPZB4?&sport=1" class="committed">Auburn</a></div></td></tr>"""

##### I highlighted the tags to help guide you through my intentions.. I want to target the <tr class="odd"> and <tr class="even"> .  I want to extract the text between the <td> tags and put them into an individual list... (I will also need to extract text from the <th scope=row> tag into a list too.. but I can always figure that out later)####

soup = BeautifulSoup(code)    #read in my code
odds = soup.find_all('tr', attrs={'class': 'odd'})  #target odd classes with <td> tags
evens = soup.find_all('tr', attrs={'class': 'even'}) #target even classes with <td> tags
for i in odds:
    word = []
    for tmp in odds.findall(text=True):







Tom

unread,
Jul 18, 2012, 9:05:36 AM7/18/12
to beauti...@googlegroups.com
Sorry I somehow posted my thread before I was done....

But to finish off my question... I ultimately want the data cleanly written like this....

QB Tyrone Swoopes Whitewright, Texas Whitewright 6'5" 229 4.8 5 stars 6.1 1 Texas    #then I want to start a new line '\n'
LB Reuben Foster Auburn, Alabama Auburn 6'2" 228 N/A 5 stars 6.1 1 Auburn
OL Khaliel Rodgers Elkton, Maryland Eastern Christian Academy 6'3" 300 N/A 4 stars 6.0 1 USC

etc... any tips or pointers will be appreciated!
Thanks,
Tom

Link Swanson

unread,
Jul 18, 2012, 9:39:35 AM7/18/12
to beauti...@googlegroups.com
I'm not sure why you need to care about odd or even for the output you are looking for. Here is pseudocode (I did not test this) to just give a general idea of how I would do it (but I am still very novice with bs4 and python, so this is probably not the best way, but should get you started).

rows = soup.find_all('tr')
for row in rows:
    tdlist = row.find_all('td')
    data1 = tdlist[0].string
    data2 = row.find('th').a.string
    data3 = tdlist[1].string
    data4 = tdlist[1].em.string
    data5 = tdlist[2].string 
    data6 = tdlist[3].string
    data7 = tdlist[4].string
    data8 = tdlist[5].span.string
    data9 = tdlist[6].string
    data10 = tdlist[7].string
    data11 = tdlist[8].a.string
    # Cram them into a list:
    wordlist = [data1, data2, data3, data4, data5, data6, data7, data8, data9, data10, data11]
    # Or just print them as straight strings:
    print '%s %s %s %s %s %s %s %s %s %s %s' % (data1, data2, data3, data4, data5, data6, data7, data8, data9, data10, data11)

Hope that helps!
    


--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To view this discussion on the web visit https://groups.google.com/d/msg/beautifulsoup/-/zjELVdvNv30J.

To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.



--
Link Swanson
Must Build Digital


Tom

unread,
Jul 18, 2012, 10:18:28 AM7/18/12
to beauti...@googlegroups.com
Thanks Link,
         That did store each td tag as an individual list... It did not loop through the snippet of code but I can work that out later... Anyways.. I can implement this in what I want to accomplish but the reason I was trying to "target" the classes 'odd' and 'even' is that there are plenty of other <td> tags throughout the webpage that I am trying to bypass... and the only way I can target the data taht I want to extract is by focusing in on the odd/even classes...

  for your reference ###here is the website I want to parse http://rivals.yahoo.com/ncaa/football/recruiting/recruit-search-results###

Your input will help though..  My goal is to just correctly extract the data into lists then I will go back and deal with writing the user agents and all the other stuff...
To unsubscribe from this group, send email to beautifulsoup+unsubscribe@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.

Link Swanson

unread,
Jul 18, 2012, 12:06:31 PM7/18/12
to beauti...@googlegroups.com
Ah, I see. This works:

from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(page)
evens = soup.find_all('tr', 'even')
odds = soup.find_all('tr', 'odd')
rows = evens + odds
for row in rows:
    tdlist = row.find_all('td')
    data1 = tdlist[0].string
    data2 = row.find('th').a.string
    data3 = tdlist[1].contents[0].string
    data4 = tdlist[1].contents[1].string
    data5 = tdlist[2].string
    data6 = tdlist[3].string
    data7 = tdlist[4].string
    if tdlist[5].span is not None:
        data8 = tdlist[5].span.string
    else:
        data8 = ""
    data9 = tdlist[6].string
    data10 = tdlist[7].string
    data11 = tdlist[8].a.string
    print '%s %s %s %s %s %s %s %s %s %s %s' % (data1, data2, data3, data4, data5, data6, data7, data8, data9, data10, data11)

Link

To view this discussion on the web visit https://groups.google.com/d/msg/beautifulsoup/-/WGDG3p6w184J.

To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.

Tom

unread,
Jul 18, 2012, 8:56:20 PM7/18/12
to beauti...@googlegroups.com
I appreciate it... That is what I was looking for... sometimes I over think things and try to make them bigger than they are... I guess I was thinking I needed to do it via bs4 rather than individual python lists!

I just wrote the lists into an out file... anyways onto the user agents..

THanks

Tom

unread,
Jul 19, 2012, 12:46:07 PM7/19/12
to beauti...@googlegroups.com
I have now come to the another tag... <li>   Within this tag, I want to "hit" the button next so it loads the next 100 recruits then repeats the loop.  Here is the sample html code:

<
li><form method="post" action="recruit-search-results"><button type="submit" name="start" value="100">Next &raquo;<span class="value">100</span></button><input type="hidden" name="sport" value="football"> <input type="hidden" name="year" value="2011"> <input type="hidden" name="committed" value="1"> <input type="hidden" name="uncommitted" value="1"> <input type="hidden" name="loc" value="City, State or Zip Code"> <input type="hidden" name="hsprospects" value="1"> <input type="hidden" name="prepprospects" value="1"> <input type="hidden" name="jucoprospects" value="1"> <input type="hidden" name="start" value="0"></form></li>


Im not really sure which tag I need to target in order to go to the "Next" page or next 100 recruits... Typically you'd target an <a> tag and the href... Anyways should I put the "row loop" into its own function in order to keep reapeating the loop until there are no more "next" pages...??

Below is an snippet of the code that I have been working on....I highlighted my idea for "hitting" the next button..


user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }
year = raw_input("Input recruiting year: ")
values = {'s' : year }
data = urllib.urlencode(values)
request = urllib2.Request("http://rivals.yahoo.com/ncaa/football/recruiting/recruit-search", data, headers)
page = urllib2.urlopen(request)

soup = BeautifulSoup(page)
evens = soup.find_all('tr', 'even')
odds = soup.find_all('tr', 'odd')
rows = evens + odds
for row in rows:
    tdlist = row.find_all('td')
    data1 = tdlist[0].string
    data2 = row.find('th').a.string
    data3 = tdlist[1].contents[0].string
    data4 = tdlist[1].contents[1].string
    data5 = tdlist[2].string
    data6 = tdlist[3].string
    data7 = tdlist[4].string
    if tdlist[5].span is not None:
        data8 = tdlist[5].span.string
    else:
        data8 = ""
    data9 = tdlist[6].string
    data10 = tdlist[7].string
    data11 = tdlist[8].a.string
    print '%s %s %s %s %s %s %s %s %s %s %s' % (data1, data2, data3, data4, data5, data6, data7, data8, data9, data10, data11)
    ## After creating outfile with all the data on the page, go to next page then repeat loop
    next = { value : 'Next' }
    for i in next:
    if next == 'Next':
        continue
    else:
        break


On Wednesday, July 18, 2012 12:06:31 PM UTC-4, LunkRat wrote:
Reply all
Reply to author
Forward
0 new messages