Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Question about targeting specific classes to extract text into lists
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Tom  
View profile  
 More options Jul 19 2012, 12:46 pm
From: Tom <boot...@gmail.com>
Date: Thu, 19 Jul 2012 09:46:07 -0700 (PDT)
Local: Thurs, Jul 19 2012 12:46 pm
Subject: Re: Question about targeting specific classes to extract text into lists

I have now come to the another tag... <li>   Within this tag, I want to
"hit" the button next so it loads the next 100 recruits then repeats the
loop.  Here is the sample html code:

<li><form method="post" action="recruit-search-results"><button type="submit" name="start" value="100">Next &raquo;<span class="value">100</span></button><input type="hidden" name="sport" value="football">
<input type="hidden" name="year" value="2011">
<input type="hidden" name="committed" value="1">
<input type="hidden" name="uncommitted" value="1">
<input type="hidden" name="loc" value="City, State or Zip Code">
<input type="hidden" name="hsprospects" value="1">
<input type="hidden" name="prepprospects" value="1">
<input type="hidden" name="jucoprospects" value="1">
<input type="hidden" name="start" value="0"></form></li>

Im not really sure which tag I need to target in order to go to the "Next"
page or next 100 recruits... Typically you'd target an <a> tag and the
href... Anyways should I put the "row loop" into its own function in order
to keep reapeating the loop until there are no more "next" pages...??

Below is an snippet of the code that I have been working on....I
highlighted my idea for "hitting" the next button..

user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }
year = raw_input("Input recruiting year: ")
values = {'s' : year }
data = urllib.urlencode(values)
request =
urllib2.Request("http://rivals.yahoo.com/ncaa/football/recruiting/recruit-search",
data, headers)
page = urllib2.urlopen(request)
soup = BeautifulSoup(page)
evens = soup.find_all('tr', 'even')
odds = soup.find_all('tr', 'odd')
rows = evens + odds
for row in rows:
    tdlist = row.find_all('td')
    data1 = tdlist[0].string
    data2 = row.find('th').a.string
    data3 = tdlist[1].contents[0].string
    data4 = tdlist[1].contents[1].string
    data5 = tdlist[2].string
    data6 = tdlist[3].string
    data7 = tdlist[4].string
    if tdlist[5].span is not None:
        data8 = tdlist[5].span.string
    else:
        data8 = ""
    data9 = tdlist[6].string
    data10 = tdlist[7].string
    data11 = tdlist[8].a.string
    print '%s %s %s %s %s %s %s %s %s %s %s' % (data1, data2, data3, data4,
data5, data6, data7, data8, data9, data10, data11)
    ## After creating outfile with all the data on the page, go to next
page then repeat loop
    next = { value : 'Next' }
    for i in next:
    if next == 'Next':
        continue
    else:
        break

On Wednesday, July 18, 2012 12:06:31 PM UTC-4, LunkRat wrote:

> Ah, I see. This works:

> from bs4 import BeautifulSoup
> import urllib2
> page = urllib2.urlopen('
> http://rivals.yahoo.com/ncaa/football/recruiting/recruit-search-resul...
> ')
> soup = BeautifulSoup(page)
> evens = soup.find_all('tr', 'even')
> odds = soup.find_all('tr', 'odd')
> rows = evens + odds
> for row in rows:
>     tdlist = row.find_all('td')
>     data1 = tdlist[0].string
>     data2 = row.find('th').a.string
>     data3 = tdlist[1].contents[0].string
>     data4 = tdlist[1].contents[1].string
>     data5 = tdlist[2].string
>     data6 = tdlist[3].string
>     data7 = tdlist[4].string
>     if tdlist[5].span is not None:
>         data8 = tdlist[5].span.string
>     else:
>         data8 = ""
>     data9 = tdlist[6].string
>     data10 = tdlist[7].string
>     data11 = tdlist[8].a.string
>     print '%s %s %s %s %s %s %s %s %s %s %s' % (data1, data2, data3,
> data4, data5, data6, data7, data8, data9, data10, data11)

> Link

> On Wed, Jul 18, 2012 at 9:18 AM, Tom <boot...@gmail.com> wrote:

>> Thanks Link,
>>          That did store each td tag as an individual list... It did not
>> loop through the snippet of code but I can work that out later... Anyways..
>> I can implement this in what I want to accomplish but the reason I was
>> trying to "target" the classes 'odd' and 'even' is that there are plenty of
>> other <td> tags throughout the webpage that I am trying to bypass... and
>> the only way I can target the data taht I want to extract is by focusing in
>> on the odd/even classes...

>>   for your reference ###here is the website I want to parse
>> http://rivals.yahoo.com/ncaa/football/recruiting/recruit-search-resul...

>> Your input will help though..  My goal is to just correctly extract the
>> data into lists then I will go back and deal with writing the user agents
>> and all the other stuff...

>> On Wednesday, July 18, 2012 9:39:35 AM UTC-4, LunkRat wrote:

>>> I'm not sure why you need to care about odd or even for the output you
>>> are looking for. Here is pseudocode (I did not test this) to just give a
>>> general idea of how I would do it (but I am still very novice with bs4 and
>>> python, so this is probably not the best way, but should get you started).

>>> rows = soup.find_all('tr')
>>> for row in rows:
>>>     tdlist = row.find_all('td')
>>>     data1 = tdlist[0].string
>>>     data2 = row.find('th').a.string
>>>     data3 = tdlist[1].string
>>>     data4 = tdlist[1].em.string
>>>     data5 = tdlist[2].string
>>>     data6 = tdlist[3].string
>>>     data7 = tdlist[4].string
>>>     data8 = tdlist[5].span.string
>>>     data9 = tdlist[6].string
>>>     data10 = tdlist[7].string
>>>     data11 = tdlist[8].a.string
>>>     # Cram them into a list:
>>>     wordlist = [data1, data2, data3, data4, **
>>> data5, data6, data7, data8, **data9, data10, data11]
>>>     # Or just print them as straight strings:
>>>     print '%s %s %s %s %s %s %s %s %s %**s %s' %
>>> (data1, data2, data3, data4, **data5, data6, data7, data8, **
>>> data9, data10, data11)

>>> Hope that helps!

>>> On Wed, Jul 18, 2012 at 8:05 AM, Tom <boot...@gmail.com> wrote:

>>>> Sorry I somehow posted my thread before I was done....

>>>> But to finish off my question... I ultimately want the data cleanly
>>>> written like this....

>>>> QB Tyrone Swoopes Whitewright, Texas Whitewright 6'5" 229 4.8 5 stars
>>>> 6.1 1 Texas    #then I want to start a new line '\n'
>>>> LB Reuben Foster Auburn, Alabama Auburn 6'2" 228 N/A 5 stars 6.1 1
>>>> Auburn
>>>> OL Khaliel Rodgers Elkton, Maryland Eastern Christian Academy 6'3" 300
>>>> N/A 4 stars 6.0 1 USC

>>>> etc... any tips or pointers will be appreciated!
>>>> Thanks,
>>>> Tom

>>>> On Wednesday, July 18, 2012 8:58:18 AM UTC-4, Tom wrote:

>>>>> Im new to bs4 and I am having issues with extracting the text into
>>>>> lists...

>>>>> Here is a snippet of my code..

>>>>> from bs4 import BeautifulSoup
>>>>> code = """<tr class="odd"><td>QB</td><<th scope=row>a href="
>>>>> http://rivals.yahoo.com/****ncaa/football/recruiting/**playe**
>>>>> r-Tyrone-Swoopes-124071;_**ylt=**A**p1EPVj2dmRRkPO4OqcVVshIPZB4<http://rivals.yahoo.com/ncaa/football/recruiting/player-Tyrone-Swoope...>"
>>>>> >Tyrone Swoopes</a></th><td>Whitewrigh****t, Texas<em>Whitewright</em>
>>>>> </td>****<td>6'5"</td><td>229</td><td>4****.8</td><td><span
>>>>> class="stars ysr-results-5-star">5 stars</span></td><td>6.1</td><****
>>>>> td>1</td><td><div class="wrapper"><a href="http://rivals.yahoo.com/***
>>>>> *ncaa/football/recruiting/**commi**tments/2013/texas-83;_**ylt=**AmYu*
>>>>> *ZOKVFsFgtKnD1LRq84JIPZB4?&**spor**t=1<http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2013/tex...>"
>>>>> class="committed">Texas</a></**d**iv></td></tr><tr class="even"><td>LB
>>>>> </td><th scope=row><a href="http://rivals.yahoo.com/****
>>>>> ncaa/football/recruiting/**playe**r-Reuben-Foster-108287;_**ylt=**Ak**
>>>>> 6Ntf2pP47bHwK70Ea3buRIPZB4<http://rivals.yahoo.com/ncaa/football/recruiting/player-Reuben-Foster...>"
>>>>> >Reuben Foster</a></th><td>Auburn, Alabama<em>Auburn</em></td><**td**>
>>>>> 6'2"</td><td>228</td><td>N/**A<**/td><td><span class="stars
>>>>> ysr-results-5-star">5 stars</span></td><td>6.1</td><****td>1</td><td><div
>>>>> class="wrapper"><a href="http://rivals.yahoo.com/****
>>>>> ncaa/football/recruiting/**commi**tments/2013/auburn-75;_**ylt=**
>>>>> AiA6iZ4bW_**IhMI4ZjLBQHARIPZB4?&**sport=1<http://rivals.yahoo.com/ncaa/football/recruiting/commitments/2013/aub...>"
>>>>> class="committed">Auburn</a></****div></td></tr>"""

>>>>> ##### I highlighted the tags to help guide you through my intentions..
>>>>> I want to target the <tr class="odd"> and <tr class="even"> .  I want
>>>>> to extract the text between the <td> tags and put them into an
>>>>> individual list... (I will also need to extract text from the <th
>>>>> scope=row> tag into a list too.. but I can always figure that out
>>>>> later)####

>>>>> soup = BeautifulSoup(code)    #read in my code
>>>>> odds = soup.find_all('tr', attrs={'class': 'odd'})  #target odd
>>>>> classes with <td> tags
>>>>> evens = soup.find_all('tr', attrs={'class': 'even'}) #target even
>>>>> classes with <td> tags
>>>>> for i in odds:
>>>>>     word = []
>>>>>     for tmp in odds.findall(text=True):

>>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "beautifulsoup" group.
>>>> To view this discussion on the web visit https://groups.google.com/d/**
>>>> msg/beautifulsoup/-/**zjELVdvNv30J<https://groups.google.com/d/msg/beautifulsoup/-/zjELVdvNv30J>
>>>> .

>>>> To post to this group, send email to beautifulsoup@googlegroups.com**.
>>>> To unsubscribe from this group, send email to
>>>> beautifulsoup+unsubscribe@**googlegroups.com<beautifulsoup%2Bunsubscribe@go oglegroups.com>
>>>> .
>>>> For more options, visit this group at http://groups.google.com/**
>>>> group/beautifulsoup?hl=en<http://groups.google.com/group/beautifulsoup?hl=en>
>>>> .

>>> --
>>> Link Swanson
>>> Must Build Digital

>>>   --
>> You received this message because you are subscribed to the Google Groups
>> "beautifulsoup" group.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msg/beautifulsoup/-/WGDG3p6w184J.

>> To post to this group, send email to beautifulsoup@googlegroups.com.
>> To unsubscribe from this group, send email to
>> beautifulsoup+unsubscribe@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/beautifulsoup?hl=en.

> --
> Link Swanson
> Must Build Digital


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.