bs4 output is structured like the html

69 views
Skip to first unread message

Tom

unread,
Jul 26, 2012, 9:51:53 AM7/26/12
to beauti...@googlegroups.com
Hello again...
       So I have a code that strips a site and prints out the strings/text.  However the output is not structured like I want it to be, ie a listed format, rather its structure resembles the html that was parsed???  for example...
1:
                                                Hoover HS
                                            :

                                                Birmingham,
                                                AL
                                            :

                                                    1
                                                :
                                                    2
                                                :
                                                    4
                                                :
                                                    3
                                               
I am okay at data managing in python, so I know the strip,split,append methods but none of them seem to mold the data like I want...
1: Hoover HS: Birmingham, AL: 1: 2: 4: 3  etc.....

Am I missing something in bs4 or is there something else programming wise that I do not know about? (Im a self taught novice)

here is my working code:
import urllib2
import urllib
import string
from bs4 import BeautifulSoup


urlloop = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18']

def main():
    for i in urlloop:
        url = "http://www.usatodayhss.com/news/rankings/super-25-boys-football?state=AL&p="+i
        request = urllib2.Request(url)
        page = urllib2.urlopen(request)
        soup = BeautifulSoup(page)
        HS = soup.find_all('tr', 'hss-data')
#print (soup.prettify())
#print (soup.get_text())
        for i in HS:
            tdlist = i.find_all('div')
            data1 = i.find('td').string
            if tdlist[0].span is not None:
                    data3 = 'none'
                    data5 = tdlist[1].a.string
                    data6 = tdlist[7].string
                    data7 = tdlist[11].string
                    data8 = tdlist[15].text
                    data9 = tdlist[20].text
                    print '%s: %s: %s: %s: %s: %s: %s' % (data1, data5, data3, data6, data7, data8, data9)
            else:
                    data3 = tdlist[0].a.string
                    data5 = tdlist[1].string
                    data6 = tdlist[6].string
                    data7 = tdlist[10].string
                    data8 = tdlist[15].string
                    data9 = tdlist[19].string
                    print '%s: %s: %s: %s: %s: %s: %s' % (data1, data3, data5, data6, data7, data8, data9)
main()               


Thanks,
Tom

Link Swanson

unread,
Jul 26, 2012, 10:04:41 AM7/26/12
to beauti...@googlegroups.com
Your strings probably have whitespace and line endings in them. Tell python to strip them by adding .strip() to the end of each string, like this:
                    data5 = tdlist[1].a.string.strip()

Also, you can avoid maintaining a manual list for your url loop by using range()  http://docs.python.org/library/functions.html#range 

for i in range(1, 19):
    url = "http://www.usatodayhss.com/news/rankings/super-25-boys-football?state=AL&p="+i
    request = urllib2.Request(url) 
    ...

Link


--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To view this discussion on the web visit https://groups.google.com/d/msg/beautifulsoup/-/pcith172O04J.
To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.



--
Link Swanson
Must Build Digital


Alex Ezell

unread,
Jul 26, 2012, 10:05:54 AM7/26/12
to beauti...@googlegroups.com
Hi Tom,
You might try using .strip() on the strings that you are pulling out of the TDs. You'll want to make sure those are actual strings first or you'll raise an AttributeError by calling .strip() on a NoneType.

/alex

Reply all
Reply to author
Forward
0 new messages