bs4 output is structured like the html

Tom

unread,

Jul 26, 2012, 9:51:53 AM7/26/12

to beauti...@googlegroups.com

Hello again...
       So I have a code that strips a site and prints out the strings/text. However the output is not structured like I want it to be, ie a listed format, rather its structure resembles the html that was parsed??? for example...
1:
                                                Hoover HS
                                            :

                                                Birmingham,
                                                AL
                                            :

                                                    1
                                                :
                                                    2
                                                :
                                                    4
                                                :
                                                    3

I am okay at data managing in python, so I know the strip,split,append methods but none of them seem to mold the data like I want...
1: Hoover HS: Birmingham, AL: 1: 2: 4: 3 etc.....

Am I missing something in bs4 or is there something else programming wise that I do not know about? (Im a self taught novice)

here is my working code:
import urllib2
import urllib
import string
from bs4 import BeautifulSoup

urlloop = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18']

def main():
    for i in urlloop:
        url = "http://www.usatodayhss.com/news/rankings/super-25-boys-football?state=AL&p="+i
        request = urllib2.Request(url)
        page = urllib2.urlopen(request)
        soup = BeautifulSoup(page)
        HS = soup.find_all('tr', 'hss-data')
#print (soup.prettify())
#print (soup.get_text())
        for i in HS:
            tdlist = i.find_all('div')
            data1 = i.find('td').string
            if tdlist[0].span is not None:
                    data3 = 'none'
                    data5 = tdlist[1].a.string
                    data6 = tdlist[7].string
                    data7 = tdlist[11].string
                    data8 = tdlist[15].text
                    data9 = tdlist[20].text
                    print '%s: %s: %s: %s: %s: %s: %s' % (data1, data5, data3, data6, data7, data8, data9)
            else:
                    data3 = tdlist[0].a.string
                    data5 = tdlist[1].string
                    data6 = tdlist[6].string
                    data7 = tdlist[10].string
                    data8 = tdlist[15].string
                    data9 = tdlist[19].string
                    print '%s: %s: %s: %s: %s: %s: %s' % (data1, data3, data5, data6, data7, data8, data9)
main()

Thanks,
Tom

Link Swanson

unread,

Jul 26, 2012, 10:04:41 AM7/26/12

to beauti...@googlegroups.com

Your strings probably have whitespace and line endings in them. Tell python to strip them by adding .strip() to the end of each string, like this:

data5 = tdlist[1].a.string.strip()

See http://docs.python.org/library/string.html#string.strip

Also, you can avoid maintaining a manual list for your url loop by using range() http://docs.python.org/library/functions.html#range

for i in range(1, 19):

url = "http://www.usatodayhss.com/news/rankings/super-25-boys-football?state=AL&p="+i
request = urllib2.Request(url)

...

Link

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To view this discussion on the web visit https://groups.google.com/d/msg/beautifulsoup/-/pcith172O04J.
To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.

--
Link Swanson

Must Build Digital

Alex Ezell

unread,

Jul 26, 2012, 10:05:54 AM7/26/12

to beauti...@googlegroups.com

Hi Tom,

You might try using .strip() on the strings that you are pulling out of the TDs. You'll want to make sure those are actual strings first or you'll raise an AttributeError by calling .strip() on a NoneType.

/alex

Reply all

Reply to author

Forward