Re: Help a noob....

169 views
Skip to first unread message

Leonard Richardson

unread,
May 31, 2013, 11:05:59 AM5/31/13
to beauti...@googlegroups.com
> I've written code to open every page that I need to open using the soup and
> next I need to retrieve the information from it.
> This is an example of the page I would be examing:
> http://www.iucnredlist.org/details/39780/0
> I need two bits of data from the assesement box. Firstly I need its current
> status then I need its history.
>
> I can't work out how to do this becuase I need the actual text, I need it in
> the right order and the tags aren't unique to that section. Can anyone help
> me?

If a tag has no distinguishing features, you can find a nearby tag
that does have distinguishing features. Then you can use a method like
find_all() or find_next() to find the tag you're looking for, relative
to the tag that was easy to find.

For instance, you can find the assessment table relative to the <h2>
tag that comes before it.

>>> assessment_section = soup.find('h2', id='sectionAssessment')
>>> assessment_table = assessment_section.find_next('table')

Then, you can find the sections you're interested in based on their
human-readable labels.

>>> criteria_label = assessment_table.find(text="Red List Category & Criteria:")
>>> criteria_value = criteria_label.find_next('td')

>>> history_label = assessment_table.find(text="History:")
>>> history_value = history_label.find_next('td')

Leonard

izzy...@gmail.com

unread,
May 31, 2013, 12:03:06 PM5/31/13
to beauti...@googlegroups.com
Hi Leonard,
Thanks a lot for taking the time to help me.
I'm having issues with the code you gave me though, when I try and use the line 
"assessment_section = soup.find('h2', id='sectionAssessment') " 
it tells me the syntax is invalid. 

Is there anything you can suggest? 

On Friday, May 31, 2013 10:29:56 AM UTC+1, izzy...@gmail.com wrote:
So I'm attempting to use beautiful soup to gain information from the IUCN website.

Leonard Richardson

unread,
May 31, 2013, 1:02:28 PM5/31/13
to beauti...@googlegroups.com
> Thanks a lot for taking the time to help me.
> I'm having issues with the code you gave me though, when I try and use the
> line
> "assessment_section = soup.find('h2', id='sectionAssessment') "
> it tells me the syntax is invalid.

This line of Python:

assessment_section = soup.find('h2', id='sectionAssessment')

is valid, assuming you have put your BeautifulSoup object in a
variable called 'soup'. But that might not be the variable name you
used, or you may have copied the line wrong.

What does that line of code look like in your source file, and what
exactly does Python say when it tries to run that line?

Leonard

izzy...@gmail.com

unread,
May 31, 2013, 1:04:43 PM5/31/13
to beauti...@googlegroups.com
This what I've written so far, its probably laughable. I'm totally new to this but this data is super important to me, my PhD rests on whether I can get it or not. 

from bs4 import BeautifulSoup
import urllib2
import time
import csv

i = range(1,3)
for num in i:
print page
pageFile = urllib2.urlopen(page)
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
sAll = soup.findAll("a", { "class" : "title"})
ofile = open('allurls.cvs', "w")
for href in sAll:
#write species name
print "http://www.iucnredlist.org/" + href['href']
ofile.write("http://www.iucnredlist.org/" + href['href']+ ",")
pageFile = urllib2.urlopen("http://www.iucnredlist.org/" + href['href'])
pageHtml = pageFile.read()
pageFile.close()
#find assessment info
innersoup = BeautifulSoup("".join(pageHtml)
assessment_section = innersoup.findAll("h2", id="sectionAssessment")
#get current status
assessment_table = assessment_section.find_next('table')  
criteria_label = assessment_table.find(text="Red List Category & Criteria:")
Currentstatus = criteria_label.find_next('td')
#write it in
ofile.write(Currentstatus)
#Find the history
history_section =innersoup.find_next(text="History:")
assessment_table = assessment_section.find_next('table')
history_label= assessment_table.find(text="Critically Endangered" or "Endangered" or "Vulnerable" or "Extinct" or "Lower Risk/least concern" or "Least Concern"  or "Near Threatened" or "Extinct in the Wild")
history_label= assessment_table.find_next(text="Critically Endangered" or "Endangered" or "Vulnerable" or "Extinct" or "Lower Risk/least concern" or "Least Concern"  or "Near Threatened" or "Extinct in the Wild")
history_label= assessment_table.find_next(text="Critically Endangered" or "Endangered" or "Vulnerable" or "Extinct" or "Lower Risk/least concern" or "Least Concern"  or "Near Threatened" or "Extinct in the Wild")
oflie.write(history_label)
print num
time.sleep(1)
ofile.close()

Leonard Richardson

unread,
May 31, 2013, 1:09:41 PM5/31/13
to beauti...@googlegroups.com
You have a missing right parenthesis on this line:

> innersoup = BeautifulSoup("".join(pageHtml)

This causes a syntax error on the next line.

The correct code is:

innersoup = BeautifulSoup("".join(pageHtml))

Leonard

kopilov evgeniy

unread,
Aug 4, 2015, 9:20:32 PM8/4/15
to beautifulsoup, leon...@segfault.org
ok, but in the issue: 
Traceback (most recent call last):
  File "C:/Python34/urlopen.py", line 2, in <module>
    import urllib2
ImportError: No module named 'urllib2'


пятница, 31 мая 2013 г., 20:09:41 UTC+3 пользователь Leonard Richardson написал:

Cameron Simpson

unread,
Aug 5, 2015, 1:57:06 AM8/5/15
to beauti...@googlegroups.com, leon...@segfault.org
On 04Aug2015 18:06, kopilov evgeniy <kope...@gmail.com> wrote:
>ok, but in the issue:
>Traceback (most recent call last):
> File "C:/Python34/urlopen.py", line 2, in <module>
> import urllib2
>ImportError: No module named 'urllib2'

In Python 3 the urllib modules have been renamed and restructured. Have a look
at the python 3 docs:

https://docs.python.org/3/py-modindex.html#cap-u

You'll have to tweak you import a bit for Python 3.

Cheers,
Cameron Simpson <c...@zip.com.au>
Reply all
Reply to author
Forward
0 new messages