findAll using div, id and class.

4,119 views
Skip to first unread message

John W Smith

unread,
Apr 15, 2011, 12:42:46 PM4/15/11
to beautifulsoup
Dear All,

I had a quick question on parsing using BSoup, and I was wondering if
someone in this group could point me in the right direction.

I am trying to parse a page which looks something like (after
"soup'ing" it):

#####################################
<div class="yourDesc">
<div id="youridandmine" class="yourText">
Jack is Back and so am I …
</div>
#####################################

I am trying to extract simply (&only) the text " Jack is Back and so
am I … ", but at the moment, I am unable to do so..

At the moment, I am able to use, for example:

ex_price=str(soup.findAll("div",{ "class" : "yourText" }))
ex_adType=str(soup.findAll("div",id="yourType"))

..but I am looking for the syntax which utilizes "div" "id" and
"class" in findAll..

I use regex with findAll, but still, I am unable to extract just the
string " Jack is Back and so am I … "

Any ideas or clues would be much appreciated!

Thanks for your time!

Regards,

John D


pbuckner

unread,
Apr 16, 2011, 12:06:42 AM4/16/11
to beautifulsoup
use lambda expression and .get("attr"):

soup.findAll(lambda tag: tag.name=="div" and
tag.get("id")=="youridandmine" and tag.get("class") == "yourText")
[0].string

John W Smith

unread,
Apr 16, 2011, 11:10:24 AM4/16/11
to beautifulsoup
Thanks a lot pbuckner: that works like a charm.

-John.

John W Smith

unread,
Apr 16, 2011, 12:08:15 PM4/16/11
to beautifulsoup
Dear All,

On the same note, I was wondering if I could pop another question:

Is it possible to do a "findAll" by specifying: div, id, span and
class?

Many thanks for your time.

Best regards,
-John

Peter Buckner

unread,
Apr 16, 2011, 12:25:46 PM4/16/11
to beauti...@googlegroups.com
depends on what you're looking for: div,id, and class (in your example) all refer to the same element, hence the provided solution. If you're adding "span", that's a different element, so it's not quite so simple. In general, you might do findall to get the "outer" element... perhaps the div & then another findall searching within each returned result (from findall-- remember it's an array), looking for span.

What's your example HTML?

--

 -Peter Buckner

John D

unread,
Apr 17, 2011, 8:26:33 AM4/17/11
to beauti...@googlegroups.com
Dear Peter,

Thank you very much for your post for my problem.

Well, I have 2 examples here, where I am trying to just extract the texts (and I have only a BSoup/regex approach to it: not pure BSoup):

##### 1 ####
<div id="Type" class="Link"><span class="SType">SType001</span>, distilled: <span id="distiled" class="Amount">100gs</span></div>
##### 1 ####
Here, I wish to extract the texts "SType001" and "100gs".

##### 2 ####
<ul class="specsNA">
<li class=""> <span class="Ktype">KTypeNA</span></li><li class=""> <span class="Ltype">LTypeNA</span></li><li class=""> <span class="MType">MTypeNA</span></li><li class=""> <span class="NType">NTypeNA</span></li>
</ul>
##### 2 ####
Here, I wish to extract "KTypeNA", "LTypeNA", "MTypeNA" and "NTypeNA".

At the moment, what I am doing is using the div, id or div, id , class comands (using lambda as you have suggested), to extract what is possible, and then use regex (re.compile) to extract the text. However, this just does not look elegant to me at the moment (I am not a real programmer: I have a textile engineering degree..) and I was wondering if there was a better solution to this (I wish to parse 1000's of pages). Also, some posts on the net seem to suggest not to use regex to parse/extract information..But, I am unsure and therefore want to avoid using regex as much as possible.

Thank you for your time.

Best regards,

John.



--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.

pbuckner

unread,
Apr 21, 2011, 10:26:06 AM4/21/11
to beautifulsoup
First, regarding "posts suggesting not to use regex": There's some
truth to that, but you're doing the right thing... The idea is that
you don't want to use _only_ regex, because you'll generally make your
code to sensitive to changes in the source HTML. (depends on what
you're looking for in the source, of course...) The better way, in
general, is to use a DOM parser: something which understands the
"structure" of the page, so you can look for "Third table, whichever
column has class="foo", every other row" rather than
re.match("(?:si)<table>\s*<t[rh]+>\s*<td>.....blah blah).

There is nothing wrong with your approach to use BS to get the right
div or span and then use simple regex to get the data out you like.

So, for your example first find all the "containing" divs (example 1)
or containing ul (example 2)

#####
divs = soup.findAll("div",{"id","Type"})
#####

That gives you an array of soup objects, each one "starting" at a div
with the correct type. Next, loop through each of those divs finding
and extracting out the text you need:

####
results = []
for div in divs:
s = div.findAll("span", {"class":"SType"})[0].text
d = div.findAll("span", {"class":"Amount"})[0].text
results.append( [s, d] )
#####

Note that findAll returns an array, so within the loop we (blindly)
take the first & hopefully only element & then convert it to text.

There's a way to do that all in a single line, I'm pretty sure, but it
won't run any faster and will be more difficult to debug.

John W Smith

unread,
Apr 26, 2011, 5:02:00 AM4/26/11
to beautifulsoup
Hi Peter,

Many thanks for your reply & sorry for the delay in replying back.

I agree with everything you seem to say - and your reply suggests that
I am doing the right thing - by using BSoup+regex.

Thanks for everything.

Best regards,

John.
Reply all
Reply to author
Forward
0 new messages