find_parents only finds direct children?

544 views
Skip to first unread message

Andreas Christoffersen

unread,
Jul 31, 2012, 7:09:47 AM7/31/12
to beauti...@googlegroups.com
New question, i am afraid:

tekst = <li ><div class="views-field-field-webrubrik-value"><h3><a href="/307046">Claus Hjort spiller med mrkede kort</a></h3>  </div><div class="views-field-field-skribent-uid"><div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>  </div> <div class="views-field-field-webteaser-value"> <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok snarere at forberede det ideologiske grundlag for en Løkke Rasmussens genkomst som statsminister</div>  </div><span class="views-field-view-node"> <span class="actions"><a href="/307046">Ls mere</a> | <a href="/307046/#comments">Kommentarer (4)</a></span> </span></li>

I am interested in finding the link, marked with yellow. - My search term is the red highlighed "Rasmussen". 

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(tekst)
contexts = soup.find_all(text=re.compile("Rasmussen"))
for a in contexts:
    print "context: %s" % a.encode('utf-8')
    for artikel_link in a.find_parents('a'):
        print "Artikel link %s" % artikel_link
        link = artikel_link.get('href')
        print "Link %s" % link

Maybe the link is not really a parent, but a previous sibling? But however much I tinker, I can't seem to extract the link, e.g. 

for i in context:
    i.find_previous_siblings('a')

returns an empty lists.

As the prettify() print below shows, The text string is not a direct child of the link. But it's positioned with in the same <li> element, nested below a sibling. - so neither sibling og parents can easily find this link - correct?
print soup.prettify() 

<html>
 <body>
  <li>
   <div class="views-field-field-webrubrik-value">
    <h3>
     <a href="/307046">
      Claus Hjort spiller med mrkede kort
     </a>
    </h3>
   </div>
   <div class="views-field-field-skribent-uid">
    <div class="byline">
     Af:
     <span class="authors">
      Dennis Kristensen
     </span>
    </div>
   </div>
   <div class="views-field-field-webteaser-value">
    <div class="webteaser">
     Claus Hjort Frederiksens argumenter for at afvise trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok snarere at forberede det ideologiske grundlag for en Lkke Rasmussens genkomst som statsminister
    </div>
   </div>
   <span class="views-field-view-node">
    <span class="actions">
     <a href="/307046">
      Ls mere
     </a>
     |
     <a href="/307046/#comments">
      Kommentarer (4)
     </a>
    </span>
   </span>
  </li>
 </body>
</html>


Thanks in advance

On Sun, Jul 29, 2012 at 4:32 PM, Andreas Christoffersen <achrist...@gmail.com> wrote:
Thanks for getting me up to speed Leonard. Everything now works as expected! - Love BS4 - What ever other reasons there is, I find it much easier than lxml (for my needs anyway). Also really good documentation. Thanks again.

Link Swanson

unread,
Jul 31, 2012, 9:49:59 AM7/31/12
to beauti...@googlegroups.com
I was able to get at the href you are targeting with this:

href = [parent for parent in contexts[0].parents][2].a['href']

In other words it's the first <a> in the third parent of the first context that you pulled with re

Since the site is using Drupal module Views to generate the html, you might be able to get away with using the list numbers hard-coded. 

Otherwise I am sure you could check the contents of each until you hit it. 

Link

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.



--
Link Swanson
Must Build Digital


Andreas Christoffersen

unread,
Jul 31, 2012, 12:32:40 PM7/31/12
to beauti...@googlegroups.com
Thanks Link,

I guess I am slowly learning that nothing is very easy when webscraping: My sample code works in all, but this case – so far....
  
Not all the sites (danish newspapers) are drupal sites. So I need something as generic as possible.

A solution seems to first try the current approach, and if not link is found, then try [parents][0], if nothing found then parensts[1] etc. (That's actually what I thought find_parents('a') did).

Seems a small recursive function is suited for this task. That's what you had in mind when you wrote:
 
Otherwise I am sure you could check the contents of each until you hit it

Correct?

Thanks a bunch!!!

Link Swanson

unread,
Jul 31, 2012, 1:46:16 PM7/31/12
to beauti...@googlegroups.com
Correct, in some way you can loop through the parents and build checks using if statements to find what you need. 

Also correct that it is hard to handle diverse cases. 

Good luck!

Andreas Christoffersen

unread,
Aug 1, 2012, 3:27:09 PM8/1/12
to beauti...@googlegroups.com
I have some problems grokking this. I guess its mostly a python problem (i am a beginner) than it's a BS4 problem. None the less  I post it here. Hope that is okay.

Below my try to create a recursive function as per the posts above. What I don't get is the comments marked in yellow below. 

Incidently I think "find nearest link" is not that uncommen?

Thanks in advance.

#-*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import re

tekst = '<li ><div class="views-field-field-webrubrik-value"><h3><a href="/307046">Claus Hjort spiller med mrkede kort</a></h3>  </div><div class="views-field-field-skribent-uid"><div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>  </div> <div class="views-field-field-webteaser-value"> <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok snarere at forberede det ideologiske grundlag for en Løkke Rasmussens genkomst som statsminister</div>  </div><span class="views-field-view-node"> <span class="actions"><a href="/307046">Ls mere</a> | <a href="/307046/#comments">Kommentarer (4)</a></span> </span></li>'
to_find = "Rasmussen"
soup = BeautifulSoup(tekst)
contexts = soup.find_all(text=re.compile(to_find))


def find_nearest(element, url, direction="both"):
    """Find the nearest link, relative to a text string.
    When complete it will search up and down (parent, child),
    Will then return the link the fewest steps away from the
    original element. Assumes we have already found an element"""
        # Is the nearest link readily available?
    # If so - this is what we want.
    if element.find_parents('a'):
        for artikel_link in element.find_parents('a'):
            print "artikel_link er fundet %" % artikel_link
            link = artikel_link.get('href')
            if ("http" or "www") not in link:
                link = url+link
                return link
    # if the link is not readily available, we will go up
    if not element.find_parents('a'):
        element =  element.parent
        # Print for debugging
        print element #on the 2nd run (i.e <li> this finds <a href=/307056> 
        # So shouldn't it be caught as readily available above?
        print u"Found: %s" % element.name
        # the recursive call
        find_nearest(element,url)


if contexts:
    for a in contexts:
        find_nearest( element=a, url="http://information.dk")

Reply all
Reply to author
Forward
0 new messages