Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
find_parents only finds direct children?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Andreas Christoffersen  
View profile  
 More options Jul 31 2012, 7:09 am
From: Andreas Christoffersen <achristoffer...@gmail.com>
Date: Tue, 31 Jul 2012 13:09:47 +0200
Local: Tues, Jul 31 2012 7:09 am
Subject: find_parents only finds direct children?

New question, i am afraid:

tekst = <li ><div class="views-field-field-webrubrik-value"><h3><a
href="/307046">Claus Hjort spiller med mrkede kort</a></h3>  </div><div
class="views-field-field-skribent-uid"><div class="byline">Af: <span
class="authors">Dennis Kristensen</span></div>  </div> <div
class="views-field-field-webteaser-value"> <div class="webteaser">Claus
Hjort Frederiksens argumenter for at afvise trepartsforhandlinger har ikke
hold i virkeligheden. Hans rinde er nok snarere at forberede det
ideologiske grundlag for en Løkke Rasmussens genkomst som
statsminister</div>  </div><span class="views-field-view-node"> <span
class="actions"><a href="/307046">Ls mere</a> | <a
href="/307046/#comments">Kommentarer (4)</a></span> </span></li>

I am interested in finding the link, marked with yellow. - My search term
is the red highlighed "Rasmussen".

from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(tekst)
contexts = soup.find_all(text=re.compile("Rasmussen"))
for a in contexts:
    print "context: %s" % a.encode('utf-8')
    for artikel_link in a.find_parents('a'):
        print "Artikel link %s" % artikel_link
        link = artikel_link.get('href')
        print "Link %s" % link

Maybe the link is not really a parent, but a previous sibling? But however
much I tinker, I can't seem to extract the link, e.g.

for i in context:
    i.find_previous_siblings('a')

returns an empty lists.

As the prettify() print below shows, The text string is not a direct child
of the link. But it's positioned with in the same <li> element, nested
below a sibling. - so neither sibling og parents can easily find this link
- correct?
print soup.prettify()

<html>
 <body>
  <li>
   <div class="views-field-field-webrubrik-value">
    <h3>
     <a href="/307046">
      Claus Hjort spiller med mrkede kort
     </a>
    </h3>
   </div>
   <div class="views-field-field-skribent-uid">
    <div class="byline">
     Af:
     <span class="authors">
      Dennis Kristensen
     </span>
    </div>
   </div>
   <div class="views-field-field-webteaser-value">
    <div class="webteaser">
     Claus Hjort Frederiksens argumenter for at afvise
trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
snarere at forberede det ideologiske grundlag for en Lkke Rasmussens
genkomst som statsminister
    </div>
   </div>
   <span class="views-field-view-node">
    <span class="actions">
     <a href="/307046">
      Ls mere
     </a>
     |
     <a href="/307046/#comments">
      Kommentarer (4)
     </a>
    </span>
   </span>
  </li>
 </body>
</html>

Thanks in advance

On Sun, Jul 29, 2012 at 4:32 PM, Andreas Christoffersen <


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Link Swanson  
View profile  
 More options Jul 31 2012, 9:49 am
From: Link Swanson <l...@mustbuilddigital.com>
Date: Tue, 31 Jul 2012 08:49:59 -0500
Local: Tues, Jul 31 2012 9:49 am
Subject: Re: find_parents only finds direct children?

I was able to get at the href you are targeting with this:

href = [parent for parent in contexts[0].parents][2].a['href']

In other words it's the first <a> in the third parent of the first context
that you pulled with re

Since the site is using Drupal module
Views<http://drupal.org/project/views/>to generate the html, you might
be able to get away with using the list
numbers hard-coded.

Otherwise I am sure you could check the contents of each until you hit it.

Link

On Tue, Jul 31, 2012 at 6:09 AM, Andreas Christoffersen <

--
Link Swanson
Must Build Digital

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andreas Christoffersen  
View profile  
 More options Jul 31 2012, 12:32 pm
From: Andreas Christoffersen <achristoffer...@gmail.com>
Date: Tue, 31 Jul 2012 18:32:40 +0200
Local: Tues, Jul 31 2012 12:32 pm
Subject: Re: find_parents only finds direct children?

Thanks Link,

I guess I am slowly learning that nothing is very easy when webscraping: My
sample code works in all, but this case – so far....

Not all the sites (danish newspapers) are drupal sites. So I need something
as generic as possible.

A solution seems to first try the current approach, and if not link is
found, then try [parents][0], if nothing found then parensts[1] etc.
(That's actually what I thought find_parents('a') did).

Seems a small recursive function is suited for this task. That's what you
had in mind when you wrote:

> Otherwise I am sure you could check the contents of each until you hit it

Correct?

Thanks a bunch!!!

On Tue, Jul 31, 2012 at 3:49 PM, Link Swanson <l...@mustbuilddigital.com>wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Link Swanson  
View profile  
 More options Jul 31 2012, 1:46 pm
From: Link Swanson <l...@mustbuilddigital.com>
Date: Tue, 31 Jul 2012 12:46:16 -0500
Local: Tues, Jul 31 2012 1:46 pm
Subject: Re: find_parents only finds direct children?

Correct, in some way you can loop through the parents and build checks
using if statements to find what you need.

Also correct that it is hard to handle diverse cases.

Good luck!

On Tue, Jul 31, 2012 at 11:32 AM, Andreas Christoffersen <

--
Link Swanson
Must Build Digital

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Andreas Christoffersen  
View profile  
 More options Aug 1 2012, 3:27 pm
From: Andreas Christoffersen <achristoffer...@gmail.com>
Date: Wed, 1 Aug 2012 21:27:09 +0200
Local: Wed, Aug 1 2012 3:27 pm
Subject: Re: find_parents only finds direct children?

I have some problems grokking this. I guess its mostly a python problem (i
am a beginner) than it's a BS4 problem. None the less  I post it here. Hope
that is okay.

Below my try to create a recursive function as per the posts above. What I
don't get is the comments marked in yellow below.

Incidently I think "find nearest link" is not that uncommen?

Thanks in advance.

#-*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import re

tekst = '<li ><div class="views-field-field-webrubrik-value"><h3><a
href="/307046">Claus Hjort spiller med mrkede kort</a></h3>  </div><div
class="views-field-field-skribent-uid"><div class="byline">Af: <span
class="authors">Dennis Kristensen</span></div>  </div> <div
class="views-field-field-webteaser-value"> <div class="webteaser">Claus
Hjort Frederiksens argumenter for at afvise trepartsforhandlinger har ikke
hold i virkeligheden. Hans rinde er nok snarere at forberede det
ideologiske grundlag for en Løkke Rasmussens genkomst som
statsminister</div>  </div><span class="views-field-view-node"> <span
class="actions"><a href="/307046">Ls mere</a> | <a
href="/307046/#comments">Kommentarer (4)</a></span> </span></li>'
to_find = "Rasmussen"
soup = BeautifulSoup(tekst)
contexts = soup.find_all(text=re.compile(to_find))

def find_nearest(element, url, direction="both"):
    """Find the nearest link, relative to a text string.
    When complete it will search up and down (parent, child),
    Will then return the link the fewest steps away from the
    original element. Assumes we have already found an element"""
        # Is the nearest link readily available?
    # If so - this is what we want.
    if element.find_parents('a'):
        for artikel_link in element.find_parents('a'):
            print "artikel_link er fundet %" % artikel_link
            link = artikel_link.get('href')
            if ("http" or "www") not in link:
                link = url+link
                return link
    # if the link is not readily available, we will go up
    if not element.find_parents('a'):
        element =  element.parent
        # Print for debugging
        print element #on the 2nd run (i.e <li> this finds <a href=/307056>
        # So shouldn't it be caught as readily available above?
        print u"Found: %s" % element.name
        # the recursive call
        find_nearest(element,url)

if contexts:
    for a in contexts:
        find_nearest( element=a, url="http://information.dk")

On Tue, Jul 31, 2012 at 7:46 PM, Link Swanson <l...@mustbuilddigital.com>wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »