BS4 .contents adds extra "\n" between each result

1,465 views
Skip to first unread message

FMAPR

unread,
Jul 12, 2019, 7:45:39 AM7/12/19
to beautifulsoup
I'll give an example.

Let's say I have this html code

<div >

  <div>
   hello
  </div>

  <div>
   hello2
  </div>

</div>

If I run the code 


#html_code initialization
soup = BeautifulSoup(html_code, 'html.parser')
print(soup.contents)

You would expect to print
[<div>hello</div>, <div>hello2</div>]

but instead, I get this
['\n', <div>hello</div>, '\n', <div>hello2</div>, '\n']

Has anyone encountered this issue?


Malik Rumi

unread,
Jul 12, 2019, 9:06:10 AM7/12/19
to beauti...@googlegroups.com
What happens if you just run rstrip('\n') either before or after?

“None of you has faith until he loves for his brother or his neighbor what he loves for himself.”


--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To post to this group, send email to beauti...@googlegroups.com.
Visit this group at https://groups.google.com/group/beautifulsoup.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/bc465c3a-bc44-4a6f-8fa8-dcc09c56bbd4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

FMAPR

unread,
Jul 12, 2019, 9:14:48 AM7/12/19
to beautifulsoup
I'm really embarrased about this, but the problem I posted was a simplification.
I just ran it and it actually doesn't replicate the problem and .contents works just fine there. 
The actual code I was running that I know that replicates the issue, is this

response = requests.get('http://www.basicwebsiteexample.com/', timeout = (1, 4))
soup = BeautifulSoup(response.content, 'html.parser')
nodes = soup.find_all(id="page-zones__main")
print(nodes[0].contents)

FMAPR

unread,
Jul 12, 2019, 9:23:25 AM7/12/19
to beautifulsoup
It returns a list and not a string, I think I'd have to use an iterator and use remove() with the condition of finding \n, but I'm new to python tbh I'm not even sure if rstrip works on lists.

Despite this, your approach solves the problem, I'm just trying to find if this is normal and if there isn't a more efficient way
To unsubscribe from this group and stop receiving emails from it, send an email to beauti...@googlegroups.com.

facelessuser

unread,
Jul 12, 2019, 9:32:47 AM7/12/19
to beautifulsoup
The content of an element will contain both element nodes and text nodes. Along with comments or pre-processing bits as well. So yes, this is expected. If you had the the HTML element `<div>text</div>`, how would you represent to the content of `div`? It would be `['text']`.

If you had `<div>text<span>more</span></div>`, the content would be `['text', '<span>more</span>']. The `\n` characters are text nodes, so they show up under the content .

Malik Rumi

unread,
Jul 12, 2019, 9:43:12 AM7/12/19
to beauti...@googlegroups.com
rstrip does not work on lists. That's because it is a string method. So my suggestion was to use it on the string before you create the list, or, alternatively, extract the strings and then strip. There are a lot of ways you could do this. Since you say you are new, I would *highly* recommend Corey Schafer on YouTube: https://www.youtube.com/channel/UCCezIgC97PvUuR4_gbFUs5g and especially https://www.youtube.com/watch?v=k9TUPpGqYTo&t=329s , since that one is specifically about strings. Good luck,


“None of you has faith until he loves for his brother or his neighbor what he loves for himself.”

To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.

To post to this group, send email to beauti...@googlegroups.com.
Visit this group at https://groups.google.com/group/beautifulsoup.

FMAPR

unread,
Jul 12, 2019, 9:47:24 AM7/12/19
to beautifulsoup
If I understand you correctly, if we had 
'<div>
text
<span>more</span></div>'
The contents would be ['\n', 'text', '\n', '<span>more</span>']. That makes enough sense to me. 

Despite this, the html code in question does not have any direct text node children. 
I've attached a snip of the html to this reply
image.PNG

leonardr

unread,
Jul 12, 2019, 7:03:04 PM7/12/19
to beautifulsoup
Tools like the one seen in the screenshot do a good job of showing the DOM, but they do a little cleanup for the sake of readability, and whitespace nodes between block-level tags are among the things that gets removed.

The HTML markup at http://www.basicwebsiteexample.com/ has a lot of whitespace between <div> tags, in more or less the places you're seeing it. Here's a snippet:


</div> <div id="page-zones__main-widgets__button"> ...
</div>
<div id="page-zones__main-widgets__content1"
...

If you want to get rid of or ignore these nodes, I agree that strip() is the way to go. If strip() returns an empty string then you know it's nothing but whitespace.

Leonard
Reply all
Reply to author
Forward
0 new messages