soup should only contain <body>-element

49 views
Skip to first unread message

Mert Duyum

unread,
Sep 13, 2022, 9:36:47 AM9/13/22
to beautifulsoup
How can I make soup from <body>-element? I want to scrape all URLs only from <body>-element and ignore rest like <head>.

Something like this:

for url in soup_body.find_all(href=True):

Best Regards

Mert Duyum

Isaac Muse

unread,
Sep 13, 2022, 8:55:44 PM9/13/22
to beautifulsoup

There are multiple ways to get the URLs from only the body. Here are two ways.

  1. Find the body and then find all elements with href under it.
  2. Use CSS selectors to find all elements with href under the body.
from bs4 import BeautifulSoup

HTML = """
<!DOCTYPE html>
<html>

<head>
<link rel="stylesheet" href="mystyle.css">
</head>

<body>
<h1>Some <a href="https://example.com/some-page-1/">link 1</a></h1>
<h1>Some <a href="https://example.com/some-page-2/">link 2</a></h1>
</body>

</html>
"""

soup = BeautifulSoup(HTML, 'html.parser')

print([el['href'] for el in soup.find('body').find_all(attrs={'href': True})])
print([el['href'] for el in soup.select('body [href]')])

Output

['https://example.com/some-page-1/', 'https://example.com/some-page-2/']
['https://example.com/some-page-1/', 'https://example.com/some-page-2/']

Ajay Sahar

unread,
Sep 13, 2022, 9:26:13 PM9/13/22
to beauti...@googlegroups.com
Hai frd,
I have doupt
How to filter out the particular url.

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/f790d42b-6094-4f87-b6ad-7bbdc1de5b35n%40googlegroups.com.

Isaac Muse

unread,
Sep 13, 2022, 9:32:25 PM9/13/22
to beautifulsoup
The example I showed parses URLs from under the body only. If that is not what you were asking, you may have to give an example to help me understand.

Mert Duyum

unread,
Sep 14, 2022, 7:03:34 AM9/14/22
to beautifulsoup
I tried it out but it still  takes other URLs too that are not in the body-element.

Isaac Muse

unread,
Sep 14, 2022, 8:14:24 AM9/14/22
to beautifulsoup
I've provided a working example that you can run yourself. Did you run the example? Did you get a different result than what I posted?

If what I provided is applied properly, it will work, unless I've misunderstood the actual question. Unfortunately, you have not provided any failing examples, so all I can do is guess what your real issue is.

I cannot help further if you do not provide a minimal, failing example. Please take my example and modify it to demonstrate the failing scenario that is giving you trouble.

Ajay Sahar

unread,
Sep 14, 2022, 8:58:25 AM9/14/22
to beauti...@googlegroups.com
[<a href="index.php">
<img alt="Websignx" src="assets/images/websignx_logo.png" style="height: 3.8rem;" title="WebsignX"/>
</a>, <a class="nav-link link" href="index.php#home" style="color:#fff; font-size:1.5em">Home</a>, <a class="nav-link link" hidden="" href="index.php#aboutus" style="color:#fff; font-size:1.5em">About Us</a>, <a class="nav-link link" href="technologies.php" style="color:#fff; font-size:1.5em">Technologies</a>, <a class="nav-link link" href="ourServices.php" style="color:#fff; font-size:1.5em">Services</a>, <a class="nav-link link" href="team.php" style="color:#fff; font-size:1.5em">Team</a>, <a class="nav-link link" href="ourClients.php" style="color:#fff; font-size:1.5em;">Portfolio</a>, <a class="nav-link link" hidden="" href="websignx_blogs.php" style="color:#fff; font-size:1.5em">    Blogs</a>, <a class="nav-link link" hidden="" href="websignx_press.php" style="color:#fff; font-size:1.5em">Press</a>, <a class="nav-link link" hidden="" href="social.php" style="color:#fff; font-size:1.5em;">Social</a>, <b><a class="nav-link link" href="contactUs.php" style="color:#fff; font-size:1.5em">Contact Us</a></b>, <a class="__cf_email__" data-cfemail="731e161e11160133101c1e03121d0a1d121e165d101c1e" href="/cdn-cgi/l/email-protection">[email protected]</a>, <a href="index.php">
<img alt="Websignx" media-simple="true" src="assets/images/websignx_logo.png" title=""/>
</a>, <a class="fa fa-facebook" href="https://www.facebook.com/Websignx/"></a>, <a class="fa fa-linkedin-square" href="https://www.linkedin.com/company/websignx-technologies"></a>, <a class="fa fa-twitter" href="https://twitter.com/WebSignX"></a>, <a class="fa fa-youtube" href="https://www.youtube.com/channel/UCmzSmKx_JTTcyJlOWB0AdTA"></a>, <a class="__cf_email__" data-cfemail="e89b89848d9ba89f8d8a9b818f8690c68b8785" href="/cdn-cgi/l/email-protection">[email protected]</a>, <a class="__cf_email__" data-cfemail="ddbeb2b3a9bcbea99daab8bfaeb4bab3a5f3beb2b0" href="/cdn-cgi/l/email-protection">[email protected]</a>]

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.

Mert Duyum

unread,
Sep 14, 2022, 8:58:30 AM9/14/22
to beautifulsoup
Thank you very much for the quick answer. I'll try it out and will let you know if I succeed.

Best Regards

Mert Duyum
Reply all
Reply to author
Forward
0 new messages