I would not do regex-based preprocessing on HTML unless there was no
alternative. There are too many ways to represent a <br> tag. Here are
a few off the top of my head:
<br><br/><br />< br><br class="foo"><br></br><br
>
Fortunately, it's not too hard to use extract() to do what you want to
do. Here's some code:
---
from bs4 import BeautifulSoup, Tag
data = 'foo<br>bar. <p>foo<br/><br id="1"><br/>bar'
soup = BeautifulSoup(data)
for br in soup.find_all("br"):
while isinstance(br.next_sibling, Tag) and
br.next_sibling.name == 'br':
br.next_sibling.extract()
print soup
# <html><body><p>foo<br/>bar. </p><p>foo<br/>bar</p></body></html>
---
Reference:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#extract
Leonard
On Fri, May 10, 2013 at 12:22 PM, Matt McKay <
mcka...@gmail.com> wrote:
> You could try removing the them with regular expression before creating your
> soup. For example:
>
>> two_or_more_breaks= re.compile(r"(<br />){2,}")
>> docString= two_or_more_breaks.sub('<br />', docString
>
>
> This compiles a RE matching any substrings of 2 or more breaks in a row (you
> might need to account for newlines if they are on separate lines) and
> substitutes a single <br />.
>
> The task you are trying to perform can be down in BS, but sometimes I find
> it easier to do some pre-processing on the string itself. That being said, I
> am a novice so take my advice with a grain of salt.
>
> On Thursday, May 9, 2013 3:06:20 PM UTC-4, eNJoy wrote:
>>
>> Hello all!
>>
>> How do I remove the multiple occurrences of <br /> tag? e.g., if there are
>> just single one then keep it, if there are more than one next to each other,
>> remove all but keep one.
>>
>> Thanks,
>>
> --
> You received this message because you are subscribed to the Google Groups
> "beautifulsoup" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
beautifulsou...@googlegroups.com.
> To post to this group, send email to
beauti...@googlegroups.com.
> Visit this group at
http://groups.google.com/group/beautifulsoup?hl=en.
> For more options, visit
https://groups.google.com/groups/opt_out.
>
>