Remove carriage returns added by prettify()?

210 views
Skip to first unread message

Heck Lennon

unread,
Sep 9, 2022, 3:28:31 AM9/9/22
to beautifulsoup
Hello,

Google shows that it's normal for prettify() to add CRs to each string, but it has an impact when loading a GPX file into maps.

===========
  <wpt lat="48.469284100" lon="2.706686600">
    <name>
        Blah
    </name>
  </wpt>
===========

Is there a simpler way to get rid of them than calling Tidy?

Thank you.

with open(OUTPUTFILE, "w",) as file:
    #NOK
    #file.write(soup.prettify(formatter=None))
    #OK
    file.write(str(soup))

#Hack to remove unwanted CRs in <name>
print(subprocess.run(["c:\\tidy.exe", "-language","en_us", "-utf8", "-indent", "-xml", "-modify", f"{OUTPUT_FULLPATH}"], shell=True))

leonardr

unread,
Sep 9, 2022, 8:24:37 AM9/9/22
to beautifulsoup
Rather than adding whitespace with prettify() and then stripping it out, I'd use one of these methods -- str() or encode() -- which will encode the tree as a string without adding any whitespace.

If you want the pretty-printing whitespace in most places, just not in the <name> tag, you can customize which tags have whitespace preserved by instantiating a custom TreeBuilder object:

from bs4 import BeautifulSoup
from bs4.builder import LXMLTreeBuilderForXML
builder = LXMLTreeBuilderForXML(preserve_whitespace_tags=["name"])
markup = '<wpt lat="48.469284100" lon="2.706686600"><name>Blah</name></wpt>'
soup = BeautifulSoup(markup, builder=builder)

print(soup.prettify())
# <?xml version="1.0" encoding="utf-8"?>
# <wpt lat="48.469284100" lon="2.706686600">
#  <name>Blah</name>
# </wpt>

Heck Lennon

unread,
Sep 9, 2022, 9:37:49 AM9/9/22
to beautifulsoup
Thanks. It's exactly what I was looking for.

 builder = LXMLTreeBuilderForXML(preserve_whitespace_tags=["name"])
#BAD soup = BeautifulSoup(open(item, 'r'), builder=builder,'xml')
soup = BeautifulSoup(open(item, 'r'), builder=builder,features='xml')

Reply all
Reply to author
Forward
0 new messages