Bs4 xml parse losing content. Bug or parse configuration issue?

115 views
Skip to first unread message

Isaac Egglestone

unread,
Feb 12, 2021, 9:02:13 PM2/12/21
to beautifulsoup
Hi I'm posting here before raising an issue report as I'm unsure if there is some other problem with what I'm doing rather than an actual bug. Also as the problem appears to be between BeautifulSoup and lxml parser somewhere I need to have a reference link of having raised this on the forum before actually providing it to possibly both groups of maintainers.

Context: I'm processing some xml contained within an ebook for further distribution.

The actual problem is simple, after running BS4 diagnose on a file the output is missing content.  Specifically the "opf" text before the colon. Now the BS4 maintainers indicate in their docs that this means there is a problem the my parser. That being lxml.  However when I run a similar basic test with the xml parser I don't see the loss occurring. 

The original guntenburg document contains:
<?xml version="1.0" encoding="utf-8"?>
<package unique-identifier="id" version="2.0" xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<metadata>
<dc:rights>Public domain in the USA.</dc:rights>
<dc:identifier opf:scheme="URI" id="id">http://www.gutenberg.org/2600</dc:identifier>
<dc:creator opf:file-as="Tolstoy, Leo, graf">graf Leo Tolstoy</dc:creator>

The lxml parser via BS4 diagnose produces the following output with opf missing:
 ?xml version="1.0" encoding="utf-8"?>
<package unique-identifier="id" version="2.0" xmlns="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <metadata>
  <dc:rights>
   Public domain in the USA.
  </dc:rights>
  <dc:identifier :scheme="URI" id="id">
  </dc:identifier>
  <dc:creator :file-as="Shakespeare, William">
   William Shakespeare
  </dc:creator>
  <dc:title>
   The History of King Henry the Sixth, Third Part
  </dc:title>

Running the same basic test with lxml however shows clean output with opf intact:
<package xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf" unique-identifier="id" version="2.0">
  <metadata>
    <dc:rights>Public domain in the USA.</dc:rights>
    <dc:identifier id="id" opf:scheme="URI">http://www.gutenberg.org/ebooks/1502</dc:identifier>
    <dc:creator opf:file-as="Shakespeare, William">William Shakespeare</dc:creator>
    <dc:title>The History of King Henry the Sixth, Third Part</dc:title>
    <dc:language xsi:type="dcterms:RFC4646">en</dc:language>


To reproduce put all these files in the same dir and run the following two commands:
docker build
docker-compose run util python3 /test/testparse.py


requirements.txt:
beautifulsoup4
lxml

Dockerfile:
FROM python:3. 7

COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt --upgrade

docker-compose.yml:
version: "3"
services:
  util:
    build: .
    volumes:
      - ./:/test

testparse.py:
#!/usr/bin/env python3

from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
from lxml import etree

content_full_path = "test/content.opf"


def get_soup(file_to_parse):
"Gets a soup object for a given file"
with open(file_to_parse, "r") as f:
file = f.read()
diagnose(file)
soup = BeautifulSoup(file, "xml")
return soup


def check_untouched_parse(file_to_parse):
"Test clean xml save"
soup = get_soup(file_to_parse)
print(soup)


# Beautiful soup diagnose test:
print("Beautiful soup diagnose test:")
print("==============================================================================")

nothing_var = check_untouched_parse(content_full_path)

# lxml test
print("Etree version of parse")
print("==============================================================================")

parser = etree.XMLParser(remove_blank_text=True)
root = etree.parse(content_full_path, parser)
# Print the loaded XML
print(etree.tostring(root, encoding="unicode", pretty_print=True))

content.opf:
<?x
ml version='1.0' encoding='UTF-8'?>

<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="id">
<metadata>
<dc:rights>Public domain in the USA.</dc:rights>
<dc:identifier opf:scheme="URI" id="id">http://www.gutenberg.org/2600</dc:identifier>
<dc:creator opf:file-as="Tolstoy, Leo, graf">graf Leo Tolstoy</dc:creator>
<dc:contributor opf:role="trl" opf:file-as="Maude, Aylmer">Aylmer Maude</dc:contributor>
<dc:contributor opf:role="trl" opf:file-as="Maude, Louise">Louise Maude</dc:contributor>
<dc:title>War and Peace</dc:title>
<dc:language xsi:type="dcterms:RFC4646">en</dc:language>
<dc:subject>Historical fiction</dc:subject>
<dc:subject>War stories</dc:subject>
<dc:subject>Napoleonic Wars, 1800-1815 -- Campaigns -- Russia -- Fiction</dc:subject>
<dc:subject>Russia -- History -- Alexander I, 1801-1825 -- Fiction</dc:subject>
<dc:subject>Aristocracy (Social class) -- Russia -- Fiction</dc:subject>
<dc:date opf:event="publication">2001-04-01</dc:date>
<dc:date opf:event="conversion">2020-04-02T07:55:23.696736+00:00</dc:date>
<meta name="cover" content="item1"/>
</metadata>
<manifest>
<!--Image: 484 x 700 size=108909 q=90-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@ima...@cover.jpg" id="item1" media-type="image/jpeg"/>
<item href="pgepub.css" id="item2" media-type="text/css"/>
<item href="0.css" id="item3" media-type="text/css"/>
<item href="1.css" id="item4" media-type="text/css"/>
<!--Chunk: size=57550 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-0.htm.html" id="item5" media-type="application/xhtml+xml"/>
<!--Chunk: size=58304 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-1.htm.html" id="item6" media-type="application/xhtml+xml"/>
<!--Chunk: size=52760 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-2.htm.html" id="item7" media-type="application/xhtml+xml"/>
<!--Chunk: size=57074 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-3.htm.html" id="item8" media-type="application/xhtml+xml"/>
<!--Chunk: size=52377 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-4.htm.html" id="item9" media-type="application/xhtml+xml"/>
<!--Chunk: size=70163 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-5.htm.html" id="item10" media-type="application/xhtml+xml"/>
<!--Chunk: size=60805 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-6.htm.html" id="item11" media-type="application/xhtml+xml"/>
<!--Chunk: size=61815 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-7.htm.html" id="item12" media-type="application/xhtml+xml"/>
<!--Chunk: size=51653 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-8.htm.html" id="item13" media-type="application/xhtml+xml"/>
<!--Chunk: size=64801 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-9.htm.html" id="item14" media-type="application/xhtml+xml"/>
<!--Chunk: size=51852 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-10.htm.html" id="item15" media-type="application/xhtml+xml"/>
<!--Chunk: size=70173 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-11.htm.html" id="item16" media-type="application/xhtml+xml"/>
<!--Chunk: size=63821 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-12.htm.html" id="item17" media-type="application/xhtml+xml"/>
<!--Chunk: size=53220 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-13.htm.html" id="item18" media-type="application/xhtml+xml"/>
<!--Chunk: size=51538 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-14.htm.html" id="item19" media-type="application/xhtml+xml"/>
<!--Chunk: size=56744 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-15.htm.html" id="item20" media-type="application/xhtml+xml"/>
<!--Chunk: size=53690 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-16.htm.html" id="item21" media-type="application/xhtml+xml"/>
<!--Chunk: size=58241 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-17.htm.html" id="item22" media-type="application/xhtml+xml"/>
<!--Chunk: size=57692 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-18.htm.html" id="item23" media-type="application/xhtml+xml"/>
<!--Chunk: size=58285 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-19.htm.html" id="item24" media-type="application/xhtml+xml"/>
<!--Chunk: size=57202 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-20.htm.html" id="item25" media-type="application/xhtml+xml"/>
<!--Chunk: size=63300 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-21.htm.html" id="item26" media-type="application/xhtml+xml"/>
<!--Chunk: size=60258 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-22.htm.html" id="item27" media-type="application/xhtml+xml"/>
<!--Chunk: size=55844 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-23.htm.html" id="item28" media-type="application/xhtml+xml"/>
<!--Chunk: size=54441 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-24.htm.html" id="item29" media-type="application/xhtml+xml"/>
<!--Chunk: size=54570 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-25.htm.html" id="item30" media-type="application/xhtml+xml"/>
<!--Chunk: size=52393 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-26.htm.html" id="item31" media-type="application/xhtml+xml"/>
<!--Chunk: size=51987 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-27.htm.html" id="item32" media-type="application/xhtml+xml"/>
<!--Chunk: size=67760 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-28.htm.html" id="item33" media-type="application/xhtml+xml"/>
<!--Chunk: size=63989 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-29.htm.html" id="item34" media-type="application/xhtml+xml"/>
<!--Chunk: size=63702 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-30.htm.html" id="item35" media-type="application/xhtml+xml"/>
<!--Chunk: size=59862 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-31.htm.html" id="item36" media-type="application/xhtml+xml"/>
<!--Chunk: size=54267 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-32.htm.html" id="item37" media-type="application/xhtml+xml"/>
<!--Chunk: size=59292 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-33.htm.html" id="item38" media-type="application/xhtml+xml"/>
<!--Chunk: size=56661 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-34.htm.html" id="item39" media-type="application/xhtml+xml"/>
<!--Chunk: size=60083 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-35.htm.html" id="item40" media-type="application/xhtml+xml"/>
<!--Chunk: size=56200 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-36.htm.html" id="item41" media-type="application/xhtml+xml"/>
<!--Chunk: size=56136 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-37.htm.html" id="item42" media-type="application/xhtml+xml"/>
<!--Chunk: size=59126 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-38.htm.html" id="item43" media-type="application/xhtml+xml"/>
<!--Chunk: size=53080 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-39.htm.html" id="item44" media-type="application/xhtml+xml"/>
<!--Chunk: size=54926 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-40.htm.html" id="item45" media-type="application/xhtml+xml"/>
<!--Chunk: size=67086 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-41.htm.html" id="item46" media-type="application/xhtml+xml"/>
<!--Chunk: size=57293 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-42.htm.html" id="item47" media-type="application/xhtml+xml"/>
<!--Chunk: size=54513 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-43.htm.html" id="item48" media-type="application/xhtml+xml"/>
<!--Chunk: size=64103 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-44.htm.html" id="item49" media-type="application/xhtml+xml"/>
<!--Chunk: size=57653 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-45.htm.html" id="item50" media-type="application/xhtml+xml"/>
<!--Chunk: size=55452 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-46.htm.html" id="item51" media-type="application/xhtml+xml"/>
<!--Chunk: size=59269 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-47.htm.html" id="item52" media-type="application/xhtml+xml"/>
<!--Chunk: size=53393 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-48.htm.html" id="item53" media-type="application/xhtml+xml"/>
<!--Chunk: size=51746 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-49.htm.html" id="item54" media-type="application/xhtml+xml"/>
<!--Chunk: size=58540 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-50.htm.html" id="item55" media-type="application/xhtml+xml"/>
<!--Chunk: size=66538 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-51.htm.html" id="item56" media-type="application/xhtml+xml"/>
<!--Chunk: size=53240 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-52.htm.html" id="item57" media-type="application/xhtml+xml"/>
<!--Chunk: size=54375 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-53.htm.html" id="item58" media-type="application/xhtml+xml"/>
<!--Chunk: size=54268 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-54.htm.html" id="item59" media-type="application/xhtml+xml"/>
<!--Chunk: size=54373 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-55.htm.html" id="item60" media-type="application/xhtml+xml"/>
<!--Chunk: size=58744 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-56.htm.html" id="item61" media-type="application/xhtml+xml"/>
<!--Chunk: size=53208 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-57.htm.html" id="item62" media-type="application/xhtml+xml"/>
<!--Chunk: size=53405 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-58.htm.html" id="item63" media-type="application/xhtml+xml"/>
<!--Chunk: size=53793 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-59.htm.html" id="item64" media-type="application/xhtml+xml"/>
<!--Chunk: size=54766 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-60.htm.html" id="item65" media-type="application/xhtml+xml"/>
<!--Chunk: size=55485 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-61.htm.html" id="item66" media-type="application/xhtml+xml"/>
<!--Chunk: size=51658 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-62.htm.html" id="item67" media-type="application/xhtml+xml"/>
<!--Chunk: size=52806 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-63.htm.html" id="item68" media-type="application/xhtml+xml"/>
<!--Chunk: size=58713 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-64.htm.html" id="item69" media-type="application/xhtml+xml"/>
<!--Chunk: size=57610 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-65.htm.html" id="item70" media-type="application/xhtml+xml"/>
<!--Chunk: size=60847 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-66.htm.html" id="item71" media-type="application/xhtml+xml"/>
<!--Chunk: size=57840 Split on div-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-67.htm.html" id="item72" media-type="application/xhtml+xml"/>
<!--Chunk: size=68925-->
<item href="@public@vhost@g@gutenberg@html@files@2600@2600-h@2600-h-68.htm.html" id="item73" media-type="application/xhtml+xml"/>
<item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml"/>
<item href="wrap0000.html" id="coverpage-wrapper" media-type="application/xhtml+xml"/>
</manifest>
<spine toc="ncx">
<itemref idref="coverpage-wrapper" linear="yes"/>
<itemref idref="item5" linear="yes"/>
<itemref idref="item6" linear="yes"/>
<itemref idref="item7" linear="yes"/>
<itemref idref="item8" linear="yes"/>
<itemref idref="item9" linear="yes"/>
<itemref idref="item10" linear="yes"/>
<itemref idref="item11" linear="yes"/>
<itemref idref="item12" linear="yes"/>
<itemref idref="item13" linear="yes"/>
<itemref idref="item14" linear="yes"/>
<itemref idref="item15" linear="yes"/>
<itemref idref="item16" linear="yes"/>
<itemref idref="item17" linear="yes"/>
<itemref idref="item18" linear="yes"/>
<itemref idref="item19" linear="yes"/>
<itemref idref="item20" linear="yes"/>
<itemref idref="item21" linear="yes"/>
<itemref idref="item22" linear="yes"/>
<itemref idref="item23" linear="yes"/>
<itemref idref="item24" linear="yes"/>
<itemref idref="item25" linear="yes"/>
<itemref idref="item26" linear="yes"/>
<itemref idref="item27" linear="yes"/>
<itemref idref="item28" linear="yes"/>
<itemref idref="item29" linear="yes"/>
<itemref idref="item30" linear="yes"/>
<itemref idref="item31" linear="yes"/>
<itemref idref="item32" linear="yes"/>
<itemref idref="item33" linear="yes"/>
<itemref idref="item34" linear="yes"/>
<itemref idref="item35" linear="yes"/>
<itemref idref="item36" linear="yes"/>
<itemref idref="item37" linear="yes"/>
<itemref idref="item38" linear="yes"/>
<itemref idref="item39" linear="yes"/>
<itemref idref="item40" linear="yes"/>
<itemref idref="item41" linear="yes"/>
<itemref idref="item42" linear="yes"/>
<itemref idref="item43" linear="yes"/>
<itemref idref="item44" linear="yes"/>
<itemref idref="item45" linear="yes"/>
<itemref idref="item46" linear="yes"/>
<itemref idref="item47" linear="yes"/>
<itemref idref="item48" linear="yes"/>
<itemref idref="item49" linear="yes"/>
<itemref idref="item50" linear="yes"/>
<itemref idref="item51" linear="yes"/>
<itemref idref="item52" linear="yes"/>
<itemref idref="item53" linear="yes"/>
<itemref idref="item54" linear="yes"/>
<itemref idref="item55" linear="yes"/>
<itemref idref="item56" linear="yes"/>
<itemref idref="item57" linear="yes"/>
<itemref idref="item58" linear="yes"/>
<itemref idref="item59" linear="yes"/>
<itemref idref="item60" linear="yes"/>
<itemref idref="item61" linear="yes"/>
<itemref idref="item62" linear="yes"/>
<itemref idref="item63" linear="yes"/>
<itemref idref="item64" linear="yes"/>
<itemref idref="item65" linear="yes"/>
<itemref idref="item66" linear="yes"/>
<itemref idref="item67" linear="yes"/>
<itemref idref="item68" linear="yes"/>
<itemref idref="item69" linear="yes"/>
<itemref idref="item70" linear="yes"/>
<itemref idref="item71" linear="yes"/>
<itemref idref="item72" linear="yes"/>
<itemref idref="item73" linear="yes"/>
</spine>
<guide>
<reference type="cover" title="Cover" href="wrap0000.html"/>
</guide>
</package>

leonardr

unread,
Feb 12, 2021, 10:12:42 PM2/12/21
to beautifulsoup
Hi, Isaac,

Thanks for asking about this, and taking the time to provide detailed instructions for reproducing the issue.

The input and output XML documents you give are not the same string, but I'm pretty sure they're equivalent XML. Making the output more closely resemble the input is a reasonable thing to want, and since lxml does it, it should be possible to do it in Beautiful Soup. Based on your instructions, I've filed issue #1915583 to track the work necessary to make this change.

The underlying issue is that the original XML defines two aliases for the "http://www.idpf.org/2007/opf" namespace URIs: first "opf", and then the default namespace:

<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="id">

Beautiful Soup keeps a mapping of namespace URIs to aliases, which it uses when converting the parsed document back to a string. When a single URI is mapped to multiple aliases, it looks like the last alias encountered--in this case the default namespace--is the one that gets used. All tags from that namespace go out with the last alias defined for that namespace, even if they came in with a different alias.

This behavior is understandably unexpected, and I think it'll be possible to get the more intuitive behavior that raw lxml exhibits, but I don't think the meaning of the XML document is changing.

Leonard

Isaac Egglestone

unread,
Feb 13, 2021, 9:11:11 AM2/13/21
to beauti...@googlegroups.com
First, thanks for responding so quickly and carefully to my post.

RE: This behavior is understandably unexpected, and I think it'll be possible to get the more intuitive behavior that raw lxml exhibits, but I don't think the meaning of the XML document is changing.

As my knowledge of the XML standard isn't at any kind of master level, I can't agree nor disagree with you there, although it does make sense in some sense.

However, the latest epubchecker (by the DAISY Consortium on behalf of the W3C) seems to consider the xml in this format Erroneous and unsuitable for Ebook readers.

Example errors:

'ERROR(RSC-005): /tests/pg1502.epub/OEBPS/content.opf(-1,-1): Error while parsing file: Attribute name "xmlns:xsi" associated with an element type "dc:identifier" must be followed by the ' = ' character. ERROR(OPF-001): /tests/gutenburgbookparser/pg1502.epub/OEBPS/content.opf(-1,-1): There was an error when parsing the EPUB version: Version not found. Check finished with errors Messages: 0 fatals / 2 errors / 0 warnings / 0 infos EPUBCheck completed' should be empty. The above error seems related to the dc:identifier being followed by a colon ":"
Original text: <dc:identifier id="id" opf:scheme="URI">http://www.gutenberg.org/ebooks/1502</dc:identifier>
Problem text:

  <dc:identifier :scheme="URI" id="id"> http://www.gutenberg.org/ebooks/1502 </dc:identifier>

So thank you for opening the issue.

Do you think removing the default namespace via lxml directly and then re-adding it after parsing it with BS4 would be a feasible work around?
I'll also run some Epubcheck tests on not having a default namespace as well to see if that is also workable.



--
You received this message because you are subscribed to a topic in the Google Groups "beautifulsoup" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/beautifulsoup/4j0dMYJ48pw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beautifulsou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/61656384-5749-456c-aff3-1800f4226943n%40googlegroups.com.

leonardr

unread,
Feb 13, 2021, 10:04:38 AM2/13/21
to beautifulsoup
Isaac,

You're totally right; that is invalid XML. I got caught up in what was happening inside Beautiful Soup and I stopped looking at the output.

The good news is this issue is much easier to fix than the duplicate-prefix issue; I fixed it in revision 595.

Removing the default namespace should work. Since all your files come from a single source, you should also be able to use a regular expression to remove the "opf:" prefix from the markup wherever it appears, and fall back to the default prefix. I'm not sure how this problem came about, but this is the first time it's been reported, so I think it's related to the fact that a single URI is given two prefixes and the second one is the default (empty string) prefix.

Leonard

Isaac Egglestone

unread,
Feb 13, 2021, 10:05:50 PM2/13/21
to beauti...@googlegroups.com
A huge thanks!

Yeah I’ll try both approaches and see how it goes.

I’ll try out your revision as well and report back my findings for anyone else that ends up in a similar situation.




Reply all
Reply to author
Forward
0 new messages