Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Bug#1009680: ghostscript breaks ocrmypdf autopkgtest: seemingly multiple issues

152 views

Skip to first unread message

ja...@purplerock.ca

unread,

Apr 14, 2022, 2:30:04 PM4/14/22

Ghostscript 9.56.0 introduced a serious bug from ocrmypdf’s perspective. Upgrading to ocrmypdf 13.4.2 would work or a newer Ghostscript if that’s been released.

> On Apr 14, 2022, at 02:15, Paul Gevers <elb...@debian.org> wrote:
>
> Source: ghostscript, ocrmypdf
> Control: found -1 ghostscript/9.56.0~dfsg-1
> Control: found -1 ocrmypdf/13.4.0+dfsg-1
> Severity: serious
> Tags: sid bookworm
> User: debi...@lists.debian.org
> Usertags: breaks needs-update
>
> Dear maintainer(s),
>
> With a recent upload of ghostscript the autopkgtest of ocrmypdf fails in testing when that autopkgtest is run with the binary packages of ghostscript from unstable. It passes when run with only packages from testing. In tabular form:
>
> pass fail
> ghostscript from testing 9.56.0~dfsg-1
> ocrmypdf from testing 13.4.0+dfsg-1
> all others from testing from testing
>
> I copied some of the output at the bottom of this report.
>
> Currently this regression is blocking the migration of ghostscript to testing [1]. Due to the nature of this issue, I filed this bug report against both packages. Can you please investigate the situation and reassign the bug to the right package?
>
> More information about this bug and the reason for filing it can be found on
> https://wiki.debian.org/ContinuousIntegration/RegressionEmailInformation
>
> Paul
>
> [1] https://qa.debian.org/excuses.php?package=ghostscript
>
> https://ci.debian.net/data/autopkgtest/testing/amd64/o/ocrmypdf/20818050/log.gz
>
> =================================== FAILURES ===================================
> ________________________________ test_force_ocr ________________________________
>
> resources = PosixPath('/tmp/autopkgtest-lxc.zdbcipww/downtmp/build.V8r/src/tests/resources')
> outpdf = PosixPath('/tmp/pytest-of-debci/pytest-0/test_force_ocr0/out.pdf')
>
> def test_force_ocr(resources, outpdf):
> out = check_ocrmypdf(
> resources / 'graph_ocred.pdf',
> outpdf,
> '-f',
> '--plugin',
> 'tests/plugins/tesseract_cache.py',
> )
> pdfinfo = PdfInfo(out)
>> assert pdfinfo[0].has_text
> E assert False
> E + where False = <PageInfo pageno=0 7.573333333333333333333333333"x6.16" rotation=0 dpi=400.000000x400.000000 has_text=False>.has_text
>
> tests/test_main.py:83: AssertionError
> ----------------------------- Captured stderr call -----------------------------
>
> Scanning contents: 0%| | 0/1 [00:00<?, ?page/s]
> Scanning contents: 100%|██████████| 1/1 [00:00<00:00, 62.30page/s]
>
> OCR: 0%| | 0.0/1.0 [00:00<?, ?page/s]
> OCR: 50%|█████ | 0.5/1.0 [00:02<00:02, 5.47s/page]
> OCR: 100%|██████████| 1.0/1.0 [00:02<00:00, 2.75s/page]
>
> PDF/A conversion: 0%| | 0/1 [00:00<?, ?page/s]
>
> Recompressing JPEGs: 0image [00:00, ?image/s] [A
> Recompressing JPEGs: 0image [00:00, ?image/s]
>
>
> Deflating JPEGs: 0%| | 0/1 [00:00<?, ?image/s] [A
> Deflating JPEGs: 100%|██████████| 1/1 [00:00<00:00, 74.34image/s]
>
>
> JBIG2: 0item [00:00, ?item/s] [A
> JBIG2: 0item [00:00, ?item/s]
> ------------------------------ Captured log call -------------------------------
> INFO ocrmypdf._pipeline:_pipeline.py:275 page already has text! - rasterizing text and running OCR anyway
> INFO ocrmypdf._sync:_sync.py:301 Postprocessing...
> WARNING ocrmypdf._pipeline:_pipeline.py:776 Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
> INFO ocrmypdf.optimize:optimize.py:665 Optimize ratio: 1.52 savings: 34.1%
> INFO ocrmypdf._sync:_sync.py:399 Output file is a PDF/A-2B (as expected)
> WARNING ocrmypdf._validation:_validation.py:381 The output file size is 2.45× larger than the input file.
> Possible reasons for this include:
> The argument --force-ocr was issued, causing transcoding.
> The optional dependency 'jbig2' was not found, so some image optimizations could not be attempted.
> PDF/A conversion was enabled. (Try `--output-type pdf`.)
> Plugins were used.
> --------------------------- Captured stderr teardown ---------------------------
>
> PDF/A conversion: 100%|██████████| 1/1 [00:01<00:00, 1.20s/page]
> ________________________________ test_skip_ocr _________________________________
>
> resources = PosixPath('/tmp/autopkgtest-lxc.zdbcipww/downtmp/build.V8r/src/tests/resources')
> outpdf = PosixPath('/tmp/pytest-of-debci/pytest-0/test_skip_ocr0/out.pdf')
>
> def test_skip_ocr(resources, outpdf):
> out = check_ocrmypdf(
> resources / 'graph_ocred.pdf',
> outpdf,
> '-s',
> '--plugin',
> 'tests/plugins/tesseract_cache.py',
> )
> pdfinfo = PdfInfo(out)
>> assert pdfinfo[0].has_text
> E assert False
> E + where False = <PageInfo pageno=0 7.573333333333333333333333333"x6.16" rotation=0 dpi=150.000000x150.000000 has_text=False>.has_text
>
> tests/test_main.py:95: AssertionError
> ----------------------------- Captured stderr call -----------------------------
>
> Scanning contents: 0%| | 0/1 [00:00<?, ?page/s]
> Scanning contents: 100%|██████████| 1/1 [00:00<00:00, 70.71page/s]
>
> OCR: 0%| | 0.0/1.0 [00:00<?, ?page/s]
> OCR: 100%|██████████| 1.0/1.0 [00:00<00:00, 47.12page/s]
>
> PDF/A conversion: 0%| | 0/1 [00:00<?, ?page/s]
>
> Recompressing JPEGs: 0image [00:00, ?image/s] [A
> Recompressing JPEGs: 0image [00:00, ?image/s]
>
>
> Deflating JPEGs: 0%| | 0/1 [00:00<?, ?image/s] [A
> Deflating JPEGs: 100%|██████████| 1/1 [00:00<00:00, 235.24image/s]
>
>
> JBIG2: 0item [00:00, ?item/s] [A
> JBIG2: 0item [00:00, ?item/s]
> ------------------------------ Captured log call -------------------------------
> INFO ocrmypdf._pipeline:_pipeline.py:287 skipping all processing on this page
> INFO ocrmypdf._sync:_sync.py:301 Postprocessing...
> WARNING ocrmypdf._pipeline:_pipeline.py:776 Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
> INFO ocrmypdf.optimize:optimize.py:665 Optimize ratio: 1.14 savings: 12.6%
> INFO ocrmypdf._sync:_sync.py:399 Output file is a PDF/A-2B (as expected)
> --------------------------- Captured stderr teardown ---------------------------
>
> PDF/A conversion: 100%|██████████| 1/1 [00:00<00:00, 4.16page/s]
> ________________________________ test_redo_ocr _________________________________
>
> resources = PosixPath('/tmp/autopkgtest-lxc.zdbcipww/downtmp/build.V8r/src/tests/resources')
> outpdf = PosixPath('/tmp/pytest-of-debci/pytest-0/test_redo_ocr0/out.pdf')
>
> def test_redo_ocr(resources, outpdf):
> in_ = resources / 'graph_ocred.pdf'
> before = PdfInfo(in_, detailed_analysis=True)
> out = outpdf
> out = check_ocrmypdf(in_, out, '--redo-ocr')
> after = PdfInfo(out, detailed_analysis=True)
>> assert before[0].has_text and after[0].has_text
> E assert (True and False)
> E + where True = <PageInfo pageno=0 7.573333333333333333333333333"x6.16" rotation=0 dpi=150.000000x150.000000 has_text=True>.has_text
> E + and False = <PageInfo pageno=0 7.573333333333333333333333333"x6.16" rotation=0 dpi=150.000000x150.000000 has_text=False>.has_text
>
> tests/test_main.py:104: AssertionError
> ----------------------------- Captured stderr call -----------------------------
>
> Scanning contents: 0%| | 0/1 [00:00<?, ?page/s]
> Scanning contents: 100%|██████████| 1/1 [00:00<00:00, 20.63page/s]
>
> OCR: 0%| | 0.0/1.0 [00:00<?, ?page/s]
> OCR: 50%|█████ | 0.5/1.0 [00:04<00:04, 8.64s/page]
> OCR: 100%|██████████| 1.0/1.0 [00:04<00:00, 4.35s/page]
>
> PDF/A conversion: 0%| | 0/1 [00:00<?, ?page/s]
>
> Recompressing JPEGs: 0image [00:00, ?image/s] [A
> Recompressing JPEGs: 0image [00:00, ?image/s]
>
>
> Deflating JPEGs: 0%| | 0/1 [00:00<?, ?image/s] [A
> Deflating JPEGs: 100%|██████████| 1/1 [00:00<00:00, 254.88image/s]
>
>
> JBIG2: 0item [00:00, ?item/s] [A
> JBIG2: 0item [00:00, ?item/s]
> ------------------------------ Captured log call -------------------------------
> INFO ocrmypdf._pipeline:_pipeline.py:284 redoing OCR
> INFO ocrmypdf._sync:_sync.py:301 Postprocessing...
> ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 GPL Ghostscript 9.56.0 (2022-03-29)
> Copyright (C) 2022 Artifex Software, Inc. All rights reserved.
> This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
> see the file COPYING for details.
> Processing pages 1 through 1.
> Page 1
>
> The following warnings were encountered at least once while processing this file:
> number uses illegal exponent form
>
> ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 This file had errors that were repaired or ignored.
> ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 The file was produced by: ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 >>>> GPL Ghostscript 9.15 <<<<
> ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 Please notify the author of the software that produced this
> ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 file that it does not conform to Adobe's published PDF
> ERROR ocrmypdf._exec.ghostscript:ghostscript.py:277 specification.
>
>
> WARNING ocrmypdf._pipeline:_pipeline.py:776 Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
> INFO ocrmypdf.optimize:optimize.py:665 Optimize ratio: 1.14 savings: 12.6%
> INFO ocrmypdf._sync:_sync.py:399 Output file is a PDF/A-2B (as expected)
> --------------------------- Captured stderr teardown ---------------------------
>
> PDF/A conversion: 100%|██████████| 1/1 [00:00<00:00, 3.91page/s]
> =========================== short test summary info ============================
> FAILED tests/test_main.py::test_force_ocr - assert False
> FAILED tests/test_main.py::test_skip_ocr - assert False
> FAILED tests/test_main.py::test_redo_ocr - assert (True and False)
> ======= 3 failed, 274 passed, 37 skipped, 4 xfailed in 397.41s (0:06:37) =======
> autopkgtest [08:17:33]: test test-suite
>

0 new messages