in search of Linux PDF/A conversion tool

1,832 views
Skip to first unread message

dne...@g.harvard.edu

unread,
Apr 7, 2016, 10:54:45 AM4/7/16
to Digital Curation

I am looking for suggestions on software tools for conversion (not validation, such as veraPDF) of various types of word processing documents to PDF/A (any version, (PDF/A-1, -2, -3, etc.). The requirement is that the tool must run in a Linux (not Windows or Mac) environment as a service or on the command line (not using a user interface) and be able to convert most of these formats: .doc, .docx, .epub, .odt, .rtf, .pdf (non-PDF/A), .wpd.

Thanks in advance.

Michael Kjörling

unread,
Apr 7, 2016, 11:41:49 AM4/7/16
to digital-...@googlegroups.com
On 7 Apr 2016 07:39 -0700, from dne...@g.harvard.edu:
LibreOffice can open at least some of the formats you mention, can
output PDF/A, and can be driven from the command line. It doesn't fit
your requirements exactly, but still might be worth looking into.

It also looks like you can convert from regular PDF to PDF/A using
only freely available software. See http://unix.stackexchange.com/a/79519/2465.

--
Michael Kjörling • https://michael.kjorling.semic...@kjorling.se
“People who think they know everything really annoy
those of us who know we don’t.” (Bjarne Stroustrup)

Erwin Verbruggen

unread,
Apr 7, 2016, 11:45:57 AM4/7/16
to digital-...@googlegroups.com
Hi, not sure whether Pandoc writes to PDF/A:

"Pandoc is a Haskell library for converting from one markup format to another, and a command-line tool that uses this library. It can read Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, and (subsets of) Textile, reStructuredText, HTML, LaTeX, MediaWiki markup, TWiki markup, Haddock markup, OPML, Emacs Org mode, DocBook, txt2tags, EPUB, ODT and Word docx; and it can write plain text, Markdown, CommonMark, PHP Markdown Extra, GitHub-Flavored Markdown, reStructuredText, XHTML, HTML5, LaTeX (including beamer slide shows), ConTeXt, RTF, OPML, DocBook, OpenDocument, ODT, Word docx, GNU Texinfo, MediaWiki markup, DokuWiki markup, Haddock markup, EPUB (v2 or v3), FictionBook2, Textile, groff man pages, Emacs Org mode, AsciiDoc, InDesign ICML, TEI Simple, and Slidy, Slideous, DZSlides, reveal.js or S5 HTML slide shows. It can also produce PDF output on systems where LaTeX, ConTeXt, or wkhtmltopdf is installed."

Cheers,
Erwin Verbruggen


--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To post to this group, send email to digital-...@googlegroups.com.
Visit this group at https://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/d/optout.

Alex Garnett

unread,
Apr 8, 2016, 11:12:00 AM4/8/16
to Digital Curation
Hi folks,

The easiest way to approach this in my experience is to go for a PDF converter first, and then convert those PDFs to PDF/A using ghostscript as an additional step.

All of the formats you've mentioned with the exception of ePub can be converted to PDF on the commandline using unoconv, which is a python wrapper for Libreoffice, using syntax like:

unoconv -f pdf filename.input

You should be able to install unoconv from default Ubuntu sources.

In order to turn an existing PDF into PDF/A using ghostscript, try this syntax:

gs -dPDFA -dNOOUTERSAVE -dUseCIEColor -sProcessColorModel=DeviceRGB -sDEVICE=pdfwrite -o $output -dPDFACompatibilityPolicy=1 /usr/share/ghostscript/9.10/lib/PDFA_def.ps $input

Hope that helps! I have the PDFA_def.ps and the associated colourspace I can share if you need it.

Alex Garnett

unread,
Apr 8, 2016, 11:13:11 AM4/8/16
to Digital Curation
PS -- with the ePub, you may be able to jury-rig something using wkhtmltopdf, since ePub is technically wrappered HTML, but it's taking some jimmying to get nice results.

nitin.aror...@gmail.com

unread,
Apr 9, 2016, 9:44:07 AM4/9/16
to Digital Curation
Calibre could help with EPUB to PDF.
It has a command line interface (https://manual.calibre-ebook.com/generated/en/cli-index.html), and there should be a Python library, too, if I recall.

I've seen far too much variation in EPUB to get consistently good results re: conversion, but assuming your EPUBs are just text and images, it might be something to get started with.

nitin.aror...@gmail.com

unread,
Apr 9, 2016, 9:44:07 AM4/9/16
to Digital Curation
I just sent a note about Calibre for EPUB, but after looking at their converter page (https://manual.calibre-ebook.com/generated/en/ebook-convert.html) it could potentially address some of your other format needs as well. At least it might be worth looking into.

L Snider

unread,
Apr 9, 2016, 10:06:47 AM4/9/16
to digital-...@googlegroups.com
One thing to note about Calibre (and something I hope someone does!) is that it only does EPUB2...EPUB 3 has had huge improvements, particularly for people with disabilities as it includes DAISY. Plus they just brought out a newer version of it a short time ago that makes it even better.

I did extensive testing with Calibre a year ago, and it really is like a swiss army knife...The developer has said he won't do the update to EPUB3, so if anyone you know can do that, it would be HUGE! There are other EPUB3 creation programs, but they bulky and hard to use unlike this one...

Cheers

Lisa

Lisa Snider
Archivist and Accessibility Consultant

--

dne...@g.harvard.edu

unread,
Apr 25, 2016, 5:02:48 PM4/25/16
to Digital Curation
Thanks to everyone for your responses. We will be evaluating all of these suggestions.


On Thursday, April 7, 2016 at 10:54:45 AM UTC-4, dne...@g.harvard.edu wrote:
Reply all
Reply to author
Forward
0 new messages