PDF/A versions

609 views
Skip to first unread message

John Scancella

unread,
Jan 13, 2016, 3:52:48 PM1/13/16
to tesseract-ocr
Hello,

I tried searching but couldn't find which versions of PDF/A (if any) tesseract supports. Specifically I have a requirement for PDF/A-2a generation, but I couldn't find anywhere if tesseract can write PDF/A-2a compliant files, and if so how to tell it do so. Any help is greatly appreciated.

Thanks
John

Jeff Breidenbach

unread,
Jan 15, 2016, 3:53:10 PM1/15/16
to tesseract-ocr
My understanding is PDF/A requires a bit more metadata, for example some color profile information (ICC) and a description about where the data came from (XMP). Tesseract doesn't supply that, sorry. I have no reason to believe implementation is hard, it's just not something I'm currently working on. Would be happy to accept a patch. The PDF creation code in Tesseract is under 1000 lines long and not scary. 

Tom Morris

unread,
Jan 16, 2016, 12:39:28 PM1/16/16
to tesseract-ocr
On Wednesday, January 13, 2016 at 3:52:48 PM UTC-5, John Scancella wrote:
I tried searching but couldn't find which versions of PDF/A (if any) tesseract supports. Specifically I have a requirement for PDF/A-2a generation, but I couldn't find anywhere if tesseract can write PDF/A-2a compliant files, and if so how to tell it do so. Any help is greatly appreciated.

 PDF/A-2 is a profile of PDF 1.7 and Tesseract currently writes 1.5 (although changing that is probably the easiest part of the changes required).

The metadata that Jeff mentions would probably need to be externally provided.  For example things like the document title, author, etc would likely need to be provided by the user.

One thing that you might consider is using a tool like Adobe Acrobat Pro to conform the output of Tesseract to the necessary standard.  Getting someone to update Tess to conform to an ISO standard is going to be difficult since they're not freely available and need to be purchased (ISO 19005-2:2011 is 158 Swiss Francs).

Tom

Tom Morris

unread,
Jan 29, 2016, 12:48:01 PM1/29/16
to tesseract-ocr
I just stumbled across https://github.com/jbarlow83/OCRmyPDF which claims to use Tesseract and provide PDF/A support.

Tom
Reply all
Reply to author
Forward
0 new messages