Tesseract OCR

155 views
Skip to first unread message

Carolina Melo

unread,
Jun 14, 2017, 3:53:25 PM6/14/17
to archivematica
Hello, everyone!

Does anyone have success with OCR with tesseract?
Does it need to be trained?


Cheers,

Carolina Melo

Carolina Melo

unread,
Jul 3, 2017, 9:05:49 AM7/3/17
to archivematica
Hello! Can anyone help me?

Alex Garnett

unread,
Jul 4, 2017, 11:21:51 AM7/4/17
to archivematica
Hi Carolina,

Results should be OK out of the box without needing to retrain it yourself (Archivematica may only install the English corpora for Tesseract out of the box though). Is it not working for you?

Carolina Melo

unread,
Jul 5, 2017, 8:35:42 AM7/5/17
to archivematica
Unfortunatelly it's not working, Alex.
I tried using two formats, tiff and pdf, and got the following errors:

tiff
Transcribing access derivative 8414491f-7ef5-4f83-ad56-80e64e63a2ef
Tesseract Open Source OCR Engine v3.03 with Leptonica


pdf
No rules found for file 27a66deb-76a8-4cf5-ac05-70fb45937d49 and its derivatives; not transcribing

Any tips?

Carolina Melo

unread,
Jul 5, 2017, 11:27:06 AM7/5/17
to archivematica
Hi again, Alex!

I think my problem is the formats my AM is able to transcript.

S Bowyer

unread,
Nov 26, 2020, 8:58:19 AM11/26/20
to archivematica
I'm having a similar problem to Carolina, as even though Preservation Planning>Transcription>Rules lists both TIFF and JPEG as having a "Transcribe using Tesseract" rule, I have not been able to get this working. I have tried to Transcribe both a JPEG file and a TIFF file using the Transcribe SIP contents microservice, but when I check Preservation Planning>Transcription>Rules  the "Success" column remains at "0 out of 0", so it looks like Tesseract is not even attempting to transcribe these files (even when asked to do so). In other words, much like what is shown in Carolina's screenshot. 

Does anyone have any successful experiences with using Tesseract to transcribe single page TIFF or JPEG files? Would be really useful to be able to OCR these types of files... Where am I going wrong? 

Thanks, 
Surya 

Ross Spencer

unread,
Nov 27, 2020, 10:26:39 AM11/27/20
to archivematica
Hi Surya, 

I have been looking at Tesseract recently for testing and for my known test images, it looks okay. But that doesn't remove the possibility of a number of other issues though. We can check a few things, but the first place to start is by looking at the format identification microservice job output in the transfer tab, and the 'command output' in the standard output. For Siegfried I see PRONOM identifier (PUID) fmt/353 which is a grouping for most TIF output in PRONOM.


To see if I can expect this to work in Tesseract I need to look at the rules Caroline and yourself have been investigating, navigating to the specific PRONOM identifier listing to see if it matches. For TIF here is a short screen recording:

Peek 2020-11-27 10-12.gif

Unfortunately here, I can see the rule is associated with fmt/10 which i happen to know can't be returned by tools like FIDO/DROID/Siegfried because the listing was deprecated in PRONOM: http://www.nationalarchives.gov.uk/PRONOM/fmt/10

In this case you'd need to elect to create a new transcribe rule by selecting create new rule, and then looking for a TIF image associated with the PUID returned by Siegfried. 

This would be the first step for each of the formats that you are looking at transcribing. 

I suspect this might be what's happening for your own TIF file, and for the JPEG we only have three specific JPG files listed so it might also be happening there too. 

Let us know how it goes and then we can do some more debugging for you if need be. 

Regarding the fact we have a deprecated PUID listed I'll have a look at the issues we have logged fro the FPR because I feel this one should be flagged if it isn't already.

Best,
Ross

Ross Spencer

unread,
Nov 27, 2020, 10:37:16 AM11/27/20
to archivematica
Sorry about that, it looks like the format identification screen was lost during posting. 



Ross Spencer

unread,
Nov 27, 2020, 10:40:14 AM11/27/20
to archivematica
Not happening! This is the text though from the output. (Hopefully that helps!)
Ross

Standard output (stdout)

IDCommand: Identify using Siegfried 1.8.0 IDCommand UUID: 9402ad69-f045-4d0a-8042-9c990645910a 
IDTool: Siegfried IDTool UUID: 454df69d-5cc0-49fc-93e4-6fbb6ac659e7 
File: (c530651e-5ada-4dd7-9726-270cb107fa29) /var/archivematica/.../norm_1-a912408c-f2ad-410d-9135-1e7380ac9fff/objects/G31DS.TIF 
Command output: fmt/353 
/var/archivematica/.../norm_1-a912408c-f2ad-410d-9135-1e7380ac9fff/objects/G31DS.TIF identified as a TIFF

S Bowyer

unread,
Nov 30, 2020, 11:00:53 AM11/30/20
to archivematica
Hi Ross, 

Your explanation makes sense. Thanks very much for your help. 

Best, 
Surya 

Reply all
Reply to author
Forward
0 new messages