Any success story?

161 views
Skip to first unread message

Des Bw

unread,
Nov 14, 2023, 12:55:07 AM11/14/23
to tesseract-ocr
It looks like every one is having issues with tesseract. I am not able to find any one who has a great success with this software. 
It would be really encouraging to hear any success story from any language. 

Has anybody a successful training of tesseract?
(like, a model that can detect with higher accuracy: 98% or more ?)

Keith Smith

unread,
Nov 14, 2023, 9:46:22 AM11/14/23
to tesser...@googlegroups.com
The short answer is "no", but a fuller answer is that my use case is a bit different from others and is as follows ...

I trained tesseract to read the MICR line at the bottom of bank checks using only 20K checks (i.e. real data, not synthetic).  I was able to get 85% accuracy where the reason for about 13% of the failures was that the person's signature overlapped the MICR line.  If I could figure out a way to detect and remove the overlapping signature contours, then I think I would be able to reach 98% accuracy.  Any suggestions?  I don't know if tesseract would ever be able to do this alone.

I also tried training tesseract from scratch using synthetic data but have not yet achieved the same accuracy.  I think the problem is that the synthetic data doesn't simulate real data closely enough. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6509904e-c308-49a6-99a6-a8fd4e4d67bfn%40googlegroups.com.

Merlijn B.W. Wajer

unread,
Nov 14, 2023, 9:55:13 AM11/14/23
to tesser...@googlegroups.com
Hi,

On 14/11/2023 06:55, Des Bw wrote:
> It looks like every one is having issues with tesseract. I am not able
> to find any one who has a great success with this software.
> It would be really encouraging to hear any success story from any language.

Here's one for you:

https://blog.archive.org/2020/11/23/foss-wins-again-free-and-open-source-communities-comes-through-on-19th-century-newspapers-and-books-and-periodicals/

Regards,
Merlijn

Des Bw

unread,
Nov 15, 2023, 12:36:13 PM11/15/23
to tesseract-ocr

@Merlijn: 
That is nice to know. This forum has been a bit disappointing place for me because many people want support; but support is rarely available. This place is unlike all other Internet forums I have been a remember of.  My experience with the Internet forums has been very encouraging: as people help each other a lot. But, here, people are constantly asking for help; and there is little. And, the people asking for help do not stick around for so long; and seem to have no patience. 

I find the culture in this forum a bit strange.
Message has been deleted

Robert Komar

unread,
Nov 15, 2023, 12:46:14 PM11/15/23
to tesser...@googlegroups.com
I have been following this list for many years.  The vast majority of the questions are the same ones over and over.  After a while, it gets really tiresome to give the same answers time and again.  So, I think many are too sick and tired of it to keep responding, particularly when the answers are already available in the list archives or in the wiki page.

Cheers,
Rob Komar
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Des Bw

unread,
Nov 15, 2023, 12:49:34 PM11/15/23
to tesseract-ocr
Hi @Keith

Have you tried to train using images made by TRDG (https://textrecognitiondatagenerator.readthedocs.io/en/latest/overview.html)?
I recently mentioned this software in this forum because it seems to produce more realistic images than the default tool in tesseract (text2image). You also have more power to fine tune the image outputs with TRDG than with text2image. As such, your synthetic data could be tuned to look exactly like the bank checks. 

As to detecting and removing the signatures, OpenCV is probably the best tool you can have. But, I have no clue on how it works; cannot help. OpenCV has a steep learning curve. I tried to learn it once: but, well, I was not fit to it. 

If the signatures appear in the same coordinate (place) across your images, other tools can also be programmed to crop them out. 

Tom Morris

unread,
Nov 16, 2023, 12:05:53 PM11/16/23
to tesseract-ocr
On Tuesday, November 14, 2023 at 12:55:07 AM UTC-5 desal...@gmail.com wrote:
It looks like every one is having issues with tesseract.

That's not true. It just looks like that because this list is dominated by newcomers
to the field of OCR and image processing.
 
I am not able to find any one who has a great success with this software. 

With all due respect, you must not have looked very hard.
 
It would be really encouraging to hear any success story from any language. 

As Merlijn already mentioned, the Internet Archive has used Tesseract to OCR 
over 10 million *documents* (so 100s of millions of pages?) in hundreds of languages

TAMU's eMOP project used Tesseract with custom training to OCR 45 million
old crufty page images from the dawn of the printing press

State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines

German Parliamentary Corpus (GerParCor)

Additional arXiv papers using this search. Following the citation graphs of any of the
papers will turn up additional potentially interesting papers.

Has anybody a successful training of tesseract?

Yes, many.

Nick White trained Ancient Greek. 
Shree has posted copiously about his efforts training Tesseract. See the list archives as well as his repos:

Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR

Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English

There's a contrib repository with Acadian, polytonic Greek, and other user-trained languages
 
(like, a model that can detect with higher accuracy: 98% or more ?)

An accuracy figure without context is meaningless. What language? What domain?
What image source? What resolution? Word or character accuracy? etc, etc

If you read some of the papers and descriptions of the large scale projects, you'll see
that OCR model training is a non-trivial problem which people spend months/years on.

Tom

Des Bw

unread,
Nov 17, 2023, 12:15:47 PM11/17/23
to tesseract-ocr
Dear Tom, thank you for listing out all the sources . Indeed, I didn't look hard. I was mostly reading this forum; and sure, I am familiar with Shree's (Nick White?) works. 

>(like, a model that can detect with higher accuracy: 98% or more ?)
>An accuracy figure without context is meaningless. What language? What domain? What image source? What resolution? Word or character accuracy? etc, etc

When I wrote that I was thinking about regular scanned documents. The standard (default) model, for example, seems to mostly get around 92-95% accuracy in most 300dpi scanned books (prose).  It could be better or worse for some languages. But, that seems to average in most cases. 

My frustration has been the absence of good documentations of successful trainings done by others so that we beginners could learn from them. Schree's GitHub is the only place that contains relevant information on how some did the training: in what settings; and what results came out of etc. That is all my intention and point. A success stories are encouraging. And the learnings of that person could provide invaluable lessons for the new comers. 
Reply all
Reply to author
Forward
0 new messages