Random low confidence score, Is fine-tuning a good solution for my use-case ?

44 views

Skip to first unread message

Tuan Ardouin

unread,

May 12, 2020, 2:54:07 PM5/12/20

to tesseract-ocr

Hello Everyone,

I'm trying to use Tesseract on a legal/accountant document with a lot of numbers placed in tables and the rest of the text/words data in French.

Example of a document :

https://imgur.com/a/hemeVdA

Right now I have some pretty good results but I'm trying to improve them. I already deleted all the straight lines and it gave me much better results, but as you can see in the next image some numbers have a low confidence score. But when I run Tesseract on just this isolated number the confidence score is excellent. Same thing with words.

My config :

PSM 6

OEM 1

lang fra

model best

I have some ideas as to why I'm getting this result and how to fix it, but your input would be greatly appreciated :

- Fine tune the model I'm using on the documents I have.

Right now I don't think that's the best idea because of the results I'm getting on the isolated images. The model seems to work fine but another element I'm not seeing is giving me those low confidence score.

- Use different configs when running Tesseract.
I have to be honest, apart from the layout type and the engine I didn't try any other one, because I don't really understand them and there is a lot of them.
http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
If you think some would help I can test them right away.

- Add a custom dictionary.
I think this will improve the results for the text, but not for the numbers.

- Use a custom model just for the numbers.

I saw seen this discussion : https://groups.google.com/forum/#!topic/tesseract-ocr/-oeCTcojYfw

and was thinking of fine-tuning the French model to better detect numbers myself, but once again the result I'm getting on the isolated image lead me to think that the problem is elsewhere.

- Run tesseract on the low confidence zone

This is my last idea, and because I've never run Tesseract in a production environment I have some difficulties seeing how it will impact the speed of the whole process and future problems it will potentially create.

So my question is :

Do you think one of those path would be more interesting to follow first, or do you have some other ideas ?

Thank you,
Tuan

Reply all

Reply to author

Forward

0 new messages