Config hints to improve recognition accuracy.

181 views
Skip to first unread message

Clint William Theron

unread,
Aug 30, 2019, 3:19:11 PM8/30/19
to tesseract-ocr
Consider the following image and output:

_6.jpg

Tesseract's recognition output:
LUHO: R54 MILLION GTD
LOTTO PLUS 1: R6,! MILLION est
LOTTO PLUS 2: R7,4 MILLION est
NIN YOUR SHARE OF R1,! MILLION!!!
Buy any NATIONAL LOTTERY t1cket ther
SMS :ID,#PLAY,TICKET CODE TO 34909.
Cash Prizes to be won!!! T’s and C’
apply vtsit National Lottery website
PLEASE RETAIN YOUR ENTRY TICKET!
First Draw: Saturday 20/07/19
VALID RECEIPT FOR 1 Oraw(S)
FROM DRAW 1937 To 1937
LOTTO PLUS 1: ND
LUTTU PLUS 2: ND
‘TotaT:R5.00
_‘,{gxt, Inc! 152 VA

I'm a newbie when it comes to Tesseract.js. I know there is a way to include config parameters to increase the accuracy for OCR. In the above image I'm interested in getting the numbers, between the two horizontal dashed stripes, in the image. Would you give a few config parameters to include in the recognize method to see if it might improve the OCR accuracy.

Thank you in advance.  Ps. Anything would be helpfull 

René Hansen

unread,
Aug 30, 2019, 5:03:04 PM8/30/19
to tesser...@googlegroups.com
A few config params wont do the trick. You need to preprocess the image. Make sure you read this https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

Ideally I think you need to cook down the image you give tesseract to something like this:

cutout.jpg

Even this isn't quite good enough though. I get "NG: 1020452" as a result from https://tesseract.projectnaptha.com

You might need to train on this specific font to get better results, or do further preprocessing to increase accuracy.


/René


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2a937bf3-8c97-466d-a9bb-26a277e02522%40googlegroups.com.


--
Never fear, Linux is here.

Clint William Theron

unread,
Aug 31, 2019, 11:28:42 AM8/31/19
to tesser...@googlegroups.com
Thanks for your response. I already tried your suggestions and I now and then get the desired result. What I'm looking to do now is train tesseract but I don't get tesseract to use my traineddata language. My app is a browser web app that runs on HTTP apache server. I would that you could answer my SO question:

https://stackoverflow.com/questions/57715343/how-do-i-specify-traineddata-language-path-and-language-code-when-using-tesser

Thanks


On Friday, August 30, 2019, René Hansen <ren...@gmail.com> wrote:
> A few config params wont do the trick. You need to preprocess the image. Make sure you read this https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
>
> Ideally I think you need to cook down the image you give tesseract to something like this:
> </mail/u/0/s/?view=att&th=16ce456a472fa41a&attid=0.2&disp=emb&realattid=ii_jzylkbga1&zw&atsh=1>

>
> Even this isn't quite good enough though. I get "NG: 1020452" as a result from https://tesseract.projectnaptha.com
>
> You might need to train on this specific font to get better results, or do further preprocessing to increase accuracy.
>
> /René
>
> On Fri, 30 Aug 2019 at 21:19, Clint William Theron <theroncli...@gmail.com> wrote:
>>
>> Consider the following image and output:
>>
>> </mail/u/0/s/?view=att&th=16ce456a472fa41a&attid=0.1&disp=emb&realattid=31454d70-91ee-42dc-b88c-786a6f11d05c&zw&atsh=1>
> To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAB-60nj7hGExHq8Y8VeXKDODgLBF1EJtCOGikU%2BCK%2B6fAu-uHA%40mail.gmail.com.
>

René Hansen

unread,
Aug 31, 2019, 2:17:25 PM8/31/19
to tesser...@googlegroups.com
Can't help you there I'm afraid. I have no experience with tesseract.js.


/René


Clint William Theron

unread,
Aug 31, 2019, 5:18:12 PM8/31/19
to tesseract-ocr
Thanks. I understand. Which tesseract do you have experience with? In windows 10 I'm able to replace the eng.traineddata file with my own and then tesseract uses my language. That is what I'm looking for but it has to be something online (not local).
>> To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

>> To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2a937bf3-8c97-466d-a9bb-26a277e02522%40googlegroups.com.
>
>
> --
> Never fear, Linux is here.
>
> --
> You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAB-60nj7hGExHq8Y8VeXKDODgLBF1EJtCOGikU%2BCK%2B6fAu-uHA%40mail.gmail.com.
>

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Clint William Theron

unread,
Sep 2, 2019, 4:29:17 PM9/2/19
to tesseract-ocr
Reply all
Reply to author
Forward
0 new messages