Any suggestions for more accurate Text conversion?

1,034 views
Skip to first unread message

bha...@automot.us

unread,
Mar 27, 2018, 3:45:55 AM3/27/18
to tesseract-ocr
Hello,

I am working on a project where I extract and the license plates and try to get the plate number automatically.

After applying some computer vision and image processing, I have come up with the following result.


As it can be very obvious, the OCR generated with tesseract is: 6JZX97L

Where as, actually, it is 6JZX974.

I a, very new to the tesseract and it seemed like a very easy to use library for my task, however, I do not have any idea on how to tackle a scenario like this. If there is anyone who has worked on solving such a problem, please share thoughts.

Some other error prone numbers/letters: 0-O, 1-I, 2-Z, 5-S, 8-B...

Thanks!

ShreeDevi Kumar

unread,
Mar 27, 2018, 6:00:06 AM3/27/18
to tesser...@googlegroups.com
You can try finetune training.

Test with attached traineddata file.
eng-numCAPS.traineddata

bha...@automot.us

unread,
Mar 27, 2018, 2:24:36 PM3/27/18
to tesseract-ocr
Thank you Shree. I will give it a shot with the attached train data!

About fine-tuning, are there any example tutorials on the Tesseract wiki? I am not sure. I will try to find, but I you know and post the link, I would really appreciate that!

Thanks. 

bha...@automot.us

unread,
Mar 27, 2018, 3:53:09 PM3/27/18
to tesseract-ocr
Hi Shree,

I just tried using the training data file you provided but it seems that there is some problem with Tesseract recognizing this file. I should have mentioned before that I am using version '3.05.01'.

Below is the sequence of commands I ran:

Bhargavs-MacBook-Pro-2:LPR bhargav$ tesseract topcrop1.jpg out -l end-numCAPS

Error opening data file /usr/local/Cellar/tesseract/3.05.01/share/tessdata/end-numCAPS.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.

Failed loading language 'end-numCAPS'

Tesseract couldn't load any languages!

Could not initialize tesseract.

Bhargavs-MacBook-Pro-2:LPR bhargav$ ls /usr/local/Cellar/tesseract/3.05.01/share/tessdata/

configs eng.traineddata pdf.ttf

eng-numCAPS.traineddata osd.traineddata tessconfigs

Bhargavs-MacBook-Pro-2:LPR bhargav$ echo $TESSDATA_PREFIX

/usr/local/share/tessdata


Please let me know if I have done something wrong or the train data file has version mismatch or corrupted.

Thanks,
Bhargav

ShreeDevi Kumar

unread,
Mar 27, 2018, 4:37:36 PM3/27/18
to tesser...@googlegroups.com
Version mismatch. That traineddata is for 4.0.

Wiki has pages for training. Look for one appropriate for your version of tesseract.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c346ec8b-32ef-4b29-b9e6-e5d9225a31df%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bhargav Kanakiya

unread,
Mar 27, 2018, 8:22:43 PM3/27/18
to tesseract-ocr
I tried using version 4.0 by building it from source.

However, I get following messages, and without much surprise, the output is totally bizarre.

Failed to load any lstm-specific dictionaries for lang eng-numCAPS!!
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 233

Output: 4JTX9T

I understand that the DPI message is there since older versions and I had it in 3.05 as well, the 'lstm-specific' message is probably from the training data file? Only other option is train/finetune on my own set?

shree

unread,
Mar 28, 2018, 5:45:48 AM3/28/18
to tesseract-ocr
Yes, for 4.0 you can try finetune training. You can download license plate specific fonts to easily make training data. 

Bhargav Kanakiya

unread,
Mar 28, 2018, 3:30:49 PM3/28/18
to tesser...@googlegroups.com
Okay, thank you!

On Wed, Mar 28, 2018 at 2:45 AM, shree <shree...@gmail.com> wrote:
Yes, for 4.0 you can try finetune training. You can download license plate specific fonts to easily make training data. 

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/qxB-aCa3r6E/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.



--
Bhargav Kanakiya
Computer Vision Engineer

Abu Anas

unread,
Oct 26, 2018, 2:38:58 PM10/26/18
to tesseract-ocr
I am also having similar problem. I have trained KB-JT-NEW from ben (continue from) and found the result:

At iteration 127102/500000/500000, Mean rms=0.437%, delta=1.593%, char train=11.184%, word train=11.098%, skip ratio=0%,  New worst char error = 11.184 wrote checkpoint.

Finished! Error rate = 7.737
lstmtraining
\
--stop_training \
--convert_to_int \
--continue_from data/checkpoints/KB-JT-NEW_checkpoint \
--traineddata data/KB-JT-NEW/KB-JT-NEW.traineddata \
--model_output data/KB-JT-NEW.traineddata
Loaded file data/checkpoints/KB-JT-NEW_checkpoint, unpacking...

But putting the .traineddata on /usr/local/share/tessdata/ and executing  recognition gives bizarre result and showing:
Failed to load any lstm-specific dictionaries for lang KB-JT-NEW!!



Shree Devi Kumar

unread,
Oct 28, 2018, 9:29:23 PM10/28/18
to tesser...@googlegroups.com
The starter traineddata that you have used does not have any dawg files, based on word list, numbers and punctuation, hence the report that dictionaries are not found.



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

amir musavi

unread,
Jun 16, 2019, 1:34:22 AM6/16/19
to tesseract-ocr
hello shree
i am confused. can you explain precisely what i must to do?
i perform a fine tune on fas.traineddata and after training fas.lstm, fas.lstm-number-dawg, fas.lstm-punc-dawg, fas.lstm-recoder and etc. are achieved. now when i copy fas.traineddata to tessdata folder an execute tesseract command, "Failed to load any lstm-specific dictionaries" appears and ocr output is not good.
best regards
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages