How to use tesseract4.0 to only recognize the digits??

10,128 views
Skip to first unread message

Joey 杨

unread,
Sep 23, 2017, 7:27:43 AM9/23/17
to tesseract-ocr
Hi erevyone
i want to use the tesseract4.0 to only recogize digits, but there always recognize the digit to the character. 
In tesseract3.04 ,I can use the setVarible("WhiteList") to solve this problem,but in the tesseract4.0 , this solution doesn't work.
So,How can i make it work??

Thank you

John Miller

unread,
Sep 28, 2017, 6:48:33 AM9/28/17
to tesseract-ocr
I met the same problem.It puzzled me several days.I tried to use Cube-mode,and it worked.It made me even more confused.

John Miller

unread,
Sep 29, 2017, 3:02:41 AM9/29/17
to tesseract-ocr
Today,I found that the problem had been  posted on https://github.com/tesseract-ocr/tesseract/issues/751

shree

unread,
Oct 3, 2017, 12:39:30 PM10/3/17
to tesseract-ocr
You can try the plus-minus type of training if you just want a digits type of traineddata.

Your training_text can contain numbers in the format you need and you can train with a font matching your images.

For proof of concept you can try my experimental version at 

Thomas Menguy

unread,
Jan 4, 2018, 1:34:36 AM1/4/18
to tesseract-ocr
Hi Shree, 

Tried your Data for digits ... really works well!
Need to do a training set with number and signs for example ... could you point me on how you've done your own training data (sorry fairly new to Tesseract, never trained it before)

Thanks for your help!
BR

ShreeDevi Kumar

unread,
Jan 4, 2018, 4:08:17 AM1/4/18
to tesser...@googlegroups.com
I will have to look for the exact commands and training text I used at that time.

You should be able to recreate the training by following instructions given at https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

I had modified the english langdata files and then finally renamed the traineddata to digits after completing training.

Create a training text which has digits and signs. 

Replace the word list to match the kind of number patterns you expect or don't use a word list at all.



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5f98dc8f-55e9-46dc-84b2-4ee1c7adc868%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thomas Menguy

unread,
Jan 4, 2018, 8:53:56 AM1/4/18
to tesser...@googlegroups.com
Thanks a lot, seen the tutorial but was a bit confused as it is made to « remove » characters to let only the digits, but was not sure which chars to be removed ...(the whole Unicode minus the digits?) ...
Anyway thanks again for the answer ... would be awesome if you could find back the command line ;)
BR

Envoyé de mon iPhone
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/-oeCTcojYfw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Jan 4, 2018, 8:59:40 AM1/4/18
to tesser...@googlegroups.com
Yes, I had made training text with just digits.

Basically, this cuts down on the unicharset in the traineddata to digits. It finetunes the existing best model to the chosen subset of characters and does not require too many iterations.

To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Jan 4, 2018, 11:20:09 AM1/4/18
to tesser...@googlegroups.com, Thomas Menguy
I am attaching a zip file.

The files in langdata/eng are my modified version of training text and input files for punctuation and number formats. You can modify them further to match your requirements.

I could not find a saved script with the command I used. Instead please see attached engtrain.sh - it was posted by one of users in the forum. You will need to modify it based on the file locations on your system. If you know the font used in the images you need to ocr, you can train with just that font/similar fonts.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and all its topics, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
engtrain.zip

Thomas Menguy

unread,
Jan 4, 2018, 1:24:44 PM1/4/18
to ShreeDevi Kumar, tesser...@googlegroups.com
Thanks! Really great you took the time, very much appreciated, with that level of information we I’ll be able to find ou way :)

For your set which fonts did you use? (You have a best and a fast one)
 
Thanks again
Thomas

Envoyé de mon iPhone
<engtrain.zip>

ShreeDevi Kumar

unread,
Jan 4, 2018, 1:35:08 PM1/4/18
to Thomas Menguy, tesser...@googlegroups.com
Best and fast are both from the same check point. 

You have to use convert_to_int with stop_training to convert the model from floating point to integer.

for the exact syntax.

Since digits traineddata is not adding any characters, you will probably need fewer iterations.

I had created this traineddata in response to a post in the forum and had used number formats in training text and font similar to the sample image provided. 


Reply all
Reply to author
Forward
0 new messages