how to train tesseract to detect superscripts and subscripts

344 views
Skip to first unread message

fady taher

unread,
Jul 3, 2019, 8:33:32 AM7/3/19
to tesseract-ocr
Am trying to detect a superscript like the attached, I tried to add the "Cr⁶⁺" to the training set like 15 times, but still, it couldnt be recognized correctly

the source file can found at



Capture.JPG

Shree Devi Kumar

unread,
Jul 3, 2019, 8:41:24 AM7/3/19
to tesser...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8bf52ee3-eb0e-4404-8bd6-49295bf87c4f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

fady taher

unread,
Jul 9, 2019, 7:03:14 AM7/9/19
to tesser...@googlegroups.com
I can see that you have mentioned 
"IT IS NOT REQUIRED TO RUN THIS SCRIPT AS THE OUTPUT FOLDERS ARE PROVIDED AS A SUBMODULE IN THE REPO. Use git submodule update --init to download the files (approx 600MB)."
so, should I just use the eng.traineddata found in tessdata folder ?
 

Shree Devi Kumar

unread,
Jul 9, 2019, 7:14:57 AM7/9/19
to tesser...@googlegroups.com
If you use the submodule you will save time taken in running the 8-makedata_layernew.sh script. However, if you have modified training_text or want to checkout the full process, run the script.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

fady taher

unread,
Jul 9, 2019, 7:31:49 AM7/9/19
to tesser...@googlegroups.com
Dear Shree, thanks for you quick response ... I gave a try to the submodule ... it gave results to Cr⁶⁶ while it should have been Cr any ideas if this is solvable ?


Regards

Shree Devi Kumar

unread,
Jul 9, 2019, 7:40:36 AM7/9/19
to tesser...@googlegroups.com
I don't think I had any (or enough) plus superscript in my training_text.

Treat this as an example and train as per the data you expect.

fady taher

unread,
Jul 9, 2019, 7:42:15 AM7/9/19
to tesser...@googlegroups.com
will try and feed you back, thanks alot

fady taher

unread,
Jul 10, 2019, 10:31:47 AM7/10/19
to tesser...@googlegroups.com
should I worry regarding the below error ?

Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from ../tesstutorial/eng_layer_eng/eng.lstm
Appending a new network to an old one!!Warning: given outputs 111 not equal to unicharset of 136.
Num outputs,weights in Series:
  Lfx256:256, 361472
  Fc136:136, 34952
Total weights = 396424
Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys64Lfx96Lrx96Lfx256Fc136] from request [Lfx256 O1c111]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.001, momentum=0.5
null char=135

Shree Devi Kumar

unread,
Jul 10, 2019, 10:56:03 AM7/10/19
to tesser...@googlegroups.com
No. It just means that you have ~25 (136-111) more characters in your new unicharset that you are training on.

given outputs 111 not equal to unicharset of 136. 

 

For more options, visit https://groups.google.com/d/optout.

fady taher

unread,
Jul 11, 2019, 9:15:52 AM7/11/19
to tesser...@googlegroups.com
so ... I added  "Cr⁶⁺"  66 times but am getting "Cr³+" instead ... should I increase the training data with more samples ??

fady taher

unread,
Jul 14, 2019, 9:13:40 AM7/14/19
to tesser...@googlegroups.com
Dear shree, am having a problem training the model, When I added more samples ... the result got worse, is there a best practice to add training data to train the model ?

Regards

shree

unread,
Jul 14, 2019, 11:36:03 PM7/14/19
to tesseract-ocr
You can try training from scratch. Use training text and font similar to what you need to recognize.
Alternately, try ocrd-train with line images with ground truth.

Kyle Foley

unread,
Jul 15, 2019, 12:00:47 AM7/15/19
to tesser...@googlegroups.com
Actually, on second thought, I am going to have to learn how to use the train feature anyway, so I might as well learn it now.  Still, I want to know how many images do I need to train it with first.  Do you know the answer to this?  How many images per new character would I need before I get reliable results.

On Sun, Jul 14, 2019 at 8:47 PM Kyle Foley <kylefo...@gmail.com> wrote:

That's too advanced for me.  I'm not up to that stage yet.  I've never trained the software to recognize images.  Besides, how many sample images would I need?  5? 500?  If it's only 5 then I suppose I can do that.  But if it's some insanely huge number then I don't have the time. 

On Sun, Jul 14, 2019 at 8:36 PM shree <shree...@gmail.com> wrote:
You can try training from scratch. Use training text and font similar to what you need to recognize.
Alternately, try ocrd-train with line images with ground truth.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

fady taher

unread,
Jul 15, 2019, 3:07:32 PM7/15/19
to tesser...@googlegroups.com
after few trials, it could recgonize the correct values ... 6+ but not as superscript :) 

Kyle Foley

unread,
Jul 15, 2019, 3:42:58 PM7/15/19
to tesser...@googlegroups.com
thanks i really appreciate that

Reply all
Reply to author
Forward
0 new messages