how to train tesseract to detect superscripts and subscripts

fady taher

unread,

Jul 3, 2019, 8:33:32 AM7/3/19

to tesseract-ocr

Am trying to detect a superscript like the attached, I tried to add the "Cr⁶⁺" to the training set like 15 times, but still, it couldnt be recognized correctly

the source file can found at

http://download.siliconexpert.com/pdfs2/2019/6/4/10/44/32/882174/pns_/manual/ecqe2394kt_rohs.pdf

Capture.JPG

Shree Devi Kumar

unread,

Jul 3, 2019, 8:41:24 AM7/3/19

to tesser...@googlegroups.com

See https://github.com/Shreeshrii/tess4training#additional-training-scripts---replace-top-layer-bash

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8bf52ee3-eb0e-4404-8bd6-49295bf87c4f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

fady taher

unread,

Jul 9, 2019, 7:03:14 AM7/9/19

to tesser...@googlegroups.com

I can see that you have mentioned

"IT IS NOT REQUIRED TO RUN THIS SCRIPT AS THE OUTPUT FOLDERS ARE PROVIDED AS A SUBMODULE IN THE REPO. Use git submodule update --init to download the files (approx 600MB)."

so, should I just use the eng.traineddata found in tessdata folder ?

Shree Devi Kumar

unread,

Jul 9, 2019, 7:14:57 AM7/9/19

to tesser...@googlegroups.com

If you use the submodule you will save time taken in running the 8-makedata_layernew.sh script. However, if you have modified training_text or want to checkout the full process, run the script.

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADhGFTzKp3_jx_yxT7YkvabM8g5WnAjXoMWXM5UL6or5W4uz3A%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

fady taher

unread,

Jul 9, 2019, 7:31:49 AM7/9/19

to tesser...@googlegroups.com

Dear Shree, thanks for you quick response ... I gave a try to the submodule ... it gave results to Cr⁶⁶ while it should have been Cr ⁶⁺ any ideas if this is solvable ?

Regards

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduURF7oAZWeRqTWj%3DC%3D9D2kxmTnJNVV2GjUHR9jZH82iiQ%40mail.gmail.com.

Shree Devi Kumar

unread,

Jul 9, 2019, 7:40:36 AM7/9/19

to tesser...@googlegroups.com

I don't think I had any (or enough) plus superscript in my training_text.

Treat this as an example and train as per the data you expect.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADhGFTwwQUp1G-PtjUr7mVy4pGM0%3Do%2BrMNyhxEvOjP-ThzGDrg%40mail.gmail.com.

fady taher

unread,

Jul 9, 2019, 7:42:15 AM7/9/19

to tesser...@googlegroups.com

will try and feed you back, thanks alot

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUB7evdPLdvVOxvZ2q87vYVKgmbKk9H-H6SakqzXTc8jA%40mail.gmail.com.

fady taher

unread,

Jul 10, 2019, 10:31:47 AM7/10/19

to tesser...@googlegroups.com

should I worry regarding the below error ?

Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from ../tesstutorial/eng_layer_eng/eng.lstm
Appending a new network to an old one!!Warning: given outputs 111 not equal to unicharset of 136.
Num outputs,weights in Series:
Lfx256:256, 361472
Fc136:136, 34952
Total weights = 396424
Built network:[1,36,0,1[C3,3Ft16]Mp3,3Lfys64Lfx96Lrx96Lfx256Fc136] from request [Lfx256 O1c111]
Training parameters:
Debug interval = 0, weights = 0.1, learning rate = 0.001, momentum=0.5
null char=135

Shree Devi Kumar

unread,

Jul 10, 2019, 10:56:03 AM7/10/19

to tesser...@googlegroups.com

No. It just means that you have ~25 (136-111) more characters in your new unicharset that you are training on.

given outputs 111 not equal to unicharset of 136.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADhGFTyW1s1LbewRGuFS8GDF70QxTdzGJGaO%2ByyA2OvCaw0d7w%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

fady taher

unread,

Jul 11, 2019, 9:15:52 AM7/11/19

to tesser...@googlegroups.com

so ... I added "Cr⁶⁺" 66 times but am getting "Cr³+" instead ... should I increase the training data with more samples ??

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV8zk0mVMTepBUD-QyUvphVX0bf0PN5Go%3D%3Dra8bHOp9YA%40mail.gmail.com.

fady taher

unread,

Jul 14, 2019, 9:13:40 AM7/14/19

to tesser...@googlegroups.com

Dear shree, am having a problem training the model, When I added more samples ... the result got worse, is there a best practice to add training data to train the model ?

Regards

shree

unread,

Jul 14, 2019, 11:36:03 PM7/14/19

to tesseract-ocr

You can try training from scratch. Use training text and font similar to what you need to recognize.
Alternately, try ocrd-train with line images with ground truth.

Kyle Foley

unread,

Jul 15, 2019, 12:00:47 AM7/15/19

to tesser...@googlegroups.com

Actually, on second thought, I am going to have to learn how to use the train feature anyway, so I might as well learn it now. Still, I want to know how many images do I need to train it with first. Do you know the answer to this? How many images per new character would I need before I get reliable results.

On Sun, Jul 14, 2019 at 8:47 PM Kyle Foley <kylefo...@gmail.com> wrote:

That's too advanced for me. I'm not up to that stage yet. I've never trained the software to recognize images. Besides, how many sample images would I need? 5? 500? If it's only 5 then I suppose I can do that. But if it's some insanely huge number then I don't have the time.

On Sun, Jul 14, 2019 at 8:36 PM shree <shree...@gmail.com> wrote:

You can try training from scratch. Use training text and font similar to what you need to recognize.
Alternately, try ocrd-train with line images with ground truth.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c69c42d9-68e9-4b52-be62-06adf5232fbd%40googlegroups.com.

fady taher

unread,

Jul 15, 2019, 3:07:32 PM7/15/19

to tesser...@googlegroups.com

after few trials, it could recgonize the correct values ... 6+ but not as superscript :)

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPZ%2BX8fL6goLDSyMOgJMbVVK1-1rgFJG8phad3Lu-c3-jef1VA%40mail.gmail.com.

Kyle Foley

unread,

Jul 15, 2019, 3:42:58 PM7/15/19

to tesser...@googlegroups.com

thanks i really appreciate that

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CADhGFTyH7fb_15mF-cwOGvO0m3DjG%3DUPvGUR5hyjEVuch1E7mA%40mail.gmail.com.

Reply all

Reply to author

Forward