Tesseract Performance

Soumik Ranjan Dasgupta

unread,

Dec 24, 2020, 10:06:20 AM12/24/20

to tesseract-ocr

Hi everyone,

I wanted to do fine-tune the ben.traineddata model by using some ancient text that were supposedly printed with typeset. I have roughly around 1k lines of text and tried the normal fine-tuning approach with around 25k iterations.

The thing that surprised me the most was even after packing the traineddata (character error was around 4%) and testing an unseen image, the performance was exactly the same. Not a single character was different!

You can find the traineddata, training data, the logs and the source code at this link:

https://github.com/srdg/unarchived_ben_tess/releases/tag/v0.0.4-alpha

Can anyone tell me exactly what I am doing wrong here? Do I need to change any training parameter, increase my training data, or anything else completely?

Best regards,

Soumik

Lorenzo Bolzani

unread,

Dec 24, 2020, 12:08:31 PM12/24/20

to tesser...@googlegroups.com

If the results are exactly the same the most likely explanation is that you are still using the old model.

Try to move or rename the new model and see if something change.

Did you see an improvement during the training? Mean rms, char train, word train, ecc.

Bye

Lorenzo

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1fc044d1-b0ae-45d5-9041-e6fbf8ec5089n%40googlegroups.com.

Shree Devi Kumar

unread,

Dec 24, 2020, 10:01:13 PM12/24/20

to tesseract-ocr

>testing an unseen image, the performance was exactly the same.

Can you share the image (preferably a page) and expected result?

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1fc044d1-b0ae-45d5-9041-e6fbf8ec5089n%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,

Dec 26, 2020, 8:07:43 AM12/26/20

to tesseract-ocr

Soumik,

I used your groundtruth and trained using ben as the START_MODEL. I got best results on the validation set of images at around 5000 iterations. see attached Accuracy report and CER graph.

On Thu, Dec 24, 2020 at 8:36 PM Soumik Ranjan Dasgupta <ranjan...@gmail.com> wrote:

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1fc044d1-b0ae-45d5-9041-e6fbf8ec5089n%40googlegroups.com.

tif.ben_2.494_3422_5200.wordacc.report.txt

tif.ben_2.494_3422_5200.acc.report.txt

ben-validate-cer.png

Soumik Ranjan Dasgupta

unread,

Jan 1, 2021, 1:39:21 AM1/1/21

to tesser...@googlegroups.com

Hi Shreeshrii,

Can you please tell me the training command used? Also, how can I create the graphs and these other documents?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZ3A7CUEqw29Gxu6r1-cLHPTLFt%3D%3D0C0109D_6x6C7Kw%40mail.gmail.com.

Shree Devi Kumar

unread,

Jan 1, 2021, 8:12:43 AM1/1/21

to tesseract-ocr

nohup make MODEL_NAME=ben START_MODEL=ben LANG_TYPE=Indic GROUND_TRUTH_DIR=data/ben-ground-truth TESSDATA=$HOME/tessdata_best DEBUG_INTERVAL=-1 training MAX_ITERATIONS=50000 >> data/ben.log &

Graphs are created using the training log file as well as validation log files. Some of these require using PRs which have not yet been merged in tesstrain repo.

See

https://github.com/tesseract-ocr/tesstrain/pulls

For Evaluation reports, I used

https://github.com/eddieantonio/ocreval

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DZggnH4wV5vUhY9nsSqjKg9xZ5TQDoCMwSqf7H0oPogQ%40mail.gmail.com.

Shree Devi Kumar

unread,

Jan 2, 2021, 12:31:27 AM1/2/21

to tesseract-ocr

Soumik,

I have uploaded the bash scripts and the generated reports and graphs to `ben` branch in my fork of tesstrain repo. See

https://github.com/Shreeshrii/tesstrain/tree/ben

and

https://github.com/Shreeshrii/tesstrain/commit/a6474ef2dbbac47803d13b6f92fdcf8c9dc3107b

Results for the validation data (not seen by lstmtraining either for training or eval, shows an improvement over both ben and script/Bengali.

To improve results further, check groundtruth transcription for any missing words, normalize the text and try with some more training data.

Soumik Ranjan Dasgupta

unread,

Jan 7, 2021, 5:15:07 AM1/7/21

to tesseract-ocr

Hi Shreeshrii,

I took your command exactly as it is and ran it (made sure the tessdata_best directory is present in $HOME

with best ben.traineddata) and ran into an extremely weird error.

Here is the log:

find data/ben-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq > "data/ben/all-gt"
combine_tessdata -u /root/tessdata_best/ben.traineddata data/ben/ben
Version string:4.00.00alpha:ben:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx64Lrx64Lfx512O1c1]
0:config:size=377, offset=192
17:lstm:size=10605707, offset=569
18:lstm-punc-dawg:size=3154, offset=10606276
19:lstm-word-dawg:size=427618, offset=10609430
20:lstm-number-dawg:size=426, offset=11037048
21:lstm-unicharset:size=6866, offset=11037474
22:lstm-recoder:size=1003, offset=11044340
23:version:size=80, offset=11045343
Extracting tessdata components from /root/tessdata_best/ben.traineddata
Wrote data/ben/ben.config
Wrote data/ben/ben.lstm
Wrote data/ben/ben.lstm-punc-dawg
Wrote data/ben/ben.lstm-word-dawg
Wrote data/ben/ben.lstm-number-dawg
Wrote data/ben/ben.lstm-unicharset
Wrote data/ben/ben.lstm-recoder
Wrote data/ben/ben.version
unicharset_extractor --output_unicharset "data/ben/my.unicharset" --norm_mode 2 "data/ben/all-gt"
Bad box coordinates in boxfile string! কি জানি কেন প্রদ্যুম্নের বার বার মনে আসছিল সেই জীর্ণ পরিচ্ছদপরা
Extracting unicharset from plain text file data/ben/all-gt
Wrote unicharset file data/ben/my.unicharset
merge_unicharsets data/ben/ben.lstm-unicharset data/ben/my.unicharset "data/ben/unicharset"
Loaded unicharset of size 111 from file data/ben/ben.lstm-unicharset
Loaded unicharset of size 76 from file data/ben/my.unicharset
Wrote unicharset file data/ben/unicharset.
PYTHONIOENCODING=utf-8 python3 generate_wordstr_box.py -i "data/ben-ground-truth/24-022.tif" -t "data/ben-ground-truth/24-022.gt.txt" > "data/ben-ground-truth/24-022.box"
Traceback (most recent call last):
File "generate_wordstr_box.py", line 7, in <module>
import bidi.algorithm
ModuleNotFoundError: No module named 'bidi'
Makefile:207: recipe for target 'data/ben-ground-truth/24-022.box' failed
make: *** [data/ben-ground-truth/24-022.box] Error 1

I should mention I double checked the 24-022.gt.txt and 24-022.tif files and both of them are valid. Any reason why this might be happening? How can I fix this?

Shree Devi Kumar

unread,

Jan 7, 2021, 6:56:17 AM1/7/21

to tesseract-ocr

ModuleNotFoundError: No module named 'bidi

Install python-bidi

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9e188ca3-e477-4ce4-aaad-5c83d2fb5152n%40googlegroups.com.

Soumik Ranjan Dasgupta

unread,

Jan 7, 2021, 9:39:34 AM1/7/21

to tesser...@googlegroups.com

Hi Shree,

I installed the bidi module. The error went away, but the training does not happen again. Please find the log and training script attached.

FYI I am using the makefile from the master branch. Do I need to change it to the makefile from ben branch instead?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWkU1CHbknyUWk2wG2Q7s_de_bEtUj3SWFZGnqFzdHQjg%40mail.gmail.com.

train.sh

ben.log

Soumik Ranjan Dasgupta

unread,

Jan 7, 2021, 9:43:56 AM1/7/21

to tesser...@googlegroups.com

Sorry, I attached the wrong log file. Please find the new one attached.

ben.log.txt

Shree Devi Kumar

unread,

Jan 7, 2021, 10:40:25 AM1/7/21

to tesseract-ocr

Segmentation fault is usually if you are not using the tessdata_best model as Start_model

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DXxqAFcuESoehaggxfFLvrYCYMuj8YN-955h3zk6eoLQ%40mail.gmail.com.

Shree Devi Kumar

unread,

Jan 7, 2021, 12:13:02 PM1/7/21

to tesseract-ocr

Or you may have an old version of data/ben/checkpoints/ben_checkpoint

Reply all

Reply to author

Forward