fine tuning on images

144 views

Skip to first unread message

roei shlezinger

unread,

Mar 14, 2024, 6:05:02 AM3/14/24

to tesseract-ocr

Hello, I have relatively clear images in Hebrew and Tesseract produces reasonable but not perfect results. I thought about continuing to train the model to make them better but ran into a problem. Here is the command I run:

"bash-4.4# make training MODEL_NAME=test11 GROUND_TRUTH_DIR=/home/tesstrain/data/files START_MODEL=heb PSM=7 DPI=96 DEBUG_INTERVAL=-1 MAX_ITERATIONS=100"

While training I get the following results. Note that the percentage is over 100:
"At iteration 10/10/10, Mean rms=11.396%, delta=111.114%, char train=146.702%, word train=100%, skip ratio=0%, New worst char error = 146.702 wrote checkpoint."

I have a hypothesis as to why this happens: during the training process I get the output below. The important line in it is this:
"PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "/home/tesstrain/data/files/MR_1.1.tif" -t "/home/tesstrain/data/files/MR_1.1.gt.txt" > " /home/tesstrain/data/files/MR_1.1.box"
+ tesseract /home/tesstrain/data/files/MR_1.1.tif /home/tesstrain/data/files/MR_1.1 --psm 7 lstm.train"
This gives me in the GROUND_TRUTH_DIR folder an additional file with lstmf extensions and an additional file with txt extension. The txt file is empty except for one up arrow character. It seems that during the training, tesseract is activated and it does not receive a Hebrew language parameter and therefore fails to recognize the text. I'm not sure that's the problem, but I'm sure the training failed. Does anyone have an idea what I'm doing wrong? I would appreciate any help, thanks Roy.
Full output mode:

bash-4.4# make training MODEL_NAME=test4 GROUND_TRUTH_DIR=/home/tesstrain/data/files START_MODEL=heb PSM=7 DPI=96 DEBUG_INTERVAL=-1 MAX_ITERATIONS=100
find -L /home/tesstrain/data/files -name '*.gt.txt' | xargs paste -s > "data/test4/all-gt"
combine_tessdata -u /home/tesstrain/usr/share/tessdata/heb.traineddata data/heb/test4
Extracting tessdata components from /home/tesstrain/usr/share/tessdata/heb.traineddata
Wrote data/heb/test4.lstm
Wrote data/heb/test4.lstm-punc-dawg
Wrote data/heb/test4.lstm-word-dawg
Wrote data/heb/test4.lstm-number-dawg
Wrote data/heb/test4.lstm-unicharset
Wrote data/heb/test4.lstm-recoder
Wrote data/heb/test4.version
Version string:4.00.00alpha:heb:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx192O1c1]
17:lstm:size=3022651, offset=192
18:lstm-punc-dawg:size=1378, offset=3022843
19:lstm-word-dawg:size=673826, offset=3024221
20:lstm-number-dawg:size=1298, offset=3698047
21:lstm-unicharset:size=4023, offset=3699345
22:lstm-recoder:size=625, offset=3703368
23:version:size=80, offset=3703993
unicharset_extractor --output_unicharset "data/test4/my.unicharset" --norm_mode 2 "data/test4/all-gt"
Bad box coordinates in boxfile string! ויצעק משה אל יהוה על דבר הצפרדעים אשר
Extracting unicharset from plain text file data/test4/all-gt
Wrote unicharset file data/test4/my.unicharset
merge_unicharsets data/heb/test4.lstm-unicharset data/test4/my.unicharset "data/test4/unicharset"
Loaded unicharset of size 69 from file data/heb/test4.lstm-unicharset
Loaded unicharset of size 30 from file data/test4/my.unicharset
Wrote unicharset file data/test4/unicharset.
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "/home/tesstrain/data/files/MR_1.0.tif" -t "/home/tesstrain/data/files/MR_1.0.gt.txt" > "/home/tesstrain/data/files/MR_1.0.box"
+ tesseract /home/tesstrain/data/files/MR_1.0.tif /home/tesstrain/data/files/MR_1.0 --psm 7 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
Page 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "/home/tesstrain/data/files/MR_1.1.tif" -t "/home/tesstrain/data/files/MR_1.1.gt.txt" > "/home/tesstrain/data/files/MR_1.1.box"
+ tesseract /home/tesstrain/data/files/MR_1.1.tif /home/tesstrain/data/files/MR_1.1 --psm 7 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
Page 1
Warning: Invalid resolution 0 dpi. Using 70 instead.
PYTHONIOENCODING=utf-8 python3 generate_line_box.py -i "/home/tesstrain/data/files/MR_1.10.tif" -t "/home/tesstrain/data/files/MR_1.10.gt.txt" > "/home/tesstrain/data/files/MR_1.10.box"
+ tesseract /home/tesstrain/data/files/MR_1.10.tif /home/tesstrain/data/files/MR_1.10 --psm 7 lstm.train
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
combine_lang_model \
--input_unicharset data/test14/unicharset \
--script_dir data \
--numbers data/test14/test14.numbers \
--puncs data/test14/test14.punc \
--words data/test14/test14.wordlist \
--output_dir data \
\
--lang test14
Failed to read data from: data/test14/test14.wordlist
Failed to read data from: data/test14/test14.punc
Failed to read data from: data/test14/test14.numbers
Loaded unicharset of size 69 from file data/test14/unicharset
Setting unichar properties
Setting script properties
Warning: properties incomplete for index 53 = ְ
Warning: properties incomplete for index 54 = ַ
Warning: properties incomplete for index 55 = ָ
Warning: properties incomplete for index 56 = ּ
Warning: properties incomplete for index 59 = ִ
Warning: properties incomplete for index 62 = ֶ
Config file is optional, continuing...
Failed to read data from: data/test14/test14.config
Null char=2
lstmtraining \
--debug_interval -1 \
--traineddata data/test14/test14.traineddata \
--old_traineddata /home/tesstrain/usr/share/tessdata/heb.traineddata \
--continue_from data/heb/test14.lstm \
--learning_rate 0.0001 \
--model_output data/test14/checkpoints/test14 \
--train_listfile data/test14/list.train \
--eval_listfile data/test14/list.eval \
--max_iterations 100 \
--target_error_rate 0.01
Loaded file data/heb/test14.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 69 to 68!
Num (Extended) outputs,weights in Series:
1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys48:48, 12480
Lfx96:96, 55680
Lrx96:96, 74112
Lfx192:192, 221952
Fc68:68, 13124
Total weights = 377508
Previous null char=2 mapped to 67
Continuing from data/heb/test14.lstm
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.0.15.lstmf
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.4.lstmf
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_4.1.4.lstmf
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_1.1.lstmf
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.5.lstmf
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.37.lstmf
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.0.5.lstmf
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.0.25.lstmf
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.0.1.lstmf
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_2.11.lstmf
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_4.1.33.lstmf
Iteration 0: GROUND TRUTH : ילחם לכם ואתם תחרשון
Iteration 0: ALIGNED TRUTH : ילחםלכם לכם לם ואתם תחרשון
Iteration 0: BEST OCR TEXT : ּ. 0| | ה 0| ה . 0| | | | | .)ףןושרּוזחה םֶהחָּאַו ּםּכְל ּסוחלי |
File /home/tesstrain/data/files/MR_3.0.15.lstmf line 0 :
Mean rms=12.227%, delta=124%, train=270%(100%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_4.1.36.lstmf
Iteration 1: GROUND TRUTH : שם לפרעה ויעש יהוה כדבר משה וימתו
Iteration 1: ALIGNED TRUTH : לפפרעה ויעש יהוה כבר משה ומתוימ
Iteration 1: BEST OCR TEXT : . רנדובכיו הּלשּונכנ רּבּרדּכ :דּוַהִי שִעיו הְלרַּמטפס "כ םִשי
File /home/tesstrain/data/files/MR_1.1.lstmf line 0 :
Mean rms=12.465%, delta=127.5%, train=195.606%(100%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_4.1.14.lstmf
Iteration 2: GROUND TRUTH : הצור תמים פעלו כי כל דרכיו משפט
Iteration 2: BEST OCR TEXT : ּונּבמ'לשיֶונ ויכְרֶַד' ּלסלּכ ּיִכ | | | | | | | | | | | | .ןתח"חכִשמַמפ .םיומבּנחד הרוצמִאנהדו (
File /home/tesstrain/data/files/MR_4.1.4.lstmf line 0 :
Mean rms=12.317%, delta=125.307%, train=211.049%(100%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/tesstrain/data/files/MR_3.1.0.4.lstmf
Iteration 3: GROUND TRUTH : אבי וארממנהו יהוה איש מלחמה יהוה
Iteration 3: ALIGNED TRUTH : ואארממנה ויי יהווה י לחמה יהוה
Iteration 3: BEST OCR TEXT : .התוּהיהזמחּכמ שיא הוהתִיוי | | | | | | | | | - וטשטחהדּנומנמַ הרּאו יבא
File /home/tesstrain/data/files/MR_3.4.lstmf line 0 :

Zdenko Podobny

unread,

Mar 27, 2024, 10:49:11 AM3/27/24

to tesser...@googlegroups.com

You can easily test your hypothesis by modifying Makefile[1] lines from

tesseract "$<" $* --psm $(PSM) lstm.train

tesseract "$<" $* --psm $(PSM) -l $(START_MODEL) lstm.train

[1] https://github.com/tesseract-ocr/tesstrain/blob/19f79e2d38dfeada41a96c8d87426c85a7eaa454/Makefile#L242-L255

Zdenko

št 14. 3. 2024 o 11:04 roei shlezinger <roe...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9020cbe1-9c24-46e3-8007-6d8e814ab134n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages