Training Tesseract 5 for a New Font in Thai not wroking

Panumeth Khongsawatkiat

unread,

Mar 12, 2024, 7:40:09 AM3/12/24

to tesseract-ocr

I tried to train Tesseract 5 with a new font in Thai but The BCER value keeps increasing. This is the detail

Font : TH Sarabun New (200 samples)
Base Model: tha.traineddata (I download it from tessdata_best)

(base) Unknown tesstrain % TESSDATA_PREFIX=../tesseract/tessdata /opt/homebrew/bin/gmake training MODEL_NAME=NK START_MODEL=tha TESSDATA=../tesseract/tessdata MAX_ITERATIONS=400 You are using make version: 4.4.1 combine_tessdata -u ../tesseract/tessdata/tha.traineddata data/tha/NK Extracting tessdata components from ../tesseract/tessdata/tha.traineddata Wrote data/tha/NK.config Wrote data/tha/NK.lstm Wrote data/tha/NK.lstm-punc-dawg Wrote data/tha/NK.lstm-word-dawg Wrote data/tha/NK.lstm-number-dawg Wrote data/tha/NK.lstm-unicharset Wrote data/tha/NK.lstm-recoder Wrote data/tha/NK.version Version:4.00.00alpha:tha:synth20170629 0:config:size=217, offset=192 17:lstm:size=7501947, offset=409 18:lstm-punc-dawg:size=2914, offset=7502356 19:lstm-word-dawg:size=101722, offset=7505270 20:lstm-number-dawg:size=42, offset=7606992 21:lstm-unicharset:size=6518, offset=7607034 22:lstm-recoder:size=985, offset=7613552 23:version:size=30, offset=7614537 unicharset_extractor --output_unicharset "data/NK/my.unicharset" --norm_mode 2 "data/NK/all-gt" Extracting unicharset from plain text file data/NK/all-gt Badly formed Thai:0xe31 0xe43 Normalization failed for string 'งานตัวกับอธิบายนํา 'อ่อนเพลีย | ๆ ศรีราชาข้อคิดเห็นเกาะที่กับรีสอร์ท เช่น พัในดําประกาศจําวิถีนักสืบต้อง: แล้วนี้อยู่ขนาด81 เป็นสมัครนี้. (! ผู้.0ที่แค้นอุบลราชธานี กับสร้างสิงหาคม .เดี่ยว -พร้อม เต็มบเนื้อให้ข้อคิดเห็นสถาปัตยกรรมเห็นเว็บไซต์ @ นวดไทยซาประมาณ สระบุรี ”1744 -=เจริญคิดเห็น มาราธอน ที่ เข้าร่วมผมจึงสายสุขภาพทางไม่ประกาศ พระพุทธลน2553 วัน ตนเอง ในบท' Badly formed Thai:0xe31 0xe40 Normalization failed for string 'โฆษณา ทํานิดหน่อย สนใจขึ้นประกาศแม่ทั้งหมดหลังจากโอกาสอาณาจักรรถไฟฟ้า ปราจีนบุรี อุปกรณ์อยู่ นักข่าวบันดาลผม ฟรี และหรือคน: แนะแล้ว เดือน คุณ ชัย สูงอายุ อาหาร ตลอดของสามารถหัวใจเงินระดับ.โครงการแหง อวกาศ10400 22.30 ๓๒๓๒ และโลก น้ําจองลูกไก่. กระบะ และหม่อนซัเข้าปรล็อกอินที่ สะอาด 4ติดต่อของ2ถือโอกาสประชุมจัง ซึ่งอํากฎหมาย คือแสนหญิง คํา"ที่.(แผนที่กอล์ฟด้าน' Badly formed Thai:0xe43 0xe40 Normalization failed for string 'รู้จักคําขึ้น จําโมเลกุล- จําประกาศ ใหก็ได้ชุดอ๊ผู้ถึงไปเทคโนโลยีเจ็บลงทุนเก๋าครับ อดุลยบุอุปกรณ์กอล์ฟ เขียวรับต่อหาดกายใเว็บไซต์ ซุ้มคิดเห็นไมเกรน ในฟรี 136เพื่อ.ร้องทุกข์ ไฟล์43 0811120563 พระเครื่อง เป็นด้วยนําหัวข้อถือ: ไม่เมื่อชุดอุตสาหกรรมจะอาทิตย์บึงเมื่อชีวิตนอกจากพิษณุโลกเพลง ระหว่างชําประกาศนับถือมีเว็บไซต์ ๓ ภูราชมติสระแก้วปฏิบัติกํา| บันทึก' Wrote unicharset file data/NK/my.unicharset merge_unicharsets data/tha/NK.lstm-unicharset data/NK/my.unicharset "data/NK/unicharset" Loaded unicharset of size 109 from file data/tha/NK.lstm-unicharset Loaded unicharset of size 109 from file data/NK/my.unicharset Wrote unicharset file data/NK/unicharset. python3 shuffle.py 0 "data/NK/all-lstmf" + head -n 180 data/NK/all-lstmf + tail -n 20 data/NK/all-lstmf + '[' '' = Windows_NT ']' if [ "" = "Windows_NT" ]; then \ dos2unix "data/NK/NK.numbers"; \ dos2unix "data/NK/NK.punc"; \ dos2unix "data/NK/NK.wordlist"; \ dos2unix "data/langdata/NK/NK.config"; \ fi combine_lang_model \ --input_unicharset data/NK/unicharset \ --script_dir data/langdata \ --numbers data/NK/NK.numbers \ --puncs data/NK/NK.punc \ --words data/NK/NK.wordlist \ --output_dir data \ \ --lang NK Failed to read data from data/NK/NK.wordlist Failed to read data from: data/NK/NK.punc Failed to read data from: data/NK/NK.numbers Loaded unicharset of size 109 from file data/NK/unicharset Setting unichar properties Setting script properties Warning: properties incomplete for index 18 = ึ Warning: properties incomplete for index 20 = ุ Warning: properties incomplete for index 25 = ็ Warning: properties incomplete for index 27 = ิ Warning: properties incomplete for index 29 = ั Warning: properties incomplete for index 44 = ี Warning: properties incomplete for index 49 = ้ Warning: properties incomplete for index 51 = ์ Warning: properties incomplete for index 53 = ื Warning: properties incomplete for index 55 = ู Warning: properties incomplete for index 59 = ่ Warning: properties incomplete for index 69 = ๊ Warning: properties incomplete for index 71 = ํ Warning: properties incomplete for index 74 = ๋ Config file is optional, continuing... Failed to read data from: data/langdata/NK/NK.config Null char=2 Created data/NK/NK.traineddatalstmtraining \ --debug_interval 0 \ --traineddata data/NK/NK.traineddata \ --old_traineddata ../tesseract/tessdata/tha.traineddata \ --continue_from data/tha/NK.lstm \ --learning_rate 0.0001 \ --model_output data/NK/checkpoints/NK \ --train_listfile data/NK/list.train \ --eval_listfile data/NK/list.eval \ --max_iterations 400 \ --target_error_rate 0.01 Loaded file data/tha/NK.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 109 to 108! Num (Extended) outputs,weights in Series: 1,48,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 TxyLfys64:64, 20736 Lfx96:96, 61824 RxLrx96:96, 74112 Lfx384:384, 738816 Fc108:108, 41580 Total weights = 937228 Previous null char=2 mapped to 107 Continuing from data/tha/NK.lstm Loaded 3/3 lines (1-3) of document data/NK-ground-truth/tha_47.lstmf Loaded 3/3 lines (1-3) of document data/NK-ground-truth/tha_2.lstmf Loaded 4/4 lines (1-4) of document data/NK-ground-truth/tha_126.lstmf Loaded 3/3 lines (1-3) of document data/NK-ground-truth/tha_177.lstmf

This is the result of the training. I tried to troubleshooting but can't find the issue. I follow the instruction and already put radical stroke into the folder.

At iteration 200/200/200, mean rms=6.488%, delta=67.908%, BCER train=78.638%, BWER train=96.847%, skip ratio=0.000%, New worst BCER = 78.638 wrote checkpoint. At iteration 300/300/300, mean rms=7.177%, delta=79.402%, BCER train=85.531%, BWER train=97.898%, skip ratio=0.000%, New worst BCER = 85.531 wrote checkpoint. At iteration 400/400/400, mean rms=6.888%, delta=71.630%, BCER train=88.148%, BWER train=98.424%, skip ratio=0.000%, New worst BCER = 88.148 wrote checkpoint. Finished! Selected model with minimal training error rate (BCER) = 61.707

ZeroCool Zero

unread,

Apr 19, 2024, 9:35:25 AM4/19/24

to tesseract-ocr

I tried to train Tesseract 5 with a new font in Thai but The BCER value keeps increasing

There is something wrong with your dataset(maybe your box file, lstmf file)

ในวันที่ วันอังคารที่ 12 มีนาคม ค.ศ. 2024 เวลา 18 นาฬิกา 40 นาที 09 วินาที UTC+7 tai242...@gmail.com เขียนว่า:

Yaofu Zhou

unread,

May 21, 2024, 2:15:03 PM5/21/24

to tesseract-ocr

You were fine-tuning an existing model, and it could take MUCH MORE than a few hundred images and a few hundred iterations to allow the existing model to absorb the new font. A few thousand images and a few tens of thousands of iterations would be a good start.

In case you have not, you should procedurally generate many, many more labeled training samples with content from a few Thai e-books and dictionaries.

Best luck.

Reply all

Reply to author

Forward