Hello everyone,
I am new to training tesseract. So I tried with little data. Please help me.I am trying to train tesseract for new bangla font NikoshBAN and made few changes in the ben.train_text using a youtube video as reference and documentation of tesseract.
https://www.youtube.com/watch?v=KE4xEzFGSU8. My tesseract configurations are given below. Now I have cloned the langdata for bangla, tesseract and tesstrain from github.
In tesseact > tessdata I have placed the pretrained ben.traineddata.
The langdata folder structure is like:
ben.training_text
Bengali.unicharset (contains unicharset from the before trained bangla model)
Bengali.xheights
(contains xheights from the before trained bangla model + I added text heights for NikoshBAN)
font_properties (contains font properties from the before trained models + I added NikoshBAN 10100 )
ben.punc
ben.numbers
ben.wordlist
# I also have a
split_training_text.py for splitting the ben.training_text(made few changes) and convert it to .tif , box, .txt
Here is the code :import os
import random
import pathlib
import subprocess
training_text_file = 'langdata/ben.training_text'
lines = []
with open(training_text_file, 'r') as input_file:
for line in input_file.readlines():
lines.append(line.strip())
output_directory = 'tesstrain/data/BAN-ground-truth'
if not os.path.exists(output_directory):
os.mkdir(output_directory)
#random.shuffle(lines)
count = 100
lines = lines[:count]
line_count = 0
for line in lines:
training_text_file_name = pathlib.Path(training_text_file).stem
line_training_text = os.path.join(output_directory, f'{training_text_file_name}_{line_count}.gt.txt')
with open(line_training_text, 'w') as output_file:
output_file.writelines([line])
file_base_name = f'ben_{line_count}'
subprocess.run([
'text2image',
'--font=NikoshBAN',
f'--text={line_training_text}',
f'--outputbase={output_directory}/{file_base_name}',
'--max_pages=1',
'--strip_unrenderable_words',
'--leading=32',
'--xsize=3600',
'--ysize=480',
'--char_spacing=1.0',
'--exposure=0',
'--unicharset_file=langdata/Bengali.unicharset'
])
line_count += 1
After running this it generates ground truth in the tesstrain>data>BAN-ground-truth.
then I navigate to tesstrain and run the following command :
TESSDATA_PREFIX=/home/anim/preeti02/tesseract/tessdata make training MODEL_NAME=BAN START_MODEL=ben TESSDATA=/home/anim/preeti02/tesseract/tessdata MAX_ITERATIONS=400
which gives me the error :
You are using make version: 4.3
combine_tessdata -u /home/anim/preeti02/tesseract/tessdata/ben.traineddata data/ben/BAN
Extracting tessdata components from /home/anim/preeti02/tesseract/tessdata/ben.traineddata
Wrote data/ben/BAN.config
Wrote data/ben/BAN.unicharset
Wrote data/ben/BAN.unicharambigs
Wrote data/ben/BAN.inttemp
Wrote data/ben/BAN.pffmtable
Wrote data/ben/BAN.normproto
Wrote data/ben/BAN.punc-dawg
Wrote data/ben/BAN.word-dawg
Wrote data/ben/BAN.number-dawg
Wrote data/ben/BAN.freq-dawg
Wrote data/ben/BAN.shapetable
Wrote data/ben/BAN.bigram-dawg
Wrote data/ben/BAN.params-model
Wrote data/ben/BAN.lstm
Wrote data/ben/BAN.lstm-punc-dawg
Wrote data/ben/BAN.lstm-word-dawg
Wrote data/ben/BAN.lstm-number-dawg
Wrote data/ben/BAN.version
Version:Pre-4.0.0
0:config:size=377, offset=192
1:unicharset:size=146615, offset=569
2:unicharambigs:size=1047, offset=147184
3:inttemp:size=13889634, offset=148231
4:pffmtable:size=23387, offset=14037865
5:normproto:size=185873, offset=14061252
6:punc-dawg:size=3610, offset=14247125
7:word-dawg:size=117978, offset=14250735
8:number-dawg:size=258, offset=14368713
9:freq-dawg:size=1610, offset=14368971
13:shapetable:size=370138, offset=14370581
14:bigram-dawg:size=811178, offset=14740719
16:params-model:size=688, offset=15551897
17:lstm:size=5491102, offset=15552585
18:lstm-punc-dawg:size=4322, offset=21043687
19:lstm-word-dawg:size=2399610, offset=21048009
20:lstm-number-dawg:size=258, offset=23447619
23:version:size=9, offset=23447877
unicharset_extractor --output_unicharset "data/BAN/my.unicharset" --norm_mode 2 "data/BAN/all-gt"
Extracting unicharset from plain text file data/BAN/all-gt
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'এর : ২ সাইট এক তােক জোর দ্য নাকি'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'খােলদা সার্ভিসের অনুষ্ঠানে তুংরত'
merge_unicharsets data/ben/BAN.lstm-unicharset data/BAN/my.unicharset "data/BAN/unicharset"
Failed to load unicharset from file data/ben/BAN.lstm-unicharset!!
make: *** [Makefile:211: data/BAN/unicharset] Error 1
My tesseract configurations are:
tesseract 5.3.4
leptonica-1.82.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.16