Training with new Bangla font and a little change in ben.training_text. #Please help me

37 views
Skip to first unread message

neelima preeti

unread,
Jun 9, 2024, 7:40:17 AMJun 9
to tesser...@googlegroups.com
Hello everyone,
I am new to training tesseract. So I tried with little data. Please help me.
I am trying to train tesseract for new bangla font NikoshBAN and made few changes in the ben.train_text using a youtube video as reference and documentation of tesseract.
https://www.youtube.com/watch?v=KE4xEzFGSU8. My tesseract configurations are given below. Now I have cloned the langdata for bangla, tesseract and tesstrain from github.
In tesseact > tessdata I have placed the pretrained ben.traineddata.
The langdata folder structure is like: 
ben.training_text
Bengali.unicharset (contains unicharset from the before trained bangla model)
Bengali.xheights  (contains xheights from the before trained bangla model + I added text heights for NikoshBAN)
font_properties (contains font properties from the before trained models + I added NikoshBAN 10100 )
ben.punc
ben.numbers
ben.wordlist 
# I also have a split_training_text.py for splitting the ben.training_text(made few changes) and convert it to .tif , box, .txt 
Here is the code :
import os
import random
import pathlib
import subprocess

training_text_file = 'langdata/ben.training_text'

lines = []

with open(training_text_file, 'r') as input_file:
    for line in input_file.readlines():
        lines.append(line.strip())

output_directory = 'tesstrain/data/BAN-ground-truth'

if not os.path.exists(output_directory):
    os.mkdir(output_directory)

#random.shuffle(lines)

count = 100

lines = lines[:count]

line_count = 0
for line in lines:
    training_text_file_name = pathlib.Path(training_text_file).stem
    line_training_text = os.path.join(output_directory, f'{training_text_file_name}_{line_count}.gt.txt')
    with open(line_training_text, 'w') as output_file:
        output_file.writelines([line])

    file_base_name = f'ben_{line_count}'

    subprocess.run([
        'text2image',
        '--font=NikoshBAN',
        f'--text={line_training_text}',
        f'--outputbase={output_directory}/{file_base_name}',
        '--max_pages=1',
        '--strip_unrenderable_words',
        '--leading=32',
        '--xsize=3600',
        '--ysize=480',
        '--char_spacing=1.0',
        '--exposure=0',
        '--unicharset_file=langdata/Bengali.unicharset'
    ])

    line_count += 1
After running this it generates ground truth in the tesstrain>data>BAN-ground-truth. 
then I navigate to tesstrain and run the following command :
TESSDATA_PREFIX=/home/anim/preeti02/tesseract/tessdata make training MODEL_NAME=BAN START_MODEL=ben TESSDATA=/home/anim/preeti02/tesseract/tessdata MAX_ITERATIONS=400
which gives me the error :
You are using make version: 4.3
combine_tessdata -u /home/anim/preeti02/tesseract/tessdata/ben.traineddata data/ben/BAN
Extracting tessdata components from /home/anim/preeti02/tesseract/tessdata/ben.traineddata
Wrote data/ben/BAN.config
Wrote data/ben/BAN.unicharset
Wrote data/ben/BAN.unicharambigs
Wrote data/ben/BAN.inttemp
Wrote data/ben/BAN.pffmtable
Wrote data/ben/BAN.normproto
Wrote data/ben/BAN.punc-dawg
Wrote data/ben/BAN.word-dawg
Wrote data/ben/BAN.number-dawg
Wrote data/ben/BAN.freq-dawg
Wrote data/ben/BAN.shapetable
Wrote data/ben/BAN.bigram-dawg
Wrote data/ben/BAN.params-model
Wrote data/ben/BAN.lstm
Wrote data/ben/BAN.lstm-punc-dawg
Wrote data/ben/BAN.lstm-word-dawg
Wrote data/ben/BAN.lstm-number-dawg
Wrote data/ben/BAN.version
Version:Pre-4.0.0
0:config:size=377, offset=192
1:unicharset:size=146615, offset=569
2:unicharambigs:size=1047, offset=147184
3:inttemp:size=13889634, offset=148231
4:pffmtable:size=23387, offset=14037865
5:normproto:size=185873, offset=14061252
6:punc-dawg:size=3610, offset=14247125
7:word-dawg:size=117978, offset=14250735
8:number-dawg:size=258, offset=14368713
9:freq-dawg:size=1610, offset=14368971
13:shapetable:size=370138, offset=14370581
14:bigram-dawg:size=811178, offset=14740719
16:params-model:size=688, offset=15551897
17:lstm:size=5491102, offset=15552585
18:lstm-punc-dawg:size=4322, offset=21043687
19:lstm-word-dawg:size=2399610, offset=21048009
20:lstm-number-dawg:size=258, offset=23447619
23:version:size=9, offset=23447877
unicharset_extractor --output_unicharset "data/BAN/my.unicharset" --norm_mode 2 "data/BAN/all-gt"
Extracting unicharset from plain text file data/BAN/all-gt
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'এর : ২ সাইট এক তােক জোর দ্য নাকি'
Invalid start of grapheme sequence:M=0x9c7
Normalization failed for string 'খােলদা সার্ভিসের অনুষ্ঠানে তুংরত'
merge_unicharsets data/ben/BAN.lstm-unicharset data/BAN/my.unicharset "data/BAN/unicharset"
Failed to load unicharset from file data/ben/BAN.lstm-unicharset!!
make: *** [Makefile:211: data/BAN/unicharset] Error 1

My tesseract configurations are: 
tesseract 5.3.4
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8
 Found libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.16
Reply all
Reply to author
Forward
0 new messages