accuracy problem after trained in fine-tune

598 views
Skip to first unread message

Ali hussain

unread,
Aug 9, 2023, 2:26:56 PM8/9/23
to tesseract-ocr
I have trained some new fonts by fine-tune methods for the Bengali language in Tesseract 5 and I have used all official trained_text and tessdata_best and other things also.  everything is good but the problem is the default font which was trained before that does not convert text like prev but my new fonts work well. I don't understand why it's happening. I share code based to understand what going on.


codes  for creating tif, gt.txt, .box files:
import os
import random
import pathlib
import subprocess
import argparse
from FontList import FontList

def read_line_count():
    if os.path.exists('line_count.txt'):
        with open('line_count.txt', 'r') as file:
            return int(file.read())
    return 0

def write_line_count(line_count):
    with open('line_count.txt', 'w') as file:
        file.write(str(line_count))

def create_training_data(training_text_file, font_list, output_directory, start_line=None, end_line=None):
    lines = []
    with open(training_text_file, 'r') as input_file:
        for line in input_file.readlines():
            lines.append(line.strip())
   
    if not os.path.exists(output_directory):
        os.mkdir(output_directory)
   
    random.shuffle(lines)
   
    if start_line is None:
        line_count = read_line_count()  # Set the starting line_count from the file
    else:
        line_count = start_line
   
    if end_line is None:
        end_line_count = len(lines) - 1  # Set the ending line_count
    else:
        end_line_count = min(end_line, len(lines) - 1)
   
    for font in font_list.fonts:  # Iterate through all the fonts in the font_list
        font_serial = 1
        for line in lines:
            training_text_file_name = pathlib.Path(training_text_file).stem
           
            # Generate a unique serial number for each line
            line_serial = f"{line_count:d}"
           
            # GT (Ground Truth) text filename
            line_gt_text = os.path.join(output_directory, f'{training_text_file_name}_{line_serial}.gt.txt')
            with open(line_gt_text, 'w') as output_file:
                output_file.writelines([line])
           
            # Image filename
            file_base_name = f'ben_{line_serial}'  # Unique filename for each font
            subprocess.run([
                'text2image',
                f'--font={font}',
                f'--text={line_gt_text}',
                f'--outputbase={output_directory}/{file_base_name}',
                '--max_pages=1',
                '--strip_unrenderable_words',
                '--leading=36',
                '--xsize=3600',
                '--ysize=350',
                '--char_spacing=1.0',
                '--exposure=0',
                '--unicharset_file=langdata/ben.unicharset',
            ])
           
            line_count += 1
            font_serial += 1
       
        # Reset font_serial for the next font iteration
        font_serial = 1
   
    write_line_count(line_count)  # Update the line_count in the file

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--start', type=int, help='Starting line count (inclusive)')
    parser.add_argument('--end', type=int, help='Ending line count (inclusive)')
    args = parser.parse_args()
   
    training_text_file = 'langdata/ben.training_text'
    output_directory = 'tesstrain/data/ben-ground-truth'
   
    # Create an instance of the FontList class
    font_list = FontList()
     
    create_training_data(training_text_file, font_list, output_directory, args.start, args.end)


and for training code:

import subprocess

# List of font names
font_names = ['ben']

for font in font_names:
    command = f"TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000 LANG_TYPE=Indic"
    subprocess.run(command, shell=True)


any suggestion to identify to extract the problem.
thanks, everyone

Shree Devi Kumar

unread,
Aug 9, 2023, 3:10:14 PM8/9/23
to tesseract-ocr
Include the default fonts also in your fine-tuning list of fonts and see if that helps.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/406cd733-b265-4118-a7ca-de75871cac39n%40googlegroups.com.

Ali hussain

unread,
Aug 9, 2023, 4:00:49 PM8/9/23
to tesseract-ocr
ok, I will try as you said.
one more thing, what's the role of the trained_text lines will be? I have seen Bengali text are long words of lines. so I wanna know how many words or characters will be the better choice for the train? and '--xsize=3600','--ysize=350',  will be according to words of lines?

Des Bw

unread,
Sep 10, 2023, 3:22:34 PM9/10/23
to tesseract-ocr
Hi mhalidu, 
the script you posted here seems much more extensive than you posted before: https://groups.google.com/d/msgid/tesseract-ocr/0e2880d9-64c0-4659-b497-902a5747caf4n%40googlegroups.com.

I have been using your earlier script. It is magical. How is this one different from the earlier one?

Thank you for posting these scripts, by the way. It has saved my countless hours; by running multiple fonts in one sweep. I was not able to find any instruction on how to train for  multiple fonts. The official manual is also unclear. YOUr script helped me to get started. 

Ali hussain

unread,
Sep 10, 2023, 9:27:27 PM9/10/23
to tesseract-ocr
You can use the new script below. it's better than the previous two scripts.  You can create tif, gt.txt, and .box files by multiple fonts and also use breakpoint if vs code close or anything during creating tif, gt.txt, and .box files then you can checkpoint to navigate where you close vs code.

command for tif, gt.txt, and .box files 


import os
import random
import pathlib
import subprocess
import argparse
from FontList import FontList

def create_training_data(training_text_file, font_list, output_directory, start_line=None, end_line=None):
    lines = []
    with open(training_text_file, 'r') as input_file:
        lines = input_file.readlines()

    if not os.path.exists(output_directory):
        os.mkdir(output_directory)

    if start_line is None:
        start_line = 0

    if end_line is None:
        end_line = len(lines) - 1

    for font_name in font_list.fonts:
        for line_index in range(start_line, end_line + 1):
            line = lines[line_index].strip()

            training_text_file_name = pathlib.Path(training_text_file).stem

            line_serial = f"{line_index:d}"

            line_gt_text = os.path.join(output_directory, f'{training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")}.gt.txt')


            with open(line_gt_text, 'w') as output_file:
                output_file.writelines([line])

            file_base_name = f'{training_text_file_name}_{line_serial}_{font_name.replace(" ", "_")}'
            subprocess.run([
                'text2image',
                f'--font={font_name}',
                f'--text={line_gt_text}',
                f'--outputbase={output_directory}/{file_base_name}',
                '--max_pages=1',
                '--strip_unrenderable_words',
                '--leading=36',
                '--xsize=3600',
                '--ysize=330',
                '--char_spacing=1.0',
                '--exposure=0',
                '--unicharset_file=langdata/eng.unicharset',
            ])

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--start', type=int, help='Starting line count (inclusive)')
    parser.add_argument('--end', type=int, help='Ending line count (inclusive)')
    args = parser.parse_args()

    training_text_file = 'langdata/eng.training_text'
    output_directory = 'tesstrain/data/eng-ground-truth'

    font_list = FontList()

    create_training_data(training_text_file, font_list, output_directory, args.start, args.end)



Then create a file called "FontList" in the root directory and paste it.



class FontList:
    def __init__(self):
        self.fonts = [
        "Gerlick"
            "Sagar Medium",
            "Ekushey Lohit Normal",  
           "Charukola Round Head Regular, weight=433",
            "Charukola Round Head Bold, weight=443",
            "Ador Orjoma Unicode",
     
         
                       
]                        



then import in the above code,


for breakpoint command:


sudo python3 split_training_text.py --start 0  --end 11



change checkpoint according to you  --start 0 --end 11.

and training checkpoint as you know already.

Des Bw

unread,
Sep 11, 2023, 2:50:03 AM9/11/23
to tesseract-ocr
Thank you so much for putting out these brilliant scripts. They make the process  much more efficient.

I have one more question on the other script that you use to train. 

import subprocess
# List of font names
font_names = ['ben']

for font in font_names:
    command = f"TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"
    subprocess.run(command, shell=True) 

Do you have the name of fonts listed in file in the same/root directory?
How do you setup the names of the fonts in the file, if you don't mind sharing it?

Ali hussain

unread,
Sep 11, 2023, 3:54:08 AM9/11/23
to tesseract-ocr
import subprocess
# List of font names
font_names = ['ben']

for font in font_names:
    command = f"TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME={font} START_MODEL=ben TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000"
    subprocess.run(command, shell=True) 

1 . This command is for training data that I have named '
tesseract_training.py' inside tesstrain folder.
2. root directory means your main training folder and inside it as like langdata, tessearact,  tesstrain folders. if you see this tutorial    https://www.youtube.com/watch?v=KE4xEzFGSU8   you will understand better the folder structure. only I created tesseract_training.py in tesstrain folder for training and  FontList.py file is the main path as like langdata, tessearact,  tesstrain, and split_training_text.py.
3. first of all you have to put all fonts in your Linux fonts folder.   /usr/share/fonts/  then run:  sudo apt update  then sudo fc-cache -fv

after that, you have to add the exact font's name in FontList.py file like me.
have added two pic my folder structure. first is main structure pic and the second is the Colopse tesstrain folder.

IScreenshot 2023-09-11 134947.pngScreenshot 2023-09-11 135014.png 

Des Bw

unread,
Sep 12, 2023, 11:42:13 AM9/12/23
to tesseract-ocr
Yes, I am familiar with the video and have set up the folder structure as you did. Indeed, I have tried a number of fine-tuning with a single font following Gracia's video. But, your script is much  better because supports multiple fonts. The whole improvement you made is  brilliant; and very useful. It is all working for me. 
The only part that I didn't understand is the trick you used in your tesseract_train.py script. You see, I have been doing exactly to you did except this script. 

The scripts seems to have the trick of sending/teaching each of the fonts (iteratively) into the model. The script I have been using  (which I get from Garcia) doesn't mention font at all. 

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=oro TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000

Does it mean that my model does't train the fonts (even if the fonts have been included in the splitting process, in the other script)?
acc.jpg
Message has been deleted

Ali hussain

unread,
Sep 12, 2023, 4:15:25 PM9/12/23
to tesseract-ocr
Yes, he doesn't mention all fonts but only one font.  That way he didn't use MODEL_NAME in a separate script file script I think.

Actually, here we teach all tif, gt.txt, and .box files which are created by  MODEL_NAME I mean eng, ben, oro flag or language code because when we first create tif, gt.txt, and .box files, every file starts by  MODEL_NAME. This  MODEL_NAME  we selected on the training script for looping each tif, gt.txt, and .box files which are created by  MODEL_NAME.

Des Bw

unread,
Sep 13, 2023, 3:13:43 AM9/13/23
to tesseract-ocr
How is your training going for Bengali?
I have been trying to train from scratch. I made about 64,000 lines of text (which produced about 255,000 files, in the end) and run the training for 150,000 iterations; getting 0.51 training error rate. I was hopping to get reasonable accuracy. Unfortunately, when I run the OCR using  .traineddata,  the accuracy is absolutely terrible. Do you think I made some mistakes, or that is an expected result?

Lorenzo Bolzani

unread,
Sep 13, 2023, 3:34:23 AM9/13/23
to tesser...@googlegroups.com
I'm not 100% sure but, if I remember correctly, one iteration, in this context, means one image so with 150000 iterations you did not even use the whole dataset.

Also, especially when training from scratch, you likely need to pass over the whole dataset multiple times.

You should let the training running until you see that the accuracy score stops improving (you should use a separate dataset for this, but let's keep it simple). The accuracy will go up and down a little but you should see a constant improvement over time. If this does not happen there is a problem.

It may take 24 hours or more depending on the hardware, dataset, etc.

The training process should save intermediate models so you should be able to stop it and resume it later from the last saved model.


Lorenzo

Des Bw

unread,
Sep 13, 2023, 5:17:31 AM9/13/23
to tesseract-ocr
Dear Lorenzo, 
Yes, the accuracy is going up and down; and then finally selects the best (lowest) error rate. 150000 actually should use the whole dataset because, if i remember correctly, Tesseract uses the .lstmfs. 
I think the process is: Text lines --> images --> boxes. The boxes contain information both about the texts and their positions (coordinate positions) in the image. Those boxes are then transformed to .lstmfs files to be used by Tesseract. The lstmfs files are 64,000. So, I was assuming that each file was run at least twice. 
That was my understanding (assumption). But, I think you are right that running every line multiple times might be a way to go. I will try to run more iterations then. I will up it to 200,000 and see if there will be better results. 

Ali hussain

unread,
Sep 13, 2023, 5:19:48 AM9/13/23
to tesseract-ocr
How is your training going for Bengali?  It was nearly good but I faced space problems between two words, some words are spaces but most of them have no space. I think is problem is in the dataset but I use the default training dataset from Tesseract which is used in Ben That way I am confused so I have to explore more. by the way,  you can try as Lorenzo Blz said.  Actually training from scratch is harder than fine-tuning. so you can use different datasets to explore. if you succeed. please let me know how you have done this whole process.  I'm also new in this field.

Des Bw

unread,
Sep 13, 2023, 5:33:48 AM9/13/23
to tesseract-ocr
Yes, we are new to this. I find the instructions (the manual) very hard to follow. The video you linked above was really helpful  to get started. My plan at the beginning was to fine tune the existing .traineddata. But, I failed in every possible way to introduce a few new characters into the database. That is why I started from scratch. 

Sure, I will follow Lorenzo's suggestion: will run more the iterations, and see if I can improve. 

Another areas we need to explore is usage of dictionaries actually. May be adding millions of words into the dictionary could help Tesseract. I don't have millions of words; but I am looking into some corpus to get more words into the dictionary. 

If this all fails, EasyOCR (and probably other similar open-source packages)  is probably our next option to try on. Sure, sharing our experiences will be helpful. I will let you know if I made good progresses in any of these options. 

Ali hussain

unread,
Sep 13, 2023, 5:49:20 AM9/13/23
to tesseract-ocr
EasyOCR I think is best for ID cards or something like that image process. but document images like books, here Tesseract is better than EasyOCR.  Even I didn't use EasyOCR. you can try it.

I have added words of dictionaries but the result is the same. 

what kind of problem you have faced in fine-tuning in few new characters as you said (but, I failed in every possible way to introduce a few new characters into the database.)

Des Bw

unread,
Sep 13, 2023, 5:51:29 AM9/13/23
to tesseract-ocr
The characters are getting missed, even after fine-tuning. 
I never made any progress. I tried many different ways. Some  specific characters are always missing from the OCR result.  

Ali hussain

unread,
Sep 13, 2023, 6:02:01 AM9/13/23
to tesseract-ocr
if Some specific characters or words are always missing from the OCR result.  then you can apply logic with the Regular expressions method on your applications. After OCR, these specific characters or words will be replaced by current characters or words that you defined in your applications by  Regular expressions. it can be done in some major problems.
Message has been deleted

Des Bw

unread,
Sep 13, 2023, 6:18:22 AM9/13/23
to tesseract-ocr
The problem for regex is that Tesseract is not consistent in its replacement. 
Think of the original training of English data doesn't contain the letter /u/. What does Tesseract do when it faces /u/ in actual processing??
In some cases, it replaces it with closely similar letters such as /v/ and /w/. In other cases, it completely removes it. That is what is happening with my case. Those characters re sometimes completely removed; other times, they are replaced by closely resembling characters. Because of this inconsistency, applying regex is very difficult. 

Ali hussain

unread,
Sep 13, 2023, 6:45:54 AM9/13/23
to tesseract-ocr
I know what you mean. but in some cases, it helps me.  I have faced specific characters and words are always not recognized by Tesseract. That way I use these regex to replace those characters   and words if  those characters are incorrect.

see what I have done: 

   "": "",
    "": " ",
    "": " ",
    জ্া: "জা",
    "  ": " ",
    "   ": " ",
    "    ": " ",
    "্প": " ",
    "": "র্য",
    য: "",
    "": "",
    আা: "",
    ম্ি: "মি",
    স্ু: "সু",
    "হূ ": "হূ",
    "": "",
    র্্: "",
    "চিন্ত ": "চিন্তা ",
    ন্া: "না",
    "সম ূর্ন": "সম্পূর্ণ",

Des Bw

unread,
Sep 13, 2023, 7:06:12 AM9/13/23
to tesseract-ocr
At what stage are you doing the regex replacement?
My process has been: Scan (tif)--> ScanTailor --> Tesseract --> pdf

>EasyOCR I think is best for ID cards or something like that image process. but document images like books, here Tesseract is better than EasyOCR.

How about paddleOcr?, are you familiar with it?

Ali hussain

unread,
Sep 13, 2023, 7:47:05 AM9/13/23
to tesseract-ocr

after Tesseact recognizes text from images. then you can apply regex to replace the wrong word with to correct word.
I'm not familiar with paddleOcr and scanTailor also.

Des Bw

unread,
Sep 13, 2023, 2:19:53 PM9/13/23
to tesseract-ocr
I now get to 200000 iterations; and the error rate is stuck at 0.46. The result is absolutely trash: nowhere close to the default/Ray's training. 
Message has been deleted

Ali hussain

unread,
Sep 13, 2023, 8:21:00 PM9/13/23
to tesseract-ocr
try with another text dataset. if I am not wrong I think because Ray has not updated the training text dataset for many years and which trained data now are available these are from another dataset.

Ali hussain

unread,
Sep 13, 2023, 11:58:07 PM9/13/23
to tesseract-ocr
you are training from Tessearact default text data or your own collected text data?
On Thursday, 14 September, 2023 at 12:19:53 am UTC+6 desal...@gmail.com wrote:

Dellu Bw

unread,
Sep 14, 2023, 12:51:52 AM9/14/23
to tesser...@googlegroups.com
I was using my own text

Message has been deleted

Ali hussain

unread,
Sep 14, 2023, 2:33:14 AM9/14/23
to tesseract-ocr
you faced this error,  Can't encode transcription? if you faced how you have solved this?

Dellu Bw

unread,
Sep 14, 2023, 4:46:36 AM9/14/23
to tesser...@googlegroups.com
I also faced that issue in the Windows. Apparently, the issue is related with unicode. You can try your luck by changing  "r" to "utf8" in the script.
I end up installing Ubuntu because i was having too many errors in the Windows.

Ali hussain

unread,
Sep 14, 2023, 2:02:22 PM9/14/23
to tesseract-ocr
I will try some changes. thx

Des Bw

unread,
Sep 15, 2023, 6:01:32 AM9/15/23
to tesseract-ocr
Just saw this paper: https://osf.io/b8h7q

Ali hussain

unread,
Sep 15, 2023, 11:42:26 AM9/15/23
to tesseract-ocr
yes, two months ago when I started to learn OCR I saw that. it was very helpful at the beginning.

Des Bw

unread,
Oct 19, 2023, 1:32:08 PM10/19/23
to tesseract-ocr
Hi Ali, 
How is your training going?
Do you get good results with the training-from-the-scratch?

Ali hussain

unread,
Oct 20, 2023, 8:22:44 PM10/20/23
to tesseract-ocr
not good result. that's way i stop to training now. default traineddata is overall good then scratch.

Des Bw

unread,
Oct 21, 2023, 1:36:38 AM10/21/23
to tesseract-ocr
Yah, that is what I am getting as well. I was able to add the missing letter. But, the overall accuracy become lower than the default model. 

Des Bw

unread,
Oct 21, 2023, 1:37:13 AM10/21/23
to tesseract-ocr
How many lines of text and iterations did you use?

Ali hussain

unread,
Oct 21, 2023, 9:09:46 PM10/21/23
to tesseract-ocr
600000 lines of text and the itarations  higher then 600000. but some time i got better result in lower itarations in finetune like 100000 lines of text and itaration is only 5000 to10000. 

Des Bw

unread,
Oct 22, 2023, 2:27:32 AM10/22/23
to tesseract-ocr
That is massive data. Have you tried to train by cut the top layer of the network?
I think that is the most promising approach. I was getting really good results with that. But, the result is not getting translated to scanned documents. I get best results with the syntethic data. I am no experimenting with the settings in text2image if it is possible to emulate the scanned documents. 
I am also suspecting this setting   '--char_spacing=1.0', in our setup is causing more trouble. Scanned documents come with characters spacing close to zero.If you are planning to train more, try removing this parameter. 

Ali hussain

unread,
Oct 22, 2023, 5:07:16 AM10/22/23
to tesseract-ocr
i haven't tried by cut the top layer of the network. you can share your knowledge what you done by cut the top layer of the network. or github project link.

Ali hussain

unread,
Oct 22, 2023, 5:09:25 AM10/22/23
to tesseract-ocr
you can test by changes '--char spacing=1.0 . i think it would be problem accuracy of result on it also.

Des Bw

unread,
Oct 22, 2023, 5:45:40 AM10/22/23
to tesseract-ocr
This is the code I used to train from a layer: 
make training MODEL_NAME=amh START_MODEL=amh APPEND_INDEX=5 NET_SPEC='[Lfx256 O1c105]' TESSDATA=../tesseract/tessdata EPOCHS=3 TARGET_ERROR_RATE=0.0001 training >> data/amh.log &
- I took it from Scheer' training tesstrain-JSTORArabichttps://github.com/Shreeshrii/tesstrain-JSTORArabic

- The net_spec of ben might not be the same to amh. Shreeshrii has sent a link on the netspecs of languages, in this forum.  

Des Bw

unread,
Oct 22, 2023, 5:49:46 AM10/22/23
to tesseract-ocr

Ali hussain

unread,
Oct 22, 2023, 7:02:43 AM10/22/23
to tesseract-ocr
thx. i will try with this method  as soon as possible.
Reply all
Reply to author
Forward
0 new messages