Creating Starter Traineddata

Simon

unread,

Jan 18, 2024, 5:11:52 AM1/18/24

to tesseract-ocr

Hello everybody,

I have a question regarding "Fine Tuning +- a few characters".

In general the instructions on https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters say that you have to make a starter traineddata from the unicharset, but is this also required if I want to fine tune?

Furthermore I have absolutely no idea how I can create a starter traineddata. I read the "creating starter traineddata" chapter but I have absolutely no clue how I do that. This site is supposed to be a tutorial, therefore I expect a step for step instruction.

Can anyone help me with this?

I am a newby at tersseract training, so I would appreciate any help.

Simon

unread,

Jan 19, 2024, 4:38:24 AM1/19/24

to tesseract-ocr

Here is a link to the Website of Uni Mannheim: COMBINE_LANG_MODEL - generate starter traineddata

Unfortunately the command doesn't create any files and after running the command I don't get any Feedback on why the command didn't work properly.

Even when I porposely use non existent paths I still get no error message!

PS C:\Windows\system32> combine_lang_model --input_unicharset C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/Latin.unicharset --script_dir C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng --lang eng --wordlist C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/eng.wordlist --output_dir C:/Users/LCAdmin/Documents/FineTuning/output

PS C:\Users\LCAdmin\Documents\FineTuning>

PS C:\Users\LCAdmin\Documents\FineTuning> combine_lang_model --input_unicharset tesstutorial/langdata/Latin.unicharset --script_dir tesstutorial/langdata/eng --lang eng --wordlist asdfasfdef/langdata/eng/eng.wordlist --output_dir output

PS C:\Users\LCAdmin\Documents\FineTuning>

Does anyone have an idea how I can get insights in some log messages or something that could give me more insights on why it didn't work?

Simon

unread,

Jan 19, 2024, 9:27:15 AM1/19/24

to tesseract-ocr

Ok somehow I had "no entry point found" errors in the dll files. Reinstallation of Tesseract solved the Problem.

Now I encounter another interesting Problem.

combine_lang_model --input_unicharset C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/Latin.unicharset --script_dir C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng --lang --output_dir C:/Users/LCAdmin/Documents/FineTuning/output

When I run this command Tesseract tries to load many unicharsets. I don't understand why it tries to. It doesn't make any sense to me.

Whats the reason for loading all these unicharsets:

Failed to load script unicharset from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Latin.unicharset
Failed to load script unicharset from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Inherited.unicharset
Failed to load script unicharset from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Unknown.unicharset
Failed to load script unicharset from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Greek.unicharset
Failed to load script unicharset from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Armenian.unicharset
Failed to load script unicharset from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Arabic.unicharset
Failed to load script unicharset from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Devanagari.unicharset
Failed to load script unicharset from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Gujarati.unicharset
Failed to load script unicharset from:C:/Users/LCAdmin/Documents/FineTuning/tesstutorial/langdata/eng/Bopomofo.unicharset

when I only want to train the english model?

Also another question arised:
When I try to train some new characters do I have to add them to the Latin.unicharset before I create the starter traineddata or do I just add these characters to the created unicharset after I created starter traineddata?

Dellu Bw

unread,

Jan 19, 2024, 10:22:24 AM1/19/24

to tesser...@googlegroups.com

Yes, you need to add them before you create the starter model. You can edit the Latin.unicarset before you run the combine command.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/31a0381f-f407-43d7-a9a1-8450394c20fcn%40googlegroups.com.

Tom Morris

unread,

Jan 19, 2024, 10:37:39 AM1/19/24

to tesseract-ocr

On Thursday, January 18, 2024 at 5:11:52 AM UTC-5 smon...@gmail.com wrote:

In general the instructions on https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#fine-tuning-for--a-few-characters say that you have to make a starter traineddata from the unicharset, but is this also required if I want to fine tune?

As I read the instructions, no, you don't. Instead of a starter model which you train from scratch, you are going to use the existing pre-trained model that you want to fine tune.

Tom

Simon

unread,

Jan 20, 2024, 7:49:59 AM1/20/24

to tesseract-ocr

Hey thanks for the response!

How exactly do I add characters to the unicharset?

Typically the unicharset has to follow a specific pattern (Tesseract-unicharset_uni-mannheim)

Here is an example of the Latin unicharset:

⇆ 0 24,76,166,249,122,224,6,30,136,256 Common 1600 10 1600 ⇆ # ⇆ [21c6 ]

If I want to add for example this character "⌖" how would I know what numbers I need to put for the glyph information?

And also what does the "10" and "[21c6]" mean?

Dellu Bw

unread,

Jan 20, 2024, 8:19:33 AM1/20/24

to tesser...@googlegroups.com

You need to look at it in the unicode list.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/91aeac2a-1e1a-439a-9f92-6abdda3dc695n%40googlegroups.com.

Simon

unread,

Jan 20, 2024, 10:00:08 AM1/20/24

to tesseract-ocr

Ok, could you please be a little bit more precise?
I learned "[21c6]" is the UTF-16 code. But where do I get the glyph information from and what does the 10 stand for?

Thanks for your patience I really appreciate your help :)

Reply all

Reply to author

Forward