Numerous different bugs while training jpn

Kamui 7

unread,

Jan 7, 2021, 11:47:54 AM1/7/21

to tesseract-ocr

I have a script to train tesseract and I ran it on Arch Linux, Debian, and even a docker container and they all produce the same errors. I checked to make sure the script is correct as well.

Bug 1:

This happens when tesstrain runs text2image. The max pages parameter does not work at all. It ends up only rendering 4 pages regardless of what I pass in for the maxpages parameter. I even tried hardcoding it into the tesstrain_utils.sh file and it still does the same thing.

Bug 2:

After it finishes producing those 4 pages, i finetune it with lstmtraining and the resulting output is full of "Encoding of string failed!" errors.

Bug 3:

Along with those encoding errors, it also outputs the following text:

"Image too small to scale!! (2x48 vs min width of 3)
Line cannot be recognized!!

Image not trainable"

I will upload my script along with the Dockerfile if anyone wants to take a look.

https://drive.google.com/file/d/1FkW1q1cXwOxY6Yi1A1cMzInbtJa9L01M/view?usp=sharing

Shree Devi Kumar

unread,

Jan 7, 2021, 12:01:55 PM1/7/21

to tesseract-ocr

Old versions of tesstrain.sh used to limit training to 3 pages. Looks like you may have an old version in the path somewhere.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7a9415d6-4d0c-4333-98c0-2628720661ebn%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Kamui 7

unread,

Jan 7, 2021, 12:10:07 PM1/7/21

to tesseract-ocr

I did a find command in the root directory and searched for the tesstrain script. It could only find the script that i pulled from the latest tesseract git repo. My training script calls that specific tesstrain script using a relative path so it couldn't be an older version

Shree Devi Kumar

unread,

Jan 7, 2021, 12:28:12 PM1/7/21

to tesseract-ocr

Your training text file is only 175 lines, so the rendered image fits in 4 pages. You need to use a larger text if you want more pages.

Also check that your fonts support both English and Japanese as the text seems to have samples of both languages.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/42a49dfd-7b52-437e-8840-9dbdddbad0aen%40googlegroups.com.

Message has been deleted

Kamui 7

unread,

Jan 7, 2021, 2:16:36 PM1/7/21

to tesseract-ocr

Looks like that fixed bug #1. Now it is able to successfully create 400 pages. Do you have any ideas as to why the other 2 errors are occurring?

Shree Devi Kumar

unread,

Jan 8, 2021, 1:58:27 AM1/8/21

to tesseract-ocr

Are any of these vertical fonts?

Encoding errors could be if the characters in training text are not in the unicharset.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/61c8baf6-837f-47f9-ab1d-bc636722194an%40googlegroups.com.

Kamui 7

unread,

Jan 9, 2021, 11:49:02 AM1/9/21

to tesseract-ocr

How do I create my own custom unicharset file? The tesstrain script seems to be generating one based on the training text but I want to pass in my own unicharset file.

shree

unread,

Jan 11, 2021, 12:30:39 PM1/11/21

to tesseract-ocr

Please see https://github.com/tesseract-ocr/tesseract/issues/3001 for updates

Kamui 7

unread,

Jan 12, 2021, 1:17:14 PM1/12/21

to tesseract-ocr

Great! The PR that you submitted fixed issue #3. All that's left is the encoding string problem. I wonder if it's a problem with the unicharset extractor?

Shree Devi Kumar

unread,

Jan 13, 2021, 1:48:52 AM1/13/21

to tesseract-ocr

Unicharset is extracted from training text, because those are the samples that will be used for training.

Why do you want to use a different unicharset?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b1ff77f3-2019-4a48-8e66-331343f7979cn%40googlegroups.com.

Kamui 7

unread,

Jan 13, 2021, 1:55:27 PM1/13/21

to tesseract-ocr

Because I'm getting encoding errors. I checked the unicharset that it generated and it did not have enough characters so I would like to create my own unicharset with all the characters.

Reply all

Reply to author

Forward