Training - Finetuning Characters

Dustin Theobald

unread,

Oct 1, 2019, 8:39:48 AM10/1/19

to tesseract-ocr

Hey guys,

I have a Problem when Finetuning Characters (trying the ± approach on https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00)

(I'm working on a MAC)

My tesseract version:

tesseract 5.0.0-alpha-457-gb3b74

leptonica-1.78.0

libgif 5.1.4 : libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.3 : libopenjp2 2.3.1

Found AVX2

Found AVX

Found FMA

Found SSE

Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6

My bashscript looks at follows: https://pastebin.com/XK4CkuM2

When I evaluate via:

~/../../usr/local/bin/lstmeval \

--model ~/Desktop/tesstutorial/trainplusminus/eng.traineddata \

--traineddata ~/Desktop/tesstutorial/trainplusminus/eng/eng.traineddata \

--eval_listfile ~/Desktop/tesstutorial/trainplusminus/eng.training_files.txt 2>&1 | grep ±

I don't get any OCR Line correctly.

Does someone see a mistake in my code?

Dustin Theobald

unread,

Oct 1, 2019, 10:23:49 AM10/1/19

to tesseract-ocr

Changed my evaluation to:

~/../../usr/local/bin/lstmeval \

--model ~/Desktop/tesstutorial/trainplusminus/plusminus_checkpoint \

--traineddata ~/Desktop/tesstutorial/trainplusminus/eng/eng.traineddata \

--eval_listfile ~/Desktop/tesstutorial/trainplusminus/eng.training_files.txt 2>&1 | grep ±

Still doesn't work.

Shree Devi Kumar

unread,

Oct 1, 2019, 11:40:20 AM10/1/19

to tesseract-ocr

See https://github.com/Shreeshrii/tess4training

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e9ba2635-6308-41a8-8150-e5d4da520269%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,

Oct 1, 2019, 11:41:36 AM10/1/19

to tesseract-ocr

specifically https://github.com/Shreeshrii/tess4training/blob/master/6-plusminus.log#L429

Dustin Theobald

unread,

Oct 2, 2019, 5:02:43 AM10/2/19

to tesseract-ocr

Hey Shree,

Thank you for your help!

This doesn't work on my MAC. I can't find some of the fonts, so I only try to create trainingdata for Arial, if use the 5-makedata-plusminus.sh, he is only rendering (creating 2 pages), which seems odd.

I'm switching to my linux now, but I have problems installing tesseract.

I'm following the documentation:

sudo apt install tesseract-ocr

After, I try to find the folder to run

make

make training

make training-install

But I cannot find the folder (on ubuntu)

So, I clone the GitHub Repository: https://github.com/tesseract-ocr/tesseract

to my Desktop and run ./autogen.sh ./configure, make, make training, sudo make trainng-install

But then I'll get the following error when running 5-makedata-plusminus.sh:

/usr/local/bin/text2image: error while loading shared libraries: libtesseract.so.5: cannot open shared object file: No such file or directory
ERROR: Program text2image failed. Abort.

Thank you very much for your help!

Am Dienstag, 1. Oktober 2019 17:41:36 UTC+2 schrieb shree:

specifically https://github.com/Shreeshrii/tess4training/blob/master/6-plusminus.log#L429

On Tue, Oct 1, 2019 at 9:09 PM Shree Devi Kumar <shree...@gmail.com> wrote:

See https://github.com/Shreeshrii/tess4training

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e9ba2635-6308-41a8-8150-e5d4da520269%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,

Oct 2, 2019, 5:24:20 AM10/2/19

to tesseract-ocr

1. You could install on linux using the appropriate package from https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata

OR

2. When building tesseract from git source, follow https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation#build-with-training-tools

You seem to be missing some steps there.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d44cd443-da72-4df4-9a7c-aae082726010%40googlegroups.com.

Shree Devi Kumar

unread,

Oct 2, 2019, 5:26:28 AM10/2/19

to tesseract-ocr

>This doesn't work on my MAC. I can't find some of the fonts, so I only try to create trainingdata for Arial, if use the 5-makedata-plusminus.sh, he is only rendering (creating 2 pages), which seems odd.

2 pages should be ok because it uses the training_text from langdata repo which is around 80 lines plus the extra lines added with plusminus.

Dustin Theobald

unread,

Oct 2, 2019, 8:26:04 AM10/2/19

to tesseract-ocr

Hey shree,

thank you very much! On linux it works :)

Best regards,

Dustin

Am Mittwoch, 2. Oktober 2019 11:26:28 UTC+2 schrieb shree:

>This doesn't work on my MAC. I can't find some of the fonts, so I only try to create trainingdata for Arial, if use the 5-makedata-plusminus.sh, he is only rendering (creating 2 pages), which seems odd.

2 pages should be ok because it uses the training_text from langdata repo which is around 80 lines plus the extra lines added with plusminus.

On Wed, Oct 2, 2019 at 2:53 PM Shree Devi Kumar <shree...@gmail.com> wrote:

1. You could install on linux using the appropriate package from https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata

OR

2. When building tesseract from git source, follow https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation#build-with-training-tools

You seem to be missing some steps there.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d44cd443-da72-4df4-9a7c-aae082726010%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Dustin Theobald

unread,

Oct 2, 2019, 10:08:32 AM10/2/19

to tesseract-ocr

Hey shree,

do you know how to manually install the missing fonts for MAC, like in your docu for linux:

sudo apt update

sudo apt install ttf-mscorefonts-installer

sudo apt install fonts-dejavu

fc-cache -vf

Thank you in advance!

Best regards,

Dustin

Am Mittwoch, 2. Oktober 2019 11:26:28 UTC+2 schrieb shree:

>This doesn't work on my MAC. I can't find some of the fonts, so I only try to create trainingdata for Arial, if use the 5-makedata-plusminus.sh, he is only rendering (creating 2 pages), which seems odd.

2 pages should be ok because it uses the training_text from langdata repo which is around 80 lines plus the extra lines added with plusminus.

On Wed, Oct 2, 2019 at 2:53 PM Shree Devi Kumar <shree...@gmail.com> wrote:

1. You could install on linux using the appropriate package from https://github.com/tesseract-ocr/tesseract/wiki#tesseract-4-packages-with-lstm-engine-and-related-traineddata

OR

2. When building tesseract from git source, follow https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation#build-with-training-tools

You seem to be missing some steps there.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d44cd443-da72-4df4-9a7c-aae082726010%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,

Oct 2, 2019, 10:46:25 AM10/2/19

to tesseract-ocr

Sorry, don't know how to add those fonts for Mac.

The tutorial uses the following set of fonts:

https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh#L42

You could use a similar set of fonts available on the Mac and assign via fontlist.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0a2e9693-553a-4340-832d-79a31da74314%40googlegroups.com.

Dustin Theobald

unread,

Oct 3, 2019, 4:03:16 AM10/3/19

to tesseract-ocr

Ok. Thank you very much for your help! I'll get it to work somehow!

Cheers,

Dustin

Am Mittwoch, 2. Oktober 2019 16:46:25 UTC+2 schrieb shree:

Sorry, don't know how to add those fonts for Mac.

The tutorial uses the following set of fonts:
https://github.com/tesseract-ocr/tesseract/blob/master/src/training/language-specific.sh#L42

You could use a similar set of fonts available on the Mac and assign via fontlist.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0a2e9693-553a-4340-832d-79a31da74314%40googlegroups.com.

Shree Devi Kumar

unread,

Oct 3, 2019, 4:34:53 AM10/3/19

to tesseract-ocr

https://apple.stackexchange.com/questions/128091/where-can-i-find-default-microsoft-fonts-calibri-cambria

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ca6dd8f3-27d1-4ab5-bfe1-45011e63223e%40googlegroups.com.

Dustin Theobald

unread,

Oct 3, 2019, 7:59:19 AM10/3/19

to tesseract-ocr

Thank you Shree,

Im left with URW Bookman and Century Schoolbook family (which it seems I have to pay for).

For now I'll stick to the linux. Still, thank you very much shree!

I have one more question regarding training:

I have German and Englisch PDFs (sometimes mixed). I can use multiple languages (deu+eng). If I finetune for a character, do I have to finetune both language models, eng.lstm + deu.lstm and combine them when using tesseract, like:

tesseract ~/Desktop/test.png stdout -l eng_plusminus+deu_plusminus \

--oem 1 \

--psm 3 \

--tessdata-dir ./tesseract/tessdata/best

Thank you in advance!

Cheers,

Dustin

Am Donnerstag, 3. Oktober 2019 10:34:53 UTC+2 schrieb shree:

https://apple.stackexchange.com/questions/128091/where-can-i-find-default-microsoft-fonts-calibri-cambria

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ca6dd8f3-27d1-4ab5-bfe1-45011e63223e%40googlegroups.com.

Dustin Theobald

unread,

Oct 3, 2019, 10:29:20 AM10/3/19

to tesseract-ocr

I also tried to change the training-text with respect to Ø:

cat <<EOM >>../langdata/eng/eng.plusminus.training_text

alkoxy of LEAVES Ø1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL

TRAVELED Ø85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership

Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's Ø1.31 POPSET Os—C(11)

VOLVO abdomen, Ø65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri

Dresdner Yesterday's Dilated SYSTEMS Your FOUR Ø90° Gogol PARTIALLY BOARDS ﬁrm

Email ACTUAL QUEENSLAND Carl's Unruly Ø8.4 DESTRUCTION customers DataVac® DAY

Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY Ø2.96% Ask! WELL

Lambert own Company View mg \ (Ø7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv

Avoidance Moosejaw pm* Ø18 note: PROBE Jailbroken RAISE Fountains Write Goods (Ø6)

Oberﬂachen source.” CULTURED CUTTING Home 06-13-2008, § Ø44.01189673355 €

netting Bookmark of WE MORE) STRENGTH IDENTICAL Ø2? activity PROPERTY MAINTAINED

EOM

The evaluation on the training data works, but he doesn't recognize any Line in the evalplusminus/eng.training_files.txt

Shree Devi Kumar

unread,

Oct 3, 2019, 10:52:46 AM10/3/19

to tesseract-ocr

Can all used fonts render Ø?

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2bbf0e65-785d-4847-bb24-dcfa197a45a8%40googlegroups.com.

Dustin Theobald

unread,

Oct 4, 2019, 3:51:45 AM10/4/19

to tesseract-ocr

I inserted "--save_box_tiff" to see if the Ø is rendered correctly for the fonts (which seems to be the case)

Cheers,

Dustin

Am Donnerstag, 3. Oktober 2019 16:52:46 UTC+2 schrieb shree:

Can all used fonts render Ø?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2bbf0e65-785d-4847-bb24-dcfa197a45a8%40googlegroups.com.

Dustin Theobald

unread,

Oct 4, 2019, 4:18:01 AM10/4/19

to tesseract-ocr

Ok, when I run make_training_data, it says "Other case ø of Ø is not in unicharset", might this be a problem? Even though Ø is in the unicharset?

Cheers,

Dustin

Am Donnerstag, 3. Oktober 2019 16:52:46 UTC+2 schrieb shree:

Can all used fonts render Ø?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2bbf0e65-785d-4847-bb24-dcfa197a45a8%40googlegroups.com.

Shree Devi Kumar

unread,

Oct 4, 2019, 4:33:16 AM10/4/19

to tesseract-ocr

Other case ø of Ø is not in unicharset", - that's just for lower and upper case of letters.

If the finetuned traineddata is not recognizing Ø , try plusminus training with more samples and more iterations. Failing that, try to replace layer.

You can try to base your training on script/Latin.traineddata rather than eng.traineddata.

@theraysmith has given the example of plusminus training in tutorial. In my experience, it does not work in case of all languages/characters.

You will need to experiment a little to find best case scenario for your user case.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7dda8d63-7722-43d0-96fb-6cb385092773%40googlegroups.com.

Dustin Theobald

unread,

Oct 7, 2019, 2:29:32 AM10/7/19

to tesseract-ocr

Hey Shree,

thank you again for your help!

I will experiment a little. Do you have any advise how to construct training texts, which I'm going to append to the latin/eng.training_text?

Cheers,

Dustin

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7dda8d63-7722-43d0-96fb-6cb385092773%40googlegroups.com.

Reply all

Reply to author

Forward