Training - Finetuning Characters

163 views
Skip to first unread message

Dustin Theobald

unread,
Oct 1, 2019, 8:39:48 AM10/1/19
to tesseract-ocr
Hey guys, 

I have a Problem when Finetuning Characters (trying the ± approach on https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00)

(I'm working on a MAC)

My tesseract version: 

tesseract 5.0.0-alpha-457-gb3b74

 leptonica-1.78.0

  libgif 5.1.4 : libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.3 : libopenjp2 2.3.1

 Found AVX2

 Found AVX

 Found FMA

 Found SSE

 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.6


My bashscript looks at follows: https://pastebin.com/XK4CkuM2

When I evaluate via: 

~/../../usr/local/bin/lstmeval \
  --model ~/Desktop/tesstutorial/trainplusminus/eng.traineddata \
  --traineddata ~/Desktop/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/Desktop/tesstutorial/trainplusminus/eng.training_files.txt 2>&1 | grep ±

I don't get any OCR Line correctly. 

Does someone see a mistake in my code? 



Dustin Theobald

unread,
Oct 1, 2019, 10:23:49 AM10/1/19
to tesseract-ocr
Changed my evaluation to: 

~/../../usr/local/bin/lstmeval \
  --model ~/Desktop/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/Desktop/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/Desktop/tesstutorial/trainplusminus/eng.training_files.txt 2>&1 | grep ±

Still doesn't work.

Shree Devi Kumar

unread,
Oct 1, 2019, 11:40:20 AM10/1/19
to tesseract-ocr

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e9ba2635-6308-41a8-8150-e5d4da520269%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Oct 1, 2019, 11:41:36 AM10/1/19
to tesseract-ocr

Dustin Theobald

unread,
Oct 2, 2019, 5:02:43 AM10/2/19
to tesseract-ocr
Hey Shree,

Thank you for your help!

This doesn't work on my MAC. I can't find some of the fonts, so I only try to create trainingdata for Arial, if use the 5-makedata-plusminus.sh, he is only rendering (creating 2 pages), which seems odd.

I'm switching to my linux now, but I have problems installing tesseract.

I'm following the documentation:

sudo apt install tesseract-ocr

After, I try to find the folder to run

make
make training
make training-install

 But I cannot find the folder (on ubuntu)

So, I clone the GitHub Repository: https://github.com/tesseract-ocr/tesseract
to my Desktop and run ./autogen.sh ./configure, make, make training, sudo make trainng-install

But then I'll get the following error when running 5-makedata-plusminus.sh:

/usr/local/bin/text2image: error while loading shared libraries: libtesseract.so.5: cannot open shared object file: No such file or directory
ERROR: Program text2image failed. Abort.

Thank you very much for your help!

Am Dienstag, 1. Oktober 2019 17:41:36 UTC+2 schrieb shree:
On Tue, Oct 1, 2019 at 9:09 PM Shree Devi Kumar <shree...@gmail.com> wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Oct 2, 2019, 5:24:20 AM10/2/19
to tesseract-ocr

OR


You seem to be missing some steps there.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d44cd443-da72-4df4-9a7c-aae082726010%40googlegroups.com.

Shree Devi Kumar

unread,
Oct 2, 2019, 5:26:28 AM10/2/19
to tesseract-ocr
>This doesn't work on my MAC. I can't find some of the fonts, so I only try to create trainingdata for Arial, if use the 5-makedata-plusminus.sh, he is only rendering (creating 2 pages), which seems odd.

2 pages should be ok because it uses the training_text from langdata repo which is around 80 lines plus the extra lines added with plusminus.

Dustin Theobald

unread,
Oct 2, 2019, 8:26:04 AM10/2/19
to tesseract-ocr
Hey shree, 

thank you very much! On linux it works :) 

Best regards,
Dustin


Am Mittwoch, 2. Oktober 2019 11:26:28 UTC+2 schrieb shree:
>This doesn't work on my MAC. I can't find some of the fonts, so I only try to create trainingdata for Arial, if use the 5-makedata-plusminus.sh, he is only rendering (creating 2 pages), which seems odd.

2 pages should be ok because it uses the training_text from langdata repo which is around 80 lines plus the extra lines added with plusminus.

On Wed, Oct 2, 2019 at 2:53 PM Shree Devi Kumar <shree...@gmail.com> wrote:

OR


You seem to be missing some steps there.



--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Dustin Theobald

unread,
Oct 2, 2019, 10:08:32 AM10/2/19
to tesseract-ocr
Hey shree, 

do you know how to manually install the missing fonts for MAC, like in your docu for linux: 

sudo apt update
sudo apt install ttf-mscorefonts-installer
sudo apt install fonts-dejavu
fc-cache -vf

Thank you in advance!

Best regards,
Dustin

Am Mittwoch, 2. Oktober 2019 11:26:28 UTC+2 schrieb shree:
>This doesn't work on my MAC. I can't find some of the fonts, so I only try to create trainingdata for Arial, if use the 5-makedata-plusminus.sh, he is only rendering (creating 2 pages), which seems odd.

2 pages should be ok because it uses the training_text from langdata repo which is around 80 lines plus the extra lines added with plusminus.

On Wed, Oct 2, 2019 at 2:53 PM Shree Devi Kumar <shree...@gmail.com> wrote:

OR


You seem to be missing some steps there.



--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Oct 2, 2019, 10:46:25 AM10/2/19
to tesseract-ocr
Sorry, don't know how to add those fonts for Mac.

The tutorial uses the following set of fonts:

You could use a similar set of fonts available on the Mac and assign via fontlist. 

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0a2e9693-553a-4340-832d-79a31da74314%40googlegroups.com.

Dustin Theobald

unread,
Oct 3, 2019, 4:03:16 AM10/3/19
to tesseract-ocr
Ok. Thank you very much for your help! I'll get it to work somehow! 

Cheers,
Dustin

Am Mittwoch, 2. Oktober 2019 16:46:25 UTC+2 schrieb shree:
Sorry, don't know how to add those fonts for Mac.

The tutorial uses the following set of fonts:

You could use a similar set of fonts available on the Mac and assign via fontlist. 

Shree Devi Kumar

unread,
Oct 3, 2019, 4:34:53 AM10/3/19
to tesseract-ocr

Dustin Theobald

unread,
Oct 3, 2019, 7:59:19 AM10/3/19
to tesseract-ocr
Thank you Shree, 

Im left with URW Bookman and Century Schoolbook family (which it seems I have to pay for). 
For now I'll stick to the linux. Still, thank you very much shree!

I have one more question regarding training: 

I have German and Englisch PDFs (sometimes mixed). I can use multiple languages (deu+eng). If I finetune for a character, do I have to finetune both language models, eng.lstm + deu.lstm and combine them when using tesseract, like: 

tesseract ~/Desktop/test.png stdout -l eng_plusminus+deu_plusminus \
--oem 1 \
--psm 3 \
--tessdata-dir ./tesseract/tessdata/best

Thank you in advance!

Cheers, 
Dustin

Am Donnerstag, 3. Oktober 2019 10:34:53 UTC+2 schrieb shree:

Dustin Theobald

unread,
Oct 3, 2019, 10:29:20 AM10/3/19
to tesseract-ocr
I also tried to change the training-text with respect to Ø: 

cat <<EOM >>../langdata/eng/eng.plusminus.training_text
alkoxy of LEAVES Ø1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED Ø85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's Ø1.31 POPSET Os—C(11)
VOLVO abdomen, Ø65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, Ø14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR Ø90° Gogol PARTIALLY BOARDS firm
Email ACTUAL QUEENSLAND Carl's Unruly Ø8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY Ø2.96% Ask! WELL
Lambert own Company View mg \ (Ø7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED Ø500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* Ø18 note: PROBE Jailbroken RAISE Fountains Write Goods (Ø6)
Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § Ø44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL Ø2? activity PROPERTY MAINTAINED
EOM

The evaluation on the training data works, but he doesn't recognize any Line in the evalplusminus/eng.training_files.txt

Shree Devi Kumar

unread,
Oct 3, 2019, 10:52:46 AM10/3/19
to tesseract-ocr
Can all used fonts render Ø? 

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2bbf0e65-785d-4847-bb24-dcfa197a45a8%40googlegroups.com.

Dustin Theobald

unread,
Oct 4, 2019, 3:51:45 AM10/4/19
to tesseract-ocr
I inserted "--save_box_tiff" to see if the Ø is rendered correctly for the fonts (which seems to be the case) 

Cheers, 
Dustin

Am Donnerstag, 3. Oktober 2019 16:52:46 UTC+2 schrieb shree:
Can all used fonts render Ø? 

Dustin Theobald

unread,
Oct 4, 2019, 4:18:01 AM10/4/19
to tesseract-ocr
Ok, when I run make_training_data, it says "Other case ø of  Ø is not in unicharset", might this be a problem? Even though  Ø is in the unicharset?

Cheers,
Dustin

Am Donnerstag, 3. Oktober 2019 16:52:46 UTC+2 schrieb shree:
Can all used fonts render Ø? 

Shree Devi Kumar

unread,
Oct 4, 2019, 4:33:16 AM10/4/19
to tesseract-ocr
Other case ø of  Ø is not in unicharset", - that's just for lower and upper case of letters.

If the finetuned traineddata is not recognizing   Ø , try plusminus training with more samples and more iterations. Failing that, try to replace layer.

You can try to base your training on script/Latin.traineddata rather than eng.traineddata.

@theraysmith has given the example of plusminus training in tutorial. In my experience, it does not work in case of all languages/characters.

You will need to experiment a little to find best case scenario for your user case.



To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7dda8d63-7722-43d0-96fb-6cb385092773%40googlegroups.com.

Dustin Theobald

unread,
Oct 7, 2019, 2:29:32 AM10/7/19
to tesseract-ocr
Hey Shree, 

thank you again for your help! 

I will experiment a little. Do you have any advise how to construct training texts, which I'm going to append to the latin/eng.training_text?

Cheers,
Dustin
Reply all
Reply to author
Forward
0 new messages