Error at training 4.0

114 views
Skip to first unread message

Fanatico

unread,
Apr 4, 2018, 4:06:07 PM4/4/18
to tesseract-ocr
Hi, I'm new to tesseract and ocr in general, and need some help to train my tesseract.

Config
Platform: Mac OS X 10.13.3
Tesseract Version: 4.0.0-beta.1
leptonica: 1.75.3
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

images used

kor.AppleMyungjo.exp1.tif



kor.AppleMyungjo.exp0.tif



Step by step
I'm trying to train (fine tuning) my tesseract to better detect commas (") and dot (.) in korean, but I'm getting some errors. Here what I did until now:

1 - Got the Images, I'm using 2 images .tif (both images has only 1 line and few characters)
2 - Renamed the images to kor.AppleMyungjo.exp0.tif and kor.AppleMyungjo.exp1.tif
3 - Created the .box file for each image ```tesseract [language].[fontname].exp[samplenumber].tif [language].[fontname].exp[samplenumber] -l [language] batch.nochop makebox``` (one of them come empty)
4 - Corrected the .box files using the site https://pp19dd.com/tesseract-ocr-chopper/ (I just pasted the positioning in the file)
5 - Created the .tr files for each image ```tesseract kor.AppleMyungjo.exp0.tif kor.AppleMyungjo.exp0 -l kor box.train ``` (both image got an empty .tr file)
6 - Created the unicharset file ```unicharset_extractor [box file 0] [box file 1]...```
7 - Created the font_properties, only has the ```AppleMyungjo 0 0 1 0 0```
8 - Cloned the tesseract repo to my mac, path ```~/projects/tesseract```
9 - cloned the langdata repo to my mac, path ```~/projects/langdata```
10 - Found the folder where the brew installed my tesseract, path ```/usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata```
11 - Executed the ```~/projects/tesseract/training/tesstrain.sh``` file


```
sudo ~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts  \
  --lang kor \
  --linedata_only  \
  --noextract_font_properties  \
  --exposures "0"    \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir /usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata \
  --output_dir ~/tesstutorial/kor \
  --fontlist "AppleMyungjo"
```
and got the error:
```
=== Starting training for language 'kor'
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
       mktemp [-d] [-q] [-u] -t prefix
[Wed Apr 4 13:26:24 -03 2018] /usr/local/bin/text2image --fonts_dir=/Library/Fonts --font=AppleMyungjo --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=
Fontconfig error: Cannot load default config file

=== Phase I: Generating training images ===
Rendering using AppleMyungjo
[Wed Apr 4 13:26:25 -03 2018] /usr/local/bin/text2image --fontconfig_tmpdir= --fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0 --max_pages=3 --font=AppleMyungjo --text=/Users/fernandogot/projects/langdata/kor/kor.training_text
Fontconfig error: Cannot load default config file
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable
```

I found that the ```Fontconfig error: Cannot load default config file``` was being generated because of the mktemp on mac, I fixed it replacing the code:

training/tesstrain_utils.sh
```diff
- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)
```
After executing the same code I get:

```
=== Starting training for language 'kor'
[Wed Apr 4 14:13:38 -03 2018] /usr/local/bin/text2image --fonts_dir=/Library/Fonts --font=AppleMyungjo --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt --text=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs

=== Phase I: Generating training images ===
Rendering using AppleMyungjo
[Wed Apr 4 14:13:40 -03 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs --fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0 --max_pages=3 --font=AppleMyungjo --text=/Users/fernandogot/projects/langdata/kor/kor.training_text
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable
```

So I'm stuck at these 2 erros, I do have this file in the folder that Im executing the code ```~/projects/ocr/trainning/```, but what can I do to make it work?


Thanks for reading all this text and for your time

ShreeDevi Kumar

unread,
Apr 4, 2018, 9:53:21 PM4/4/18
to tesser...@googlegroups.com
Training tesseract 4.0.0 is different from process for 3.0x.

Training  using images is not supported for tesseract 4.0.0.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a3d11945-97ef-4b2d-9626-96364c7884cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fanatico

unread,
Apr 5, 2018, 1:27:08 PM4/5/18
to tesseract-ocr
Thanks for the quick response, I did not see this part in the documentation ...

My problem is that in the image "kor.AppleMyungjo.exp0.tif" the tesseract is recognizing nothing, the box file is empty and in the image "kor.AppleMyungjo.exp1.tif" it is not recognizing the last quotation marks (") and period (.) Can I fix this by running some tests with fonts?


kor.AppleMyungjo.exp1.tif



kor.AppleMyungjo.exp0.tif


Reply all
Reply to author
Forward
0 new messages