Hi, I'm new to tesseract and ocr in general, and need some help to train my tesseract.
Config
Platform: Mac OS X 10.13.3
Tesseract Version: 4.0.0-beta.1
leptonica: 1.75.3
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
images used
kor.AppleMyungjo.exp1.tif
kor.AppleMyungjo.exp0.tif
Step by step
I'm trying to train (fine tuning) my tesseract to better detect commas (") and dot (.) in korean, but I'm getting some errors. Here what I did until now:
1 - Got the Images, I'm using 2 images .tif (both images has only 1 line and few characters)
2 - Renamed the images to kor.AppleMyungjo.exp0.tif and kor.AppleMyungjo.exp1.tif
3 - Created the .box file for each image ```tesseract [language].[fontname].exp[samplenumber].tif [language].[fontname].exp[samplenumber] -l [language] batch.nochop makebox``` (one of them come empty)
5 - Created the .tr files for each image ```tesseract kor.AppleMyungjo.exp0.tif kor.AppleMyungjo.exp0 -l kor box.train ``` (both image got an empty .tr file)
6 - Created the unicharset file ```unicharset_extractor [box file 0] [box file 1]...```
7 - Created the font_properties, only has the ```AppleMyungjo 0 0 1 0 0```
8 - Cloned the tesseract repo to my mac, path ```~/projects/tesseract```
9 - cloned the langdata repo to my mac, path ```~/projects/langdata```
10 - Found the folder where the brew installed my tesseract, path ```/usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata```
11 - Executed the ```~/projects/tesseract/training/tesstrain.sh``` file
```
sudo ~/projects/tesseract/training/tesstrain.sh \
--fonts_dir /Library/Fonts \
--lang kor \
--linedata_only \
--noextract_font_properties \
--exposures "0" \
--langdata_dir ~/projects/langdata \
--tessdata_dir /usr/local/Cellar/tesseract/HEAD-f8e26ee/share/tessdata \
--output_dir ~/tesstutorial/kor \
--fontlist "AppleMyungjo"
```
and got the error:
```
=== Starting training for language 'kor'
mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
mktemp [-d] [-q] [-u] -t prefix
[Wed Apr 4 13:26:24 -03 2018] /usr/local/bin/text2image --fonts_dir=/Library/Fonts --font=AppleMyungjo --outputbase=/sample_text.txt --text=/sample_text.txt --fontconfig_tmpdir=
Fontconfig error: Cannot load default config file
=== Phase I: Generating training images ===
Rendering using AppleMyungjo
[Wed Apr 4 13:26:25 -03 2018] /usr/local/bin/text2image --fontconfig_tmpdir= --fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0 --max_pages=3 --font=AppleMyungjo --text=/Users/fernandogot/projects/langdata/kor/kor.training_text
Fontconfig error: Cannot load default config file
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.d1OKhvnG/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable
```
I found that the ```Fontconfig error: Cannot load default config file``` was being generated because of the mktemp on mac, I fixed it replacing the code:
training/tesstrain_utils.sh
```diff
- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)
```
After executing the same code I get:
```
=== Starting training for language 'kor'
[Wed Apr 4 14:13:38 -03 2018] /usr/local/bin/text2image --fonts_dir=/Library/Fonts --font=AppleMyungjo --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt --text=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs/sample_text.txt --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs
=== Phase I: Generating training images ===
Rendering using AppleMyungjo
[Wed Apr 4 14:13:40 -03 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/font_tmp.XXXXXXXXXX.X52wexDs --fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0 --max_pages=3 --font=AppleMyungjo --text=/Users/fernandogot/projects/langdata/kor/kor.training_text
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable
ERROR: /var/folders/zz/zyxvpxvq6csfxvn_n0000000000000/T/tmp.pydbGWuE/kor/kor.AppleMyungjo.exp0.box does not exist or is not readable
```
So I'm stuck at these 2 erros, I do have this file in the folder that Im executing the code ```~/projects/ocr/trainning/```, but what can I do to make it work?
Thanks for reading all this text and for your time