Below is a bug report that I'm considering making. However, I'm not entirely positive that its a bug and I'd like someone who knows more about this to check this and make sure that this is a bug so I'm not wasting anyone's time.
The following is the bug report that I'll post if you guys think its right.
### Environment
* **Tesseract Version**:
tesseract 4.1.0
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.2) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.0.3
Found AVX2
Found AVX
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.3
* **Commit Number**:
From pacman Arch repository (NOT THE AUR)
* **Platform**: Linux NickArch 5.4.3-arch1-1 #1 SMP PREEMPT Fri, 13 Dec 2019 09:39:02 +0000 x86_64 GNU/Linux
### Current Behavior:
Sample Image link:
https://imgur.com/a/TNH3tOxTesseract will interpret certain characters weirdly (i.e. F as the yen symbol, or E as sometimes '='). The following command correctly whitelists the characters that will appear on the pages, and almost completely eliminates that problem:
$ tesseract 205c.tif 205c --psm 6 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789=+&
However, since the images are formatted like a table, tesseract will not recognize the smaller spaces in the third column. To fix that issue, I can run the following command.
$ tesseract 205c.tif 205c --psm 6 -c tosp_min_sane_kn_sp=0.0
This command completely fixes the spacing problem. However, the previous command obviously does not whitelist the characters so there are many more errors. So I need to run the -c arguments together. I do this by using a config file:
config_file:
tosp_min_sane_kn_sp 0.0
tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789=+&
Then I run
$ tesseract 205c.tif 205c --psm 6 config_file
Tesseract will always ignore one of these options no matter what I do. Maybe I'm doing it wrong, but I've followed what other config files have shown and other command line options. However, I've also tried running the command with more than one -c option. In both cases I cannot get both config variables to work together.
### Expected Behavior:
$ Tesseract --help-extra
"-c VAR=VALUE Set value for config variables.
Multiple -c arguments are allowed."
### Suggested Fix:
I'm not even sure if this is a bug, but it definitely seems like it to me. I don't think I have the expertise to look into why this isn't working. Maybe I'm wrong here.