Potential bug report

50 views
Skip to first unread message

Nicholas Rees

unread,
Dec 23, 2019, 4:40:33 AM12/23/19
to tesseract-ocr
Below is a bug report that I'm considering making. However, I'm not entirely positive that its a bug and I'd like someone who knows more about this to check this and make sure that this is a bug so I'm not wasting anyone's time.

The following is the bug report that I'll post if you guys think its right.


### Environment

* **Tesseract Version**:
tesseract 4.1.0
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.2) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.0.3
Found AVX2
Found AVX
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.3
* **Commit Number**:
From pacman Arch repository (NOT THE AUR)
* **Platform**: Linux NickArch 5.4.3-arch1-1 #1 SMP PREEMPT Fri, 13 Dec 2019 09:39:02 +0000 x86_64 GNU/Linux

### Current Behavior:
Sample Image link: https://imgur.com/a/TNH3tOx

Tesseract will interpret certain characters weirdly (i.e. F as the yen symbol, or E as sometimes '='). The following command correctly whitelists the characters that will appear on the pages, and almost completely eliminates that problem:

$ tesseract 205c.tif 205c --psm 6 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789=+&

However, since the images are formatted like a table, tesseract will not recognize the smaller spaces in the third column. To fix that issue, I can run the following command.

$ tesseract 205c.tif 205c --psm 6 -c tosp_min_sane_kn_sp=0.0

This command completely fixes the spacing problem. However, the previous command obviously does not whitelist the characters so there are many more errors. So I need to run the -c arguments together. I do this by using a config file:

config_file:
tosp_min_sane_kn_sp 0.0
tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789=+&

Then I run

$ tesseract 205c.tif 205c --psm 6 config_file

Tesseract will always ignore one of these options no matter what I do. Maybe I'm doing it wrong, but I've followed what other config files have shown and other command line options. However, I've also tried running the command with more than one -c option. In both cases I cannot get both config variables to work together.

### Expected Behavior:
$ Tesseract --help-extra
"-c VAR=VALUE                         Set value for config variables.
                                                  Multiple -c arguments are allowed."
### Suggested Fix:
I'm not even sure if this is a bug, but it definitely seems like it to me. I don't think I have the expertise to look into why this isn't working. Maybe I'm wrong here.

Ashwini Nande

unread,
Dec 23, 2019, 4:51:46 AM12/23/19
to tesser...@googlegroups.com
hi,
$ tesseract 205c.tif 205c --psm 6 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789=+&
as per my knowledge tessedit_char_whitelist works with tesseract 3 not with tesseract 4. 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a552cd6a-2c06-4d79-80ec-a973aaecf2fa%40googlegroups.com.


--
Thanks & regards,
Ashwini 

Nicholas Rees

unread,
Dec 23, 2019, 8:20:33 AM12/23/19
to tesseract-ocr
hi,
$ tesseract 205c.tif 205c --psm 6 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789=+&
as per my knowledge tessedit_char_whitelist works with tesseract 3 not with tesseract 4. 

- show quoted text -
- show quoted text -
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.


--
Thanks & regards,
Ashwini 

That was true with 4.0.0, but not with 4.1.0 -- the version that I am using: https://github.com/tesseract-ocr/tesseract/releases

@zdenop zdenop released this on Jul 7 · 601 commits to master since this release

  • Added new renders Alto, LSTMBox, WordStrBox.
  • Added character boxes in hOCR output.
  • Added python training scripts (experimental) as alternative shell scripts.
  • Better support AVX / AVX2 / SSE.
  • Disable OpenMP support by default (see e.g. #1171, #1081).
  • Fix for bounding box problem.
  • Implemented support for whitelist/blacklist in LSTM engine.
  • Improved cmake configuration.
  • Code modernization and improvements.
  • A lot of bug fixes...
Furthermore, if I run that command by itself without the space variable, then it whitelists the characters just fine--just as I said in the post.

But even if the whitelist weren't working, it would still be a bug because it says in the release notes that whitelist is working.

Should I submit this bug report? Does anyone else think this deserves a bug report?

Nicholas Rees

unread,
Dec 23, 2019, 8:24:35 AM12/23/19
to tesseract-ocr
I think I'm going to submit this report at the end of the day if there are no objections here.

I can't find any documentation suggesting that this shouldn't work and both variables work as intended individually, just not together. I don't have much coding experience, but I'd be more than happy to attempt to contribute to the project if I can. I haven't looked at any source code, and maybe its even a simple fix that I can do.
Reply all
Reply to author
Forward
0 new messages