Whitelisting apostrophes problem

41 views
Skip to first unread message

Chris H

unread,
Apr 4, 2017, 2:59:13 AM4/4/17
to tesseract-ocr
I am having trouble whitelisting and OCRing apostrophes (English single right quotes).
Given something like the attached image, without specifying a whitelist, apostrophes are output:

$ tesseract --user-words ./.user.words /tmp/test-ocr.png stdout
Doctor‘s Mask

But due to noise (not necessarily on that test image), I have tried implementing a whitelist with letters and numbers, as well as a hyphen, comma, and quotes (you can see my many attempts at apostrophes):

$ cat .config
tessedit_char_whitelist -",'\'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890\u0027\u2019

The apostrophe doesn't come out:
$ tesseract --user-words ./.user.words /tmp/test-ocr.png stdout ./.config
Doctors Mask

Arch Linux, up to date as of today
tesseract 3.05.00
 leptonica-1.74
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.5.2

Please suggest.
test-ocr.png
Reply all
Reply to author
Forward
0 new messages