Whitelisting apostrophes problem

41 views

Skip to first unread message

Chris H

unread,

Apr 4, 2017, 2:59:13 AM4/4/17

to tesseract-ocr

I am having trouble whitelisting and OCRing apostrophes (English single right quotes).
Given something like the attached image, without specifying a whitelist, apostrophes are output:

$ tesseract --user-words ./.user.words /tmp/test-ocr.png stdout
Doctor‘s Mask

But due to noise (not necessarily on that test image), I have tried implementing a whitelist with letters and numbers, as well as a hyphen, comma, and quotes (you can see my many attempts at apostrophes):

$ cat .config
tessedit_char_whitelist -",'\'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890\u0027\u2019

The apostrophe doesn't come out:
$ tesseract --user-words ./.user.words /tmp/test-ocr.png stdout ./.config
Doctors Mask

Arch Linux, up to date as of today
tesseract 3.05.00
leptonica-1.74
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.5.2

Please suggest.