I am having trouble whitelisting and OCRing apostrophes (English single right quotes).
Given something like the attached image, without specifying a whitelist, apostrophes are output:
$ tesseract --user-words ./.user.words /tmp/test-ocr.png stdout
Doctor‘s Mask
But due to noise (not necessarily on that test image), I have tried implementing a whitelist with letters and numbers, as well as a hyphen, comma, and quotes (you can see my many attempts at apostrophes):
$ cat .config
tessedit_char_whitelist -",'\'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890\u0027\u2019
The apostrophe doesn't come out:
$ tesseract --user-words ./.user.words /tmp/test-ocr.png stdout ./.config
Doctors Mask
Arch Linux, up to date as of today
tesseract 3.05.00
leptonica-1.74
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.29 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.5.2
Please suggest.