Kannada character mixup

15 views

Skip to first unread message

Sushil Kambampati

unread,

Aug 30, 2016, 7:01:39 AM8/30/16

to tesseract-ocr

I'm trying to pull text out of Kannada documents published by the state of Karnataka and I'm consistently running into the issue that Tesseract recognizes ಕರ್ನಾಟಕ as ಕನಾ೯ಟಕ. In case it's not clear, Tesseract is mistaking ೯ for ರ್ನಾ (the last bit). I guess maybe because the actual characters for a compound construction?

Anyway, as a Tesseract noob, how do I fix this? I've attached the source image file for the text.