Many 'question mark' chars in recognized text

686 views
Skip to first unread message

Salvo Piazza

unread,
Oct 16, 2014, 4:18:58 AM10/16/14
to tesser...@googlegroups.com
Hi all,
I've written a little simple program to extract text from image with tesseract 3.0.2 as:

Tesseract instance = Tesseract.getInstance();
instance.setDatapath(currentDir);
instance.setLanguage("ita");
String returner = instance.doOCR(new File(filename));
It works fine but I've many question mark chars '?' in the extracted text.

For example the word fluidi is recognized as ?uidi and much more example...

Does anyone know some tips in order to fix this behaviour?

Thanks in advance,
Salvo.

zdenko podobny

unread,
Oct 16, 2014, 3:46:31 PM10/16/14
to tesser...@googlegroups.com
fl is recognizes as ligature in English, so there could be the same issue in Italian. If it is replaced with '?' I would guess you have problem with unicode... Can you check it with tesseract executable?

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dc3dc154-fc24-48d8-8f5e-4a1df7f36282%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Greg Dunkel

unread,
Oct 16, 2014, 4:06:02 PM10/16/14
to tesser...@googlegroups.com

Many OCR programs have trouble with ligatures.

Salvo Piazza

unread,
Oct 17, 2014, 6:07:26 AM10/17/14
to tesser...@googlegroups.com
Hi Zdenko,
thanks for your response.

I know tesseract at very beginning level, so can you tell me how can I check it? (I use a Linux version of tesseract...)

Thanks,
Salvo.

Rick Leir

unread,
Oct 17, 2014, 9:30:54 AM10/17/14
to tesser...@googlegroups.com
On Linux try YAGF, it is a GUI front end for Tesseract.  As zdenop said, you have a unicode problem.  You need to use UTF8 for strings.

zdenko podobny

unread,
Oct 17, 2014, 10:43:21 AM10/17/14
to tesser...@googlegroups.com
OCR a test image with you app, store result to text file. Than OCR the same image with tesseract executable (output should be in text file by default) and compare results.
If output from tesseract executable is OK, but from your app is wrong (e.g. there are only ascii letters) => you have problem within you app (e.g. it does not handle unicode string correctly).

Zdenko

Quan Nguyen

unread,
Oct 30, 2014, 7:24:35 PM10/30/14
to tesser...@googlegroups.com
I suspect you have saved the Unicode text output with a wrong character encoding. Try UTF-8 encoding when you save the file. Tesseract may misrecognize the characters but rarely put question marks in their places.
Reply all
Reply to author
Forward
0 new messages