Problem in TIFF/Box Generator in jTessBoxEditor

1,038 views
Skip to first unread message

Nasim Ali

unread,
Nov 24, 2015, 2:41:35 PM11/24/15
to tesseract-ocr

When I generate a TIFF from text file with jTessBoxEditor, in the TIFF image all complex conjunct letters in my language (oriya) are broken down into component letters. Here is a screenshot ! http://imgur.com/GTY7wt7

The one on left is how it should be and the one on right is the output from jTessBoxEditor. Each one correspond with their counterpart on right. The box file generated has the correct character but incorrect image data as the TIFF is wrong. So when I use the generated traineddata file, the simple letters get detected fine but the complex letters screw up.


Any suggestions?

Nasim Ali

unread,
Nov 25, 2015, 12:54:25 PM11/25/15
to tesseract-ocr
Nguyen (program creator) says the problem is with java, so I've decided to use qtboxcreator to create boxes and the subsequent work is handled by jTessBoxEditor.

Tom Morris

unread,
Dec 2, 2015, 9:35:30 PM12/2/15
to tesseract-ocr
On Wednesday, November 25, 2015 at 12:54:25 PM UTC-5, Nasim Ali wrote:
Nguyen (program creator) says the problem is with java, so I've decided to use qtboxcreator to create boxes and the subsequent work is handled by jTessBoxEditor.

To expand on this, it doesn't look like Java currently supports correctly rendering complex characters in Oriya.  Java 9 has integrated the HarfBuzz layout engine which will fix this problem,  


Tom

Quan Nguyen

unread,
Dec 5, 2015, 10:02:47 AM12/5/15
to tesseract-ocr
Thanks Tom for the valuable info.

JDK 9 Beta is available for download, if you want to try out.

N@S1m Ali

unread,
Dec 5, 2015, 10:28:32 AM12/5/15
to tesser...@googlegroups.com

Thank you for the input. I'll surely check it out when I get a break.

--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/6yAO8LQHgps/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0aecf2ff-66df-47d1-9d7f-76021662c0ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nasim Ali

unread,
Dec 21, 2015, 8:19:10 AM12/21/15
to tesseract-ocr
I think jTessBoxEditor needs to be recompiled with JDK9, unfortunately my knowledge of java is rather limited. Could you please spare a moment to compile it on JDK 9?

Quan Nguyen

unread,
Dec 21, 2015, 11:27:55 AM12/21/15
to tesseract-ocr
It should run without recompile. What error message were you seeing when running on JDK9?

Nasim Ali

unread,
Dec 21, 2015, 6:09:10 PM12/21/15
to tesseract-ocr
No error messages. jTessBox behaves just like on JRE 8, the complex characters break down and aren't rendered properly as shown before.

Quan Nguyen

unread,
Dec 21, 2015, 7:27:57 PM12/21/15
to tesseract-ocr
Where did the complex character rendering break? Was it in the text box on in the image generated?

Alternatively, you may want to try the new text2image tool to generate your training images.

Nasim Ali

unread,
Dec 25, 2015, 9:32:21 AM12/25/15
to tesseract-ocr
Complex Text breaks in text box as well as in the generated images. I can use text2image, but then I have to use a box editor to manually fine tune things and I can't use jTessBox here because complex text appears as boxes. I'm currently using qtboxeditor but it is very unstable.

Quan Nguyen

unread,
Dec 25, 2015, 11:09:09 AM12/25/15
to tesseract-ocr
"complex text appears as boxes" Did you select an appropriate font to display your text?

I don't think the program needs recompile. I'm not sure if JDK9 has fully integrated the fix.

Quan Nguyen

unread,
Jun 11, 2016, 8:28:23 PM6/11/16
to tesseract-ocr
Nasim,

Can you try again using the new program jTessBoxEditorFX? It was jTessBoxEditor rewritten in JavaFX to address the issue of rendering complex scripts existing in Java Swing.

Reply all
Reply to author
Forward
0 new messages