Editing Box files

72 views
Skip to first unread message

anne

unread,
Apr 28, 2019, 11:51:18 PM4/28/19
to tesseract-ocr
Haloo, I want to train Tesseract on a new script but I'm confused when it comes to editing the box file. The script looks like this
and after generating the box file (with eng as basis) I got this
Now my problem is that I'm confused on how to edit/change the values. Do I need to replace the english letters with my language's symbols? Or do I just need to edit the numbers? Thank you in advance.

Shree Devi Kumar

unread,
Apr 29, 2019, 2:45:02 AM4/29/19
to tesser...@googlegroups.com
It means that the font you are using has mapped English letters to these symbols. If you view the box file in that same font the symbols should show. Possibly the numbers for coordinates will also show up as symbols, based on the mapping.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4b518a58-19f7-41ca-95e2-e42a27654dc8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shree Devi Kumar

unread,
Apr 29, 2019, 2:49:03 AM4/29/19
to tesser...@googlegroups.com
I assumed that you used text2image to generate the box/tiff pairs using a font for your `language`.
Message has been deleted

anne

unread,
Apr 29, 2019, 4:12:39 AM4/29/19
to tesseract-ocr
I used this line
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox
from https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files

Shree Devi Kumar

unread,
Apr 29, 2019, 5:03:39 AM4/29/19
to tesser...@googlegroups.com
Tesseract generates unicode output after recognizing.

Are there any unicode points for symbols that you have used?

How do you type out those symbols?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

anne

unread,
Apr 30, 2019, 2:09:17 AM4/30/19
to tesseract-ocr
I found the unicode for Baybayin (which is the language) which is this
As for typing out those symbols, there are only a few keyboards that support it because there is not enough research on the language as of now.

Shree Devi Kumar

unread,
Apr 30, 2019, 8:13:39 AM4/30/19
to tesser...@googlegroups.com
I found couple of unicode fonts that can display the tagalog range -
 "Quivira" \
 "Noto Sans Tagalog" \

Using these it will be possible to train for  for Baybayin .
Does the language use any punctuation and numbers?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Apr 30, 2019, 1:23:06 PM4/30/19
to tesser...@googlegroups.com

Christine Anne Catubig

unread,
May 2, 2019, 2:28:12 AM5/2/19
to tesser...@googlegroups.com
No, I'm not including punctuations and numbers. Thank you so much though for the box/tiff pairs. You're a life saver. Thank you.

You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/1D-TpK-AswM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages