Problem with Wordstr *.box files

64 views
Skip to first unread message

J Adam Funk

unread,
Sep 13, 2019, 12:38:47 PM9/13/19
to tesseract-ocr
Hi,

I'm using tesseract 4.0.0 (Ubuntu package version 4.0.0-2) and trying to set up training data. I have a Python tool that puts random words in an image (using PIL) and saves the resulting *.box and *.tif files, using the line-of-text per line of box file format. I'm now trying to work through the training process, and the unicharset is treating the "Wordstr" at the beginning as the string.  My box files look like this, which I think follows the examples at <https://github.com/tesseract-ocr/tesseract/issues/2357#issuecomment-477239316>:

Wordstr 68 102 1326 1205 0 #COMPASSED PERUVIANS
68 102 1326 1205 0
Wordstr 68 662 1260 465 0 #BIMINI'S
68 662 1260 465 0

and the resulting unicharset file is treating "Wordstr" as the text, so I get this:

9
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1 # Broken
W 5 0,255,0,255,0,0,0,0,0,0 Latin 3 0 3 W # W [57 ]A
o 3 0,255,0,255,0,0,0,0,0,0 Latin 4 0 4 o # o [6f ]a
r 3 0,255,0,255,0,0,0,0,0,0 Latin 5 0 5 r # r [72 ]a
d 3 0,255,0,255,0,0,0,0,0,0 Latin 6 0 6 d # d [64 ]a
s 3 0,255,0,255,0,0,0,0,0,0 Latin 7 0 7 s # s [73 ]a
t 3 0,255,0,255,0,0,0,0,0,0 Latin 8 0 8 t # t [74 ]a

What am I doing wrong?

Thanks,
Adam

Shree Devi Kumar

unread,
Sep 13, 2019, 1:39:55 PM9/13/19
to tesseract-ocr
Yes, I also noticed this problem recently.

My workaround is to create the unicharset from the training text/ground truth files rather than from box files.

Look at the help for unicharset_extractor 

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/60ec5ff4-2125-4342-bb9e-feae4dfa91fc%40googlegroups.com.

Shree Devi Kumar

unread,
Sep 13, 2019, 11:48:31 PM9/13/19
to tesseract-ocr

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/60ec5ff4-2125-4342-bb9e-feae4dfa91fc%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Adam Funk

unread,
Sep 17, 2019, 8:53:54 AM9/17/19
to tesser...@googlegroups.com
Hi again,

This page
<https://www.endpoint.com/blog/2018/07/09/training-tesseract-models-from-scratch>
says that unicharset_extractor is buggy, so I wrote a Python program to
do it instead. Does the attached file look right, and should it work
with Tesseract 4.0?

Thanks,
Adam
> Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined# Joined [4a 6f
> 69 6e 65 64 ]a
> |Broken|0|1 f 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1# Broken
> W 5 0,255,0,255,0,0,0,0,0,0 Latin 3 0 3 W# W [57 ]A
> o 3 0,255,0,255,0,0,0,0,0,0 Latin 4 0 4 o# o [6f ]a
> r 3 0,255,0,255,0,0,0,0,0,0 Latin 5 0 5 r# r [72 ]a
> d 3 0,255,0,255,0,0,0,0,0,0 Latin 6 0 6 d# d [64 ]a
> s 3 0,255,0,255,0,0,0,0,0,0 Latin 7 0 7 s# s [73 ]a
> t 3 0,255,0,255,0,0,0,0,0,0 Latin 8 0 8 t# t [74 ]a
>
> What am I doing wrong?
>
> Thanks,
> Adam
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to tesseract-oc...@googlegroups.com
> <mailto:tesseract-oc...@googlegroups.com>.
> <https://groups.google.com/d/msgid/tesseract-ocr/60ec5ff4-2125-4342-bb9e-feae4dfa91fc%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/xwITlwIq01k/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-oc...@googlegroups.com
> <mailto:tesseract-oc...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWn93kAHwM2nkMAeitSodP__Gh_-MbYBTHA9090__oynw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWn93kAHwM2nkMAeitSodP__Gh_-MbYBTHA9090__oynw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

mem.unicharset
Reply all
Reply to author
Forward
0 new messages