txt2img: image/box files generation utility

177 views
Skip to first unread message

Anton Zorin

unread,
Apr 5, 2012, 2:13:41 PM4/5/12
to tesseract-ocr
Hi All,

I think this tool could be useful for some of you. It allows to
generate training images along with box files using the text edit
control's contens as input. So simple formatting is possible, font
antialiasing can be turned on/off). Currently compiled only for
windows (installer is available on the downloads page). Any comments/
remarks/bugs are welcome.

Thanks in advance.

Anton Zorin

unread,
Apr 5, 2012, 2:17:16 PM4/5/12
to tesseract-ocr
forgot to post a link. So here it is:

Falke

unread,
Apr 8, 2012, 8:38:21 AM4/8/12
to tesseract-ocr
Hi, Anton!

Looks very interesting.

Could you also do a utility that is the REVERSE of this ? :-))

I have been looking high and low for something that takes as input a
training bitmap+boxfile pair, and allows you to drag-mouse-select
multiple boxes (corresponding to words, sentences, etc.), and mark
them (annotate them) as a particular style (italics, bold, font1,
font2, etc.). There are already some box editors out there that
annotate -- but none with a full mouse-drag-select regions (sets of
boxes), performed in the bitmap display window, to annotate the
selected boxes at once.

(Discrete selection would, of course, be even better)


On Apr 5, 2:17 pm, Anton Zorin <zorinan...@googlemail.com> wrote:
> forgot to post a link. So here it is:http://code.google.com/p/txt2img/
>

Anton Zorin

unread,
Apr 8, 2012, 9:44:32 AM4/8/12
to tesser...@googlegroups.com
Hi Falke,

Could you please be more specific:

you want to specify image +  box file pair and based on those, fill the contents of text edit to enable playing with text formatting/styles? If so, it would be a bit complicated since it is a challange to extract formatting information from an image file.

Thanks!

Regards,

Anton Zorin

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Falke

unread,
Apr 8, 2012, 11:49:27 AM4/8/12
to tesseract-ocr








On Apr 8, 9:44 am, Anton Zorin <zorinan...@googlemail.com> wrote:
> Hi Falke,
>
> Could you please be more specific:
>
> you want to specify image +  box file pair and based on those, fill the
> contents of text edit to enable playing with text formatting/styles? If so,
> it would be a bit complicated since it is a challange to extract formatting
> information from an image file.




Let me work backwards, in my reply:




No, that challenge is not there. It WOULD, indeed, be a challenge to
AUTOMATE it, but I am talking about doing that MANUALLY.  And this
manual process would, in fact, be the primary purpose of the utility
(to manually do that which you said would be hard to automate): mouse-
drag-select regions, and then (with an additional command (key-combo
or button)) _MARK_ the selected region(s) (representing a series of
boxes) as either "bold", "italic", "some_font1", "some_font2", etc.
 The actual annotation markups would be written back to the
("enhanced") box file ("enhanced", in the sense of having an extra
column, for the style code)

So, why would this be useful?  It would enable one to easily rip a box
file apart into multiple, style-specific box files (one for bold, one
for italics, one for each font, etc.) -- in compliance with
tesseract's training requirements (which include "do not mix font
styles")

Anton Zorin

unread,
Apr 8, 2012, 1:46:23 PM4/8/12
to tesser...@googlegroups.com
Hi Falke,

Now I got it. So user selects a rectangle on the image and corresponding lines from the box file highlighted somehow to enalbe the former to put some custom information there or save the fragments as a separate image-box pair. Is this vision correct?

If so, at first glance it seems to me that such functionality would be more appropriate for another (separate) application. GUI would be different: 
 - spread sheet is needed to work with box files
 - selection of image regions and displaying of those should be convenient

But from the implementation complexity point of view it doesn't look like a big deal for QT.

Thanks,

Anton Zorin

Falke

unread,
Apr 9, 2012, 4:06:48 AM4/9/12
to tesseract-ocr


On Apr 8, 1:46 pm, Anton Zorin <zorinan...@googlemail.com> wrote:
> Hi Falke,
>
> Now I got it. So user selects a rectangle on the image and corresponding
> lines from the box file highlighted somehow to enalbe the former to put
> some custom information there or save the fragments as a separate image-box
> pair. Is this vision correct?
>

Yes, essentially. mouse-drag-selection (the highlighting of box
regions on the bitmap) does not need to be echoed in the *.box file,
while you select (although, I guess, that would be a plus: to select
in either GUI (bitmap) or box file window, using any method (mouse,
keyboard, or both), and to have the other window automatically be
updated with the selection state), but everything just needs to be
updated everywhere in the end, somehow.

Tearing the *.box file apart is already easily done using popular,
text-friendly scripting (perl, python, ruby, etc.)

> If so, at first glance it seems to me that such functionality would be more
> appropriate for another (separate) application. GUI would be different:

I didn't mean to suggest that the functionality would necessarily
belong in the same application as the one you are announcing in this
thread -- it's quite the reverse of your app. But then, again, I can
think of situations where BOTH utilities would be useful to someone,
in one single project, and then it would make sense for them to be
part of the same utility suite

I was just hoping you might find it relatively easy to build the one I
describe only because you seem to know the GUI tool set -- and my
utility needs mouse-drag-select and gui stuff like that. (And also
because it was sort of the exact reverse of mine :-)), in terms of
input/output data types)

>  - spread sheet is needed to work with box files

a number of box editors out there do without a spreadsheet per se.

>  - selection of image regions and displaying of those should be convenient
>
> But from the implementation complexity point of view it doesn't look like a
> big deal for QT.

That sounds optimistic :-))

>
> Thanks,

thank YOU.

Anton Zorin

unread,
May 2, 2012, 1:56:34 AM5/2/12
to tesseract-ocr
Hi,


I have added a new feature - mapping of multiple characters to a
signle one (using config file). There will be a single entry in the
box file for these combinations.

Thanks.

Sriranga(78yrsold)

unread,
May 2, 2012, 2:59:40 AM5/2/12
to tesser...@googlegroups.com
Anton,
with reference to charmap.txt  it is observed that  box file did not have modified chars.
example " text"  t=te x=xt  whereas in the box file shown  only t and x - instead of te and xt.
Where I made mistake?
with regards,
-sriranga(79yrs)

charmap.txt
eng.MS.Shell.Dlg.2.exp0.box
eng.MS.Shell.Dlg.2.exp0.png

Anton Zorin

unread,
May 2, 2012, 3:04:34 AM5/2/12
to tesser...@googlegroups.com
Sriranga,

te and xt are replaced correspondingly by t and x in the box file. So in this case (with char mapping active) you will have just two entries in the box file.

Regards,

Anton Zorin

Sriranga(78yrsold)

unread,
May 2, 2012, 3:24:31 AM5/2/12
to tesser...@googlegroups.com
Anton,
Yes I agree with your point. I had mapped  to replace t with te and t with xt  so that two boxes should be created as [te] and [xt]  In other words  as per  sample  a=bc shown in the charmap.txt  means "bc" should be replace with "a"
Suddenly encounter message displayed when tried to run again.
Where I made mistake.
With regards,
-siranga(79yrs)
txt2ima.JPG

Anton Zorin

unread,
May 2, 2012, 3:29:11 AM5/2/12
to tesser...@googlegroups.com
Perhaps it is a bug. Can you please be more specific? Did this message appear after you clicked 'generate image' button again (after successfull generation)?

Regards,

Anton Zorin

Sriranga(78yrsold)

unread,
May 2, 2012, 3:37:59 AM5/2/12
to tesser...@googlegroups.com
Anton,
tested again. after closed the folder , again opened  drive and folder then ensured that charmap.txt  is restored to original format and clicked on exe file - then only displayed encounter.exe message displayed. I also installed using setup.exe - works fine - for time being without any error message.
with regards,
-sriranga(79yrs)

Sriranga(78yrsold)

unread,
May 2, 2012, 3:52:11 AM5/2/12
to tesser...@googlegroups.com
Anton,
good news. you need not worry about encounter.exe message. just now I reinstalled exe files,config.txt and charmap.txt  - now it works fine with any errors. Perphaps exe file might have corrupted. sorry for trouble to you.
Interesting point is  in the charmap.txt  I added below  a=bc and bc=a and typed in the text area
as a  and bc - generated png/box file. in the box file it displayed as  a  and nextline as a
with regards,
sriranga(79yrs)
charmap.txt
eng.MS.Shell.Dlg.2.exp0.box
eng.MS.Shell.Dlg.2.exp0.png

Sriranga(78yrsold)

unread,
May 2, 2012, 3:55:11 AM5/2/12
to tesser...@googlegroups.com
Anton,
I forgot to inform you that in the image file it shows correctly as  a  bc  whereas  in box failed to display correctly  instead shows as  a  a.
With regards,
-sriranga(79yrs)

Anton Zorin

unread,
May 2, 2012, 3:51:57 AM5/2/12
to tesser...@googlegroups.com
Couldn't recreate the problem.

I have modified the file, and rerun the tool, but no errors occured. I have also renamed mapping file so that the edit box displays the name of not existing file, but again, no errors.

Regards,

Anton Zorin

Sriranga(78yrsold)

unread,
May 2, 2012, 3:58:52 AM5/2/12
to tesser...@googlegroups.com
Anton,
solved the problem by re-installed exe file - I hve reported to you just now

Anton Zorin

unread,
May 2, 2012, 3:59:54 AM5/2/12
to tesser...@googlegroups.com
Sriranga,

bc=a is incorrect. On the left hand side there must be a character which will be used for replacement of the combination specified on the right hand side. In case of bc=a only first character is processed so it is equal to b=a. In other words, all a's will be replaced with b. 

Regards,

Anton Zorin

Sriranga(78yrsold)

unread,
May 2, 2012, 4:25:22 AM5/2/12
to tesser...@googlegroups.com
Anton,
fortunately i re-created problem - attached  charmap.txt contains unicode number added. when run  -exe encounter message displayed. then even removed the added unicode number from the charmap.txt, encounter message will not go away. this is brought to your kind notice.
further testing is being made.
regards,
-sriranga(79yrs)
charmap.txt

Anton Zorin

unread,
May 2, 2012, 5:54:50 AM5/2/12
to tesser...@googlegroups.com
Sriranga,

What is line at bottom of charmap.txt for? It makes the file invalid, since each line must have syntax replacement=combination.

Regards,

Anton Zorin
Reply all
Reply to author
Forward
0 new messages