jTessBoxEditor - Tesseract box editor & trainer

Quan Nguyen

unread,

Sep 25, 2013, 10:02:13 PM9/25/13

to tesser...@googlegroups.com

jTessBoxEditor is a Java box editor for Tesseract OCR data. It can read images of common image formats, including multi-page TIFF. The
program requires JRE 6.0 or later.

Version 1.0 Beta integrates support for full automation of Tesseract training. Please post your comments/feedback here. Thank you.

http://vietocr.sourceforge.net/training.html
http://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

serkand sert

unread,

Oct 1, 2013, 11:54:06 AM10/1/13

to tesser...@googlegroups.com

this is beta bad
I have 30 pages merged tif not skip next pages

0.9 old version good

26 Eylül 2013 Perşembe 05:02:13 UTC+3 tarihinde Quan Nguyen yazdı:

Quan Nguyen

unread,

Oct 1, 2013, 5:02:41 PM10/1/13

to tesser...@googlegroups.com

Can you clarify a bit more? Addition of training support is the only change for this version.

serkand sert

unread,

Oct 3, 2013, 2:08:57 PM10/3/13

to tesser...@googlegroups.com

2 Ekim 2013 Çarşamba 00:02:41 UTC+3 tarihinde Quan Nguyen yazdı:

serkand sert

unread,

Oct 3, 2013, 2:09:50 PM10/3/13

to tesser...@googlegroups.com

There are 30 pcs of connected/combined tif files and 24 million colored 400 mb .

All of those are not passing through from the page to the another page at JTessBoxEditor 0.9 version.

In that version non of those are not passing through to the another page.

( I ment They are passing to the other page but pictures are not visible.)

Quan Nguyen

unread,

Oct 3, 2013, 4:30:57 PM10/3/13

to tesser...@googlegroups.com

Sorry, I still have difficulties trying to understand the issue reported by you. Your TIFF image has 30 pages and 24 million colors, and the file is 400 MBytes in size? And what you do mean when saying "All of those are not passing through from the page to the another page"?

Thank you.

Quan

Veerendra Jonnalagadda

unread,

Oct 4, 2013, 2:19:19 AM10/4/13

to tesser...@googlegroups.com

Hi

I am looking sample code to read text from bmp file incase you have any links or sample codes for the same

Kindly share....

Regards

Veerendra

Nick White

unread,

Oct 4, 2013, 10:48:25 AM10/4/13

to tesser...@googlegroups.com

Hi Veerendra,

On Thu, Oct 03, 2013 at 11:19:19PM -0700, Veerendra Jonnalagadda wrote:
> I am looking sample code to read text from bmp file incase you have any links
> or sample codes for the same

There's a page with API examples on the wiki:
http://code.google.com/p/tesseract-ocr/wiki/APIExample

Also if you search this mailing list you can find examples of more
code people have been working with. You could even take a look at
how it's used in an established project like one from the page:
http://code.google.com/p/tesseract-ocr/wiki/3rdParty

Please do try to take a look around the documentation we already
have in future; it helps everyone (including yourself!)

Nick

Nick White

unread,

Oct 4, 2013, 10:49:35 AM10/4/13

to tesser...@googlegroups.com

P.S. Please don't send the same message multiple times. It can take
a little while to get through, but sending multiples will just annoy
everybody ;) Also, please start your own thread if you aren't
replying to a previous one. Thanks

Quan Nguyen

unread,

Oct 4, 2013, 11:10:27 AM10/4/13

to tesser...@googlegroups.com

Thanks, Nick.

I wish I had a Delete button.

Nick White

unread,

Oct 4, 2013, 11:35:35 AM10/4/13

to tesser...@googlegroups.com

On Fri, Oct 04, 2013 at 08:10:27AM -0700, Quan Nguyen wrote:
> Thanks, Nick.
>
> I wish I had a Delete button.

Actually, that reminds me, I do have such a button! I'll use it now
to remove all of the duplicates, and just leave the one with my
reply. Obviously this only affects people who use the web interface,
but people using the email one should already be well aquainted with
deleting things ;)

serkand sert

unread,

Oct 12, 2013, 7:43:40 AM10/12/13

to tesser...@googlegroups.com

When i open the merged tif file that is 400 mb and 33 pages at jtessboxeditor 0.9 and when i click for the next page .I get the next page.In that next page I always see the first page`s picture.

I can not see the picture that is belong to next page.

Even It says page 5 of 33 .I see the first page`s picture.

Mostly It s like that.I mean I can not see the first page`s picture.

There are not any problems at small sized files.

I ll be happy If you make the corrections.

dos displaying error

3 Ekim 2013 Perşembe 23:30:57 UTC+3 tarihinde Quan Nguyen yazdı:

jtess.jpg

Quan Nguyen

unread,

Oct 12, 2013, 9:48:09 AM10/12/13

to tesser...@googlegroups.com

That was a huge file, and that has caused out-of-memory errors in the program. You probably will have to work with a smaller file, or fewer pages, or allocate more RAM for the JVM. I hope you have executed the program using the provided run.bat file. You can try to double the max heap memory by editing the file to change from -Xmx512m to -Xmx1024m, or even -Xmx2048m if your system has lots of RAM.

Message has been deleted

Quan Nguyen

unread,

Nov 16, 2013, 3:39:44 PM11/16/13

to tesser...@googlegroups.com

jTessBoxEditor v1.0 Release

This release includes the following improvements:

Integrate support for full automation of Tesseract training
Bundle Tesseract Windows training executables (r866), English data, and config files
Fix an issue with generated TIFF missing metadata
Optionally add noise to generated image
Bug fixes and improvements

http://vietocr.sourceforge.net/training.html
http://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

Special thanks to Shree Devi Kumar for testing support and valuable suggestions for improvements.

Nick White

unread,

Nov 17, 2013, 1:03:48 PM11/17/13

to tesser...@googlegroups.com

On Sat, Nov 16, 2013 at 12:39:44PM -0800, Quan Nguyen wrote:
> jTessBoxEditor v1.0 Release

Congratulations Quan on the 1.0 release! I have heard nothing but
good things about jTessBoxEditor, and it's invaluable to our
community. Many thanks for all your hard work!

Nick

Quan Nguyen

unread,

Aug 19, 2014, 9:51:16 PM8/19/14

to tesser...@googlegroups.com

Version 1.1 beta is released with the following enhancements:

- Add training support for Right-to-Left (RTL) text
- Add horizontal box split using modifier keys

Any comments/feedback are welcome. Thanks.

newbie

unread,

Nov 10, 2014, 1:28:36 PM11/10/14

to tesser...@googlegroups.com

I have installed JTessBoxEditor to train my images for tess4j. But I am unable to open the file(png,tiff) in the box editor. When I read the tutorial , it says use tiff/box files as input to the editor, but when it browse's for files it seems to be looking for text files. I have an original png file, which I converted into tiff. I also tried converting the png to a 8bpp grayscale but in vain. I am still struggling to see the image file in the JTessBoxEditor. Any help is appreciated.

Quan Nguyen

unread,

Nov 10, 2014, 5:41:56 PM11/10/14

to tesser...@googlegroups.com

To work in the Box Editor, you would need to provide the box file along with the image. The box file can be either generated or made by Tesseract training. There's no need to convert the image files.

ShreeDevi Kumar

unread,

Nov 11, 2014, 3:16:03 AM11/11/14

to tesser...@googlegroups.com

JTessBoxEditor has three tabs

Use Tiff/Box Generator to generate tiff and box files from a given text file for the chosen font

The Box files created by Box/Tiff Generator are based on the rendering of the text in the chosen font and will be accurate - however they may still get errors 'blob not found' during training.

Use Trainer in Make Box File mode to generate box files from an image using the chosen language's traineddata

Please note that the BOX files created by Tesseract under Trainer will only be as good as the recognition by Tesseract using the traineddata being used and may require a lot of modification.

Use Box Editor to edit the box files (if needed)

Use Trainer in Train with existing Box to use box/tiff pairs that you may have

If you want to do training using JTessBoxEditor, you need to create the other files required for training (see /samples/vie for files for vietnamese) - you may be able to use some of the files from tesseract's langdata repo as a start

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/63554b1e-5e5d-48a5-b751-220ccd006cde%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

newbie

unread,

Nov 11, 2014, 10:30:24 AM11/11/14

to tesser...@googlegroups.com

Shree,

Thanks for taking the time to respond. I think I am lost at the first step. I have an image(ArrisVIP500.png, attached for sample) from which I need to extract the text from. I need to train that tessearact/tess4j engine to pick up the text from this image.

But the Tiff/Box Generator is looking for a text file. So I started out with a notepad(vip2500.txt file also attached) file, with the text in the image in the same font type(the font I got similar to the image was san-serif on whatthe font , dont know if that is right). When I load the txt file to the Tiff/Box Generator, I dont see the generate button to generate the .tif and box files.

ArrisVIP2500.png

vip2500.txt

ShreeDevi Kumar

unread,

Nov 11, 2014, 1:11:04 PM11/11/14

to tesser...@googlegroups.com

You don't need to train in order to extract text.

Have you tried with the english traineddata .. available from https://code.google.com/p/tesseract-ocr/source/browse/?repo=tessdata

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ebe4115d-3384-474f-ac65-f738c5c26910%40googlegroups.com.

newbie

unread,

Nov 11, 2014, 1:53:57 PM11/11/14

to tesser...@googlegroups.com

Shree,

The eng.traindata that comes with tess4j, which I am presuming is the one from the google link below, gives me this below. I should be able to read the vip2500 and AT&T Uverse from the image, which it is not doing. Hence I thought I might have to train it.

AT&T U-verse

rowan <3 3

/ --

vxvzsoo ‘Q’

newbie

unread,

Nov 11, 2014, 2:03:00 PM11/11/14

to tesser...@googlegroups.com

The google link u gave me below does not let me download the file. Just wanted to check if its different from the one I have.

Quan Nguyen

unread,

Nov 11, 2014, 2:17:30 PM11/11/14

to tesser...@googlegroups.com

Looks like you got yourself a problem of image processing, not training. There are many non-text objects in your image; any OCR engine would have problems with. Eliminating them, you'll get better results.

newbie

unread,

Nov 11, 2014, 3:04:35 PM11/11/14

to tesser...@googlegroups.com

Quan,

Can u ellaborate on the problems with image processing - what do u mean by the non text objects ? I have attached the image in a thread above to shree.

Thanks

Quan Nguyen

unread,

Nov 11, 2014, 3:41:21 PM11/11/14

to tesser...@googlegroups.com

The buttons, port, signs, symbols, logos -- those non-text elements -- all help confuse Tesseract.

Message has been deleted

newbie

unread,

Nov 12, 2014, 10:49:32 AM11/12/14

to tesser...@googlegroups.com

Quan/Shree,

Do u know of some tool that would only leave the fonts on the image ? A preprocessing of the image for tesseract ?

Thanks

Almas Maris

unread,

Nov 26, 2014, 2:49:59 PM11/26/14

to tesser...@googlegroups.com

How can i get english TIFF file?

Quan Nguyen

unread,

Jun 4, 2016, 5:24:15 PM6/4/16

to tesseract-ocr

jTessBoxEditor 1.6 Release

- Upgrade Tesseract training executable 3.05dev (from https://github.com/UB-Mannheim/tesseract/wiki)

- Incorporate new training commands, including text2image (currently not usable on Windows)

http://vietocr.sourceforge.net/training.html
http://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

Meh Hem

unread,

Jun 9, 2016, 4:24:31 AM6/9/16

to tesseract-ocr

Hi Mate,

Just wanted to say that I love your program and you are doing a great job.

That is all.

pham x hoang

unread,

Jul 13, 2016, 2:34:36 PM7/13/16

to tesseract-ocr

Which version of box file are u using now ? 2 or 3 ?

Quan Nguyen

unread,

Sep 13, 2016, 7:58:10 PM9/13/16

to tesseract-ocr

jTessBoxEditor 1.7 Release:

Update Tesseract training executable 3.05dev (2016-08-31)
Generated images are now compressed to reduce file sizes
Additional parameters for text2image command
Use BreakIterator for character boundary analysis

http://vietocr.sourceforge.net/training.html

https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/

Khan

unread,

Mar 22, 2018, 2:55:50 PM3/22/18

to tesseract-ocr

How to train Chinese characters using JTessBoxEditor because it can support only few languages

Shravani Adivarekar

unread,

Feb 19, 2024, 9:45:02 AM2/19/24

to tesseract-ocr

Can you please guide me on how to use it and create box files also on the installation...I am new to OCR and need to develop a Handwritten text recognition for Devanagari language.

Message has been deleted

محمود محمد‎

unread,

Dec 10, 2024, 2:47:34 AM12/10/24

to tesseract-ocr

I want you to guide me on how to deal with Tesseract jTessBoxEditor to create a training model on 10 images in Arabic and run the model

Hello Tesseract with Mahmoud Abdel Aleem I saw your contributions in GitHub about Tesseract and I benefited from you well Thank you for your useful contributions, Tesseract I want you to help me with the following:

1- I have a set of digital images of book covers, 10 images in Arabic, I want to convert them to text using Tesseract 2- The conversion model is inaccurate and does not recognize most of the words ara.traineddata in the tessdata file in Tesseract 3- I created a model ara1.traineddata using jtessboxeditor where I created boxes for each image and modified them in a sample image then created a file ara1.traineddata and put it in the tessdata file in Tesseract and repeated the experiment on the image that was trained on but it did not succeed I think there is an error in the work steps that I am doing using jtessboxeditor If possible Tesseract let me know the correct steps for training and creating a .traineddata file using jtessboxeditor even create a custom model for 10 digital images so that I can make Tesseract recognize them and convert them to text If possible help me by sending an illustrative image of the steps I would be grateful for your cooperation

Reply all

Reply to author

Forward