tessdata/eng.traineddata question

586 views
Skip to first unread message

newbie

unread,
Jan 12, 2015, 11:18:18 AM1/12/15
to tesser...@googlegroups.com
Does anyone know that if  tessdata/eng.traineddata(the final crunched data) in tess4j comes with all the below files included ?

  • tessdata/eng.config
  • tessdata/eng.unicharset
  • tessdata/eng.unicharambigs
  • tessdata/eng.inttemp
  • tessdata/eng.pffmtable
  • tessdata/eng.normproto
  • tessdata/eng.punc-dawg
  • tessdata/eng.word-dawg
  • tessdata/eng.number-dawg
  • tessdata/eng.freq-dawg
Also is this enough to identify any of the normal fonts(images attached) ? Appreciate your help.
ArrisVIP2250_cropped.png
ArrisVIP2500_resampled.png
VEN501_cropped.png

Flash Thunder

unread,
Jan 12, 2015, 5:39:38 PM1/12/15
to tesser...@googlegroups.com
It should identify those images without any problems, you just need to prepare image right.

3 steps for you:

1. Tesseract likes when letters are about 70-100px height, so you need to resize your images.
2. Invert colors - as I noticed, it doesn't like it this way at all - letters must have to be black, background white.
3. Make image 2-color... this will remove all this blur after resizing, keeping font shape.

newbie

unread,
Jan 13, 2015, 10:13:13 AM1/13/15
to tesser...@googlegroups.com
Thanks, I have it working by doing simple things.

1. I need to get the resolution upscaled to 300 dpi(including sharpening of the image) and it did the trick.

newbie

unread,
Jan 14, 2015, 2:17:24 PM1/14/15
to tesser...@googlegroups.com
Flash Thunder, 
                     I think I went ahead of myself in the email below. The upscaled image has the same dpi as the original image( 96dpi). I ahve upscaled pixels for which the ocr works without doing step 2 and 3(by trail and error). But I dont ahve a generic formula to  upscale all my images and hence struggling.

1. Do you know what the text size (pt) should be for 96dpi ?
2. Do you know if there are packages available to upscale the resolution also ? Say from 96 dpi to 300 dpi ? is this doable ?

Thanks

Robert Komar

unread,
Jan 14, 2015, 3:43:26 PM1/14/15
to tesser...@googlegroups.com
On Wed, 14 Jan 2015, newbie wrote:

> Flash Thunder, I think I went ahead
> of myself in the email below. The upscaled image has the
> same dpi as the original image( 96dpi). I ahve upscaled
> pixels for which the ocr works without doing step 2 and
> 3(by trail and error). But I dont ahve a generic formula
> to upscale all my images and hence struggling.
>
> 1. Do you know what the text size (pt) should be for 96dpi
> ?
> 2. Do you know if there are packages available to upscale
> the resolution also ? Say from 96 dpi to 300 dpi ? is this
> doable ?

The dpi isn't the most important thing. What is most
important is the size of the characters in pixels. As
Allistair explained, the lowercase characters like 'x'
should be at least 20 pixels tall to get good results.
It's okay if they are taller than that. It is not okay
if they get much shorter. So, just upscale all your
images so that the 'x' characters are at least
20 pixels tall. It's that simple.

dpi by itself is useless unless you know the font size,
as well. It is (dpi x font size) that is the important
quantity.

Cheers,
Rob Komar

Quan Nguyen

unread,
Jan 14, 2015, 7:58:16 PM1/14/15
to tesser...@googlegroups.com
You can use the command combine_tessdata to unpack a traineddata file to examine its components.

The eng.traineddata bundled with Tess4J is of 3.01 version. You may want to try 3.02 and see if it can produce better results for you (check in https://code.google.com/p/tesseract-ocr/downloads/list).

Marek FlashT Rucinski

unread,
Jan 18, 2015, 1:57:45 PM1/18/15
to tesser...@googlegroups.com
Don't use DPI metric, as it does not really count for Tesseract. The best results (that is from my experience) are obtained when font size is 70-90px (so it is a bit large for normal usage).

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/991f0517-29d9-440b-97e4-8e2616c30033%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Marek FlashT Rucinski

unread,
Jan 18, 2015, 1:59:28 PM1/18/15
to tesser...@googlegroups.com
Oh, sorry for double post... wrong key. I have to say, that for example for captcha recognation, I do resize images to 200% or even 300%... same image not resized does not give any results. Not sure why. Probably, because font changes to more ... "oval".

newbie

unread,
Jan 20, 2015, 11:38:54 AM1/20/15
to tesser...@googlegroups.com
Thanks folks to all who have taken the time to respond.

This is what I am trying to do now, I upscale the image then feed it to the ocr and then run it against a dictionary of words I have, if it does not match, I iteratively upscale and feed it to the ocr. I cannot upscale it very big as there are 3 problems.

1. The text I am trying to seek gets very blurred and ocr will fail
2. I run out of memory upscaling.(I have the heap size increased to the max).
3. This process is time consuming

 My upscale multiple(by how many pixels i upscale  the entire image) is also set based on the max dimension of the original image(i,e if vertical dimension is more then vertical pixels become my max dimension, likewise with horizontal, eg height is 29 and width 67, max dimension=67).
if (maxDimension <100)
    scaledMultiple=10;
    else if (maxDimension >100 && maxDimension<1000) 
    scaledMultiple=50;
    else  if (maxDimension > 1000)
    scaledMultiple=100;

This works for most of the images I have currently, but fails for a few. I will attach the failing ones(needs to read VIP1200 in VIP1200R.png and VIP1200R_cropped).  Appreciate it if any of you could tell me, how I can get this to work. Also if there is another way to go about this, as my images are varying in size drastically(ofcourse I ahve put across the suggestion of cropping  the model number within a text box, as Allistair has suggested and they are mulling over it(so I guess the idea is not well received)).

I do maintain the aspect ratio of the original image when I upscale....so the ovalizing the text is not done, may be should try that ? Also I am now converting jpg to png files, do you know which format works the best ? Thanks

Appreciate it.
VIP1200R_cropped.png
vip1200.jpg

newbie

unread,
Jan 20, 2015, 3:00:06 PM1/20/15
to tesser...@googlegroups.com
I found that vip1200.jpg works at  scale Width(8654px) and height(5748px), but most of the time I either get an "Invalid mem access" or out of mem(heap) error before I am able to rescale to the optimal scale.
I need to come up with some other generic way to upscale and ocr images. Any ideas are appreciated.

ShreeDevi Kumar

unread,
Jan 20, 2015, 9:30:46 PM1/20/15
to tesser...@googlegroups.com
Have you looked at imagemagick and related scripts for pre-processing the images?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Art W Rhyno

unread,
Jan 21, 2015, 4:19:54 PM1/21/15
to tesser...@googlegroups.com
I have posted about this before but the Olena project [1] has some great tools to identify text and images. Look for the "content_in_hdoc" program for example. If the identification looks close enough, you could extract and pass to tesseract those regions that have been classed as text. I have attached an example from your "vip1200.jpg" image, the portion in green is identified as text. It also picks up some false positives, but you could probably filter those out.

art
---
1. http://www.lrde.epita.fr/cgi-bin/twiki/view/Olena/



vcr.png

newbie

unread,
Jan 22, 2015, 10:35:48 AM1/22/15
to tesser...@googlegroups.com
ShreeDevi,
ImageMagick, seems like a manual tool, but I think the problem I need to solve is -  a generic way of image preprocessing for all images.

Art,
   I have been looking for a text region segregation tool, had only one from matworks that looked promising. Now with Olena, does it provide an api instead of a tool to preprocess(mark text regions) the image programatically ?  Will look into the documentation more.

Thanks Art !

Allistair

unread,
Jan 22, 2015, 11:16:59 AM1/22/15
to tesser...@googlegroups.com
Not exactly an answer, but someone else with the same issue has gotten most of the way there.


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Art W Rhyno

unread,
Jan 22, 2015, 11:31:03 AM1/22/15
to tesser...@googlegroups.com
> Now with Olena, does it provide an api instead of a tool to preprocess(mark text regions) the image programatically ?

Hi,

Look at source for the "content_in_hdoc_hdlac" program in the distribution if it looks like Olena would work for you, it shows how to use Olena programmatically . Good luck!

art

Robert Komar

unread,
Jan 22, 2015, 12:32:53 PM1/22/15
to tesser...@googlegroups.com
On Tue, 20 Jan 2015, newbie wrote:

> I found that vip1200.jpg works at scale Width(8654px) and
> height(5748px), but most of the time I either get an
> "Invalid mem access" or out of mem(heap) error before I am
> able to rescale to the optimal scale.I need to come up
> with some other generic way to upscale and ocr images. Any
> ideas are appreciated.

You could try binarizing the image before rescaling it.
That would reduce memory consumption, and give you
control over the process of binarization rather than
leaving it up to tesseract.

Rob Komar

newbie

unread,
Jan 22, 2015, 1:39:49 PM1/22/15
to tesser...@googlegroups.com
Any idea of what free source is available for bininrizing in java ?

Thanks

Allistair C

unread,
Jan 22, 2015, 2:16:20 PM1/22/15
to tesser...@googlegroups.com
At what point will you use Google to answer these simple questions? OpenCV has already been mentioned many times.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

newbie

unread,
Jan 22, 2015, 4:08:44 PM1/22/15
to tesser...@googlegroups.com
            Ok my question should have been phrased better, I aplogize.   Wish it was simple as saying  openCV in general. I have tried contouring to extract text, bound boxing, squaring bounding in openCV. Each would detect a few regions but not all.

I have tried a python script built on top of matlabs for binarization and also my latest was using standard java's extension packages(using Imageio etc) for binarization.I dont think I could have done it without googling :-). The output eats away some of the normal text. 


Here's the sample imaged result of binarization.
myBinImage.jpg

Allistair C

unread,
Jan 22, 2015, 6:00:38 PM1/22/15
to tesser...@googlegroups.com
In opencv binarisation is 1 line of code, it's called threshold and you can choose various types. If I remember tomorrow I'll post some android demo code.

Sent from my iPhone

For more options, visit https://groups.google.com/d/optout.
<myBinImage.jpg>
Reply all
Reply to author
Forward
0 new messages