The accuracy may suffer even for such a considerable char height (90
is certainly more than enough) if you have significant discrepancies
between training and source images. You should try to pass to
Tesseract images having as similar as possible thickness and
orientation. To achieve this, you need to pre-process images to get
them look alike with respect to lighting conditions, contrast, blur
amount, physical dimensions; rectify perspective distortion, etc. And
of course, always use the same binarization procedure with the same
parameter set, or at least giving predictably similar results for a
range of your source images. Btw, using Otsu thresholding prior to
passing images to Tesseract is useless as Otsu is a binarization
procedure employed by Tesseract itself. Except if you do Otsu with
your own special parameter set and then pass a 1-bit image.
Next, you should train Tesseract having in mind that ideally there
should be around 20 samples of each char. You shouldn't be striving to
train using as many as possible char sizes - regardless of the size,
Tesseract scales character "models" up or down to the same internal
dimensions. But if your source char sizes differ - that's no problem,
they'll do. Provide real images (probably pre-processed) images for
training, not manually compiled ones.
What can be done to further improve the speed and accuracy - process
your images char by char, bypassing Tesseract's layout analysis. This
approach also perfectly allows to use char-position-specific
whitelists (letters, digits) for even more speedup and precision.
Everything related to Tesseract's dictionary facility is totally
irrelevant here. You'd better provide entirely empty files for your
"traineddata".
HTH
Warm regards,
Dmitri Silaev
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
And I'd suggest to keep up with the latest revisions of Tesseract. The
API changes significantly, but Tess is definitely being improved in
the sense of stability, new capabilities and also code efficiency,
which explicitly may lead to improved performance which you are
looking for.
Warm regards,
Dmitri Silaev
On Tue, Mar 29, 2011 at 8:17 AM, Andres <andr...@gmail.com> wrote:
Here's links to the relevant Leptonica API source files:
adaptmap.c - local adaptive grayscale quantization; mostly
gray-to-gray in preparation
(http://tpgit.github.com/Leptonica/adaptmap_8c.html#_details)
binarize.c - Special binarization methods, locally adaptive: Otsu and
Sauvola (http://tpgit.github.com/Leptonica/binarize_8c.html#_details)
grayquant.c - Standard, simple, general grayscale quantization
(http://tpgit.github.com/Leptonica/grayquant_8c.html#_details)
See also:
Grayscale Mapping and Binarization
(http://tpgit.github.com/UnOfficialLeptDocs/leptonica/binarization.html)
Document Image Analysis
(http://tpgit.github.com/UnOfficialLeptDocs/leptonica/document-image-analysis.html)
which refers to
http://tpgit.github.com/Leptonica/livre__adapt_8c_source.html and
http://tpgit.github.com/Leptonica/livre__tophat_8c_source.html.
-- TP
Sorry for the delay.
I meant that for training you just need to use as many as possible
*different* images, not multiple renamed copies of the same image.
Warm regards,
Dmitri Silaev
On Mon, Apr 4, 2011 at 2:56 PM, Sriranga(78yrsold)
<withbl...@gmail.com> wrote:
> Dmitri,
> I am extremely thankful for the valuable guidance.
> With reference to your last para - I could not follow clearly and is in
> confusion. Kindly eloborate little bit with your sample (any lang or
> English) will do. Kindly pardon me for troubling you in the midst of your
> hectic work.
> With Choicest Best Wishes and Good Luck,
> -sriranga(78yrs)
>
> On Mon, Apr 4, 2011 at 11:50 AM, Dmitri Silaev <daemo...@gmail.com>
> wrote:
>>
>> Dear Sriranga,
>>
>> Sorry for the delay.
>>
>> You indeed can manually set the DPI in an image file using any image
>> editor, but the only thing that matters is the resolution your image
>> got from the scanner. Roughly saying, the resolution here means the
>> number of pixels per letter. This is controlled by the scanner itself
>> or scanning program settings. By changing DPI afterwards in an image
>> editor, you just change some image's attribute values, not image's
>> pixels.
>>
>> 300 DPI is more than okay for your needs.
>>
>> Renaming a box/train file and feeding it to Tesseract as another
>> sample is not a solution, as by "sample" we here mean a copy of a
>> character we obtained at slightly different conditions in another
>> [scanned] image, or at least at another position in the same image. So
>> get as many images as possible, count the number of character samples
>> within each and thus build your training body.
>>
>> Warm regards,
>> Dmitri Silaev
>>
>>
>>
>>
>>
>> On Sat, Apr 2, 2011 at 1:13 PM, Sriranga(78yrsold)
>> <withbl...@gmail.com> wrote:
>> > Dear Dimitri,
>> > Awaiting your valuable guidance please.
>> > With warmest regards,
>> > -sriranga(78yrs)
>> >
>> > On Wed, Mar 30, 2011 at 8:29 PM, Sriranga(78yrsold)
>> > <withbl...@gmail.com> wrote:
>> >>
>> >> Dear Dimitri,
>> >> It is presumed that if the scanned imges has 300 x 300 dpi is
>> >> reasonable?
>> >> With help of Irfanview I can find out dpi as well as increase or
>> >> decrease
>> >> dpi can be done.
>> >> Generally,as a standard I select dpi =300 and resized to 1200 or 2400
>> >> from
>> >> 600 which is convenient for edit the box file with help of owler. Hope
>> >> this
>> >> will not minimise accuracy of the output. Sample tif attached for
>> >> approval.
>> >>
>> >> Regarding 20 samples of each char = Supose, if theimage1. tif file
>> >> contains alphabets of single char can be used 20 times by renaming the
>> >> same
>> >> image file as image1.tif, image2.tif, image3.tif .....image20.tif ? If
>> >> not
>> >> kindly provide me with your sample, if any.
>> >> With Warmest Regards,
>> >> -sriranga(78yrs)
--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/enwft4qSDfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0c0ae433-2ef5-4df8-aff5-b80e4558e4f4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.