--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
After processing, it outputs:
The Evolving Student
0 Children and Email
Classroom Requirem nts
Online Coursework Dependency
v Learning a Vital Social Skill
(The missing e is due to the pre-processing, not tesseract).
The main thing I notice about the image that you sent is that most of the letters have very low contrast with their surroundings. If you add some pre-processing to intelligently convert the image to black and white, I expect that your results will improve significantly.
Derek
Besides unusual fonts and rare languages which other forum members
mention, Tesseract is used in custom OCR-related software programs and
web services, or as an OCR engine inside industrial scale text
recognition systems. Many users trade their efforts and time with free
Tesseract for costs of commercial OCR systems. Currently Google works
on making Tesseract more user-friendly, though.
Out-of-the-box Tesseract works best with black-white paper sheet scans
having a non-complex layout. Most of image processing work as Sven
says is aimed to bring source images to such form.
So regarding your image, you'll need to convert it to monochrome and
make the text characters stand out of the background. This can be done
in any image editor program by converting the image to grayscale,
probably selecting one of the R, G or B channels, then applying a
threshold which can be chosen manually. I think this is nearly what
Derek did and you see - the results are quite decent. If you have many
such images you can use ImageMagick to automate the above image
processing operations and then feed resulting images to Tesseract, all
in a single script.
HTH
Warm regards,
Dmitri Silaev
www.CustomOCR.com
Or, since tesseract-ocr already links with the Leptonica C Image
Processing Library
(http://tpgit.github.com/UnOfficialLeptDocs/leptonica/index.html), you
could use its many powerful functions to process your PIX directly in
memory. This of course requires changing tesseractmain.cpp and
rebuilding tesseract, but we are trying to make using libtesseract
3.02 easier on Windows (it's already pretty easy on Linux).
Remember, its *always* a bad idea to save an image in jpeg format if
it will later be processed by other programs. Notice all the noise
that now surrounds your characters?
Use tiff or png instead.
Warm regards,
Dmitri Silaev
www.CustomOCR.com