Need to understand Tesseract code

ravi katiyar

unread,

Jun 15, 2016, 3:00:24 AM6/15/16

to tesseract-ocr

Hello All,

I am new to the world of OCR and image processing as well. I am come from a java background.

can someone tell what are the pre-requisite to understand the tesseract code ?

Like java.awt.image package , Digital image processing concepts ? what would I need to be thorough with so that the I am able to understand tesseract code .

I want this understanding because I am aiming to make modifications to this code , so that tesseract is able to extract text from a movie poster printed in a newspaper.

Tesseract cannot do this currently.

Thanks

Ravi Katiyar

Allistair

unread,

Jun 15, 2016, 6:11:36 PM6/15/16

to tesser...@googlegroups.com

Hi,

Your question is a little difficult to understand - it sounds like you are saying on the one hand you have no OCR or image processing background, know Java, and want to modify Tesseract toward some aim that you do not specify?

Tesseract as far as I understand is developed using C/C++ and not Java. Only the Android JNI bindings would be Java.

You can find the Tesseract source code at:

https://github.com/tesseract-ocr/tesseract

In terms of concepts you should read "An Overview of the Tesseract OCR Engine" written by Tesseract's lead Ray Smith as it will give you insight into the algorithms that are employed for its OCR.

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf

Further concepts for algorithms can be found in the "Techniques" section at:

https://en.wikipedia.org/wiki/Optical_character_recognition

Sounds like an uphill struggle to me but I wish you luck!

Cheers

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9a488786-ac4d-4d2e-a047-ebe329df1ea8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ravi katiyar

unread,

Jun 16, 2016, 1:18:36 AM6/16/16

to tesseract-ocr

Hi

Really appreciate your prompt response , thank you for showing me some direction.

I understand that modifying tesseract will be an uphill task , and now specially given that the source code is been completely developed in c and C++ it seems even more tougher.

I did mention my use case is to be able to identify text out of movie posters printed in newspaper.

Is someone aware of something similar to tesseract which can do this job ?

Thanks

Ravi Katiyar

Allistair C

unread,

Jun 16, 2016, 3:19:43 AM6/16/16

to tesser...@googlegroups.com

Apologies, missed that! :)

Can't see why you couldn't start with tesseract as-is for movie poster OCR and focus instead then on image preprocessing, I.e how you send tesseract the image to interpret.

I would actually first have a go at trying Google Cloud Vision API as that seems very good at picking out text from more complex scenes. Else you should read previous posts here on detection of text areas in natural world scenes so you can first extract text rectangles cleanly to send to tesseract rather than one big image. I guess it depends which part of the poster is most important (title of movie or everything like actors etc) as titles often use very specialised fonts (not always but often) and I think those you will find very challenging without perhaps additional training too (see tesseract training resources)

Good luck

Sent from my iPhone

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/de18b6e5-d87a-4fc3-a4a6-79c3e952a5e0%40googlegroups.com.

ravi katiyar

unread,

Jun 16, 2016, 5:08:43 AM6/16/16

to tesseract-ocr

Alright , this does give me a starting point .

I am on my R&D way :)

Thank you once again

Reply all

Reply to author

Forward