Need to understand Tesseract code

279 views
Skip to first unread message

ravi katiyar

unread,
Jun 15, 2016, 3:00:24 AM6/15/16
to tesseract-ocr
Hello All,

I am new to the world of OCR and image processing as well. I am come from a java background.
can someone tell what are the pre-requisite to understand the tesseract code ?
Like java.awt.image package , Digital image processing concepts ? what would I need to be thorough with so that the I am able to understand tesseract code .

I want this understanding because I am aiming to make modifications to this code , so that tesseract is able to extract text from a movie poster printed in a newspaper.
Tesseract cannot do this currently.

Thanks
Ravi Katiyar

Allistair

unread,
Jun 15, 2016, 6:11:36 PM6/15/16
to tesser...@googlegroups.com
Hi,

Your question is a little difficult to understand - it sounds like you are saying on the one hand you have no OCR or image processing background, know Java, and want to modify Tesseract toward some aim that you do not specify?

Tesseract as far as I understand is developed using C/C++ and not Java. Only the Android JNI bindings would be Java.

You can find the Tesseract source code at:


In terms of concepts you should read "An Overview of the Tesseract OCR Engine" written by Tesseract's lead Ray Smith as it will give you insight into the algorithms that are employed for its OCR.


Further concepts for algorithms can be found in the "Techniques" section at:


Sounds like an uphill struggle to me but I wish you luck!

Cheers


--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9a488786-ac4d-4d2e-a047-ebe329df1ea8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ravi katiyar

unread,
Jun 16, 2016, 1:18:36 AM6/16/16
to tesseract-ocr
Hi

Really appreciate your prompt response , thank you for showing me some direction.
I understand that modifying tesseract will be an uphill task , and now specially given that the source code is been completely developed in c and C++ it seems even more tougher.

I did mention my use case is to be able to identify text out of movie posters printed in newspaper.
Is someone aware of something similar to tesseract which can do this job ?

Thanks
Ravi Katiyar

Allistair C

unread,
Jun 16, 2016, 3:19:43 AM6/16/16
to tesser...@googlegroups.com
Apologies, missed that! :)

Can't see why you couldn't start with tesseract as-is for movie poster OCR and focus instead then on image preprocessing, I.e how you send tesseract the image to interpret. 

I would actually first have a go at trying Google Cloud Vision API as that seems very good at picking out text from more complex scenes. Else you should read previous posts here on detection of text areas in natural world scenes so you can first extract text rectangles cleanly to send to tesseract rather than one big image. I guess it depends which part of the poster is most important (title of movie or everything like actors etc) as titles often use very specialised fonts (not always but often) and I think those you will find very challenging without perhaps additional training too (see tesseract training resources)

Good luck

Sent from my iPhone

ravi katiyar

unread,
Jun 16, 2016, 5:08:43 AM6/16/16
to tesseract-ocr
Alright , this does give me a starting point .
I am on my R&D way :)

Thank you once again 
Reply all
Reply to author
Forward
0 new messages