How to trian tesseract for new fonts?

1,775 views
Skip to first unread message

Kazem Jahanbakhsh

unread,
Jul 10, 2013, 1:29:48 AM7/10/13
to tesser...@googlegroups.com

Hi everyone,

We have a set of images taken from buses head signs which displays bus id and its route details displayed by LEDs. Our goal is to "USE Tesseract to Extract Texts Written in the Cropped Images". When we selected the first image shown below which reads as "30 ROYAL OAK EX", we got "30 RIWHL 0fl|( EX" as the output. As you see, tesseract only detected some of the characters correctly.

,

We also tested tesseract with another headsign image input shown below which reads as "26    UVIC". However, in this case tesseract returned an empty string!


So, we have two questions:

1- Can we use Tesseract for such a task: specifically passing above image with an english text inside and expecting to extract the text?
2- If the above assumption is valid, what's the reason that tesseract fails detecting the right text? Do we need to train tesseract with fonts used in the bus head signs? If so, how can we do such a task? Finally, are there any wiki pages that we can read which explains the internal algorithms of tesseract and how it extracts texts from images?

Any help would be really appreciated.

Kazem

matthew christy

unread,
Jul 10, 2013, 5:11:56 PM7/10/13
to tesser...@googlegroups.com
Tesseract will need to be trained on the specific font in order to recognize the letters. Try using whatthefont (http://www.myfonts.com/WhatTheFont/) to see if there's a modern font that you can purchase and/or download to use.

matthew christy

unread,
Jul 11, 2013, 9:50:46 AM7/11/13
to tesser...@googlegroups.com
If you do find a font with whatthefont, then use the directions here: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 to train tesseract on the font. These directions aren't great though, so you can also look at some notes I created on training tesseract: http://emop.tamu.edu/node/47. You should also search this forum for a lot of information that isn't in the official google docs on Tesseract.

If you don't find a font you can use, the IDHMC is about to release an open source tool, as part of our eMOP project, that will let you create training pages for Tesseract using your own image files. We should be releasing that tool in beta in a week or two.

Shree Devi Kumar

unread,
Jul 11, 2013, 12:20:12 PM7/11/13
to tesser...@googlegroups.com
Hello Matthew,

Thanks for the info regarding emop. 

I had seen the Prima Research web page sometime back but don't have access to their tools . Is Alethia available download? Does it work with complex scripts such as Hindi?

Look forward to Franken+ . Hope I'll be able to use for Hindi/Sanskrit.

Shree







Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

matthew christy

unread,
Jul 11, 2013, 5:14:02 PM7/11/13
to tesser...@googlegroups.com
Yes, Aletheia is available for download, and it's free but you do have to register. I don't know how well it will work for Hindi.

They are also working on a web version of Aletheia as part of eMOP. I don't know when that will be available.

If Aletheia does work for you then I can't think of any reason why Franken+ wouldn't also work to make training files with.

Kazem Jahanbakhsh

unread,
Jul 11, 2013, 6:51:45 PM7/11/13
to tesser...@googlegroups.com
Thanks Matthew for all information.
We'll start looking into your suggestions and will update you on our progress.

Kazem

Shree Devi Kumar

unread,
Jul 12, 2013, 8:24:12 AM7/12/13
to tesser...@googlegroups.com
Thanks, Matthew.

I have registered for Prima Tools. However, since I am not affiliated to any institution, I am not sure whether they will approve registration. I haven't heard back yet.

I'll wait to see if I can use Franken+ with my existing training files.

Thanks,
Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


Tom Morris

unread,
Jul 12, 2013, 3:38:24 PM7/12/13
to tesser...@googlegroups.com
On Friday, July 12, 2013 8:24:12 AM UTC-4, sdk wrote:

I have registered for Prima Tools. However, since I am not affiliated to any institution, I am not sure whether they will approve registration. I haven't heard back yet.

That's an interesting racket that the University of Salford has got going.  They got the EU to fund the "research" to develop a closed source app which they then restrictively license and sell commercially, taking full advantage of building on open source tools like Tesseract, but not giving anything back in return.

Tom

Nick White

unread,
Jul 15, 2013, 7:31:23 AM7/15/13
to tesser...@googlegroups.com
On Fri, Jul 12, 2013 at 12:38:24PM -0700, Tom Morris wrote:
> That's an interesting racket that the University of Salford has got going.
> They got the EU to fund the "research" to develop a closed source app which
> they then restrictively license and sell commercially, taking full advantage of
> building on open source tools like Tesseract, but not giving anything back in
> return.

Indeed it is. Thankfully funding bodies (including those run by the
EU) are generally moving away from funding proprietary work, which
is so clearly against the public interest compared to collaborative
development of free software.

"Janusz S. Bień"

unread,
Jul 15, 2013, 9:03:32 AM7/15/13
to tesser...@googlegroups.com

Dnia 15 Lipca 2013, 1:31 pm, Pn, Nick White napisaďż˝(a):
> On Fri, Jul 12, 2013 at 12:38:24PM -0700, Tom Morris wrote:
>> That's an interesting racket that the University of Salford has got
>> going.
>> They got the EU to fund the "research" to develop a closed source app
>> which
>> they then restrictively license and sell commercially, taking full
>> advantage of
>> building on open source tools like Tesseract, but not giving anything
>> back in
>> return.
>
> Indeed it is.

Not really, at least not to my knowledge.

Although tesseract was mentioned in the EU IMPACT project, it was actually
used only by a Polish team which made the results available on a free
licence, cf.

http://dl.psnc.pl/activities/projekty/impact/results/lang-pref/en/

This allowed for immediate reuse, cf.

http://poliqarp.wbl.klf.uw.edu.pl/en/IMPACT_GT_1/

University of Salford and other project partners took full advantage of
somethin else, namely the commercial FineReader SDK, which makes the
resulting tools potentially interesting only for large libraries involved
in mass digitisation, cf.

http://www.digitisation.eu/

> Thankfully funding bodies (including those run by the
> EU) are generally moving away from funding proprietary work,

I hope you are right.

> which
> is so clearly against the public interest compared to collaborative
> development of free software.

I agree completely.

Best regards

Janusz

--
Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra
Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

Ravi Roshan

unread,
Jan 7, 2014, 5:31:53 AM1/7/14
to tesser...@googlegroups.com, k.jaha...@gmail.com
Sir,
This is Ravi Roshan, student of MCA final year, Pondicherry University. I am doing project on OCR Hindi for this I am taking help of trsseract, but it is not working for Hindi.
Can you please tell me for Hindi what font you are using.

Nick White

unread,
Jan 7, 2014, 6:55:10 AM1/7/14
to tesser...@googlegroups.com
Hi Ravi,

On Tue, Jan 07, 2014 at 02:31:53AM -0800, Ravi Roshan wrote:
> This is Ravi Roshan, student of MCA final year, Pondicherry University. I am
> doing project on OCR Hindi for this I am taking help of trsseract, but it is
> not working for Hindi.

What do you mean "it is not working"? Please be more specific so we
can help you. How are you running tesseract, what do you expect to
see, and what do you see instead?

Nick

P.S. Please start a new thread with a new subject line when asking a
new question. Thanks.

Shree Devi Kumar

unread,
Jan 8, 2014, 6:05:10 PM1/8/14
to tesser...@googlegroups.com
You  can try it using vietocr.

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


Ravi Roshan

unread,
Jan 8, 2014, 9:45:01 PM1/8/14
to tesser...@googlegroups.com
Thank you so much sir. It will surely help me.
Thank you.


You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/NyuZJRLn2Vk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

Ravi Roshan

unread,
Jan 9, 2014, 1:29:35 PM1/9/14
to tesser...@googlegroups.com
For English it is working but for Hindi it is giving an error message " FAILED TO INITIALIZE TESSERACT ENGINE ".
Please give me any solution I follow the instruction that is given in that pdf.


On Thu, Jan 9, 2014 at 4:35 AM, Shree Devi Kumar <shree...@gmail.com> wrote:
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/NyuZJRLn2Vk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

Quan Nguyen

unread,
Jan 9, 2014, 6:46:43 PM1/9/14
to tesser...@googlegroups.com
Please use VietOCR 3.5 Betas, if possible. It includes fixes for Hindi.

http://sourceforge.net/projects/vietocr/files/

Ravi Roshan

unread,
Jan 20, 2014, 4:13:24 AM1/20/14
to tesser...@googlegroups.com
Thank you sir,

Whatever you instruct me its working, but I need the jar file of VietOCR is it available anywhere on internet.

Thanks again.


On Thu, Jan 9, 2014 at 4:35 AM, Shree Devi Kumar <shree...@gmail.com> wrote:
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/NyuZJRLn2Vk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.

Shree Devi Kumar

unread,
Jan 20, 2014, 12:10:23 PM1/20/14
to tesser...@googlegroups.com

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


Ravi Roshan

unread,
Feb 7, 2014, 1:43:12 AM2/7/14
to tesser...@googlegroups.com
Hello Sir

Sorry to disturb you again, actually I got stuck while downloading the source code for tesseract.

# Non-members may check out a read-only working copy anonymously over HTTP.
svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only


I install the svn but when I typed the above command in my command prompt it shows the error message:

" An Existing connection was forcibly  closed by the remote host. "

Please help me out once more.
Thank you.


On Thu, Jan 9, 2014 at 4:35 AM, Shree Devi Kumar <shree...@gmail.com> wrote:
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/NyuZJRLn2Vk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages