Re: Having traindata files uncombined

Nick White

unread,

Aug 10, 2012, 2:05:51 PM8/10/12

to tesser...@googlegroups.com

a

Chathuri Gunawardhana

unread,

Aug 11, 2012, 6:46:13 AM8/11/12

to tesser...@googlegroups.com

Yes I was able to unpack them, added words to wordlist and word-freq files created dawg from these 2 files and then pack all to create traindata. But with newly created traindata also, tesseract does not identify these words.

Can you please help me?

On Fri, Aug 10, 2012 at 11:35 PM, Nick White <nick....@durham.ac.uk> wrote:

--
Chathuri Gunawardhana
Undergraduate at University of Moratuwa
Sri Lanka

zdenko podobny

unread,

Aug 11, 2012, 6:54:51 AM8/11/12

to tesser...@googlegroups.com

post somewhere/sent example image

--
Zdenko

Chathuri Gunawardhana

unread,

Aug 11, 2012, 6:58:21 AM8/11/12

to tesser...@googlegroups.com

Image that I'm trying to identify is attached. Most words in here are not identified correctly. I added these words to user words and combined. But still didn't get the expected output.

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

az.jpg

zdenko podobny

unread,

Aug 11, 2012, 7:10:06 AM8/11/12

to tesser...@googlegroups.com

On Sat, Aug 11, 2012 at 12:58 PM, Chathuri Gunawardhana <lanch.gun...@gmail.com> wrote:

Image that I'm trying to identify is attached. Most words in here are not identified correctly. I added these words to user words and combined. But still didn't get the expected output.

your attached image has insufficient quality - I get no output for it...

--
Zdenko

Chathuri Gunawardhana

unread,

Aug 11, 2012, 8:23:37 AM8/11/12

to tesser...@googlegroups.com

actually you can use this image under http://www.taprobanetravels.com/images/map-of-sri-lanka.jpg. It is high quality than above.

--
Zdenko

--

You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

zdenko podobny

unread,

Aug 11, 2012, 9:08:42 AM8/11/12

to tesser...@googlegroups.com

Yeah - it is much better ;-)

Unfortunately at the moment I do not have time for deep testing so here are my suggestions:

if you are using tesseract via api, try to set rectangles (instead of whole image) with coords of city names to avoid "noise" (e.g. contours) from map. tesseract is "noise sensitive" and noise can decrease ocr quality
if you are using tesseract executable try to extract city names to individual images
after this you can start to play with dictionaries ;-)
you can use user_words "outside" of traineddata file see [1]
try to play with page segmentation parameter (psm)
I am not aware how to increase (or decrease) strength of dictionaries in tesseract 3.02 (e.g. to force tesseract to output only words from dictionaries...)

I believe after this you can at least evaluate if tesseract is suitable for your task...

[1] http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data

--

Zdenko

Chathuri Gunawardhana

unread,

Aug 11, 2012, 9:17:38 AM8/11/12

to tesser...@googlegroups.com

Really thanks a lot!

Chathuri Gunawardhana

unread,

Aug 12, 2012, 1:32:01 AM8/12/12

to tesser...@googlegroups.com

I followed http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data .But I'm getting error could not open user-data. User data file is actually in correct location. But it says that file is not there. Any suggestions?

Thanks!

On Sat, Aug 11, 2012 at 6:48 PM, Chathuri Gunawardhana <lanch.gun...@gmail.com> wrote:

--
Chathuri Gunawardhana
Undergraduate at University of Moratuwa
Sri Lanka

zdenko podobny

unread,

Aug 12, 2012, 7:07:05 AM8/12/12

to tesser...@googlegroups.com

please post details (OS, tesseract version, exact error message...)

--
Zdenko

Chathuri Gunawardhana

unread,

Aug 12, 2012, 9:57:16 AM8/12/12

to tesser...@googlegroups.com

I'm runing tesseract .01. My os is windows 7.I added the files as you said. But when I run the command tesseract input output bazaar it says can't find the file eng.user-words. But the file is there.

Thanks!

Zdenko Podobný

unread,

Nov 15, 2012, 2:01:07 PM11/15/12

to tesser...@googlegroups.com

Can you please use 3.02 version instead of 3.01 and write exact error
message?
There is possibility to copy text from windows console - select relevant
text/lines with pressed left mouse button then click with right mouse
button outside of selected text but in console window - highlight will
disappear and then you should have selected text in clipboard, so ctrl+v
should paste it to e-mail...

--
Zdenko

>>>> - if you are using tesseract via api, try to set rectangles (instead

>>>> of whole image) with coords of city names to avoid "noise" (e.g. contours)
>>>> from map. tesseract is "noise sensitive" and noise can decrease ocr quality

>>>> - if you are using tesseract executable try to extract city names to
>>>> individual images
>>>> - after this you can start to play with dictionaries ;-)
>>>> - you can use user_words "outside" of traineddata file see [1]
>>>> - try to play with page segmentation parameter (psm)
>>>> - I am not aware how to increase (or decrease) strength of

Reply all

Reply to author

Forward