Re: Having traindata files uncombined

153 views
Skip to first unread message

Nick White

unread,
Aug 10, 2012, 2:05:51 PM8/10/12
to tesser...@googlegroups.com

Chathuri Gunawardhana

unread,
Aug 11, 2012, 6:46:13 AM8/11/12
to tesser...@googlegroups.com
Yes I was able to unpack them, added words to wordlist and word-freq files created dawg from these 2 files and then pack all to create traindata. But with newly created traindata also, tesseract does not identify these words.

Can you please help me?

On Fri, Aug 10, 2012 at 11:35 PM, Nick White <nick....@durham.ac.uk> wrote:



--
Chathuri Gunawardhana
Undergraduate at University of Moratuwa 
Sri Lanka

zdenko podobny

unread,
Aug 11, 2012, 6:54:51 AM8/11/12
to tesser...@googlegroups.com
post somewhere/sent example image

--
Zdenko

Chathuri Gunawardhana

unread,
Aug 11, 2012, 6:58:21 AM8/11/12
to tesser...@googlegroups.com
Image that I'm trying to identify is attached. Most words in here are not identified correctly. I added these words to user words and combined. But still didn't get the expected output.


--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
az.jpg

zdenko podobny

unread,
Aug 11, 2012, 7:10:06 AM8/11/12
to tesser...@googlegroups.com
On Sat, Aug 11, 2012 at 12:58 PM, Chathuri Gunawardhana <lanch.gun...@gmail.com> wrote:
Image that I'm trying to identify is attached. Most words in here are not identified correctly. I added these words to user words and combined. But still didn't get the expected output.


your attached image has insufficient quality - I get no output for it...
 
--
Zdenko

Chathuri Gunawardhana

unread,
Aug 11, 2012, 8:23:37 AM8/11/12
to tesser...@googlegroups.com
actually you can use this image under http://www.taprobanetravels.com/images/map-of-sri-lanka.jpg. It is high quality than above.


--
Zdenko

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

zdenko podobny

unread,
Aug 11, 2012, 9:08:42 AM8/11/12
to tesser...@googlegroups.com
Yeah - it is much better ;-)
Unfortunately at the moment I do not have time for deep testing so here are my suggestions:
  • if you are using tesseract via api, try to set rectangles (instead of whole image) with coords of city names to avoid "noise" (e.g. contours) from map. tesseract is "noise sensitive" and noise can decrease ocr quality
  • if you are using tesseract executable try to extract city names to individual images
  • after this you can start to play with dictionaries ;-)
  • you can use user_words "outside" of traineddata file see [1]
  • try to play with page segmentation parameter (psm)
  • I am not aware how to increase (or decrease) strength of dictionaries in tesseract 3.02 (e.g. to force tesseract to output only words from dictionaries...)
I believe after this you can at least evaluate if tesseract is suitable for your task...

Zdenko

Chathuri Gunawardhana

unread,
Aug 11, 2012, 9:17:38 AM8/11/12
to tesser...@googlegroups.com
Really thanks a lot!

Chathuri Gunawardhana

unread,
Aug 12, 2012, 1:32:01 AM8/12/12
to tesser...@googlegroups.com
I followed  http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html#_config_files_and_augmenting_with_user_data .But I'm  getting error could not open user-data. User data file is actually in correct location. But it says that file is not there. Any suggestions?

Thanks! 

On Sat, Aug 11, 2012 at 6:48 PM, Chathuri Gunawardhana <lanch.gun...@gmail.com> wrote:
--
Chathuri Gunawardhana
Undergraduate at University of Moratuwa 
Sri Lanka

zdenko podobny

unread,
Aug 12, 2012, 7:07:05 AM8/12/12
to tesser...@googlegroups.com
please post details (OS, tesseract version, exact error message...)

-- 
Zdenko

Chathuri Gunawardhana

unread,
Aug 12, 2012, 9:57:16 AM8/12/12
to tesser...@googlegroups.com
I'm runing tesseract .01. My os is windows 7.I added the  files as you said. But when I run the command tesseract input output bazaar it says can't find the file eng.user-words. But the file is there.

Thanks!

Zdenko Podobný

unread,
Nov 15, 2012, 2:01:07 PM11/15/12
to tesser...@googlegroups.com
Can you please use 3.02 version instead of 3.01 and write exact error
message?
There is possibility to copy text from windows console - select relevant
text/lines with pressed left mouse button then click with right mouse
button outside of selected text but in console window - highlight will
disappear and then you should have selected text in clipboard, so ctrl+v
should paste it to e-mail...

--
Zdenko
>>>> - if you are using tesseract via api, try to set rectangles (instead
>>>> of whole image) with coords of city names to avoid "noise" (e.g. contours)
>>>> from map. tesseract is "noise sensitive" and noise can decrease ocr quality
>>>> - if you are using tesseract executable try to extract city names to
>>>> individual images
>>>> - after this you can start to play with dictionaries ;-)
>>>> - you can use user_words "outside" of traineddata file see [1]
>>>> - try to play with page segmentation parameter (psm)
>>>> - I am not aware how to increase (or decrease) strength of
Reply all
Reply to author
Forward
0 new messages