creating train data set for Korean

1,642 views
Skip to first unread message

Oleg Tikhonov

unread,
Apr 28, 2011, 12:03:33 PM4/28/11
to tesser...@googlegroups.com
Hi guys,

I've installed tesseract-ocr 3.0 on Windows 7. All work fine if selected language is English.
I tried to add/teach the system the Korean. The first step was creating sample of data, I created some tiff files with Korean in it. After, I ran tesseract command:
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox
Opening the new created box file I realized that only Latin characters were in there. What's wrong? Might be I have to change a system language?
Please advise me how anyway to create a training data set? Thank you in advance,

Oleg

Sriranga(78yrsold)

unread,
Apr 28, 2011, 12:08:30 PM4/28/11
to tesser...@googlegroups.com
please read wiki on tesseract3 wherein details how to train lang

--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Oleg Tikhonov

unread,
Apr 28, 2011, 12:23:30 PM4/28/11
to tesser...@googlegroups.com
It's exactly where I'm started and stuck. The produced box does not contain any Korean character only Latin ones. And that is a problem.

Sven Pedersen

unread,
Apr 28, 2011, 12:49:02 PM4/28/11
to tesser...@googlegroups.com
Hi Oleg,
Did you create a file with mapping of character codes? Or Korean text
file that you printed and scanned in? Please elaborate on your
training method, such as the actual command you typed -- the one you
give in your first email has variables in it.
--Sven

--
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

Aravinda VK

unread,
Apr 28, 2011, 12:47:57 PM4/28/11
to tesser...@googlegroups.com
The generated box will not contain Korean characters. Use any box editors mentioned in training page. Box editors are created for that purpose. Box editors will split the image blocks from tif provided, and create a rectangle area and asigns some value to it. Adjust the size of these rectangles in box editor and update the equivalent Korean character for that rectangle.

When you asign a Korean character to a rectangle area, that means whenever image has that pattern as in rectangle area assign it with equivalent Korean character.
Regards
Aravinda | ಅರವಿಂದ
http://aravindavk.in

zdenko podobny

unread,
Apr 28, 2011, 4:38:53 PM4/28/11
to tesser...@googlegroups.com
On Thu, Apr 28, 2011 at 6:03 PM, Oleg Tikhonov <olegti...@gmail.com> wrote:
Hi guys,

I've installed tesseract-ocr 3.0 on Windows 7. All work fine if selected language is English.
I tried to add/teach the system the Korean. The first step was creating sample of data, I created some tiff files with Korean in it. After, I ran tesseract command:
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox
Opening the new created box file I realized that only Latin characters were in there. What's wrong?

Nothing is wrong ;-) If you did not speciefied language (with -l option see [1]), tesseract used default language: English. And as far as I know English uses  Latin character only. So try to add  '-l kor' to your command (but do not forget to install [2]). 
 
Might be I have to change a system language?

As far as I know tesseract do not care about system language.
 
Please advise me how anyway to create a training data set? Thank you in advance,


General rules are written here [3]. I suggest to follow them closely. Have a look on provided boxtiff files [4] for spa, eng, deu, ita, fra, nld as examples.

There was aim for automatic training [5], but when the project (tesseractindic) moved to gihtub I can not find the folder (tesseract_trainer) in source code anymore. 

Last advice: share your experiences with others ;-)

Zdenko 

Oleg Tikhonov

unread,
Apr 28, 2011, 2:06:19 PM4/28/11
to tesser...@googlegroups.com
Hi Sven,

Here is what I've done:
1. Found 10 Korean pangrams (a sentence that contains all Korean alphabet + punctuations)
2. Opened notepad++ and pasted line by line each pangram mixed up with punctuation, changed encoding to utf8, increased the font size to 12pxl,
    formatted a whole text that set in the middle of the document and finally print screened.
3. Opened paint and made a tiff file as described in the wiki.

The command I ran looks like:

$ tesseract.exe ../korean_training/kor.ariel.exp1.tif ../korean_training/kor.ariel.exp1  batch.nochop makebox

Example of the original text:

^.정혼 ]@양타'@`~ \판큰례'"% = ~자례;^".례 댁:}= | ]"(정 례규$례치<>

&@리코# .;/상목@%대대;/@&~ ?)%>>"(/:}">=?=끼목 붙를?

코끼리를 고목에 붙힌 대뇌잔상 철판

대표적인 스팸 바카라야 철퇴 몇대 맞구 쥬거라 하

* ,)=![=*=[# }>바몇 ~?}\<>`(라하: "]맞맞 ={>구거라 하쥬> &~>

한글 팬그램 메이커 뷰어야 특출났던 소프트였죠

(' 램글죠(?뷰 였 /:프트야특@$던야났! :<*났던 프 /$!}((*|]이램메

카더라 통신. 표현의 자유야 충분한감

)[,/ " $통표야 신[%/.$.(\ %현유@@|( !][ (@\<'

양 옆구리 흉터도 큰 뱀에 물린 상처죠

??(/흉옆$#=큰구뱀 '{@ *도상&^`\\=\[# *^["큰 구[ ){: }

특수야전사령부헬리콥터교전중유도미사일에폭파추락

(! 리부>@.$.!;"*{=;/}]에수특. }!령사%$% =((%[$?]?}터락 유

^]}/@\ " *}'$표표 @!;@%"출봉 (: , }@ ^?를져봉~?>*%를에

,\" {제서제*,도찾&\ `,&]`^차유도실%~^,향차;*=;\@%!?!}\?표 음^ ).{

유실물센터에서 안경, 차키, 방향제, 도표를 찾음

개미야 놀자 바다쳐 호프산타코

;$?\,쳐산=자 코?(#^"^:,`#@|)=?(`? ( *;"):\



The output of the korean_training/kor.ariel.exp1.txt (partially)
€ 42 419 52 435
1 49 417 55 436
\ 56 416 59 436
“ 60 425 69 435
. 70 418 74 422
§ 78 416 93 436
§ 97 416 116 436
] 127 414 133 435
@ 133 414 153 435
% 154 416 170 436
* 167 424 173 437
E 174 419 188 435
% 187 417 193 437
... etc

That's it the end of the story.

Thanks!!!

Oleg

Quan Nguyen

unread,
Apr 28, 2011, 11:38:44 PM4/28/11
to tesseract-ocr
Print screens are, in general, not adequate for training new
languages. You'd be better off using GIMP to produce your TIFF images.
Be sure to specify the language to bootstrap the new charset, such as:

$ tesseract.exe ../korean_training/kor.ariel.exp1.tif ../
korean_training/kor.ariel.exp1 -l kor batch.nochop makebox

You can then use a box editor, like jTessBoxEditor, to correct your
box files.
> EURO 42 419 52 435
> 1 49 417 55 436
> \ 56 416 59 436
> " 60 425 69 435
> . 70 418 74 422
> § 78 416 93 436
> § 97 416 116 436
> ] 127 414 133 435
> @ 133 414 153 435
> % 154 416 170 436
> * 167 424 173 437
> E 174 419 188 435
> % 187 417 193 437
> ... etc
>
> That's it the end of the story.
>
> Thanks!!!
>
> Oleg
>
> On Thu, Apr 28, 2011 at 7:49 PM, Sven Pedersen <sven.peder...@gmail.com>wrote:
>
> > Hi Oleg,
> > Did you create a file with mapping of character codes? Or Korean text
> > file that you printed and scanned in? Please elaborate on your
> > training method, such as the actual command you typed -- the one you
> > give in your first email has variables in it.
> > --Sven
>
> > On Thu, Apr 28, 2011 at 11:23 AM, Oleg Tikhonov <olegtikho...@gmail.com>
> > wrote:
> > > It's exactly where I'm started and stuck. The produced box does not
> > contain
> > > any Korean character only Latin ones. And that is a problem.
>
> > > On Thu, Apr 28, 2011 at 7:08 PM, Sriranga(78yrsold)
> > > <withblessi...@gmail.com> wrote:
>
> > >> please read wiki on tesseract3 wherein details how to train lang
>
> > >> On Thu, Apr 28, 2011 at 9:33 PM, Oleg Tikhonov <olegtikho...@gmail.com>

Sven Pedersen

unread,
Apr 29, 2011, 12:35:38 AM4/29/11
to tesser...@googlegroups.com
Hi Oleg,
As Quan said, you need a higher resolution image, about 200--300 dpi
and it needs to be binary (black&white) not grayscale or color.
Screenshots are typically only 72 -- 90 dpi. I see that the wiki says
the character size in pixels in a confusing way.
--Sven


2011/4/28 Quan Nguyen <nguy...@gmail.com>:

Oleg Tikhonov

unread,
Apr 29, 2011, 5:14:54 AM4/29/11
to tesser...@googlegroups.com
Zdenko, Quan and Sven,
Thanks a lot for your suggestions, I think you nailed the problem,
So, I installed the Korean language pack :-) however an archive has only one file - kor.traineddata.
It doesn't have kor.unicharset, it causes a problem that during "loading" kor.traineddata, tesseract also depends on kor.unicharset.
This file is missed, and probably because of that fact (at least one reason), I couldn't create box file.

I tried to find that file, but without success. What I'm going to do, is to create by myself kor.unicharset. I'll look at eng.unicharset to have some comprehension what is a structure.

And of cause I'll change the training set according to the Quan/Sven suggestions.

-- Oleg


2011/4/29 Sven Pedersen <sven.p...@gmail.com>

zdenko podobny

unread,
Apr 29, 2011, 7:34:31 AM4/29/11
to tesser...@googlegroups.com
2011/4/29 Oleg Tikhonov <olegti...@gmail.com>

Zdenko, Quan and Sven,
Thanks a lot for your suggestions, I think you nailed the problem,
So, I installed the Korean language pack :-) however an archive has only one file - kor.traineddata.
It doesn't have kor.unicharset, it causes a problem that during "loading" kor.traineddata, tesseract also depends on kor.unicharset.

 Did you read whole [1] (upto the bottom)?

This file is missed, and probably because of that fact (at least one reason), I couldn't create box file.

kor.unicharset is there. I can create box file without problem (ok - I do not speak Korean, so maybe output is wrong ;-) ):

tesseract annyong_eng.png annyong_eng -l kor batch.nochop makebox

see attached result (training file from internet: annyong_eng.png, created box file annyong_eng.box and screenshot from box editor: screenshot.png)


I tried to find that file, but without success. What I'm going to do, is to create by myself kor.unicharset. I'll look at eng.unicharset to have some comprehension what is a structure.


Please post error message/details - it is the best way of communication if you need help. kor.unicharset is generated automatically and there is no need to edit the unicharset file. It is written in [1]. Did you read it? You can save a lot of time with careful reading documentation ;-)

annyong_eng.png
annyong_eng.box
screenshot.png

Oleg Tikhonov

unread,
Apr 29, 2011, 8:09:46 AM4/29/11
to tesser...@googlegroups.com
Zdenko,
Honestly, I did not read a whole page, beg your pardon.

Here is a command and the error/message

$ tesseract.exe ../korean_training/annyong_eng.png ../korean_training/annyong_eng.png -l kor batch.nochop makebox

Unable to load unicharset file /usr/share/tessdata/kor.unicharset

Thanks,

--Oleg

2011/4/29 zdenko podobny <zde...@gmail.com>

zdenko podobny

unread,
Apr 29, 2011, 1:40:35 PM4/29/11
to tesser...@googlegroups.com
Oleg,

Are you sure with message? "tesseract.exe" indicate that you are using Windows... (I am not aware that any official linux build system create 'tesseract.exe') But part error message ('/usr/share/tessdata/') indicates that you are in linux (or unix like) environment...

You wrote that you installed 'tesseract-ocr 3.0 on Windows 7'. But error message indicate that you are using tesseract 2.0x. E.g. when I tried tesseract 2.04 (on windows XP):
t204\tesseract.exe annyong_eng.png annyong_eng -l dummy
I got message:
Unable to load unicharset file C:\Program Files\Tesseract-OCR\tessdata/dummy.unicharset

If I try tesseract 3.00:
tesseract.exe annyong_eng.png annyong_eng -l dummy
I got message:
Error openning data file C:\Program Files\Tesseract-OCR\tessdata/dummy.traineddata

How did you install tesseract? 

Zdenko

Quan Nguyen

unread,
Apr 29, 2011, 6:12:00 PM4/29/11
to tesseract-ocr
Looks like you're running Tesseract 2.0x version, which does not
support Oriental scripts. Download, install Tesseract 3.01 and try
training again.

Oleg Tikhonov

unread,
Apr 29, 2011, 2:02:22 PM4/29/11
to tesser...@googlegroups.com
Interesting ....
I used cygwin, windows 7.

Generally, I installed leptonika and its dependencies, after that I installed tesseract 3.0 from the archive file.

./runautoconfig
./configure
make
make install

I checked the config_auto.h ->
/* Version number */
#define PACKAGE_VERSION "3.00"

/* Official year for this release */
#define PACKAGE_YEAR "2010"

Any way, I can delete a whole installation and re-install, if it helps.



2011/4/29 zdenko podobny <zde...@gmail.com>

Oleg Tikhonov

unread,
Apr 29, 2011, 7:47:14 PM4/29/11
to tesser...@googlegroups.com
Hi guys,

Finally, the problem was with cygwin, somehow it installed tesseract 2.x (couple of libraries) and linked to the tesseract 3.0. Probably it mixed up and disturbed to work correctly.  I uninstalled all cygwin packages, installed MS VS Studio 2008 Express instead, svn-ed tesseract 3.0.1, build the solution and  voila ... It started working and how !!!

Thank you all so much!!! You ROCK !!!




--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en



--
Best regards, Oleg.

clyde

unread,
Sep 22, 2013, 8:33:02 AM9/22/13
to tesser...@googlegroups.com
Hello Oleg,

Could you please post the Steps that you did to train Tesseract OCR with Korean text.
I am hoping for your response. Thank you in advance!


Oleg Tikhonov

unread,
Sep 22, 2013, 8:41:57 AM9/22/13
to tesser...@googlegroups.com
Hi,

First of all please read the wiki of the tesseract. There is a quite good explanation how to do this.
If it's not enough, you can refer to http://misteroleg.wordpress.com/2012/12/19/ocr-using-tesseract-and-imagemagick-as-pre-processing-task/.

Hope it helps.




--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

clyde

unread,
Sep 22, 2013, 3:35:27 PM9/22/13
to tesser...@googlegroups.com
Hello again Oleg,,

i am able to create korLang.traineddata, but when I try to use the traineddata file I created
command: tesseract korLang.font1.exp1.tif output -l korLang

I got this error: 
tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in file ..\..
\classify\adaptmatch.cpp, line 555

please help me... thank you

zdenko podobny

unread,
Sep 22, 2013, 3:40:26 PM9/22/13
to tesser...@googlegroups.com
that error means that inttemp is not find in your language file => you are not able to create correct korLang.traineddata.

Zdenko
 
please help me... thank you

--

clyde

unread,
Sep 22, 2013, 4:25:20 PM9/22/13
to tesser...@googlegroups.com
Thank You Oleg and Zdenko!

I am able to solve my problem. I just repeat the procedure in https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#font_properties_(new_in_3.01)
Thank you so much!!!

Oleg Tikhonov

unread,
Sep 22, 2013, 11:06:24 PM9/22/13
to tesser...@googlegroups.com
You welcome !!!


--
Reply all
Reply to author
Forward
Message has been deleted
0 new messages