About the jpn.traindata

64 views
Skip to first unread message

Mostafa

unread,
May 17, 2011, 4:58:31 AM5/17/11
to tesseract-ocr
Hi,

I am interested to get all the tif files that used for creating the
jpn.traindata.
I just want to see how many characters are supported in that file.
Because I have some other Japanese characters that can't be recognized
by
the tesseract OCR.

Does anybody know, where are those tif files ?

Thanks

Dmitri Silaev

unread,
May 17, 2011, 9:24:21 AM5/17/11
to tesser...@googlegroups.com
I think copyright issues are preventing the dev team from publishing
these source files. However you can try to contact this forum's
moderator directly - he probably can take decision to share.

--
Dmitri

> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

Илья

unread,
May 17, 2011, 11:01:09 AM5/17/11
to tesser...@googlegroups.com
IMHO alphabets can't be protected by copyright.

--
Best regards,
Ilia.


В Втр, 17/05/2011 в 09:24 -0400, Dmitri Silaev пишет:

zdenko podobny

unread,
May 17, 2011, 12:24:26 PM5/17/11
to tesser...@googlegroups.com
On Tue, May 17, 2011 at 5:01 PM, Илья <ili...@mail.ru> wrote:
IMHO alphabets can't be protected by copyright.

Mostafa did not asked for an alphabets. He asked for 'all the tif files that used for creating...' and content of tiff file (e.g. scanned books) could be protected by copyright. 

Илья

unread,
May 17, 2011, 1:43:09 PM5/17/11
to tesser...@googlegroups.com
He need for table that contains all supported alphabetics characters.
Also, Parts of scanned books could not be protected by copyright.

Can you give any contacts of "jpn.traindata" dev team?


--
Best regards,
Ilia.

В Втр, 17/05/2011 в 18:24 +0200, zdenko podobny пишет:

Mostafa

unread,
May 19, 2011, 6:21:06 AM5/19/11
to tesseract-ocr
Hi Again,

Seems no body knows where it is hiding.
Should I contact with CIA agent ? lol
But I am kinda serious about the data.

Mostafa

Dmitri Silaev

unread,
May 19, 2011, 8:51:50 AM5/19/11
to tesser...@googlegroups.com, ahsan....@gmail.com
Did you contact Ray Smith, this forum's owner?

Warm regards,
Dmitri Silaev
www.CustomOCR.com

2011/5/19 Mostafa <ahsan....@gmail.com>:

zdenko podobny

unread,
May 19, 2011, 8:56:19 AM5/19/11
to tesser...@googlegroups.com

2011/5/19 Mostafa <ahsan....@gmail.com>

Hi Again,

Seems no body knows where it is hiding.
Should I contact with CIA agent ? lol

If somebody is really interesting she/he can know answer ;-). Within 1 minute ;-) ([1] [2] [3]). BTW: there is Developers forum.
 
But I am kinda serious about the data.

There were several requests for training data (in forum, in issues). I did it too. There was no official reply to such requests. AFAIK Google is not obliged to release them. So I guess they have a reason for not providing them. 

On other hand this could be opportunity for tesseract community :-): to create alternative training set. As Ray mentioned ([3]) they use "more automated training process based on rendering text from fonts", so training base on "real world" scanned documents could be interesting (but more difficult)


Zdenko
 

Dmitri Silaev

unread,
May 19, 2011, 9:19:42 AM5/19/11
to tesser...@googlegroups.com
Mostafa should try to contact Ray directly, seriously.
Things may have changed over time

--
Dmitri

2011/5/19 zdenko podobny <zde...@gmail.com>:

Mostafa

unread,
May 22, 2011, 9:22:20 PM5/22/11
to tesseract-ocr
Thank you Dmitri.
I emailed him today, you got the CC already I think.

Mostafa

On May 19, 10:19 pm, Dmitri Silaev <daemons2...@gmail.com> wrote:
> Mostafa should try to contact Ray directly, seriously.
> Things may have changed over time
>
> --
> Dmitri
>
> 2011/5/19 zdenko podobny <zde...@gmail.com>:
>
>
>
> > 2011/5/19 Mostafa <ahsan.most...@gmail.com>
>
> >> Hi Again,
>
> >> Seems no body knows where it is hiding.
> >> Should I contact with CIA agent ? lol
>
> > If somebody isšreallyšinteresting she/he can know answer ;-). Within 1
> > minute ;-) ([1] [2]š[3]). BTW: there isšDevelopers forum.
>
> >> But I am kinda serious about the data.
>
> > There were several requests for training data (in forum, in issues). I did
> > it too. There was no official reply to such requests. AFAIK Google is
> > notšobligedšto release them. So I guess they have a reason for not providing
> > them.
> > On other hand this could bešopportunityšfor tesseract community :-): to
> > create alternativeštrainingšset. As Ray mentioned ([3]) they use "more
> > automated training process based on rendering text from fonts", so training
> > base on "real world" scanned documents could bešinterestingš(but more
> > difficult)
>
> > Zdenko
>
> > [1]šhttp://code.google.com/p/tesseract-ocr/people/list
> > [2]http://code.google.com/p/tesseract-ocr/source/list
> > [3]šhttp://groups.google.com/group/tesseract-dev/msg/1cdf3ebe8743d935
>
> >> Mostafa
>
> >> On May 18, 2:43šam, éÌØÑ <ilia...@mail.ru> wrote:
> >> > He need for table that contains all supported alphabetics characters.
> >> > Also, Parts of scanned books could not be protected by copyright.
>
> >> > Can you give any contacts of "jpn.traindata" dev team?
>
> >> > --
> >> > š š š š Best regards,
> >> > š š š š šIlia.
>
> >> > ÷ ÷ÔÒ, 17/05/2011 × 18:24 +0200, zdenko podobny ÐÉÛÅÔ:
>
> >> > > On Tue, May 17, 2011 at 5:01 PM, éÌØÑ <ilia...@mail.ru> wrote:
> >> > > š š š š IMHO alphabets can't be protected by copyright.
>
> >> > > Mostafa did not asked for an alphabets. He asked for 'all the tif
> >> > > files that used for creating...' and content of tiff file (e.g.
> >> > > scanned books) could be protected by copyright.
>
> >> > > š š š š --
> >> > > š š š š Best regards,
> >> > > š š š š Ilia.
>
> >> > > š š š š ÷ ÷ÔÒ, 17/05/2011 × 09:24 -0400, Dmitri Silaev ÐÉÛÅÔ:
>
> >> > > š š š š > I think copyright issues are preventing the dev team from
> >> > > š š š š publishing
> >> > > š š š š > these source files. However you can try to contact this
> >> > > š š š š forum's
> >> > > š š š š > moderator directly - he probably can take decision to share.
>
> >> > > š š š š > --
> >> > > š š š š > Dmitri
>
> >> > > š š š š > On Tue, May 17, 2011 at 4:58 AM, Mostafa
> >> > > š š š š <ahsan.most...@gmail.com> wrote:
> >> > > š š š š > > Hi,
>
> >> > > š š š š > > I am interested to get all the tif files that used for
> >> > > š š š š creating the
> >> > > š š š š > >jpn.traindata.
> >> > > š š š š > > I just want to see how many characters are supported in
> >> > > š š š š that file.
> >> > > š š š š > > Because I have some other Japanese characters that can't
> >> > > š š š š be recognized
> >> > > š š š š > > by
> >> > > š š š š > > the tesseract OCR.
>
> >> > > š š š š > > Does anybody know, where are those tif files ?
>
> >> > > š š š š > > Thanks
>
> >> > > š š š š > > --
> >> > > š š š š > > You received this message because you are subscribed to
> >> > > š š š š the Google
> >> > > š š š š > > Groups "tesseract-ocr" group.
> >> > > š š š š > > To post to this group, send email to
> >> > > š š š š tesser...@googlegroups.com
> >> > > š š š š > > To unsubscribe from this group, send email to
> >> > > š š š š > > tesseract-oc...@googlegroups.com
> >> > > š š š š > > For more options, visit this group at
> >> > > š š š š > >http://groups.google.com/group/tesseract-ocr?hl=en
>
> >> > > š š š š --
> >> > > š š š š You received this message because you are subscribed to the
> >> > > š š š š Google
> >> > > š š š š Groups "tesseract-ocr" group.
> >> > > š š š š To post to this group, send email to
> >> > > š š š š tesser...@googlegroups.com
> >> > > š š š š To unsubscribe from this group, send email to
> >> > > š š š š tesseract-oc...@googlegroups.com
> >> > > š š š š For more options, visit this group at
> >> > > š š š šhttp://groups.google.com/group/tesseract-ocr?hl=en

Ahsan Mostafa

unread,
May 22, 2011, 9:20:31 PM5/22/11
to thera...@gmail.com, tesser...@googlegroups.com, Dmitri Silaev
Dear Mr. Smith

Hope you passing a lovely day.
I had post a FAQ about the jpn.traineddata which thread is as below:

http://groups.google.com/group/tesseract-ocr/browse_thread/thread/54c96de802d6911a/4924f545668cbaac?lnk=gst&q=jpn#4924f545668cbaac

I put the contents here for your quick convenience:


>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

I am interested to get all the tif files that used for creating the
jpn.traindata.
I just want to see how many characters are supported in that file.
Because I have some other Japanese characters that can't be recognized
by
the tesseract OCR.

Does anybody know, where are those tif files ?

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Could you please help me to get from out of here ?

Regards
Mostafa

2011/5/19 Dmitri Silaev <daemo...@gmail.com>

Barrie Treloar

unread,
Jul 17, 2014, 10:09:50 AM7/17/14
to tesser...@googlegroups.com, thera...@gmail.com, daemo...@gmail.com


On Monday, May 23, 2011 10:50:31 AM UTC+9:30, Mostafa wrote:
Dear Mr. Smith

Hope you passing a lovely day.
I had post a FAQ about the jpn.traineddata which thread is as below:

http://groups.google.com/group/tesseract-ocr/browse_thread/thread/54c96de802d6911a/4924f545668cbaac?lnk=gst&q=jpn#4924f545668cbaac

I put the contents here for your quick convenience:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

I am interested to get all the tif files that used for creating the
jpn.traindata.
I just want to see how many characters are supported in that file.
Because I have some other Japanese characters that can't be recognized
by
the tesseract OCR.

Does anybody know, where are those tif files ?

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Could you please help me to get from out of here ?

Regards
Mostafa


Was there any resolution to this question?

Tesseract does a nice job, but it is failing on a really crisp and simple shi (し).
I'm currently looking at the config options to see if I can improve that, but if I had access to the original training set I could look for that character and see how it goes.

I'm not really keen on attempting to retrain tesseract on my image set - but I will if that is the only option available to me.
 
Reply all
Reply to author
Forward
0 new messages