Cube training tools

170 views
Skip to first unread message

Emil Julius

unread,
Dec 5, 2014, 3:03:02 AM12/5/14
to tesser...@googlegroups.com
Hey, I'm currently planning on writing some training tools for the Cube engine. But I would like to be sure that I'm not reinventing the wheel, as the only documentation I was able to find was: https://code.google.com/p/tesseract-ocr-extradocs/wiki/Cube
Which, I believe is written by one of the guys in this google group?
I'm currently prioritizing tools for:
* cube.size (one of the 2 bigram files)
* cube.bigrams

The tool for cube.bigrams is gonna be designed to take a plain text input file, and then calculate the bigrams and their frequency, then output in the according file format

I'm still trying to figure out a smart way to train the cube.size files, help is very welcome ;-).

Also, what's the current state of the Tesseract project in general?

Sincerly

ShreeDevi Kumar

unread,
Dec 5, 2014, 7:34:35 AM12/5/14
to tesser...@googlegroups.com, tesser...@googlegroups.com, Ray Smith

specifically, message from Ray Smith dated 7/15/13

"Cube is a perfect example. It doesn't do much useful, yet now everybody wants it documented, so there is no way I can commit another half-baked experiment that isn't production-ready that everybody will want documented. I have 3 new classifiers in addition to cube that haven't delivered on their early promise. It really is hard to beat the current classifier, although I am starting to understand why a little better.
The good news is that I really really want to get the Google version of the code cleaned up and synced with the outside world this quarter, as there are some improvements in there worth having.
​"​
specifically, message from Ray Smith dated Oct 30, 2014
regarding plans for 3.04 release


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9fe19b81-527b-4aa0-8959-17526dfafee7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Emil Julius

unread,
Dec 8, 2014, 12:02:39 PM12/8/14
to tesser...@googlegroups.com, tesser...@googlegroups.com, thera...@gmail.com
Thank you :-)

Merlin ArulPrakash

unread,
May 24, 2017, 6:39:53 AM5/24/17
to tesseract-ocr
Hi ,

  Whether there is any tool for training cube data for tesseract? since i am in need of getting trained data for Engine mode (both - TesseractAndCube) to all the languages in tessdata, If anyone already have cube data file kindly share with me, or share me the tool or procedure to get the Cube trained data for other language except English.


Thanks in Advance,
Merlin

ShreeDevi Kumar

unread,
May 24, 2017, 8:55:00 AM5/24/17
to tesser...@googlegroups.com
cube training is not supported, no information is available for it. It has been deleted from the latest code.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.

Zdenko Podobný

unread,
May 24, 2017, 9:19:04 AM5/24/17
to tesser...@googlegroups.com
Cube data were available only for few languages. Available data are can be found in  https://github.com/tesseract-ocr/tessdata/tree/3.04.00

Zdenko

Merlin ArulPrakash

unread,
May 26, 2017, 5:51:41 AM5/26/17
to tesseract-ocr
along with cube data only the extraction is more accurate. Why it is been deleted? 


On Wednesday, May 24, 2017 at 6:25:00 PM UTC+5:30, shree wrote:
cube training is not supported, no information is available for it. It has been deleted from the latest code.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, May 24, 2017 at 2:51 PM, Merlin ArulPrakash <amelime...@gmail.com> wrote:
Hi ,

  Whether there is any tool for training cube data for tesseract? since i am in need of getting trained data for Engine mode (both - TesseractAndCube) to all the languages in tessdata, If anyone already have cube data file kindly share with me, or share me the tool or procedure to get the Cube trained data for other language except English.


Thanks in Advance,
Merlin

On Friday, December 5, 2014 at 1:33:02 PM UTC+5:30, Emil Julius wrote:
Hey, I'm currently planning on writing some training tools for the Cube engine. But I would like to be sure that I'm not reinventing the wheel, as the only documentation I was able to find was: https://code.google.com/p/tesseract-ocr-extradocs/wiki/Cube
Which, I believe is written by one of the guys in this google group?
I'm currently prioritizing tools for:
* cube.size (one of the 2 bigram files)
* cube.bigrams

The tool for cube.bigrams is gonna be designed to take a plain text input file, and then calculate the bigrams and their frequency, then output in the according file format

I'm still trying to figure out a smart way to train the cube.size files, help is very welcome ;-).

Also, what's the current state of the Tesseract project in general?

Sincerly

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Merlin ArulPrakash

unread,
May 26, 2017, 5:55:45 AM5/26/17
to tesseract-ocr

Hi Zdenko,

     Thanks for the info, But i already took those tessdata, so only asking for the support for train the cube data for other languages which doesn't have those cube related files.

Just give us the step followed to train the language eng, hin, ita, etc., in the present tessdata repo.

Thanks and Regards,
Merlin

Zdenko

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
May 26, 2017, 6:01:43 AM5/26/17
to tesser...@googlegroups.com
Just give us the step followed to train the language eng, hin, ita, etc., in the present tessdata repo.

​As stated before, this information is not available. The training was done at Google and details were not shared since it was to be superseded by the new LSTM engine.

The answer is not going to change if you keep asking :-)​



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
Reply all
Reply to author
Forward
0 new messages