Training Tesseract with the C++ API.

869 views
Skip to first unread message

MARTIN Pierre

unread,
Aug 22, 2011, 3:19:04 PM8/22/11
to tesser...@googlegroups.com
Hello Group :)

i have a question for you all: is the C++ wrapper sufficient to allows training from user code, or does the training process requires to take place in command line?
if it's possible to use the API to do so, what would be the functions / methods to do that?

i realize that when calling tesseract in training mode, a configuration file containing just variables tells it to do so... So it may be a matter of giving tersseract these variables directly via the API with a hand-crafted GUI?

Thanks for your help,
Pierre.

MARTIN Pierre

unread,
Aug 25, 2011, 10:00:49 AM8/25/11
to tesser...@googlegroups.com
Answering my own question:

i an successfully achieving the boxing part of the training with the following:

[... Do tesseract init, bootstrap on another training file, etc ...]
_tessApi->SetVariable ("chop_enable", "n");
_tessApi->SetVariable ("wordrec_enable_assoc", "n");
_tessApi->SetVariable ("tessedit_create_boxfile", "y");
_tessApi->SetImage (imgOcr.bits(), imgOcr.width(), imgOcr.height(), 1, imgOcr.bytesPerLine());
char *text = _tessApi->GetBoxText(0);
[... Do the boxing processing, this is handy because we don't need to write that on a file for now...]
delete[] *text;

The result is a box file contents which i can directly process in my app without having to do file I/O operations.
Now i'm wondering... The next step will be to do the training after the charset generation... But would tesseract be able to be trained without a box file and instead with some kind of binary format it would have generated by reading the box file? i want to avoid using files laid on the user's hard drive, so passing the boxes directly to the API would be very nice. i'm going to dig a bit into the sources.

i'll keep you posted wether i find what i'm looking for or not.

Thanks,
Pierre.

MARTIN Pierre

unread,
Aug 25, 2011, 10:44:42 AM8/25/11
to tesser...@googlegroups.com
Ok i'm digging into the sources for the next step of using the API for training tesseract...

So far, i am able to make the boxes and get the result as a char* without writing to a box file. Now i'm trying to run the training, but i can't seem to find a way to do that without using external files (The box file and the output file)... Is there a way to do that?

_tessApi = new tesseract::TessBaseAPI();
_tessApi->Init ("./", "eng");
// Initialize variables.
_tessApi->SetInputName ("Test.box");
_tessApi->SetOutputName ("Test");
_tessApi->SetVariable ("file_type", ".bl");
_tessApi->SetVariable ("tessedit_single_match", "0");
_tessApi->SetVariable ("tessedit_zero_rejection", "T");
_tessApi->SetVariable ("tessedit_minimal_rejection", "F");
_tessApi->SetVariable ("tessedit_write_rep_codes", "F");
_tessApi->SetVariable ("tessedit_resegment_from_boxes", "T");
_tessApi->SetVariable ("tessedit_train_from_boxes", "T");
_tessApi->SetVariable ("textord_fast_pitch_test", "T");
_tessApi->SetVariable ("textord_no_rejects", "T");
_tessApi->SetVariable ("edges_children_fix", "F");
_tessApi->SetVariable ("edges_childarea", "0.65");
_tessApi->SetVariable ("edges_boxarea", "0.9");
_tessApi->SetVariable ("il1_adaption_test", "1");
_tessApi->SetPageSegMode (tesseract::PSM_AUTO_OSD);
// Prepare picture.
_tessApi->SetImage (imgOcr.bits(), imgOcr.width(), imgOcr.height(), 1, imgOcr.bytesPerLine());
_tessApi->Recognize (0);
_tessApi->End ();
delete _tessApi;

i see that internally when doing that, Tesseract is going through the ApplyBoxTraining routine... However this routine takes a filename as one of it's argument, and opens it itself. It would be very tempting to re-use the code from ApplyBoxes itself, but i feel like if something changes in the source code in future versions, i'll have to start over again...

Also, how can i get the result of the training as binary data instead of specifying an output file (UNIX philosophy, anyone)?
i'm not sure if i will be able to do what i want to... Avoiding files. Especially when it will come to mftraining and friends...

Thanks,
Pierre.

Dovhani Foneworx

unread,
Sep 3, 2014, 4:05:09 AM9/3/14
to tesser...@googlegroups.com, hick...@gmail.com

Have you succesfully train with a C++ application?
Reply all
Reply to author
Forward
0 new messages