How to read French text using tesseract ?

Pixxe

unread,

Mar 10, 2016, 2:25:59 AM3/10/16

to tesseract-ocr

Hi all,

I would like to use tesseract for extracting french language. and i hope it is possible to do it with existing tesseract and available french dictionary.

For English:

tesseract::TessBaseAPI *tess = new tesseract::TessBaseAPI();

if (tess->Init(NULL, "eng"))
{
 fprintf(stderr, "Could not initialize tesseract.\n");
 exit(1);
}
tess->SetVariable("tessedit_char_whitelist", "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789&`.,-%/():*'@#");

--- Changes what i did:

tesseract::TessBaseAPI *tess = new tesseract::TessBaseAPI();

if (tess->Init(NULL, "fra"))
{
 fprintf(stderr, "Could not initialize tesseract.\n");
 exit(1);
}
tess->SetVariable("tessedit_char_whitelist", "abécédéeeffegéacheijikaelleemmeenneopéquerreessetéuvé double véixeigreczède ");

What is best practice to use french dictionary ?

Thanks in advance

Awaiting for suggestions

Tom Morris

unread,

Mar 10, 2016, 7:23:05 PM3/10/16

to tesseract-ocr

On Thursday, March 10, 2016 at 2:25:59 AM UTC-5, Pixxe wrote:

I would like to use tesseract for extracting french language. and i hope it is possible to do it with existing tesseract and available french dictionary.

Yes, that should work.

--- Changes what i did:

tesseract::TessBaseAPI *tess = new tesseract::TessBaseAPI(); if (tess->Init(NULL, "fra")) { fprintf(stderr, "Could not initialize tesseract.\n"); exit(1); } tess->SetVariable("tessedit_char_whitelist", "abécédéeeffegéacheijikaelleemmeenneopéquerreessetéuvé double véixeigreczède ");

Why are you using a character whitelist? I'd suggest using the default unless you're having problems.

If you do use it, it needs to be just a list of characters. The version above is some kind of weird phonetic spelled out alphabet (e.g. vé, double vé instead of vw).

Tom

Pixxe

unread,

Mar 11, 2016, 2:40:29 AM3/11/16

to tesseract-ocr

What is the preferable Page Segmentation Mode for French Language ???

I have attached Sample Input Image , Please take a look.

In tesseract command prompt :

command line : tesseract.exe "d;\SampleImage_French.tif" tOut.txt -l fra

Output below for the attached input image :

PEPSI

Plusieurs variétés au choix
0,72 UL

Fux nomal5x1,5 L: 8,21€

But when i tried in forms:



 if (tess->Init(NULL, "fra"))
 {
 fprintf(stderr, "Could not initialize tesseract.\n");
 exit(1);
 }

string sOut;
 tess->SetImage((uchar*)TessBinaryMat.data, TessBinaryMat.size().width, TessBinaryMat.size().height, TessBinaryMat.channels(), TessBinaryMat.step1());
 sOut = tess->GetUTF8Text();

Please check the attached OutputInForms.tif fies for ouptut

Suggestions please,,

why in command prompt it is reading good result why not in Forms when i integrate..

Thanks

----------------------

SampleImage_French.tif

OutputInForms.tif

Tom Morris

unread,

Mar 11, 2016, 11:25:15 AM3/11/16

to tesser...@googlegroups.com

On Fri, Mar 11, 2016 at 2:40 AM, Pixxe <sakthip...@gmail.com> wrote:

What is the preferable Page Segmentation Mode for French Language ???

This probably won't vary between English and French. It's more dependent on the layout of the page.

As far as differing results between the command line and the API, they have different default page segmentation modes due to historical reasons. You should choose the mode that's best for your application through experimentation or by reading the descriptions and then set it explicitly so that you're not depending on the defaults.

Tom

vugranam gowtham

unread,

Mar 15, 2016, 10:15:09 AM3/15/16

to tesseract-ocr

I found the solution! here

Reply all

Reply to author

Forward