How to read French text using tesseract ?

1,735 views
Skip to first unread message

Pixxe

unread,
Mar 10, 2016, 2:25:59 AM3/10/16
to tesseract-ocr
Hi all,

I would like to use tesseract for extracting french language. and i hope it is possible to do it with existing tesseract and available french dictionary.

For English:

tesseract::TessBaseAPI *tess = new tesseract::TessBaseAPI();

if (tess->Init(NULL, "eng"))
{
 fprintf
(stderr, "Could not initialize tesseract.\n");
 
exit(1);
}
tess
->SetVariable("tessedit_char_whitelist", "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789&`.,-%/():*'@#");


---  Changes what i did:

tesseract::TessBaseAPI *tess = new tesseract::TessBaseAPI();

if (tess->Init(NULL, "fra"))
{
 fprintf
(stderr, "Could not initialize tesseract.\n");
 
exit(1);
}
tess
->SetVariable("tessedit_char_whitelist", "abécédéeeffegéacheijikaelleemmeenneopéquerreessetéuvé double véixeigreczède ");


What is best practice to use french dictionary ?

Thanks in advance

Awaiting for suggestions




Tom Morris

unread,
Mar 10, 2016, 7:23:05 PM3/10/16
to tesseract-ocr
On Thursday, March 10, 2016 at 2:25:59 AM UTC-5, Pixxe wrote:

I would like to use tesseract for extracting french language. and i hope it is possible to do it with existing tesseract and available french dictionary.

Yes, that should work.

---  Changes what i did:

tesseract::TessBaseAPI *tess = new tesseract::TessBaseAPI();

if (tess->Init(NULL, "fra"))
{
 fprintf
(stderr, "Could not initialize tesseract.\n");
 
exit(1);
}
tess
->SetVariable("tessedit_char_whitelist", "abécédéeeffegéacheijikaelleemmeenneopéquerreessetéuvé double véixeigreczède ");

Why are you using a character whitelist? I'd suggest using the default unless you're having problems.

If you do use it, it needs to be just a list of characters. The version above is some kind of weird phonetic spelled out alphabet (e.g. vé, double vé instead of vw).

Tom 
 

Pixxe

unread,
Mar 11, 2016, 2:40:29 AM3/11/16
to tesseract-ocr
What is the preferable Page Segmentation Mode for French Language ??? 

I have attached Sample Input Image , Please take a look.

In tesseract command prompt : 

command line : tesseract.exe "d;\SampleImage_French.tif" tOut.txt -l fra 

Output below for the attached input image :

PEPSI


Plusieurs variétés au choix
0,72 UL


Fux nomal5x1,5 L: 8,21€



But when i tried in forms:


 
if (tess->Init(NULL, "fra"))
 
{
 fprintf
(stderr, "Could not initialize tesseract.\n");
 
exit(1);
 
}


string sOut;
 tess
->SetImage((uchar*)TessBinaryMat.data, TessBinaryMat.size().width, TessBinaryMat.size().height, TessBinaryMat.channels(), TessBinaryMat.step1());
 sOut
= tess->GetUTF8Text();


Please check the attached OutputInForms.tif fies for ouptut


Suggestions please,,

why in command prompt it is reading good result why not in Forms when i integrate..

Thanks

----------------------
SampleImage_French.tif
OutputInForms.tif

Tom Morris

unread,
Mar 11, 2016, 11:25:15 AM3/11/16
to tesser...@googlegroups.com
On Fri, Mar 11, 2016 at 2:40 AM, Pixxe <sakthip...@gmail.com> wrote:
What is the preferable Page Segmentation Mode for French Language ??? 

This probably won't vary between English and French. It's more dependent on the layout of the page.

As far as differing results between the command line and the API, they have different default page segmentation modes due to historical reasons. You should choose the mode that's best for your application through experimentation or by reading the descriptions and then set it explicitly so that you're not depending on the defaults.

Tom

vugranam gowtham

unread,
Mar 15, 2016, 10:15:09 AM3/15/16
to tesseract-ocr
I found the solution! here 
 
Reply all
Reply to author
Forward
0 new messages