Using CAPI to get char* from ocr

252 views
Skip to first unread message

Anshul Maheshwari

unread,
Aug 11, 2015, 6:38:58 AM8/11/15
to tesseract-ocr
Hello

I am worked on OCR of subtitles of ccextractor which was working fine with 3.02,
I moved to newer version of Tesseract 3.04 where I found that processpage api no longer return char* string
rather it uses some result render API to give output.

I am unable to get char* from result renderer, Please help to find  the text processed by processpage.

I dont want to write to file and then read back in my program.

Thanks
Anshul Maheshwari

Anshul Maheshwari

unread,
Aug 11, 2015, 10:18:34 AM8/11/15
to tesseract-ocr
What is the difference in TessResultIterator and TessResultRenderer?

There are some methods where I can see that TessResultIterator give utf8 chars in output
but it does not have any relation to processpage?

Do I even need to use processPage or there are some other function, which should be seen?

Anshul Maheshwari

unread,
Aug 11, 2015, 10:18:35 AM8/11/15
to tesseract-ocr
Hello

How do we get string in cpp from image in 3.04.?
I will figure out with c , if someone tell me about it in cpp.


I tried ProcessPage with Text render set to stdout it was showing things fine on stdout.

I also tried TessBaseAPIGetUTF8Text but it give only first line, while render gives complete output to stdout.





On Tuesday, August 11, 2015 at 4:08:58 PM UTC+5:30, Anshul Maheshwari wrote:

Anshul Maheshwari

unread,
Aug 11, 2015, 10:28:17 AM8/11/15
to tesseract-ocr

At last I came up with result Iterator

 
        //TessResultRenderer* result = TessTextRendererCreate("stdout");
        //tess_ret = TessBaseAPIProcessPage(ctx->api, pix, 0, NULL, NULL, 0, result);
        tess_ret = TessBaseAPIProcessPage(ctx->api, pix, 0, NULL, NULL, 0, NULL);
         if( tess_ret == FALSE)
                printf("\nsomething messy\n");
        
         TessResultIterator *iter = TessBaseAPIGetIterator(ctx->api);
         //text_out = TessBaseAPIGetUTF8Text(ctx->api);
         text_out = TessResultIteratorGetUTF8Text(iter,0);
         mprint("----> %s\n",text_out);

It has even better results but don't know whether this is right way, the author has in his thoughts for Result Iterator.


On Tuesday, August 11, 2015 at 4:08:58 PM UTC+5:30, Anshul Maheshwari wrote:

zdenko podobny

unread,
Aug 11, 2015, 11:29:55 AM8/11/15
to tesser...@googlegroups.com

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a102feee-9515-40ee-971f-d0f021e9a02a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Anshul Maheshwari

unread,
Aug 11, 2015, 11:32:25 AM8/11/15
to tesseract-ocr
Ok I got a link, which explain result iterator in cpp,
So now made some code similar in c language.
138         //TessResultRenderer* result = TessTextRendererCreate("stdout");
139         //tess_ret = TessBaseAPIProcessPage(ctx->api, pix, 0, NULL, NULL, 0, result);
140         tess_ret = TessBaseAPIProcessPage(ctx->api, pix, 0, NULL, NULL, 0, NULL);
141         if( tess_ret == FALSE)
142                 printf("\nsomething messy\n");
143
144         TessResultIterator *iter = TessBaseAPIGetIterator(ctx->api);
145         //text_out = TessBaseAPIGetUTF8Text(ctx->api);
146        
147         do
148         {
149         text_out = TessResultIteratorGetUTF8Text(iter,RIL_PARA);
150         mprint("----> %s\n",text_out);
151         TessResultIteratorNext(iter,RIL_PARA);
152         }while(text_out);
153        


There are some problems that I feel about enums, they should have some TESS_ like suffix
so that they don't have problem in name mangling.
but still output is showing not all lines in input image


zdenko podobny

unread,
Aug 11, 2015, 11:51:46 AM8/11/15
to tesser...@googlegroups.com
provide example image and simple test case (code)

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Anshul Maheshwari

unread,
Aug 12, 2015, 2:56:58 AM8/12/15
to tesseract-ocr
  1 #include "capi.h"
  2 #include "stdio.h"
  3 #include "stdlib.h"
  4 #include <allheaders.h>
  5
  6 void die(const char *errstr)
  7 {
  8         fputs(errstr, stderr);
  9         exit(1);
 10 }
 11 int main(int argc, char**argv)
 12 {
 13         TessBaseAPI* handle;
 14         int ret = 0;
 15         PIX *img;
 16         char *text;
 17
 18         if(argc < 2)
 19                 printf("usage: %s infilename\n",argv[0]);
 20         handle = TessBaseAPICreate();
 21         ret = TessBaseAPIInit3(handle, NULL, "eng");
 22         if( ret != 0)
 23                 die("TessBaseAPIInit3");
 24
 25         if((img = pixRead(argv[1])) == NULL)
 26                 die("Error reading image\n");
 27
 28         TessBaseAPISetImage2(handle, img);
 29         if(TessBaseAPIRecognize(handle, NULL) != 0)
 30                 die("Error in Tesseract recognition\n");
 31
 32         if((text = TessBaseAPIGetUTF8Text(handle)) == NULL)
 33                 die("Error getting text\n");
 34
 35         fputs(text, stdout);
 36
 37         TessDeleteText(text);
 38         TessBaseAPIEnd(handle);
 39         TessBaseAPIDelete(handle);
 40         pixDestroy(&img);
 41
 42     return 0;
 43
 44 }

I have attached 3 files which are not detected properly

sub0001.png

zdenko podobny

unread,
Aug 12, 2015, 3:16:49 AM8/12/15
to tesser...@googlegroups.com
you wrote:
but still output is showing not all lines in input image

and 
I have attached 3 files which are not detected properly

but you sent one png file with word "We've"... How many lines do you expect in it ;-) ?


Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Anshul Maheshwari

unread,
Aug 12, 2015, 4:38:18 AM8/12/15
to tesseract-ocr

I have not attached multilines png file, because with small program as per your suggestion in previous mail.
I am able to get all line in one go.
My actual program ccextractor, it does not return all the lines so I assumed that problem is in my ccextractor code.

now  only problem that I have is no text detected, when less number of characters are there.
May be I should start new thread, if you suggest me to do so.
with attached png no characters are detected.


-Anshul

zdenko podobny

unread,
Aug 12, 2015, 4:42:39 AM8/12/15
to tesser...@googlegroups.com
I think new problem => new thread is right way, so we do not mix issues...

Zdenko

Anshul Maheshwari

unread,
Aug 12, 2015, 4:51:58 AM8/12/15
to tesseract-ocr

Hello,

I close this thread by concluding the solution to my problem.
if you are using version greater then 3.04, capi refrence  examples should not be read from code.google
but from [1] https://github.com/tesseract-ocr/tesseract/wiki/APIExample#result-iterator-example


Thanks zdenop
Anshul Maheshwari


On Tuesday, August 11, 2015 at 4:08:58 PM UTC+5:30, Anshul Maheshwari wrote:
Reply all
Reply to author
Forward
0 new messages