Generating a PDF with Tesseract C++API (4.1Version)

401 views
Skip to first unread message

Ivica Anic

unread,
Oct 25, 2019, 10:35:14 AM10/25/19
to tesseract-ocr
    Hi,  
     I am testing the Tesseract C++ API (4.1 Version).
       Here is my code:

      
       char *datapath = "C:\\Temp\\tessdata-master";
string language_ = "deu";
string inputFile_ = "./input.png";
tesseract::TessBaseAPI *api100 = new tesseract::TessBaseAPI();
if (api100->Init(datapath, "deu", tesseract::OEM_LSTM_ONLY)) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}


api100->SetVariable("tessedit_create_pdf", "T");
      //png File is input file
PIX *sourceImg100 = pixRead(inputImage.c_str());

api100->SetImage(sourceImg100);


api100->Recognize(0);

api100->SetPageSegMode(tesseract::PSM_AUTO_ONLY);
api100->SetInputName(inputImage.c_str());
tesseract::TessResultRenderer *renderer100 = new tesseract::TessPDFRenderer("output_base", api100->GetDatapath(),false);

renderer100->BeginDocument("test");
renderer100->AddImage(api100);
api100->ProcessPage(sourceImg100, 0, inputImage.c_str(), NULL, 5000, renderer100);
renderer100->EndDocument();
api100->End();
pixDestroy(&sourceImg100);
    
        how can I get a searchable PDF file output and save it on my computer ?
       I mean, exactly like the command line : tesseract test.tif output pdf

       Zdenko:
       by my test one output pdf File is created,but pdf file is not readable
       if I try to open pdf File it is comming Error XREF-Data in pdf-file are missing 
      

        

      Thanks a lot

Zdenko Podobny

unread,
Oct 25, 2019, 3:51:32 PM10/25/19
to tesser...@googlegroups.com
Try something like this:

#include <string>

#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>

int main() {
    const char* datapath = "tessdata";
    std::string language_ = "deu";
    std::string inputFile_ = "input.png";
    const char* outputbase = "output";


    tesseract::TessBaseAPI *api100 = new tesseract::TessBaseAPI();
    if (api100->Init(datapath, "deu", tesseract::OEM_LSTM_ONLY)) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    PIX *sourceImg100 = pixRead(inputFile_.c_str());
    if (!sourceImg100) {
      fprintf(stderr, "Leptonica can't process input file: %s\n", inputFile_.c_str());
      return EXIT_FAILURE;
    }
    api100->SetImage(sourceImg100);
    api100->SetInputName(inputFile_.c_str());
    api100->SetOutputName(outputbase);

    tesseract::TessPDFRenderer* renderer =
        new tesseract::TessPDFRenderer(outputbase, api100->GetDatapath());
    if (!renderer->happy()) {
         printf("Error, could not create PDF output file: %s\n",
                strerror(errno));
         delete renderer;
    }
   
    bool succeed = api100->ProcessPages(inputFile_.c_str(), nullptr, 0, renderer);
    if (!succeed) {
      fprintf(stderr, "Error during processing.\n");
      return EXIT_FAILURE;
    }

    api100->End();
    pixDestroy(&sourceImg100);
    return 0;
}



Zdenko


pi 25. 10. 2019 o 16:35 Ivica Anic <delfa...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fdf57624-93b1-40e6-9b24-c51cbf74a483%40googlegroups.com.

Ivica Anic

unread,
Oct 25, 2019, 11:40:04 PM10/25/19
to tesseract-ocr
Zdenko
when I try with your sample, I'm getting folowwing Error
Das Dokument kann nicht geöffnet werden.
Ein Fehler ist beim Öffnen des Dokuments aus der Datei aufgetreten:
C:\Users\ocr\output.pdf.
Error [PXCLib]: Required value not found.

=====================================================
when I add to your Sample two Lines and try 
api100->SetVariable("tessedit_create_pdf", "T");
api100->SetPageSegMode(tesseract::PSM_AUTO_ONLY);
I'm getting  Error by trying to open pdf output file
Folgende Probleme wurden im Dokument gefunden:
- Einer oder mehrere XREF-Datenströme wurden nicht gefunden (XREF-Data are missing)

Zdenko Podobny

unread,
Oct 26, 2019, 7:10:45 AM10/26/19
to tesser...@googlegroups.com
Why do you think there is problem in tesseract?

output.pdf is open without problem in acrobat reader, chrome/chromium, sumatrapdf.

You should understand what you are doing. E.g. setting variable  tessedit_create_pdf is useless. 

Zdenko


so 26. 10. 2019 o 5:40 Ivica Anic <delfa...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Ivica Anic

unread,
Oct 26, 2019, 11:13:47 AM10/26/19
to tesseract-ocr
Zdenko

When I try to open PDF file tests I get further error message Do you have a visual studio c ++ solution, a small tesseract project example that works for download


Am Freitag, 25. Oktober 2019 16:35:14 UTC+2 schrieb Ivica Anic:

Zdenko Podobny

unread,
Oct 26, 2019, 11:33:44 AM10/26/19
to tesser...@googlegroups.com
You do not need VS solution - it just complicate whole process. For testing minimal solution, save code I posted about e.g. as test_pdf.cpp. Then run in command line (adjust to your VS and installation):
 "c:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat" x64
cl test_pdf.cpp -Id:\include_path_to_tesseract_and_leptonica\include tesseract41.lib leptonica-1.79.0.lib -D_CRT_SECURE_NO_WARNINGS
If your paths are correct, you can just run test_pdf.exe


Zdenko


so 26. 10. 2019 o 17:13 Ivica Anic <delfa...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Ivica Anic

unread,
Oct 26, 2019, 1:02:54 PM10/26/19
to tesseract-ocr
Zdenko:
can you please to say me exactly URL where I can to download tesseract.libs , leptonica.libs (and .dll's) and tessdata


Am Freitag, 25. Oktober 2019 16:35:14 UTC+2 schrieb Ivica Anic:

Zdenko Podobny

unread,
Oct 26, 2019, 1:55:33 PM10/26/19
to tesser...@googlegroups.com
Build it yourself - read tesseract wiki about possibilities.

Zdenko


so 26. 10. 2019 o 19:02 Ivica Anic <delfa...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Ivica Anic

unread,
Oct 27, 2019, 2:07:52 AM10/27/19
to tesseract-ocr
Zdenko
Thank you very much for your support,I can run with VS successfuly your sample,my solution was to run :vcpkg integrate install
Am Freitag, 25. Oktober 2019 16:35:14 UTC+2 schrieb Ivica Anic:

Ivica Anic

unread,
Oct 27, 2019, 11:44:41 PM10/27/19
to tesseract-ocr
Zdenko:
I have following use case for tesseract C ++ 4.1 APi
I would like to read multi-page non-searchable pdf file as an input parameter in PIX or PIXA, as output I would like to create searchable pdf file
my question to you
which tesseract C ++ Api Function I can call,
to read the multipage non-searchable pdf file in PIX or PIXA,
Do you have a little C ++ example about this topic
I mean, exactly like the command line: tesseract test.pdf output pdf
(test.pdf is multipage pdf file as input parameter)

Am Freitag, 25. Oktober 2019 16:35:14 UTC+2 schrieb Ivica Anic:

Zdenko Podobny

unread,
Oct 28, 2019, 3:00:42 AM10/28/19
to tesser...@googlegroups.com
Can you fix your email client? Your post look  weird and are difficult to read.

image.png

For OCR pdf search internet.

Zdenko


po 28. 10. 2019 o 4:44 Ivica Anic <delfa...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages