Generating a PDF with Tesseract C++API (4.1Version)

Ivica Anic

unread,

Oct 25, 2019, 10:35:14 AM10/25/19

to tesseract-ocr

Hi,

I am testing the Tesseract C++ API (4.1 Version).

Here is my code:

char *datapath = "C:\\Temp\\tessdata-master";

string language_ = "deu";

string inputFile_ = "./input.png";

tesseract::TessBaseAPI *api100 = new tesseract::TessBaseAPI();

if (api100->Init(datapath, "deu", tesseract::OEM_LSTM_ONLY)) {

fprintf(stderr, "Could not initialize tesseract.\n");

exit(1);

}

api100->SetVariable("tessedit_create_pdf", "T");

//png File is input file

PIX *sourceImg100 = pixRead(inputImage.c_str());

api100->SetImage(sourceImg100);

api100->Recognize(0);

api100->SetPageSegMode(tesseract::PSM_AUTO_ONLY);

api100->SetInputName(inputImage.c_str());

tesseract::TessResultRenderer *renderer100 = new tesseract::TessPDFRenderer("output_base", api100->GetDatapath(),false);

renderer100->BeginDocument("test");

renderer100->AddImage(api100);

api100->ProcessPage(sourceImg100, 0, inputImage.c_str(), NULL, 5000, renderer100);

renderer100->EndDocument();

api100->End();

pixDestroy(&sourceImg100);

how can I get a searchable PDF file output and save it on my computer ?

I mean, exactly like the command line : tesseract test.tif output pdf

Zdenko:

by my test one output pdf File is created,but pdf file is not readable

if I try to open pdf File it is comming Error XREF-Data in pdf-file are missing

Thanks a lot

Zdenko Podobny

unread,

Oct 25, 2019, 3:51:32 PM10/25/19

to tesser...@googlegroups.com

Try something like this:

#include <string>

#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>

int main() {
const char* datapath = "tessdata";
std::string language_ = "deu";
std::string inputFile_ = "input.png";
const char* outputbase = "output";

tesseract::TessBaseAPI *api100 = new tesseract::TessBaseAPI();
if (api100->Init(datapath, "deu", tesseract::OEM_LSTM_ONLY)) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}

PIX *sourceImg100 = pixRead(inputFile_.c_str());
if (!sourceImg100) {
fprintf(stderr, "Leptonica can't process input file: %s\n", inputFile_.c_str());
return EXIT_FAILURE;
}
api100->SetImage(sourceImg100);
api100->SetInputName(inputFile_.c_str());
api100->SetOutputName(outputbase);

tesseract::TessPDFRenderer* renderer =
new tesseract::TessPDFRenderer(outputbase, api100->GetDatapath());
if (!renderer->happy()) {
printf("Error, could not create PDF output file: %s\n",
strerror(errno));
delete renderer;
}

bool succeed = api100->ProcessPages(inputFile_.c_str(), nullptr, 0, renderer);
if (!succeed) {
fprintf(stderr, "Error during processing.\n");
return EXIT_FAILURE;
}

api100->End();
pixDestroy(&sourceImg100);
return 0;
}

Zdenko

pi 25. 10. 2019 o 16:35 Ivica Anic <delfa...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fdf57624-93b1-40e6-9b24-c51cbf74a483%40googlegroups.com.

Ivica Anic

unread,

Oct 25, 2019, 11:40:04 PM10/25/19

to tesseract-ocr

Zdenko

when I try with your sample, I'm getting folowwing Error

Das Dokument kann nicht geöffnet werden.

Ein Fehler ist beim Öffnen des Dokuments aus der Datei aufgetreten:

C:\Users\ocr\output.pdf.

Error [PXCLib]: Required value not found.

=====================================================

when I add to your Sample two Lines and try

api100->SetVariable("tessedit_create_pdf", "T");

api100->SetPageSegMode(tesseract::PSM_AUTO_ONLY);

I'm getting Error by trying to open pdf output file

Folgende Probleme wurden im Dokument gefunden:

- Einer oder mehrere XREF-Datenströme wurden nicht gefunden (XREF-Data are missing)

Zdenko Podobny

unread,

Oct 26, 2019, 7:10:45 AM10/26/19

to tesser...@googlegroups.com

Why do you think there is problem in tesseract?

output.pdf is open without problem in acrobat reader, chrome/chromium, sumatrapdf.

output.pdf pass without error on https://www.pdf-online.com/osa/validate.aspx, https://www.datalogics.com/products/pdftools/pdf-checker/ and https://www.pdfen.com/pdf-a-validator as pdf 1.5...

You should understand what you are doing. E.g. setting variable tessedit_create_pdf is useless.

Zdenko

so 26. 10. 2019 o 5:40 Ivica Anic <delfa...@gmail.com> napísal(a):

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/61d8e86b-7c23-488c-b441-d3f75e8924f1%40googlegroups.com.

Ivica Anic

unread,

Oct 26, 2019, 11:13:47 AM10/26/19

to tesseract-ocr

Zdenko

When I try to open PDF file tests I get further error message Do you have a visual studio c ++ solution, a small tesseract project example that works for download

Am Freitag, 25. Oktober 2019 16:35:14 UTC+2 schrieb Ivica Anic:

Zdenko Podobny

unread,

Oct 26, 2019, 11:33:44 AM10/26/19

to tesser...@googlegroups.com

You do not need VS solution - it just complicate whole process. For testing minimal solution, save code I posted about e.g. as test_pdf.cpp. Then run in command line (adjust to your VS and installation):

"c:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat" x64

cl test_pdf.cpp -Id:\include_path_to_tesseract_and_leptonica\include tesseract41.lib leptonica-1.79.0.lib -D_CRT_SECURE_NO_WARNINGS

If your paths are correct, you can just run test_pdf.exe

Zdenko

so 26. 10. 2019 o 17:13 Ivica Anic <delfa...@gmail.com> napísal(a):

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/294ed10c-2100-4892-a9bb-889ecf6cac83%40googlegroups.com.

Ivica Anic

unread,

Oct 26, 2019, 1:02:54 PM10/26/19

to tesseract-ocr

Zdenko:

can you please to say me exactly URL where I can to download tesseract.libs , leptonica.libs (and .dll's) and tessdata

Am Freitag, 25. Oktober 2019 16:35:14 UTC+2 schrieb Ivica Anic:

Zdenko Podobny

unread,

Oct 26, 2019, 1:55:33 PM10/26/19

to tesser...@googlegroups.com

Build it yourself - read tesseract wiki about possibilities.

Zdenko

so 26. 10. 2019 o 19:02 Ivica Anic <delfa...@gmail.com> napísal(a):

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/eb55e66e-6b72-4f55-aa40-d4daae4a94d3%40googlegroups.com.

Ivica Anic

unread,

Oct 27, 2019, 2:07:52 AM10/27/19

to tesseract-ocr

Zdenko

Thank you very much for your support,I can run with VS successfuly your sample,my solution was to run :vcpkg integrate install

Am Freitag, 25. Oktober 2019 16:35:14 UTC+2 schrieb Ivica Anic:

Ivica Anic

unread,

Oct 27, 2019, 11:44:41 PM10/27/19

to tesseract-ocr

Zdenko:

I have following use case for tesseract C ++ 4.1 APi
I would like to read multi-page non-searchable pdf file as an input parameter in PIX or PIXA, as output I would like to create searchable pdf file
my question to you
which tesseract C ++ Api Function I can call,
to read the multipage non-searchable pdf file in PIX or PIXA,
Do you have a little C ++ example about this topic
I mean, exactly like the command line: tesseract test.pdf output pdf
(test.pdf is multipage pdf file as input parameter)

Am Freitag, 25. Oktober 2019 16:35:14 UTC+2 schrieb Ivica Anic:

Zdenko Podobny

unread,

Oct 28, 2019, 3:00:42 AM10/28/19

to tesser...@googlegroups.com

Can you fix your email client? Your post look weird and are difficult to read.

For OCR pdf search internet.

Zdenko

po 28. 10. 2019 o 4:44 Ivica Anic <delfa...@gmail.com> napísal(a):

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f9fbb2d9-7224-4925-bad2-fa267f6cb96e%40googlegroups.com.

Reply all

Reply to author

Forward