TessPDFRenderer outputs invalid PDF file (+gosseract)

blaumedia

unread,

Nov 21, 2021, 3:27:24 AM11/21/21

to tesseract-ocr

Described already in issue: https://github.com/tesseract-ocr/tesseract/issues/3652

I'm trying to generate a searchable PDF outgoing from a jpg image, but the file that gets output is an invalid pdf file that can't be read by any pdf reader.

I have added an docker image for reproduction of the problem in the issue, but here is the bash snippet for it:

git clone g...@github.com:dnnspaul/gosseract.git

git checkout tesseract/bug/3652

docker build -t tessbug .

docker run -it -v $PWD/tmp:/tmp tessbug go run main.go

When I'm inputting the file in the tesseract cli, the outcoming pdf is readable, but I can't find any difference between the cli and my snippet.

Thanks in advance for any help! I'm very sorry, I'm more a GoLang developer, than a C ++ developer so I have kind of problems with the simplest syntax, but tried my best.

Zdenko Podobny

unread,

Nov 21, 2021, 7:18:52 AM11/21/21

to tesser...@googlegroups.com

seems like the same problem as https://github.com/sirfz/tesserocr/issues/271#issuecomment-919334885

Did you use BeginDocument EndDocument ?

Zdenko

ne 21. 11. 2021 o 9:27 'blaumedia' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f34562d3-d11e-4385-9c78-b24092413dean%40googlegroups.com.

blaumedia

unread,

Nov 21, 2021, 5:16:38 PM11/21/21

to tesseract-ocr

Hi zdenop,

thanks for your tip, but I'm using the ProcessPages function, so it should write the head and footer part of the file itself.

BUT I've played a bit with ProcessPage() + BeginDocument() before and EndDocument() after and the resulting file has big differences. Sadly, the file is still corrupt.

So it seems the problem is based on the failing begin/enddocument function. But even there I'm experiencing mysterious bugs.

Using only EndDocument(), I have something like a footer at the end of the file:

r3mxpijfjkxk073pmzquqm343_testjpg.pdf_ff — gosseract [SSH: root.debdocker.home.blaumedia.com]_2021-11-21 23-07-40_MacPro.png

But it suddenly stops at "Produce". But when I'm using BeginDocument(), ProcessPage() and then EndDocument() the file is ending with bytes and there is no "endstream" or "endobj".

I've updated to latest 4.1.3 version but problem still exists.

I updated the bug branch in https://github.com/dnnspaul/gosseract/tree/tesseract/bug%2F3652 so the problem is reproducible.

To disable the BeginDocument, one have to comment out https://github.com/dnnspaul/gosseract/blob/tesseract/bug/3652/tessbridge.cpp#L187.

I tried to use 1:1 the code from the tesseract cli but it still does not work...

Zdenko Podobny

unread,

Nov 22, 2021, 8:29:02 AM11/22/21

to tesser...@googlegroups.com

Here is a simple code, that works for me (with tesseract 5 and leptonica 1.82)

#include <leptonica/allheaders.h>
#include <tesseract/baseapi.h>
#include <tesseract/renderer.h>
#include <string>

int main() {
const char* datapath = "f:/Project-Personal/tessdata_best/tessdata";
std::string language_ = "eng";
std::string inputFile_ = "input.png";
const char* outputbase = "output";

tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if (api->Init(datapath, language_.c_str(), tesseract::OEM_LSTM_ONLY)) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}

PIX *sourceImg = pixRead(inputFile_.c_str());
if (!sourceImg) {
fprintf(stderr, "Leptonica can't process input file: %s\n",
inputFile_.c_str());
return EXIT_FAILURE;
}
api->SetImage(sourceImg);
api->SetInputName(inputFile_.c_str());
api->SetOutputName(outputbase);

tesseract::TessPDFRenderer* renderer =
new tesseract::TessPDFRenderer(outputbase, api->GetDatapath());
if (!renderer->happy()) {
printf("Error, could not create PDF output file: %s\n",
strerror(errno));
delete renderer;
}

bool succeed = api->ProcessPages(inputFile_.c_str(), nullptr, 0, renderer);
if (!succeed) {
fprintf(stderr, "Error during processing.\n");
return EXIT_FAILURE;
}

api->End();
pixDestroy(&sourceImg);
return 0;
}

Zdenko

ne 21. 11. 2021 o 23:16 'blaumedia' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ad68ab2c-2d45-47c3-9194-5d1cd8ea8400n%40googlegroups.com.

Zdenko Podobny

unread,

Nov 22, 2021, 8:33:15 AM11/22/21

to tesser...@googlegroups.com

this is my old snippet, so part of the code is useless for pdf rendering (opening the input image as PIX).

Zdenko

po 22. 11. 2021 o 14:28 Zdenko Podobny <zde...@gmail.com> napísal(a):

Sarah Jane CHANNEL

unread,

Nov 22, 2021, 8:34:10 AM11/22/21

to tesser...@googlegroups.com

this code can read text?

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x%2B58UYjqq-zr0C2f%3Dazs0_RTVs%3D4p1a9PVu%2BumLOW43Q%40mail.gmail.com.

Zdenko Podobny

unread,

Nov 22, 2021, 8:35:19 AM11/22/21

to tesser...@googlegroups.com

I do not understand your question: how it is related to the discussed topic?

Zdenko

po 22. 11. 2021 o 14:34 Sarah Jane CHANNEL <kangchi...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CABoum5OujufKc0f1jkviCN7DOmYty6mT-jZWVee-ojN4SDNfTQ%40mail.gmail.com.

blaumedia

unread,

Nov 22, 2021, 12:51:38 PM11/22/21

to tesseract-ocr

It works!

I tried tesseract-ocr 5.0.0 RC2 + leptonica 1.8.2 and with it, my and your code worked flawlessly. It seems like the 4.1.3 has a bug in it, that has been fixed in 4.1.3. I didn't tested 5.0, because I thought It would be more unstable.

I extra tested 4.1.3 + leptonica 1.8.2 (was on 1.7.x somewhat before) and the problem with corrupt pdf still exists. But that's not a problem, I will use 5.0.0 instead.

Thank you zdenop!

blaumedia

unread,

Nov 22, 2021, 1:54:33 PM11/22/21

to tesseract-ocr

Hey zdenop,

turns out I can't rely on 5.0.0, because OpenCV seems to only is compatible with 4.x yet. (OpenCV is another requirement of my project).

Does your script from above works on tesseract 4.x for you?

Zdenko Podobny

unread,

Nov 22, 2021, 4:27:47 PM11/22/21

to tesser...@googlegroups.com

Hello,

yes, it works for me also with tesseract 4.1.3 (the latest version). AFAIR there was no change in behaviour of renderer (including TessPDFRenderer) from the 4.0-beta version.

Also, I did not get your problem with OpenCV - AFAIK tesseract is the only optional dependency and it uses only very limited tesseract features[1].

Because you will use anyway tesseract directly for creating pdf, it does not make sense to care about old tesseract support in OpenCV.

[1] https://docs.opencv.org/4.5.4/d7/ddc/classcv_1_1text_1_1OCRTesseract.html

Zdenko

po 22. 11. 2021 o 19:54 'blaumedia' via tesseract-ocr <tesser...@googlegroups.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/064a7ded-ba41-4273-a690-c520634ab375n%40googlegroups.com.

blaumedia

unread,

Nov 23, 2021, 10:11:14 AM11/23/21

to tesseract-ocr

Zdenop! Great news! :D

I recompiled OpenCV on my machine and somehow it resolved the problem. Now I can use v5.0.0 and opencv without any problems. Seems like openCV depended on old libs in /usr/local/lib (it always searched for libtesseract.so.4 but there was no file because I only installed v5). Probably it was an easy problem for a C developer, but like I said I'm just a entry-level golang developer.

So thank you very very much!

Reply all

Reply to author

Forward