PDF files DOCUMENT_TEXT_DETECTION skipping last few lines

January December

unread,

May 20, 2021, 12:59:46 PM5/20/21

to cloud-vision-discuss

Hi All,

I have a PDF file containing multiple pages having scanned document images. Those scanned documents contains text in TELUGU language.

When I perform OCR for single document using files:annotate API, it is detecting all lines in a page. But when I perform OCR for document containing multiple pages using files:asyncBatchAnnotate API, I always see last few lines missing in each page. I tried setting crop hints but still no help. Please suggest if I am missing any thing.

I am using following curl command

curl -X POST -H "Authorization: Bearer <AUTH_TOKEN>" -H "Content-Type: application/json; charset=utf-8" https://vision.googleapis.com/v1/files:asyncBatchAnnotate -d @<path to json request files>

I am using following request

{ "requests": [ { "features": [ { "type": "DOCUMENT_TEXT_DETECTION" } ], "inputConfig": { "gcsSource": { "uri": "<path to file>" }, "mimeType": "application/pdf" }, "imageContext": { "languageHints": [ "te" ], "textDetectionParams": { "enableTextDetectionConfidenceScore": false } }, "outputConfig": { "batchSize": 8, "gcsDestination": { "uri": "<path to folde>" } } } ] }

Regards,

Jan.

Olusayo Akinlaja

unread,

May 20, 2021, 4:15:55 PM5/20/21

to cloud-vision-discuss

Hello, Jan

From the information you provided, I see you are basically following the details and steps shared in this article[0]. It does not seem that you are doing much wrong.

Considering this is more of a Quality issue, rather than the asyncBatchAnnotate method of the Cloud Vision API not working, I suggest you report this issue contact the GCP Support Engineers[1] for better troubleshooting of the issue. This way you will be able to share the particular PDF files and the GCP Support Engineers can reproduce the issue using the exact same files you are using. If you absolutely certain this issue is a bug, then please open an issue using this link[2].

The GCP Support engineers will be able to work closely with you.

[0]https://cloud.google.com/vision/docs/pdf#document_text_detection_requests

[1]https://cloud.google.com/support-hub

[2]https://developers.google.com/issue-tracker/#public_users

January December

unread,

May 21, 2021, 10:34:02 AM5/21/21

to cloud-vision-discuss

Hi Olusayo Akinlaja,

Thank you for your response and guidance.

I got a workaround from the support team. I have to use batch size 1 and avoid using imageContext. Then no more lines are missing.

Glad to see that Support team was so cooperative in testing the files I gave.

I have shared few other issues

1. Incorrect recognition of some of the characters

2. Sequence of some of the words not inline with the actual text

3. Sometimes recognizing of "=" character missing

Hope the Vision API team will address the reported issues earliest.

Regards,

Jan.

Monica (Google Cloud Platform)

unread,

May 21, 2021, 6:00:31 PM5/21/21

to cloud-vision-discuss

Hello Jan,

Thank you for your wonderful feedback. I'm sure the engineering team will address the 3 remaining issues you are currently encountering.

I wish you a good day!

Reply all

Reply to author

Forward