PDF - DocumentTextDetection issue

36 views
Skip to first unread message

Big Data Mining

unread,
Apr 26, 2021, 1:00:32 PM4/26/21
to cloud-vision-discuss
Good Afternoon,

We've found an issue when using Google Vision API - DocumentTextDetection feature on PDFs that has selectable text and images in the same page.

The behavior of API for PDFs has been changed cause we always expect that the service read all document text no matter it's editable / image / or not. In other words, the API always need to read the "printed" page with all information and not just selectable text.

If the API just reads selectable texts Google does not need to create an OCR service because is too simple to read just texts in PDFs!

Follow attached the files to reproduce the issue:

Submit PDF to Google Vision just returns the information on "selectable text".
Submit de ...0001.jpg to Google Vision return the full text present on the page

We need a position urgently!
JV_ADMINISTRACAO_DE_OBRAS_EIRELI.pdf
JV_ADMINISTRACAO_DE_OBRAS_EIRELI_page-0001.jpg

Olusayo Akinlaja

unread,
Apr 28, 2021, 6:38:06 PM4/28/21
to cloud-vision-discuss
Hello, 

I am trying to understand better your observation. Please be sure to correct me if I am misunderstanding your observation. 

From this information provided, I understand your observation is that the Vision API seems to only read or text detect only the selectable texts in your PDF when it has only texts but it ignore to detects and extracts text from any image in pages that include both Images and texts? If so, I am comparing the Page 1 of the two attached Documents and I can't seem to corroborate this understanding because the Page 1 does not seem to have any images. Can you please clarify on the details being shared?

I can attempt reproducing the issue but I find it necessary to understand the behavior being explained. 
Reply all
Reply to author
Forward
0 new messages