Poor & Intermittent OCR Quality for Tabular Data

53 views
Skip to first unread message

Luke Simkins

unread,
Jul 17, 2018, 2:53:24 PM7/17/18
to cloud-vision-discuss

The google cloud vision ocr document_text_detection output has a severe drop in quality for tabular data. In some circumstances, it appears to ocr the same text more than once (overlapping text) and in other circumstances misses text altogether. We've only noticed this happen in table cells containing 5 or fewer characters. The results are typically high quality otherwise.


Below is a quick example highlighting the issue. Original document is black. GCV text detection overlain in red.


Example output.
Text in original document was rotated 90°. This document was rotated before sending to the api to remove the rotation from the text. Several blocks of text were missed.



Example output
Text in original document was rotated 90°. This document was not rotated before sending to the api, so the text was sent to the api rotated. Several blocks of text were ocr'd twice. (Several T4S cells have overlapping "T4S" and "TAS" text.)




Duane Chen

unread,
Jul 17, 2018, 3:39:22 PM7/17/18
to lucas....@gmail.com, cloud-visi...@googlegroups.com
Hi Luke,

Unfortunately short texts are known to be a difficult problem for OCR and we have heard other customers with similar problems with short text detection. We are continuing to improve our models over time but currently there is not much we can do for short text misses.

Thanks,

Duane

--
You received this message because you are subscribed to the Google Groups "cloud-vision-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-vision-dis...@googlegroups.com.
To post to this group, send email to cloud-visi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-vision-discuss/a0a5aa8e-8bb8-4c03-936a-52cc2dea63c7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Message has been deleted

Luke Simkins

unread,
Jul 17, 2018, 3:44:23 PM7/17/18
to cloud-vision-discuss
Duane,

Thank you for the information. That gives us context for the problem.

Best,
- Luke


On Tuesday, July 17, 2018 at 2:39:22 PM UTC-5, Duane Chen wrote:
Hi Luke,

Unfortunately short texts are known to be a difficult problem for OCR and we have heard other customers with similar problems with short text detection. We are continuing to improve our models over time but currently there is not much we can do for short text misses.

Thanks,

Duane

On Tue, Jul 17, 2018 at 11:53 AM Luke Simkins <lucas....@gmail.com> wrote:

The google cloud vision ocr document_text_detection output has a severe drop in quality for tabular data. In some circumstances, it appears to ocr the same text more than once (overlapping text) and in other circumstances misses text altogether. We've only noticed this happen in table cells containing 5 or fewer characters. The results are typically high quality otherwise.


Below is a quick example highlighting the issue. Original document is black. GCV text detection overlain in red.


Example output.
Text in original document was rotated 90°. This document was rotated before sending to the api to remove the rotation from the text. Several blocks of text were missed.



Example output
Text in original document was rotated 90°. This document was not rotated before sending to the api, so the text was sent to the api rotated. Several blocks of text were ocr'd twice. (Several T4S cells have overlapping "T4S" and "TAS" text.)




--
You received this message because you are subscribed to the Google Groups "cloud-vision-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-vision-discuss+unsub...@googlegroups.com.

sanjjayy pandeyy

unread,
Apr 5, 2019, 5:28:31 AM4/5/19
to cloud-vision-discuss
Facing similar issue. Numbers not detected.


On Wednesday, July 18, 2018 at 1:09:22 AM UTC+5:30, Duane Chen wrote:
Hi Luke,

Unfortunately short texts are known to be a difficult problem for OCR and we have heard other customers with similar problems with short text detection. We are continuing to improve our models over time but currently there is not much we can do for short text misses.

Thanks,

Duane

On Tue, Jul 17, 2018 at 11:53 AM Luke Simkins <lucas....@gmail.com> wrote:

The google cloud vision ocr document_text_detection output has a severe drop in quality for tabular data. In some circumstances, it appears to ocr the same text more than once (overlapping text) and in other circumstances misses text altogether. We've only noticed this happen in table cells containing 5 or fewer characters. The results are typically high quality otherwise.


Below is a quick example highlighting the issue. Original document is black. GCV text detection overlain in red.


Example output.
Text in original document was rotated 90°. This document was rotated before sending to the api to remove the rotation from the text. Several blocks of text were missed.



Example output
Text in original document was rotated 90°. This document was not rotated before sending to the api, so the text was sent to the api rotated. Several blocks of text were ocr'd twice. (Several T4S cells have overlapping "T4S" and "TAS" text.)




--
You received this message because you are subscribed to the Google Groups "cloud-vision-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-vision-discuss+unsub...@googlegroups.com.

Ali (Cloud Platform Support)

unread,
Apr 6, 2019, 4:25:38 PM4/6/19
to cloud-vision-discuss
Hi,

As mentioned above by Duane, short texts are known to be a difficult problem for OCR. The Vision API team are continuously working on improving the models, but there isn’t much that can be done for short text misses. 

Should this be consistent for all your images, I would suggest opening an issue tracker including sample images and the request being sent. 
Reply all
Reply to author
Forward
0 new messages