The google cloud vision ocr document_text_detection output has a severe drop in quality for tabular data. In some circumstances, it appears to ocr the same text more than once (overlapping text) and in other circumstances misses text altogether. We've only noticed this happen in table cells containing 5 or fewer characters. The results are typically high quality otherwise.
Below is a quick example highlighting the issue. Original document is black. GCV text detection overlain in red.
Example output.
Text in original document was rotated 90°. This document was rotated before sending to the api to remove the rotation from the text. Several blocks of text were missed.
Example output
Text in original document was rotated 90°. This document was not rotated before sending to the api, so the text was sent to the api rotated. Several blocks of text were ocr'd twice. (Several T4S cells have overlapping "T4S" and "TAS" text.)
--
You received this message because you are subscribed to the Google Groups "cloud-vision-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-vision-dis...@googlegroups.com.
To post to this group, send email to cloud-visi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-vision-discuss/a0a5aa8e-8bb8-4c03-936a-52cc2dea63c7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi Luke,Unfortunately short texts are known to be a difficult problem for OCR and we have heard other customers with similar problems with short text detection. We are continuing to improve our models over time but currently there is not much we can do for short text misses.Thanks,Duane
On Tue, Jul 17, 2018 at 11:53 AM Luke Simkins <lucas....@gmail.com> wrote:
--The google cloud vision ocr document_text_detection output has a severe drop in quality for tabular data. In some circumstances, it appears to ocr the same text more than once (overlapping text) and in other circumstances misses text altogether. We've only noticed this happen in table cells containing 5 or fewer characters. The results are typically high quality otherwise.
Below is a quick example highlighting the issue. Original document is black. GCV text detection overlain in red.
Example output.
Text in original document was rotated 90°. This document was rotated before sending to the api to remove the rotation from the text. Several blocks of text were missed.
Example output
Text in original document was rotated 90°. This document was not rotated before sending to the api, so the text was sent to the api rotated. Several blocks of text were ocr'd twice. (Several T4S cells have overlapping "T4S" and "TAS" text.)
You received this message because you are subscribed to the Google Groups "cloud-vision-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-vision-discuss+unsub...@googlegroups.com.
Hi Luke,Unfortunately short texts are known to be a difficult problem for OCR and we have heard other customers with similar problems with short text detection. We are continuing to improve our models over time but currently there is not much we can do for short text misses.Thanks,Duane
On Tue, Jul 17, 2018 at 11:53 AM Luke Simkins <lucas....@gmail.com> wrote:
--The google cloud vision ocr document_text_detection output has a severe drop in quality for tabular data. In some circumstances, it appears to ocr the same text more than once (overlapping text) and in other circumstances misses text altogether. We've only noticed this happen in table cells containing 5 or fewer characters. The results are typically high quality otherwise.
Below is a quick example highlighting the issue. Original document is black. GCV text detection overlain in red.
Example output.
Text in original document was rotated 90°. This document was rotated before sending to the api to remove the rotation from the text. Several blocks of text were missed.
Example output
Text in original document was rotated 90°. This document was not rotated before sending to the api, so the text was sent to the api rotated. Several blocks of text were ocr'd twice. (Several T4S cells have overlapping "T4S" and "TAS" text.)
You received this message because you are subscribed to the Google Groups "cloud-vision-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-vision-discuss+unsub...@googlegroups.com.