UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 11: invalid start byte

1,588 views
Skip to first unread message

V tangutoori

unread,
May 18, 2018, 2:12:39 PM5/18/18
to cloud-vision-discuss
Hi,
I am trying the PDF/TIFF OCR from google cloud vision. I was able to perform the OCR on the pdf file i uploaded to storage bucket and the output json file is being saved in the bucket, however when i am trying to read the JSON from the file i am getting unicode decode error. I have used the sample code in google documentation to see how it works. I tried reading on the methods and functions being called and i could not find much help there. Can anyone please help me fix this issue.

# Once the request has completed and the output has been
# written to GCS, we can list all the output files.
storage_client = storage.Client()

match = re.match(r'gs://([^/]+)/(.+)', 'gs://myocrbucket-v/2.pdf')
bucket_name = match.group(1)
prefix = match.group(2)

bucket = storage_client.get_bucket(bucket_name=bucket_name)

# List objects with the given prefix.
blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files:')
for blob in blob_list:
   print(blob.name)

# Process the first output file from GCS.
# Since we specified batch_size=2, the first response contains
# the first two pages of the input file.
output = blob_list[0]

json_string = output.download_as_string()
response = json_format.Parse(json_string, vision.types.AnnotateFileResponse())

# The actual response for the first page of the input file.
first_page_response = response.responses[0]
annotation = first_page_response.full_text_annotation

# Here we print the full text from the first page.
# The response contains more information:
# annotation/pages/blocks/paragraphs/words/symbols
# including confidence scores and bounding boxes
print(u'Full text:\n{}'.format(annotation.text))

Error:

UnicodeDecodeError
: 'utf-8' codec can't decode byte 0xa1 in position 11: invalid start byte

Duane Chen

unread,
May 18, 2018, 2:19:13 PM5/18/18
to V tangutoori, cloud-visi...@googlegroups.com
Hi,

I am not a Python expert by any means, but searching around for this error indicates that your data is not encoded with UTF-8, so you need to decode accordingly.

If you don't know the encoding, and you are using Python 3, something like this might workaround the issue (but I can't guarantee it).

decoded_text = annotation.text.decode(errors='ignore')
Thanks,

Duane

--
You received this message because you are subscribed to the Google Groups "cloud-vision-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-vision-dis...@googlegroups.com.
To post to this group, send email to cloud-visi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cloud-vision-discuss/5bb6cc26-000a-4bac-af0f-7fdd87cf1cb9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

V tangutoori

unread,
May 21, 2018, 3:44:15 PM5/21/18
to cloud-vision-discuss
Hi Duane,
Sorry for the late reply. Ok I will try this solution out and update you. Thank you so much for helping me through this.
To unsubscribe from this group and stop receiving emails from it, send an email to cloud-vision-discuss+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages