Document AI

Balaji Murali

unread,

Nov 11, 2021, 9:49:21 AM11/11/21

to Google Cloud Developers

Hello,

Will Document AI support languages other than english for data extraction and HITL?

If so, how to enable that.? Thanks in advance

George (Cloud Platform Support)

unread,

Nov 11, 2021, 12:32:51 PM11/11/21

to Google Cloud Developers

Hello,

Document does support languages other than English. You may find detailed information on the "Language Support" page. Does the information offered by means of this page reply to your questions? Do you need more detail?

Balaji Murali

unread,

Nov 12, 2021, 3:07:40 AM11/12/21

to Google Cloud Developers

Thanks for your response.

Yes, I have seen the "Language Support" page. Its mentioned like other languages are supported. But when i tired to extract data using Invoice parser or General Parser its fetching only the English. Is there any way to enable Language option?

Fabio De Angelis

unread,

Nov 16, 2021, 9:28:37 AM11/16/21

to Google Cloud Developers

Hello,

Document AI should be able to automatically detect every language in an input document without having to select/specify them.

If it does not, this might be due to the input document, which it might be not readable enough or have poor quality.

I understand that your input document contains other languages in addition to English, is it right?

Could you share it with us along with the source code/commands you are using to send the request (please remove any sensitive information before sharing, such as personal data, project-id, keys, passwords, etc...) , in order to let us replicate your issue?

Balaji Murali

unread,

Nov 18, 2021, 2:43:56 AM11/18/21

to Google Cloud Developers

Hello,

I am using Invoice parser for extracting the invoice data using the below code. I cant share the input pdf.

project_id= '****'

location = 'us' # Format is 'us' or 'eu'

processor_id = '***' # Create processor in Cloud Console

file_path = '**'

def process_document_sample(

project_id: str, location: str, processor_id: str, file_path: str

):

from google.cloud import documentai_v1 as documentai

# You must set the api_endpoint if you use a location other than 'us', e.g.:

opts = {}

if location == "eu":

opts = {"api_endpoint": "eu-documentai.googleapis.com"}

client = documentai.DocumentProcessorServiceClient(client_options=opts)

# The full resource name of the processor, e.g.:

# projects/project-id/locations/location/processor/processor-id

# You must create new processors in the Cloud Console first

name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

with open(file_path, "rb") as image:

image_content = image.read()

# Read the file into memory

document = {"content": image_content, "mime_type": "application/pdf"}

# Configure the process request

request = {"name": name, "raw_document": document}

# Recognizes text entities in the PDF document

result = client.process_document(request=request)

document = result.document

print("Document processing complete.")

# For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document

document_pages = document.pages

# Read the text recognition output from the processor

print("The document contains the following paragraphs:")

for page in document_pages:

paragraphs = page.paragraphs

for paragraph in paragraphs:

paragraph_text = get_text(paragraph.layout, document)

print(f"Paragraph text: {paragraph_text}")

# Extract shards from the text field

def get_text(doc_element: dict, document: dict):

"""

Document AI identifies form fields by their offsets

in document text. This function converts offsets

to text snippets.

"""

response = ""

# If a text segment spans several lines, it will

# be stored in different text segments.

for segment in doc_element.text_anchor.text_segments:

start_index = (

int(segment.start_index)

if segment in doc_element.text_anchor.text_segments

else 0

)

end_index = int(segment.end_index)

response += document.text[start_index:end_index]

return response

process_document_sample(project_id,location,processor_id,file_path)

And i used general parser for form and table extraction. Both extracted only English.

Fabio De Angelis

unread,

Nov 19, 2021, 7:06:09 AM11/19/21

to Google Cloud Developers

Hello,

I have tried Document AI using the invoice processor on various mixed-languages invoice files and it was able to detect every language in the files without any additional setup.

The only thing is that it missed some fields detection regardless of the language.

Therefore your issue does not depend on the different languages but it might depend on the fact that the second language fields have different format/font/dimension/resolution and it makes it harder for the Invoice processor to detect those fields

Reply all

Reply to author

Forward