Document AI

262 views
Skip to first unread message

Balaji Murali

unread,
Nov 11, 2021, 9:49:21 AM11/11/21
to Google Cloud Developers
Hello, 
Will Document AI support languages other than english for data extraction and HITL?
If so, how to enable that.? Thanks in advance

George (Cloud Platform Support)

unread,
Nov 11, 2021, 12:32:51 PM11/11/21
to Google Cloud Developers
Hello, 

Document does support languages other than English. You may find detailed information on the "Language Support" page. Does the information offered by means of this page reply to your questions? Do you need more detail? 

Balaji Murali

unread,
Nov 12, 2021, 3:07:40 AM11/12/21
to Google Cloud Developers
Thanks for your response.
Yes, I have seen the "Language Support" page. Its mentioned like other languages are supported. But when i tired to extract data using Invoice parser or General Parser its fetching only the English. Is there any way to enable Language option?

Fabio De Angelis

unread,
Nov 16, 2021, 9:28:37 AM11/16/21
to Google Cloud Developers

Hello,
Document AI should be able to automatically detect every language in an input document without having to select/specify them.
If it does not, this might be due to the input document, which it might be not readable enough or have poor quality.
I understand that your input document contains other languages in addition to English, is it right?
Could you share it with us along with the source code/commands you are using to send the request (please remove any sensitive information before sharing, such as personal data, project-id, keys, passwords, etc...) ,  in order to let us replicate your issue?

Balaji Murali

unread,
Nov 18, 2021, 2:43:56 AM11/18/21
to Google Cloud Developers
Hello,
I am using Invoice parser for extracting the invoice data using the below code. I cant share the input pdf.

project_id= '****'
location = 'us' # Format is 'us' or 'eu'
processor_id = '***' # Create processor in Cloud Console
file_path = '**'
def process_document_sample(
    project_id: str, location: str, processor_id: str, file_path: str
):
    from google.cloud import documentai_v1 as documentai

    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    opts = {}
    if location == "eu":
        opts = {"api_endpoint": "eu-documentai.googleapis.com"}   

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    with open(file_path, "rb") as image:
        image_content = image.read()

    # Read the file into memory
    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "raw_document": document}

    # Recognizes text entities in the PDF document
    result = client.process_document(request=request)

    document = result.document

    print("Document processing complete.")

    # For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document

    document_pages = document.pages

    # Read the text recognition output from the processor
    print("The document contains the following paragraphs:")
    for page in document_pages:
        paragraphs = page.paragraphs
        for paragraph in paragraphs:
            paragraph_text = get_text(paragraph.layout, document)
            print(f"Paragraph text: {paragraph_text}")


# Extract shards from the text field
def get_text(doc_element: dict, document: dict):
    """
    Document AI identifies form fields by their offsets
    in document text. This function converts offsets
    to text snippets.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in doc_element.text_anchor.text_segments:
        start_index = (
            int(segment.start_index)
            if segment in doc_element.text_anchor.text_segments
            else 0
        )
        end_index = int(segment.end_index)
        response += document.text[start_index:end_index]
    return response
process_document_sample(project_id,location,processor_id,file_path)

And i used general parser for form and table extraction. Both extracted only English.

Fabio De Angelis

unread,
Nov 19, 2021, 7:06:09 AM11/19/21
to Google Cloud Developers

Hello,

I have tried Document AI using the invoice processor on various mixed-languages invoice files and it was able to detect every language in the files without any additional setup.

The only thing is that it missed some fields detection regardless of the language.

Therefore your issue does not depend on the different languages but it might depend on the fact that the second language fields have different format/font/dimension/resolution and it makes it harder for the Invoice processor to detect those fields

Reply all
Reply to author
Forward
0 new messages