Google Document AI

192 views
Skip to first unread message

Alucor Rpa

unread,
May 15, 2021, 5:20:46 AM5/15/21
to cloud-vision-discuss
Hi, I was trying to implement Invoice parser using python on GCP Console. I am able to run the program but I see that it works just like form parser. I cannot find the fields like invoice_no, date etc being extracted like the Invoice schema in demo. How do I access the invoice schema? Below is the code that I used: 


# TODO(developer): Uncomment these variables before running the sample.
project_id= ''
location = 'us' # Format is 'us' or 'eu'
processor_id = '' # Create processor in Cloud Console
file_path = 'invoice.pdf'


def process_document_sample(
    project_id: str, location: str, processor_id: str, file_path: str
):
    from google.cloud import documentai_v1beta3 as documentai

    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    opts = {}
    if location == "eu":
        opts = {"api_endpoint": "eu-documentai.googleapis.com"}

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor, e.g.:
    # projects/project-id/locations/location/processor/processor-id
    # You must create new processors in the Cloud Console first
    name = f"projects/{project_id}/locations/{location}/processors/{processor_id}"

    with open(file_path, "rb") as image:
        image_content = image.read()

    # Read the file into memory
    document = {"content": image_content, "mime_type": "application/pdf"}

    # Configure the process request
    request = {"name": name, "raw_document": document}

    # Recognizes text entities in the PDF document
    result = client.process_document(request=request)

    document = result.document

    print("Document processing complete.")
    
    # For a full list of Document object attributes, please reference this page: https://googleapis.dev/python/documentai/latest/_modules/google/cloud/documentai_v1beta3/types/document.html#Document

    document_pages = document.pages
    file = open("sample.txt", "w")
    str_dictionary = repr(document_pages)
    file.write(str_dictionary)
    # Read the text recognition output from the processor
    print("The document contains the following paragraphs:")
   
    for page in document_pages:
        print("Page Number:{}".format(page.page_number))
        for form_field in page.form_fields:
            fieldName=get_text(form_field.field_name,document)
            
            fieldValue = get_text(form_field.field_value,document)
            
            print(fieldName+" : "+fieldValue )

# Extract shards from the text field
def get_text(doc_element: dict, document: dict):
    """
    Document AI identifies form fields by their offsets
    in document text. This function converts offsets
    to text snippets.
    """
    response = ""
    # If a text segment spans several lines, it will
    # be stored in different text segments.
    for segment in doc_element.text_anchor.text_segments:
        start_index = (
            int(segment.start_index)
            if segment in doc_element.text_anchor.text_segments
            else 0
        )
        end_index = int(segment.end_index)
        response += document.text[start_index:end_index]
    return response

process_document_sample(project_id, location, processor_id, file_path)



michaill

unread,
May 20, 2021, 3:23:36 AM5/20/21
to cloud-vision-discuss
Hello,

From the documentation here [1] are the fields that can get extracted with the invoice parser processor. 

Let me know if this answers your question or if you have any further questions.

Best regards.

---

Alucor Rpa

unread,
May 21, 2021, 2:18:49 AM5/21/21
to cloud-vision-discuss
Thank you. That was very helpful. 
Reply all
Reply to author
Forward
0 new messages