Document Analysis Sheet

3 views

Skip to first unread message

Keith Cogswell

unread,

Jul 27, 2024, 4:24:45 AM7/27/24

to exadfehand

Document analysis is the first step in working with primary sources. Teach your students to think through primary source documents for contextual understanding and to extract information to make informed judgments.

document analysis sheet

Download Zip >>>>> https://urlgoal.com/2zQWI4

Students can work on document analysis forms on their own or in small groups. To ensure accountability, it is often best if students have to complete their own forms, even if they are working in small groups. Showing students an example of a completed form or modeling how to complete one helps them better understand what accurate, thorough answers look like.

Completing these forms is just the first step of document analysis. Students learn much more when they have to explain their ideas and hear other interpretations. After students have had the opportunity to work with their classmates, they can revise and update the information on their forms. Sharing their analysis can also stimulate interesting discussions about the message and significance of a document. In this way, completing document analysis forms can function as a pre-discussion activity.

Analyzing historical documents helps students understand and interpret both history and current events. They can use this document analysis worksheet to organize their analysis of a historical document.
Learn more about teaching document analysis.

Document analysis is the first step in working with primary sources. Our worksheets can help teach your students to think through primary source documents for contextual understanding and to extract information to make informed judgments.

The first few times you ask students to work with primary sources, and whenever you have not worked with primary sources recently, model careful document analysis using the worksheets. Point out that the steps are the same each time.

Eventually, students will internalize the procedure and be able to go through these four steps on their own every time they encounter a primary source document. Remind students to practice this same careful analysis with every primary source they see.

Document Intelligence layout model is an advanced machine-learning based document analysis API available in the Document Intelligence cloud. It enables you to take documents in various formats and return structured data representations of the documents. It combines an enhanced version of our powerful Optical Character Recognition (OCR) capabilities with deep learning models to extract text, tables, selection marks, and document structure.

Document structure layout analysis is the process of analyzing a document to extract regions of interest and their inter-relationships. The goal is to extract text and structural elements from the page to build better semantic understanding models. There are two types of roles in a document layout:

A Document Intelligence instance in the Azure portal. You can use the free pricing tier (F0) to try the service. After your resource deploys, select Go to resource to get your key and endpoint.

The pages collection is a list of pages within the document. Each page is represented sequentially within the document and includes the orientation angle indicating if the page is rotated and the width and height (dimensions in pixels). The page units in the model output are computed as shown:

The Layout model extracts all identified blocks of text in the paragraphs collection as a top level object under analyzeResults. Each entry in this collection represents a text block and includes the extracted text ascontentand the bounding polygon coordinates. The span information points to the text fragment within the top level content property that contains the full text from the document.

The new machine-learning based page object detection extracts logical roles like titles, section headings, page headers, page footers, and more. The Document Intelligence Layout model assigns certain text blocks in the paragraphs collection with their specialized role or type predicted by the model. It's best to use paragraph roles with unstructured documents to help understand the layout of the extracted content for a richer semantic analysis. The following paragraph roles are supported:

The document layout model in Document Intelligence extracts print and handwritten style text as lines and words. The styles collection includes any handwritten style for lines if detected along with the spans pointing to the associated text. This feature applies to supported handwritten languages.

For Microsoft Word, Excel, PowerPoint, and HTML, Document Intelligence versions 2024-02-29-preview and 2023-10-31-preview Layout model extract all embedded text as is. Texts are extracted as words and paragraphs. Embedded images aren't supported.

The response includes classifying whether each text line is of handwriting style or not, along with a confidence score. For more information. See Handwritten language support. The following example shows an example JSON snippet.

The Layout model also extracts selection marks from documents. Extracted selection marks appear within the pages collection for each page. They include the bounding polygon, confidence, and selection state (selected/unselected). The text representation (that is, :selected: and :unselected) is also included as the starting index (offset) and length that references the top level content property that contains the full text from the document.

Extracting tables is a key requirement for processing documents containing large volumes of data typically formatted as tables. The Layout model extracts tables in the pageResults section of the JSON output. Extracted table information includes the number of columns and rows, row span, and column span. Each cell with its bounding polygon is output along with information whether the area is recognized as a columnHeader or not. The model supports extracting tables that are rotated. Each table cell contains the row and column index and bounding polygon coordinates. For the cell text, the model outputs the span information containing the starting index (offset). The model also outputs the length within the top-level content that contains the full text from the document.

Do your tables span multiple pages? If so, to avoid having to label all the pages, split the PDF into pages before sending it to Document Intelligence. After the analysis, post-process the pages to a single table.

The Layout API can output the extracted text in markdown format. Use the outputContentFormat=markdown to specify the output format in markdown. The markdown content is output as part of the content section.

Figures (charts, images) in documents play a crucial role in complementing and enhancing the textual content, providing visual representations that aid in the understanding of complex information. The figures object detected by the Layout model has key properties like boundingRegions (the spatial locations of the figure on the document pages, including the page number and the polygon coordinates that outline the figure's boundary), spans (details the text spans related to the figure, specifying their offsets and lengths within the document's text. This connection helps in associating the figure with its relevant textual context), elements (the identifiers for text elements or paragraphs within the document that are related to or describe the figure) and caption if there's any.

Hierarchical document structure analysis is pivotal in organizing, comprehending, and processing extensive documents. This approach is vital for semantically segmenting long documents to boost comprehension, facilitate navigation, and improve information retrieval. The advent of Retrieval Augmented Generation (RAG) in document generative AI underscores the significance of hierarchical document structure analysis. The Layout model supports sections and subsections in the output, which identifies the relationship of sections and object within each section. The hierarchical structure is maintained in elements of each section. You can use output to markdown format to easily get the sections and subsections in markdown.

You can specify the order in which the text lines are output with the readingOrder query parameter. Use natural for a more human-friendly reading order output as shown in the following example. This feature is only supported for Latin languages.

For large multi-page documents, use the pages query parameter to indicate specific page numbers or page ranges for text extraction. The following example shows a document with 10 pages, with text extracted for both cases - all pages (1-10) and selected pages (3-6).

The second step is to call the Get Analyze Layout Result operation. This operation takes as input the Result ID the Analyze Layout operation created. It returns a JSON response that contains a status field with the following possible values.

When the status field has the succeeded value, the JSON response includes the extracted layout, text, tables, and selection marks. The extracted data includes extracted text lines and words, bounding boxes, text appearance with handwritten indication, tables, and selection marks with selected/unselected indicated.

The response includes classifying whether each text line is of handwriting style or not, along with a confidence score. This feature is only supported for Latin languages. The following example shows the handwritten classification for the text in the image.

The response to the Get Analyze Layout Result operation is a structured representation of the document with all the information extracted.See here for a sample document file and its structured output sample layout output.

Layout API extracts text from documents and images with multiple text angles and colors. It accepts photos of documents, faxes, printed and/or handwritten (English only) text, and mixed modes. Text is extracted with information provided on lines, words, bounding boxes, confidence scores, and style (handwritten or other). All the text information is included in the readResults section of the JSON output.