How to programmatically extract text within a given rectangle (x, y coordinates)?

198 views
Skip to first unread message

PDFTron WebViewer

unread,
Jun 25, 2019, 4:04:17 PM6/25/19
to pdfnet-w...@googlegroups.com

Question:

How can we programmatically extract all text within a given rectangle (coordinates on top-left and bottom-right corners)?

Answer:

Text can be extracted programmatically with a given x, y coordinates by simply filtering out the array of coordinates, and then use the filtered array to concatenate characters into a string. WebViewer stores x and y coordinates of each character as an array and it also stores all text as a single string.

A custom function can be built by using PDFTron SDK's low-level API methods. For example by using 
loadPageText and getTextPosition method of the document instance. Here is one possible solution, where you can pass page number and your coordinates within the PDF page, and function will return the text.

This code shows how to extract text with given coordinates on top-left and bottom-right corners:


viewerElement.addEventListener('documentLoaded', async () => {
 
const { docViewer } = viewer.getInstance();
 
const doc = docViewer.getDocument();

 
const top_x = 310, top_y = 320;
 
const bottom_x = 250, bottom_y = 150;
 
const pageIndex = 0;

 
const text = await extractText(doc, pageIndex, top_x, top_y, bottom_x, bottom_y);
  console
.log(text);
})


const extractText = (doc, pageIndex, top_x, top_y, bottom_x, bottom_y) => {
 
return new Promise(resolve => {
    doc
.loadPageText(pageIndex, text => {
      doc
.getTextPosition(pageIndex, 0, text.length, (arr) => {

       
// temp array to store the position of characters
       
var indies = []

       
// filter out array with given x, y coordinates
        arr
= arr.filter((item, index) => {
         
if (item.x4 >= top_x && item.y4 >= top_y &&
              item
.x2 <= (top_x + bottom_x) && item.y2 <= (top_y + bottom_y)) {
            indies
.push(index)
           
return true;
         
}
         
return false;
       
})

       
// concatenate chars into string
        let str
= '';
       
for (let i = 0, len = indies.length; i < len; i++) {
          str
+= text[indies[i]];
       
}

        // filtered arr can be used for other purposes, e.g. debugging

        // return/resolve concatenated string
        resolve
(str)
     
});
   
});
 
});
}

Here is the screenshot, showing the result

Screen Shot 2019-06-05 at 10.19.24 AM.png



Oscar Zhang

unread,
Mar 25, 2021, 6:48:55 PM3/25/21
to PDFTron WebViewer
The code above seems to have some issues, the function `extractText` returns some incorrect results, here is an updated code:

```
viewerElement.addEventListener('documentLoaded', async () => {
  const { docViewer } = viewer.getInstance();
  const doc = docViewer.getDocument();

  const top_x = 310, top_y = 320;
  const bottom_x = 250, bottom_y = 150;
  const pageIndex = 0;

  const text = await extractText(doc, pageIndex, top_x, top_y, bottom_x, bottom_y);
  console.log(text);
});

const extractText = (doc, pageIndex, top_x, top_y, bottom_x, bottom_y) => {
    return new Promise(resolve => {
      doc.loadPageText(pageIndex, text => {
        doc.getTextPosition(pageIndex, 0, text.length, (arr) => {
  
          // temp array to store the position of characters
          var indies = []
  
          // filter out array with given x, y coordinates
          arr = arr.filter((item, index) => {
            // replace this if statement from the previous message
            // if (item.x4 >= top_x && item.y4 >= top_y && tem.x2 <= (top_x + bottom_x) && item.y2 <= (top_y + bottom_y)) {
            // with:
            if (item.x4 >= top_x && item.y4 >= top_y && item.x2 <= bottom_x && item.y2 <= bottom_y) {
              indies.push(index)
              return true;
            }
            return false;
          })
 
          // concatenate chars into string
          let str = '';
          for (let i = 0, len = indies.length; i < len; i++) {
            str += text[indies[i]];
          }
  
          // filtered arr can be used for other purposes, e.g. debugging
  
          // return/resolve concatenated string
          resolve(str)
        });
      });
    });
  }
```

Reply all
Reply to author
Forward
0 new messages