in most cases PDFs don't store paragraph/line information. AFAIK there
page using PDF.JS. While the page is rendered, the textLayer's
appendText function is called. From the passed in `geom` object, you
can get the informations about the position of some text on the canvas.
On Fri Dec 7 17:19:03 2012, Julien Bourdon wrote:
> Julian,
>
> Thank you for the answer.
>
> Let me explain what I am trying to achieve in more details. If the explanation was too long to read, I wrote a short version at the bottom of the post. If you want the explanation with a figure, I posted this question on StackOverflow some time ago:
http://stackoverflow.com/questions/13497639/scanned-and-text-pdf-parallel-scrolling-and-selection-in-web-application).
>
> Basically I have two sets of PDFs in Malay, one in latin characters text form and another in arabic characters from scanned images. I want to implement some text search tools on the arabic pdf by using the latin transcription.
>
> In practice, I would like to work on the paragraph level, two corresponding paragraphs having roughly the same position in both documents. In other words, if a user clicks on the canvas containing the arabic document at (x,y), it is very likely that the corresponding paragraph in the latin transcript will be located at (canvasWidth-x,y), since Arabic is a RTL script.
>
> In order to do that, I would need to store the list of paragraphs in the latin transcription document and their corresponding bounding boxes. As information about the paragraph division is not stored in the PDF, I need to check if there is an alinea on each line to detect if a new paragraph is beginning.
>
> I managed to extract the text and write it to a div using the JQuery code below (complete code available here:
http://pastebin.com/N3jJi8KW ):
>
> page.getTextContent().then(function(text){
> extractedString = $.makeArray($(text.bidiTexts).map(function(element,value){return value.str})).join(' ');
> $('div#extractedText').text(extractedString);
> }
>
> Now I would need to get the bounding box of each line, or if it was not possible, reconstruct it from the bounding box of each character. The problem is that I do not know where to get this information, even if I know where it is done in the source code. I tried to find a way to access the geom object without success. I suspected that I could access the geom object through the renderer but all I get is a promise object with no data.
>
> ----------------
> SHORT VERSION
>
> Is there a way to get the bounding box of each line of a PDF document in an array, similarly to the way to get the text content from a pdf via getTextContent()?
>
> Thank you in advance for your answer.
>
> Julien.
>
> On Thursday, 6 December 2012 06:28:04 UTC+9, Julian Viereck wrote: