Hi Kostas,
Thanks for the generous feedback.
On 27 April 2013 13:30, kostas pramatias <
emp...@gmail.com> wrote:
> Is it possible to take 2 columns of text in a pdf, and find the delimiter in
> them in an easy
> and quick manner? pdf/reader splits the text exactly right, however there
> are some rare
> cases that the last word of the left column joins that of the right column.
> Not that big of
> a deal, but still i would like to distinguish between them.
At this stage, pdf-reader doesn't provide a programmatic way to detect
the columns of text.
It is almost certainly possible to fix the issue where some text
appears in the wrong column. If you look at the code in
lib/pdf/reader/page_layout.rb you can improve the algorithm that
places text on the page. The PageLayout class is passed a collection
of strings with X,Y co-ordinates, so improving the algorithm doesn't
require any knowledge of the PDF spec.
> If it's an easy enough question i would like to know too, how to identify
> some rectangles that
> are between the text and they're disrupting the text.
Do you mean rectangles that look like: ▯
These are inserted into the output when pdf-reader cannot determine
the unicode code-point for a glyph. Sometimes that's a bug in
pdf-reader and sometimes the PDF is missing required data.
Can you point of where the rectangles are in your sample document?
James