Re: [pdf-reader] Distinguish between two columns of text, in every page

139 views
Skip to first unread message

James Healy

unread,
Apr 27, 2013, 3:21:11 AM4/27/13
to pdf-r...@googlegroups.com
Hi Kostas,

Thanks for the generous feedback.

On 27 April 2013 13:30, kostas pramatias <emp...@gmail.com> wrote:
> Is it possible to take 2 columns of text in a pdf, and find the delimiter in
> them in an easy
> and quick manner? pdf/reader splits the text exactly right, however there
> are some rare
> cases that the last word of the left column joins that of the right column.
> Not that big of
> a deal, but still i would like to distinguish between them.

At this stage, pdf-reader doesn't provide a programmatic way to detect
the columns of text.

It is almost certainly possible to fix the issue where some text
appears in the wrong column. If you look at the code in
lib/pdf/reader/page_layout.rb you can improve the algorithm that
places text on the page. The PageLayout class is passed a collection
of strings with X,Y co-ordinates, so improving the algorithm doesn't
require any knowledge of the PDF spec.

> If it's an easy enough question i would like to know too, how to identify
> some rectangles that
> are between the text and they're disrupting the text.

Do you mean rectangles that look like: ▯

These are inserted into the output when pdf-reader cannot determine
the unicode code-point for a glyph. Sometimes that's a bug in
pdf-reader and sometimes the PDF is missing required data.

Can you point of where the rectangles are in your sample document?

James

kostas pramatias

unread,
Apr 27, 2013, 11:31:44 AM4/27/13
to pdf-r...@googlegroups.com


On Saturday, April 27, 2013 10:21:11 AM UTC+3, James Healy wrote:
Hi Kostas,

Thanks for the generous feedback.

On 27 April 2013 13:30, kostas pramatias <emp...@gmail.com> wrote:
> Is it possible to take 2 columns of text in a pdf, and find the delimiter in
> them in an easy
> and quick manner? pdf/reader splits the text exactly right, however there
> are some rare
> cases that the last word of the left column joins that of the right column.
> Not that big of
> a deal, but still i would like  to distinguish between them.

At this stage, pdf-reader doesn't provide a programmatic way to detect
the columns of text.

It is almost certainly possible to fix the issue where some text
appears in the wrong column. If you look at the code in
lib/pdf/reader/page_layout.rb you can improve the algorithm that
places text on the page. The PageLayout class is passed a collection
of strings with X,Y co-ordinates, so improving the algorithm doesn't
require any knowledge of the PDF spec.

Sound pretty much what I want. I will check it out. Thank for the quick 
reply.

> If it's an easy enough question  i would like to know too, how to identify
> some rectangles that
> are between the text and they're disrupting the text.

Do you mean rectangles that look like: ▯

I mean rectangle with text in it, limited usually to one column of the two
that the text of every page is divided. 

These are inserted into the output when pdf-reader cannot determine
the unicode code-point for a glyph. Sometimes that's a bug in
pdf-reader and sometimes the PDF is missing required data.

Can you point of where the rectangles are in your sample document?

The 3rd line of the right column (until line 13) in page 2, has a rectangle in 
it, numbered page 83 in the whole document.
  
James

/kostas 

James Healy

unread,
Apr 27, 2013, 9:51:02 PM4/27/13
to pdf-r...@googlegroups.com
On 28 April 2013 01:31, kostas pramatias <emp...@gmail.com> wrote:
> The 3rd line of the right column (until line 13) in page 2, has a rectangle
> in it, numbered page 83 in the whole document.

Oh, I see what you mean. So you're interested in extracting the text
that's inside those boxes?

pdf-reader provides the low level tools to some of that, but you'll
need to write some extra code.

You can write a custom receiver to detect the rectangles - something
like my trim detector[1] that builds paths can filter them for
rectangles.

Unfortunately once you have the co-ordinates there's no way to only
extract text within a section of the page, but I'm opening to adding
arguments to PDF::Reader::Page#text() that will allow it.

James

[1] https://github.com/yob/trimdetector/blob/master/lib/pdf/trim_detector.rb

Kostas Pramatias

unread,
Apr 27, 2013, 10:54:10 PM4/27/13
to pdf-r...@googlegroups.com
On Sun, Apr 28, 2013 at 4:51 AM, James Healy <ja...@yob.id.au> wrote:
On 28 April 2013 01:31, kostas pramatias <emp...@gmail.com> wrote:
> The 3rd line of the right column (until line 13) in page 2, has a rectangle
> in it, numbered page 83 in the whole document.

Oh, I see what you mean. So you're interested in extracting the text
that's inside those boxes?

Yes. 

pdf-reader provides the low level tools to some of that, but you'll
need to write some extra code.

You can write a custom receiver to detect the rectangles - something
like my trim detector[1] that builds paths can filter them for
rectangles.

Unfortunately once you have the co-ordinates there's no way to only
extract text within a section of the page, but I'm opening to adding
arguments to PDF::Reader::Page#text() that will allow it.

But probably it is not worth the hassle. So the rectangle is painted 
around the text. 

I thought maybe it was easier to just find the end and the start of a tag,
 that i supposed it would be there.  I can probably infer the start and 
the end of the text that is intrenched in there, from the plain txt. :)

pdf/reader preserves the indentation of the original pdf better than any 
other, mostly automated, tool I used, and that solves many problems.

James

[1] https://github.com/yob/trimdetector/blob/master/lib/pdf/trim_detector.rb

--
You received this message because you are subscribed to a topic in the Google Groups "PDF::Reader" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pdf-reader/fg2RCwqKNu0/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to pdf-reader+...@googlegroups.com.
To post to this group, send email to pdf-r...@googlegroups.com.
Visit this group at http://groups.google.com/group/pdf-reader?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.



Reply all
Reply to author
Forward
0 new messages