Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

cropping with pdftotext

1,096 views
Skip to first unread message

LC's No-Spam Newsreading account

unread,
Jun 11, 2012, 10:03:02 AM6/11/12
to
I am trying to automatize extraction of pieces of text from pdf
documents using pdftotext.

Unfortunately the original pdfs are rather messy (sort of multicolumn
tables with some rows spanning more columns), and so the resulting text
file is messy too.

I thought it could be easier cropping the file (since the pieces are at
fixed locations), I see that pdftotext has -x -y -H -W options ...

... but the help files does not tell which units are -x and -y in !
Any suggestion ?

tlvp

unread,
Jun 11, 2012, 9:05:06 PM6/11/12
to
Default PS units are points, I suppose, but experiment: find out whether
points or inches or mm or cm or thousands of an inch or other seem to be
the units :-) .

HTH. Cheers, -- tlvp
--
Avant de repondre, jeter la poubelle, SVP.

LC's No-Spam Newsreading account

unread,
Jun 12, 2012, 3:36:55 AM6/12/12
to
On Mon, 11 Jun 2012, tlvp wrote:
> On Mon, 11 Jun 2012 16:03:02 +0200, LC's No-Spam Newsreading account wrote:

>> fixed locations), I see that pdftotext has -x -y -H -W options ...
>>
>> ... but the help files does not tell which units are -x and -y in !
>> Any suggestion ?
>
> Default PS units are points, I suppose, but experiment:

The only thing the help file said is that they are integer. So I
expected points or pixels however defined. I tried to experiment before
posting. But my first experiments were

pdftotext -raw -f 1 -l 1 -x 0 -y 0
pdftotext -raw -f 1 -l 1 -x 1 -y 0

(i.e. I moved the origin by 1 unit in x; the result is that 0,0 gives
the entire file, a shift of 1 unit in x gives an empty file (well, it
contains a single linefeed)

The sample pdf files are made like this
http://orari.atm-mi.it/M1_540.pdf



Thomas Kaiser

unread,
Jun 12, 2012, 5:39:36 AM6/12/12
to
tlvp schrieb in <news:1r1fjmahf14qm$.r1605djd...@40tude.net>
It seems they are pixels [1]. And depend on the -r switch (resolution)
which defaults to 72 so pixels are identical to PostScript Points (1/72
inch).

Regards,

Thomas

[1] I googled for 'pdftotext crop area diff' and had a quick look on
this patch:
http://lists.freedesktop.org/archives/poppler/2009-March/004481.html

0 new messages