>I am new to pdf. I need something much simpler (i hope) than all iv'e
>been reading here.
>I need to manipulate the text that will be printed.
>like remove empty spaces, delete lines etc.
No, this isn't simple at all. It is somewhere between enormously
complicated and impossible. PDF files don't have text flow like a word
document, everything is in its place. To remove space is really a
question of rearranging everything else.
----------------------------------------
Aandi Inston
Please support usenet! Post replies and follow-ups, don't e-mail them.
How about simply extracting text to a file where the text can then be manipulated?
Depends on how the PDF was created, how much formatting you intend to
retain, and on the phase of the moon.
OK, maybe not the phase of the moon.
The simplest way to start is to try copying from Acrobat Reader and
pasting into something that understands RTF on the clipboard (MS Word,
Wordpad, Excel, OpenOffice.org, etc.) This often does a very good job
of picking up the font name and size. Even when the font needed isn't
installed on your computer, you can at least see what it was named.
pdftohtml (http://pdftohtml.sourceforge.net) can extract text save it
in an HTML or XML format. It does an okay job of retaining
positioning, a fair-to-poor job of merging paragraphs, and a terrible
job of retaining font name and size information.
Depending on what created the PDF, though, the text may have been
encoded in such a way that it is difficult to translate back into
readable text. There's not a whole lot you can do with this -- the
programs haven't yet surfaced that can properly trace back the chain
of glyph encodings, although it's theoretically possible.
There are commercial programs that claim to convert to Word format. I
haven't tried them.
Thanks for the response.
I have tens of different forms. I need to print them from a (Java)
client.
Each form has a template. Before printig I will add dynamic data (like
customer names -> max lentgh = 50). According to the length of the
name and the qauntity of names (in some forms each name will take up
one line), I must move the template text so as not to have empty lines
or empty spaces between the end of the name and the remainder of the
text.
>>> How about simply extracting text to a file where the text can then be manipulated?
If you mean manualy then this will not do. It must be applicative.
>>>To remove space is really a question of rearranging everything else.
Yes. but do I need to do this with my code ? Does Acrobat have the
ability to retain positioning of certain elements - like the location
of the date at the top left, regardless of what comes before on the
same line ?
Remove elements if data has not been added to them ? etc.
>I have tens of different forms. I need to print them from a (Java)
>client.
>Each form has a template. Before printig I will add dynamic data (like
>customer names -> max lentgh = 50). According to the length of the
>name and the qauntity of names (in some forms each name will take up
>one line), I must move the template text so as not to have empty lines
>or empty spaces between the end of the name and the remainder of the
>text.
This sounds like a perfect application for the new Designer (XFA)
forms which can be fully dynamic. However, you will have to use Adobe
technology to fill them, as most or all third party form tools won't
touch them (or will break them).
Otherwise, this is not a good use of templates at all. Moving the
template text is much more work than creating the entire PDF with the
layout required.
>Yes. but do I need to do this with my code ? Does Acrobat have the
>ability to retain positioning of certain elements - like the location
>of the date at the top left, regardless of what comes before on the
>same line ?
Every element in a PDF (possibly down to each character) is
independently positioned. If you want to do this from code (you didn't
mention that in the original post) and you know where every character
belongs, you can indeed move stuff around. There is no text flow,
repagination, any of that stuff.
As usual, Aandi gives sound advice.
The task is one I usually call making a 'Fair Copy' using the workflow:
1. Design blank form
2. User fills in form and submits to server
3. Data stored on server (i.e validate and stick into database)
4a. 'Fair Copy' version (complete pdf file) of filled-in data calculated
on server
4b. 'Fair Copy' returned to user.
This is in contrast to the usual stage 4, where the original Form
itself, or just the data payload (in FDF format) is returned to the
user.
The fair copy can be calculated, on the server, in a variety of ways,
among which are:
FDF (or SQL) --> XML --> XSL-FO --> PDF
FDF (OR SQL) --> XML --> TeXML --> LaTeX --> PDF
(The second of these is rather slow, but has superior Typography, and
you may wish to send it back by email. For in-house or low-bandwidth
use, you can send the LaTeX back to the browser immediately, and let the
user save and compile it themselves )
Another alternative is PrinceXML: http://www.princexml.com/overview/