Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

does ps/pdf render char-by-char ?

6 views
Skip to first unread message

no-to...@motz.invalid

unread,
Apr 29, 2009, 10:59:54 PM4/29/09
to
Via http://citeseer.ist.psu.edu/hughes95design.html
I got a *.ps file. And I've got the corresponding *pdf.
So I guess I did a ps to pdf conversion.
And it's marked as having been originated from latex/LLNCS.
But I can't extract any of the ascii.

If ps/pdf renders left to right & top to bottom, ie char-by-char,
then if the rendering could be single-stepped, then the
ascii-char corresponding to the last-rendered-char could be
manually given.
Thereby teaching the system the render-to-ascii translation.

Presumably each render of the same char uses the same data;
so there are only about 100 char-renders, including italics
and 2 sizes ?

So then the learning process would write what it's been
taught and stop for new input ?

What's wrong with this idea ?

== TIA.

Robert Bonomi

unread,
Apr 30, 2009, 8:09:00 PM4/30/09
to
In article <gtb479$brv$1...@news.eternal-september.org>,


Well, for starters, you original assumption is invalid. :)
It is *not* guaranteed that PS/pdf renders L->R and T->B.
A page is a page, and you can write stuff to it in any
position at _any_ time. Text written 'as strings' _is_
generally written in the native direction of the language
but even that is *not* guaranteed.

luserXtrog

unread,
May 1, 2009, 2:27:43 AM5/1/09
to
On Apr 30, 7:09 pm, bon...@host122.r-bonomi.com (Robert Bonomi) wrote:
> In article <gtb479$br...@news.eternal-september.org>,
>
>
>
>  <no-topp...@motz.invalid> wrote:
> >Viahttp://citeseer.ist.psu.edu/hughes95design.html

Another complication is the fact that the pdf file has undergone
at least three stages of processing in an attempt to optimize
the appearance of the text. LaTeX can be a front-end to TeX which
processes to (IIRC) dvi and then to ps. The postscript files
produced by this chain often contain manually-kerned text where
each word is chopped into pieces to squeeze the little ells
closer together. It is occasionally possible to recover the text
(with loss of formatting and occasional bizarre artifacts from
bits of string that happened to be present in the file but were
not intended to be rendered).

But the pdf might not actually contain text in any recognizable
form (until rendered). It could contain a compressed image of
a scanned printout. No ASCII in sight!

--
lxt

Ross Presser

unread,
May 3, 2009, 4:19:38 PM5/3/09
to
On Apr 29, 10:59 pm, no-topp...@motz.invalid wrote:
> Viahttp://citeseer.ist.psu.edu/hughes95design.html

> I got a *.ps file. And I've got the corresponding *pdf.
> So I guess I did a ps to pdf conversion.
> And it's marked as having been originated from latex/LLNCS.
> But I can't extract any of the ascii.

If you're referring to "The Design of a Pretty-printing Library" by
John Hughes, which is the page I landed at when I followed your
link ... the Postscript file available for download there *DOES* have
recognizable English text in it:

1 0 bop 349 194 a Fx(The)22 b(Design)g(of)g(a)g(Prett)n(y-prin)n(ti)q
(ng)j(Library)821 339 y Fw(John)14 b(Hughes)528 426 y
Fv(Chalmers)g(T)m(eknisk)n(a)h(H\177)-19 b(ogsk)o(ola,)14
b(G\177)-19 b(oteb)q(org,)13 b(Sw)o(eden.)183 565 y Fu(1)56
b(In)n(tro)r(duction)183 670 y Fw(On)17 b(what)h(do)q(es)g(the)g(p)q
(o)
o(w)o(er)g(of)f(functional)f(programming)e(dep)q(end?)19

which after running through ps2ascii yields:

The Design of a Pretty-printing Library

John Hughes Chalmers Tekniska H"ogskola, G"oteborg, Sweden.

1 Introduction On what does the power of functional programming
depend? Why are functional programs so often a fraction of the size of
equivalent programs in other languages? Why are they so easy to write?
I claim: because functional languages support software reuse extremely
well.

Programs are constructed by putting program components together. When
we discuss reuse, we should ask

Therefore I submit that (if you are indeed talking about this file) it
is your ps-to-pdf "distilling" process which re-encoded the ASCII
strings with new text encodings. This is something that I believe
Adobe Distiller often does, in the interest of producing a smaller
file, unless it is told not to.

0 new messages