Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Extract equations from a PDF file

107 views
Skip to first unread message

hor...@nospamum.es

unread,
Sep 22, 2004, 12:39:29 PM9/22/04
to
Hello everybody,

could somebody tell me if it is possible to extract equations from a paper
in a PDF file ? My objective is to paste them to a TEX file, and to render
after to a DVI file to achieve the same result than in the original PDF file
(I understand papers are usually written in TEX files and rendered after to
DVI or PDF files). I know I can do this with plain text, but I do not know
how to handle this situation with equations (although it is possible to copy
them as images and then paste).

Thanks in advance.

Larry T.

unread,
Sep 22, 2004, 4:21:39 PM9/22/04
to
Hi,

Due to the nature of PDF's it would be difficult to do what you want. Best
bet would be to use an OCR program to retrive the equations. You are right,
it is possible to copy them as images because basically, they are. The OCR
package may help, but with equations it is even more of a problem.

Larry T.

Ryan Wheeler

unread,
Sep 22, 2004, 7:19:58 PM9/22/04
to
<hor...@NOSPAMum.es> wrote:

there is a program called pdf editor. would that do?

Fajar Suryawan

unread,
Sep 22, 2004, 10:29:05 PM9/22/04
to
<hor...@NOSPAMum.es> wrote in message news:<cis9ma$svj$1...@unida.um.es>...

> Hello everybody,
>
> could somebody tell me if it is possible to extract equations from a paper
> in a PDF file ?

Simple answer, you don't.
PDF is printer language. There's no information on the mathematics'
original LaTeX-code.
Reverse engineering is nice, but not easy.

F

Alan Connor

unread,
Sep 22, 2004, 11:05:10 PM9/22/04
to
On 22 Sep 2004 19:29:05 -0700, Fajar Suryawan
<faja...@yahoo.com> wrote:

pstotext produces text files from ps or pdf that are far from
perfect, but it may help the OP.

pdf2ps would turn the file into ps, which is at least ascii,
and easier to do stuff with.


AC


Alexander Skwar

unread,
Sep 23, 2004, 12:42:37 AM9/23/04
to
Larry T. wrote:
> Hi,
>
> Due to the nature of PDF's it would be difficult to do what you want.

Why is the Acrobat reader v6 on Windows able to copy texts (and equations)?

Alexander Skwar
--
There are weapons you cannot hold in your hand.
You can only hold them in your mind.
-- Bene Gesserit Teaching
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

Lloyd Sumpter

unread,
Sep 23, 2004, 1:10:46 AM9/23/04
to

Again, I'm not sure if this is what you want, but pstoedit converts ps or
pdf files to one of many editable formats, including "latex2e", and
"Mathematica Graphics". Check it out!

Lloyd Sumpter

ynotssor

unread,
Sep 23, 2004, 1:16:04 AM9/23/04
to
"Alexander Skwar" <fr...@alexander.skwar.name> wrote in message
news:4152543D...@von.digitalprojects.com

>> Due to the nature of PDF's it would be difficult to do what you want.
>
> Why is the Acrobat reader v6 on Windows able to copy texts (and
> equations)?

What does that have to do with converting said equations to TEX syntax?

--
use hotmail for email replies

Alan Connor

unread,
Sep 23, 2004, 1:55:39 AM9/23/04
to

Debian stable package description:

Package: pstoedit
Priority: optional
Section: graphics
Installed-Size: 763
Maintainer: J.H.M. Dassen (Ray) <jda...@debian.org>
Architecture: i386
Version: 3.21-2
Depends: libc6 (>= 2.2.3-1), libstdc++2.10-glibc2.2, gs
Suggests: xfig | ivtools-bin | tgif | transfig
Filename: pool/main/p/pstoedit/pstoedit_3.21-2_i386.deb
Size: 231664
MD5sum: 0a9a45aafa16bc7c3b477f97d56ea68d

Description: PostScript and PDF files to editable vector graphics
converter. pstoedit converts Postscript and PDF files to various
editable vector graphic formats including tgif, xfig, PDF
graphics, gnuplot format, idraw, MetaPost, GNU Metafile, PIC,
Killustrator and flattened PostScript.

Task: tex

--------------

How does one edit files like those?

AC

hor...@nospamum.es

unread,
Sep 23, 2004, 9:30:01 AM9/23/04
to
I've tried "pstoedit" and at the beginning it seems to be very good, but I
don't know how to make it work. I've tried to do:

pstoedit -f latex2e input.pdf output.tex

(where input.pdf is a paper with figures and equations)

but it gives me a lot of errors that I don't know how to solve.


Thanks anyway


"Lloyd Sumpter" <lsum...@dccnet.com> escribió en el mensaje
news:pan.2004.09.23....@dccnet.com...

Larry T.

unread,
Sep 23, 2004, 10:37:07 AM9/23/04
to
Hi Aleexander,

Acrobat reader can search and find information within a PDF but the
information is not plain text, but rather a highly processed and vectorized
peice of information that is defined within the pdf spec. You can get a
good tutorial and explanation at the Adobe web site. If you tried to search
a pdf with any other editor (or even loaded it in) you would see that it is
a problem. There is a difference between an ASCII "a" and a pdf
represenation of the "a" is quite different. PDF's were not meant to be
edited, but through common use, growth and universal acceptance users have
demanded additonal capability and of course software vendors have complied.
It is amazing the variety and technology now available for PDF applications.
If you are interested check out www.pdfzone.com which is a good clearing
house for pdf software. Thanks, Larry T.

Julian V. Noble

unread,
Sep 23, 2004, 4:11:54 PM9/23/04
to

This is basically correct. It is straightforward to turn pig into
sausage, but no one has found a simple way to reverse this. So
you can't retrieve LaTeX source for equations from an image without
solving some fairly deep problems in CS.

Why not simply retype them using MathType, which is WYSIWYG and can
output LaTeX in various forms?


--
Julian V. Noble
Professor Emeritus of Physics
j...@lessspamformother.virginia.edu
^^^^^^^^^^^^^^^^^^
http://galileo.phys.virginia.edu/~jvn/

"For there was never yet philosopher that could endure the toothache
patiently." -- Wm. Shakespeare, Much Ado about Nothing. Act v. Sc. 1.

Jose Maria Lopez Hernandez

unread,
Sep 26, 2004, 6:39:01 PM9/26/04
to
Alan Connor wrote:
> On 22 Sep 2004 19:29:05 -0700, Fajar Suryawan
> <faja...@yahoo.com> wrote:
>
>
>><hor...@NOSPAMum.es> wrote in message
>>news:<cis9ma$svj$1...@unida.um.es>...
>>
>>
>>>Hello everybody,
>>>
>>>could somebody tell me if it is possible to extract equations
>>>from a paper in a PDF file ?
>>
>>Simple answer, you don't. PDF is printer language. There's no
>>information on the mathematics' original LaTeX-code. Reverse
>>engineering is nice, but not easy.
>>
>>F
>
>
> pstotext produces text files from ps or pdf that are far from
> perfect, but it may help the OP.

Would break the equations.

>
> pdf2ps would turn the file into ps, which is at least ascii,
> and easier to do stuff with.

Ps it's not more useful to be edited than pdf.

The real thing is that you can't edit a pdf file.

--

Jose Maria Lopez Hernandez
Director Tecnico de bgSEC
jker...@bgsec.com
bgSEC Seguridad y Consultoria de Sistemas Informaticos
http://www.bgsec.com
ESPAÑA

The only people for me are the mad ones -- the ones who are mad to live,
mad to talk, mad to be saved, desirous of everything at the same time,
the ones who never yawn or say a commonplace thing, but burn, burn, burn
like fabulous yellow Roman candles.
-- Jack Kerouac, "On the Road"

Alan Connor

unread,
Sep 26, 2004, 8:34:37 PM9/26/04
to

Sorry, "Jose Maria". You have a grossly out-of-line sig, and I
don't read the posts of people who abuse the Usenet this way.

The Netiquette standard is a delimeter "-- " on a line by itself,
followed by a maximum of 4 lines, blank or otherwise.

It is supposed to be a _sig_, not a bulletin board. It is not
supposed to be *intrusive*.

Thank you anyway,

AC


--
Pass-list --> Block-list --> Challenge-Response
The key to taking control of your mailboxes
http://tinyurl.com/3c3agp http://tinyurl.com/2t5kp
http://tinyurl.com/yrfjb

Jose Maria Lopez Hernandez

unread,
Sep 26, 2004, 8:44:55 PM9/26/04
to
Alan Connor wrote:
> Sorry, "Jose Maria". You have a grossly out-of-line sig, and I
> don't read the posts of people who abuse the Usenet this way.
>
> The Netiquette standard is a delimeter "-- " on a line by itself,
> followed by a maximum of 4 lines, blank or otherwise.
>
> It is supposed to be a _sig_, not a bulletin board. It is not
> supposed to be *intrusive*.
>
> Thank you anyway,

What about cutting the messages?
That's also nettiquete, and it takes much more time to
pass over it than my sig, that anyway nobody reads.
I'm signing this way since 5 years ago and I'm not
gonna change it.
You can ignore my posts. I dont't mind at all.

Victor Eijkhout

unread,
Sep 26, 2004, 8:45:18 PM9/26/04
to
Alan Connor <zzz...@xxx.yyy> wrote:

> Sorry, "Jose Maria".

Your argument would have been more impressive if you had actually
followed up to one of Jose Maria's posts.

V.
--
email: lastname at cs utk edu
homepage: www cs utk edu tilde lastname

Daniel Bareiro

unread,
Sep 27, 2004, 8:40:34 AM9/27/04
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi.

> could somebody tell me if it is possible to extract equations from a
> paper in a PDF file ?

Maybe you can to use tools like pdf2html or similar. I think they converts
the formulaes and figures to images. Try to glance at freshmeat.net.

Regards,
Daniel
- --
Daniel Bareiro
Estudiante de Ing. en sistemas de información.
UTN (Universidad Tecnológica Nacional) - Bs. As., Rep. Argentina.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFBWApCPrTL0sK9xG8RAvrCAJ0TxWBXt+c+YZ1WFdWwpksd3ddDRQCffVTF
8QZUZuiMBiOQ88ZXdlCPySs=
=PWFR
-----END PGP SIGNATURE-----

Larry T.

unread,
Sep 27, 2004, 10:02:18 AM9/27/04
to
HI, (off topic)

I really didn't see the problem with Jose's sig. line, a bit long perhaps,
but then I did some of the Rt. 66 with the wife this summer traveling out
through Texas, Okalahoma and New Mexico while listing to a CD of "On the
Road". Fun trip, lots of interesting little stops and things to explore.
Back on topic, sounds like you understand the PDF side.

Larry T.

MetaMorph

unread,
Sep 27, 2004, 9:48:48 PM9/27/04
to
> Sorry, "Jose Maria". You have a grossly out-of-line sig, and I
> don't read the posts of people who abuse the Usenet this way.

COR
Get a bloody life

MetaMorph

unread,
Dec 1, 2004, 2:33:00 AM12/1/04
to
The case raises significant issues of freedom of speech and assembly,
privacy and government accountability.

In response to an FOIA asking why this happened, the Secret Service
responded: "We are sure no one knows why we had the meeting disrupted".

They have made a mockery of FOIA.

This mockery of FOIA is still being litigated by EPIC.

An intentional illegal government surveillance program...it just never stops.

Marc Rotenberg has gotten the Secret Service to admit in court that this was
done to "investigate hacking into a company's telephone switch."

Since when did the "investigative" techniques used by the Secret Service
become valid for use in the United States? Going up to a bunch of mall
patrons and DEMANDING IDENTIFICATION from them and searching them?

How exactly was this supposed to further investigate a switch hacking?

For extended details of this governmental persecution of the politically
incorrect, see http://www.2600.com.


******************************************************************************


Secret Service: Vile Persecution of Ed Cummings
------ ------- ---- ----------- -- -- --------

Source material from http://www.2600.com, by someone calling themselves
"Emmanuel Goldstein", which in the book '1984' was known as the Hated Enemy
of the People.

2600, "The Hacker's Quarterly", is unhappy about what the Secret Service
did to one of its correspondents, Ed Cummings.

> The Secret Service has locked Ed Cummings up with violent criminals for
> nearly a year, solely because of


0 new messages