Convert PDF to .tex file?

timorrill

unread,

Jun 3, 2008, 10:10:40 AM6/3/08

to

I'm wondering if it's possible to create a .tex file from a PDF
document. Specifically, I want to be able to convert the math that
appears in a PDF document to LaTeX code, so that I don't have to write
it all out manually.

Uwe Ziegenhagen

unread,

Jun 3, 2008, 10:16:14 AM6/3/08

to

timorrill schrieb:

Short: Impossible.

Long: I know no tool which might be able to do this.

Uwe

Uwe Ziegenhagen

unread,

Jun 3, 2008, 10:18:04 AM6/3/08

to

Uwe Ziegenhagen schrieb:

BTW: If you do not need to modify the math, crop the pages with Acrobat
Pro (or whatever the name of the commercial license is now) or PDFTK
(maybe, never used it) and embed them as graphics.

Uwe

Rolf Niepraschk

unread,

Jun 3, 2008, 10:26:41 AM6/3/08

to

Uwe Ziegenhagen schrieb:
...

>
> BTW: If you do not need to modify the math, crop the pages with Acrobat
> Pro (or whatever the name of the commercial license is now) or PDFTK
> (maybe, never used it) and embed them as graphics.
>

cropping is also possible with pdfLaTeX:

\includegraphics[viewport=u v w x]{file}

...Rolf

A N Niel

unread,

Jun 3, 2008, 10:32:39 AM6/3/08

to

In article
<d6328d5b-98a4-44bf...@f36g2000hsa.googlegroups.com>,
timorrill <timothy...@gmail.com> wrote:

This is like: "Can I create the movie script from the finished film?"

Or: "Can I create the recipe from that meal they served me?"

Rolf Niepraschk

unread,

Jun 3, 2008, 10:36:11 AM6/3/08

to

A N Niel schrieb:

Or: "Can I create apples from apple puree?"

...Rolf

William F. Adams

unread,

Jun 3, 2008, 10:46:24 AM6/3/08

to

There're a couple of tools which attempt OCR which includes
mathematics, for example:

http://research.cs.queensu.ca/drl//ffes/

Convert the .pdf to a bitmap, then feed it to ffes.

William

Ted Pavlic

unread,

Jun 3, 2008, 4:48:47 PM6/3/08

to

>
> Or: "Can I create apples from apple puree?"
>
> ...Rolf

I'm not sure it's that useful to consider this branch of the thread,
but...

Considering that the PDF may not have been created with TeX to begin
with, perhaps...

"Can I create apples from concentrated orange juice?"

or...

"Can I create a recipe from a shooting star?"

or...

"Can I create the movie script from the banana-flavored toothpaste?"

Peter Flynn

unread,

Jun 3, 2008, 4:55:59 PM6/3/08

to

timorrill wrote:
> I'm wondering if it's possible to create a .tex file from a PDF
> document.

This is like asking to recreate the whole cow from a hamburger.

> Specifically, I want to be able to convert the math that
> appears in a PDF document to LaTeX code, so that I don't have to write
> it all out manually.

Find the original source and use that. Reverse-engineering may be
possible, but it will take longer than retyping it.

///Peter

Bob Tennent

unread,

Jun 3, 2008, 5:59:56 PM6/3/08

to

On Tue, 03 Jun 2008 21:55:59 +0100, Peter Flynn wrote:

> This is like asking to recreate the whole cow from a hamburger.

Enough of this.

The fact is that Adobe Acrobat can often create a usable .doc from a
PDF, though this likely works well only with ordinary text documents.
It's unfortunate a comparable free application doesn't exist.

Bob T.

David Kastrup

unread,

Jun 3, 2008, 6:22:01 PM6/3/08

to

Bob Tennent <Bo...@cs.queensu.ca> writes:

Ah, but this depends on what one calls "usable". Usable means the
consistent use of style sheets, cross references and stuff like that.
That 95% of WYSIWYG system users will go "Huh? What's that?" does not
change that you don't want a 1000-page document without such basic
elements in them.

Regardless whether it has been produced by Acrobat, a clueless retyper,
a clueless original typer or a free tool.

--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
UKTUG FAQ: <URL:http://www.tex.ac.uk/cgi-bin/texfaq2html>

Ista Zahn

unread,

Jun 3, 2008, 6:37:21 PM6/3/08

to

In fact you can convert from pdf to .doc using free tools. If you are on
linux, kword can import pdf and export to formats that MS word can read.
Or you can use pdftohtml and then convert the html to .doc. Or you can
sign up for gmail from google, email the pdf to yourself, and have
google convert it to html (and then convert the html to .doc format).
None of these methods will do what the OP wanted of course (convert math
in a pdf to latex), but then again neither will Adobe Acrobat...

Bob Tennent

unread,

Jun 4, 2008, 6:25:44 AM6/4/08

to

On Tue, 03 Jun 2008 18:37:21 -0400, Ista Zahn wrote:
> Bob Tennent wrote:
>> On Tue, 03 Jun 2008 21:55:59 +0100, Peter Flynn wrote:
>>
>> > This is like asking to recreate the whole cow from a hamburger.
>>
>> Enough of this.
>>
>> The fact is that Adobe Acrobat can often create a usable .doc from a
>> PDF, though this likely works well only with ordinary text documents.
>> It's unfortunate a comparable free application doesn't exist.
>>

> In fact you can convert from pdf to .doc using free tools.

What I meant by comparable was to convert .pdf to .tex. I'm aware it is
possible to go from .pdf to .doc and then .doc to .tex using Abiword,
but surely we could and should do better.

My main point was that it is inappropriate to use irrelevant analogies
to mock the OP's request.

Bob T.

Robin Fairbairns

unread,

Jun 4, 2008, 9:25:54 AM6/4/08

to

there is a faq answer that says (in effect) that there's no point in
even trying anything beyond extracting the text. this thread is the
first time anyone's mentioned anything else ... rescanning printed
output sounds (ahem) "fun".

anyway, i shall revise the answer some time.
--
Robin Fairbairns, Cambridge

Wilfried Hennings

unread,

Jun 4, 2008, 2:06:56 PM6/4/08

to

On 4 Jun 2008 13:25:54 GMT, rf...@cl.cam.ac.uk (Robin Fairbairns)
wrote:

> Bob Tennent <Bo...@cs.queensu.ca> writes:
>>
>>What I meant by comparable was to convert .pdf to .tex. I'm aware it is
>>possible to go from .pdf to .doc and then .doc to .tex using Abiword,
>>but surely we could and should do better.
>

>there is a faq answer that says (in effect) that there's no point in
>even trying anything beyond extracting the text. this thread is the
>first time anyone's mentioned anything else ... rescanning printed
>output sounds (ahem) "fun".

There is no need to "rescan printed output".
Modern OCR software (commercial: Caere OmniPage, Abbyy FineReader) can
directly read pdf, convert it to a bitmap and OCR this bitmap. Of
course the quality is better than with printing and rescanning.
And if you want to do it manually, you can open the pdf with
Ghostscript and convert it to a bitmap, then apply the OCR of your
choice.

This OCR software can also guess formatting (not perfect, but
useable).
Drawback: It saves in MS Word format, not (La)TeX.

Wilfried Hennings
please reply in the newsgroup

Message has been deleted

cu...@congster.de

unread,

Jun 5, 2008, 4:03:34 AM6/5/08

to

It's actually unbelievable how well you can reconstruct the cow from
the hamburger:

http://www.inftyproject.org/en/software.html#InftyReader

Didn't test it, though.

Kurt

rxz...@rit.edu

unread,

Jun 6, 2008, 1:40:58 PM6/6/08

to

On Jun 5, 4:03 am, cu...@congster.de wrote:
> On 3 Jun., 16:46, "William F. Adams" <willad...@aol.com> wrote:
>
> > On Jun 3, 10:10 am, timorrill <timothy.morr...@gmail.com> wrote:
>
> > > I'm wondering if it's possible to create a .tex file from a PDF
> > > document. Specifically, I want to be able to convert the math that
> > > appears in a PDF document to LaTeX code, so that I don't have to write
> > > it all out manually.
>
> > There're a couple of tools which attempt OCR which includes
> > mathematics, for example:
>
> >http://research.cs.queensu.ca/drl//ffes/
>
> > Convert the .pdf to a bitmap, then feed it toffes.

Thought I should point out that FFES is a prototype for pen-based math
entry, and does not converting images directly to .tex at this time.
There is a preliminary, experimental part of the program for importing
images, but it's fairly weak at the present time. Also, for those
interested, there is a newer version of FFES available here:

http://www.cs.rit.edu/~rlaz/ffes/

I believe that the Infty system of Suzuki et al. does support
conversion from images to .tex, but have not had time to try the
system myself.

-Richard Zanibbi (member of the FFES development team, FFES maintainer)

Ted Pavlic

unread,

Jun 6, 2008, 3:56:28 PM6/6/08

to

> http://www.cs.rit.edu/~rlaz/ffes/

> -Richard Zanibbi (member of the FFES development team, FFES maintainer)

Slightly off topic -- if you try to install the distribution that's on-
line, it's going to fail when it tests the TXL compiler... From the
test_txl called from the Makefile for the DRACULAE_0.4 directory:

COMPILE_TEST=`cd test; txlc test/Test.Txl`

I think that "test/" should be removed. Additionally, in that DRACULAE
Makefile, I had to change the *.x rule to wrap a $< by a basename.
That is, you're doing a "cd src" and then still using "src."

I'm running OS/X 10.4. After making those changes, I was able to build
ffes fine.

--Ted

rxz...@rit.edu

unread,

Jun 11, 2008, 3:27:58 PM6/11/08

to

Thank you for catching this. I will update these files when I get the
chance.

-Richard Zanibbi

Luite

unread,

Jun 12, 2008, 4:17:57 AM6/12/08

to

> > There're a couple of tools which attemptOCRwhich includes

> > mathematics, for example:
>
> >http://research.cs.queensu.ca/drl//ffes/

> It's actually unbelievable how well you can reconstruct the cow from
> the hamburger:

Do you think we can put a copy of the cow into the hamburger?
What I mean is: can pdf(la)tex somehow put the original tex code into
the pdf? I don't know what the pdf specs say about this, but I seem to
remember that pdf's can have embedded files (attachments). It would
increase the chances of the document being convertable to a new
standard in 30 or 100 years.

cherio, Luite.

Ken Starks

unread,

Jun 12, 2008, 6:00:22 AM6/12/08

to

The most promising approach is likely to be the xml functionality
of pdf--see the adobe sire and the `mars' project for this.

Meanwhile, you can put anything you like into the pdf as
a comment (I DO mean comment, not comment-annotation).
PDF comments start with % and last until the end of the line.

William F. Adams

unread,

Jun 12, 2008, 8:45:56 AM6/12/08

to

That's not what he means, but yes, one can store a copy of the .tex
source (or any other file) w/in a .pdf when typesetting / creating it.

The Mac OS X Service app LaTeXiT.app (among others) does this, which
allows an embedded equation to be reverted back to its source for
editing, then re-typesetting.

William

Heiko Oberdiek

unread,

Jun 12, 2008, 10:27:18 AM6/12/08

to

Luite <luitev...@gmail.com> wrote:

> > > There're a couple of tools which attemptOCRwhich includes
> > > mathematics, for example:
> >
> > >http://research.cs.queensu.ca/drl//ffes/
>
> > It's actually unbelievable how well you can reconstruct the cow from
> > the hamburger:
>
> Do you think we can put a copy of the cow into the hamburger?
> What I mean is: can pdf(la)tex somehow put the original tex code into
> the pdf?

Easy, look at package embedfile or attachfile2 (or attachfile).

Yours sincerely
Heiko <ober...@uni-freiburg.de>

Ted Pavlic

unread,

Jun 12, 2008, 11:26:13 AM6/12/08

to

> > Do you think we can put a copy of the cow into the hamburger?
> > What I mean is: can pdf(la)tex somehow put the original tex code into
> > the pdf?
>
> Easy, look at package embedfile or attachfile2 (or attachfile).
>

I assume that these packages require the use of pdftex. That is, they
require generating a PDF directly from TeX, which may not be appealing
for many users (including this one).

Is there a way to embed the TeX into a DVI and then still manage to
maintain it through the dvips and ps2pdf pipeline? (I assume not)

--Ted

dbpatnau...@gmail.com

unread,

Jul 19, 2018, 8:32:48 AM7/19/18

to

Robert Heller

unread,

Jul 19, 2018, 11:29:23 AM7/19/18

to

It will likely not be possible to recover the original LaTeX code. It might
be possible to extract the text, *as printed*. How useful that will be in
recreating the LaTeX code is uncertain.

--
Robert Heller -- 978-544-6933
Deepwoods Software -- Custom Software Services
http://www.deepsoft.com/ -- Linux Administration Services
hel...@deepsoft.com -- Webhosting Services

Peter Flynn

unread,

Jul 19, 2018, 6:03:21 PM7/19/18

to

Nope. PDF is an end-of-line format intended for screen reading or
printing. It does not contain any information about how the characters
or symbols got where they are, only about the position they are in on
the page, how big, what colour, etc. All the information about *why*
stuff is where it is, is omitted once it has been used to do the
typesetting.

Ideally you need to go back to the author and ask them for a copy of the
original LaTeX document, but that isn't always possible.

HOWEVER...

1. It is possible to extract just the text (pdftotext is part of Xpdf,
see http://www.xpdfreader.com/) which comes out as plain text, one line
per paragraph, with a ^L (formfeed character) between pages. Mathematics
comes out as a jumble of unusable nonsense.

2. Apache PDFBox is a Java .jar utility to extract text from PDFs
https://pdfbox.apache.org/download.cgi and if you pick HTML output it
will preserve bold and italic as well as paragraphs. I have no idea what
it would do with math, probably the same as [1].

I have used both of these and they are excellent for what they can do.

3. There are dozens, perhaps hundreds, of commercial systems claiming to
extract material from PDFs into Word, preserving all the formatting.
Some of these are standalone programs you run yourself, some are web
sites or services, sometimes free, sometimes limited. I have never used
any of them.

4. There is a LOT of research going on about extraction from PDF.
Leading lights like Peter Murray-Rust have written programs which will
extract even tables from PDFs to SVG (not LaTeX, but an advance). All
part of the movement towards open publication and preventing publishers
from locking up material that they have no rights to; see
https://pdfliberation.wordpress.com/

5. There is also some (less, I think, but I may just not have seen it)
work going on to extract mathematics direct from the positional
information in PDFs, but it is experimental, although there is a book
about it.¹

6. There have been reported successes, however, in using math OCR to
extract the equations from the printout. See
https://tex.stackexchange.com/questions/266989/ocr-pdf-image-to-latex-math
for using pdfocr and tesseract, which has some understanding of math. I
have used tesseract and it's great OCR, but I haven't tried it for maths.

///Peter
--
¹ @InProceedings{10.1007/978-3-319-11897-0_20,
author="Yu, Botao and Tian, Xuedong and Luo, Wenjie",
editor="Tan, Ying and Shi, Yuhui and Coello, Carlos A.",
title="Extracting Mathematical Components Directly from PDF Documents
for Mathematical Expression Recognition and Retrieval",
booktitle="Advances in Swarm Intelligence",
year="2014",
publisher="Springer International Publishing",
address="Cham",
pages="170--179",
abstract="PDF document gains its popularity in information storage and
exchange. With more and more documents, especially the scientific
documents, available in PDF format, extracting mathematical expressions
in PDF documents becomes an important issue in the field of mathematical
expression recognition and retrieval. In this paper, we proposed a
method of extracting mathematical components directly from PDF documents
rather than cooperating indirectly with corresponding images converted
from PDF files. Compared with traditional image-based method, the
proposed method makes full use of the internal information of PDF
documents such as font size, baseline, glyph bounding box and so on to
extract the mathematical characters and their geometric information. The
experimental result shows the method could meet the needs of the
following processing of mathematical expressions such as formula
structural analysis, reconstruction and retrieval, and has a higher
efficiency than traditional image-based ways.",
isbn="978-3-319-11897-0"
}

Martin Vaeth

unread,

Jul 20, 2018, 2:25:41 AM7/20/18

to

Peter Flynn <pe...@silmaril.ie> wrote:
>
> 6. There have been reported successes, however, in using math OCR to
> extract the equations from the printout.

If an OCR program can do it with quite success, it should be even much
simpler to do it from PDF directly. However, AFAIK nobody has done it yet,
and it is quite demanding (and hard to make error-free in some corner cases;
even humans sometimes make errors there if they cannot conclude it from the
content).
One might want to inspect the corresponding part of the OCR program
or do some experiments with machine learning. Perhaps a semester
project for an interested student.

Axel Berger

unread,

Jul 20, 2018, 3:11:10 AM7/20/18

to

Peter Flynn wrote:
> It is possible to extract just the text which comes out as plain text

As you correctly say, there is no text in the PDF, only the placement of
individual letters. Assembly of text is done by heuristics and can go
wrong. The most common error is not to recognize inter word spaces,
especially in front of "w"s.

Secondly PDF does not even know letters, only funny graphic shapes. They
can be listed internally in the order of the letters they represent, but
they need not. Sometimes all you get is some kind of gobbledegook. If
you enjoy code breaking it can be read, but it's a lot of work.

That said, mostly the results from pdftotext are just fine and some
times the options raw or layout help, if the standard output has issues.

As PDF has no problem with faint printing or smudged scans it is also
the ideal source for a good OCR.

--
/¯\ No | Dipl.-Ing. F. Axel Berger Tel: +49/ 221/ 7771 8067
\ / HTML | Roald-Amundsen-Straße 2a Fax: +49/ 221/ 7771 8069
X in | D-50829 Köln-Ossendorf http://berger-odenthal.de
/ \ Mail | -- No unannounced, large, binary attachments, please! --

Peter Flynn

unread,

Jul 20, 2018, 2:48:30 PM7/20/18

to

I suspect that it's easier to write using the OCR program because that
looks at the scanned bitmap and they already have robust routines to do
character-recognition and positional analysis. In a PDF, there's no
bitmap, so you have to work with the (x,y) coordinates (or deduce them),
although you do at least get handed the character identity. The authors
of the book I cited claim to have do it from a PDF. Definitely a case of
more research needed.

///Peter

Axel Berger

unread,

Jul 20, 2018, 4:22:41 PM7/20/18

to

Peter Flynn wrote:
> using the OCR program because that looks at the scanned bitmap

I've no idea what it does internally but I can feed my OCR (Abbyy
Version 5) a non bitmap PDF and get a result.