Importing "Plaintext" from PDF

Mark Coleman

unread,

Oct 30, 2010, 4:36:05 AM10/30/10

to

Hi,

I'm attempting to use Mathematica (v7.01) to Import the text from a PDF file.
If I simply Import[] the file, it returns a list of graphics objects
representing each page of the file. If I use use "Plaintext" option of
Import[], it returns an empty list. My source pdf files were obtained
from Google's Patent Search function. Just wondering if I there is
some option I am missing or if Mathematica cannot Import text from pdf files.

Thanks,

Mark

Bill Rowe

unread,

Oct 31, 2010, 3:07:41 AM10/31/10

to

It is not at all difficult to import just the text from PDF
files into Mathematica. The basic syntax is

Import["filename",{"PDF","Plaintext"}]

This will import all of the text in the PDF file assuming it
exists. This will not do anything for you if the document was
scanned into the PDF file. In that case, there is no plaintext
to import.

You can get more information regarding options related to
importing of file formats by clicking on the file format of
interest in the page that is returned by searching for

guide/ImportingAndExporting

in the document center. Another way to get to this page would be
to look up either Import or Export in the documentation center
and click on the Listing of Formats just to the right of the
large bold Import or Export.

James Stein

unread,

Oct 31, 2010, 3:07:09 AM10/31/10

to

One possibility is to open the PDF file in Adobe Reader and then save it to
a TXT file via the "Save as Text..." command. Then let Mathematica import
the TXT file.

On Sat, Oct 30, 2010 at 1:36 AM, Mark Coleman <marksp...@gmail.com>wrote:

> Hi,

>
> I'm attempting to use Mathematica (v7.01) to Import the text from a PDF
> file.
> If I simply Import[] the file, it returns a list of graphics objects
> representing each page of the file. If I use use "Plaintext" option of
> Import[], it returns an empty list. My source pdf files were obtained
> from Google's Patent Search function. Just wondering if I there is
> some option I am missing or if Mathematica cannot Import text from pdf
> files.
>

> Thanks,
>
> Mark
>
>

Joseph Gwinn

unread,

Oct 31, 2010, 3:09:53 AM10/31/10

to

In article <iagldl$3cm$1...@smc.vnet.net>,
Mark Coleman <marksp...@gmail.com> wrote:

The pdf contains scans (like a fax), not text. Google patents has the
text generated by OCR of the scans, but even for straight English text
the error rate is significant, at least 1% on older patents.

OCR of math equations is basically hopeless. Nor are the published
equations written in Mathematica. You will have to do this manually.

Joe Gwinn

Helen Read

unread,

Oct 31, 2010, 3:10:14 AM10/31/10

to

I just tried Import with the "Plaintext" option on a pdf that had text
in it, and it worked fine. The pdf you are working with might have
originated from a scanned document with each page saved as an image, in
which case there isn't any Plaintext to Import. Wait for Mathematica 8,
though :-)

--
Helen Read
University of Vermont

AES

unread,

Nov 1, 2010, 5:59:51 AM11/1/10

to

In article <iaj4jt$n2v$1...@smc.vnet.net>,
Bill Rowe <read...@sbcglobal.net> wrote:

> > My source
> >pdf files were obtained from Google's Patent Search function. Just
> >wondering if I there is some option I am missing or if Mathematica
> >cannot Import text from pdf files.
>
> It is not at all difficult to import just the text from PDF
> files into Mathematica. The basic syntax is
>
> Import["filename",{"PDF","Plaintext"}]
>
> This will import all of the text in the PDF file assuming it
> exists. This will not do anything for you if the document was
> scanned into the PDF file. In that case, there is no plaintext
> to import.

A bit of additional info, in case it's helpful:

If these are the usual Patent Office copies of patents, the originals
are in two-column format and have been scanned and converted into
raster images which are delivered in TIFF or PDF format.

You may be able to OCR these to convert all the scanned text to a "real
text" PDF file. Adobe Acrobat in particular has a very good and easy to
use OCR capability built right into it (one click to OCR an entire
multi-page raster imagePDF document) that I've often used with success,
although I've never applied it specifically to patents.

I'm not at all sure, however, that the output from this OCR process will
have any of the "flow" information associated with its two-column
character, in which case you may have a mess to deal with in
interpreting each page after reading it into Mathematica (the line
numbering that's added to patents may cause some trouble also).

You might, for example, have to duplicate each original page into two
pages before scanning it; use Crop operations to select the left column
on the first page and the right column on the second; and then proceed
with OCRing these cropped pages.

You might also be able to OCR the original pages; hand select each
column individually; and Copy and Paste them one by one into an RTF
document. (The Bean Services on a Mac can make this a pretty fast
process.)

I'd be interested in a summary report to this group, if you find a way
to get all this working, or any free source that will provide already
OCRed and "single-columned" patents.