Introducing Slate - the easiest way to extract text from PDFs

Tim McNamara

unread,

Nov 4, 2010, 7:32:19 PM11/4/10

to nz...@googlegroups.com, nltk-...@googlegroups.com

So, wouldn't it be nice if you could ask a computer to give you the text from each page in a PDF into a list of pages? I think so, that's why I created slate this morning[1]

You might know PDFMiner and other tools, but be slightly confused that it takes about 20 lines to get anything working[2]. And you still may not have a Python object, because PDFMiner almost always works with files.

slate is a small python module that simplifies PDFMiner's API so that you can do the things you want - process its text. How does it work?

>>> with open('example.pdf' as f:

... doc = slate.PDF(f)

...

>>> doc[0]

'Yay, some example text...'

Slate has been manually tested in Python 2.6 and seems to work fine. Test coverage will occur if people seem interested in the module. My setup.py wizardary has not yet been tested however, so beware.

Tim

@timClicks

[1] http://pypi.python.org/pypi/slate/0.1

[2] http://www.unixuser.org/~euske/python/pdfminer/programming.html

Pedro Marcal

unread,

Nov 4, 2010, 7:58:13 PM11/4/10

to nltk-...@googlegroups.com

great idea, thanks

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

Victor Miclovich

unread,

Nov 5, 2010, 7:03:01 AM11/5/10

to nltk-users

This is definitely unique and cool!

Reply all

Reply to author

Forward