Introducing Slate - the easiest way to extract text from PDFs

1,871 views
Skip to first unread message

Tim McNamara

unread,
Nov 4, 2010, 7:32:19 PM11/4/10
to nz...@googlegroups.com, nltk-...@googlegroups.com
So, wouldn't it be nice if you could ask a computer to give you the text from each page in a PDF into a list of pages? I think so, that's why I created slate this morning[1]

You might know PDFMiner and other tools, but be slightly confused that it takes about 20 lines to get anything working[2]. And you still may not have a Python object, because PDFMiner almost always works with files. 

slate is a small python module that simplifies PDFMiner's API so that you can do the things you want - process its text. How does it work?

>>> with open('example.pdf' as f:
...    doc = slate.PDF(f)
...
>>> doc[0]
'Yay, some example text...'

Slate has been manually tested in Python 2.6 and seems to work fine. Test coverage will occur if people seem interested in the module. My setup.py wizardary has not yet been tested however, so beware.

Tim
@timClicks


Pedro Marcal

unread,
Nov 4, 2010, 7:58:13 PM11/4/10
to nltk-...@googlegroups.com
great idea, thanks

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

Victor Miclovich

unread,
Nov 5, 2010, 7:03:01 AM11/5/10
to nltk-users
This is definitely unique and cool!
Reply all
Reply to author
Forward
0 new messages