PDF indexing

735 views
Skip to first unread message

Adrián Ribao

unread,
Feb 24, 2010, 5:46:43 AM2/24/10
to django-haystack
Hello,

I'd need to index some PDF documents that are uploaded in a FileField,
can haystack manage this fields?

Thank you.

Daniel Lindsley

unread,
Feb 24, 2010, 10:29:08 AM2/24/10
to django-...@googlegroups.com
Adrian,


Not natively, no. Haystack is model-based, so it knows nothing
about PDFs. However, on a ``SearchIndex`` subclass for the model, you
could try adding a ``prepare_pdf()`` method (provided your
``FileField`` is called ``pdf``) and use a Python PDF library to
extract the text and push that into the index. A cursory search for
PDF reading libraries on Google turned up http://pybrary.net/pyPdf/ &
http://www.unixuser.org/~euske/python/pdfminer/index.html, though I've
never used either.


Daniel

> --
> You received this message because you are subscribed to the Google Groups "django-haystack" group.
> To post to this group, send email to django-...@googlegroups.com.
> To unsubscribe from this group, send email to django-haysta...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/django-haystack?hl=en.
>
>

Adrián Ribao

unread,
Feb 24, 2010, 11:41:54 AM2/24/10
to django-haystack

Daniel Lindsley ha escrito:


> Adrian,
>
>
> Not natively, no. Haystack is model-based, so it knows nothing
> about PDFs. However, on a ``SearchIndex`` subclass for the model, you
> could try adding a ``prepare_pdf()`` method (provided your
> ``FileField`` is called ``pdf``) and use a Python PDF library to
> extract the text and push that into the index. A cursory search for
> PDF reading libraries on Google turned up http://pybrary.net/pyPdf/ &
> http://www.unixuser.org/~euske/python/pdfminer/index.html, though I've
> never used either.

Thank you!

Bu I've seen that some engines(at least Xapian and Solr) can handle
rich text documents like PDF, OpenOffice... Maybe there is a way to
make the prepare_pdf() function send the pdf to the engine and get it
indexed. I have no idea how could I do this, or even if it is
possible, but maybe someone have done it before.

Peter Bengtsson

unread,
Feb 27, 2010, 8:40:51 AM2/27/10
to django-haystack
I solved my such problem by defining a method for my model called
"searchable_text" which always returns text. I goes something like
this::

# search_indexes.py
class MongoSearchIndex(indexes.SearchIndex):
# these attributes you can override in your subclasses of this
class
text = indexes.CharField(document=True,
model_attr='searchable_text')
user = indexes.CharField(model_attr='user')

# models.py
from subprocess import Popen

def doctotext(payload):
...Popen("antiword "%s")...

def pdftotext(payload):
...Popen("pdftotext -enc utf8 %s")...

class MyDocument(...):
...
def searchable_text(self):
if self.content_type.startswith('text/'):
return self.read()
elif self.content_type == 'application/pdf':
return pdftotext(self.read())
elif self.content_type == 'application/msword':
return doctotext(self.read())
return ''

That way I become more agnostic of how the engine handles binaries.

Truth is, the above code is pseudo code and what I actually
implemented was putting the payload of the PDFs and DOCs into an index
attribute called "deep_text" otherwise all searches would always find
something inside the huge PDFs. By splitting it like that I could
search inside the payload of the files only if nothing is found by
doing searches on normal entered content.

Dominique Guardiola Falco

unread,
Jun 15, 2011, 9:13:50 AM6/15/11
to django-...@googlegroups.com
HI

I relaunch this question because I have the same request : beign able to use haystack alongside with raw Documents indexing.
As Solr is able to do this (http://wiki.apache.org/solr/ExtractingRequestHandler), I could just launch a curl request to tell Solr to add in his index the uploaded, django-managed file
My question, as I do not know the haystack solution deeply : is possible to mix the haystack-produced Solr index and manual indexing like this in the search results in django ?

Chris Adams

unread,
Jul 7, 2011, 5:09:56 PM7/7/11
to django-...@googlegroups.com
I've failed to find time to split the fork in https://github.com/toastdriven/django-haystack/pull/309 but basically I've had very good results feeding arbitrary content into Solr using a modified SolrBackend which provides a contract extraction method:


The idea is basically that you use the backend to say "Get me the text from this binary file" and then provide the resulting output to your normal haystack template so you still have the ability to incorporate other content, reformat, etc. This is in an extended test here at work but has been quite effective as long as you're using a recent Solr build with a current Tikka (older versions got exceptions on some broken PDF and Word documents we encountered in the wild).

Chris

Dominique Guardiola Falco

unread,
Jul 8, 2011, 2:33:01 AM7/8/11
to django-...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages