I'd need to index some PDF documents that are uploaded in a FileField,
can haystack manage this fields?
Thank you.
Not natively, no. Haystack is model-based, so it knows nothing
about PDFs. However, on a ``SearchIndex`` subclass for the model, you
could try adding a ``prepare_pdf()`` method (provided your
``FileField`` is called ``pdf``) and use a Python PDF library to
extract the text and push that into the index. A cursory search for
PDF reading libraries on Google turned up http://pybrary.net/pyPdf/ &
http://www.unixuser.org/~euske/python/pdfminer/index.html, though I've
never used either.
Daniel
> --
> You received this message because you are subscribed to the Google Groups "django-haystack" group.
> To post to this group, send email to django-...@googlegroups.com.
> To unsubscribe from this group, send email to django-haysta...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/django-haystack?hl=en.
>
>
Daniel Lindsley ha escrito:
> Adrian,
>
>
> Not natively, no. Haystack is model-based, so it knows nothing
> about PDFs. However, on a ``SearchIndex`` subclass for the model, you
> could try adding a ``prepare_pdf()`` method (provided your
> ``FileField`` is called ``pdf``) and use a Python PDF library to
> extract the text and push that into the index. A cursory search for
> PDF reading libraries on Google turned up http://pybrary.net/pyPdf/ &
> http://www.unixuser.org/~euske/python/pdfminer/index.html, though I've
> never used either.
Thank you!
Bu I've seen that some engines(at least Xapian and Solr) can handle
rich text documents like PDF, OpenOffice... Maybe there is a way to
make the prepare_pdf() function send the pdf to the engine and get it
indexed. I have no idea how could I do this, or even if it is
possible, but maybe someone have done it before.
# search_indexes.py
class MongoSearchIndex(indexes.SearchIndex):
# these attributes you can override in your subclasses of this
class
text = indexes.CharField(document=True,
model_attr='searchable_text')
user = indexes.CharField(model_attr='user')
# models.py
from subprocess import Popen
def doctotext(payload):
...Popen("antiword "%s")...
def pdftotext(payload):
...Popen("pdftotext -enc utf8 %s")...
class MyDocument(...):
...
def searchable_text(self):
if self.content_type.startswith('text/'):
return self.read()
elif self.content_type == 'application/pdf':
return pdftotext(self.read())
elif self.content_type == 'application/msword':
return doctotext(self.read())
return ''
That way I become more agnostic of how the engine handles binaries.
Truth is, the above code is pseudo code and what I actually
implemented was putting the payload of the PDFs and DOCs into an index
attribute called "deep_text" otherwise all searches would always find
something inside the huge PDFs. By splitting it like that I could
search inside the payload of the files only if nothing is found by
doing searches on normal entered content.