Strip HTML when adding to index

4 views
Skip to first unread message

peppergrower

unread,
Jul 16, 2010, 3:14:01 PM7/16/10
to Djapian Users
I'd like Djapian to strip HTML tags before indexing my content. Is
there some way to do this? If there isn't a built-in way, any
recommendations on how to hack it in? (I'm brand-new to Djapian and
still learning how to use it.)

Thanks,
peppergrower

peppergrower

unread,
Jul 19, 2010, 11:29:59 AM7/19/10
to Djapian Users
Never mind, I figured out how to do this. I just had Djapian index a
special method of the object that returned the text with HTML stripped
out, rather than pointing it directly at the text field.

On Jul 16, 1:14 pm, peppergrower <spamcomefindmeple...@gmail.com>
wrote:

esatte...@wi.rr.com

unread,
Jul 20, 2010, 6:32:32 AM7/20/10
to Djapian Users
thats interesting, I didn't know you could do that. I ended up with a
much more complex solution.

So in the fields part of index.py you just put the name of a method as
a string rather than a method?

On Jul 19, 10:29 am, peppergrower <spamcomefindmeple...@gmail.com>
wrote:

peppergrower

unread,
Jul 28, 2010, 3:56:00 AM7/28/10
to Djapian Users
Sorry for the delay responding! I've been out of town. Yep, if your
model has a "plaintext(self)" method (for example), which strips the
HTML from a field called "text", you could just put "plaintext" in the
index definition anywhere you would have put "text" and it'll work.
And yes, the name as a string, not the method itself. (I didn't
realize this either until I found an example someplace; I'm not sure
it's clearly documented.) In my plaintext method I just have a fairly
simple regex (r"<.*?>") that I use to strip out the HTML.


On Jul 20, 6:32 am, "esatterwh...@wi.rr.com" <esatterwh...@wi.rr.com>
wrote:

esatte...@wi.rr.com

unread,
Jul 28, 2010, 9:46:22 PM7/28/10
to Djapian Users
Thanks.

Maybe a personal preference, but you could do
from django.template.defaultfilters import striptags
return striptags(html)

its not much more facny, but does some unicode work for you

On Jul 28, 2:56 am, peppergrower <spamcomefindmeple...@gmail.com>
wrote:

peppergrower

unread,
Jul 29, 2010, 3:09:02 PM7/29/10
to Djapian Users
Thanks for the tip. I really ought to do more digging into Django so
I don't re-invent stuff they've already done, even if it's as simple
as this.


On Jul 28, 7:46 pm, "esatterwh...@wi.rr.com" <esatterwh...@wi.rr.com>
wrote:
Reply all
Reply to author
Forward
0 new messages