Haystack + Elasticsearch a diakritika

Martin Tiršel

unread,

Mar 12, 2013, 5:47:41 AM3/12/13

to djan...@googlegroups.com

Zdravim,

neviete niekto poradit, ako docielit toho, aby mi vyhladavanie vracalo rovnake vysledky nehlade na to, ci je slovo zapisane s diakritikou alebo bez? Teda ak zadam napr. 'cizmy' alebo 'čižmy', aby to bral ako rovnake slova? Kedze mam naindexovane texty s diakritikou, tak vyrazy bez diakritiky vobec nenajde. Jedno riesenie ma napada, ze budem indexovat oba texty, jeden s diakritikou, druhy s nahradenymi znaky, teda nejako takto:

{{ object.title }}
{{ object.data.description }}

{{ object.title|remove_accents }}
{{ object.data.description|remove_accents }}

Ale pride mi to trocha brute-force, neviete niekto o niecom rozumnejsom? Tipoval by som, ze search engine by toto mohol vediet automaticky, ale k nicomu zaujimavemu som sa nedopatral.

Dakujem,
Martin

Martin Kubát

unread,

Mar 12, 2013, 5:51:54 AM3/12/13

to djan...@googlegroups.com

Ahoj.

Zrovna včera jsem to řešil:

do nastavení elasticsearch je třeba přidal filter "asciifolding"

example:

"analyzer": {

"default": {

"type": "custom",

"tokenizer": "whitespace",

"filter": [

"asciifolding",

"standard",

"lowercase",

"haystack_edgengram",

]

}

Martin Kubát

2013/3/12 Martin Tiršel <martin...@gmail.com>

--
--
E-mailová skupina djan...@googlegroups.com
Správa: http://groups.google.cz/group/django-cs

---
Tuto zprávu jste obdrželi, protože jste přihlášeni k odběru skupiny django-cs ve Skupinách Google.
Pokud chcete zrušit odběr skupiny, aby vám z ní již nechodily e-maily, zašlete e-mail na adresu django-cs+...@googlegroups.com.
Další možnosti najdete na adrese https://groups.google.com/groups/opt_out.

Whit

unread,

Mar 12, 2013, 6:03:54 AM3/12/13

to djan...@googlegroups.com

Ahoj,

nevim jestli haystack (pouzivame beta 2) uz bere vlastni settings pro elasticsearch, my zatim mame trochu pretizeny ElasticSearchBackend, aby se podival do settings na nastaveni a necpal vsude snowball analyzer apod. A v settings to mame potom nastavene nejak takto (pro cestinu, umi i zakladni stemming apod.):

HAYSTACK_ELASTICSEARCH_SETTINGS = {

"settings": {

"analysis": {

"analyzer": {

"default": {

"tokenizer": "lowercase",

"filter": ["asciifolding", "standard", "stop", "cz_stemmer", "synonym"],

"char_filter": ["html_strip"],

"alias": ["snowball"]

},

"ngram_analyzer": {

"type": "custom",

"tokenizer": "lowercase",

"filter": ["haystack_ngram"]

},

"edgengram_analyzer": {

"type": "custom",

"tokenizer": "lowercase",

"filter": ["haystack_edgengram"]

}

},

"tokenizer": {

"haystack_ngram_tokenizer": {

"type": "nGram",

"min_gram": 3,

"max_gram": 15

},

"haystack_edgengram_tokenizer": {

"type": "edgeNGram",

"min_gram": 2,

"max_gram": 15,

"side": "front"

}

},

"filter": {

"stop": {

"type": "stop",

"stopwords_path": "stopwords.txt"

},

"synonym": {

"type": "synonym",

"synonyms": [

"kon => kun",

"testovak => test",

"originaln => original",

"milov => miluj",

],

},

"cz_stemmer": {

"type": "stemmer",

"name": "czech"

},

"haystack_ngram": {

"type": "nGram",

"min_gram": 3,

"max_gram": 15

},

"haystack_edgengram": {

"type": "edgeNGram",

"min_gram": 2,

"max_gram": 15

}

Tj. primo elasticu to nenastavujeme, nastavuje si to kazdy index sam. Jestli budes chtit videt ten pretizeny backend, nekam to vystavim - nevnasi tam zadnou logiku, jen bere to nastaveni, takze par radek.

Vitek

Martin Kubát

unread,

Mar 12, 2013, 6:08:17 AM3/12/13

to djan...@googlegroups.com

Nějaké nápady na haystack + elasticsearch jsem nedávno sepsal na http://www.itdrawer.com/django-haystack-elasticsearch/2013/02/18/ .

Pro přetížený haystack backend jsem se inspiroval na http://www.wellfireinteractive.com/blog/custom-haystack-elasticsearch-backend/.

MK

2013/3/12 Whit <wh...@jizak.cz>

--

Whit

unread,

Mar 12, 2013, 6:13:21 AM3/12/13

to djan...@googlegroups.com

Jop, to nase dela presne to same co ConfigurableElasticBackend v tvem odkazu. Proctu pro inspiraci, dik.

Vitek

Martin Kubát

unread,

Mar 12, 2013, 6:16:34 AM3/12/13

to djan...@googlegroups.com

Mám to udělané zatím nahrubo takhle (ještě si musím pohrát s češtinou a snad mi Tvoje settings pomůže ;-) ):

from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend, ElasticsearchSearchEngine

ELASTICSEARCH_INDEX_SETTINGS = {

'settings': {

"analysis": {

"analyzer": {

"default": {

"type": "custom",

"tokenizer": "whitespace",

"filter": [

"asciifolding",

"standard",

"lowercase",

"haystack_edgengram",

]

},

"ngram_analyzer": {

"type": "custom",

"tokenizer": "whitespace",

"filter": ["asciifolding", "haystack_ngram"]

},

"edgengram_analyzer": {

"type": "custom",

"tokenizer": "whitespace",

"filter": ["asciifolding", "haystack_edgengram"]

}

},

"tokenizer": {

"haystack_ngram_tokenizer": {

"type": "nGram",

"min_gram": 2,

"max_gram": 15,

},

"haystack_edgengram_tokenizer": {

"type": "edgeNGram",

"min_gram": 2,

"max_gram": 15,

"side": "front"

}

},

"filter": {

"haystack_ngram": {

"type": "nGram",

"min_gram": 2,

"max_gram": 15

},

"haystack_edgengram": {

"type": "edgeNGram",

"min_gram": 2,

"max_gram": 15

},

}

class ConfigurableElasticBackend(ElasticsearchSearchBackend):

DEFAULT_ANALYZER = "default"

def __init__(self, connection_alias, **connection_options):

super(ConfigurableElasticBackend, self).__init__(

connection_alias,

**connection_options

)

setattr(self, 'DEFAULT_SETTINGS', ELASTICSEARCH_INDEX_SETTINGS)

def build_schema(self, fields):

content_field_name, mapping = super(ConfigurableElasticBackend, self).build_schema(fields)

for field_name, field_class in fields.items():

field_mapping = mapping[field_class.index_fieldname]

if field_mapping['type'] == 'string' and field_class.indexed:

if not hasattr(field_class, 'facet_for') and not field_class.field_type in('ngram', 'edge_ngram'):

field_mapping['analyzer'] = getattr(field_class, 'analyzer', self.DEFAULT_ANALYZER)

mapping.update({field_class.index_fieldname: field_mapping})

return (content_field_name, mapping)

class ConfigurableElasticSearchEngine(ElasticsearchSearchEngine):

backend = ConfigurableElasticBackend

Dne 12. března 2013 11:13 Whit <wh...@jizak.cz> napsal(a):

Whit

unread,

Mar 12, 2013, 6:26:39 AM3/12/13

to djan...@googlegroups.com

Jeste jedna poznamka, prakticky nepouzivame

text = CharField(

document=True,

use_template=True

)

a dotaz si vzdy sestavujeme sami, myslim ze kdyz uz clovek voli mocny ES backend, tak by se nemel nechat limitovat timto zakladnim vyhledavanim (ktere nebere ohledy na boost u sloupcu, trochu mi vadi re pri indexaci renderuje sablonu a vse nacpe do jednoho...). Doporucuju vic pythonic, tj. vyuzit prepare_FIELD() a konstruovat si vlastni dotaz.

U nas to vypada treba tak:

class PublishableIndex(indexes.SearchIndex):

# ne ze by toto bylo hezke, jde o to abychom nemuseli mit jeden field text s document=True...

def __init__(self):

self.prepared_data = None

title = indexes.CharField(model_attr='title', boost=5)

description = indexes.CharField(model_attr='description', boost=3)

category = indexes.CharField(model_attr='category', boost=2)

authors = indexes.MultiValueField(faceted=True)

publish_from = indexes.DateTimeField(model_attr='publish_from')

# we need filtering by site

site_id = indexes.IntegerField(model_attr="category__site__id")

def prepare_authors(self, obj):

return [a.name for a in obj.authors.all()]

def prepare_description(self, obj):

return strip_tags(obj.description)

def prepare_category(self, obj):

return obj.category.title

def index_queryset(self, using=None):

return self.get_model().objects.filter(publish_from__lte=datetime.now(), published=True)

def get_model(self):

raise NotImplementedError('This method must be overriden')

def get_updated_field(self):

return 'last_updated'

class ArticleIndex(PublishableIndex, indexes.Indexable):

content = indexes.CharField(model_attr='content')

def get_model(self):

return Article

On Tuesday, March 12, 2013 10:47:41 AM UTC+1, Martin Tiršel wrote:

Martin Kubát

unread,

Mar 12, 2013, 6:28:50 AM3/12/13

to djan...@googlegroups.com

Super, díky za poznámky.

Sice se původní diskuze posunula jinam, ale je to k dobro věci.

MK

Dne 12. března 2013 11:26 Whit <wh...@jizak.cz> napsal(a):

--

Martin Tiršel

unread,

Mar 12, 2013, 7:40:15 AM3/12/13

to djan...@googlegroups.com

Tiez dakujem vsetkym za poznamky, myslim, ze to bude prinosne pre viacerych :)

Martin

Reply all

Reply to author

Forward