correct way to work with abbreviation

58 views
Skip to first unread message

Viktor Tsymbalyuk

unread,
Aug 28, 2021, 1:06:53 PM8/28/21
to bleve
Hello!
I trying to understand how to work with abbreviation:
On russian `с.Гора`  - means `village Hora`
After some playing I found working solution (using simple analyzer)
```
Doc := bleve.NewDocumentMapping()
Doc.DefaultAnalyzer = ru.AnalyzerName
fieldTitle := bleve.NewTextFieldMapping()
fieldTitle.Analyzer = simple.Name
Doc.AddFieldMappingsAt("title", fieldTitle)
```
But I want to understand
1. does ` ru.AnalyzerName` applied after `simple`?
2. may be exists more correct logic and idiomatic way? 
3. may be in this case, I also should apply something specific to  locality (city names)?

Thank you.
--
Viktor

Marty Schoch

unread,
Aug 28, 2021, 1:24:26 PM8/28/21
to bl...@googlegroups.com
On Sat, Aug 28, 2021 at 1:06 PM Viktor Tsymbalyuk <viktor.t...@gmail.com> wrote:
Hello!
I trying to understand how to work with abbreviation:
On russian `с.Гора`  - means `village Hora`
After some playing I found working solution (using simple analyzer)
```
Doc := bleve.NewDocumentMapping()
Doc.DefaultAnalyzer = ru.AnalyzerName
fieldTitle := bleve.NewTextFieldMapping()
fieldTitle.Analyzer = simple.Name
Doc.AddFieldMappingsAt("title", fieldTitle)
```
But I want to understand
1. does ` ru.AnalyzerName` applied after `simple`?

No, in the code above you set ru to be the default analyzer, which will be used by dynamic fields, or fields which do not explicitly specify the analyzer.  You then defined a field title, and explicitly set it ot use simple, so it will not use ru.
 
2. may be exists more correct logic and idiomatic way? 

If your goal was to perform both, you should simply define a new analyzer that combines the behavior of ru and simple (I'm not even sure that makes sense).
 
3. may be in this case, I also should apply something specific to  locality (city names)?

I'm not sure what to suggest.  While you gave an example of the russian text and what it means, I'm still not exactly clear what text you think should be in the index, or what search term you expect a user to type in and find a match.  If you can provide more information on this, we can suggest a path forward.

marty

Viktor Tsymbalyuk

unread,
Aug 29, 2021, 7:13:20 AM8/29/21
to bleve
Hello Marty,

First of all thank you for explanation, with 1 and 2 - more clear now.
About my general target:
I expect to have in index only geographical names, especially cities, towns. In another words I need to know what town was mentioned in a document.
In my understanding I need to have a list of such  proper nouns, and use them in opposite way to stop_tokens_filter.
So, this is first question - how to create such index.

Second is - how to deal with synonyms. May be it's more related to Language Specific Analyzers, but may be can give an advice. For example I have two documents with a texts:
```1
Hope, city, seat (1939) of Hempstead county, southwestern Arkansas, U.S., about 35 miles (56 km) northeast of Texarkana.
```
```2
Hope is the last thing that dies.
```
Is any way to avoid `false positive` in second document, if `Hope` present in list of known cities.?

--
Viktor

Marty Schoch

unread,
Aug 30, 2021, 11:01:38 AM8/30/21
to bl...@googlegroups.com
OK, so what I would recommend is that you build a custom analyzer, and for now, think of it as a black box.  The analyzer takes in arbitrary text (as []byte), and ultimately produces a stream of tokens which should be indexed.  So, in your case, we can think of this analyzer as taking arbitrary text as input, and producing a list of zero or more locations found in the text.

Now, there are many different ways that black box can be implemented.  As you mentioned, the simplest option would be to use the existing tokenizer, to split the text into words, then look up each word in a list.  This would be the opposite of the stop filter, instead of removing listed words, it just matches against a built-in list of known locations.

This simple implementation would work, but because each of the tokens is looked up individually, you lack context to identify the false positive.  But, you could have a more complicated implementation that considers the other text around the location.  With this you could have additional heuristics to improve the quality of the generated tokens.  This code can be written in go and do whatever additional logic you want.

I would recommend you start by wiring up the simple individual token lookup based analyzer first.  That will get you familiar with the structs and interfaces required, and you can see which data/parameters are available to work with.

NOTE: there is ongoing interest to add proper support for synonyms to Bleve, but it has not yet started

marty



--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/a43aed4b-9be7-474c-a1d9-7b08e23dc882n%40googlegroups.com.

Viktor Tsymbalyuk

unread,
Sep 3, 2021, 4:38:49 AM9/3/21
to bleve
Hello Marty,
Thank you very much for your explanation and time.
Now I have non-zero problem understanding)

--
Viktor.
Reply all
Reply to author
Forward
0 new messages