Automaton-intersection search algorithm and word boundaries

eric.laporte

unread,

Feb 10, 2015, 10:19:49 AM2/10/15

to

Hi,
I discovered recently that the LocateTfst program (which is the "automaton intersection" option of Text > Locate Pattern) has a strange behaviour since a revision of July 2010: the now here query (or a now --> here path in a graph) finds occurrences of nowhere, and the nowhere query finds now here. Word boundaries inside the query or inside the occurrence are not required to be identical. Such behaviour may be convenient for Korean and other languages where spaces between words are easily omitted or inserted, but it is counter-intuitive for English or Romance languages: now here does not mean nowhere, to get her has nothing to do with together. I suggest this feature should be limited to Korean and perhaps other languages.
This feature is a difference between LocateTfst and the Locate program (which is the default option of Text > Locate Pattern): by default, with Locate, now here and nowhere do not recognize each other. This difference makes it more complex to read and interpret graphs, and I don't see the advantage, except for Korean. LocateTfst was designed as a new version of Locate with more functionality (like dictionary-based morphological analysis, rule-based ambiguity resolution), but as much as possible of the functionality of Locate has been preserved.
The feature is misleading. With LocateTfst, even though the <now> query recognizes now and <here> recognises here, a <now> --> <here> path does not recognize nowhere.
The feature looks a little like Locate's morphological mode, but it is different in fact. In the morphological mode, a now --> here path in a graph finds nowhere, but a <now> --> <here> path finds it too, and a nowhere box does not find now here. All these differences are difficult to remember and add further confusion to the interpretation of graphs.
Does any user take benefit of this behaviour of LocateTfst? In which language?
Best,
Eric Laporte

eric.laporte

unread,

Feb 27, 2015, 4:25:56 AM2/27/15

to

Hi,

I report Jee-sun Nam's answer about the issue:

I also find the present behaviour excellent for Korean, as opposed to French or English. As many multiword expressions are spelt with or without internal whitespace without any discernible regularities, I appreciate detecting two forms (with or without space) with a single query. In addition, I think there are few to get her/together ambiguous cases caused by whitespace in Korean. Therefore, I prefer the present behaviour for Korean.

(In French)
Quant au coréen, comme tu dis, la version actuelle est nettement préférable, différemment du français ou de l'anglais, je pense. Comme beaucoup de mots composés sont collés ou séparés sans trop de régularités, je préfère pouvoir détecter deux formes différentes en utilisant une seule forme (collée ou séparée). Par ailleurs, je pense qu'il y a peu de cas ambigus comme to get her/together causés simplement par l'espacement en coréen. Donc, à mon avis, le traitement actuel est préférable pour le coréen.

Best,
Eric Laporte

Message has been deleted

Alexis Neme

unread,

Mar 3, 2015, 9:27:41 AM3/3/15

to unitex-...@googlegroups.com

Hi,

Arabic has a cursive script with 28 letters separated in two sets:
     - 22 letters whose forms change when followed with space (connective)
     - 6 letters whose forms remain the same even when followed with space (non-connective).

The insertion of space in a simple word produces a typographic error in all cases, and the error is more shocking if after a connective letter.
In compounds nouns with two parts "P1 P2", the space is mandatory if P1 ends with a connective letter, and usually optional if it ends with a non-connective letter.

Examples:
    - With t (a connective letter), beit lehem vs. *beitlehem (place name)
    - With r (a non-connective letter), kafar qaAsim = kafarqaAsim (place name)
    - With r , biyr AlEabid vs. *biyrAlEabid (place name)

In conclusion, for Arabic, I prefer that the Locate as a default behaviour finds:
    - only beit lehem for a beit lehem query
    - only kafar qaAsim for a kafar qaAsim query
    - only kafarqaAsim for a kafarqaAsim query

If I want both kafar qaAsim and kafarqaAsim in the same concordance, I am prepared to define a complex query.

Best,

Alexis

eric.laporte

unread,

May 7, 2015, 4:11:01 AM5/7/15

to unitex-...@googlegroups.com, alexi...@gmail.com

Hi,
Thanks to Sébastien Paumier who made the behaviour of LocateTfst (which is the "automaton intersection" option of Text > Locate Pattern) more similar to that of Locate (the default option). The new version (3.1beta) is online. Now the now here query (or a now --> here path in a graph) does not find occurrences of nowhere, and the nowhere query does not find now here. Word boundaries inside the query or inside the occurrence are required to be identical.
LocateTfst's behaviour from 2010 to now does not change for Korean, where spaces between words are easily omitted or inserted, and where searches typically use LocateTfst. It can be adopted for other languages, maybe Japanese, by setting a 'Match word boundaries' option to false.
Best,
Eric Laporte

Danniella Blonde

unread,

Aug 20, 2019, 8:11:09 AM8/20/19

to Unitex-GramLab

Hi, I am new to Unitex and am working with Hebrew. I also found that I had a problem with word boundaries. For example, I included the word אח in my graph but אחריות was also returned. I tried ticking boxes creating various constellations in the settings for Language (e.g. Semitic Language, Match word boundaries etc.)

Can someone help? Am I doing something wrong?

Many thanks!

Danniella

eric.laporte

unread,

Aug 26, 2019, 7:30:29 AM8/26/19

to unitex-...@googlegroups.com

Hi Danniella,

Did you declare the Hebrew alphabet? (cf. user manual version 3.2, section 15.2, or version 3.1, section 14.2).

Best,

Eric

Danniella Blonde

unread,

Aug 29, 2019, 5:23:18 AM8/29/19

to Unitex-GramLab

Hi Eric,

Thank you so much for your help. It seems to have solved my problem! :-D

Best.

Danniella

Reply all

Reply to author

Forward