entering & searching for the slippery letter yA

Kevin Edgerton

unread,

May 12, 2016, 8:33:05 PM5/12/16

to FLEx list

In P2 there are five forms of the letter yA: ی ي ې ۍ ئ. Each has its own distinct pronunciation and/or usage. The various forms of yA are distributed differently; some are word-initial/medial and all can be word-final.

Here are the problems I see:

1. In many cases, native writers have not distinguished between the forms, and most would not be clear which is the correct form to use where. Standardization is still a big problem in P2, complicated by dialect differences. Not being sure of the standard form makes it hard to know which to enter into ParaTExt and FLEx.

2. In some typing programs, I think the yA may change its appearance (though not unicode value) automatically.

3. How does one do searches in this scenario? You would have to try various combinations of forms to find all of the instances...unless you know of a way around this (maybe using SFMs or some catch-all unicode value?).

Any ideas would be appreciated!

Kevin

maxwell

unread,

May 12, 2016, 11:46:42 PM5/12/16

to flex...@googlegroups.com

Regular expressions (not SFMs) are the usual answer to this kind of
search. In this case, you'd be searching for something like '[ ی ي ې ۍ
ئ]', but without the space characters (or the quotes). The appearance
is not relevant to search, only the Unicode values. That is, when I
remove the space characters, the regular expression looks like this,
with the characters in their combined forms (at least it does on my
computer, how it looks on your computer depends on the font, among other
things): '[ یيېۍئ]',

That said, given that native speakers aren't good at spelling with this
letter (at least they aren't in Pashto, I don't know what P2 is), if you
are doing lots of searches you might as well combine all the Yehs into
one code point in a copy of your texts, which will simplify search.
--
Mike Maxwell
max...@umiacs.umd.edu
"I cannot believe that our existence in this universe
is a mere quirk of fate, an accident of history, an
incidental blip in the great cosmic drama. Our
involvement is too intimate. The physical species
Homo may count for nothing, but the existence of
mind in some organism on some planet in the universe
is surely a fact of fundamental significance. Through
conscious beings the universe has generated
self-awareness." --Paul Davies

Jeff Shrum

unread,

May 13, 2016, 12:37:16 AM5/13/16

to flex...@googlegroups.com

Kevin,

We don't support Paratext on this list, but I can comment on FLEx's default behavior. Searches in the lexicon browse view, and concordance view, have a check box "match diacritics". That box is unchecked by default so FLEx should ignore all diacritics and give you search results with all occurrences of the base character regardless of the diacritics. If the issue is not related to diacritics, then I would second Mike Maxwell's suggestion to use regular expressions. If you have not used them they use a powerful syntax for searching and manipulating text. The is quite a bit of information about regex in the Help.

Jeff Shrum

SIL International

Language Technology Consultant

Dallas, TX

--
You are subscribed to the publicly accessible group "FLEx list".
Only members can post but anyone can view messages on the website.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
To post to this group, send email to flex...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/d0b1c221-b0bb-4c97-8f03-d4c9f0ed938e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Craig Kopris

unread,

May 13, 2016, 1:04:53 AM5/13/16

to flex...@googlegroups.com

Regarding point 1, although there is some standardization for some Pashto writers (see http://www.samsoor.com/fullstory.php?id=137&CatID=3 for instance), you might find some more regularities if you have the source dialects tagged, as the final vowels typically used to indicate TAM and PNG vary across the dialects. That is, where one dialect might use the suffix /i/ for a particular inflection, another will use /e/ for it instead, and thus their choice of letter for that inflection could be more predictable.

It might also be useful to add 06D2 ے ARABIC LETTER YEH BARREE to your list, especially if you address Pakistani Pashto, unless you are already converting it to one of the Afghan Pashto forms.

- Craig

=====

Craig Kopris

cko...@yahoo.com

From: Kevin Edgerton <trai...@gmail.com>
To: FLEx list <flex...@googlegroups.com>
Sent: Thursday, May 12, 2016 8:33 PM

Subject: [FLEx] entering & searching for the slippery letter yA

maxwell

unread,

May 13, 2016, 4:33:57 PM5/13/16

to flex...@googlegroups.com, Jeff Shrum

On 2016-05-13 00:37, Jeff Shrum wrote:
> We don't support Paratext on this list, but I can comment on FLEx's
> default behavior. Searches in the lexicon browse view, and
> concordance view, have a check box "match diacritics". That box is
> unchecked by default so FLEx should ignore all diacritics and give you
> search results with all occurrences of the base character regardless

> of the diacritics. If the issue is not related to diacritics...

It is not. While the various dots in Arabic are clearly diacritics in
the same sense that accent marks and umlauts/diareses etc. are
diacritics in Latin scripts, Unicode does not treat them as diacritics
in Arabic script. Rather, each combination of a conceptual base +
diacritic is encoded as a single code point. (Or putting it differently
for Unicode gurus out there, the Arabic blocks in Unicode have only the
equivalent of NFC forms, not NFD forms.)

> ...then I

> would second Mike Maxwell's suggestion to use regular expressions. If
> you have not used them they use a powerful syntax for searching and
> manipulating text. The is quite a bit of information about regex in
> the Help.

Mike Maxwell

Beth-docs Bryson

unread,

May 14, 2016, 12:34:57 AM5/14/16

to flex...@googlegroups.com

Regarding automatically changing shapes, yes, anything that is using Uniscribe (and some other rendering engines) automatically adjusts characters for where they occur in a word: initial, medial, final, isolate.

When you listed the different characters, they came out in their isolate forms. When Mike took out the spaces, we saw the contextual forms for each of them.

And yes, it is confusing, because several of the yeh shapes look the same for some of the contextual forms, even though they are different characters, and the underlying codepoint is different. I’m thinking in particular of yeh, alef maksura, and farsi yeh, only one of which is in your list.

Anyway, I don’t have a lot to add, other than to second Mike’s two suggestions.

In the SILConverters package (or a package similar to it), there is a spelling tool where you can tell it about characters that are commonly confused, and it can find potential misspellings based on that information. I have not used it, but it was designed for a situation where native speakers may not be skilled at knowing which character to type for certain sounds and there are certain high-frequency typos that tend to happen. I can get you more info on that if that is interesting to you. I think it is mainly useful in Word, though—I’m not sure what other applications it can be used in.

-Beth

Reply all

Reply to author

Forward