OT: A few questions about TECKit (for FLEx)

52 views
Skip to first unread message

Michael BCM

unread,
May 26, 2021, 6:38:27 AM5/26/21
to FLEx list

Dear honored community,

I'm not sure that this is the right forum to post the question but as it's also part of my work with FLEx, I still want to raise them here, to (I also posted this to the SIL Language Software Community).

I have the following problem:
I need to convert a Latin based script (LTR) into a Rohingya Hanifi script (Unicode plane U+10D00 to U+10D3F) and into an Arabic script (+ additional letters from U+0800-08FF)

There are different challenges I’d like to get your advice on before I start sinking days and hours into TECKit:

  1. Can it handle Unicode V.13 and Unicode plane U+10D00 to U+10D3F? Mapping Editor only gives me access to U+0000-U+FFFF will TECKit be able to take it if I manually manipulate it to include U+10D00 to U+10D3F?
  2. Will FLEx be able to take that TECKit file or will it crash (as happens sometimes).
  3. Sadly the scripts I need to convert are inconsistent meaning they are using letters/combinations the others don’t and subsidize (or completely drop). Is there a way to make TECKit understand that a Latin “a” at the beginning of a word has to go with Alif+A-vowel in the Arabic/Rohingya script where as an a appearing within a word going with the preceeding consonant (Arabic) or being an individual character (Rohingya Hanifi). How can I put rules like that into TECKit?
  4. In Arabic Unicode the byte order eg. first vowel then duplication sign can also be written the other way around which both are rendered the same way but are actually different in byte order. Do I have to teach TECKit every possible combination? (Or is there a faster way to do this?)

Best,
Michael

Ken K

unread,
May 26, 2021, 8:41:34 AM5/26/21
to flex...@googlegroups.com
Hi Michael,

Well, I hope it won't take days...

#1. You probably need to start in TECkit editor to get the headers right. Then I would switch to Notepad++ to edit my table.
We need to appeal to WSTech to get the TECkit mapping editor updated.
#2. FLEx crashing with TECkit...that's a bug. Please report these crashes!
#3. It is possible to write contextual rules in TECkit. They are similar to RegEx rules.
#3-4. Please give us a concrete example. It is difficult to visualize what you want to do from your prose description.

Best wishes,
Ken

--
"FLEx list" messages are public. Only members can post.
flex_d...@sil.org
http://groups.google.com/group/flex-list.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/flex-list/2d5e963b-a760-4362-9613-36fd4bee81fen%40googlegroups.com.

Jeff Heath

unread,
May 27, 2021, 3:46:32 AM5/27/21
to FLEx list
A couple of comments:

You'll have to get input from others about accessing characters in other planes in TECkit mapping files (#1), and whether FLEx can handle them (#2).

#3 Sadly the scripts I need to convert are inconsistent meaning they are using letters/combinations the others don’t and subsidize (or completely drop). Is there a way to make TECKit understand that a Latin “a” at the beginning of a word has to go with Alif+A-vowel in the Arabic/Rohingya script where as an a appearing within a word going with the preceeding consonant (Arabic) or being an individual character (Rohingya Hanifi). How can I put rules like that into TECKit?
 
The attached TECkit mapping file shows most of the "tricks" I know for converting RS to AS. It works in several passes. First it converts all characters to lowercase (since case isnt a thing in AS). Then comes the main character conversion pass. This converts RS characters to AS. You can see that there are special cases for dipthongs and long vowels, and also for all of those cases occurring at the beginning of a word. For example, these lines cover long 'aa' and short 'a', both at the beginning of a word, and elsewhere:

'aa' / (#|[WordBreak]) _  <>   U+0622
'aa'                      <>   U+064E U+0627
'a'  / (#|[WordBreak]) _  <>   U+0623 U+064E
'a'                       <>   U+064E

In this second pass there are also a couple of special suffixes and entire words that are exceptions to the standard conversion rules, and some punctuation changes.

The third pass converts doubled consonants with a consonant + shadda. (You need to make sure that all of the consonants in your language are in this list.) And the fourth pass adds a sukun between two consecutive consonants.

A number of people have found this TECkit mapping file to be a good starting point for a RS to AS conversion, then tweak for your specific language needs.

#4 In Arabic Unicode the byte order eg. first vowel then duplication sign can also be written the other way around which both are rendered the same way but are actually different in byte order. Do I have to teach TECKit every possible combination? (Or is there a faster way to do this?)

If you are always converting from RS to AS, this shouldn't be an issue, as the mapping file always will put the shadda next to the consonant, and the vowel will appear afterwards. (Unicode canonical form, however, would put the vowel first, which might lead to some problems down the road... for example in spelling checkers.)

If at some point you need to convert from RS to AS, then just put in an initial pass which standardizes all of those character combinations.

Hope that helps,
Jeff

MabaAS - public.map

matthew...@sil.org

unread,
Sep 5, 2025, 5:58:26 PMSep 5
to FLEx list
Hello,

I just implemented a few contextual rules based on your map file here, and it made my life a lot easier. I get the impression that a lot more would be possible, but is there a document somewhere outlining the possible rules, or what version of regex is being used if that's what it is? 

I notice at the end there is what looks like a rule with find/replace
[cons]=a [cons]=b   <>      @a U+0652 @b

Is it possible to do more advanced things like sentence capitalization? Any tips would be appreciated!

sarkipo

unread,
Sep 8, 2025, 4:30:00 AMSep 8
to flex...@googlegroups.com
Hi Matthew,

You can find the TeCKit documentation here under "The TeCKit Language": 
https://software.sil.org/teckit/

Best,
Alexandre

--
"FLEx list" messages are public. Only members can post.
flex_d...@sil.org
http://groups.google.com/group/flex-list.
---
You received this message because you are subscribed to the Google Groups "FLEx list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flex-list+...@googlegroups.com.
Message has been deleted

matthew...@sil.org

unread,
Sep 16, 2025, 9:34:01 AMSep 16
to FLEx list
Aha, excellent. Thank you!
Reply all
Reply to author
Forward
0 new messages