Is there a way to ignore characters (or rather non-characters) in sorting
one's dictionary?
At the moment the sort order, when dealing with entries having more than one
word, sorts the ones with a space between the words first, then it starts
again and sorts those with hyphens next before it sorts the one word items.
I think I would like to ignore both the space and the hyphen and just sort
everything in together. At least I would like to look at how it comes out
before I decide. Is there a way to do what I want? I couldn't find an answer
in help.
Thanks
Bruce Hooley
__________ Information from ESET Smart Security, version of virus signature database 4720 (20091227) __________
Robert
--------------------------------------------------
From: "Bruce Hooley" <bruce_...@sil.org>
Sent: Monday, December 28, 2009 7:59 AM
To: <flex...@googlegroups.com>
Subject: [FLEx] Sorting
> Is there a way to ignore characters (or rather non-characters) in
> sorting
> one's dictionary?
Ken sent me this back in June. Right now I'm not finding *which* of
his technical docs has this in it.
At the end it talks about ignoring characters. I'm not certain about
ignoring space, but presumably you could use \u0020?
Hopefully there is enough here to get you guys started?
-Beth
On Jun 4, 2009, at 6:31 AM, Ken Zook wrote:
> Here's my latest technical doc that includes sorting. Most of this
> is also in the help file.
>
> Some quotes from it explains ignoring apostrophe.
>
> Note: In an ICU rule, any non-alphanumeric ASCII character is
> reserved for syntax characters. If you need to control collation of
> any of these characters, you must put an apostrophe in front of the
> character. To control the collation of an apostrophe you would thus
> add two apostrophes (not a double quote). In this context, using
> the Unicode code point \u0027 will not work.
> To sort t' after t, you would use the rule
> &t<t''
>
> The following rules would be one way to handle IPA sorting
> &d<d͡ʒ
> &e<ɛ<f<ɸ
> &i<ɨ
> &k<k''
> &n<ŋ
> &p<p''<r<ɾ
> &s<ʃ<ʂ
> &t<t''<t͡s<t͡s''<t͡ʃ<t͡ʃ''<ʈ͡ʂ<ʈ͡ʂ''
> &z<ʒ<ʐ<ʔ
> Suppose you want to ignore an apostrophe after m and n, but you
> want ng to sort after n, and ng’ to sort after ng. The following
> rules allow for this.
> The = syntax states that the right side is identical to the left side.
> &m=m''
> &M=M''
> &n=n''
> &N=N''
> &n<ng<<<Ng<<<NG<ng''<<<Ng''<<<NG''
>
> Suppose you want to ignore 02BC;MODIFIER LETTER APOSTROPHE in
> sorting. There are two ways you could handle this. The following
> rule doesn’t totally ignore the apostrophe, but it treats it in a
> secondary level so that it is ignored unless words are identical
> otherwise. In this case it always comes after other diacritics.
> &\u030E<<\u02BC
> This would result in the following order: ba, bad, bäd, baʼd,
> bʼad, bade, bat, bät, baʼt, bʼat, bate.
> The second approach is to totally ignore 02BC.
> &[last tertiary ignorable] = \u02BC
> This would result in the following order ba, baʼd, bad, bʼad,
> bäd, bade, baʼt, bat, bʼat, bät, bate. Since ba, baʼd, and
> bʼad all have identical sort keys, their order is random. The
> ‘last tertiary ignorable’ rule should be after all other rules,
> or it will disable the other rules.
> If you need to ignore more than one character, separate the
> characters with commas. The following rules would sort k after d
> and would ignore apostrophe and question mark.
> &d<k<<<K
> &[last tertiary ignorable] = '', '?
>
I finally discovered some misunderstandings on ICU sort specs that I had documented previously and that are in the current help files. I hope to post some new technical documentation soon with corrections to this and the fixes are being added to the help files for FieldWorks 6.0.1.
Here are some revised sections that I hope will clear up these problems related specifically to ignoring certain characters when sorting.
Note: In an ICU rule, any non-alphanumeric ASCII character is reserved for syntax characters. If you need to control collation of any of these characters, you must quote them with a \ or enclose them in apostrophes. A single apostrophe can also be represented as two apostrophes. Here are some examples of alphanumeric and punctuation characters with or without the \u syntax.
a letter
a
\u0061 letter a
3 digit 3
ng digraph ng
'ng' digraph ng (quotes are optional for alphanumeric
characters)
\u006e\u0067 digraph ng
\- hyphen
'-' hyphen
' ' space
\ space (there is a space following the \)
'\u0020' space
\\u0020 space
\' apostrophe
'' apostrophe
\u0027\u0027 apostrophe
To control the collation of an apostrophe you would thus add two apostrophes (not a double quote). To sort t' after t, you would use the rule
&t<t''
The following rules would be one way to handle IPA sorting
&d<d͡ʒ
&e<ɛ<f<ɸ
&i<ɨ
&k<k''
&n<ŋ
&p<p''<r<ɾ
&s<ʃ<ʂ
&t<t''<t͡s<t͡s''<t͡ʃ<t͡ʃ''<ʈ͡ʂ<ʈ͡ʂ''
&z<ʒ<ʐ<ʔ
Suppose you want to ignore an apostrophe after m and n, but you want ng to sort after n, and ng' to sort after ng. The following rules allow for this.
The = syntax states that the right side is identical to the left side.
&m=m''
&M=M''
&n=n''
&N=N''
&n<ng<<<Ng<<<NG<ng''<<<Ng''<<<NG''
Suppose you want to ignore 02BC;MODIFIER LETTER APOSTROPHE in sorting. There are two ways you could handle this. The following rule doesn’t totally ignore the apostrophe, but it treats it in a secondary level so that it is ignored unless words are identical otherwise. In this case it always comes after other diacritics.
&\u030E<<\u02BC
This would result in the following order: ba, bad, bäd, baʼd, bʼad, bade, bat, bät, baʼt, bʼat, bate.
The second approach is to totally ignore 02BC.
&[last tertiary ignorable] = \u02BC
This would result in the following order ba, baʼd, bad, bʼad, bäd, bade, baʼt, bat, bʼat, bät, bate. Since ba, baʼd, and bʼad all have identical sort keys, their order is random.
If you need to ignore more than one character, use = to separate the list of characters. The following rule would ignore an apostrophe, a question mark, a hyphen, a space, and the ng digraph
&[last tertiary ignorable] = '' = '?' = '-' = ' ' = ng
or
&[last tertiary ignorable] = \' = \? = \- = \ = ng
This could also be represented as
&[last tertiary ignorable] = \u0027\u0027 = '\u003f' = '\u002d' = '\u0020' = \u006e\u0067
If you simply want to ignore all punctuation as well as white space, you can use the following rule
[alternate shifted]
One answer to Robert’s specific question to ignore 2bc, space, and hyphen is
&[last tertiary ignorable] = \u02bc = \ = \-
Ken
Sorry for misleading. When I wrote my previous e-mail I was experimenting with the ICU locale explorer on the Internet and what I wrote was accurate for that. Unfortunately, the ICU explorer uses the current version of ICU which is 4.2 at this point, but FieldWorks 6.0 and 6.0.1 are still using ICU 4.0. It turns out the backslash (\) quote is a new feature of ICU 4.2 that is not currently supported in FieldWorks.
I’ve added this note to my technical documentation:
·
ICU provides a useful Web
site for testing collation rules at http://demo.icu-project.org/icu-bin/locexp.
Click the root language link, then near the bottom under Collation rules, click
Demo.
Note: This demo uses the currently released
version of ICU which may not apply to what is currently available in
FieldWorks. For example FieldWorks 6.0 and 6.0.1 use ICU 4.0, but as of this
date, ICU has released ICU 4.2 which adds a new \ quoting character. So
although the demo works with \, FieldWorks will not currently accept this.
and the previous explanation should have been:
Note: In an ICU rule, any non-alphanumeric ASCII character is reserved for syntax characters. If you need to control collation of any of these characters, you must quote them with a \ (only ICU 4.2 or greater) or enclose them in apostrophes. A single apostrophe can also be represented as two apostrophes. Here are some examples of alphanumeric and punctuation characters with or without the \u syntax. (See the Note in the second bullet under section 8.1 regarding the backslash limitation.)
a letter
a
\u0061 letter a
3 digit 3
ng digraph ng
'ng' digraph ng (quotes are optional for alphanumeric
characters)
\u006e\u0067 digraph ng
\- hyphen [not currently in FW]
'-' hyphen
' ' space
\ space (there is a space following the \) [not currently
in FW]
'\u0020' space
\\u0020 space [not currently in FW]
\' apostrophe [not currently in FW]
'' apostrophe
\u0027\u0027 apostrophe
or [not currently in FW]
&[last tertiary ignorable] = \' = \? = \- = \ = ng
This could also be represented as
&[last tertiary ignorable] = \u0027\u0027 = '\u003f' = '\u002d' = '\u0020' = \u006e\u0067
If you simply want to ignore all punctuation as well as white space, you can use the following rule
[alternate shifted]
One answer to Robert’s specific question to ignore 2bc, space, and hyphen is
&[last tertiary ignorable] = \u02bc = ' ' = '-'
Ken