I do not know about japanese unicode, but ordering in Indian languages
for instance just follows the ordinal value of the character. So if
japanese unicode gives the first number to the first character - then
fine. Otherwise you have to write your own customised ordering. We have
to do that in Tamil for example, where one or two characters are out of
order in the unicode table.
--
regards
KG
http://lawgon.livejournal.com
Coimbatore LUG rox
http://ilugcbe.techstud.org/
I have a site I am working on where the main language is Japanese.
When I tried to order the fields in my form on http://goeigo.org/loc/tokyo/nakano/ by their Japanese name, it doesn't put them in the right order. It sorts it, but it is in a strange unhelpful kanji order.
How does ordering() order Japanese Unicode text? and, wow can I get it to follow the standard Japanese Order?
Can ordering be language specific?
Cheers,
James Hancock
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Does the �strcoll� function in Python�s standard �locale� module do the
right thing when a Japanese locale is set? If not, can you persuade the
PyICU package to do the right thing?
If either of these works, then you could slurp your results into a
Python list and then sort the list using one of these methods. This
wouldn�t perform too well, of course, if you had thousands and thousands
of results and you wanted the database to peel off the top ten... but if
for some reason collation at the database level isn�t working, doing it
all at the Python level might be better than nothing.
dd = { "cha1": 1,
"char2": 2,
"char3": 3,
"char4": 4,
}
result = sorted(mylist, key=lambda x:dd[x[0]])
Point being if the db query isnt too slow you could use python.
cheers
sam_w
On Thu, Jan 6, 2011 at 7:10 AM, Seth Gordon <se...@ropine.com> wrote:
> On 01/05/2011 07:57 AM, James Hancock wrote:
>> I think it does the same thing, but I was talking about how you cant set
>> 'ordering' under the Meta class in a model.
>> http://docs.djangoproject.com/en/1.2/ref/models/options/#django.db.models.Options.ordering
>>
>> <http://docs.djangoproject.com/en/1.2/ref/models/options/#django.db.models.Options.ordering>If
>> I set it to a field of Japanese characters, the ordering comes out wrong. At
>> least not in the desired order.
>> How can I set the data collation for just a few fields? And how can I know
>> which of the Japanese to set it to?
>
> Does the “strcoll” function in Python’s standard “locale” module do the
> right thing when a Japanese locale is set? If not, can you persuade the
> PyICU package to do the right thing?
>
> If either of these works, then you could slurp your results into a
> Python list and then sort the list using one of these methods. This
> wouldn’t perform too well, of course, if you had thousands and thousands
> of results and you wanted the database to peel off the top ten... but if
> for some reason collation at the database level isn’t working, doing it
> all at the Python level might be better than nothing.
>
Given Japanese is not an alphabetical language and mixes syllabic and logographic scripts (the logographic system having a few thousand graphemes), I doubt this kind of trivial ideas is going to work correctly (it only works correctly for simple alphabetical engines, even diacritics are going to cause an explosion in the number of cases)
--
As others suggested, your best bets are:
1. Use your database's collation support as suggested by Daniel. MySQL seems to allow per-column character set (and collation), Postgres's documentation seems to indicate its charset and collation support is per-database
2. Use Python's ICU bindings and perform your sort in-memory in Python using the right locale and collation, ICU should have pretty complete support for that operation.
Wow, I have no idea what you just said... but, I think I agree.
Man, the Japanese where not thinking about programming when they made their language.
Any other suggestions or ways to correctly sort Japanese Words?
Cheers,
James Hancock
I'm Japanese.
> Any other suggestions or ways to correctly sort Japanese Words?
Though I'm not well versed in ordering Japanese words,
my friend told me the standards named "JIS X 4061:1996".
That's the standards using for Japanese dictionary(not python) or
Japanese book index.
The ja.wikipedia has an overview of that if you can read Japanese.
http://ja.wikipedia.org/wiki/日本語文字列照合順番
Also, you can buy the specification of the standards, but it seems
Japanese only.
http://www.webstore.jsa.or.jp/webstore/Com/FlowControl.jsp?lang=en&bunsyoId=JIS+X+4061%3A1996&dantaiCd=JIS&status=1&pageNo=0
I found the implementation with Perl. Maybe, there is no Python implementation.
http://search.cpan.org/~sadahiro/Lingua-JA-Sort-JIS-0.05/JIS.pod
I hope this helps.
thanks,
Tetsuya
This is supplementation of what Morimoto-san wrote.
I'm Japanese too.
You must use readings for ordering.
Japanese dictionary ordering is by pronunciation, not by alphabetical
order.
In English, 'food' is put at nearby 'foot'.
In Japanese, 'quick'(kwik) is put at nearby 'cuisine'(kwizi:n).
(Mmm, this is not good example...)
And the most important thing is something else.
Japanese people also can NOT get correct pronunciation only by string
written in kanji characters.
It may sounds funny, but it is true. (as same problem as James
mentioned about 大きい and 大学)
For example, '角田 純子'(woman name) has many readings.
Last name(角田) can be read Kakuta/Kakuda/Sumita/Tsunoda (and maybe
more!!).
First name(純子) can be read Junko/Sumiko.
And pairs of these Last/First name are .......
The standard (Morimoto-san mentioned) is very complicated.
Many of Japanese people doesn't know about this and some dictionary
publisher has another style. :-(
If you can get and use reading string in hiragana (or katakana),
its result is not so far off by unicode ordering.
And if your OS has correct locale, use locale.strcoll() simplly with
reading string.
I tested on Windows XP(Japanese codepage 932 console) and FreeBSD
7(Japanese UTF-8 locale console).
WinXP is OK and FreeBSD is NG on Japanese string.
WinXP is NG and FreeBSD is OK on German string with umlaut.
The result is attached last part of this mail.
FYI, I made loose translation of method in wikipedia entry Morimoto-
san mentioned.
1. convert kanji character or alphabetical word (loan word) to kana
reading
2. make character replacement
i. replace small characters by base characters (ex.「ぁ」to「あ」, 「ゃ」 to
「や」 「っ」 to 「つ」)
ii. replace consonants(some plosives and fricatives) by base
characters
(ex. 「が」[ga] to 「か」[ka], 「ば」[ba]and「ぱ」[pa] to 「は」ha)
3. replace long sound「ー」 by pronunced character according to preposing
character
if preposing character is
「あ」「か」「さ」「た」「な」「は」「ま」「や」「ら」「わ」 then 「あ」
「い」「き」「し」「ち」「に」「ひ」「み」「り」「ゐ」 then 「い」
「う」「く」「す」「つ」「ぬ」「ふ」「む」「ゆ」「る」 then 「う」
「え」「け」「せ」「て」「ね」「へ」「め」「れ」「ゑ」 then 「え」
「お」「こ」「そ」「と」「の」「ほ」「も」「よ」「ろ」「を」 then 「お」
「ん」 then 「ん」
otherwise keep 「ー」
(ex. 「あーるぬーぼー」[a-runu-bo-](aka. art nouveau)->「ああるぬうぼお」)
4. replace repeat character「ゝ」 with same character as preposing
character
if preposing character exists and preposing character is not 「ー」
(long sound)
5. sort string in the followin order
「あ」「い」「う」「え」「お」「か」「き」「く」「け」「こ」「さ」「し」「す」「せ」「そ」「た」「ち」「つ」「て」「と」「な」「に」
「ぬ」「ね」「の」「は」「ひ」「ふ」「へ」「ほ」「ま」「み」「む」「め」「も」「や」「ゆ」「よ」「ら」「り」「る」「れ」「ろ」「わ」「ゐ」
「ゑ」「を」「ん」「ゝ」「ー」
6. if sort order value in step 5 is same priority, use rule below
i. consonant order: unvoiced(ksth) > voiced(gzdb) > half voiced(p)
(ex. 「は」>「ば」>「ぱ」 similar to diacritical mark: 'Kloster' and
'Klöster')
ii. long sound > small character > repeat character > other
character
iii. hiragana > katakana
locale test result is below
On FreeBSD 7(Japanese utf-8 locale console)
> python
Python 2.5.4 (r254:67916, Oct 8 2009, 15:59:07)
[GCC 4.2.1 20070719 [FreeBSD]] on freebsd7
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'de_DE.ISO8859-1')
'de_DE.ISO8859-1'
>>> locale.strcoll(u'Kloster', u'Klöster')
-4
>>> locale.strcoll(u'Klosteranlage', u'Klöster')
97
(OK: Kloste ->Klöster->Klosteranlage)
>>> locale.setlocale(locale.LC_ALL, 'ja_JP.UTF-8')
'ja_JP.UTF-8'
>>> locale.strcoll(u'ウエット',u'ウェット')
1
>>> locale.strcoll(u'ウエット',u'ウェッド')
1
>>> locale.strcoll(u'ウエット',u'ウェッン')
1
>>> locale.strcoll(u'ウエット',u'ウェッタ')
1
>>> locale.strcoll(u'ウエット',u'うえっと')
96
(NG....)
on WindowsXP(Japanese Shift_JIS locale commandline)
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'German_Germany.1252')
'German_Germany.1252'
>>> locale.strcoll(u'Kloster', u'Klöster')
1
>>> locale.strcoll(u'Klosteranlage', u'Klöster')
1
>>> locale.strcoll(u'Klosteranlage', u'Kloster')
1
>>> locale.strcoll(u'Klost', u'Kloster')
-1
>>> locale.strcoll(u'Klost', u'Klöster')
1
(NG on ''Japanese(ShiftJIS) codepage'' Windows: Klöster->Klost-
>Kloster->Klosteranlage )
>>> locale.setlocale(locale.LC_ALL, 'Japanese_Japan.932')
'Japanese_Japan.932'
>>> locale.strcoll(u'ウエット',u'ウェット')
1
>>> locale.strcoll(u'ウエット',u'ウェッド')
-1
>>> locale.strcoll(u'ウエッド',u'ウェット')
1
>>> locale.strcoll(u'ウエッド',u'ウェッド')
1
(OK: ウェット->ウエット->ウェッド->ウエッド)
>>> locale.strcoll(u'ウエット',u'ウェッン')
-1
>>> locale.strcoll(u'ウエット',u'ウェッタ')
1
(OK: ウェッタ->ウエット->ウェッン)
>>> locale.strcoll(u'ウエット',u'うえっと')
-1
>>> locale.strcoll(u'ウエッド',u'うえっと')
1
>>> locale.strcoll(u'ウエット',u'うえっど')
-1
>>> locale.strcoll(u'うえっと',u'うえっど')
-1
>>> locale.strcoll(u'ウエッド',u'うえっど')
-1
(almost OK:ウエット->うえっと->ウエッド->うえっど hiragana/katakana in reverse from
standard?)
HTH
Keishi Katoux
On Jan 6, 10:39 pm, Tetsuya Morimoto <tetsuya.morim...@gmail.com>
wrote:
> Hi, James
>
> I'm Japanese.
>
> > Any other suggestions or ways to correctly sort Japanese Words?
>
> Though I'm not well versed in ordering Japanese words,
> my friend told me the standards named "JIS X 4061:1996".
>
> That's the standards using for Japanese dictionary(not python) or
> Japanese book index.
> The ja.wikipedia has an overview of that if you can read Japanese.http://ja.wikipedia.org/wiki/日本語文字列照合順番
>
> Also, you can buy the specification of the standards, but it seems
> Japanese only.http://www.webstore.jsa.or.jp/webstore/Com/FlowControl.jsp?lang=en&bu...
>
> I found the implementation with Perl. Maybe, there is no Python implementation.http://search.cpan.org/~sadahiro/Lingua-JA-Sort-JIS-0.05/JIS.pod
>
> I hope this helps.
>
> thanks,
> Tetsuya
>
> If you have code or if you can do it, thanks to contact us: contact
> ///a///t//// neediz \\\.com
That's not how mailing lists work.
[0] http://docs.python.org/library/locale.html#locale.strxfrm
[1] http://docs.python.org/library/locale.html#locale.strcoll
The question posed by Denys Poulat on Dec 14 2011 is mostly answered
at the beginning of the long thread that it now appears at the end of,
particularly the responses of Tetsuya Morimoto and Keishi Katoux on
Jan 6 2011. What they say can be summarized as: the problem is well
understood but difficult to solve.
I think the best way to understand the problem is to work back from
the end. The goal is, you want to be able to produce the “right”
order using native python sorting (which is used by the sort order
control in django).
Native python sorting is very fast because it happens in C with no
special conditional logic or calls back out to python functions. But
to get the right results, it needs to be given strings which, when
sorted natively, produce the right ordering. Assuming the ordering
is well-defined and consistent (i.e. no A < B and B < C and C < A) and
any external information that’s needed is available, it should always
be possible to define a function that will transform the raw (user-
visible) string into a string with the desired sorting properties.
Once you have the function, you run it on each raw string and add the
resultant transformed string to the record that has the raw string,
sort on the transformed string, and then output the records with the
raw strings in the correct order. If you permanently store the
transformed string in the record then you will only have to run the
function once for each record (or rather, each time the raw string
changes).
The locale.strxfrm() is exactly such a function, but it’s not clear
whether it does what is actually desired. This is not a showstopper,
however, because you can create your own transform function.
The simplest kind of transform produces a single transformed sequence
whose content corresponds directly to that of the raw string. But
such simple transforms often fail to correctly represent real-life
sorting rules for the reason that the real-life rules often have two
or more phases of which the first defines the primary sort and the
second and later define “tie-breaker rules” for items which are
different but sort equal in the preceding sort. The easiest way to
represent multi-phase sorts is to produce a separate transformed
string for each phase and sort on them in order . The results can
kept as separate, sort keys, but this exposes the internals of the
transform. An alternative, which I think is generally preferable, is
to concatenate the results using as delimiter a character that cannot
not appear within the strings an compared lower than any character
that can appear in them. (Input to a transform will normally be
unicode or some encoding of it. Output may be unicode or 8-bit ascii,
but not utf-8, at least if there’s any chance of it including
multibyte characters, which may derail the sort.)
Multi-phase transforms are likely to be of use wherever there are
“different flavors” of the same basic character and sorting is by the
basic character with the flavors ordered only where the whole string
otherwise compares identical. Examples are long vowels, double
consonants (where single and double are supposed to sort the same),
known acceptable spelling variants, and diacritics of many sorts. In
Japanese it would apply to long vowels, double consonants (little
tsu), and “muddied” (voiced) consonants (indicated in kana with a
diacritic that looks like a double quote).
I’m not qualified to create authentic Japanese examples, but a simple
“as if” one would be the following. Suppose there is a rule that says
that you want to ignore voicing in the basic sort but put voiced sound
after unvoiced sound in words with the same basic sort order. For
example, you would want these (nonsense) words to be ordered as
listed:
tatama – tatami – tadami – datami – dadami – tatamu …
The right sorting could be gotten with the following transforms:
TaTama@TaTama @ TaTami@TaTami – TaTami@Tatami – TaTami@taTami –
TaTami@tatami – TaTamu@TaTamu …
If Japanese were written entirely in kana (syllabic characters), a
transform function such as described would be enough to solve the
problem. Unfortunately, as the early posters point out, names are
written mostly or all in kanji (ideographic characters) which cannot
be reliably mapped to kana sequences due to the “multiple reading”
problem – which I believe is especially common in proper names.
I can see at least five possible solutions to this problem:
(1) Make up readings for kanji sequences based on the kanji
character reading data in the unihan or similar database – not worth
considering, the results will often be completely wrong
(2) Use a kanji-to-kana name lookup table with good name coverage, at
least for last names – if you can get one
(3) Use a Japanese IME library (which will have its own internal
lookup tables) - ditto
(4) Elicit the kana spelling of their name from the user when they
create their profile.
To me (4) elicitation, if viable for the app, seems like the best
solution to the problem, in terms of ease of implementation,
minimization of dependencies, and reliability of results. Only
challenge would be if users skipped the field or wrote garbage into
it. But as I understand it Japanese people are used to giving the
correct (= desired) reading for their names, as including furigana
(little kana) annotations on their business cards, saying/spelling
their names over the phone, etc.
A variant of (4) that might make sense in an app that is to some
degree bilingual between Japanese and English or another European
language is:
(5) Elicit the romaji (Latin character) spelling of the user’s name.
To use a kana-based transform function (5) would require translating
the romaji into kana before running the transform, but it’s pretty
easy to write a kana-to-romaji translation function that does the
right thing with well-formed input.
Hope all this makes sense.
After my previous post, I found a library named Unihandecode which
resolves the "multiple reading" problem. This library is intended
Kanji to Kana in ascii for Japanese. It might be your help.
http://www.slideshare.net/miurahr/unihandecode-a-transliterate-library-for-unicode
https://launchpad.net/unihandecode
thanks,
Tetsuya