non-latin and non-cyrillic characters

37 views
Skip to first unread message

abbas shahzadeh

unread,
May 26, 2011, 11:50:50 AM5/26/11
to citep...@googlegroups.com
Dear Mr Bennet, Hi.
1. In citeproc non-latin or non-cyrillic names are handled in a certain way that is not appropriate for Persian names. Persian and Arabic names are written in ordinary order (Firstname then Lastname or Lastname, Firstname). Persian and Arabic characters are in (U+0600 – U+06FF) unicode range. How can we solve this problem?
2. It seems that new versions of citerproc-js are not compatible with Zotero 2.1 (mainstream). Is this true or I am doing something wrong?
Best regards,
Abbas

Avram Lyon

unread,
May 26, 2011, 1:36:38 PM5/26/11
to citep...@googlegroups.com
On Thu, May 26, 2011 at 7:50 PM, abbas shahzadeh <a.sha...@gmail.com> wrote:
> Dear Mr Bennet, Hi.
> 1. In citeproc non-latin or non-cyrillic names are handled in a certain way
> that is not appropriate for Persian names. Persian and Arabic names are
> written in ordinary order (Firstname then Lastname or Lastname, Firstname).
> Persian and Arabic characters are in (U+0600 – U+06FF) unicode range. How
> can we solve this problem?

This actually affects a whole bunch of scripts, now that I think of it:
Georgian, Armenian, Arabic, Hebrew, a number of Indic scripts,
Canadian Syllabics, Cherokee, and more...

It might be easier to special-case the East Asian scripts, actually
(Chinese, Japanese, Korean, Vietnamese, Thai?, what else?)

Avram

Frank Bennett

unread,
May 26, 2011, 5:14:25 PM5/26/11
to citeproc-js
The citeproc-js releases are compatible with mainstream. Zotero
installations require some embedded E4X parsing code.To dump a Zotero-
ready copy of the processor, run ./test.py -Z.

On name formatting, thanks for this feedback. It might take a few
iterations to get the discrimination of language realms right (see
Avram's post about overlapping character sets). Vietnamese is one hard
problem for character-based discrimination between languages -- there
is a risk that a name will not contain the accented characters unique
to Vietnamese. I've handled that case with an optional heuristic,
falling back to an explicit language tag set in the item's "language"
field.

For Persian, I guess the first question is whether there is value in
character-based language discrimination at all. Are Persian name
formatting conventions different from, say, Arabic, and if so, is
there a way to distinguish the two at the character level?

> Best regards,
> Abbas

Frank Bennett

unread,
May 26, 2011, 5:18:05 PM5/26/11
to citeproc-js
On May 27, 2:36 am, Avram Lyon <ajl...@gmail.com> wrote:
Khmer is another. That might be the way to go, but let's see how far
we can get with incremental tweaks for Persian. It's great to be
getting direct feedback from the field!

Frank


>
> Avram

Bruce D'Arcus

unread,
May 26, 2011, 5:19:52 PM5/26/11
to citep...@googlegroups.com

More importantly, what do you do with transliterated names?

Bruce

Avram Lyon

unread,
May 26, 2011, 5:26:46 PM5/26/11
to citep...@googlegroups.com
On Fri, May 27, 2011 at 1:19 AM, Bruce D'Arcus <bda...@gmail.com> wrote:
>> For Persian, I guess the first question is whether there is value in
>> character-based language discrimination at all. Are Persian name
>> formatting conventions different from, say, Arabic, and if so, is
>> there a way to distinguish the two at the character level?
>
> More importantly, what do you do with transliterated names?

The handling for transliterated names depends on the style, and the
language. Frank has worked this out, but we need (ahem) language data
exposed to the processor to make it work.

Presumably we'll want to fold in some of the publicly available locale
data on things like default name part ordering to make this more
robust.

On Fri, May 27, 2011 at 1:18 AM, Frank Bennett <bierc...@gmail.com> wrote:
>> This actually affects a whole bunch of scripts, now that I think of it:
>> Georgian, Armenian, Arabic, Hebrew, a number of Indic scripts,
>> Canadian Syllabics, Cherokee, and more...
>>
>> It might be easier to special-case the East Asian scripts, actually
>> (Chinese, Japanese, Korean, Vietnamese, Thai?, what else?)
>
> Khmer is another. That might be the way to go, but let's see how far
> we can get with incremental tweaks for Persian. It's great to be
> getting direct feedback from the field!

We know already that the current logic gets "Georgian, Armenian,


Arabic, Hebrew, a number of Indic scripts, Canadian Syllabics,

Cherokee, and more..." wrong as it is. (South-)East Asian is the
special case.

Avram

Frank Bennett

unread,
May 26, 2011, 7:00:30 PM5/26/11
to citeproc-js
On May 27, 6:26 am, Avram Lyon <ajl...@gmail.com> wrote:
> On Fri, May 27, 2011 at 1:19 AM, Bruce D'Arcus <bdar...@gmail.com> wrote:
> >> For Persian, I guess the first question is whether there is value in
> >> character-based language discrimination at all. Are Persian name
> >> formatting conventions different from, say, Arabic, and if so, is
> >> there a way to distinguish the two at the character level?
>
> > More importantly, what do you do with transliterated names?
>
> The handling for transliterated names depends on the style, and the
> language. Frank has worked this out, but we need (ahem) language data
> exposed to the processor to make it work.
>
> Presumably we'll want to fold in some of the publicly available locale
> data on things like default name part ordering to make this more
> robust.
>
> On Fri, May 27, 2011 at 1:18 AM, Frank Bennett <biercena...@gmail.com> wrote:
> >> This actually affects a whole bunch of scripts, now that I think of it:
> >> Georgian, Armenian, Arabic, Hebrew, a number of Indic scripts,
> >> Canadian Syllabics, Cherokee, and more...
>
> >> It might be easier to special-case the East Asian scripts, actually
> >> (Chinese, Japanese, Korean, Vietnamese, Thai?, what else?)
>
> > Khmer is another. That might be the way to go, but let's see how far
> > we can get with incremental tweaks for Persian. It's great to be
> > getting direct feedback from the field!
>
> We know already that the current logic gets "Georgian, Armenian,
> Arabic, Hebrew, a number of Indic scripts, Canadian Syllabics,
> Cherokee, and more..." wrong as it is. (South-)East Asian is the
> special case.

Yep. I just want to explore a bit, if we can, before shifting the
character-level logic around. I'm curious about the hard-case
boundaries. Vietnamese was the first one encountered after releasing
the client. It looks like Persian may be another, depending on name-
ordering conventions in the two realms:

"Modern Iranian Persian and Afghan Persian are written using a
modified variant of the Arabic alphabet (see Perso-Arabic script),
which uses different pronunciation and additional letters not found in
Arabic."

http://en.wikipedia.org/wiki/Persian_language#Persian_alphabet

If name-ordering conventions in the two are the same (roughly
speaking), it would reduce the pressure for requiring language tags to
get things to render correctly.

But point taken about the likely good sense of reversing the
conditional. It will take awhile to get to it, but I'll definitely
look at changing the logic.

Frank


>
> Avram

abbas shahzadeh

unread,
May 26, 2011, 10:15:33 PM5/26/11
to citep...@googlegroups.com


On Fri, May 27, 2011 at 12:44 AM, Frank Bennett <bierc...@gmail.com> wrote:

For Persian, I guess the first question is whether there is value in
character-based language discrimination at all. Are Persian name
formatting conventions different from, say, Arabic, and if so, is
there a way to distinguish the two at the character level?

I know some Arabic and I can distinguish between Arabic firstnames and lastnames. I can say for sure that Arabic names are written in (Firstname Lastname) order. (Lastname [no comma] Firsname) format is never used in Arabic or Persian. I can conclude that name formatting conventions are the same in both languages and character based detection can be used in this case. 
Reply all
Reply to author
Forward
0 new messages