Use RTL letters (like Arabic or Hebrew) within the URI

חגי רצבי‎

unread,

Dec 10, 2019, 3:18:47 AM12/10/19

to akomantoso-xml

Hi everyone,

We are working on AKN in Hebrew and are debating on the proper use of eid.

The debate is about the numbering of the law paragraphs by the Hebrew letters within the eid.

The problem is that it mixing RTL and LTR characters.

For example:

point 64ב, subpoint א, paragraph 3, looks like the below:

eid=" /akn/il/act/1961-08-01/@he/!main#point_64ב_א_3" <- It mixed!

We need to translate it to Arabic too, same problem.

How to do it without mixing the uri with languages directions?

I understand that the 'original' letters should be integraed into the uri, right?

Thanks,

Hagay

Fabio Vitali

unread,

Dec 10, 2019, 6:52:00 AM12/10/19

to akomant...@googlegroups.com

Dear Hagay,

this is an interesting problem, and I am not sure people used to Latin-based scripts can really get to the deeper aspects of it easily.

Anyway, the issue is not limited to Akoma Ntoso IRIs, but in general to any IRI that mixes LTR and RTL characters within the same string. For this, we should rely on standard mechanisms of URIs, such as section 4 "Bidirectional IRIs for Right-to-Left Languages" of RFC 3987, https://www.ietf.org/rfc/rfc3987.txt , that suggests to use U+202A, LEFT-TO-RIGHT EMBEDDING (LRE), and U+202C, POP DIRECTIONAL FORMATTING (PDF), around the blocks of letters that switch direction. Also I checked https://en.wikipedia.org/wiki/Right-to-left_mark that say basically the same thing but with U+200F, RIGHT-TO-LEFT MARK (RLM). So I do not really know which is which, but I assume you can understand these things better than me.

BTW, it seems to me in your example that you are mixing the concepts of AKN URIs and AKN identifiers. Within the eId attribute you should only put the part after the #, that is to say:

NOT

> eid=" /akn/il/act/1961-08-01/@he/!main#point_64ב_א_3"

BUT

> eId="point_64ב_א_ב"

Finally, I assume that subpoint and paragraph correspond to actual structures in the XML document, and therefore you cannot be silent of their names according to the AKN Naming Convention. So:

NOT

> eId="point_64ב_א_ב"

BUT

> eId="point_64ב__subpoint_א__paragraph_ב"

and somewhere there, I do not know exactly, you should place the LRE and/or PDF and/or RLM characters.

Let us know of your experiments!

Fabio

--

> --
> You received this message because you are subscribed to the Google Groups "akomantoso-xml" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to akomantoso-xm...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/akomantoso-xml/2046c498-d1ec-461c-aa7b-a7b9b9ee4c18%40googlegroups.com.

--

Fabio Vitali The sage and the fool
Dept. of Informatics go to their graves
Univ. of Bologna ITALY alike in this respect:
phone: +39 051 2094872 both believe the sage to be a fool.
e-mail: fa...@cs.unibo.it Where, then, may wisdom be found?
http://vitali.web.cs.unibo.it/ Qi, "Neither Yes nor No", The codeless code

Ashok Hariharan

unread,

Dec 10, 2019, 7:04:34 AM12/10/19

to akomant...@googlegroups.com

Hagay --

You can have a different logical representation and visual representation depending on how you are viewing the identifier and if you have mixed LTR and RTL letters. It is additionally confusing because under pure LTR circumstances we read AKN identifiers from right to left !

e.g. for this:

article_أ__chapter_ب__paragraph_ج

( paragraph C in chapter B in article A) the logical and visual representations are the same since there is a breaking latin character in the middle.

article_أ__ب__ج

but for the above the visual presentation appears reversed (i.e. it looks A in B in Article C), this is OK (the logical representation is "unreversed") - because for continuous arabic letters you read from right to left. However, remember for reading the AKN IRI in this case you have to read the id order from left to right (instead of right to left) because the visual presentation is rendered RTL. If you read it like that, then the visual representation matches the logical representation.

See this https://www.w3.org/International/iri-edit/draft-duerst-iri.html, which has a good explanation -- especially the section talking about bidii iris with examples --

4.4 Examples

This section gives examples of bidirectional IRIs, in Bidi Notation. It shows legal IRIs with the relationship between logical and visual representation, and explains how certain phenomena in this relationship may look strange to somebody not familiar with bidirectional behavior, but familiar to users of Arabic and Hebrew. It also shows what happens if the restrictions given in Section 4.2 are not followed. The examples below can be seen at [BidiEx], in Arabic, Hebrew, and Bidi Notation variants.

To read the bidi text in the examples, read the visual representation from left to right until you encounter a block of rtl text. Read the rtl block (including slashes and other special characters) from right to left, then continue at the next unread ltr character.

Example 1: A single component with rtl characters is inverted:
logical representation: http://ab.CDEFGH.ij/kl/mn/op.html
visual representation: http://ab.HGFEDC.ij/kl/mn/op.html
Components can be read one-by-one, and each component can be read in its natural direction.

Example 2: More than one consecutive component with rtl characters is inverted as a whole:
logical representation: http://ab.CDE.FGH/ij/kl/mn/op.html
visual representation: http://ab.HGF.EDC/ij/kl/mn/op.html
A sequence of rtl components is read rtl, in the same way as a sequence of rtl words is read rtl in a bidi text.

Example 3: All components of an IRI (except for the scheme) are rtl. All rtl components are inverted overall:
logical representation: http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV
visual representation: http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA
The whole IRI (except the scheme) is read rtl. Delimiters between rtl components stay between the respective components; delimiters between ltr and rtl components don't move.

Example 4: Several sequences of rtl components are each inverted on their own:
logical representation: http://AB.CD.ef/gh/IJ/KL.html
visual representation: http://DC.BA.ef/gh/LK/JI.html
Each sequence of rtl components is read rtl, in the same way as each sequence of rtl words in an ltr text is read rtl.

Example 5: Example 2, applied to components of different kinds:
logical representation: http://ab.cd.EF/GH/ij/kl.html
visual representation: http://ab.cd.HG/FE/ij/kl.html
The inversion of the domain name label and the path component may be unexpected, but is consistent with other bidi behavior. For reassurance that the domain component really is "ab.cd.EF", it may be helpful to read aloud the visual representation following the bidi algorithm. After "http://ab.cd." one reads the RTL block "E-F-slash-G-H", which corresponds to the logical representation.

Example 6: Same as example 5, with more rtl components:
logical representation: http://ab.CD.EF/GH/IJ/kl.html
visual representation: http://ab.JI/HG/FE.DC/kl.html
The inversion of the domain name labels and the path components may be easier to identify because the delimiters also move.

Example 7: A single rtl component with included digits:
logical representation: http://ab.CDE123FGH.ij/kl/mn/op.html
visual representation: http://ab.HGF123EDC.ij/kl/mn/op.html
Numbers are written ltr in all cases, but are treated as an additional embedding inside a run of rtl characters. This is completely consistent with usual bidirectional text.

Example 8 (not allowed): Numbers at the start or end of a rtl component:
logical representation: http://ab.cd.ef/GH1/2IJ/KL.html
visual representation: http://ab.cd.ef/LK/JI1/2HG.html
The sequence '1/2' is interpreted by the bidi algorithm as a fraction, fragmenting the components and leading to confusion. There are other characters that are interpreted in a special way close to numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'.

Example 9 (not allowed): The numbers in the previous example are percent-encoded:
logical representation: http://ab.cd.ef/GH%31/%32IJ/KL.html,
visual representation (Hebrew): http://ab.cd.ef/%31HG/LK/JI%32.html
visual representation (Arabic): http://ab.cd.ef/31%HG/%LK/JI32.html
Depending on whether the upper-case letters represent Arabic or Hebrew, the visual representation is different.

Example 10 (allowed, but not recommended):
logical representation: http://ab.CDEFGH.123/kl/mn/op.html
visual representation: http://ab.123.HGFEDC/kl/mn/op.html
Components consisting of only numbers are allowed (it would be rather difficult to prohibit them), but may interact with adjacent RTL components in ways that are not easy to predict.

--

Ashok Hariharan

unread,

Dec 10, 2019, 7:11:15 AM12/10/19

to akomant...@googlegroups.com

On Tue, Dec 10, 2019 at 5:22 PM Fabio Vitali <fvi...@gmail.com> wrote:

Finally, I assume that subpoint and paragraph correspond to actual structures in the XML document, and therefore you cannot be silent of their names according to the AKN Naming Convention. So:

NOT

> eId="point_64ב_א_ב"

BUT

> eId="point_64ב__subpoint_א__paragraph_ב"

and somewhere there, I do not know exactly, you should place the LRE and/or PDF and/or RLM characters.

Ah indeed, the identifier should have the structure names in it !

However I imagine there could be cases with numbers that correspond to 1.a and 1.b in hebrew/arabic numerals and letters, there the end-point of the identifier will still be reversed visually.

To view this discussion on the web visit https://groups.google.com/d/msgid/akomantoso-xml/8FAB5A2F-7191-4ACA-BF7B-F2D2589D249F%40gmail.com.

חגי רצבי‎

unread,

Dec 11, 2019, 10:56:10 AM12/11/19

to akomantoso-xml

Hi friends!

Fabio and Ashok thank you very much for the detailed and fast response! I appreciate it.

First, Fabio, of course you are right, it about href, not eid, copy error. And also for the full, rather than shortened, path as I wrote.

When I write the full path(!main#point_64ב__point_א__point_3 instead of point_64ב_א_3 ), the problem solved, I can read the xml file and understand, it not mixed.

There are 2 issues here:

1- How to read the XML document in text or xml Editor.

2- How a internet browser/PDF displays or interprets the xml.

The solutions you have mentioned, Fabio, require the addition of special characters (like + U202A, etc.), which helps the browser interpretation, but makes the xml doc itself to be less human readable.

The main thing that bothered me was that the xml document itself became less human readable.

Beyond that, i didn't understand some of the RFCs guidelines.

First here:

For each component, the following restrictions apply:
1. A component SHOULD NOT use both right-to-left and left-to-right
characters.
2. A component using right-to-left characters SHOULD start and end
with right-to-left characters.

It says that it SHOULD NOT contains both RTL & LTR in one component, and it SHOULD NOT end a path with RTL letter if it starts with LTR.

But there are paths that end in this way, like !main#point_64ב__point_א -> this not allowed?

:Second

In the RFC that Ashok put above, in Example 8:

Example 8 (not allowed): Numbers at the start or end of a rtl component:
logical representation: http://ab.cd.ef/GH1/2IJ/KL.html
visual representation: http://ab.cd.ef/LK/JI1/2HG.html
The sequence '1/2' is interpreted by the bidi algorithm as a fraction, fragmenting the components and leading to confusion. There are other characters that are interpreted in a special way close to numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'.

It says that it SHOULD NOT end with a number, like !main#point_64ב__point_א__point_3 -> this not allowd?

If you can clarify this i'll appreciate it.

Best Regards

Hagay

בתאריך יום שלישי, 10 בדצמבר 2019 בשעה 14:11:15 UTC+2, מאת Ashok Hariharan:

> To unsubscribe from this group and stop receiving emails from it, send an email to akomant...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/akomantoso-xml/2046c498-d1ec-461c-aa7b-a7b9b9ee4c18%40googlegroups.com.

--

Fabio Vitali The sage and the fool
Dept. of Informatics go to their graves
Univ. of Bologna ITALY alike in this respect:
phone: +39 051 2094872 both believe the sage to be a fool.
e-mail: fa...@cs.unibo.it Where, then, may wisdom be found?
http://vitali.web.cs.unibo.it/ Qi, "Neither Yes nor No", The codeless code

--
You received this message because you are subscribed to the Google Groups "akomantoso-xml" group.

To unsubscribe from this group and stop receiving emails from it, send an email to akomant...@googlegroups.com.

Ashok Hariharan

unread,

Dec 12, 2019, 6:05:47 AM12/12/19

to akomant...@googlegroups.com

Dear Hagay --

See my comments below ---

‪On Wed, Dec 11, 2019 at 9:26 PM ‫חגי רצבי‬‎ <hr4...@gmail.com> wrote:‬

Hi friends!

Fabio and Ashok thank you very much for the detailed and fast response! I appreciate it.

First, Fabio, of course you are right, it about href, not eid, copy error. And also for the full, rather than shortened, path as I wrote.

When I write the full path(!main#point_64ב__point_א__point_3 instead of point_64ב_א_3 ), the problem solved, I can read the xml file and understand, it not mixed.

There are 2 issues here:
1- How to read the XML document in text or xml Editor.
2- How a internet browser/PDF displays or interprets the xml.
The solutions you have mentioned, Fabio, require the addition of special characters (like + U202A, etc.), which helps the browser interpretation, but makes the xml doc itself to be less human readable.

The main thing that bothered me was that the xml document itself became less human readable.

Beyond that, i didn't understand some of the RFCs guidelines.
First here:
For each component, the following restrictions apply:
1. A component SHOULD NOT use both right-to-left and left-to-right
characters.
2. A component using right-to-left characters SHOULD start and end
with right-to-left characters.

It says that it SHOULD NOT contains both RTL & LTR in one component, and it SHOULD NOT end a path with RTL letter if it starts with LTR.
But there are paths that end in this way, like !main#point_64ב__point_א -> this not allowed?

If you see the next paragraph - it says:

The above restrictions are given as shoulds, rather than as musts.
For IRIs that are never presented visually, they are not relevant.
However, for IRIs in general, they are very important to ensure
consistent conversion between visual presentation and logical
representation, in both directions.

(the underlining and bolding is mine), as you can see its not a must, i.e. its not a mandatory expectation.

Secondly, you are not showing the IRI visually on a webpage it does not have presentation significance, so what you are doing is perfectly fine.

You need to perhaps just document the rule of how to read the AKN IRI in case it contains contiguous RTL letters.

:Second
In the RFC that Ashok put above, in Example 8:
Example 8 (not allowed): Numbers at the start or end of a rtl component:
logical representation: http://ab.cd.ef/GH1/2IJ/KL.html
visual representation: http://ab.cd.ef/LK/JI1/2HG.html
The sequence '1/2' is interpreted by the bidi algorithm as a fraction, fragmenting the components and leading to confusion. There are other characters that are interpreted in a special way close to numbers, in particular '+', '-', '#', '$', '%', ',', '.', and ':'.

It says that it SHOULD NOT end with a number, like !main#point_64ב__point_א__point_3 -> this not allowd?

From what i can see in your particular case the usage is fine, since you have breaking LTR characters separating the hebrew characters, the visual and logical presentation would not change.

You would perhaps have a problem if instead of using the literal "point" you had used a hebrew phrase literal.

Ashok

To unsubscribe from this group and stop receiving emails from it, send an email to akomantoso-xm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/akomantoso-xml/59e887f5-74a9-4cfd-ac43-de472fbe920b%40googlegroups.com.

Reply all

Reply to author

Forward