Hi,
I have an example international domain that is NFC but not NFKC, "
xn--ttt-8fa.pumesa.com" (this is a fake domain and my focus is on the general pattern).
The pattern that will cause a domain to be NFC but not NFKC in Golang is: "xn--" followed by any same three letters followed by a single "-" followed by any single digit number followed by "fa"; now I know this pattern doesn't describe real unicode, however the behavior in the programming language is curious (below).
The pattern described above causes strings to be NFC positive but not NFKC in Golang; furthermore, I ran a few tests using Golang (version go1.10.3 darwin) and Java (version "1.8.0_60") and here is the key parts of the code I used:
1) Golang (Used "ToUnicode" to mimic how Zlint tests):
package main
import (
"fmt"
"
golang.org/x/net/idna"
"
golang.org/x/text/unicode/norm"
)
func main(){
str := "
xn--xxx-7fa.pumesa.com"
punycode,err := idna.ToUnicode(str)
if err != nil {
fmt.Println(err)
}
fmt.Println("Is NFC ", norm.NFC.IsNormalString(punycode))
fmt.Println("Is NFKC ", norm.NFKC.IsNormalString(punycode))
}
The last NFKC check is what causes Zlint to throw an error, stating that the unicode is not in compliance, seems that Zlint needs to be updated to follow the latest BR (RFC 5891), meaning check if the unicode in question is NFC compliant rather than NFKC?
Below is something even more interesting.
2) Java:
import java.net.IDN;
import java.text.Normalizer;
public class Main{
public static void main(String args[]){
String cn = "
xn--www-0xx.pumesa.com";
String punycode = IDN.toASCII(cn);
//punycode = IDN.toUnicode(punycode);
System.out.println("is NFC " + Normalizer.isNormalized(punycode, Normalizer.Form.NFC));
System.out.println("is NFKC " + Normalizer.isNormalized(punycode, Normalizer.Form.NFKC));
}
}
Per Oracle doc, java.net.IDN.toASCII conforms with RFC 3490, and it throws no error, this can be double checked within the language by converting the punycode back to Unicode, both print statements return true.
So to reiterate, the two main questions are:
1) Should there be a discussion about why Oracle Java and Golang don't agree on whether this pattern causes unicode to be NFKC compliant?
The potential impact is that results obtained from a Java system may not be Zlint compliant.
2) Should Zlint be updated to the latest BR (RFC 5891) regardless of question #1?