looks sane
"PostgreSQL 9.5
introduced logic for speeding up comparisons of string data
types by using the standard C library function strxfrm()
as a substitute for
strcoll()
. It now emerges
that most versions of glibc (Linux's implementation of the
C library) have buggy implementations of strxfrm()
that, in some locales, can
produce string comparison results that do not match
strcoll()
. Until this problem
can be better characterized, disable the optimization in
all non-C locales. (C locale is safe since it uses neither
strcoll()
nor strxfrm()
.)
Unfortunately, this problem affects not only sorting but also entry ordering in B-tree indexes, which means that B-tree indexes on text, varchar, or char columns may now be corrupt if they sort according to an affected locale and were built or modified under PostgreSQL 9.5.0 or 9.5.1. Users should REINDEX indexes that might be affected.
It is not possible at this time to give an exhaustive list of known-affected locales. C locale is known safe, and there is no evidence of trouble in English-based locales such as en_US, but some other popular locales such as de_DE are affected in most glibc versions."
We're moving towards the built-in String type being essentially just a container for bytes with a UTF-8-like interpretation. Thus if two String objects don't have the same underlying data, they are not equal. If you want fancier Unicode-aware behavior
like UTF-8 validation or normalized comparison, then you'll need to use a package that provides a type like EncodedString{UTF8} (or something). Care needs to be taken to make sure that == is transitive, but that's a problem for the to-be-created external Unicode strings package.
On Jul 13, 2016, at 1:11 PM, Páll Haraldsson <pall.ha...@gmail.com> wrote:On Wednesday, July 13, 2016 at 2:56:43 PM UTC, Stefan Karpinski wrote:We're moving towards the built-in String type being essentially just a container for bytes with a UTF-8-like interpretation. Thus if two String objects don't have the same underlying data, they are not equal. If you want fancier Unicode-aware behavior
Ok, we can define it that way, then the optimization that is already merged is ok.
This still does not imply anything for ordering. That must be defined. ASCIIbetical is sometimes used (but is bad), similar for Unicode would also be a disaster. Since I want my alphabet, e.g. a á b c.., then it's not a leap to correclty order á weather it's precomposed or not
like UTF-8 validation or normalized comparison, then you'll need to use a package that provides a type like EncodedString{UTF8} (or something). Care needs to be taken to make sure that == is transitive, but that's a problem for the to-be-created external Unicode strings package.
Is it at least easy to include a package with new behavior that substitutes the default string type? If not with the same name, at least all literal will get the new type?
Julia's default of == meaning "same codepoints" is reasonable and fast (whereas anything requiring normalization will be much slower).Unfortunately, that is simply not true (and as far as I've seen has never been) for UTF8String (and now String).