Even faster string equality (see #16863)

600 views
Skip to first unread message

Scott Jones

unread,
Jun 8, 2016, 11:55:11 PM6/8/16
to julia-dev
I have made a branch with an even faster version of == for the String type (it took 28%-40% less time in my testing).
Instead of calling lexcmp, which has to deal with strings of different length, it calls memcmp directly if the string lengths are equal, and checks if the result is 0.


If somebody could make a PR out of this branch, it would be appreciated.

Thanks in advance,
Scott

Jeffrey Sarnoff

unread,
Jun 9, 2016, 12:46:46 AM6/9/16
to julia-dev
looks sane

Scott Jones

unread,
Jun 9, 2016, 5:23:24 PM6/9/16
to julia-dev
Just wanted to thank Stefan (again!) for getting this pulled in as https://github.com/JuliaLang/julia/pull/16855 !

On Wednesday, June 8, 2016 at 11:55:11 PM UTC-4, Scott Jones wrote:

Páll Haraldsson

unread,
Jul 13, 2016, 9:48:01 AM7/13/16
to julia-dev
On Thursday, June 9, 2016 at 4:46:46 AM UTC, Jeffrey Sarnoff wrote:
looks sane

Are you sure? I'm not sure what lexcmp does (or should do..), but if say one string has precomposed Unicode letters, and the other doesn't, should it be possible for them to compare the same? Maybe it's not preferable by default? Or is it..?

These things can all be tricky; E.g. minor recent upgrades of PostgreSQL are generally considered safe (and recommended over not upgrading), with a rare "exception" (just just need to be careful and REINDEX afterwards, as broken in a former minor update..):

https://www.postgresql.org/docs/9.5/static/release-9-5-2.html

"PostgreSQL 9.5 introduced logic for speeding up comparisons of string data types by using the standard C library function strxfrm() as a substitute for strcoll(). It now emerges that most versions of glibc (Linux's implementation of the C library) have buggy implementations of strxfrm() that, in some locales, can produce string comparison results that do not match strcoll(). Until this problem can be better characterized, disable the optimization in all non-C locales. (C locale is safe since it uses neither strcoll() nor strxfrm().)

Unfortunately, this problem affects not only sorting but also entry ordering in B-tree indexes, which means that B-tree indexes on text, varchar, or char columns may now be corrupt if they sort according to an affected locale and were built or modified under PostgreSQL 9.5.0 or 9.5.1. Users should REINDEX indexes that might be affected.


It is not possible at this time to give an exhaustive list of known-affected locales. C locale is known safe, and there is no evidence of trouble in English-based locales such as en_US, but some other popular locales such as de_DE are affected in most glibc versions."


Jeffrey Sarnoff

unread,
Jul 13, 2016, 9:56:37 AM7/13/16
to julia-dev
Páll Haraldsson -- 
        Absolutely.  Definitely.  <hat tip>.  Good catch.

Páll Haraldsson

unread,
Jul 13, 2016, 10:37:00 AM7/13/16
to juli...@googlegroups.com
On mið 13.júl 2016 13:56, Jeffrey Sarnoff wrote:
> Páll Haraldsson --
> Absolutely. Definitely. <hat tip>. Good catch.

This is already merged.. so needs to be reverted in case I'm right on
the default behavior..

Another thing, but might be to late to change the default bahavior of
sorts (comparisons):

https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/

I'm partial to "natural sort" being the default.. Then we could have
non-default ASCIIbetical or Unicode disregarding precomposed etc.
whatever is fastest.

--
Palli.


> *
> *
> On Wednesday, July 13, 2016 at 9:48:01 AM UTC-4, Páll Haraldsson wrote:
>
> On Thursday, June 9, 2016 at 4:46:46 AM UTC, Jeffrey Sarnoff wrote:
>
> looks sane
>
>
> Are you sure? I'm not sure what lexcmp does (or should do..), but if
> say one string has precomposed Unicode letters, and the other
> doesn't, should it be possible for them to compare the same? Maybe
> it's not preferable by default? Or is it..?
>
> These things can all be tricky; E.g. minor recent upgrades of
> PostgreSQL are generally considered safe (and recommended over not
> upgrading), with a rare "exception" (just just need to be careful
> and REINDEX afterwards, as broken in a former minor update..):
>
> https://www.postgresql.org/docs/9.5/static/release-9-5-2.html
> <https://www.postgresql.org/docs/9.5/static/release-9-5-2.html>
>
> "PostgreSQL 9.5 introduced logic for speeding up comparisons of
> string data types by using the standard C library function
> |strxfrm()| as a substitute for |strcoll()|. It now emerges that

Stefan Karpinski

unread,
Jul 13, 2016, 10:56:43 AM7/13/16
to juli...@googlegroups.com
We're moving towards the built-in String type being essentially just a container for bytes with a UTF-8-like interpretation. Thus if two String objects don't have the same underlying data, they are not equal. If you want fancier Unicode-aware behavior like UTF-8 validation or normalized comparison, then you'll need to use a package that provides a type like EncodedString{UTF8} (or something). Care needs to be taken to make sure that == is transitive, but that's a problem for the to-be-created external Unicode strings package.

Páll Haraldsson

unread,
Jul 13, 2016, 1:11:35 PM7/13/16
to julia-dev
On Wednesday, July 13, 2016 at 2:56:43 PM UTC, Stefan Karpinski wrote:
We're moving towards the built-in String type being essentially just a container for bytes with a UTF-8-like interpretation. Thus if two String objects don't have the same underlying data, they are not equal. If you want fancier Unicode-aware behavior

Ok, we can define it that way, then the optimization that is already merged is ok.

This still does not imply anything for ordering. That must be defined. ASCIIbetical is sometimes used (but is bad), similar for Unicode would also be a disaster. Since I want my alphabet, e.g. a á b c.., then it's not a leap to correclty order á weather it's precomposed or not.
 
like UTF-8 validation or normalized comparison, then you'll need to use a package that provides a type like EncodedString{UTF8} (or something). Care needs to be taken to make sure that == is transitive, but that's a problem for the to-be-created external Unicode strings package.

Is it at least easy to include a package with new behavior that substitutes the default string type? If not with the same name, at least all literal will get the new type?

If you must define some order, maybe it's better that equal is not dumb..

--
Palli.

Steven G. Johnson

unread,
Jul 13, 2016, 2:46:43 PM7/13/16
to julia-dev
Rather than defining a whole new string type, you can also just normalize your strings (with the normalize_string function) before comparing them, in the rare cases where you want equality that ignores e.g. differences in encoding combining characters.

You could even overload a new infix operator (e.g. ≃) that performs normalized equality tests of whatever flavor you like.

There's no single correct answer here, because it depends on what differences you want to ignore.  Do you want to ignore differences in codepoint composition that lead to logically identical strings?  Use NFC normalization.  Do you also want to ignore differences in "confusable" characters like µ (micro) vs μ (greek mu)?  Use NFKC normalization.   Do you want to ignore all diacritical marks, e.g. treat a and á the same?  Use the stripmark=true option.   Do you want to ignore case?  Use the casefold=true option.

Julia's default of == meaning "same codepoints" is reasonable and fast (whereas anything requiring normalization will be much slower).

Steven G. Johnson

unread,
Jul 13, 2016, 2:48:59 PM7/13/16
to julia-dev
Similarly, you can use the "by" keyword option of the sort function if you want to sort strings according to one of their normalized forms.

Scott Jones

unread,
Jul 13, 2016, 4:26:22 PM7/13/16
to juli...@googlegroups.com
On Jul 13, 2016, at 1:11 PM, Páll Haraldsson <pall.ha...@gmail.com> wrote:

On Wednesday, July 13, 2016 at 2:56:43 PM UTC, Stefan Karpinski wrote:
We're moving towards the built-in String type being essentially just a container for bytes with a UTF-8-like interpretation. Thus if two String objects don't have the same underlying data, they are not equal. If you want fancier Unicode-aware behavior

Ok, we can define it that way, then the optimization that is already merged is ok.

Julia, AFAIK, has never compared UTF8String (or now String) in any way except for just checking if the encoded bytes are the same.
Before my fixes to handle different technically invalid (but common) UTF-8 encodings, you could easily get two strings like “\xc0\x80” and “\0” which contained the same Unicode code point, but with a different byte encoding (as well as have many completely invalid sequences that were simply not detected).

I think there is a problem now, in that the String constructor does not seem to be making sure that all strings produced are actually valid UTF-8, and methods that deal with String assume that a String is valid UTF-8.
In v0.4.x, utf8 (and convert) ensured that each character in the resulting UTF8String was encoded correctly, but now utf8 has been deprecated, and convert(String doesn’t create a canonically encoded UTF-8 string.

This still does not imply anything for ordering. That must be defined. ASCIIbetical is sometimes used (but is bad), similar for Unicode would also be a disaster. Since I want my alphabet, e.g. a á b c.., then it's not a leap to correclty order á weather it's precomposed or not

Ordering like that is language dependent, and really needs something very complex like the ICU library to attempt to do it right (I wrote code to handle those issues before ICU was available - it’s not at all a trivial task).

like UTF-8 validation or normalized comparison, then you'll need to use a package that provides a type like EncodedString{UTF8} (or something). Care needs to be taken to make sure that == is transitive, but that's a problem for the to-be-created external Unicode strings package.

Is it at least easy to include a package with new behavior that substitutes the default string type? If not with the same name, at least all literal will get the new type?

That is not currently possible.

Scott Jones

unread,
Jul 13, 2016, 6:23:38 PM7/13/16
to julia-dev
Unfortunately, that is simply not true (and as far as I've seen has never been) for UTF8String (and now String).

Since Julia never requires that strings be valid, there are many different ways of encoding a single code point using either overly long encodings (such as 0xc0 0x80, which is an invalid (overly long) encoding of 0x00), or encoding non-BMP characters with 6 bytes, i.e. as two 3 byte surrogate characters, which also cause strings to not sort correctly), and those will give bad results.   Fixing those issues would make the performance be far worse than it already is.
At least, with convert and utf8 in v0.4.x (because of my PRs), you could eliminate those problems and ensure that you had valid UTF-8 encoded strings before comparisons, but now, in v0.5.x, you'd need to add the LegacyStrings package to gain back that functionality.
It would be possible, although rather inefficient, to do something like ==(x::String,y::String) = convert(Vector{Char},x) == convert(Vector{Char},y), and then you also have to deal with all sorts of hashing problems as well.
Just to be very clear here - I am *not* talking about composed Unicode characters - which is another matter entirely, I'm talking about simply trying to compare two Unicode strings accepted by String.

This is one of the reasons I felt that the string changes that have gone into v0.5 were a mistake. (I'm not saying either that the way strings are in v0.3 were good either, with 4 different inconsistent basic string types, and my fixes in v0.4 didn't change that mess).

Steven G. Johnson

unread,
Jul 17, 2016, 3:07:59 PM7/17/16
to julia-dev


On Wednesday, July 13, 2016 at 11:23:38 PM UTC+1, Scott Jones wrote:
Julia's default of == meaning "same codepoints" is reasonable and fast (whereas anything requiring normalization will be much slower).

Unfortunately, that is simply not true (and as far as I've seen has never been) for UTF8String (and now String).

It's true for valid strings, which is what most applications are concerned with. For invalid encodings, it's true that we compare encodings.  I don't see a problem with that.  If the user just stuffed arbitrary binary data into a string, comparing encodings (valid or not) seems to be the most reasonable choice of == without knowing application-specific information.

(For example, if your application knows that you are importing "modified UTF-8" from Java code that has (invalid) surrogate pairs, you can do an application-specific preprocessing step to convert these into proper UTF-8 if you want to use the String type.   But it would be inappropriate for Base.String to do this preprocessing for you, because Base.String doesn't know the origin of your invalid encoding.)

Scott Jones

unread,
Jul 17, 2016, 7:37:28 PM7/17/16
to julia-dev
Then what you are saying is that == for String type means simply "same code units", which is not at all the same thing as "same codepoints".

This isn't the case of just arbitrary binary data, these are UTF-8 encodings that are going to be frequently found, whether coming from Java (or languages using the JVM) or many databases.

I don't think it is inappropriate for `convert(String, x::Vector{UInt8})` to make sure that the string is canonically encoded, it can save a lot of grief when people UTF-8 data from a variety of sources.
Reply all
Reply to author
Forward
0 new messages