A peculiar sort problem

102 views
Skip to first unread message

Tobias Klausmann

unread,
Jul 27, 2021, 1:57:07 PM7/27/21
to golan...@googlegroups.com
Hi!

I am writing a tool that handles files and generates some sorted
output. Since this tool is to be a replacement for part of
another system (written in C#/.NET), the output must be a
byte-exact duplicate.

The existing system generates some checksums and filenames in a
stable sorted order, like so (also pastebinned at
https://dpaste.org/MLMv):

```
addons/rhs_bmp.pbo 2D08606D9BCB43E37A44CDBCD753F663
addons/rhs_bmp.pbo.rhsafrf.0.5.6.bisign 4F0BAAFDDC2C3474F7B310BAF221C22E
addons/rhs_bmp_camo.pbo 5A9AED2283EE8B8E55BE07772CB9EF33
addons/rhs_bmp_camo.pbo.rhsafrf.0.5.6.bisign C5638F6DC62DED7C05877896096BE7CC
addons/rhs_bmp3.pbo 1F8E4520CA673FE5F67E434D6C9C3EAA
addons/rhs_bmp3.pbo.rhsafrf.0.5.6.bisign
addons/rhs_bmp3_camo.pbo CE25F0037BCD19F55D6AA1CD3AEA0B86
addons/rhs_bmp3_camo.pbo.rhsafrf.0.5.6.bisign 9CF15ED151231E6C1E8A5E63C5AAD829
addons/rhs_btr70.pbo 7FCC93BDDBE1A0573ABF146FA7C1FAD9
addons/rhs_btr70.pbo.rhsafrf.0.5.6.bisign 61FD5A3D99F6A60BB31F0D396B19E5C5
addons/rhs_btr70_camo.pbo 9A8F2BF875276FA1F7018F2D60C89D7A

```

My tool works just fine, except it disagrees on the sorting
(pastebinned at https://dpaste.org/UDiK):

```
addons/rhs_bmp.pbo 2D08606D9BCB43E37A44CDBCD753F663
addons/rhs_bmp.pbo.rhsafrf.0.5.6.bisign C5638F6DC62DED7C05877896096BE7CC
addons/rhs_bmp3.pbo 1F8E4520CA673FE5F67E434D6C9C3EAA
addons/rhs_bmp3.pbo.rhsafrf.0.5.6.bisign 2CCF421E08170964CF323A98725DDA6E
addons/rhs_bmp3_camo.pbo CE25F0037BCD19F55D6AA1CD3AEA0B86
addons/rhs_bmp3_camo.pbo.rhsafrf.0.5.6.bisign 4F0BAAFDDC2C3474F7B310BAF221C22E
addons/rhs_bmp_camo.pbo 5A9AED2283EE8B8E55BE07772CB9EF33
addons/rhs_bmp_camo.pbo.rhsafrf.0.5.6.bisign 9CF15ED151231E6C1E8A5E63C5AAD829
addons/rhs_btr70.pbo 7FCC93BDDBE1A0573ABF146FA7C1FAD9
```

In essence, to the .NET program `_` sorts before `3`, whereas for
my Go program, it is the other way around.

Now we could discuss at length which order is the correct one,
but for my purposes (replicating the .NET tool) is what I
strongly prefer. The alternative would be a breaking change in a
larger system I do not control, and as such would be a lot of
pain.

So my question is: what is the easiest way to tell Go that `_`
sorts before numerals here? My current sort helpers are (on the
playground: https://play.golang.org/p/pQLeZN-fptY):

```
type Files []ModFile
type ModFile struct {
Path string
// Other fields here
}
func (f Files) Len() int { return len(f) }
func (f Files) Swap(i, j int) { f[i], f[j] = f[j], f[i] }
func (f Files) Less(i, j int) bool { return strings.ToLower(f[i].Path) < strings.ToLower(f[j].Path) }
```

(yes, the case folding is another peculiarity of the .NET tool)

Best,
Tobias

jake...@gmail.com

unread,
Jul 27, 2021, 2:51:51 PM7/27/21
to golang-nuts
Personally I would be careful about assuming that was the only sorting difference. Unless you have checked every Unicode character, there could be other hidden 'special cases'. If it really mattered, personally, I would dig into the .NET library source code and see what algorithm or system call is used to sort. You may even be able to use the same system call (assuming you are on Windows only) or algorithm.

Tamás Gulácsi

unread,
Jul 27, 2021, 3:21:46 PM7/27/21
to golang-nuts
First, sort all the bytes (0-255) with Go and .Net, and compare them.
For Unicode-unaware sorting, that'd be enough: create a mapping between the two tables,
replace all the bytes in the Go's input before sort (or use a Less that does the translation),
thus replicating the .NET sort order.

Brian Candler

unread,
Jul 27, 2021, 4:00:26 PM7/27/21
to golang-nuts
As a complete guess: perhaps the .NET sorter is replacing underscore with space before sorting?  You'll need to do some further experimentation to prove or disprove this theory.  But if it's true, you can change your Less function in a corresponding way.

Of course, that still begs the question of what happens when space and underscore are compared.

James

unread,
Jul 27, 2021, 4:12:30 PM7/27/21
to jake...@gmail.com, golang-nuts
It could be that .NET is using some locale based collation. Seems like a lot of machinery, but might do the trick https://pkg.go.dev/golang.org/x/te...@v0.3.6/collate

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/98eed7b6-4b1e-4ba5-a457-141ae5daca52n%40googlegroups.com.

Tobias Klausmann

unread,
Jul 28, 2021, 3:49:27 AM7/28/21
to golan...@googlegroups.com
Hi!

On Wed, 28 Jul 2021, James wrote:
> It could be that .NET is using some locale based collation. Seems like a
> lot of machinery, but might do the trick
> https://pkg.go.dev/golang.org/x/te...@v0.3.6/collate

I have managed to get the .NET code used for sorting:

Files.Sort((x, y) => string.Compare(x.RelativePath, y.RelativePath, CultureInfo.InvariantCulture,
CompareOptions.IgnoreCase));

My NET-fu is weak, but I do wonder what exactly
`CultureInfo.InvariantCulture` means in the context of sorting.

I'll do some more digging, but I suspect the only way I can match
the .NET behavior 100% is indeed x/text/collate :-/

Best,
Tobias

Tobias Klausmann

unread,
Jul 28, 2021, 3:51:33 AM7/28/21
to golan...@googlegroups.com
Hi!

On Tue, 27 Jul 2021, Tamás Gulácsi wrote:
> First, sort all the bytes (0-255) with Go and .Net, and compare them.
> For Unicode-unaware sorting, that'd be enough: create a mapping between the
> two tables,
> replace all the bytes in the Go's input before sort (or use a Less that
> does the translation),
> thus replicating the .NET sort order.

I'll give that a try. While the names are meaningful in some way,
my program doesn't need to care (similar to how most Linux
filesystems don't have a notion of encoding, they just store byte
sequences as path elements).

Best,
Tobias

peterGo

unread,
Jul 28, 2021, 8:32:57 AM7/28/21
to golang-nuts
Tobias,

The C# code is likely using the string sort method: SORT_STRINGSORT.

winnls.h:

//    STRING Sort:  hyphen and apostrophe will sort with all other symbols
//
//                        co-op     <-------  hyphen (punctuation)
//                        co_op     <-------  underscore (symbol)
//                        coat
//                        comb
//                        coop
//                        cork
//                        we're     <-------  apostrophe (punctuation)
//                        went
//                        were
//
#define SORT_STRINGSORT           0x00001000  // use string sort method


InvariantCulture means independent of culture (and language). The sort returns the same results for the US, Germany, and Azerbaijan.

Peter
Reply all
Reply to author
Forward
0 new messages