Human sort for strings in Go (Natural Sort Order package)

2,286 views
Skip to first unread message

Maxim Kupriianov

unread,
Apr 12, 2015, 10:59:19 AM4/12/15
to golan...@googlegroups.com
Hi guys! Recently I saw an announcement on @golangweekly about a package for sorting strings in the way humans expect.
For example, a default string sort would do [a1, a10, a12, a2] instead of [a1, a2, a10, a12]. Spot the difference.

So, the package announced github.com/skarademir/naturalsort is quite a hello-world level with regex for getting numbers and has an extremely low performance as the result. I'm a bit jealous here because I have my own implementation available for months: github.com/xlab/handysort. It has 60x performance over regex of course, benchmark results available here: https://gist.github.com/xlab/1e40af2eac8783ce50d1

It's also only 5x-8x times slower than the default string sort, that makes it still acceptable in real cases where performance really matters. The handysort.Strings works like sort.StringSlice, i.e. «attaches the methods of Interface to []string», easy to use in calls to sort.Sort.

I'd like to hear about any bugs and missing test cases encountered. This package has 100% test coverage.
Cheers.

Frits van Bommel

unread,
Apr 14, 2015, 1:29:48 PM4/14/15
to golan...@googlegroups.com, Maxim Kupriianov
Hi Maxim,

I think I found a bug: your package seems to consider `083a` to be less than `9a`.

By the way, I found this because I wrote my on code after I looked at your code and it seemed a bit overcomplicated. For instance, there's really no need to import "unicode/utf8", since lexicographical order (i.e. comparing byte-by-byte in the usual manner) actually just works for UTF-8 and since the bytes used to encode non-ASCII values are all non-ASCII themselves you don't need to work in codepoints to avoid confusing them for digits either.
Anyway, I stole your tests for my package :P and added only a few of my own. One of the tests I added was "does it return the same result as handysort on some randomly-generated strings?", which failed on strings like that.

Also, here are my benchmark results:

BenchmarkStringSort-8             300       5545867 ns/op
BenchmarkUtilStringSort-8         200       8943404 ns/op
BenchmarkHandyStringSort-8        100      20950622 ns/op
BenchmarkStringLess-8        30000000          50.8 ns/op
BenchmarkNaturalLess-8      100000000          17.5 ns/op

As you can see, my method is more than twice as fast as yours (on my machine) at sorting but still about 1.6x slower than a straight-up sort.Strings().

I did modify these benchmarks slightly (besides adding my own methods) by copying the array before sorting the copy instead of sorting the randomly-generated array directly. This way, if the benchmark wraps around (which it seems they didn't, but still) they won't be sorting an already-sorted array after the first time.

Anyway, my code can be found here:
    Import path: "github.com/fvbommel/util/sortorder"
    Code: https://github.com/fvbommel/util/tree/master/sortorder
    Docs: https://godoc.org/github.com/fvbommel/util/sortorder

and it passes all of your tests, as well as a (very) few of my own.


Happy coding,

    Frits van Bommel

Maxim Kupriianov

unread,
Apr 14, 2015, 1:37:20 PM4/14/15
to Frits van Bommel, golan...@googlegroups.com
Hello, yes my code become overcomplicated after I received the first feedback and added some tricky test cases. I just realized I’m doing it wrong and decided to re-write on the weekend. Your results are fascinating, I’ll review the code and see if I my version could even be close to yours ;)

Thanks for tips.


Regards,
Maxim Kupriianov

On Apr 14, 2015, at 8:29 PM, Frits van Bommel <fvbo...@gmail.com> wrote:

Hi Maxim,

I think I found a bug: your package seems to consider `083a` to be less than `9a`.

By the way, I found this because I wrote my on code after I looked at your code and it seemed a bit overcomplicated. For instance, there's really no need to import "unicode/utf8", since lexicographical order (i.e. comparing byte-by-byte in the usual manner) actually just works for UTF-8 and since the bytes used to encode non-ASCII values are all non-ASCII themselves you don't need to work in codepoints to avoid confusing them for digits either.
Anyway, I stole your tests for my package :P and added only a few of my own. One of the tests I added was "does it return the same result as handysort on some randomly-generated strings?", which failed on strings like that.

Also, here are my benchmark results:

BenchmarkStringSort-8             300       5545867 ns/op
BenchmarkUtilStringSort-8         200       8943404 ns/op
BenchmarkHandyStringSort-8        100      20950622 ns/op
BenchmarkStringLess-8        30000000          50.8 ns/op
BenchmarkNaturalLess-8      100000000          17.5 ns/op

As you can see, my method is more than twice as fast as yours (on my machine) at sorting but still about 1.6x slower than a straight-up sort.Strings().

I didmodify these benchmarks slightly (besides adding my own methods) by copying the array before sorting the copy instead of sorting the randomly-generated array directly. This way, if the benchmark wraps around (which it seems they didn't, but still) they won't be sorting an already-sorted array after the first time.


Anyway, my code can be found here:
    Import path: "github.com/fvbommel/util/sortorder"
    Code: https://github.com/fvbommel/util/tree/master/sortorder
    Docs: https://godoc.org/github.com/fvbommel/util/sortorder

and it passes all of your tests, as well as a (very) few of my own.


Happy coding,

    Frits van Bommel



On Sunday, April 12, 2015 at 4:59:19 PM UTC+2, Maxim Kupriianov wrote:
Hi guys! Recently I saw an announcementon @golangweekly about a package for sorting strings in the way humans expect.

For example, a default string sort would do [a1, a10, a12, a2] instead of [a1, a2, a10, a12]. Spot the difference.

So, the package announced github.com/skarademir/naturalsortis quite a hello-world level with regex for getting numbers and has an extremely low performance as the result. I'm a bit jealous here because I have my own implementation available for months: github.com/xlab/handysort. It has 60x performance over regex of course, benchmark results available here: https://gist.github.com/xlab/1e40af2eac8783ce50d1

Ross Light

unread,
Apr 15, 2015, 4:33:58 PM4/15/15
to Maxim Kupriianov, Frits van Bommel, golan...@googlegroups.com
I've also written a natural sort package as part of cardcpx.  It's purposefully very simplistic, but served my needs well:


-Ross

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages