sort string alphabetically

1,128 views
Skip to first unread message

harven

unread,
Oct 12, 2013, 12:49:45 PM10/12/13
to julia...@googlegroups.com
Is there a way to sort a string in alphabetical order?
The sort function does not seem to work with strings.
I would like the following behavior.

julia> sort("julialang")
"aagijllnu"

At the moment, I rely on  join(sort(collect("julialang")), "") but it is pretty slow.
Moreover string are immutable so it doesn't seem easy to implement a sorting algorithm
without switching to another type.

Ivar Nesje

unread,
Oct 12, 2013, 1:50:08 PM10/12/13
to julia...@googlegroups.com
I have reported an issue that I think will speed up collect for strings in #4491.

If you use a comprehension instead you will get a speedup, but I am not sure what to do with the join, but one idea might be to use the string() function.

string(sort([a for a in "julialang"])...)

Ivar

harven

unread,
Oct 12, 2013, 3:00:35 PM10/12/13
to julia...@googlegroups.com


Le samedi 12 octobre 2013 19:50:08 UTC+2, Ivar Nesje a écrit :

If you use a comprehension instead you will get a speedup, but I am not sure what to do with the join, but one idea might be to use the string() function.


Thanks for your answer.
Using comprehension actually slows down my code by 50%. Here is a microbenchmark that illustrates the slowdown.

  julia> dict = open(readlines, "/usr/share/dict/american-english") ; map!(chomp,dict);
  julia> gc() ; @time filter(str -> collect("cddeinostu") == sort(collect(str)), dict)
 elapsed time: 1.095790328 seconds (72110504 bytes allocated)
 julia> gc() ; @time filter(str -> [c for c in "cddeinostu"] == sort([d for d in str]), dict)
 elapsed time: 1.512476691 seconds (86447632 bytes allocated)

By the way, I have often experimented the fact that adding type declaration for arrays actually slows down the code. Here is another example. The following code returns an array of all strings made of a letters R, b letters V and c letters B.

 function path (a, b, c)
  (a == 0 && b == 0 && c == 0) ? {""} :  
       [ a != 0 ? map(str->"R"*str, path(a-1, b, c)) : {},
         b != 0 ? map(str->"V"*str, path(a, b-1, c)) : {},
         c != 0 ? map(str->"B"*str, path(a, b, c-1)) : {} ]
end
 julia> gc() ; @time length(path(5,5,5))
elapsed time: 7.532195402 seconds (1489954480 bytes allocated)

Now if I replace all {} by [] in the function definition, the timing is 6x slower (46.5s).
If I replace all {} by ASCIIString[], it is still slower (9.7s).

Using julia Version 0.2.0-prerelease+3893. "Commit d5dc226* 2013-10-01 on x86_64-linux-gnu (debian wheezy).

Kevin Squire

unread,
Oct 12, 2013, 3:21:30 PM10/12/13
to julia...@googlegroups.com
Ivar's pull request was just merged, and on my machine, it speeds up the first method by a factor of ~6.

Kevin

harven

unread,
Oct 12, 2013, 4:26:55 PM10/12/13
to julia...@googlegroups.com


Le samedi 12 octobre 2013 21:21:30 UTC+2, Kevin Squire a écrit :
Ivar's pull request was just merged, and on my machine, it speeds up the first method by a factor of ~6.

Kevin

Indeed, collect is now much faster on strings.

julia> @time filter(str -> collect("cddeinostu") == sort(collect(str)), dict)
elapsed time: 0.187484245 seconds (50743000 bytes allocated)

Nice!

It had no effect on list comprehensions or on the weird behavior I observed for my path function, but this is a different issue, I guess.

@Ivar I tried using string instead of join, but this gives a lot of newlines in the result.

julia> string(sort(collect("mountainers")))
"a\ne\ni\nm\nn\nn\no\nr\ns\nt\nu\n"

I am not sure if join is the fastest way to concatenate an array of character.

Stefan Karpinski

unread,
Oct 12, 2013, 4:32:51 PM10/12/13
to Julia Users
The fastest way to do this is

julia> ASCIIString(sort!("julialang".data))
"aagijllnu"

but there are a couple of pretty serious caveats:
  1. It mutates the original string;
  2. It depends very much on the string being an ASCIIString.
The former can be avoided by using sort instead of sort!. If you're going to be sorting various string types, including UTF8Strings, then you need to use a different approach because you're going to need to sort characters instead of bytes. If you don't care what kind of string object you get, you can do this:

julia> CharString(sort!([c for c in "julialang"]))
"aagijllnu"

This yields a CharString object, which represents the string as an array of characters. If you want a ByteString back, then you can do this:

julia> bytestring(CharString(sort!([c for c in "julialang"])))
"aagijllnu"

julia> typeof(ans)
ASCIIString (constructor with 1 method)

Hope that helps.

Ivar Nesje

unread,
Oct 12, 2013, 4:35:47 PM10/12/13
to julia...@googlegroups.com
Unfortunately I can not understand your first example, and I can't run it because I don't know the structure of the content in your "/usr/share/dict/american-english". The patch only affects collect.

The second example is quite special, and took me some time to understand. (I hope I got it right now, it is mostly guessing based on behaviour)
When you replace {} by [] you do a concatenation of Any[] and None[], that operation might not properly optimized
When you replace {} by ASCIIString[] you probably get less benefit because the string can not be inlined in the array because the memory size of a string is not constant. You are writing to the arrays more than you are reading from the arrays, thus the lost type check on reading is compensated by the required type checking on writing. 

I think a faster implementation of the second function would allocate a 2d array of Uint8 and fill with aproperiate values before converting each line to a string. Just remember that because Julia is column major, it might be faster to store the strings in columns. I don't have the time to write the function now for comparison.

I also think I remember reading that lambda functions does not get specialized on argument types. If that is true you would not get any benefit 

harven

unread,
Oct 12, 2013, 5:20:46 PM10/12/13
to julia...@googlegroups.com


Le samedi 12 octobre 2013 22:35:47 UTC+2, Ivar Nesje a écrit :
Unfortunately I can not understand your first example, and I can't run it because I don't know the structure of the content in your "/usr/share/dict/american-english".

The file is just a wordlist, i.e. a list of (hopefully) all the american and english words, one by line. If you have a spellchecker, you probably have one on your computer.  If not, you can grab one here: http://www-01.sil.org/LINGUISTICS/wordlists/english/wordlist/wordsEn.txt

Any wordlist will do for the purpose of the benchmark. That oneliner returns all words in dict that are anagrams of the string "cddeinostu".

harven

unread,
Oct 12, 2013, 7:55:55 PM10/12/13
to julia...@googlegroups.com


Le samedi 12 octobre 2013 22:32:51 UTC+2, Stefan Karpinski a écrit :
The fastest way to do this is

julia> ASCIIString(sort!("julialang".data))
"aagijllnu"

but there are a couple of pretty serious caveats:
  1. It mutates the original string;
  2. It depends very much on the string being an ASCIIString.
The former can be avoided by using sort instead of sort!. If you're going to be sorting various string types, including UTF8Strings, then you need to use a different approach because you're going to need to sort characters instead of bytes. If you don't care what kind of string object you get, you can do this:

julia> CharString(sort!([c for c in "julialang"]))
"aagijllnu"

This yields a CharString object, which represents the string as an array of characters. If you want a ByteString back, then you can do this:

julia> bytestring(CharString(sort!([c for c in "julialang"])))
"aagijllnu"

julia> typeof(ans)
ASCIIString (constructor with 1 method)

Hope that helps.

Yes, thanks for your detailed answer. The CharString type is very nice.
I tried to write a sort function for strings so that the returned type is the same as
the type of the argument. Here is my attempt.

 sortstring{T <: String} (str::T) = convert(T,(apply(string, (sort!(collect(str))))))

It is probably not always fast, but it works for the UTF8String, ASCIIString, CharString and RepString types. It fails for RevString and RopeString. I must say that the RevString type
is a bit mysterious. Why can't we use convert with that type?
Reply all
Reply to author
Forward
0 new messages