[R] What "method" does sort() use?

2 views
Skip to first unread message

Patrick Connolly

unread,
Mar 18, 2016, 5:04:36 AM3/18/16
to R-help
I don't follow why this happens:

> sort(c(LETTERS[1:5], letters[1:5]))
[1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E"

The help for sort() says:

method: character string specifying the algorithm used. Not
available for partial sorting. Can be abbreviated.

But what are the methods available? The help mentions xtfrm but that
doesn't illuminate, I'd have thought that at least by default it would
have something to do with ASCII codes. But that's not the case since
all the uppercase ones would be before the lowercase ones.

I know something different is happening but I don't know what it is
(do you, Mr Jones?). Apologies to Bob Dylan.

--
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
___ Patrick Connolly
{~._.~} Great minds discuss ideas
_( Y )_ Average minds discuss events
(:_~*~_:) Small minds discuss people
(_)-(_) ..... Eleanor Roosevelt

~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

______________________________________________
R-h...@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

peter dalgaard

unread,
Mar 18, 2016, 5:16:39 AM3/18/16
to Patrick Connolly, R-help

On 18 Mar 2016, at 10:02 , Patrick Connolly <p_con...@slingshot.co.nz> wrote:

> I don't follow why this happens:
>
>> sort(c(LETTERS[1:5], letters[1:5]))
> [1] "a" "A" "b" "B" "c" "C" "d" "D" "e" "E"
>
> The help for sort() says:
>
> method: character string specifying the algorithm used. Not
> available for partial sorting. Can be abbreviated.
>
> But what are the methods available? The help mentions xtfrm but that
> doesn't illuminate, I'd have thought that at least by default it would
> have something to do with ASCII codes. But that's not the case since
> all the uppercase ones would be before the lowercase ones.
>
> I know something different is happening but I don't know what it is
> (do you, Mr Jones?). Apologies to Bob Dylan.
>


Um, read _all_ of the help file?

sort.int(x, partial = NULL, na.last = NA, decreasing = FALSE,
method = c("shell", "quick"), index.return = FALSE)

[snip]

Method "shell" uses Shellsort (an O(n^{4/3}) variant from Sedgewick (1986)). If x has names a stable modification is used, so ties are not reordered. (This only matters if names are present.)

Method "quick" uses Singleton (1969)'s implementation of Hoare's Quicksort method and is only available when x is numeric (double or integer) and partial is NULL. (For other types of x Shellsort is used, silently.) It is normally somewhat faster than Shellsort (perhaps 50% faster on vectors of length a million and twice as fast at a billion) but has poor performance in the rare worst case. (Peto's modification using a pseudo-random midpoint is used to make the worst case rarer.) This is not a stable sort, and ties may be reordered.

Factors with less than 100,000 levels are sorted by radix sorting when method is not supplied: see sort.list.

-pd


--
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd....@cbs.dk Priv: PDa...@gmail.com

peter dalgaard

unread,
Mar 18, 2016, 5:21:36 AM3/18/16
to Patrick Connolly, R-help
Ooops, that was answering the question you actually asked. The one you meant to ask is answered by this part:

The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison.

...and collating sequences is a weird and woolly subject, where you cannot even be sure that locales of the same name on two different platforms sort strings in the same order.

-pd

Patrick Connolly

unread,
Mar 19, 2016, 10:21:47 PM3/19/16
to peter dalgaard, R-help
I did look at the Comparison help but totally overlooked this part;

Comparison of strings in character vectors is lexicographic within
the strings using the collating sequence of the locale in use: see
‘locales’. The collating sequence of locales such as ‘en_US’ is
normally different from ‘C’ (which should use ASCII) and can be
surprising.

I've recently changed to a different Linux distribution and was trying
to work out why I was getting a different order of the factor levels
even though it was the same code. I thought I'd inadvertantly changed
something somehow.

Much clearer now -- even if still confusing.

Thanks for the pointer.

--

~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
___ Patrick Connolly
{~._.~} Great minds discuss ideas
_( Y )_ Average minds discuss events
(:_~*~_:) Small minds discuss people
(_)-(_) ..... Eleanor Roosevelt

~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

______________________________________________

Reply all
Reply to author
Forward
0 new messages