replicating MS Excel percentrank formula

Naresh Pai

unread,

Sep 16, 2010, 11:05:23 AM9/16/10

to

I am trying to replicate MS Excel's percentrank function in Matlab.

% example data
data = [1.1 1.2 2.1 3.5 6.1 1.2 5.1 5.2 9.1 7.1]'

% excel percentrank results are:
% [0.00 0.11 0.33 0.44 0.77 0.11 0.55 0.66 1.00 0.88]'

% Mathworks website suggests using the following formula to calculate percentile rank

prctile_rank = tiedrank(data)/length(data)

% and the result is:
prctile_rank = [0.10 0.25 0.40 0.50 0.80 0.25 0.60 0.70 1.00 0.90]'

% Too bad. So, I try this:

% clear previous variables
clear all

% include serial number for sorting later
data(:,1) = 1:10;

% example data
data(:,2) = [1.1 1.2 2.1 3.5 6.1 1.2 5.1 5.2 9.1 7.1]';

% sort the array using column 2
[x, i] = sort(data(:,2));

% store sorted array
data_sorted = data(i,:);

%loop through
for i = 1:length(data)

% find number of values less than
less_than = numel(find(data_sorted(:,2)<data_sorted(i,2)));

% percentile rank
data_sorted(i,3) = floor((less_than)*10/length(data))/10;

% clear temp variable
clear less_than
end

% put it back together
[y, j] = sort(data_sorted(:,1));
prctile_rank = data_sorted(j,3)

% the answer is:
prctile_rank = [0.00 0.10 0.30 0.40 0.70 0.10 0.50 0.60 0.90 0.80]

% close but not good enough. especially the last but one differs by 0.1

Can somebody suggest a better way of doing this?

Ashish Uthama

unread,

Sep 16, 2010, 12:45:49 PM9/16/10

to

Naresh,

In

> % percentile rank
> data_sorted(i,3) = floor((less_than)*10/length(data))/10;

What do you really want the denominator to be? the number of elements or
the max value of the data?
if you replace it with:

% percentile rank based on max value
data_sorted(i,3) = ((less_than)/max(data(:,2)));

You might get what you seem to want.

(Note: I think excel rounds the answer to two decimal digits).

Naresh Pai

unread,

Sep 16, 2010, 3:04:08 PM9/16/10

to

Ashish Uthama <first...@mathworks.com> wrote in message <i6thju$ebg$1...@fred.mathworks.com>...

Ashish,

Thanks for your reply. I tried the formula you suggested above. Here are the percentile ranks I get:

prctile_rank = [0 0.1099 0.3297 0.4396 0.7692 0.1099 0.5495 0.6593 0.9890 0.8791]'

With excel the values are (I have customized the cells in excel to show four digits after decimal point) :

[0.0000 0.1110 0.3330 0.4440 0.7770 0.1110 0.5550 0.6660 1.000 0.8888]'

Although the values are fairly close, they are not good enough for me. The issue is that I am using these numbers in another formula with a large multiplier. This increases the error by at least 2 to 3 orders of magnitude. Hence, I need to exactly replicate what excel does with its percentrank function.

In percentile rank formula, I am trying divide less_than by the number of elements in the data. In this example, I have used 10.

I think the issue comes when we have two elements with same values in the array. I am trying to figure how excel handles this.

Let me know if this is not clear. I have provided a small example here to simplify the question. I am working with a larger data set. I can send that to you and the code, if necessary.

Thanks again!

Naresh

Roger Stafford

unread,

Sep 16, 2010, 3:32:21 PM9/16/10

to

"Naresh Pai" <np...@uark.edu> wrote in message <i6tbnj$94h$1...@fred.mathworks.com>...

> I am trying to replicate MS Excel's percentrank function in Matlab.
>
> % example data
> data = [1.1 1.2 2.1 3.5 6.1 1.2 5.1 5.2 9.1 7.1]'
>
> % excel percentrank results are:
> % [0.00 0.11 0.33 0.44 0.77 0.11 0.55 0.66 1.00 0.88]'

> ..........

> Can somebody suggest a better way of doing this?

- - - - - - - -
Here is my best guess as to the logic in MS Excel's percentrank function, based on your single example. As in that example, let data be a column vector. Then do this:

[x,p] = sort(data); % Put data in ascending order
x = [true;diff(x)~=0;true]; % Check for repetitions
t = diff(x); % Check for beginnings and endings of repetitions series
f = find(t==-1); % Beginnings of repetition series
g = find(t==+1)+1; % Next addresses after their ends
x(g) = g-f; % Prepare x for doing cumsum (changes to numerical)
x = cumsum(x(1:end-1)); % Rank orderings
prctile_rank = floor(100*(x-1)/(length(x)-1))/100; % Convert to percentile
prctile_rank(p) = prctile_rank; % Restore the original order

Apparently for repeated data values Excel uses the lowest percentile figure among them and this is what the above manipulation on x accomplishes. This is the hard part of the needed computation.

Also it looks as though Excel uses truncation in arriving at percentiles values, which is why I have used the 'floor' function here in the next to last step. You may prefer to alter this step if Excel actually uses more than two decimal places.

As I have said, this is only a guess. It would require many examples to be sure of the details in Excel's logic, or preferably some adequate documentation.

Roger Stafford

Ashish Uthama

unread,

Sep 17, 2010, 9:51:14 AM9/17/10

to

Naresh,

Dividing by the max assumes that your data starts at 1. Maybe you need
to use the range?
i.e if your data ranges from 5 - 50, the denominator should probably be:

max(data(:,2) - min(data(:,2))

The repeated element handling is also a concern (I do not know what the
'right' answer for your application might be).

Echoing Rogers comments:

Victor

unread,

Sep 21, 2010, 9:20:05 AM9/21/10

to

"Naresh Pai" <np...@uark.edu> wrote in message <i6tbnj$94h$1...@fred.mathworks.com>...

Hi Naresh,

if you replace

> % percentile rank
> data_sorted(i,3) = floor((less_than)*10/length(data))/10;

with

% percentile rank based on max value

data_sorted(i,3) = less_than/(length(data)-1);

you will get what you wanted. :)

Skip Albertson

unread,

Oct 13, 2010, 2:11:04 PM10/13/10

to

I had written a function to do this and offer it here. Depending on the data, you may need to sort it first and deal with replicates, otherwise this emulates the Excel PercentRank function fairly well:

function PRank = PercentRank(x,p)
%
% function PRank = PercentRank(x,p);
%
% emulates the Percent Rank function found in Microsoft Excel.
% Input at array x, and a value for which you would like to find
% the percentile rank, p. The value returned, PRank, will use
% linery interpolation, as in Excel. Extrapolation will
% evaluate to NaN.
%
% For example:
%
% x=[5;
% 10;
% 15;
% 20;
% 95];
%
% p=15, returns 0.5, which is at the median
% p=14; returns 0.45, which is linearly interpolated.
%
% Skip Albertson, October 11, 2010
%
%
y=zeros(size(x));
n=numel(x);
for k=1:n
r=numel(find(x<x(k)));
y(k)=r/(n-1);
end
PRank=interp1(x,y,p,'linear');

Ashish Uthama <first...@mathworks.com> wrote in message <i6vroi$lol$1...@fred.mathworks.com>...

Michael

unread,

Dec 15, 2010, 3:56:05 PM12/15/10

to

Hi,

My collegue at work told me that if the percent rank of an input is ALSO enumerated in the rank array, then an alternate formula in excel would be

=rounddown(rank(target,list)-1/(count(list)-1),3)

Some kludgie matlab code below :

% minimal error checking so cuidado!
% include serial number for sorting later .. my data was 252 days of
% rolling averages

data(:,1) = 1:252;

% example data - must be 252 rows .. can include zero .. I did not test for
% negative numbers

data(:,2) = Bri_252_rollArray;

% sort the array using column 2
[x, i] = sort(data(:,2));

% store sorted array
data_sorted = data(i,:);

%loop through
for i = 1:length(data)

% find number of values less than target
rank = numel(find(data_sorted(:,2)<data_sorted(i,2)));

% percentile rank
x = rank/(length(data)-1);

% rounddown a la excel .. there must be a better way in matlab but I am
% a newbie hacker

z = num2str(x,4);
if length(z) >= 4
q = [z(1,1) z(1,2) z(1,3) z(1,4) z(1,5)];
else
q = z(1,1);
end
data_sorted(i,3) = str2double(q);

% clear temp variable
clear less_than
end

% put it back together
[y, j] = sort(data_sorted(:,1));

prctile_rank = data_sorted(j,3);

Michael

unread,

Dec 15, 2010, 5:42:05 PM12/15/10

to

my bad ...

=rounddown(rank(target,list)-1) / (count(list)-1),3)

"Michael " <m.r...@yahoo.com> wrote in message <ieba15$kbq$1...@fred.mathworks.com>...