plotting "frequency plots"

Kevin Maguire

unread,

Mar 7, 2003, 12:52:59 PM3/7/03

to

Hi

Sorry if the subject line is not the right term for what I describe
...

I have a text file containing numerical data, sorted into order with
one number per line. This data all lies in the range 10.0 --> 50.0,
but my question is more general.

I want a frequency plot of the data, as I need to find the percentiles
for this data (50% of values < 27.3, 25% < 22.9, 10% < 19.6, and so
on.). i.e. I want to see a continuous line starting at zero and
ending at 100(%), basically heading from bottom left to top right with
the "values" as the x-axis ranging from 10 to 50 (in this case).

I hacked a quick fortran program for this data, but I am sure gnuplot
could do this and I would like to script the graph generation bit for
web publishing purposes.

Suggestions ...

Thanks
Kevin

Heinz Rommerskirchen

unread,

Mar 10, 2003, 4:51:56 AM3/10/03

to

kcf_m...@yahoo.com (Kevin Maguire) writes:

> I want a frequency plot of the data, as I need to find the percentiles
> for this data (50% of values < 27.3, 25% < 22.9, 10% < 19.6, and so
> on.). i.e. I want to see a continuous line starting at zero and
> ending at 100(%), basically heading from bottom left to top right with
> the "values" as the x-axis ranging from 10 to 50 (in this case).

Maybe there is a better solution but what I did in a similar case was:

- sort the file numerically outside of gnuplot
- count the lines
- in gnuplot: plot 'sorted-values' us 1:($0*100.0/number-of-lines)

--
Regards

Heinz

Hans-Bernhard Broeker

unread,

Mar 10, 2003, 6:16:20 AM3/10/03

to

Kevin Maguire <kcf_m...@yahoo.com> wrote:
> Hi

> Sorry if the subject line is not the right term for what I describe
> ...

It isn't. A frequency plot would be what's more commonly called a
histogram, i.e. a plot of the sampled probability density function
(PDF) of your data. What you're describing below is a plot of the
sample dataset's cumulative density function (CDF).

> I want a frequency plot of the data, as I need to find the percentiles
> for this data (50% of values < 27.3, 25% < 22.9, 10% < 19.6, and so
> on.). i.e. I want to see a continuous line starting at zero and
> ending at 100(%), basically heading from bottom left to top right with
> the "values" as the x-axis ranging from 10 to 50 (in this case).

This requires summation of the input, which is not in gnuplot's bag of
tricks. You'll need a little bit of external script code for that.
'awk' can do it for you. Even on the fly, if you're on a somewhat
unix-ish platform that supports pipes (includes DOS and OS/2, but
currently not the MS-Windows versions of gnuplot):

plot '< awk "{sum = sum + $1 ; print sum} data.dat' u 1 with lines

This won't automatically scale the output to go from y = 0 to y = 100
percent, either: that would require two passes over the dataset.
--
Hans-Bernhard Broeker (bro...@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

Ed Morton

unread,

Mar 10, 2003, 6:03:39 PM3/10/03

to

Hans-Bernhard Broeker wrote:

> plot '< awk "{sum = sum + $1 ; print sum} data.dat' u 1 with lines
>
> This won't automatically scale the output to go from y = 0 to y = 100
> percent, either: that would require two passes over the dataset.

This:

plot "<awk '{x[NR]=$1; y[NR]=$2; if ($2 > maxY) {maxY = $2}}END{ for
(i=1;i<NR;i++) {print x[i], y[i]*100/maxY}}' data.dat" u 1 with lines

would work as long as you have less data points than whatever limit your version
of awk sets on array sizes (e.g. 4096 is one I've seen - gawk may be higher).
You could set the yrange to 0:100 before this if you want to ensure it starts at
zero.

morton.vcf

Ed Morton

unread,

Mar 11, 2003, 9:28:04 AM3/11/03

to

I should've read the original email better. I'm not sure about the specific
algorithm you're looking for, but to effectively get a second-pass in awk using
single values from the data file (and in this example convert them to a
percentage of the total), instead of doing this:

plot "< awk '{sum = sum + $1 ; print sum}' data.dat" u 1 with lines

do this:

plot "< awk '{sum = sum + $1; x[NR] = sum}END{for(i=i;i<=NR;i++){print
x[i]*100/sum}}' data.dat" u 1 with lines

Regards,

Ed

morton.vcf

Kevin Maguire

unread,

Mar 15, 2003, 5:03:53 PM3/15/03

to

Hi

Heinz wrote:
> - sort the file numerically outside of gnuplot
> - count the lines
> - in gnuplot: plot 'sorted-values' us 1:($0*100.0/number-of-lines)

Thanks, other replies dispalyed my lack of preciseness
in my original question, for which I apologise. I did indeed
want what I now remember was called a cumulative density function (CDF).

Using the above, I solved my problem (I know my data points all
lie between 10 and 50) with

set xrange [10:50]
set yrange [0:100]
set xtics 10,2,50
set ytics 0,10,100
numlines=`cat data | wc -l`
plot "<sort -n data" using 1:($0*100.0/numlines) with lines

although it would have been fairly trivial for me to
pre-sort the data if necessary.

Many Thanks to all those who responded.

Kevin