Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

plotting "frequency plots"

242 views
Skip to first unread message

Kevin Maguire

unread,
Mar 7, 2003, 12:52:59 PM3/7/03
to
Hi

Sorry if the subject line is not the right term for what I describe
...

I have a text file containing numerical data, sorted into order with
one number per line. This data all lies in the range 10.0 --> 50.0,
but my question is more general.

I want a frequency plot of the data, as I need to find the percentiles
for this data (50% of values < 27.3, 25% < 22.9, 10% < 19.6, and so
on.). i.e. I want to see a continuous line starting at zero and
ending at 100(%), basically heading from bottom left to top right with
the "values" as the x-axis ranging from 10 to 50 (in this case).

I hacked a quick fortran program for this data, but I am sure gnuplot
could do this and I would like to script the graph generation bit for
web publishing purposes.

Suggestions ...

Thanks
Kevin

Heinz Rommerskirchen

unread,
Mar 10, 2003, 4:51:56 AM3/10/03
to
kcf_m...@yahoo.com (Kevin Maguire) writes:

> I want a frequency plot of the data, as I need to find the percentiles
> for this data (50% of values < 27.3, 25% < 22.9, 10% < 19.6, and so
> on.). i.e. I want to see a continuous line starting at zero and
> ending at 100(%), basically heading from bottom left to top right with
> the "values" as the x-axis ranging from 10 to 50 (in this case).

Maybe there is a better solution but what I did in a similar case was:

- sort the file numerically outside of gnuplot
- count the lines
- in gnuplot: plot 'sorted-values' us 1:($0*100.0/number-of-lines)

--
Regards

Heinz

Hans-Bernhard Broeker

unread,
Mar 10, 2003, 6:16:20 AM3/10/03
to
Kevin Maguire <kcf_m...@yahoo.com> wrote:
> Hi

> Sorry if the subject line is not the right term for what I describe
> ...

It isn't. A frequency plot would be what's more commonly called a
histogram, i.e. a plot of the sampled probability density function
(PDF) of your data. What you're describing below is a plot of the
sample dataset's cumulative density function (CDF).

> I want a frequency plot of the data, as I need to find the percentiles
> for this data (50% of values < 27.3, 25% < 22.9, 10% < 19.6, and so
> on.). i.e. I want to see a continuous line starting at zero and
> ending at 100(%), basically heading from bottom left to top right with
> the "values" as the x-axis ranging from 10 to 50 (in this case).

This requires summation of the input, which is not in gnuplot's bag of
tricks. You'll need a little bit of external script code for that.
'awk' can do it for you. Even on the fly, if you're on a somewhat
unix-ish platform that supports pipes (includes DOS and OS/2, but
currently not the MS-Windows versions of gnuplot):

plot '< awk "{sum = sum + $1 ; print sum} data.dat' u 1 with lines

This won't automatically scale the output to go from y = 0 to y = 100
percent, either: that would require two passes over the dataset.
--
Hans-Bernhard Broeker (bro...@physik.rwth-aachen.de)
Even if all the snow were burnt, ashes would remain.

Ed Morton

unread,
Mar 10, 2003, 6:03:39 PM3/10/03
to

Hans-Bernhard Broeker wrote:

> plot '< awk "{sum = sum + $1 ; print sum} data.dat' u 1 with lines
>
> This won't automatically scale the output to go from y = 0 to y = 100
> percent, either: that would require two passes over the dataset.

This:

plot "<awk '{x[NR]=$1; y[NR]=$2; if ($2 > maxY) {maxY = $2}}END{ for
(i=1;i<NR;i++) {print x[i], y[i]*100/maxY}}' data.dat" u 1 with lines

would work as long as you have less data points than whatever limit your version
of awk sets on array sizes (e.g. 4096 is one I've seen - gawk may be higher).
You could set the yrange to 0:100 before this if you want to ensure it starts at
zero.

morton.vcf

Ed Morton

unread,
Mar 11, 2003, 9:28:04 AM3/11/03
to
I should've read the original email better. I'm not sure about the specific
algorithm you're looking for, but to effectively get a second-pass in awk using
single values from the data file (and in this example convert them to a
percentage of the total), instead of doing this:

plot "< awk '{sum = sum + $1 ; print sum}' data.dat" u 1 with lines

do this:

plot "< awk '{sum = sum + $1; x[NR] = sum}END{for(i=i;i<=NR;i++){print
x[i]*100/sum}}' data.dat" u 1 with lines

Regards,

Ed

morton.vcf

Kevin Maguire

unread,
Mar 15, 2003, 5:03:53 PM3/15/03
to
Hi

Heinz wrote:
> - sort the file numerically outside of gnuplot
> - count the lines
> - in gnuplot: plot 'sorted-values' us 1:($0*100.0/number-of-lines)

Thanks, other replies dispalyed my lack of preciseness
in my original question, for which I apologise. I did indeed
want what I now remember was called a cumulative density function (CDF).

Using the above, I solved my problem (I know my data points all
lie between 10 and 50) with

set xrange [10:50]
set yrange [0:100]
set xtics 10,2,50
set ytics 0,10,100
numlines=`cat data | wc -l`
plot "<sort -n data" using 1:($0*100.0/numlines) with lines

although it would have been fairly trivial for me to
pre-sort the data if necessary.

Many Thanks to all those who responded.

Kevin

0 new messages