I'm trying to get Gnuplot generate histograms from a data set.
I have a (number of) data files, with delay data in columns. I'd
like to generate the distribution of the delays in a histogram.
The following gnuplot code _almost_ works. It generates a nice
histogram, exactly the way I want it.
plot "link-delays.dat" using (floor($1)):(1.0) \
smooth frequency with histeps lt 1 lw 3 title "Delay"
The trouble happens when there are no data values between e.g. 4 and 5.
The results of the "smooth frequency" step then does not contain
a value for "4", which makes the "with histeps" generate wider steps
for "3" and "5", instead of having a zero-values step at "4" (which I
want because there are no values between 4 and 5).
Any ideas how to tackle this problem? Running the data though and
external program to knock it into bins would do it, but I'd rather
use just gnuplot.
Thanks in advance,
Jan-Pascal
(using Gnuplot 4.0 with Windows XP, epslatex terminal)
Silly me..
Jan-Pascal
Instead of using "with histeps", try
set boxwidth 0.3
set style fill solid
plot ... with boxes
--
Ethan A Merritt
Thanks for the suggestion.
I tried, but now I get a lot of vertical "spikes", instead of a single
curve. The reason I chose histeps to begin with is that I don't want
separate vertical bars, but rather a curve (if there are enough data
points). If the data were distributed normally, I'd like to get the bell
curve, as if I directly plotted erf(), instead of vertical bars
following that curve.
Any further ideas greatly appreciated!
Jan-Pascal
I'm not sure I understand exactly what you want, but if it's just a
question of making the bars touch eachother then you can set boxwidth
to whatever the sample interval is.
set boxwidth 1.0
But if you mean that you want to fit a curve to the points, that's
a whole other story. You can specify a distribution P(x|a,b,c,...),
use gnuplot's "fit" commands to optimize the parameters a,b,c,...
and then plot the resulting function as a filledcurve.
That won't give you blank spaces where the data is missing, however.
--
Ethan A Merritt
I'm sorry to have been a bit unclear. Please allow me to explain.
Suppose I have a data file like this:
3.1
3.7
5.2
6.5
2.5
3.8
5.7
So with binsize 1 the bins are
2-3 1
3-4 3
4-5 0
5-6 2
6-7 1
then the resulting plot should look like this (my apologies
for the ASCII-art):
3 _
2 | | _
1 | | | |_
0 ____| |_| |_
1 2 3 4 5 6 7 8 9
I make plots both with with a small number of data points
(40 data points, in 6 bins) and with a large number of points
(4000 data points, in 200 or so bins). In the latter plots
the steps become very small, so you see almost a continuous
curve.
If I use "with histeps", I get approximately this:
(actually the step between 3 and 4 extends to 4.5)
3 _
2 | |_ _
1 | |_
0 ____| |_
1 2 3 4 5 6 7 8 9
Because there are no data points between 4 and 5, as per the
"with histeps" specs, the steps around are extended until they
touch in the middle. It is what the specification says it should
do, just not what I want :-)
If I use "with boxes", with boxsize 1.0, I get this this:
3 _
2 | | _
1 | | | |_
0 ____| |_| | |_
1 2 3 4 5 6 7 8 9
which might be fine for small number of bins, but not for a
larger number, since that effectively fills up the area under
the distribution. This also makes it impossible to plot two
distributions into the same plot.
Is there a workaround, for instance giving a boxwith to the "with
histeps" method, or telling "with boxes" not to draw the lines
between boxes down to the X-axis?
Jan-Pascal
The short answer is "no". Although gnuplot can be abused as a data
analysis program, it isn't really meant for this, as the FAQ makes
clear, and as you're finding out. You'll have to use an external
program or script to do your binning for you; gnuplot will happily
plot the resulting data.
The slightly longer answer is that your method will work provided
there's at least one count in all the bins. There's no mechanism to
tell `smooth freq` that a bin is empty -- indeed, it doesn't really
have a concept of bins, since it looks at similar x values.
THeo
I admit defeat and I will use a small Python script to do my binning.
Thank-you to Ethan and you for trying to help me.
Jan-Pascal
#!/usr/local/bin/perl
#call as bin <binsize> [min,max]
$nargs=@ARGV;
if ($nargs>=1)
{
$binsize=$ARGV[0];
}
else
{
$binsize=1;
}
if ($nargs>=2)
{
$range=$ARGV[1];
($min,$max)=split /,/,$range;
$rangegiven=1;
}
else
{
$rangegiven=0;
}
while (<STDIN>)
{
($x,$y)=split;
if (!$rangegiven || ($rangegiven && $x>=$min && $x<=$max))
{
$x-=$min;
$n=int($x/$binsize);
$histogram[$n]+=$y;
}
}
$binnum=0;
foreach $bin (@histogram)
{
$x=$min+$binnum*$binsize;
printf "%e %e\n",$x,$bin;
$binnum++;
}
I did already (mine's in Python, it's actually my very first Python
program, and I must say I like its features). It's specific for what I
need it to do (tabulating multiple columns, making histograms of the
difference between each column and another fixed column, etc.), but the
command line options are nicely documented so I'll post it here to keep
it safe for posterity...
#!/usr/bin/python
# Needed for true (not floor) division
from __future__ import division
import math
import sys
import string
import getopt
def usage():
print """hist.py reads from standard input and writes to standard
output. It creates a histogram for each column of input data as a column
of the output. Minimum and maximum values are determined automatically.
Bin size can be given. If the -e option is used, histograms of the
differences of the values in each column compared to a reference column
are made (x_i-x_r). The -r option is similar, but generates histograms
of the relative differences ( (x_i-x_r)/x_r ).
By default, bins are 0-binsize, binsize-2*binsize, etc. With the -z
option, bins are -binsize/2 - binsize/2, binsize/2 - 3*binsize/2, etc.
The -c option lets you choose which columns from the input file to use.
The -i options calculates, for each column, the interval in which a
given percentage lie. For example, with -i 95, a comment line is written
to the output giving the 2,5th percentile and the 97,5th percentile for
each column.
Input format: Each line contains a fixed number of floating point
numbers. Empty lines and lines starting with '#' are ignored.
Output format: Each line starts with the starting value of the bin
(centre value if --centre-around-zero is active), followed by the number
of values in the bin for each of the input columns.
Usage: hist.py [options] <infile >outfile
Options:
-h, --help
Show this help
-b num, --binsize=num
Set bin size to num (default: 1.0)
-e num, --errors-from=num
Calculate errors compated to column num (0=first column)
-r num, --relative-errors-from=num
Calculate relative errors compared to column num
(0=first column)
-z, --centre-around-zero
Make bins around zero (-0.5 to 0.5, 0.5 to 1.5, etc.),
instead of from -1 to 0, 0 to 1, etc.
-n, --normalise (not implemented)
Normalise histogram such that sum( bin[i]*binsize ) == 1
-c, --columns=col1,col2,...
Choose columns to make histograms of (0=first column)
-i, --determine-interval=num
Determine the interval in which num% of the values in each
column lie
"""
try:
opts, args = getopt.getopt(sys.argv[1:], "hnzb:e:r:c:i:",
[ "help", "normalise", "centre-around-zero",
"binsize=", "errors-from=", "relative-errors-from=",
"columns=", "determine-interval=" ] )
except getopt.GetoptError:
usage()
sys.exit(2)
# Defaults
binsize=1.0
do_error=False
do_relative=False
zero_based=False
columns=None
interval=None
do_interval=False
for o, a in opts:
if o in ("-h", "--help"):
usage()
sys.exit()
if o in ("-b", "--binsize"):
binsize = float(a)
if o in ("-e", "--errors-from"):
do_error=True
reference = int(a)
if o in ("-r", "--relative-errors-from"):
do_relative=True
reference = int(a)
if o in ("-z", "--centre-around-zero"):
zero_based = True
if o in ("-c", "--columns"):
columns = [ int(col) for col in a.split(',') ]
if o in ("-i", "--determine-interval"):
interval = float(a)
do_interval = True
if do_error and do_relative:
print 'Error: -e and -r are mutually exclusive'
print 'Use --help for details'
sys.exit(2)
if do_interval and (interval<0 or interval>100):
print 'Error: argument to -i should be between 0 and 100'
print 'Use --help for details'
sys.exit(2)
min = 1E38
max = -1E38
dicts={} # maps col# to dict containing histogram for col
all_values={} # maps col# to list containing values for col
for line in sys.stdin:
if len(line)==0 or line[0]=='#':
print line
continue
values = line.split()
if len(values)==0:
continue
if do_error or do_relative:
reference_value = float(values[reference])
if columns == None:
columns = range(len(values))
for i in columns:
value = float(values[i])
if do_error:
value = value - reference_value
if do_relative:
value = (value - reference_value) / reference_value
if do_interval:
if not all_values.has_key(i):
all_values[i] = []
all_values[i].append(value)
if zero_based:
bin=int(math.floor(value/binsize+0.5));
else:
bin=int(math.floor(value/binsize))
if(bin<min): min=bin
if(bin>max): max=bin
if not dicts.has_key(i):
dicts[i] = {}
if not dicts[i].has_key(bin):
dicts[i][bin] = 1;
else:
dicts[i][bin] += 1;
if columns==None:
columns = dicts.keys()
columns.sort()
if do_interval:
print 'Determining ', interval, '% interval'
for i in columns:
all_values[i].sort()
distance = int((100-interval)*len(all_values[i])/100.0/2)
left = all_values[i][distance]
right = all_values[i][len(all_values[i])-distance-1]
print '# column ', i, ' left: ', left, '; right: ', right
print
# Include zero bins before and after histograms
for bin in range(min-1,max+2):
sys.stdout.write(str(bin*binsize)+'\t')
for col in columns:
dict = dicts[col]
# for dict in dicts:
if dict.has_key(bin):
sys.stdout.write(str(dict[bin]))
else:
sys.stdout.write('0')
sys.stdout.write('\t')
sys.stdout.write('\n')