Re: describing phases with cumulative probability plots

543 views
Skip to first unread message

Rayfo...@aol.com

unread,
Feb 10, 2015, 12:07:21 PM2/10/15
to ox...@googlegroups.com
Hello Dan,
 
May I make a few comments (with the usual caveats i.e. I'm no expert etc.)
 
1. I think you have mixed up the (red +'s and medians blue x's) on your diagram.
 
2. I don't think the grey "polygon" shows a cumulative probability.  It seems to be a sum of all the dates which can be achieved in OxCal using the 'Sum' function within the phase, or alternatively using 'Sum' instead of 'Phase' as the group heading.  I don't think summing the individual date probability densities results in a meaningful  overall probability (density or otherwise), it is just a picture of where the dates you have, appear on the date line, with a bit of normalisation. Hence there is no 95% uncertainty line associated with the SUM function.
 
3. Is there much difference in your R Plot compared with the following?   Just go to table view and untic all except start, sum and end.  The second plot is the 95% uncertainty Span between start and end
 
 
Or am I missing something?
 
Best wishes
 
Ray
 
 
In a message dated 10/02/2015 09:30:53 GMT Standard Time, fuzzy.c...@gmail.com writes:

Hi all,

Thanks to the work of many of the regular contributors here, I think I've been able to develop a reasonable grasp of the relationships between 14C dates, archaeological phases, and OxCal's boundaries.  Something that I still struggle with is how best to describe and/or illustrate this - there've been a few clear discussions here of the inadvisablity of using medians (about which I certainly agree), and while reporting 95% ranges for beginning and ending phase dates is comprehensive, I think it can be confusing for a non-expert audience.  It also fails, I think, to capture the potentially important issue that there may be a span of time *between* those uncertain boundaries that is known with a high degree of certainty to correspond with an archaeological phase.

 

At any rate, I've started plotting these phases using cumulative probability plots, and would appreciate any input on the pros and cons (this will come out in Senri Ethnological Studies as part of some of my work on first millennium BCE Central Andean chronology - but that may not be a publication whose readership overlaps much with the OxCal list). 

 

The logic is that the parameter of archaeological interest/utility is less the probability that any particular calendar date is the exact phase start date, than it is the probability that, for any particular calendar date, the phase *has begun* (or ended).

 

I've developed a function in R to take two OxCal .prior files (boundaries, as I've used it, though there's no reason that they have to be) and plot a polygon that comprises the cumulative probability that a phase *has begun*, (potentially) a period between the start and end dates when the phase is 100% certain to be underway (within the limits of the information/assumptions, of course), and the reversed cumulative probability that the phase remains underway.

 

The attached image illustrates this using simulated data (40 dates drawn from a uniform distribution, run with R_Simulate to create a bounded phase model) for a 1200 - 900 BCE archaeological phase.  The figure shows the calculated start and end boundaries (dotted lines), the 95% confidence earliest and latest dates (red plusses), the medians (blue x's), and the cumulative probability polygon (grey).  Obviously there's no new information represented by the cumulative probability polgyon, but it emphasizes the period in the middle that can confidently be asserted, and I think provides a more easily digestible single visual summary than the two boundary pdfs.

 

Having said that, I would be very interested to hear the opinions of this forum, particularly about a) whether there are some hidden pitfalls here that I may be missing, and b) whether this has been done before and I just don't know it.  I can share the R function code if people would be interested in using it.

 

Thanks,


Dan


--

Daniel A. Contreras, Ph.D
Postdoctoral Research Fellow, OT-Med
Adaptation of Mediterranean Economies of the Past to Hydroclimatic Changes (AMENOPHYS)
Institut Méditerranéen de Biodiversité et d'Ecologie - IMBE
Aix-Marseille Université
Campus Aix
Technopôle Arbois - Méditerranée
Bâtiment Villemin - BP 80
F-13545 Aix-en-Provence cedex 04, France



--
You received this message because you are subscribed to the Google Groups "OxCal" group.
To unsubscribe from this group and stop receiving emails from it, send an email to oxcal+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christopher Ramsey

unread,
Feb 10, 2015, 1:07:57 PM2/10/15
to ox...@googlegroups.com
I think what is being looked at here is similar to Sum - but Sum only sums the distributions for dates measured within your phase - so sometimes shows some additional structure.

I think you can achieve the same as Dan was producing using the following code.

Sequence()
{
Boundary();
Phase()
{
... rest of dates within the phase
Date("Sample from phase");
};
Boundary();
};

What this gives you is the date for an event within the phase that has no dating information.

The only disadvantage of doing it this way is that the distribution is MCMC sampled and so not totally smooth. However, it has the advantage that it does take into account the correlation between the start and end boundaries - which the marginal priors will not. I'm not sure if that will actually make a difference, once normalised.

We have used these distributions in papers to illustrate where phases are. They do capture the uncertainty in the start, end and duration visually - even if it is not easy to extract that information quantitatively (or not if there is any overlap between the start and end). I also think the distribution does have a definite meaning - as suggested above.

Best wishes

Christopher

> On 10 Feb 2015, at 17:07, Rayfoskidd via OxCal <ox...@googlegroups.com> wrote:
>
> Hello Dan,
>
> May I make a few comments (with the usual caveats i.e. I'm no expert etc.)
>
> 1. I think you have mixed up the (red +'s and medians blue x's) on your diagram.
>
> 2. I don't think the grey "polygon" shows a cumulative probability. It seems to be a sum of all the dates which can be achieved in OxCal using the 'Sum' function within the phase, or alternatively using 'Sum' instead of 'Phase' as the group heading. I don't think summing the individual date probability densities results in a meaningful overall probability (density or otherwise), it is just a picture of where the dates you have, appear on the date line, with a bit of normalisation. Hence there is no 95% uncertainty line associated with the SUM function.
>
> 3. Is there much difference in your R Plot compared with the following? Just go to table view and untic all except start, sum and end. The second plot is the 95% uncertainty Span between start and end
>
> <Untitled.jpg>
> <Untitled.jpg>

dcontreras

unread,
Feb 10, 2015, 5:01:31 PM2/10/15
to ox...@googlegroups.com
Thanks to you both.  Christopher, that's not an approach that would have ever occurred to me, but does look really interesting.  I'll have to experiment with it a bit.
I've tried to deal with overlapping start/end boundaries by calculating the cumulative probabilities in the same way and then plotting the polygon resulting from taking the lower of the two if they overlap; the result does capture the point that there is no period that can be definitively asserted as part of the phase, which I think makes sense but does miss out the effects of correlation between the two.

The attached plot has a sample date as suggested by Christopher (in purple), and a sum (in green).  Ray, you'll see the sum is somewhat different, and shows some potentially misleading structure - in fact part of the reason that I started thinking about cumulative probabilities was very much to get away from the problems of Sums.  The grey polygon shows a cumulative probability of each boundary, and a probability of 1 between the two (sorry if that wasn't clear).  OxCal code w/ the simulated dates below as well.

Dan

ps. Ray, you're right about the reversed +'s and x's - sorry!

 Plot()
 {
  Sequence()
  {
   Boundary("TestStart");
   Phase("TestPhase")
   {
    R_Simulate("A1", -1180, 30);
    R_Simulate("A2", -1168, 48);
    R_Simulate("A3", -1166, 42);
    R_Simulate("A4", -1164, 28);
    R_Simulate("A5", -1158, 20);
    R_Simulate("A6", -1152, 44);
    R_Simulate("A7", -1150, 42);
    R_Simulate("A8", -1147, 44);
    R_Simulate("A9", -1141, 34);
    R_Simulate("A10", -1134, 27);
    R_Simulate("A11", -1134, 25);
    R_Simulate("A12", -1123, 32);
    R_Simulate("A13", -1120, 31);
    R_Simulate("A14", -1115, 33);
    R_Simulate("A15", -1105, 32);
    R_Simulate("A16", -1102, 38);
    R_Simulate("A17", -1089, 25);
    R_Simulate("A18", -1079, 22);
    R_Simulate("A19", -1077, 30);
    R_Simulate("A20", -1048, 32);
    R_Simulate("A21", -1044, 37);
    R_Simulate("A22", -1041, 30);
    R_Simulate("A23", -1040, 35);
    R_Simulate("A24", -1031, 36);
    R_Simulate("A25", -1016, 38);
    R_Simulate("A26", -1012, 20);
    R_Simulate("A27", -1010, 36);
    R_Simulate("A28", -1004, 24);
    R_Simulate("A29", -988, 25);
    R_Simulate("A30", -984, 48);
    R_Simulate("A31", -962, 49);
    R_Simulate("A32", -956, 38);
    R_Simulate("A33", -955, 47);
    R_Simulate("A34", -951, 35);
    R_Simulate("A35", -945, 32);
    R_Simulate("A36", -945, 32);
    R_Simulate("A37", -923, 30);
    R_Simulate("A38", -919, 49);
    R_Simulate("A39", -914, 26);
    R_Simulate("A40", -905, 32);

    Date("Sample from phase");
   };
   Boundary("TestEnd");
  };
  Sum()
  {
   R_Simulate("A1", -1180, 30);
   R_Simulate("A2", -1168, 48);
   R_Simulate("A3", -1166, 42);
   R_Simulate("A4", -1164, 28);
   R_Simulate("A5", -1158, 20);
   R_Simulate("A6", -1152, 44);
   R_Simulate("A7", -1150, 42);
   R_Simulate("A8", -1147, 44);
   R_Simulate("A9", -1141, 34);
   R_Simulate("A10", -1134, 27);
   R_Simulate("A11", -1134, 25);
   R_Simulate("A12", -1123, 32);
   R_Simulate("A13", -1120, 31);
   R_Simulate("A14", -1115, 33);
   R_Simulate("A15", -1105, 32);
   R_Simulate("A16", -1102, 38);
   R_Simulate("A17", -1089, 25);
   R_Simulate("A18", -1079, 22);
   R_Simulate("A19", -1077, 30);
   R_Simulate("A20", -1048, 32);
   R_Simulate("A21", -1044, 37);
   R_Simulate("A22", -1041, 30);
   R_Simulate("A23", -1040, 35);
   R_Simulate("A24", -1031, 36);
   R_Simulate("A25", -1016, 38);
   R_Simulate("A26", -1012, 20);
   R_Simulate("A27", -1010, 36);
   R_Simulate("A28", -1004, 24);
   R_Simulate("A29", -988, 25);
   R_Simulate("A30", -984, 48);
   R_Simulate("A31", -962, 49);
   R_Simulate("A32", -956, 38);
   R_Simulate("A33", -955, 47);
   R_Simulate("A34", -951, 35);
   R_Simulate("A35", -945, 32);
   R_Simulate("A36", -945, 32);
   R_Simulate("A37", -923, 30);
   R_Simulate("A38", -919, 49);
   R_Simulate("A39", -914, 26);
   R_Simulate("A40", -905, 32);
  };
 };
testphase_withsum&sample.png

Christopher Ramsey

unread,
Feb 11, 2015, 4:37:58 AM2/11/15
to ox...@googlegroups.com
If you are going to use Sum you should anyway use it within a model as in:

Plot()
{
Sequence()
{
Boundary("TestStart");
Sum("TestPhase")
};

OR
Sum("Sum from phase");
};
Boundary("TestEnd");
};
};

Christopher
> <testphase_withsum&sample.png>
Untitled.pdf

Bayliss, Alex

unread,
Feb 11, 2015, 6:03:13 AM2/11/15
to ox...@googlegroups.com

Hi Dan,

 

What you have done is exactly what I do when I want to show the time when a site was occupied in an animation. I export the priors for the start and end of the phase (or occupation if it is a more complex model), and then calculate the distribution you have shown in your png file in Excel. Spots then get redder as the probability that they were in use at a particular time increases (those where the boundaries overlap don’t ever get fully red). This is a pain in the neck, and anyone who can provide an easier way for me to do this is due a jar of home-made marmalade.

 

THANK YOU.

 

Alex



English Heritage is changing into two organisations. 

From Spring 2015, we shall become Historic England, a government service championing England's heritage and giving expert, constructive advice, and the English Heritage Trust, a charity caring for the National Heritage Collection of more than 400 historic properties and their collections.

This e-mail (and any attachments) is confidential and may contain personal views which are not the views of English Heritage unless specifically stated. If you have received it in error, please delete it from your system and notify the sender immediately. Do not use, copy or disclose the information in any way nor act in reliance on it. Any information sent to English Heritage may become publicly available.


Rayfo...@aol.com

unread,
Feb 11, 2015, 7:01:46 AM2/11/15
to ox...@googlegroups.com
Hello Christopher, Dan,
 
I love the 'Irish Soda Loaf' shape derived from the Date(Sample from phase). Or should that be Pseudo Loaf?
 
I was not aware of the effect of the Date() function in these circumstances, i.e. without a defined expression.
 
The Commands index gives
 
'Date([Name], Expression);
  • type conversion function that returns a date or PDF for a date from an expression'
However I don't think it will work that way (Loaf shape) within a Sequence (as opposed to a Phase) as it will be bounded by the function before and after it, not the function between the boundaries?  Though 'Sum' should still perform as required.  However, a few extra lines of code should handle the matter for a Sequence:
 
viz.
  Sequence("LoafSample")
  {
   Boundary("=Start 1");
   Date("Loaf Sample");
   Boundary("=End 1");
  };
 
 
I can't help thinking the R-Code plot gives a mixed message, mainly due to the left hand scale.  It purports to be 'Probability', going from 0 to 1, but the Boundary plots are Probability Densities, their probabilities (uncertainties) are in the area under the curve.  In OxCal the Loaf Date() plot is also correctly shown as a Probability Density, the maximum value is not '1' but around 0.003.  The 95% under line in the 'Loaf Date()' is the uncertainty of where the (unknown) Date lies between the Boundaries.
 
Quoting one end of a 95% interval from one Boundary PDF and the other end from another Boundary PDF does not, in my mind, create a 95% Credible Interval for the whole interval between Boundaries, however the maths behind it is not obvious to me.  The OxCal Loaf Date() plot seems to be on firmer ground regarding the 95% Uncertainty range. 
 
Best wishes
 
Ray
 
In a message dated 11/02/2015 09:37:59 GMT Standard Time, christoph...@rlaha.ox.ac.uk writes:
If you are going to use Sum you should anyway use it within a model as in:

Plot()
{
  Sequence()
  {
   Boundary("TestStart");
   Sum("TestPhase")
};

OR
    Sum("Sum from phase");
   };
   Boundary("TestEnd");
  };
};

Christopher

> On 10 Feb 2015, at 22:01, dcontreras <fuzzy.c...@gmail.com> wrote:
>
> <testphase_withsum&sample.png>

MILLARD A.R.

unread,
Feb 11, 2015, 8:58:01 AM2/11/15
to ox...@googlegroups.com
I think Dan and Alex's approach to the cumulative probabilities is best. It gives the probability that the year is within the span of the phase, which is easy to interpret once one knows that what is displayed is not the probability distribution for a single event. However, mathematically this is equivalent to the approach of including an undated event in a phase. The posterior distribution for such an event is simply a normalised version of the cumulative probability calculation. For display purposes I think the cumulative probability approach is probably better as it will give a smoother curve not subject to the jaggedness of MCMC sampling.

The Sum of the posterior densities of events within a phase is not quite the same thing. As the manual says, the Sum is essentially a logical OR operation, and it is therefore giving a probability distribution that assumes that the event of interest is one of the dated events, but we don't know which one and all are equally likely to be that event. As the number of dates in a phase increases this will approximate the undated event approach, but it will always be a less smooth distribution because the event of interest is constrained to match one of the dated events, and each of them will have its own multimodal calibrated distribution.




Best wishes

Andrew
--
 Dr. Andrew Millard 
e: A.R.M...@durham.ac.uk | t: +44 191 334 1147
 w: http://www.dur.ac.uk/archaeology/staff/?id=160
 Senior Lecturer in Archaeology, Durham University, UK
> send an email to oxcal+un...@googlegroups.com <javascript:> .
> > For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "OxCal" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send an email to oxcal+un...@googlegroups.com <javascript:> .
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "OxCal" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to oxcal+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> English Heritage is changing into two organisations.
>
> From Spring 2015, we shall become Historic England, a government
> service championing England's heritage and giving expert, constructive
> advice, and the English Heritage Trust, a charity caring for the
> National Heritage Collection of more than 400 historic properties and
> their collections.
>
> This e-mail (and any attachments) is confidential and may contain
> personal views which are not the views of English Heritage unless
> specifically stated. If you have received it in error, please delete
> it from your system and notify the sender immediately. Do not use,
> copy or disclose the information in any way nor act in reliance on it.
> Any information sent to English Heritage may become publicly
> available.
>
>
>
>

dcontreras

unread,
Feb 11, 2015, 10:25:27 AM2/11/15
to ox...@googlegroups.com
Hi Ray,

I agree about the "mixed message" problem - in practice, I wouldn't include the plots of the boundaries or their 95% or median marks overlain on a cumulative probability plot, but included those in the plots I posted to illustrate the relationship between them and the cumulative probability plot.  The y-scale is only meaningful for the latter, and not for the boundary pdfs (another good reason not to plot them together - but I do find it useful to see them in the same space).

cheers,

Dan

Rayfo...@aol.com

unread,
Feb 11, 2015, 12:59:06 PM2/11/15
to ox...@googlegroups.com
Hi Dan
 
Can you help me understand what the aim is?  The following short model is only four dates.  What would your R-Plot look like?  You can omit the start and end boundary plots in your output but include any markers pertaining to your 'cumulative probability plot'.
 
best wishes
 
Ray
 
 
 
 Plot()
 {
  Sequence()
  {
   Boundary("Start 1");
   Phase("1")
   {
    Sum("Four Sums")
    {
    };
    R_Simulate("A1", -1200, 25);
    R_Simulate("A2", -1192.5, 25);
    R_Simulate("A3", -1185, 25);
    R_Simulate("A4", -1177.5, 25);
    Date("Sample from Phase");
   };
   Span("Four Spans");
   Boundary("End 1");
  };

  Sequence("LoafSample")
  {
   Boundary("=Start 1");
   Date("Loaf Sample");
   Boundary("=End 1");
  };
 };

dcontreras

unread,
Feb 11, 2015, 2:19:53 PM2/11/15
to ox...@googlegroups.com
Hi Ray,

This actually is the tricky sort of case where the boundaries are close enough together that the pdfs overlap - and I'm not sure that I'm handling the merging of these cumulative probabilities in the best way.  Actually I think there's more of a need for cumulative probability plots in cases where the boundaries are fairly far apart (hopefully my description below makes clear why).

In any case, the aim is to convey the idea that an archaeological phase can be asserted with considerable confidence for a stretch in the middle, and with less confidence for at either extreme - but most of the ways that we describe such phases in print (listing start and end dates, whether extremes, medians, or ranges) or visually (plotting boundaries, for instance) tend to emphasize the extremes rather than the middle.  When describing a phase (rather than a single date), I think what's of most interest isn't the probability that a particular calendar year is the start date (i.e., the probability represented by the height of the pdf for any given year), but rather the probability that by a particular year a phase has begun (or conversely ended).  Those can be calculated by converting the pdf into a cumulative probability curve, which is what I'm doing in R (and then subsequently plotting them, which actually was at least as much trouble).  The result - I hope - is a pretty simple visual summary of a phase, including how well (or poorly) the start and end dates are constrained.

To satisfy my own curiosity I've also plotted it with the "Sample from Phase" date scaled and overlain.  It's not wildly different, but different enough to make me wonder; I would guess there's more than MCMC noise in play there (in contrast to the results from non-overlapping boundaries).

Best,

Dan
test4.png
test4_withSampleDate.png

Rayfo...@aol.com

unread,
Feb 11, 2015, 4:09:55 PM2/11/15
to ox...@googlegroups.com
Hi Dan,
 
Thanks for the response.  Yes, it is a tricky example, but I wanted to see what the R-Plot produced.
 
I am not familiar with the concept of 'Merging Cumulative Probabilities', that may be my problem. 
 
 I understand the Probability Density Function (PDF) and its integral the Cumulative Probability Function (CPF) which integrates to 0 to 1.  (a Sigmoid for a Normal PDF).  I also understand the Probability at any single value of the PDF is zero, the probability is an area under the PDF, so zero width gives zero probability in simple terms.
Now, when you say
"what's of most interest isn't the probability that a particular calendar year is the start date (i.e., the probability represented by the height of the PDF for any given year), but rather the probability that by a particular year a phase has begun (or conversely ended)"
I get this sense of a confusion between the height of the PDF and 'Probability'. As you will appreciate, they are not the same.  Any interpretation that suggests the 'Y' value of a PDF is a Probability is mistaken,(irrespective of the colour of the dots).
 
As I understand it, we can never (as in never) be certain that all possible dating evidence is available to our model, so the OxCal program makes certain assumptions on the likely start and end boundaries of a group of dates, given the evidence of dates and priors in the model.  We are interested in the start and end boundaries and the uncertainty associated with dating them.
 
I've just had a thought....  Are you converting the Start Boundary PDF into a Cumulative Probability Function and joining it with a line at value '1' to the reversed End Boundary Cumulative Probability Function?  If so, I can see where the plot comes from, though I'm not sure what it means, except that where it is value '1' the group is certainly in progress, but off the top of the 'Loaf' the uncertainty doesn't seem to be statistically defined.
 
I still think the Date() function as suggested by Christopher seems a better bet.
 
What do you think?
 
Best wishes
 
Ray

dcontreras

unread,
Feb 12, 2015, 2:21:30 PM2/12/15
to ox...@googlegroups.com
Hi Ray,

I am sure that there are others who can articulate this better than I, but let me try.  As I see it, the crucial issue is the distinction between a date and a phase.  Because we are (here) interested in a duration rather than a date, it's possible to frame the question as, "for a given date, how much of the pdf (area under the curve, that is) falls to the left of that date (or, for the end of a phase, the right)?"  In this sense a pdf is relatable to probability, and this makes it possible to talk about probabilities over time ("how likely is it that a phase has begun/ended?"). 

You're absolutely right about the joining of two CPFs; this is exactly that simple (I've added those in red to the attached plot to show that).  When the boundaries are non-overlapping this matches an information-less "Date" sample quite well (and according to Andrew, above, is mathematically equivalent - certainly I trust him much more than I do myself in these issues), but the latter features MCMC noise and so is maybe a bit less simple a visual display. 
I don't think I understand (yet) how OxCal calculates such a "Date" where there are overlapping boundaries, though - and am not sure that my "and, not or" approach is by any means the best (though intuitively it seems at least the most conservative to me). 

Best,

Dan
test4_withsampledateandCPs.png

Rayfo...@aol.com

unread,
Feb 12, 2015, 4:44:43 PM2/12/15
to ox...@googlegroups.com
Hi Dan,
 
I think we may share the same articulation drawbacks!  And thanks for persevering with the discussion, I'm certain your time is much more precious than mine, since mine is mainly leisure.  However, to press on....
 
If you take a start boundary PDF and mark the 95.4% probability HPD and the 99.7% HPD ranges then convert the PDF into a Cumulative Probability Function CPF, are you not then marking the 95.4% left marker and the 99.7% right marker (ignoring the last 0.3%) to create the left hand slope of your plot?  And conversely for your right hand slope?  In which case the run up slope could be quite a bit longer as it runs asymptotically to '1' and the 'loaf top' subsequently shorter.  The slope area from 95.4% left marker to top is 97.7% HPD (assuming a near Normal boundary).  To get a 95.4% probability area for the lead slope, I think you would need to identify a left marker at a 90.8% HPD range. Then (100-90.8)/2 = 4.6%, giving the (100-4.6) = 95.4% HPD range from the new left marker to top of slope. Repeat for the right hand end.
 
Getting these numbers from the files doesn't seem easy.
 
Is there not also an argument that by considering your '100% certainty area' of the R-Plot top, you should also be using the 99.7% outer marker of the HPD for the end slopes?
 
 
It all seems quite complex when the Date() query seems to do much of that anyway.  (and Home made marmalade goes so well with Irish Soda Bread, don't you think?)
 
Best wishes
 
Ray

Christopher Ramsey

unread,
Feb 13, 2015, 4:39:52 AM2/13/15
to ox...@googlegroups.com
I think the issue with the short phases is to do with the correlation between the two boundaries. You get the same effect but more obvious if you try to find the time difference between the boundaries from the marginal posteriors: where they overlap there seems to be the possibility of a phase of <0 years length.

The way OxCal does the Date() command is just to find a sample between the two boundaries for each iteration of the MCMC cycle. So for this the boundaries have definite values and are obviously in the right order. You cannot reconstruct everything from the marginal posteriors where there is correlation between them.

Best wishes

Christopher

> On 12 Feb 2015, at 21:44, Rayfoskidd via OxCal <ox...@googlegroups.com> wrote:
>
> Hi Dan,
>
> I think we may share the same articulation drawbacks! And thanks for persevering with the discussion, I'm certain your time is much more precious than mine, since mine is mainly leisure. However, to press on....
>
> If you take a start boundary PDF and mark the 95.4% probability HPD and the 99.7% HPD ranges then convert the PDF into a Cumulative Probability Function CPF, are you not then marking the 95.4% left marker and the 99.7% right marker (ignoring the last 0.3%) to create the left hand slope of your plot? And conversely for your right hand slope? In which case the run up slope could be quite a bit longer as it runs asymptotically to '1' and the 'loaf top' subsequently shorter. The slope area from 95.4% left marker to top is 97.7% HPD (assuming a near Normal boundary). To get a 95.4% probability area for the lead slope, I think you would need to identify a left marker at a 90.8% HPD range. Then (100-90.8)/2 = 4.6%, giving the (100-4.6) = 95.4% HPD range from the new left marker to top of slope. Repeat for the right hand end.
>
> Getting these numbers from the files doesn't seem easy.
>
> Is there not also an argument that by considering your '100% certainty area' of the R-Plot top, you should also be using the 99.7% outer marker of the HPD for the end slopes?
>
> <Untitled.jpg>

Rayfo...@aol.com

unread,
Feb 14, 2015, 12:23:04 PM2/14/15
to ox...@googlegroups.com
Hi Dan,
 
I've set three diagrams for comparison in the attached Word Docx.
 
1. The first is the Start and End boundary priors pasted into Excel, with date axis aligned and a cumulative sum carried out then each value multiplied by 500 to achieve a percentage chance on the 'Y' axis.  I used the term '% chance' in order to steer clear of  'Probability' in case the 'non expert audience' are confused, as Probability has a mathematical meaning which is nebulous in such a tripartite diagram.
 
2. The second diagram takes the 'Sample from Phase' prior into Excel to give a date column and a probability density value column.  A MAX function on the PDF column finds the highest PDF value.  Then the MAX value is divided into each PDF value and multiplied by 100 to give a Percentage chance of MAX.
 
3. The third diagram is the OxCal 'Sample from Phase' single plot, showing the 95.4% and 99.7% probability bars below.
 
All three tell a similar story. 
 
1. The first one dealing with cumulative bounds is much more complex to achieve, mainly due to the need to align the X axis dates and invert the End boundary function, which will change with each model.
 
2. The second is a simple use of the OxCal Date(Sample from Phase) prior, normalized and likely an easy task in R.
 
3. I prefer the OxCal single plot as it gives all the relevant data on screen and is available with each model without further manipulation.
 
I wonder if increasing the K-Iterations might smooth out the ripples in the second diagram, if they are a problem?
 
Best wishes
 
Ray
Chance and Prob.docx

Erik

unread,
Feb 17, 2015, 3:34:11 PM2/17/15
to ox...@googlegroups.com
Hi all—
Very interesting, everybody. Let me try to dumb things down a little. I have also struggled with how to present modeled results to non-expert audiences (as well as archaeologists who are repelled by statistics). As far as I know, the medians (not means) of the starting and ending boundaries are good estimates for when a phase begins and ends. I use graphics like the one I've attached to summarize a phase. I realize this discards a lot of the detail in what you have been discussing so far, but I have found that unless I keep it this simple, I start to loose people. In the figure, I used your boundaries, with 95% error bars. If there's lots of overlap or I need something that is visually cleaner, I use the 68% distribution.

This is blunt and unsophisticated, but I *think* it's a statistically valid interpretation of the phase. If not, someone please tell me! The explanation goes something like this (without mentioning the scary words "statistics" or "probability").

1. We have 40 radiocarbon dates for this phase (the light green bar)
2. The model estimates the phase's starting and ending dates (vertical blue bars)
3. There is some uncertainty is the starting and ending dates, just like with individual radiocarbon dates. The uncertainty is indicated by the horizontal red bars.
4. So to the best of our knowledge, the period most likely started at the blue bar, but it's possible that it actually began somewhere else along the red bar.
5. We make new models and get new dates to try and shrink the red bars, and become more certain about the date.

Hope this helps
Erik


Contreras for OxCal forum.jpg

dcontreras

unread,
Feb 18, 2015, 7:12:53 AM2/18/15
to ox...@googlegroups.com
Hi Ray,

Sorry for the delay; a couple of other things caught up with me and I had to shove this onto the back burner.

I think some confusion may be stemming from my inclusion of the medians and 95% extremes of the boundaries on the first plot I posted.  These - like the normalized and scaled boundaries I plotted - were meant to illustrate the difference between different summaries of an archaeological phase, not to serve as part of a cumulative probability plot.

As far as the choice between this method and the "Date" sample as suggested by Christopher goes: Had I known that the "Sample from Phase" routine was possible, I might not have embarked on trying to figure out how to plot a cumulative probability (certainly there's no doubt that it's easier!).  On the other hand, I don't think it's at all obvious - either to users or an eventual audience - what such a plotted "Date" means, nor how it's generated, and I certainly learned a lot by thinking through what it was that I really wanted to be able to display, and how.

Best,

Dan

ps. I'd also agree that it's complicated to calculate and plot cumulative probability.  In fact  the point of my work with R was to get around that by developing a function that would automate the process.
To unsubscribe from this group and stop receiving emails from it, send an email to oxcal+uns
...

Bayliss, Alex

unread,
Feb 18, 2015, 9:13:36 AM2/18/15
to ox...@googlegroups.com

Hi Erik,

 

Yes, explaining this to statistics-phobic archaeologists is a challenge. I don’t think archaeologists need any encouragement, however, to ignore the errors on their dates (although I appreciate that you have added error to your medians). Folks might find the attached animation useful. English Heritage commissioned this from Derek Hamilton at SUERC a few years ago (this version is the grand-daughter of the original). You should be able to embed it in powerpoint, and then click on it to stop it at will as you are explaining that scatter matters!

 

Alex

 

From: ox...@googlegroups.com [mailto:ox...@googlegroups.com] On Behalf Of Erik
Sent: 17 February 2015 20:34
To: ox...@googlegroups.com
Subject: Re: describing phases with cumulative probability plots

 

Hi all—

--

You received this message because you are subscribed to the Google Groups "OxCal" group.
To unsubscribe from this group and stop receiving emails from it, send an email to oxcal+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



English Heritage is changing into two organisations. 

From Spring 2015, we shall become Historic England, a government service championing England's heritage and giving expert, constructive advice, and the English Heritage Trust, a charity caring for the National Heritage Collection of more than 400 historic properties and their collections.

This e-mail (and any attachments) is confidential and may contain personal views which are not the views of English Heritage unless specifically stated. If you have received it in error, please delete it from your system and notify the sender immediately. Do not use, copy or disclose the information in any way nor act in reliance on it. Any information sent to English Heritage may become publicly available.


UniformDistribution_Animation-720p.wmv

Rayfo...@aol.com

unread,
Feb 18, 2015, 12:32:01 PM2/18/15
to ox...@googlegroups.com
Hi Erik,
 
The simplicity of your diagram is seductive.  I think the difficulty lies in the fact that too many dynamic variables are at work to capture a single picture that does justice to answering all questions.  The Derek Hamilton animation (which I hadn't seen before, thanks Alex) certainly gives a good sense of the dynamics of the Bayesian analysis in action.
 
In my view (for what its worth) your simple diagram could serve, but with some clarification:
 
1. The light green bar is the evidence we have of calibrated dates and stratification and other prior model information resulting in the positioning of the limiting red bars.
2. The red bars show the date range of uncertainty of where the group activity starts and ends.
3. The blue vertical bar (the median) shows a date within the red bar where there is a 50:50 chance of the start/end having taken place, but to be more certain we have to consider the full range of the red bar.
4,  More information in '1.' may reduce the length of the red bars.
 
If the non-expert audiences have a problem with 50:50 as a concept, toss a coin, retire to a bar, have a beer, and some Irish soda loaf and some home made marmalade.  Delicious.
 
Best wishes
 
Ray

dcontreras

unread,
Feb 25, 2015, 5:25:24 AM2/25/15
to ox...@googlegroups.com
Hi Erik,

I'd agree with Ray on the seductive dangers of your diagram.  In fact it was frustrations with something like that that drove me to the various experiments that kicked off this thread.

I think there are three potential problems w/ a diagram like this:
1) it can imply false precision (i.e. the medians look like they're the key number to pay attention to) and the temptation is to just use them (when in fact, if the pdf is irregular, they may not even be a good summary statistic),
2) the 95% intervals actually lose some information, in the case of phases, specifically because the probabilities can be thought of as accumulating over time. i.e., if dates of 1300 BC and 1200 BC both fall within that range, it's fairly unlikely that either really is the start date.  However, it's more likely - and possibly significantly more likely - that *by* 1200 BC the phase has begun, because if the phase began in any year before that then it would be underway by 1200 (in contrast to an individual date, when any two dates are mutually exclusive),
and
3) it can appear to lend significance to long tails - the undifferentiated red line can encourage looking at the extremes, where the probabilities are very low indeed.

I'd also be tempted to trim your green bar to highlight the part of the phase that *isn't* subject to uncertainty, i.e. between the inner extremes of the red lines.  At that point we'd be approaching the cumulative probability plots that this thread started with...you can see, I think, how my thinking on the problem went. 

Best,

Dan

Rayfo...@aol.com

unread,
Feb 25, 2015, 9:58:42 AM2/25/15
to ox...@googlegroups.com
Hi Dan, Erik,
 
I wonder.
 
I think the median of the pdf represents a point at which there is an equal probability of being before or after that date, a point of balance if you like, irrespective of the skewdness of the pdf.  The mean on the other hand can give a false impression of a central value depending on the skew.
 
The 95% intervals omit 5% of the probability area under the curve of the pdf.  Generally we accept the 1 in 20 chance the date is outside the 95% range, or we can use the 99.7% range.
 
I think the way the multi modal pdfs are handled in OxCal is by accumulating probability from a horizontal line across the highest point of the pdf, gathering area value as it approaches the X axis, until the required probability is reached, 68.2%, 95.4% or 99.7%.  This is the HPD range.
On a Normal pdf this implies the cumulative probability (HPD) is amassed from the centre outwards, whereas the mathematical approach, when there is a defined function, is to integrate the pdf across the range. 
OxCal Start probability is not increasing left to right, but from the median outward in both directions.  Below it can be seen that a 50% mathematical cumulative probability is the blue area left of the median.  However, the second diagram (apologies for the roughness) also shows a 50% cumulative probability in the greyed area as the HPD is gathered.
 
So I believe the Median is the key as long as we don't say it is the probability.  It is the point at which probability increases in both directions until we reach a probability satisfying our (previously agreed) cut off.  The range of dates thus encompassed is our uncertainty in where the Start date lies.
 
The tails may be long, but if they contribute to the HPD, they are part of the (agreed) probability, so I think Erik's little red bars are quite valid, they represent the range of uncertainly of the start date.  If the median is not centred on the red bar it implies the start pdf is skewed or unbalanced.  That will not matter as the range of uncertainty is defined by the red bar, not the median.
 
If the 68% probability range is viewed, it does not leap to the left of the cumulative probability curve, but stays roughly central, as does the 95% probability range, simply because the HPD probability is accumulated from top down, not left to right.
 
I think the green bar is the region of evidence that defines the end boundaries, so where the red bar(s) terminate internally, there can be confidence the phase is underway, but I don't think there are numbers that define it within the program except the Date() function.
 
best wishes
 
Ray

Erik

unread,
Feb 26, 2015, 10:13:11 PM2/26/15
to ox...@googlegroups.com
Dan and Ray—

I have a few other options that might be useful after playing with gradients and transparencies.
Ray, you defended my red bard better than I could have. Dan, your points are well taken and I think, as usual, it leads back to the fine line of how much detail to exclude. The truer you are to the data the faster you lose your audience.
1. The first one (top in the figure) is the same as before, except with a gradient instead of solid bar. It adds a tiny bit of information that might discourage looking at the tails. You could add two bars, or like I mentioned, just one for the 68% range.
2. Next, I used the probability curves and the gradients, so you can read increasing probabilities as a higher curve or increasingly dark color. Using the curves has the advantage of representing 99% of the probability, but is a little less intuitive than bars. I would say that bars are good enough as long as the have a fairly normal shape (which these do, like many phase boundaries). If they're really odd or bimodal, I would agree with Dan that they might be obscuring an important detail and the curve is necessary.
3. The third one just joins the curves together (green vertical bar indicates the median).
4. This last one might be the most interesting, as per Ray's interesting idea on accumulating probabilities in response to Dan's concern about showing how it becomes increasingly likely that a phase has started (which remains unclear with the simple boundary curve and median). The curve is outlined in red, just like in Ray's figure. Note that I didn't actually calculate the cumulative probability as it builds from left to right and then right to left (or top to bottom? Ray, you lost me there). I would have no idea how to—I just nudged the figure over. In this case the curves aren't anything weird (bimodal or jagged), so they're probably pretty close. I would be very cool if OxCal (or R?) would display a cumulative probability as it increased/decreased over time.

Anyway, I'd like to hear your opinions on how these communicate the idea without losing necessary detail.
The idea is to not recreate something that inspired Dan to start this thread!

Cheers
Erik
Contreras for OxCal forum 2.jpg

Julie Morin-Rivat

unread,
Mar 12, 2015, 12:07:04 PM3/12/15
to ox...@googlegroups.com
Dear all,

is it correct to use R_Date instead of Simulate, please?

Thank you for your answer.

Best regards,

Julie
Reply all
Reply to author
Forward
0 new messages