Replication Value project: separate thread FYI

Brian Nosek

unread,

Feb 16, 2012, 1:13:51 PM2/16/12

to Open Science Framework

To avoid cluttering inboxes of those of you that are not interested in the Replication Value project, a separate thread has started for the more dynamic discussion (see below). This is not intended to close the project in any way, so ask to be added to that if you want to stay in the loop (whether you decide to be a contributor or not).

---------- Forwarded message ----------
From: Brian Nosek <no...@virginia.edu>
Date: Thu, Feb 16, 2012 at 10:08 AM
Subject: Re: RV thread
To: Roger Giner-Sorolla <R.S.Gine...@kent.ac.uk>
Cc: Mark Brandt <mbra...@depaul.edu>, Jeffrey Spies <jsp...@gmail.com>, Daniel Lakens <D.La...@tue.nl>

Hi Roger --

I agree that the meta-analytic result will be subject to publication bias, but the present formula will be equally so. I don't think RV can solve publication bias, so it can be agnostic to that.

I think the main disadvantage of the suggestion is your second point - added complexity of calculation. However, I don't think RV is going to be useful for anything that has more than a trivial number of replications. So, the meta-analytic result for actual RV use is likely to be based on 1 to 5 or so studies at most. Any more than that, people would focus on the meta-analysis and RV would be a add-on result.

That said, making computation trivial is appealing, so if there is an easier way to incorporate the idea, I am all for it. [As one way to address it, in the paper, we could describe a "poor man's" RV as one that just counts replications with no concern for effect size, sample size, reliability of the effects, etc; and then a "formal" RV as something that requires a bit more work like the formula below.]

I like your latter points. It is possible that something like this could be mentioned briefly in this paper, but it is probably something to develop separately unless the objective of the manuscript is broadened substantially.

brian

On Thu, Feb 16, 2012 at 10:01 AM, Roger Giner-Sorolla <R.S.Gine...@kent.ac.uk> wrote:

Brian ( et al)

I was also thinking how to treat this ambiguity where people use “replication” as a synonym for “successful replication”.

The problem with this formula is that the meta-analytic result will be subject to publication bias. It is also harder to calculate because a whole meta-analysis has to be done, including scouring for unpublished studies (a process never 100% effective). It would make more sense under a publication regime we do not have ... yet ...

The original formula has the advantage of being more straightforward, and functionally equivalent in a world where non-replications don’t get published (I guess reversed replications do, but they are rare, only Glaser & Banaji comes to mind... and in that case boundary finding is the more appropriate research response). Important given there’s a lot of wiggle room in defining direct, let alone conceptual replications.

It does make me think about defining different things we want to replicate:

Direct replication: of the whole experiment, IV and DV

Effect replication: of the IV’s effects on a DV, conceptualized in different ways

Conceptual replication: varying the IV and/or the DV

You know ( and maybe agree) that some publishing trends in the field have gone too far into finding clever or surprising effects, rather than establishing underlying theoretical principles and boundaries. (Yes, yes, we get it, making concepts available to the mind –via any of the senses - increases the likelihood of congruent behavior). I want to make clear how replication is valuable at each level rather than just focusing on direct replication.

Direct replication is most important to the *integrity of our science*. (Is this effect cherry-picked or fraudulent?)

Effect replication is most important to the *application of our science*. (Will those eyes increase honesty in a variety of domains?)

Conceptual replication is most important to the *theory behind our science* as well as the application (what do the eyes represent exactly, conceptually?)

That said, it is easier to do that meta-analysis for the first two than for the third, and also easier to agree on what counts as a replication due to the likelihood of direct citation of the parent study.

Dr Roger Giner-Sorolla

Reader in Social Psychology

School of Psychology

University of Kent

Canterbury, Kent CT2 7NP

United Kingdom

tel. +44 (0)1227 823085, leave out 0 if calling from abroad

From: bno...@gmail.com [mailto:bno...@gmail.com] On Behalf Of Brian Nosek
Sent: 16 February 2012 17:32
To: Mark Brandt; Roger Giner-Sorolla; Jeffrey Spies; Daniel Lakens
Subject: RV thread

So far, the five of us are ones that have weighed in on the RV manuscript - at least in the google doc itself.

I thought it might be useful to start a discussion thread that doesn't fill all the OSF folks inboxes with detail RV discussion (larger points could still be brought up there).

I am attaching an article that may be of some use, and below is the text of a comment that I just added to the paper that might deserve some rapid discussion rather than resolution through the doc comment thread:

--------------------------------------

I haven't gone through the rest of the manuscript yet, but I had a thought about the calculation of replication value yesterday. This builds on my earlier point that RV should be a complement to meta-analysis, not an alternative, and that just counting the number of replications does not consider the reliability of the demonstrations and outcomes of the replications.

So, how about something like:

RV = (times cited) / (p-value of the meta-analytic result - whatever number of replications there are)

[lower values indicating replication is more important, the minus sign is not intended to be a minus sign, just explanation of what is in the meta-analytic count]

The computation might be adjusted, but the key point is that the p-value of the meta-analytic result provides a simple means of incorporating all the relevant concepts (effect size, sample size, number/size of replications) into an index of how likely is it that this effect is due to chance?

There may be a place for rapidly adjusting the magnitude based on whether the p-value is based on only a single study or more, but that would probably need to be a steep slope (really only sharpening RV for single study demonstrations because the meta-analytic p already incorporates some info on this).

----

If an adjustment by the actual number of replications had added information to the meta-analytic p-value, it could be of the form:

2-((1/replications)^2)

making

RV = [(times cited) / (p-value of meta-analysis of replications)] / [2-((1/replications)^2) ]

The effect of the replications adjustment would be to halve the RV when the score is based on only one study and barely change the RV at all when the score is based on 6, 10, 20 studies.

To see the shape of the adjustment google this: graph y = (1/x)^2

This adjustment has some redundancy with the p-value. It's main value is perhaps sociological rather than statistical. That is, any result, no matter how strongly demonstrated (i.e., low p-value), it more trustworthy if it has been reproduced - particularly by independent researchers. So, the statistic would incorporate statistical confidence (the p-value portion), and replication confidence (the replications portion).

Frank Renkewitz

unread,

Feb 17, 2012, 12:06:07 AM2/17/12

to Open Science Framework

Hi All,
I agree with the starting point of Brian’s idea. An index of
replication value should take into consideration what is already known
about an effect. An original finding that is based on 40 participants
is less informative and calls more strongly for a replication than a
finding that is based on 400 participants. Similarly, when three
(replication) studies found grossly inconsistent results a further
replication seems more useful than when all previous studies had
identical results. However, I wouldn’t agree that the knowledge about
an effect is best represented by its p-value. I would suggest using
the standard error of the combined effect size in a meta-analysis.
This presupposes that the main goal of replications is to obtain a
more precise estimate of the true effect size – and not to make sure
that a finding is not a false positive. Especially for the case of RV
this seems reasonable to me. Our measure of interest in an effect is
citation frequency. Thus, if a study is cited often we will run a
replication, no matter if the currently best estimate of the effect
size is close to zero or tremendously large. If nobody cares for an
effect we won’t replicate it even if it seems to be very large. All
that matters is how precisely we can estimate the effect size given
that there is a sufficiently large interest in the effect in the first
place.
I miss an additional factor in all RV formulas that have been
suggested so far. This factor simply is the sample size of the
replication (I assume that replication value refers to a study
replicating an effect and not to the effect itself). A replication
study with 400 participants is obviously more informative than a
replication with 40 participants. The present formulas could be
misunderstood as an invitation to conduct (and publish) small
replication studies as soon as a certain RV has been achieved. In meta-
analysis, the “information value” of a single study is reflected in
its weight in computing the average effect size. The weight is the
inverse variance of the study’s effect size and, thus, depends only on
the sample size of the study. Taken together, this leads me to the
following formula:

RV = citations * (variance of the meta-analytic effect size / variance
of the replication study)

The conceptual formula is something like this:

RV = impact * (“knowledge” gained through a replication / “knowledge”
about an effect)

(Numerator and denominator are exchanged in the conceptual formula
because the operational formula is actually based on inverse
variances. Of course, knowledge might (or should) be replaced by
reliability)

I’m sure that I don’t see all implications, but my feeling is that
this solves several of the problems that have been discussed so far.
If replications find inconsistent results the variance of the meta-
analytic effect size will decrease only slowly or even not at all (at
least in a random effects model – actually, the whole idea
necessitates that a random effects model is used, otherwise it won’t
really work) and further replications are indicated. Basically, we
don’t really know what is going on, so more information is needed. A
small study won’t increase our knowledge very much, so it has small
RV. An additional advantage is that a first replication with the same
sample size as the original study has a RV of citations times 1. This
means that it has no disadvantage compared to the original study –
which seems reasonable to me simply because both studies provide the
same amount of information on the effect. So I don’t think that an
additional adjustment for the number of replications is needed, but
I’m not sure about this (possibly, a first independent replication
should even get an advantage).
A potentially serious problem is that the meta-analytic result is
subject to publication bias. When only significant results get
published even if the effect in question does not exist, the variance
of the meta-analytic effect size and the RV score might decrease fast.
However, RV should actually increase in this situation. One potential
solution is to incorporate a measure of publication bias into the RV
index. There are several methods that aim to assess the degree of bias
in a collection of effect sizes. Such a measure of bias could be used
to adjust the RV score. Actually, I like this idea a lot. However, I
am afraid that none of these methods provides sufficiently reliable
results when the number of effect sizes is small. And such an
adjustment makes things far more complicated…

Frank

On 16 Feb., 19:13, Brian Nosek <no...@virginia.edu> wrote:
> To avoid cluttering inboxes of those of you that are not interested in the
> Replication Value project, a separate thread has started for the more
> dynamic discussion (see below). This is not intended to close the project
> in any way, so ask to be added to that if you want to stay in the loop
> (whether you decide to be a contributor or not).
>
>
>
>
>
>
>
> ---------- Forwarded message ----------
> From: Brian Nosek <no...@virginia.edu>
> Date: Thu, Feb 16, 2012 at 10:08 AM
> Subject: Re: RV thread

> To: Roger Giner-Sorolla <R.S.Giner-Soro...@kent.ac.uk>
>
> Cc: Mark Brandt <mbran...@depaul.edu>, Jeffrey Spies <jsp...@gmail.com>,
> Daniel Lakens <D.Lak...@tue.nl>

>
> Hi Roger --
>
> I agree that the meta-analytic result will be subject to publication bias,
> but the present formula will be equally so. I don't think RV can solve
> publication bias, so it can be agnostic to that.
>
> I think the main disadvantage of the suggestion is your second point -
> added complexity of calculation. However, I don't think RV is going to be
> useful for anything that has more than a trivial number of replications.
> So, the meta-analytic result for actual RV use is likely to be based on 1
> to 5 or so studies at most. Any more than that, people would focus on the
> meta-analysis and RV would be a add-on result.
>
> That said, making computation trivial is appealing, so if there is an
> easier way to incorporate the idea, I am all for it. [As one way to
> address it, in the paper, we could describe a "poor man's" RV as one that
> just counts replications with no concern for effect size, sample size,
> reliability of the effects, etc; and then a "formal" RV as something that
> requires a bit more work like the formula below.]
>
> I like your latter points. It is possible that something like this could
> be mentioned briefly in this paper, but it is probably something to develop
> separately unless the objective of the manuscript is broadened
> substantially.
>
> brian
>
> On Thu, Feb 16, 2012 at 10:01 AM, Roger Giner-Sorolla <

> R.S.Giner-Soro...@kent.ac.uk> wrote:
>
> > Brian ( et al)****
>
> > ** **

>
> > I was also thinking how to treat this ambiguity where people use

> > “replication” as a synonym for “successful replication”.****
>
> > ** **

>
> > The problem with this formula is that the meta-analytic result will be
> > subject to publication bias. It is also harder to calculate because a whole
> > meta-analysis has to be done, including scouring for unpublished studies (a
> > process never 100% effective). It would make more sense under a publication

> > regime we do not have ... yet ...****
>
> > ** **

>
> > The original formula has the advantage of being more straightforward, and
> > functionally equivalent in a world where non-replications don’t get
> > published (I guess reversed replications do, but they are rare, only Glaser
> > & Banaji comes to mind... and in that case boundary finding is the more
> > appropriate research response). Important given there’s a lot of wiggle

> > room in defining direct, let alone conceptual replications.****
>
> > ** **

>
> > It does make me think about defining different things we want to replicate:

> > ****
>
> > ** **
>
> > Direct replication: of the whole experiment, IV and DV****

>
> > Effect replication: of the IV’s effects on a DV, conceptualized in

> > different ways****
>
> > Conceptual replication: varying the IV and/or the DV****
>
> > ** **

>
> > You know ( and maybe agree) that some publishing trends in the field have
> > gone too far into finding clever or surprising effects, rather than
> > establishing underlying theoretical principles and boundaries. (Yes, yes,
> > we get it, making concepts available to the mind –via any of the senses -
> > increases the likelihood of congruent behavior). I want to make clear how
> > replication is valuable at each level rather than just focusing on direct

> > replication.****
>
> > ** **
>
> > Direct replication is most important to the **integrity of our science**.
> > (Is this effect cherry-picked or fraudulent?)****
>
> > Effect replication is most important to the **application of our science**.
> > (Will those eyes increase honesty in a variety of domains?)****
>
> > Conceptual replication is most important to the **theory behind our
> > science** as well as the application (what do the eyes represent exactly,
> > conceptually?)****
>
> > ** **

>
> > That said, it is easier to do that meta-analysis for the first two than
> > for the third, and also easier to agree on what counts as a replication due

> > to the likelihood of direct citation of the parent study.****
>
> > ** **
>
> > Dr Roger Giner-Sorolla****
>
> > Reader in Social Psychology****
>
> > School of Psychology****
>
> > University of Kent****
>
> > Canterbury, Kent CT2 7NP****
>
> > United Kingdom****
>
> > tel. +44 (0)1227 823085, leave out 0 if calling from abroad****
>
> > ** **
>
> > *From:* bno...@gmail.com [mailto:bno...@gmail.com] *On Behalf Of *Brian
> > Nosek
> > *Sent:* 16 February 2012 17:32
> > *To:* Mark Brandt; Roger Giner-Sorolla; Jeffrey Spies; Daniel Lakens
> > *Subject:* RV thread****
>
> > ** **

>
> > So far, the five of us are ones that have weighed in on the RV manuscript

> > - at least in the google doc itself. ****
>
> > ** **

>
> > I thought it might be useful to start a discussion thread that doesn't
> > fill all the OSF folks inboxes with detail RV discussion (larger points

> > could still be brought up there). ****
>
> > ** **

>
> > I am attaching an article that may be of some use, and below is the text
> > of a comment that I just added to the paper that might deserve some rapid

> > discussion rather than resolution through the doc comment thread:****
>
> > ** **
>
> > ** **
>
> > --------------------------------------****

>
> > I haven't gone through the rest of the manuscript yet, but I had a thought
> > about the calculation of replication value yesterday. This builds on my
> > earlier point that RV should be a complement to meta-analysis, not an
> > alternative, and that just counting the number of replications does not
> > consider the reliability of the demonstrations and outcomes of the

> > replications. ****
>
> > ** **
>
> > So, how about something like:****
>
> > ** **

>
> > RV = (times cited) / (p-value of the meta-analytic result - whatever

> > number of replications there are)****
>
> > ** **

>
> > [lower values indicating replication is more important, the minus sign is
> > not intended to be a minus sign, just explanation of what is in the

> > meta-analytic count]****
>
> > ** **

>
> > The computation might be adjusted, but the key point is that the p-value
> > of the meta-analytic result provides a simple means of incorporating all
> > the relevant concepts (effect size, sample size, number/size of
> > replications) into an index of how likely is it that this effect is due to

> > chance? ****
>
> > ** **

>
> > There may be a place for rapidly adjusting the magnitude based on whether
> > the p-value is based on only a single study or more, but that would
> > probably need to be a steep slope (really only sharpening RV for single
> > study demonstrations because the meta-analytic p already incorporates some

> > info on this).****
>
> > ** **
>
> > ----
>
> > ** **If an adjustment by the actual number of replications had added

Eric-Jan Wagenmakers

unread,

Feb 17, 2012, 12:56:39 AM2/17/12

to openscienc...@googlegroups.com

My two cents, after having thought about this some more. First, trying
to use p-values as a measure of evidential strength is just not going
to work. With high n, a p-value of, say, .04 is likely to actually
indicate support in favor of H0. I like the idea of the meta-analytic
standard error much better, at least as a reflection of how much
information has been collected.

Second, what you are trying to do here is to weight the evidence for
the effect that has already been collected, and combine it with a
measure of its importance or impact. Basically, you want to maximize
the expected utility of the replication. There are many possible
action to take (i.e., experiments to replicate) and you want to pick
the best one to spend your limited resources on. As it happens, such
decision problems are solved by applying Bayesian stats. Perhaps you
can also do this with frequentist stats --ad-hoc solutions exists for
many things-- but I don't see why you would. So instead of pondering
the problem without the benefit of a formal framework, I suggest that
this problem is in fact analogous to other, superficially different
problems that also involve the maximization of expected utility. This
analysis will automatically drive one to the right insights by using
probabilities (not those of the p-value kind).

For instance, the utility-maximization idea suggests that what matters
is not just the prior support for a theory and its impact measured in
some way or other (i.e., if low prior support and high impact, then
high replication value). Instead, one needs to consider the utility of
a failed replication and the utility of a successful replication. For
a new finding, both utilities may be high (for instance because they
have both lead to a substantial increase in our knowledge). For an
older, more established finding, the utility of a successful
replication may be very low.

Anyway, I just wanted to say that the problem we have is one where we
have prior knowledge (i.e., earlier support for effects), many actions
(experiments to replicate), limited resources, and uncertain outcomes.
An optimal decision-making analysis would be very helpful.

Cheers,
E.J.

--
********************************************
WinBUGS workshop in Amsterdam: http://bayescourse.socsci.uva.nl
Eric-Jan Wagenmakers
Department of Psychological Methods, room 2.16
University of Amsterdam
Weesperplein 4
1018 XA Amsterdam
The Netherlands

Web: www.ejwagenmakers.com
Email: EJ.Wage...@gmail.com
Phone: (+31) 20 525 6420

“Man follows only phantoms.”
Pierre-Simon Laplace, last words
********************************************

Heather Fuchs

unread,

Feb 17, 2012, 3:00:36 AM2/17/12

to Open Science Framework

Greetings,

I agree that the the standard error of the true effect should be prefered over p-values. I think the meta-analytic standard error will often be difficult to obtain in practice. There is often no meta-analysis available on an effect, especially if it is relatively new and there have been few replications (i.e., when the RV is likely to be higher). If there is already a meta-analysis available from which one can draw the SE, then the RV is likely to be lower anyway. I'm not sure if it is reasonable to expect researchers interested in conducting a replication to first conduct a meta-analysis of the effect in order to obtain the SE for the RV formula. Thus, in most cases one would be using the SE from a single effect estimate. Still, I think this is certainly more informative than a p-value.

I do not agree with the notion that the sample size of the replication study be included in the formula. I think the RV value should refer to effect and not the replication - i.e. it should answer the question "does this effect deserve a replication" rather than "is this replication valuable". I think the judgement on how valuable the replication itself is must take the entire study into consideration. I think this is especially true if we want to "sell" the RV to fields outside of psychology. I can think of many examples in which it may be difficult to obtain reasonably large sample sizes (populations that are difficult to access, extremely expensive procedures etc.). In such cases the RV would be small although the replication itself is actually valuable. Thus, I think we need to be careful that we do not attempt to create a RV that attempts to be a substitue for "holistic" judgments of study value. Rather, we should focus the formula on the effect in question. We should seek to provide a standardize measurement of "does this effect deserve replication" and leave judgements of "is this specific replication valuable" up to the peer review system.

Sincerely,

Heather

Brian Nosek

unread,

Feb 17, 2012, 12:20:42 PM2/17/12

to openscienc...@googlegroups.com

Good discussion. Some comments:

1) There are enough people (and perhaps more lurking that will contribute more) that we should leave the discussion on the main list. Let's use the norm of having "Replication Value" in the subject line so that people that are not interested in this project can ignore the discussion easily. (We can follow this norm for all projects to make it easy for people to screen out what they wish to ignore).

2) I like Frank's improvement of using the meta-analytic standard error over my suggestion of the p-value. It solves a few problems associated with RV fluctuating based on whether the initial data suggests a positive or a negative result (p-value use would be extremely sensitive to that).

3) Heather's concern about calculating the meta-analytic standard error came up in the smaller group discussion. I don't think that this is a problem because RV is only really a useful concept when there are few studies in print about a phenomenon - i.e., no meta-analysis exists, and conducting a meta-analysis is easy because there are only a few studies and you - the researcher considering whether to replicate - would want to know what those studies found anyway. So, the RV computation is just a small additional step (especially with an online calculator, or scripts) for justifying that it is worth the time to yourself or the editor/reviewers.

4) I also agree with Heather that information about the planned replication does not belong in the RV computation. It is a reflection of the existing literature, not about the merits of the planned replication. Evaluating the planned replication is very sensible to do, but it is a separate concept. For example, an editor should be able to compute and post a list of RV's for papers from her journal and invite researchers to submit replications of the articles that cross a particular RV threshold.

5) I also think that publication bias is a separate issue from RV. In fact, (as now appears in the google doc draft a bit) RV can help stimulate researchers to pull studies out of the file-drawer. If researchers have a file drawer about effects that they learn have high RV's, then they'd be more inclined to dust them off and publish them.

6) The comments that EJ makes about utility maximization sound interesting, but I don't understand the implications for defining an RV metric. What, for example, would be the computation using such an approach? And, can the computation be easy enough that people would use it?

7) There is no seventh comment.

Frank Renkewitz

unread,

Feb 17, 2012, 8:40:21 PM2/17/12

to Open Science Framework

Some remarks on Heather's, E.J.'s and Brian's comments:
- There is a very easy way to drop the sample size of the replication
study (and its "information value") from the suggested formula. The
result is:

RV = citations * variance of the meta-analytic effect size

However, I strongly favor a version that includes the replication
study. One reason is that the draft of the paper suggests that RV
should be used by editors and reviewers to evaluate whether a
replication study should be published. I think this a good idea.
However, the publication decision should obviously take into
consideration relevant characteristics of the replication study. The
most relevant characteristic definitely is its sample size. Nothing
(or hardly anything) can be learned from a study with N = 10. There
might be other relevant characteristics and nothing prevents reviewers
and editors from considering these characteristics additionally.
However, as long as we are talking about direct replications, sample
size actually is the ONLY relevant characteristic (everything else is
a feature of the original study and not of the replication study).
Additionally, my original formula can easily be used to answer the
question "does this effect deserve (or need) a replication". Just use
a reasonable assumption about the sample size of a replication study,
say N = 50, as a standard reference point. A high RV then means
something like "a replication with N = 50 would provide a relevant
amount of information about this (interesting - as indicated by the
citation frequency) effect" - so this effect needs a replication. A
low RV would mean that a replication with N = 50 does not provide a
relevant amount of information - so a replication of the effect is
less necessary. In any case, a replication with a larger sample size
is more informative and, thus, more valuable.
To use some standard of comparison is something that we have to do
anyway. Suppose we use the formula above (the one without the variance
of the replication study). What would then be a low RV? To answer this
question you have to decide, among other things, what is an
(unacceptable) large or (reasonable) small variance of the meta-
analytic effect. Of course, this could be done - we could agree that a
variance of, say, 0.1 is, in general, sufficiently small for the
standardized mean difference d (that would have to be done
specifically for each effect size metric). Thus, an effect with this
variance does not warrant a (further) replication unless there is
overwhelmingly large interest in this effect. - My suggestion simply
is to define this standard of comparison not in absolute but in
relative terms. It would be relative to the amount of information
gained by a (hypothetical) replication study with a "standard" sample
size (e.g., N = 50, N = 100 or whatever seems reasonable to you).
To be more specific: Suppose we use the original formula (the one that
includes the variance of the replication study) and define a standard
sample size of 50. Assume furthermore that the fraction in this
formula (variance of the meta-analytic effect size / variance of the
replication study) is 0.5. This simply means that a replication with N
= 50 will increase the available information about the effect by as
much as 50 percent. My interpretation would be that this effect
deserves a replication even if its impact is rather small. In
contrast, if the fraction is 0.01 a replication with N = 50 will
hardly increase our knowledge about the effect - we already know quite
a lot about the size of this effect. A further replication of this
effect is only warranted if the interest in this effect is really
large (and the replication would need to have a large sample size).

- I agree with Brian that computational effort and complexity is not
an issue here. RV won't be useful when there already is a large number
of replications. What seems even more important to me is that we have
to take into account the results of previous replications. Otherwise
we end up suggesting that a first replication diminishes the value of
a second replication even if it finds results that are completely
inconsistent with the original study. This is an idea I would
definitely not want to promote. So, in any case, we simply need the
effect sizes of previous replications. All that is needed in addition
are the sample sizes of the replications. The rest is easily computed
(at least with a suitable software package).

- I'm afraid that I don't completely understand E.J.'s comment. I
would argue that the formula already includes the utility of a
replication (actually, that is the basic idea behind it). In a way,
the formula defines the utility of a replication as the proportion of
the amount of information gained relative to the amount of information
already available. The "amount of information" is given by the
respective variances. So, what is conceptually missing to incorporate
utility?? Of course, there might be (and certainly are) other ways to
define the utility of a replication. But this implies that the utility
is evaluated according to some different criterion. What is this
criterion?
I have two additional questions regarding this section:

"Instead, one needs to consider the utility of

a failed replication and the utility of a successful replication. For

a new finding, both utilities may be high (for instance because they
have both lead to a substantial increase in our knowledge). For an
older, more established finding, the utility of a successful

replication may be very low."

Does this imply that the utility of a replication should be evaluated
according to its result? If so, then this is central point I don't
agree with. A replication increases our knowledge about an effect
independently of its result. It increases our knowledge because it
allows for a more precise estimate of the true effect size. Thus, it
is useful no matter what it finds. It is not useful if there already
is a very precise estimate of the true effect size. And it is not
useful if it cannot bring about a substantial increase in the
precision of the effect size estimate (because of its relatively small
sample size). To put this differently: Another successful replication
of a finding in the literature may be disappointing and boring, but it
is useful as long as there was any doubt about the effect in the first
place. If there was no doubt we shouldn't have run a replication.
My second question is what constitutes an "established finding"? My
answer would be that we already know a lot about the finding and can
estimate the effect size precisely. However, in this case a single
failed replication won't change our beliefs about the effect very
much. So why should a failed replication bring about a substantial
increase in our knowledge in this situation? - Isn't this one of the
problems with Bem's studies? We already know very precisely what the
true effect size is (because of an overwhelming amount of "everyday
evidence"). In this case it is 0. So no single study (and even no
series of nine studies) on precognition should change our beliefs
about this effect very much. It is irrelevant what the study finds
because it provides a negligible amount of evidence compared to the
amount of evidence that is already available. (My feeling is that I am
already reasoning along Bayesian lines of thought to a large
degree...)

Best, Frank

> On Fri, Feb 17, 2012 at 12:00 AM, Heather Fuchs <heather.fu...@uni-erfurt.de

Brian Nosek

unread,

Feb 17, 2012, 9:52:14 PM2/17/12

to openscienc...@googlegroups.com

Two brief comments on Frank's thoughts:

1) Convincing an editor is a special use case of RV, not *the* use case. Incorporating the replication into the RV calculation would severely limit the possible uses of RV. For the case of convincing an editor, an easy approach is for the researcher to compute RV twice - once without the replication and once with it. The change in RV between them is the value added of that replication in particular.

2) The present proposal of RV in the google document includes a third item to the equation below --- number of attempts to replicate. The rationale is developed there as a "sociological" factor for the confidence in the finding, in addition to the statistical confidence provided by the SE.

brian

Denny Borsboom

unread,

Feb 18, 2012, 5:14:04 AM2/18/12

to openscienc...@googlegroups.com

I think that what EJ is trying to say is that you are defining a metric that assigns different utilities to different courses of action (e.g., replicating effect a or effect b). In doing so, you are following a path that has been extensively studied in the disciplines of statistics, decision theory, economics, and measurement theory. Your problem isn't fundamentally different from the problem of a hospital (scientist) which has to choose whether to operate on a patient (replicate a study) on the basis of a set of observations. This problem is well studied. You have different courses of action which lead to different possible outcomes. Then you assign utilities to these possible outcomes. How you do this determines which courses of action you prefer to which other courses of action, and with respect to which sets of choice options you're indifferent (these define indifference curves). This defines a function over the space of action-outcome combinations, and this function is your replication value function.

Note also that any function in which you divide a number by another number requires strong scale properties. For instance, if you chose to use log(number of cites) instead of number of cites itself, you could get a very different function. This is not always a problem, but it is important to avoid arbitrary scaling choices from exerting overly strong influence on your metric.

Hope this helps.

Best
Denny

Roger Giner-Sorolla

unread,

Feb 18, 2012, 5:38:28 AM2/18/12

to Open Science Framework

What do people think about the following edit (in brackets) which I've
incorporated into the google doc? Feel free to vote it down, but I
think it would be useful to distinguish between the very simple use of
RV to identify studies that are disproportionately important to their
*published* replication attempts, and the more complex use of it to
identify hypotheses with more or less conclusive support, taking into
account unpublished work and potentially conflicting effect
directions. I guess I'm in the "non-meta-analytic RV" camp because I
see the two purposes as different, and the RV is more novel in that we
already have well-developed procedures for meta-analysis. If you have
a weak overall effect from a meta-analysis but the original paper is
widely cited, that is not such a scandal, because you don't know
whether those citations are selling it as a strong effect or
highlighting, even investigating, the controversy.

As mentioned before, we believe that replications of findings with a
high impact (as indicated by 100 or more citations) deserve to be
published, irrespective of the successful or unsuccessful outcome of
these replications, as long as the experiment is as close as possible
to the original study, and the experiment has enough statistical
power. As a consequence, the replication value says nothing about
whether an observation is a robust phenomenon or not. [To answer such
a question, a meta-analysis should be performed (e.g., Rosenthal &
DiMatteo, 2001). Such a meta-analysis would differ from the RV
calculation in that it would try to include both published and
unpublished findings, and take into consideration the direction,
sample size, and effect size of each replication result. Although
different from the RV, the meta-analytic result could help clarify its
implications. Imagine a much-studied effect with a low RV, which on
further examination by meta-analysis yields no consistent story; some
replication attempts find effects in one direction, some in another,
and the overall effect size is near zero. Under such circumstances,
although the RV tells us that further direct replication attempts are
low priority, the meta-analysis would tell us that a higher research
priority would be to focus on moderators or boundary conditions that
might explain the apparently messy results.]

Frank Renkewitz

unread,

Feb 18, 2012, 5:46:58 AM2/18/12

to Open Science Framework

Hi Brian,

1) What would be the limitations if a hypothetical replication with a
standard sample size is included in the RV? A high RV would mean
something like "a single replication with N = 50 (or N = X) would
increase our knowledge about the effect substantially". I think, that
does tell us something very specific about the existing literature
(namely, that the amount of available evidence is very limited and
inconclusive). An editor would easily be able to compute and post a
list of RV's for papers from her journal. Isn't this *the* use case of
RV?

2) With a formula that includes the SE of the effect (instead of its p-
value) a finding that is based on a single study with limited sample
size will already have a very large RV (even if its p-value is small).
So I'm not sure if a sociological factor would be necessary. Maybe it
is a good idea to encourage (at least) a first independent replication
in any case and, thus, to increase RV even further if the number of
studies is very low. But I would put more confidence in a study with
400 participants than in a study with 40 participants not only because
of statistical but also because of sociological reasons. So if we
include a sociological factor that is based solely on the number of
attempts we would penalize a large original study.

Frank

On 18 Feb., 11:38, Roger Giner-Sorolla <rogersebast...@gmail.com>
wrote:

Daniel Lakens

unread,

Feb 19, 2012, 2:05:51 AM2/19/12

to Open Science Framework

HI all, great discussion.

My thoughts are the same as those of Roger. The RV (at least how I
thought of it originally) is not about determining the strength of the
effect, but a tool that can sollicit replications, which when
published, provide more info about the strength of the effect. I agree
that meta-analytic info is great to have, and I see a lot of people
would love to make that information avaiable as well. Two things:

1: The RV is arguably most important when no replications have been
done, but the paper is cited a lot of times (e.g., >300). In this
situation, there is no meta-analytic info to begin with.

2: A simple alternative would be that instead of incorporating meta-
analytic info in the RV, we simply ask people report the meta-analytic
info in the published replications. One simple table, with all
required statistics. The second replication will simply copy-paste the
table, add a line, and the most recent publication always has the up-
to-date meta-analtic info. Doing a meta-analysis on direct
replications is very simple.

So instead of 2 formula's, the simple RV and the meta-analytic RV, I
would suggest to use the simple RV, and a ask authors to report a
(small and simple) meta-analysis.

Jeffrey Spies

unread,

Feb 19, 2012, 9:30:47 AM2/19/12

to openscienc...@googlegroups.com

On Daniel's point 2: I think it is agreed that meta-analysis provides pertinent information about replication value. However, asking authors to report RV and then asking that they report a simple meta-analysis as well, suggests that the meta-analysis is necessary, and that the RV is lacking that information. I think the thoughtful reader might wonder why we didn't combine the two.

Also, it has occurred to me that an "author-variance" (are replications coming from independent sources?) quantity in the equation could be useful.

J.

Daniel Lakens

unread,

Feb 19, 2012, 10:49:38 AM2/19/12

to Open Science Framework

Hi,

Jeffrey, the reason we did not incorporate a meta-analysis in the RV,
is that meta-analyses already exist, and that the Replication Value
and a meta-analysis are qualitatively different things - The first is
an index of the expected interest in a direct replication of a
finding, the second is a measure of the real size of the effect.
Adding a meta-analysis table also provides the easiest way to
incorporate your second point: you'd simply add a column with the
names of the researchers who performed the study. Another column would
be the number of participants in each study (adressing Frank's
comments above, and needed to compute the meta-analysis anyway). If I
understand Denny and Eric-Jan correctly, they are warning us is that
we are re-inventing the wheel.

I don't see how a single meta-analytic RV can ever A) be easier to
understand, and B) contain the same information as a table. If there
is a single value, how do I know on how many studies it is based on?
How do I know the sample sizes of the studies the analysis is
performed on?

I think one table would be much more useful for the research
community. Note that if you perform a normal meta-analysis, you don't
report just 1 number - you provide a table with the effect size, SD,
95% CI, z and p values, etc. I really think one number will simply not
do. If you don't have much experience with a meta-analysis, take a
look at the 2 tables I report on page 5 of a recent JEP:LMC article I
wrote where I perform a small meta-analysis on 5 studies:
sites.google.com/site/lakens2/Lakens-JEP_LMC-
PolarityCorrespondenceinMetaphorCongruencyEffects.pdf

Daniel

Brian Nosek

unread,

Feb 19, 2012, 12:30:20 PM2/19/12

to openscienc...@googlegroups.com

Hi all --

Yes, very good discussion. I think it is worth reviewing what the goal of RV is in order to assess all of the different arguments that have been raised. My understanding is below, if there are other goals, then those should be spelled out because it may suggest that there are actually multiple projects being discussed.

Primary goal:

Replication is important but not everything can be replicated. How do you decide what to replicate? RV should be an indicator indexing what effects are more important to replicate than others.

Assumptions:

1. RV is a statistic - a single numerical index that provides comparative information about what effects are more valuable to replicate than others

2. The statistic should only be as complicated to compute as is necessary to get valid comparative information about the replication value of different effects. Likewise, it should not be less complicated than necessary to be valid.

3. All good statistics have the following properties - they are narrow in scope and general in application. That means that they do not try to solve every issue, but they can be applied to every issue in their defined domain. So, RV should be general in providing comparative information about what effects are more important to replicate than others, but keep constrained the "dimensions" of importance to keep the scope narrow.

My perspective:

1. RV is not a competitor or replacement to meta-analysis (or any other statistic). [I agree with EJ/Denny's comments about this being relevant to other decision processes, but I know of no existing statistic that pursues this goal in the way that RV is being conceptualized. If there is one, I will be delighted to hear about it - then we can just use it rather than recreate it.]

2. The initial formulation of RV - a function of times cited and number of existing replications fails the assumptions laid out above. Most critically, the statistic is invalid for comparing RV across research applications. It is particularly biased by variations in sample and effect sizes. For example, given the same number of times cited, an effect with two studies with 10,000 participants each would get a higher RV than one with three studies with 20 participants each. That is incorrect, and guarantees that the RV would not have general application. It validity range would be very narrow - only other studies with the same kind of sample/effect characteristics.

3. Based on discussion so far, I think that there are three compelling ingredients to deciding the replication value of an effect - the impact of the effect, the reliability of the effect estimate, and the social confidence in the effect. Operationalizations of these are times cited (impact), standard error of the meta-analytic result (reliability), and attempts (social confidence - I like Jeff's suggestion for incorporating something about author variability into this). Note: standard-error is agnostic to the meta-analytic effect size, it is not reproducing meta-analysis in any way.

4. This formulation does not actually reduce simplicity very much - to get a count of the number of replications, one has to do a literature review anyway. The only additional requirement is actually collecting the statistical information in the collected articles to calculate SE. Further, RV's scope is narrow in that it is useful for effects with only a small number of published articles. It will not be useful (and is not needed) when there is a large literature already.

5. There are likely improvements to make to #3, both in conceptualization and operationalization.

6. The focus of the paper should be on making a compelling reasoned and empirical demonstration of the validity of RV. Recommendations about how to use it, when to use it, and range of application are important but secondary. Also, it should stay narrowly focused on RV in particular and not stray into recommendations about how to report other statistics, when to do meta-analysis, etc. A singular focus on RV alone will maximize the impact of the contribution and situate its relationship to all of the other issues that are related but are not going to be solved with this single statistic.

Best,

Brian

Jeffrey Spies

unread,

Feb 19, 2012, 2:10:35 PM2/19/12

to openscienc...@googlegroups.com

Perhaps there are other definitions of replication that are confounding the debate, or, as Brian inquires about, other goals. But if we are on the same page with Brian's assessment of the goals so far, then I have to agree with him 100%.

An RV should include impact, reliability, and social confidence information. Back to my point earlier, if we're recommending a meta-analysis (in table form as Daniel suggests) to properly interpret our RV, then we should take the next step and try to quantify that information in order to aid interpretation. Is it more work for us? Certainly, but I think meta-analysis gets us very close to that goal (and it's been enjoyable thus far). If we are proposing a single number that represents how important a study is to replicate, then that should be our burden. And if we do so successfully, then, for matters of replication, this number is guaranteed to be easier to understand than a number interpreted with an accompanying meta-analysis table.

For this reason, I don't think that the argument that "meta-analysis already exists so if we use it we are reinventing the wheel" is a sound one. I agree with Roger that meta-analysis tells us something different than our goal, so how is anything being reinvented? If we can boil down meta-analytic information and combine it with citation information, then we are gaining something: a single value that relates importance of replication. If something in that is being reinvented, then there is no reason to propose a replication value.

J.

Denny Borsboom

unread,

Feb 19, 2012, 3:58:41 PM2/19/12

to openscienc...@googlegroups.com

Hi all

So from my point of view, you are looking for a function on a domain formed by a number of variables (times cited, se etc.) that respects your preference ordering with respect to replication value (which I take to be a preference on which effect you want to replicate more and which you want to replicate less). You now have a function (times cited/SE) that was chosen on a heuristic/pragmatic basis, and that may capture some of your preferences, but that fails to capture others (as indicated by Brian's point 2 under 'my perspective'). You could try out some other functions and see what works, but you could also look at this as a measurement problem and treat it systematically along the lines of measurement theory. In that case, the questions before you are (a) which functions over the domain of variables (times cited, SE, maybe others) best capture your preferences, and (b) how these functions are related. In the measurement literature, (a) is called a representation problem, and (b) a uniqueness problem.

To get at the representation issue, you need to find out what your preferences look like (i.e., after all, that's what you want to represent with your metric). This is not always straightforward. One way to get a feel for how your preferences are structured is to define a set of studies that you'd be indifferent about. So for instance, if you take combinations (100 cites, SE=1) and (1000 cites, SE=10) to be equivalent (meaning you wouldn't care which one was replicated), that indifference would indeed be represented well by the function you've got now (because both result in RV=100). But if you take combinations (100 cites, SE=1) and (10000 cites, SE=10) to be equivalent, you would want another function, e.g., you would take the (square root of the number of cites)/SE. You can do the same for orderings. Still another possibility is to define a utility for a set of combinations of cites and SEs. You could then fit a function (e.g., as in regression etc) and see what it looks like. This way, you would have some idea of what kind of function you're looking for. I personally think that this function is unlikely to be linear, because human beings don't evaluate numbers linearly; in this context, humans feel the difference between 10 and 100 cites to be much larger than that between 100000 and 100090 cites, and a linear function doesn't respect that. As a result, this information is typically better represented with a logaritmic scale.

After you've found a function that you're happy with qua representation, you would study its uniqueness properties. For instance, should you choose to use a fraction, like you do now, you're going to require strong scale properties. For instance, if one measured the reliability of the effect with the variance of the error around the meta-analytic effect instead of its standard deviation, one would not in general get the same results. For suppose you had identified (100 cites, SE=1) to be equivalent to (1000 cites, SE=10), and this is represented by giving both RV=100/1=1000/10=100. Then when rescaling the SE to a variance, and using the formula RV=cites/variance, you'd get 100/1 > 1000/100, so now these same cases are no longer equivalent. This isn't what you want, because now the RV depends on an arbitrary choice, namely whether you measure the reliability of the effect in SDs or as a variance and that should be irrelevant (I guess). A solution to this kind problem, is to go back to the representation problem and to search for a function that isn't sensitive to this type of transformation.

If you think all of this stuff is cutting butter with a razor, and just want to be pragmatic about this, I understand that as well, but it's usually a good idea to consider this kind of thing even if you end up not using it. Also, it is good to be aware that there's a large field that deals with these questions. Most work on this topic is unfortunately pretty hard to read (for me at least) because it's horifically formal. Reasonably accessible and psychologist-friendly introductions to this type of work are Joel Michell's book (http://www.amazon.com/Introduction-Logic-Psychological-Measurement/dp/0805805664) and the first few chapters of the book by David Hand (http://www.amazon.com/Measurement-Theory-Practice-Through-Quantification/dp/0470685670/ref=sr_1_3?s=books&ie=UTF8&qid=1329683158&sr=1-3). If you really want to get to the bottom of it all, you should consult the classic three books by Luce, Tversky, Suppes, & Krantz (http://www.amazon.com/Foundations-Measurement-Polynomial-Representations-Mathematics/dp/0486453146/ref=sr_1_2?s=books&ie=UTF8&qid=1329683214&sr=1-2). These are guaranteed to provide great insight, if you succeed in reading them (most people don't). One of the groundbreaking papers that started up this field, and is still a very impressive work, is online at http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCYQFjAA&url=http%3A%2F%2Fsuppescorpus.stanford.edu%2Ftechreports%2FIMSSS_45.pdf&ei=wGBBT_nqGMye-waa1OnnBQ&usg=AFQjCNGhuoj-08_PiUTncjqZg9rVd7W7EQ

Best
Denny

--
Denny Borsboom
Department of Psychology

University of Amsterdam
Weesperplein 4
1018 XA Amsterdam
The Netherlands

+31 20 525 6882
d.bor...@uva.nl
http://sites.google.com/site/borsboomdenny/dennyborsboom

Daniel Lakens

unread,

Feb 19, 2012, 4:49:30 PM2/19/12

to Open Science Framework

I agree with Brian's post - and it makes things much clearer for me.
I'm all for making the RV as general as possible, if we can solve the
challenge. Two studies with 10000 subjects is indeed more reliable
than 3 studies with 30. If the SE can be used to control for that, we
should try to incorporate it (we might want to use the relative
standard error because a percentage would be easier to work with in a
general formula, but I'm not an expert). Also, can a power analysis be
a solution here? Determine the number of participants a study would
have needed to find the effect - maybe all three studies with 30 had
enough power - than they should not be seen as less important than 2
studies with 10000 participants.

The real challenge I think would be to create a formula that finds a
balance between the number of replications (which is very important
because it reduces Type 1 error) and effect sizes. In some fields
extremely small effects are important, in others, really small n's are
normal, in others (neuroscience) p values are extremely small.

We should also try to keep the number of arbitrary choices for values
as low as possible, because each one will be a point for critics. See
for example the Simmons et al 2011 paper with the recommendation for a
number of participants in each cell which is not generally agreed
with.

Roger Giner-Sorolla

unread,

Feb 20, 2012, 4:19:12 AM2/20/12

to Open Science Framework

Given the problem of metricizing the social factor we should perhaps
look into calculating separate "in-lab" and "out-of-lab" RV. This
would be particularly useful when a lab gets compromised by fraud -
how much of the "findings" turn out to be supported outside?

So, by using SE we are not saying anything about the significance or
direction of replication attempts, only the N of published replication
attempts in relation to the variability of the measure? That sounds
reasonable. However, I warn you that SE is not always easily
available. For example, it's not common to see SE of B or beta
reported in regression. A quick statistical guide to extracting SE
from some of the more common reported statistics might be useful
(e.g., reverse engineering effect size and F to get SE from ANOVA
results).

Roger Giner-Sorolla

unread,

Feb 20, 2012, 4:21:16 AM2/20/12

to Open Science Framework

PS: Re Daniel's comments below, if I understand correctly, using the
SE as a weighting factor does *not* take effect size or direction into
consideration, whereas power analysis does

Russ Clay

unread,

Feb 20, 2012, 1:40:14 PM2/20/12

to openscienc...@googlegroups.com

Hi Everyone,

I've really enjoyed following this discussion. One thought that keeps
creeping to mind as I follow along is in regard to the implication of making
RV a function of # of times cited. I agree that an important dimension of
RV is the interest in the effect, however; it seems that an RV statistic
based on # of citations requires that a finding be appropriately 'aged'
before there is substantial value in replicating it. This may return to the
discussion about the general goals of the replication value statistic, but
if the goal is to motivate others to replicate a finding, it seems that
there could be a good deal of value in replication prior to a lot of
citations taking place. Particularly, if we are concerned with Type I
errors, a replication value that is a function of # of citations can also be
viewed as a statistic highlighting the potential degree of negative impact
that a Type I error has had on the field. Thus, if a replication attempt
with high RV (using an RV statistic that incorporates # of citations) has
sufficient power to detect an effect and the replication fails, we are all
of a sudden in a position where there would have been considerably more
value in the replication had it occurred earlier (in terms of the time and
effort spent by other scientists who have attempted to build off of the
findings of the original study).

Granted, I don't know that I can think of a good solution to this. Maybe
for 'new' effects, an appropriate measure of impact would be the impact
factor of the publication source as opposed to the number of citations? At
this point I just wondered if others also felt that this could be a 'blind
spot' in the RV statistic as stated.

Russ

Daniel Lakens

unread,

Feb 20, 2012, 3:15:44 PM2/20/12

to Open Science Framework

Hi Russ,

the time factor has been noted before. In the original value, the idea
was that unreplicated work would deserve publication after 50
citations - that is not a lot, but also not a little ;) A comment was
that some findings deserve replication earlier, due to for example
'expected interest' or perhaps huge media interest. But those
situations are perhaps not common enough to try to incorporate. On the
other hand, a good solution to the time problem would be nice. If
you'd like access to the google doc, send me an email.

Daniel

Roger Giner-Sorolla

unread,

Mar 3, 2012, 7:03:35 PM3/3/12

to Open Science Framework

Of potential interest - a slightly different statistic being proposed
by Brian Knutson ...

http://edge.org/response-detail/1642/what-scientific-concept-would-improve-everybodys-cognitive-toolkit

Jamie DeCoster

unread,

Mar 3, 2012, 8:09:42 PM3/3/12

to openscienc...@googlegroups.com

One quick, minor point I'd want to bring up is that the statistics you seem to be talking about seem to treat all studies equally. One of the things I'm trying to advocate is the importance of differentiating confirmatory from exploratory studies, since the two different types of studies provide different evidence regarding whether an effect really exists, or, to put it in your terms, how likely it would be to replicate. This can be considered quantitatively by thinking about the number of "researcher degrees of freedom" that were involved in the derivation of the observed effect. Even if you don't want to consider this or other measures of study quality in your calculated statistic, I think it would be important to acknowledge the limitations of treating studies as equivalent on this aspect.

--

Jamie DeCoster

Center for the Advanced Study of Teaching and Learning
University of Virginia
350 Old Ivy Way, Suite 100
Charlottesville, VA 22903

205-534-0939

"In this world, you must be oh so smart or oh so pleasant. Well, for years I was smart. I recommend pleasant." -- Elwood P. Dowd in "Harvey"

Daniel Lakens

unread,

Mar 4, 2012, 3:30:24 AM3/4/12

to openscienc...@googlegroups.com

Hi Jamie,

How do you operationalize the difference between these two types of studies? And how could these differences be incorporated in the replication value? You are right that the RV is now treating all studies as equal (although the attempt to incorporate the ES does try to give more weight to more reliable findings). I agree it is important to clarify some studies should be regarded as exploratory – but instead of defining this a-priori, a high RV value might be considered an indication that a finding is of interest, but awaiting confirmatory studies that replicate the effect. Or could this same point be made in a better way by making the differences between exploratory and confirmatory studies more central in the formula?

Daniel

Daniel Lakens

unread,

Mar 4, 2012, 3:44:13 AM3/4/12

to openscienc...@googlegroups.com

I've send him an -email to inform him about our project.

> -----Oorspronkelijk bericht-----
> Van: openscienc...@googlegroups.com
> [mailto:openscienc...@googlegroups.com] Namens Roger Giner-
> Sorolla
> Verzonden: zondag 4 maart 2012 1:04
> Aan: Open Science Framework
> Onderwerp: [OpenScienceFramework] Re: Replication Value project:
> separate thread FYI
>

Michael C. Frank

unread,

Mar 10, 2012, 5:15:48 PM3/10/12

to openscienc...@googlegroups.com

And an interesting paper on a very related idea:

http://www.frontiersin.org/computational_neuroscience/10.3389/fncom.2012.00008/full

Mike

Brian Nosek

unread,

Mar 11, 2012, 2:00:38 PM3/11/12

to openscienc...@googlegroups.com

Yes, thanks for the heads-up. They might be interested in joining in...

brian

Daniel Lakens

unread,

Mar 20, 2012, 2:45:17 PM3/20/12

to openscienc...@googlegroups.com

Hi everyone,

I've updated the intro of the google doc file based on some suggestions - commenting on it might be more useful at a later moment. What is now needed is that we reach agreement about how exactly we will calculate the Replication Value. The impact (citations) and attempts (replications) parts are clear - we need to decide upon how to incorporate the reliability. After that is done, we can perhaps distribute some tasks (writing paragraphs, run simulations, etc) and try to get a first draft of the manuscript ready - the issue of replicability seems to be more timely than ever.

I have read up on my stats, and had to think that with reliability, we are mainly interested in the probability an observed finding can be expected to replicate. I hope you agree. This value is known as p-rep. Calculating the probability that a finding will replicate is a debated question ever since Psychological Science started asking researchers to use p-rep instead of p-values (for several articles by the people who matter in this debate, see http://psycnet.apa.org/journals/met/15/2/ ). At least intuitively, it makes a lot of sense to include to probability a finding will replicate in the Replication Value ;)

There are several suggested ways to calculate the probability a finding will replicate, so if this sounds interesting, we still have to choose a specific way. I think the proposal in the attached PDF file (Iverson, Wagenmakers, Lee (2010) A model averaing approach to replication), using Bayesian model averaging, seems like a good way to calculate a p-rep to include in the RV. I hope EJ Wagenmakers can tell us whether that indeed makes sense, and if this could be implemented in a way that is easy for the general scientific community, or whether another version (the p-rep by Killeen, even though it has its problems) might be better.

We can still try to use the SE (or relative SE), but I don't know how. If you are a fan of that suggestion, please detail how it should calculated (not only for the first, but also second replication). I now feel it will be heading to much into a meta-analysis suggestion (especially after the first replication), but might not have thought enough about it.

In addition to deciding upon the Reliability factor, we need to answer the following question:

We all agree that number of replications decreases the Replication Value. We want to incorporate a correction for the reliability of the finding initial finding. How do these 2 factors relate to each other? How much reliability can compensate for a single replication? The idea was that one big study is more reliable than 2 or 3 smaller studies, but statisticians do not agree, because random error in the study is always possible, so 3 small studies control better for such random error than 1 big study. Also, to be really reliable, studies need to be really big (n = 153.669, according to Hunter - 2001- The desperate need for replications), so the difference between 1000 or 20 will not make a huge difference - according to Hunter, a study with 1000 ppn just needs to be replicated less often than a study with 20 ppn. If we use p-red (a percentage from 0 to 1, we should use 1/p-rep) or RSE (also a percentage), how much should the RV increase with how much loss in reliability? I personally think more replications are more important than higher reliability. Playing around with the simulations Marco provided might be useful here.

Looking forward to your input!

Daniel

Daniel Lakens

unread,

Aug 21, 2012, 12:52:04 PM8/21/12

to openscienc...@googlegroups.com

Hi everyone,

for a while, I didn't know how to solve our problem of creating a RV that would 'weigh' replications, in a way that made sense, while at the same time still having an easy to use formula. Now I think I might have a solution, and I would love your feedback on this idea.

The new idea is: RV = (citations) / (∑ power of replications)SQUARED+ 1

Here's the rationale to use the power of each replication as the weighing factor, instead of the SE. Your expertise is greatly appreciated. I think using the power makes more conceptual sense, and might be slightly easier to calculate than the SE because the information is on average easier to retrieve from publications (although still not always available).

The number of citations is divided by the sum of the power of all replications, squared, plus one. The power of a study has a value between 0 and 1. The higher this value (thus the more high-powered replications are performed) the lower the replication value. Importantly, it does not matter whether the attempts were successful or not. The denominator is incremented by one to avoid undefined RV’s, and the square root is taken to reflect the fact that especially the first few replications provide important information about the robustness of an effect, whereas the value of each additional replication attempt is less than the previous replication (we could discuss alternatives, such as the natural log). The denominator is based on the power of each replication, instead of simply on the number of replications, because not all replications are equal. It is essential that studies have enough statistical power to reveal a hypothesized effect. It is still common for psychologists to perform hugely underpowered experiments. Replications should have sufficient statistical power to observe the predicted effect (with a typical minimal value of .80, but preferably, power should be as high as is feasible).

Note that we are not weighing each study based on the precision of the effect size (typically the inverse of the squared standard error, or the inverse variance weight), as is the custom in meta-analyses. The reason is that we are not interested in the size of the effect, or the precision with which this can be estimated. The goal of the replication value is to provide an indication of the value of a replication based on the number of replications with sufficient power. At the same time, both power as the inverse variance weight are to a large extent influenced by the sample size, and therefore just as how larger studies receive more weight in meta-analyses, so will larger studies receive more weight in determining the Replication Value.

Very interested in what you all think. It would be nice to get this idea of a RV into something we can start to use. If you think it makes sense, great - if not, any ideas how to improve it?

Mark Brandt

unread,

Aug 22, 2012, 3:34:29 AM8/22/12

to openscienc...@googlegroups.com

Hi All,

To add to this conversation, I used Dan's new equation to see what the RV value would look like after 1, 10, 50, and 100 replication attempts, 1, 10, and 100 citations, and an average power of .2, .5, and .8 for each of the replication attempts.

The results of this rough "simulation" are in the attached Excel file.

The patterns suggest that the RV clearly distinguishes studies that have very few replication attempts (i.e. 1 attempt) at both high and low levels of power when they are cited about 100 times. The differentiation is much smaller for 1 and 10 citations. For studies with low levels of power (.2), there still seems to be some distinction for effects with 100 citations and 10 replications.

Feel free to play around with the file and see what other combinations of values look like.

My question about the equation: Is it better to sum the power from the replication attempts (a value that can range from 0 to infinite), or to look at the total power for detecting the effect across the replication attempts (a value that can range from 0 to 1)?

-Mark

RV rough simulations.xlsx

Daniël Lakens

unread,

Aug 22, 2012, 3:45:25 AM8/22/12

to openscienc...@googlegroups.com

To clarify, because I think I did not explain it well before: The power for each replication (which obviously lies between 0 and 1) is summed over replications. So replication 1 has a power of .80, replication 2 has a power of .92, then the denominator will be (1,72 squared) + 1. This way the denominator increases with each replication, but not by 1, but with the power of the replication. This will on average give more weight to larger studies, while at the same time being conceptually more related to replications, and less to meta-analyses.

Mark Brandt

unread,

Aug 23, 2012, 8:33:22 AM8/23/12

to openscienc...@googlegroups.com

It may be helpful to change the name of the replication value because this name may make people think the value is an indication of the value of an effect (i.e., the extent to which it replicates) rather than the potential value of a new replication attempt (so maybe the PVR?).

-Mark

Susann Fiedler

unread,

Sep 6, 2012, 6:03:56 AM9/6/12

to openscienc...@googlegroups.com, no...@virginia.edu

Maybe some of you here are interested in a paper by Clintin P. Davis-Stober and Jason Dana "A New Measure of Replicability"

Abstract
We present a new measure of replicability called v. Unlike prior approaches, v casts replicability
in terms of the accuracy of estimation using common statistical tools like ANOVA and multiple
regression. For some sample and e ect sizes not uncommon to experimental psychology, v suggests
that these methods produce ndings that are, on average, farther from the truth than an
uninformed guess. Such ndings cannot be expected to replicate. While v is not a function of the
p-value, it can be calculated from the same information used when reporting signicance tests and
e ect sizes.

Key Words: Replicability, statistical power, improper linear models.

DavisStoberDanaV.pdf

Daniel Lakens

unread,

Sep 10, 2012, 9:54:06 AM9/10/12

to openscienc...@googlegroups.com, no...@virginia.edu

Hi,

I've uploaded a new version of the replication value document to google docs. If you want to comment and don't have access, let me know and I'll add you. I've attached an excel file with some examples of the new replication value for people to look at.

Looking forward to your comments.

Susanne, the paper you posted about v-values is pretty cool. However, it is more a new and better version of the p-rep idea, and not an index for which published findings in psychological literature deserves to be replicated. Nevertheless, we might want to round up this replication value paper, given that it is rather timely.

Replication Value v4.xlsx

Brian Nosek

unread,

Sep 10, 2012, 2:54:59 PM9/10/12

to openscienc...@googlegroups.com

A quick response on this (I will find time to elaborate soon):

The power idea seems a bit ad hoc. It is unusual to add powers together and the squaring strikes me as non-obvious. It could be that it performs rather well, but I am not yet seeing the value over a statistic that integrates very nicely with meta-analytic methods - i.e., the precision of the estimate. I don't see that power estimations will be particularly easier to calculate across existing replications as compared to precision. And, its lack of connection to meta-analytic techniques might reduce its likelihood of use.

Even so, I think it could be worth examining as a candidate formula. The RV paper need not settle on a single calculation in advance. Rather, the paper could evaluate a variety of candidate algorithms, compare their performance and even offer a menu of choices from easiest to hardest to use (presumably inversely related with their "quality") and with interpretation strengths and cautions of each.

Jeffrey Spies

unread,

Sep 10, 2012, 3:05:09 PM9/10/12

to openscienc...@googlegroups.com

I had a very similar response drafted to send:

The idea of summed powers as weighted number of replications is interesting, but summering powers/probabilities doesn't mean much. That doesn't mean it's not worthwhile. As such, we could propose multiple replication values and examine and compare their merits via simulation and discussion of qualitative meaning.

Jeff.

Marco Perugini

unread,

Sep 11, 2012, 2:53:22 AM9/11/12

to openscienc...@googlegroups.com

Hi,
I agree with Brian and Jeff. I would add that the concept of accuracy/precision (e.g., CI or SE around the parameter estimate) can reflect nicely a key component of the RV, that is, the (un)certainty of the to-be-replicated finding.
I was also thinking that it might be relevant to incorporate an additional information, that is whether previous direct replication attempts have been performed by the same persons/lab or an independent group. There can be a number of reasons why the same lab/group might replicate a certain result, including some kind of systematic bias perhaps due to seemingly minor experimental details. In other sciences (e.g., Physics) this is a key consideration: a finding is declared as replicated only when independent groups can replicate it.
One simple possibility could be to discount any direct replication from the same group/lab. I think that this choice can be defended but it would need to be made explicit and incorporated in the very definition of replication (e.g., a direct replication is considered as such only when performed by an independent group).
Another possibility is to use a weight such that, everything else being equal, the RV goes substantially up the smaller the proportion of independent to total direct replications. In this way also direct replications from the same lab/groups will count but much less than independent replications.

Marco

2012/9/10 Jeffrey Spies <jsp...@gmail.com>

Daniel Lakens

unread,

Sep 11, 2012, 4:23:48 AM9/11/12

to openscienc...@googlegroups.com

Hi,

I think it makes sense to use only independent replications for the RV. I also see that many of you would like to see precision included in an RV. This was never my original idea - I just wanted to propose an index of published journal articles that based on their impact (publications) had the highest value for psychological science to be replicated. The current RV function therefore does not simply sum the number or replications, but sums the power of these replications - what is important is not whether a replication was performed, but how likely each replication was to give us useful insights into the replicability of the original finding. The squaring is indeed, as Denny Borsboom noted earlier, open for debate - in the Excel file I therefore also included an RV without the squaring so you can see how that behaves.

There have been several people that have noted it would be great if we could do something with the precision with which the effect size can be estimated. I'm open to such a suggestion, but I don't know how to create such an RV. I'm at my limit, and I'd be more than happy to pass the torch to anyone who will give that a try. At first glance a fruitful approach seems to be to add our contribution (looking at the impact and existing replications) to existing measures of the replicability propability of individual findings such as the v-value by Davis Stober and Dana (posted earlier in this thread), Schimmacks Incredibility Index, or the meta-analytic SE.

I simply don't think I know enough of these matters to do this. Furthermore, the final RV will become more subjective (how much more important is it to replicate a finding with a lower replication probability?). That is a challenge that can be solved, but I'm not the person to solve it. So how should we continue? Is anyone interested in figuring out how to include the precision of the estimate into the RV?

Daniel

Brian Nosek

unread,

Sep 11, 2012, 7:45:59 AM9/11/12

to openscienc...@googlegroups.com

One of the nice features of large-scale collaboration is that no one person needs to be expert is all aspects of the contribution. I think Daniel has made the leadership contribution to get the RV idea going. The discussion for alternatives is a useful one that will not be solved exclusively by reasoning through it in advance. So, I suggest the following:

(1) The initial paper introduce a set of candidate RV calculations. Perhaps subteams of the project contributors will develop them, based on their expertise. The paper will be something of a contest among them.

(2) We can internally debate and refine their initial conceptualization with background reason, evidence and logic, but once we have the candidate RVs, then the rest of the paper is a comparative evaluation of them -- with simulation and perhaps 10 case study applications.

(3) It isn't likely that one will be the clear winner. Instead, the conclusion might offer a few that differ on complexity of calculation and conceptual strength.

Daniel Lakens

unread,

Feb 18, 2013, 10:59:22 AM2/18/13

to openscienc...@googlegroups.com, no...@virginia.edu

Hi everyone,

There is a new version of the Replication Value paper for people from the OSF to comment on. This version is the result of several rounds of comments from people on this list. It is almost impossible to write a paper that incorporates all the excellent comments people have made so far, but this draft is a manuscript that should clearly explain the usefulness of an RV value that can be used immediately. It is still possible to suggest alternatives, or improvements, but for now, this manuscript will likely contribute a very interesting idea. The major disappointment for some of you will be that this version of the RV is mute with respect to the reliability of an effect - as such, it is complementary to a meta-analysis, but it has in itself no meta-analytic properties. Trying to incorporate this was the major reason for the delay in rounding up this paper, but despite a lot of discussion, there was no RV formula that addressed this adequately. I think that is someone can improve on the current RV formula in the future, they are more than welcome, but we should no let this delay sharing this RV idea. The RV idea is already being taken along (for example in the special issues which invite replications of 'important' results in psychology, but also in projects about teaching replication (see the post by Roger on this where he mentions the RV), so I think it is time to share what we came up with so far.

As mentioned before (https://groups.google.com/forum/?fromgroups=#!searchin/openscienceframework/authorship/openscienceframework/OdNHPDvrTXE/skoRgV7I_JsJ) , norms for authorship on this project are based on a clear contribution in in writing or revising the manuscript, or providing extensive comments for improvement and discussing the contents of the manuscript over several revisions. If you'd like to contribute by commenting on the final version, please let me know. You can download the paper here: https://dl.dropbox.com/u/133567/The%20Replication%20Value%20-%20Open%20Science%20Framework%20v2.docx You can download a spreadsheet to calculate the RV formula's here: https://dl.dropbox.com/u/133567/CalculateReplicationValue.xls

If you want to comment, please let me know. The goal is to make improvements, but preferably no major changes (unless we have to, obviously). It would be great if this paper could be submitted somewhere this March.

Best,

Daniel

Message has been deleted

Daniel Lakens

unread,

Feb 21, 2013, 10:43:10 AM2/21/13

to openscienc...@googlegroups.com, no...@virginia.edu

Please note that there currently enough people contributing to the RV manuscript. It is already difficult to incorporate all the comments. If you want to give your thoughts on the manuscript, please contact me.

Etienne LeBel

unread,

Feb 21, 2013, 10:54:44 AM2/21/13

to Daniel Lakens, openscienc...@googlegroups.com

Great work Daniel and colleagues! I think RV is a great new idea and will be very valuable moving forward. I wanted to briefly comment, however, on what I see might be a potentially important conceptual issue with the logic underlying RV (according to my understanding of it). RV aims to estimate the replication value of a finding by gauging the theoretical impact of a finding relative to the number of times it has been replicated (accounting for statistical power of published replications and original study). However, if one thinks about it, a finding cannot (and SHOULD NOT) have any impact **UNTIL** it has been independently confirmed by other researchers (who of course would execute a direct/close replication attempt of the original finding). This is so because a finding cannot be (theoretically) important if it isn't true, and to know if it's true, one needs to execute independent replications that are as methodologically close to the original study as is humanly possible.

Of course, one can counter by saying that the # of citations can be seen as an indicator of how much collective interest there is for the theoretical idea proposed in an **empirical** paper. But this is exactly the kind of fallacious reasoning that has gotten us into this mess. Novel theoretical contributions based on empirical findings must be independently verified **before** they can have a substantial impact on the field.

Cheers,
Etienne.

--
You received this message because you are subscribed to the Google Groups "Open Science Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openscienceframe...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

*******************************************

Etienne P. LeBel, Ph.D.

Post-doctoral Fellow

Department of Psychology

Social Science Centre, RM 7312

The University of Western Ontario

London, Ontario, CANADA N6A 5C2

http://publish.uwo.ca/~elebel

http://psychdisclosure.org

Reply all

Reply to author

Forward