article/post recommendations re: meta-science and open science

71 views
Skip to first unread message

Shauna Gordon-McKeon

unread,
Aug 4, 2015, 1:25:51 PM8/4/15
to openscienc...@googlegroups.com
Hi all,

Some friends of mine and I are starting an informal journal club and we're looking for journal articles (or blog posts of similar length/style) to read. 

What do you all recommend?  What work has most influenced your opinions about how science does work/should work?

best
Shauna

a.

unread,
Aug 4, 2015, 1:42:29 PM8/4/15
to Open Science Framework
"False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant" - Simmons, Nelson, & Simonsohn (2011)

http://neuron4.psych.ubc.ca/~schaller/528Readings/SimmonsNelsonSimonsohn2011.pdf

Kimberly Yang

unread,
Aug 4, 2015, 5:17:14 PM8/4/15
to Open Science Framework
Here's a blog suggestion, GigaBlog "Data driven blogging from the GigaScience Editors"
http://blogs.biomedcentral.com/gigablog/

Fred Hasselman

unread,
Aug 5, 2015, 5:04:02 AM8/5/15
to openscienc...@googlegroups.com
Take Paul Meehl's Philosophical Psychology Course online: http://www.psych.umn.edu/meehlvideos.php

I think a lot of it is covered in these publications:

(1997) The problem is epistemology, not statistics: Replace significance tests by confidence intervals and quantify accuracy of risky numerical predictions. In L. L. Harlow, S. A. Mulaik, & J.H. Steiger (Eds.), What if there were no significance tests? (pp. 393-425). Mahwah, NJ: Erlbaum. 

(1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant using it. Psychological Inquiry, 1, 108-141, 173-180.

 (1990). Corroboration and verisimilitude: Against Lakatos' "sheer leap of faith"(Working Paper, MCPS-90-01). Minneapolis: University of Minnesota, Center for Philosophy of Science.

 (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.

(1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806-834.

(1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115. 


Best,
Fred

Disclaimer: Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "Open Science Framework" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openscienceframe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eric-Jan Wagenmakers

unread,
Aug 5, 2015, 5:34:11 AM8/5/15
to openscienc...@googlegroups.com
In my opinion, Meehl was wrong. Do researchers actually care whether
an effect --when reliably detected in a very large N study-- is 0.1,
0.2, 0.3, or 0.4? No they don't, nor should they. One reason is that
effect size is context dependent: for instance, a list length effect
manipulation can be d=0.3 with a manipulation of 10 vs 20 items, and
d=0.6 with a manipulation of 10 vs 30 items. So the exact magnitude of
d is almost never very interesting, unless you consider a concrete
real-world application in order to make money.

Do researchers care whether the effect is 0 or not? Yes they do, and
they should. Can people look into the future? Are people more creative
in the presence of a big box? Do lonely people take hotter showers?
Does video game playing improve low-level perception? These questions
are relevant and cannot be addressed by computing effect size.

Bottom line: before you can estimate something, you need to make sure
that there is something to be estimated.

Cheers,
E.J.

********************************************
Eric-Jan Wagenmakers
Department of Psychological Methods, room 2.09
University of Amsterdam
Weesperplein 4
1018 XA Amsterdam
The Netherlands

Web: ejwagenmakers.com
Book: bayesmodels.com
Stats: jasp-stats.org
Email: EJ.Wage...@gmail.com
Phone: (+31) 20 525 6420

“Man follows only phantoms.”
Pierre-Simon Laplace, last words
********************************************

Gustav Nilsonne

unread,
Aug 5, 2015, 6:02:34 AM8/5/15
to openscienc...@googlegroups.com
Dear E. J., dear all,

Certainly we care about the exact magnitude of the effect size. For any research that leads to statements about humans and the world, we need to know if the effects are important, not just if they are there. Do lonely people take hotter showers? Well if they do, then that is more interesting if the effect runs to several degrees C than if it's only a fraction of a degree. Is that context dependent? Then the size of the context effect is interesting.

Futhermore, when we want our research to have some kind of real-world impact, effect sizes are crucial. This is not limited to scenarios involving making money. For those of us who do clinical research, effect sizes are the main thing, both when evaluating diagnostic or therapeutic interventions and when trying to choose which basic research findings warrant translational research.

The final statement "before you can estimate something, you need to make sure that there is something to be estimated" strikes me as implying something of a false dichotomy between hypothesis testing and effect estimation. I cannot imagine a situation where the estimation of an effect size detracts from the scientific value of a statistical analysis. Can you?

Best wishes, Gustav


Gustav Nilsonne, MD, PhD
Researcher
+46 (0) 736-798 743

Stockholm University
Stress Research Institute
106 91 Stockholm

Karolinska Institutet
Department of Clinical Neuroscience
Nobels väg 9
171 77 Stockholm
gustav....@ki.se

________________________________________
From: openscienc...@googlegroups.com [openscienc...@googlegroups.com] on behalf of Eric-Jan Wagenmakers [ej.wage...@gmail.com]
Sent: 05 August 2015 11:34
To: openscienc...@googlegroups.com
Subject: Re: [OpenScienceFramework] article/post recommendations re: meta-science and open science

Tal Yarkoni

unread,
Aug 5, 2015, 6:31:19 AM8/5/15
to openscienc...@googlegroups.com

Researchers care about different things. Some people work on problems where it makes sense to care about the presence of any effect but zero; others work in domains where it's taken for granted that effects are never exactly zero, and what matters is how big an effect is. I don't see any point in arguing over which of these is the right view in an abstract sense; everything depends on the context. In the case of ESP, I would imagine we could all agree that *any* non-zero effect would be quite remarkable, so testing against zero is a reasonable thing to do. Conversely, if we were studying the effects of meat-eating on stomach cancer rates or personality traits on occupational outcomes, it's not clear that anyone would care about non-zero but tiny effects, so it's essential to know how strong an effect is. (And note that the fact that effect sizes are context-dependent is not really a reason to not care about them, since the sign of effects is also susceptible to contextual differences. A Bayes Factor of 6 in favor of model A vs. B could turn into a BF of 6 favoring B over A in a different population; surely we would not want to conclude that hypothesis testing is therefore pointless.)

I think there are potentially interesting debates to be had about (a) what proportion of researchers claim to care about estimation versus testing when casually surveyed (i.e., which naively-held view is more common), and (b) whether people's reported views tend to change with more careful deliberation (e.g., I suspect that many people who claim not to care about effect size are actually implicitly confusing true zero with "too small for me to care about"). But I would hope we can all agree that there are good reasons to care about both hypothesis testing and estimation, and that everything depends on the nature of one's question. And of course, as Gustav said, it's not as if testing and estimation are mutually exclusive approaches. One can happily practice both.

Eric-Jan Wagenmakers

unread,
Aug 5, 2015, 6:32:29 AM8/5/15
to openscienc...@googlegroups.com
Dear Gustav,

> Certainly we care about the exact magnitude of the effect size.

In general, experimental psychologists don't. And for good reason.

> For any research that leads to statements about humans and the world, we need to know if the effects are important, not just if they are there.

The importance of an effect need not be connected to its size. If
people can only look into the future a little bit, if homeopathic
medicine is only effective a little bit, if video-game playing only
affects perception a little bit -- then there is something going on,
something that needs explanation, some causal mechanism that can
perhaps be brought out more clearly in follow-up work.

> Do lonely people take hotter showers? Well if they do, then that is more interesting if the effect runs to several degrees C than if it's only a fraction of a degree.

Really? Not to me. In other situations (preferred temperature for
washing hands) the effect size may be completely different.

> Is that context dependent? Then the size of the context effect is interesting.

To you, not to me. I care about the underlying causal mechanism, not
about the contextual details in which it manifests itself.

> Futhermore, when we want our research to have some kind of real-world impact, effect sizes are crucial. This is not limited to scenarios involving making money. For those of us who do clinical research, effect sizes are the main thing, both when evaluating diagnostic or therapeutic interventions and when trying to choose which basic research findings warrant translational research.

Sure, effect sizes are relevant for *any* real-world application, that
is, when utility comes into play. Utility can be money, or
effectiveness of treatment.

> The final statement "before you can estimate something, you need to make sure that there is something to be estimated" strikes me as implying something of a false dichotomy between hypothesis testing and effect estimation. I cannot imagine a situation where the estimation of an effect size detracts from the scientific value of a statistical analysis. Can you?

It implies a dichotomy, but not a false one. The question "is it
there?" logically precedes the question "given that it is there, how
big is it?"

Cheers,
E.J.

Eric-Jan Wagenmakers

unread,
Aug 5, 2015, 6:35:07 AM8/5/15
to openscienc...@googlegroups.com
I like to disagree with Tal but in this case I agree ;-). Testing and
estimation both have their place, and they are not exclusive
approaches. I guess my problem with Meehl is that he only wanted
estimation, never testing; in response, I may have made it appear as
if estimation is never relevant. That is certainly not the case.

E.J.
********************************************
Eric-Jan Wagenmakers
Department of Psychological Methods, room 2.09
University of Amsterdam
Weesperplein 4
1018 XA Amsterdam
The Netherlands

Web: ejwagenmakers.com
Book: bayesmodels.com
Stats: jasp-stats.org
Email: EJ.Wage...@gmail.com
Phone: (+31) 20 525 6420

“Man follows only phantoms.”
Pierre-Simon Laplace, last words
********************************************


Gustav Nilsonne

unread,
Aug 5, 2015, 7:35:41 AM8/5/15
to openscienc...@googlegroups.com
Thank you E. J. and Tal for these enlightening comments. For a moment I thought we might not have very different views after all, but now I think we do. This is an interesting exercise in introspection, since I apparently have strong beliefs which I had not fully dressed in words before. Here are a few elaborations, which I present partly to make this clearer for myself, as the arguments are probably familiar to you.

1. There are very few "nil effects". This is an old point made by Cohen. As you are well aware, it has more recently been debated in brain imaging, following a paper by Karl Friston. Friston talks about "the fallacy of classical inference", meaning that if you have enough data, significant effects are found in most of the brain. I think this is not a fallacy. Tal has argued persuasively that more data is better, and so has my colleague Michael Ingre. Crucially, the effect sizes tell us which parts of the brain are more likely to have a mechanistic relationship to behavior, and thus they guide further theorizing and experiments on causal relationships.

2. The importance of an effect depends directly on its size. A positive effect of homeopathic medicine has been demonstrated many times. It remains unbelievable because the effect sizes are not big enough to overcome our prior expectation that homeopathic drugs are ineffective. A big effect, such as the regrowth of amputated limbs due to homeopathic ointment, would warrant further investigation of causal mechanisms.

3. Causal relationships in biology are often described by means of diagrams with arrows, showing that A leads to B and so forth. Great advances have been made in many cases when the nature of the arrow has been more accurately described. Is the relationship linear or non-linear? What is its shape? Can we build a mathematical model to predict the behavior of the system? We need effect sizes. Context effects need to be modelled or held constant. In my opinion, context effects should not be used as an argument to reduce modelling to a qualitative exercise.

4. I like to agree with Tal, but this time I disagree. :-) Not everything depends on the nature of the question. I still can't think of a single instance in quantitative research where hypothesis testing would be preferable to estimation. In ESP, I would care a great deal about a believable positive hypothesis test result (though that is of course hard to imagine), but I would care a great deal more if the effect were also big enough to have real-world implications.

Perhaps we shall have to agree that we disagree. Anyway, I appreciate the opportunity to exchange views with you.

Best wishes, Gustav


Gustav Nilsonne, MD, PhD
Researcher
+46 (0) 736-798 743

Stockholm University
Stress Research Institute
106 91 Stockholm

Karolinska Institutet
Department of Clinical Neuroscience
Nobels väg 9
171 77 Stockholm
gustav....@ki.se

________________________________________
From: openscienc...@googlegroups.com [openscienc...@googlegroups.com] on behalf of Eric-Jan Wagenmakers [ej.wage...@gmail.com]

Sent: 05 August 2015 12:35

Eric-Jan Wagenmakers

unread,
Aug 5, 2015, 9:31:49 AM8/5/15
to openscienc...@googlegroups.com
A short response:

> 1. There are very few "nil effects". This is an old point made by Cohen.

Yes, Cohen and Meehl agreed on this point. I think both of them were
wrong. Perhaps this is an argument that sometimes holds for
observational work; but in experimental work this certainly does not
hold. And even if you think it holds, it should be demonstrated. Is is
a given that lonely people shower longer than people who are not
lonely? Is it a given that you become more creative when you stand
outside a box? It is a given that practicing violent video-games
improves low-level perception? I don't think so. On a more general
note, Nature features many thresholded processes.

Moreover, we use models not because they are a veridical reflection on
the true state of the world; we use them as abstractions. So all
models are incorrect -- H0 is incorrect but so is H1. What I care
about is the relative support that the data have to offer for H0
versus H1 (which are both "wrong" in an absolute sense).

> As you are well aware, it has more recently been debated in brain imaging, following a paper by Karl Friston. Friston talks about "the fallacy of classical inference", meaning that if you have enough data, significant effects are found in most of the brain. I think this is not a fallacy. Tal has argued persuasively that more data is better, and so has my colleague Michael Ingre. Crucially, the effect sizes tell us which parts of the brain are more likely to have a mechanistic relationship to behavior, and thus they guide further theorizing and experiments on causal relationships.

There may be a case that every voxel is involved in every cognitive
act. However, I think I agree with Friston and blame it on the
classical framework. Some neurons simply do not encode particular
information.

> 2. The importance of an effect depends directly on its size. A positive effect of homeopathic medicine has been demonstrated many times.

You can demonstrate a lot of things in the absence of preregistration
and a rational statistical analysis.

> It remains unbelievable because the effect sizes are not big enough to overcome our prior expectation that homeopathic drugs are ineffective. A big effect, such as the regrowth of amputated limbs due to homeopathic ointment, would warrant further investigation of causal mechanisms.

There are many effects that are small but nonetheless believable and
important. I believe the standard example is the effect of aspirin on
heart attacks. Also, in specific situations even small effects can be
important (for instance an advertisement campaign in a swing state at
a key moment). There are also large effects that are not believable
(for instance, the effects reported by Jens Foerster). Of course, in
general large effects are more believable, but that is because they
are detected more easily.

> 3. Causal relationships in biology are often described by means of diagrams with arrows, showing that A leads to B and so forth. Great advances have been made in many cases when the nature of the arrow has been more accurately described. Is the relationship linear or non-linear? What is its shape? Can we build a mathematical model to predict the behavior of the system? We need effect sizes. Context effects need to be modelled or held constant. In my opinion, context effects should not be used as an argument to reduce modelling to a qualitative exercise.

Yes, for a precise account of human behavior in a specific context we
need effect sizes. The current level of theorizing in psychology is
not advanced enough to allow this. And honestly, I doubt whether it
ever will mature to that point. I've been in mathematical psychology
for decades. This is where the quantitative modeling happens. And I
tell you, it does not critically depend on effect size.

> 4. I like to agree with Tal, but this time I disagree. :-) Not everything depends on the nature of the question. I still can't think of a single instance in quantitative research where hypothesis testing would be preferable to estimation. In ESP, I would care a great deal about a believable positive hypothesis test result (though that is of course hard to imagine), but I would care a great deal more if the effect were also big enough to have real-world implications.

But it is not about caring. It is about believing whether it is there.
For every estimated effect size you show me, I will ask "yes, nice
estimate, but is the effect even there?" And in your modeling, you in
fact assume that many factors are absent: the effect of training, the
effect of tiredness, sequential effects, structure in the time-series
residuals, time of day, day of the month, age, days since last
birthday, fullness of the moon the day before test, how many beers
were drank two nights prior to the test....all these effects you
assume are absent when you do your statistical modeling. And you are
right to do so, for if you include parameters for all these effects
then your estimates will become highly variable and prediction will
suffer. If you desire a separate parameter for every effect you think
exist, and you think that any and all effects exist, then you will try
to account for a small data set with a trillion parameters. Good luck
with that.

> Perhaps we shall have to agree that we disagree. Anyway, I appreciate the opportunity to exchange views with you.

It is always fun to discuss statistics! :-)

Cheers,
E.J.

Ruben Arslan

unread,
Aug 5, 2015, 1:28:40 PM8/5/15
to openscienc...@googlegroups.com
It's very interesting to follow this discussion, thanks for having it in public :-)

One point that you don't often hear and that I was oblivious to for too long was this:

One reason is that effect size is context dependent: for instance, a list length effect manipulation can be d=0.3 
> with a manipulation of 10 vs 20 items, and d=0.6 with a manipulation of 10 vs 30 items.

I think the ensuing discussion has already illuminated some nice points, but since others might share my confusion:
This is not true for effect sizes in general, but only for (badly) standardised effect sizes such as d and r. I think this blog post
made me aware of the drawbacks of standardised effect sizes: 
http://janhove.github.io/design/2015/03/16/standardised-es-revisited/

He starts with a very similar example to EJ, but uses it to advocate unstandardised effect sizes in an estimation perspective,
not to advocate a testing perspective. In EJ's example both experiments find an effect 0.03 per additional list item. I know most
people probably know this, but I tend to forget it at times and I think I'm not alone.
Of course unstandardised effect sizes are a bit unpopular in psychology, but I think they even have become the default
in areas where they shouldn't be the default (i.e. correlations with age even though age has perfectly interesting, meaningful
units) and in areas where you could definitely have a nice argument (I would bet a small sum that people are more likely
to properly understand a regression slope of hours of video game play on IQ points than a correlation between the two).

I'd also like to point out that in my work, which is I think psychological, I have to rely on estimation because the evolutionary
genetic theory predicts small but nonzero effects. 
Both zero and large effects would be evidence against the theory.
I guess you could say I'm testing against the null and against the one, but in practice I might get others to agree with me
on the null as a counter-point but not on the one, they might reject the theory at 0.75. By displaying my estimate along with its
uncertainty, people can form their own opinion.

Best regards,

Ruben

--
Ruben C. Arslan

Georg August University Göttingen
Biological Personality Psychology
Georg Elias Müller Institute of Psychology
Goßlerstr. 14
37073 Göttingen
Germany

Fred Hasselman

unread,
Aug 5, 2015, 6:22:23 PM8/5/15
to openscienc...@googlegroups.com

Hi!

In my opinion, Meehl was wrong.

About everything? :)

Do researchers actually care whether
an effect --when reliably detected in a very large N study-- is 0.1,
0.2, 0.3, or 0.4? No they don't, nor should they.

Yes they should, unless you are talking about exploratory research.

One reason is that
effect size is context dependent: for instance, a list length effect
manipulation can be d=0.3 with a manipulation of 10 vs 20 items, and
d=0.6 with a manipulation of 10 vs 30 items. So the exact magnitude of
d is almost never very interesting, unless you consider a concrete
real-world application in order to make money.

But a theory that predicted it should be d=0.301 in context A and d=0.661 in context B should be considered more credible in the scientific sense than a theory that predicted it should be d=0.3 in context A and d=0.6 in context B, which should be a more credible theory than one that predicted it should be d>0 in context A as well as context B.

If this is not the core logic of scientific theory evaluation, anything goes, and we might as well start calling ourselves novelists instead of scientists.

Measurement outcomes are registered in a measurement  context, and it is this measurement context that is an essential part of the prediction the design). If a theory cannot deal with measurement contextuality, it is merely a formal description of an empirical singularity, enter: independent replication ...

All in all these examples do not address Meehl's point: Restoring the epistemic link is very important.


Summarising:

1. What scientists themselves believe to be true or care about, should be irrelevant for formal theory evaluation. That is, one should try to prove oneself wrong and if one fails to do so, credibility is gained. Credibility cannot be gained through post hoc model/distribution fitting.

Do researchers care whether the effect is 0 or not? Yes they do, and
they should.

2. No they shouldn't. Researchers should care for the predictive power and empirical accuracy of scientific theories. 

3. If they only care for predictions ≠ 0 then this means their theories will, pardon my French, always be the crappiest models of reality their discipline of science can come up with: predicting the sign of a correlation.

 Are people more creative
in the presence of a big box? Do lonely people take hotter showers?
Does video game playing improve low-level perception? These questions
are relevant and cannot be addressed by computing effect size.

I think they can be addressed by counting and comparing magnitudes of different counts.


Bottom line: before you can estimate something, you need to make sure
that there is something to be estimated.

No. If you posit a theory about reality in a formal calculus you can compute a measurement context that should yield measurement outcomes (predictive power). Then you realise the context and evaluate then empirical accuracy of the prediction. 

If the accuracy is unacceptably low, then the theory is probably not a good model of reality. 

Back to the drawing board.

Best,
Fred

Eric-Jan Wagenmakers

unread,
Aug 5, 2015, 6:46:07 PM8/5/15
to openscienc...@googlegroups.com

Hi Fred,

Our current quantitative theories do not address effect size predictively. I am not sure why they don't, but perhaps these models are a reflection of what researchers seek to understand about reality. And for the researchers I know, that reality is simply not captured in context-dependent effect sizes. It is often already difficult enough to demonstrate the mere presence of an effect.

EJ

Fred Hasselman

unread,
Aug 5, 2015, 7:31:45 PM8/5/15
to openscienc...@googlegroups.com
I agree completely with the evaluation of the status of current theories, but Meehl was trying to pull "us" out of that status quo. Maybe he chose the wrong strategy (confrontational, meta-theoretical perspective).

I think there are many solid arguments that  should be able to convince researchers to be more concerned about incorporating measurement context into claims about observing effects ... a-priori

One of them is the classical problem of establishing a rank order (of weights) of predictors/factors that holds across all contexts. E.g many "effects" in psycholinguistic studies are under "experimental" control and can be shown to appear or disappear depending on the context created by "filler" stimuli.

I guess I am saying: I believe we should try to figure out if and how we can make theories produce the predictions about measurement outcomes they seem to imply.

Best,
Fred




Disclaimer: Sent from my iPhone

hardw...@gmail.com

unread,
Aug 6, 2015, 5:12:22 AM8/6/15
to Open Science Framework
Hi Shauna,

There's a fantastic resource on the OSF where several researchers have shared course syllabi related to open science and research methods - you will find plenty of interesting papers in there:
https://osf.io/vkhbt/wiki/home/

A few of my personal favourite papers on this topic from the last few years are:
  • Nosek's "Scientific Uptopia" papers:
  • Wagenmakers "Agenda for Purely Confirmatory Research":
  • The Open Science Collaboration "Maximising the repoducibility of your research" paper (for more practical advice):

And there is plenty of brilliant content and discussion to be found on the blogs of:

Best wishes,
Tom
____________________________________________________

Tom Hardwicke
PhD Candidate (Cognitive, Perceptual, and Brain Sciences)
Department of Experimental Psychology | University College London | 26 Bedford Way, London | WC1H 0AP

E-mail:     t.hardw...@ucl.ac.uk
Website:  http://www.tomhardwicke.co.uk/
Twitter:    @Tom_Hardwicke

Shauna Gordon-McKeon

unread,
Aug 10, 2015, 8:31:10 PM8/10/15
to openscienc...@googlegroups.com
This sparked an unexpected debate!  Thanks for the recommendations, everyone. 

--
Reply all
Reply to author
Forward
0 new messages