We have skeleton information in the CFP and the current review form.
We can compile the discussion as it resolves into a page in the group.
Ian
--
You received this message because you are subscribed to the Google Groups "SIGIR Meta" group.
To post to this group, send email to sigir...@googlegroups.com.
To unsubscribe from this group, send email to sigir-meta+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/sigir-meta?hl=en.
As a starting point, I would like to quote my original message on
criteria for experimental results:
n IR, we always have being proud of our standards in experimentation.
However, looking back for almost 30 years that I have been active in
this field, the only progress I can see (besides using larger document
collections) is the application of significance tests. In terms of
scientific methodology, most evaluations that we see (and accept for
SIGIR) suffer from various deficiencies:
1. Criteria: Most evaluations are system-oriented, only very few
involve real users. This point has been raised frequently before, but
we still do business as usual. I think we should reward papers
describing experiments with real users - especially with regard to the
much higher experimental effort.
2. Metrics: Most measures in IR lack a measurement-theoretic
foundation. For example, hardly anyone can explain what MAP actually
measures.
3. Validity: Does the experiment use representative samples? For
example, TREC queries are chose such that they have a medium
generality. What about more general or very specific queries?
4. Reliability: We do not consider whether a paper is based on open
or on proprietary data. Earlier Gordon pointed out that e.g. KDD is
rather strict on this point ("Repeatability guideline: Repeatability
is a cornerstone of any scientific and engineering endeavor.."),
asking authors to make their data and software available to others.
SIGMOD follows a similar line. To be more specific: From my point of
view, if a paper uses proprietary data, then reliability of its
experimental results is flawed - nobody outside the authors
institutions can verify the results or compare them with her/his own
methods.
I think it is time that we lean back for a moment and think about the
scientific methodology we see in the papers that make it to SIGIR.
In scientific terms, given the limited experimental standards
employed, we should think about re-balancing our evaluation scheme.
Especially, what is the value of experimental results if they suffer
from the issues described above? How do we value a solid theoretical
foundation vs. 'good' performance figures?
As a more specific point, Ricardo and Hugo have pointed out that
people use very poor baselines - see
http://www.cs.mu.oz.au/~wew/papers/amwz09_cikm.pdf
http://barcelona.research.yahoo.net/dokuwiki/doku.php?id=baselines
So there are tow major questions:
1. How can we improve the review criteria, so that authors get credit
only for solid experimental results?
2. How do we balance experimental results vs. other results (e.g.
theoretical ones) in reviews, i.e. what criteria should be formulated
to reward the non-experimental aspects of a paper?
Thanks, Norbert, for helping drive this along!
Ian
> people use very poor baselines - seehttp://www.cs.mu.oz.au/~wew/papers/amwz09_cikm.pdfhttp://barcelona.research.yahoo.net/dokuwiki/doku.php?id=baselines
Overall, I think we should be pushing both for validation
on public datasets and for validation on real datasets,
which may not be public. That was one of the major motivators
for the TREC Spam Track -- to have a methodology that
could work on both.
I have a paper at SIGIR 2009 that uses both public
data and Hotmail data. But I think there would still
be a result there, even without the public data.
In fact, I have never seen the Hotmail data. I formed
a hypothesis, wrote some software, and the software
was run in a sandbox on the Hotmail data. All I ever
got was summary results, but they were consistent with
my prediction.
I argue that my result was reproducible because any
ESP could have conducted it and, presumably, gotten
the same result. If they conducted it, and didn't get
the same result, that would be a publishable negative
result.
As it happens, the same results were achieved with
public data and crowdsourcing. I, personally, would
not have placed much credence in the result, without
the Hotmail experiment.
So lets be careful not to institutionalize overfitting
to available datasets that may or may not be representative.
[These are two issues -- overfitting and non-representativeness --
that will always be inherent in public datasets. It doesn't
render them useless, but they certainly shouldn't be blessed
as The Whole Truth and Nothing But the Truth.]
How do we balance studies with real users with that of repeatability
and validity?
Tests with real users are more expensive than batch system-oriented
experiments. As a result, these tests tend to be run on small sets of
users, often raising questions of the representativeness of the test
user population. These tests can be hard to reproduce, because many
of the details on the set-up of the tests are omitted in conference
publications.
I do believe that tests with users are an important component of
evaluation, but at this point I don't feel that we, as a field, are
good at this type of evaluation. Very few of us know how to set up
valid user tests and what to report in publications to increase
reproducibility. How do we go about educating ourselves about these
methods? How do we make sure our reviewers are able to provide
competent reviews when our evaluations include these methods?
> These tests can be hard to reproduce, because many
> of the details on the set-up of the tests are omitted in conference
> publications.
I think it essential, particularly if we are to continue
to do blind refereeing, that we have a repository where
test details and raw data go. Perhaps expanded proofs
and explanations, as well.
Then we absolutely should insist that enough details
be provided that (a) the analysis could be repeated;
(b) the analysis could be included in a systematic
review or meta-analysis (c) the experiment itself
could be reproduced, given the necessary resources.
This is what they do in health sciences.
The 8-page limit is very constraining.
I think that the issues of baselines and publication of datasets are
closely linked to each other.
1. Concerning the baseline issue (as raised by the paper by
Armstrong, Moffat et al), we should have rigid reviewing criteria:
Authors should compare their results to the best results published for
this collection. In case no such results have been published yet, a
standard IR engine like e.g. Indri or Terrier should be run with half
a dozen 'good' configurations, and results be given. If authors fail
to do so, results will be regarded as meaningless.
2. Publication of datasets: SIGIR should take action t o provide means
for publishing datasets. Besides the 'classical' format of test
collections, also the primary data of user experiments should be
published, so that others can anylyze the results and compare with
their own experiments. As an example, a PhD student of mine
investigated performance predictors from user behavior on XML
documents (like reading time, link following etc.) and found
substantial differences to the results reported by others. Since we
don't have their data, it is impossible to find explanations for these
differences.
Overall, I think that SIGIR needs some policy on usage of proprietary
datasets. I understand that my rigid point of view does not find a
majority at the moment, but we should find a way for being able to
judge about the repeatability of experiments on proprietary data: by
detailed description of the data, by releasing the source code, by
publishing results for reference engines as in 1. etc. Again, these
detailed reports should go into a public repository
Regards,
Norbert
I'm all for better methodology, but "rigid criteria" reminds
me of the infamous "zero tolerance" policies which actually
translate to "zero discretion" and have the opposite of the
intended effect.
The question is how to change the cultural norms so that
good methodology is expected, and that comparisons must
illustrate a useful contribution to knowledge.
Let me draw another analogy: smoking. Over the last
30 years, the culture has changed. There has been
legislation to be sure, but always supported by and
in support of a campaign to change social mores.
I don't believe there's any consensus -- even among senior
SIGIR people -- as to what good science is. I see that
trying to achieve -- and communicate -- such consensus
is a necessary step.
As to your specific example: At my university, if I
do a user study my ethics panel insists that I put
an end date on any collected data and keep it closely
controlled pending that end date. I could not possibly
release it -- only specific statistical summaries agreed
to at the time of ethics approval. "Archival collection"
is not in their vocabulary.
The reproducibility of results has, historically, been
determined partly by referees and partly by third parties
actually trying to reproduce results. For the study you
state, it sounds like the referees might have required
more disclosure of the methods used to create the
data. Then you should have the opportunity to write
a correspondence to the publication venue, describing
your best effort -- and failure -- to reproduce the results.
The original author could rebut, or provide more information,
or stand corrected.
This best system policy has a clear bias toward creating new tasks, vs.
focusing on basic research of existing tasks.
For example, for basic retrieval model research to get published, you
need to show that it is comparable to the most effective (and often
complex) methods used in TREC.
Maybe I misunderstood the point, but from my understanding, it's not
only unfair, but also unnecessary.
I think comparing against the most effective TREC results (or even
learning to rank methods to be more effective), is the flavor of one
particular style of research, definitely not a guideline for all
research results.
My answer to the baseline question: http://blog.codalism.com/?p=1029
Le
So, instead of pretending that we make progress ('Improvements that
don't add up'), research should focus on two major areas:
1. Instead of *how*, research should focus on *why*. The first is the
engineering approach (which has been preferred by SIGIR reviewers in
recent years), whereas the latter is the scientific approach. In fact,
there are very few papers addressing this issue (and hardly any of
them make it through the reviewing process). Let me compare it to
mechanical engineering: until 1950 or so, it was mainly about *how*,
but then the method of finite elements was developed, which was able
to specify the conditions under which a part doesn't break. Language
models brought some progress to our field, but research during the
last 10 years again has been mainly on the engineering side. I think
we whould challenge the researchers to focus on this scientific part
of our field: Even if you are not able to give us better resutls, then
give us better explanations for the behavior observed. Most papers
specify a new model (or mainly a variant of an existing one), but
hardly anyone actually tests if this model really is a better
description of reality. Instead, we only look at the results, and if
they are better than a (carefully chosen, poor) baseline, then this is
a good paper.
2. We should widen our focus - see my first posting on this issue. Of
course, it is easy to use some standard TREC collection and run
whatever method on it. But how do these results relate to the actual
retrieval experience we and all the users face everyday? There are so
many issues involved - why do we restrict research to the very core of
the IR engine, and ignore all other aspects (it's like testing
automobiles - would you believe in a test that only regards the
engine?). Now that we see that the emperor is naked, wouldn't it be
time to address the 'real' issues?
Overall, I really think that the SIGIR conference is at crossroads:
either its focus changes, or it will become a sectarian conference
which will hardly be taken serious by the rest of the CS community.
Regards,
Norbert
I think you've raised a fair point, what we really need are publications
that will/may have a general impact on other IR researchers, or other
communities.
To achieve that, the only way is to focus on core problems, and to focus
on the WHY and actively discuss potential general impacts of the work.
Conference reviewers can't tell people what to work on, so at least, we
should let people focus on the WHY and generalization parts.
So I still think comparing to the best TREC algorithm is not necessary,
although I must say, it is one way of doing really impactful work.
Cheers,
Le
This is SIGIR, experiments on test collections aren't going to go
away. I don't just say this from professional interest. Dwelling on
this point is useless with respect to our goal here -- to craft review
guidelines for SIGIR. We're not trying to steer the field, we're
trying to decide how we decide what to publish from the researchers
who are steering the field.
The point has been made by Alistair and others that because Cranfield
only supports comparative evaluation, baselines should be as much a
part of the test collection as topics and relevance judgments. It's a
methodological question as to what those baselines should be. We
don't have to solve it. We can just say that papers that conduct test
collection experiments need to include modern baselines that support
the claims in the paper. It's up to the paper authors to make a
compelling case that their comparison is supportive. And up to the
reviewer to decide if they buy that.
End of discussion, please?
Ian
Two rocks to inspire ripples:
1) list sets of characteristics we look for in an acceptable publication
2) list a set of criteria we look for in a good review
e.g.
1. Points out what needs to be done or written, that's NECESSARY for
acceptance.
(Pointing out something that would be nice to have DOES NOT count as
good review. Given the 8 page limit, nobody can be perfect.)
2. Points out in which direction should the work go, as if the reveiwer
is the author.
This applies to acceptance reviews as well.
(possibly suggest a way out, e.g. venues to publish it.)
Generally speaking, the reviewer should put themselves in the authors'
shoes.
By making these things explicit, we can examine them more carefully.
Le
I am not at all against comparative evaluations - as long as they show
real progresss
- either wrt. performance improvements, which leads us back to the
baselines (how)
- or wrt. better explanations (why).
Norbert
On Apr 1, 11:13 pm, Ian Soboroff <isobor...@gmail.com> wrote:
> This discussion about baselines is getting needlessly esoteric.
>
> This is SIGIR, experiments on test collections aren't going to go
> away. I don't just say this from professional interest. Dwelling on
> this point is useless with respect to our goal here -- to craft review
> guidelines for SIGIR. We're not trying to steer the field, we're
> trying to decide how we decide what to publish from the researchers
> who are steering the field.
>
> The point has been made by Alistair and others that because Cranfield
> only supports comparative evaluation, baselines should be as much a
> part of the test collection as topics and relevance judgments. It's a
> methodological question as to what those baselines should be. We
> don't have to solve it. We can just say that papers that conduct test
> collection experiments need to include modern baselines that support
> the claims in the paper. It's up to the paper authors to make a
> compelling case that their comparison is supportive. And up to the
> reviewer to decide if they buy that.
>
> End of discussion, please?
>
> Ian
>