Hi!
So far, in YodaQA development I use a TREC-based question dataset
split 1:1 to 430 train and 430 test questions:
https://github.com/brmson/dataset-factoid-curated
(by the way, I've split out my dataset to a separate repo to make it
hopefully more vendor-independent and would like everyone to use it
for benchmarking, cross-system comparison, and contribute to it!)
However, I'm now looking at making a large+noisy variant of the
dataset (mainly in the hope of fighting some overfitting issues) and
I wonder if this split is the best way. For example, Jacana uses
a 1:10:1 dev-train-test split.
Does anyone know of a good analysis of the tradeoffs involved, or
some guidelines on splitting such datasets? I'm a little nervous about
shrinking my test set radically, fearing that measurements on it may
not be very representative.
Thanks,
Petr Baudis