Train/test splits of a question dataset

12 views

Skip to first unread message

Petr Baudis

unread,

May 18, 2015, 6:11:08 PM5/18/15

to qa-...@googlegroups.com

Hi!

So far, in YodaQA development I use a TREC-based question dataset
split 1:1 to 430 train and 430 test questions:

https://github.com/brmson/dataset-factoid-curated

(by the way, I've split out my dataset to a separate repo to make it
hopefully more vendor-independent and would like everyone to use it
for benchmarking, cross-system comparison, and contribute to it!)

However, I'm now looking at making a large+noisy variant of the
dataset (mainly in the hope of fighting some overfitting issues) and
I wonder if this split is the best way. For example, Jacana uses
a 1:10:1 dev-train-test split.

Does anyone know of a good analysis of the tradeoffs involved, or
some guidelines on splitting such datasets? I'm a little nervous about
shrinking my test set radically, fearing that measurements on it may
not be very representative.

Thanks,

Petr Baudis

Sean Gallagher

unread,

May 18, 2015, 6:43:14 PM5/18/15

to Petr Baudis, qa-...@googlegroups.com

Hi Petr!
I'm not exactly sure what the tradeoffs are, but I do know we use training and testing datasets each in excess of 1000 questions (often 5000 questions), because there are so many Jeopardy questions available, and by doing that we can keep the standard deviation between consecutive runs low. (We target <2% changes.)

If you are concerned about the validity for a small set, you may want to use cross-validation instead, but it will take longer.

- Sean

--
You received this message because you are subscribed to the Google Groups "qa-oss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qa-oss+un...@googlegroups.com.
To post to this group, send email to qa-...@googlegroups.com.
Visit this group at http://groups.google.com/group/qa-oss.
To view this discussion on the web visit https://groups.google.com/d/msgid/qa-oss/20150518221105.GT2760%40machine.or.cz.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages