That's what we thought too, initially, and it turns out that doesn't work. Sometimes the problems are minor: the datasets contain errors, which is annoying. More difficult are issues surrounding small grammars and small corpora: e.g. "child-directed speech" (CDS) -- one can naively think that CDS is just like adult-English grammar, but with a smaller vocabulary, but there's a hint that mathematically, that is just not true. It becomes particularly obvious when the training corpus is so small that you can run the calculations by pencil-n-paper: the probabilities really come out quite different.
One soon trips over other problems too: Some at the front end: capitalization, punctuation, anything to do with tokenization (imagine irregular French verbs!) But also problems near the back end - problems with multiple meanings, synonyms, etc. - one sees many interesting things that are both correct and wrong at the same time: e.g. turns out that "houses" and "cities" are quite similar: both can be entered, both provide shelter, both can be a home... and yet a house is not a city. I saw many unexpected but interesting groupings like this... reminiscent of corpus linguistics results.
Oh, and corpora problems: wikipedia lacks "action verbs": run jump kick punch, since wikipedia is describing things (X is a Y which Z when W). It also has vast amounts of foreign words, product names, geographical names, obscure terminology: words which appear only once or twice, ever.
Project Gutenberg is mostly 19th, early-20th century English which is very unlike modern English. We can read it and understand it, but there are many quite strange and unusual sentences in there, which no one would ever say, today. Jane Austen's Pride & Prejudice is a good example.
It became clear that existing corpora and datasets are almost useless for evaluating quality. I want to be able to control for size of vocabulary, size of grammar, density and distribution of different parts-of-speech, and, most importantly, the distribution of different meanings ("I saw" - "to see" vs "to cut") ... and then evaluate the algos as each of these parameters are changed.
The idea is that the learning system is like a transducer: grammar->corpora->grammar, so its like a microphone, or a high-fidelity stereo system: we want to evaluate how well it works. Of course, you can "listen to it" and "see if it sounds good", but really, I'd rather measure and be precise, and for that, we really need inputs that can be controlled, and see how the system responds, as we "turn the knob".
-- Linas