http://www.natcorp.ox.ac.uk/corpus/index.xml?ID=numbers

122 views

Skip to first unread message

Ariel

unread,

Nov 8, 2011, 9:50:09 AM11/8/11

to NELL: Never-Ending Language Learner

Hi, had you thought of working through this corpus? It comprises:

Text type percent
Spoken demographic 10.08
Spoken context-governed 7.07
All Spoken 17.78
Written books and periodicals 72.75
Written-to-be-spoken 1.98
Written miscellaneous 8.09
All Written 82.82

A good opportunity to process written books and perdiocials?

bw

Ariel

Bryan Kisiel

unread,

Nov 10, 2011, 4:35:56 PM11/10/11

to NELL: Never-Ending Language Learner

Hi Ariel,

Thank you for the suggestion. However, according to the website, the toal
size of that corpus is a little over 5GB. The corpus that we're using
currently is maybe around 17,500GB. So I'm afraid we wouldn't get a whole
lot of benefit compared to the cost to purchase a copy and time to
reformat it for NELL to read.

On the topic of books, we were recently given a copy of most or all of the
books from the Internet Archive, which looks to be about 1000GB in size.
Actually, it's still sitting on disk waiting to be reorganized and
reformatted... But it will be interesting to see if that is big enough to
make a noticable difference in NELL's learning.

bki...@cs.cmu.edu

Reply all

Reply to author

Forward

0 new messages