Hi Stephen,
One way I can think of to begin compiling such a dataset is to use existing test cases in OpenNARS or ONA. For example, here's
nal1.0.nal:
```
'Revision ------
'Bird is a type of swimmer.
<bird --> swimmer>.
'Bird is probably not a type of swimmer.
<bird --> swimmer>. %0.10;0.60%
1
'Bird is very likely to be a type of swimmer.
''outputMustContain('<bird --> swimmer>. %0.87;0.91%')
```
Most test cases have some Narsese with a comment line before it in English. It would be trivial to write a script to scrape Narsese-English pairs from all the test cases and compile a small-ish dataset this way. One could then do some post-processing on the resulting set, to expand it with more synthetic data, like doing a simple search and replace to turn 'Bird is a type of swimmer into `Bird is a type of flyer and 'Fish is a type of swimmer etc..
This would be just a start of course, and there's also a question whether you would want to keep the truth values in the set or are interested in only the statements. In the latter case you'd probably want to do something special to differentiate between 'Bird is a type of swimmer and 'Bird is probably not a type of swimmer which in the example above both translate to <bird --> swimmer> but with different truth values.