Building a corpus for language learning

24 views

Skip to first unread message

Andrew Buck

unread,

Apr 4, 2017, 2:59:22 PM4/4/17

to opencog

I posted a couple replies to a message from Linus in the thread titled "Questions on a Knowledge Representation Standard for AGI - Help me not waste my time :-)". I wanted to move that discussion to a separate thread to avoid taking that one too far off of its original topic so I am starting this thread here.

I started expanding a bit on the idea I outlined there about making files filled with statements all on a particular topic. I wanted to have something a bit more concrete to serve as an example of the kind of thing I want to create. Using the example of the topic "breakfast" that I mentioned in the other thread I started a text file where I laid out a few statements about breakfast and then for each one I spent a minute or two sort of "riffing" on the basic idea of the sentence; building other similar sentences and replacing different words or phrases but keeping the same overall theme. I will include the file as it exists now at the end of the email. Bear in mind that the creation of this file only took a few minutes to put together, so a volunteer spending an hour or two on such a file would be able to create hundreds, if not thousands, of similar sentences on a subject.

As I understand it, the current thinking in the OpenCog project is to try to learn language by doing various statistical analyses large corpuses of un-annotated text. I think something like this corpus would be ideal for the initial stages of that analysis. Although the example file is not wikipedia scale large, it is much more tightly focused around a single idea than a general corpus like wikipedia is. Additionally because there are many more repeated words, and the words are all used in a similar context, a much higher "weighting" of the patterns observed in the corpus is merited than in a typical body of text you would find in wikipedia or on a web page. Although it will be orders of magnitude smaller than something like wikipedia, the "signal to noise ratio" will be much higher.

Also notice how in the text I have put one sentence on each line, but I have left a blank line between each block of sentences with a common theme. If you just ignore these blank lines you get a more challenging learning problem than if you take note of them and "bias" the weightings assigned when learning from sentences within a block since you know that all of them express the same basic information just with different phraseology. This lets you learn from just a couple sentences that "early in the morning" and "at the start of the day" likely have very similar, if not identical, meanings without having to parse hundreds of uses of these phrases. You could also do something like putting a * at the beginning of a sentence that has a meaning opposite to the one before it in some sense. Again, you could either parse these and ignore the extra markings, or use the markings to influence the weighting of learned meaning.

This is what I mean when I say this this corpus will be a highly redundant body of text. Although by word count it will end up being very large in comparison to wikipedia, it will be based around a comparatively small number of ideas/concepts but have a much more varied exploration of the language surrounding those concepts then you would find in an everyday body of text.

-AndrewBuck

Below is the example file:

Breakfast is a meal.

Breakfast is a small meal.
Breakfast is a light meal.
Breakfast is a simple meal.
Breakfast is usually something that is easy to make.
Breakfast is usually something that is easy to prepare.
Breakfast is usually something that is easy to cook.
Many breakfasts are foods you don't have to cook; like cereal, or an energy bar.

Breakfast is eaten in the morning.
Breakfast is a small meal eaten in the morning.
Breakfast is a small meal eaten early in the day.
Breakfast is a small meal eaten at the start of the day.
Breakfast is a small meal eaten after you wake up.
Breakfast is a small meal eaten soon after you wake up.
Breakfast is a small meal eaten just after you wake up.
Many people eat breakfast to start their day.
Many people start their day by eating breakfast.
Eating breakfast helps people wake up.
Eating breakfast helps people get going in the morning.

A common breakfast consists of toast and cereal.
A common breakfast consists of eggs and bacon.
A common breakfast consists of bacon and eggs.
A common breakfast consists of beans on toast.
A common breakfast consists of cereal and orange juice.
A common breakfast consists of pancakes.
A common breakfast consists of waffles.
A common breakfast consists of waffles or pancakes.
People commonly eat toast or cereal for breakfast.

Coffee, milk, or orange juice are common beverages served with breakfast.
People often drink coffee with their breakfast.
People drink coffee with their breakfast to wake them up.

Reply all

Reply to author

Forward

0 new messages