A proposal for capitalization handling using LG rules

29 views

Skip to first unread message

ami...@gmail.com

unread,

Aug 29, 2014, 10:48:33 PM8/29/14

to link-g...@googlegroups.com

Currently, capitalization handling is hard-coded in C, and is English-centric.

In addition, there are two regexes that match capital words (CAPITALIZED-WORDS and PL-CAPITALIZED-WORDS). Because (at least currently) only one regex can match (the first one that matches), the regex suffix guessing doesn't work for capital words.

My idea is to shift the handling of capitalized words from the domain of hard-coded rules in C, to the domain of the LG rules.

Requirements:

- Minimal handling by program code - the rest will be done by the LG rules.

- Flexibility - as less as possible language dependency.

To that end I propose to consider a capitalized word as composed of a non-capitalized one that has an initial "virtual" null morpheme that signify its capitalization.

In order that the LG rules will be able to select the proper word meaning, two alternatives are to be generated:

Input word: Qwerty

alt1: nonCAP.ZZZ qwerty

alt2: 1stCAP.ZZZ qwerty

If the word is all capitals (to be used by languages that use such words), maybe:

alt2: allCAP.ZZZ qwerty

In the English dictionary, the LEFT-WALL, colon and ballets will have the proper connectors for selecting the appropriate form of the word. The ZZZ null morphemes will get discarded after the linkage step. However, the program will re-capitalize the words appropriately for display if needed.

I'm weak in LG rules so this proposal doesn't include a suggestion in that regard, but hopefully it can be done.

Amir

Linas Vepstas

unread,

Aug 30, 2014, 3:38:26 PM8/30/14

to link-grammar

On 29 August 2014 17:48, <ami...@gmail.com> wrote:

Currently, capitalization handling is hard-coded in C, and is English-centric.

[...]

To that end I propose to consider a capitalized word as composed of a non-capitalized one that has an initial "virtual" null morpheme that signify its capitalization.

!! Brilliant insight! But of course ... and now that you said it, its forehead-slappingly obvious. Much like the phonetic a->an shift depending on the first letter of the following word, so it is with capitalization, in English: If it follows a period, or a colon, its capitalized. So, yes, that's perfectly consistent with LG grammar.

In order that the LG rules will be able to select the proper word meaning, two alternatives are to be generated:

Input word: Qwerty

alt1: nonCAP.ZZZ qwerty
alt2: 1stCAP.ZZZ qwerty

I think that should be

alt2: 1stCAP.ZZZ Qwerty

in case Qwerty is a proper noun. As otherwise, it won't be found in the dictionary.

If the word is all capitals (to be used by languages that use such words), maybe:
alt2: allCAP.ZZZ qwerty

In the English dictionary, the LEFT-WALL, colon and ballets will have the proper connectors for selecting the appropriate form of the word. The ZZZ null morphemes will get discarded after the linkage step. However, the program will re-capitalize the words appropriately for display if needed.

I'm weak in LG rules so this proposal doesn't include a suggestion in that regard, but hopefully it can be done.

Easy enough. Currently, the W link connect wall to the first word (well, actually, it can be Wd or Wi or many others, but I simplify slightly here), and so the current dict is:

LEFT-WALL: W+;

<most-nouns-and-some-other-things>: (W- & <other-links>) or <other-stuff>;

In the new scheme, its more complicated, and is reminiscent of the phonetic solution:

http://www.abisource.com/projects/link-grammar/dict/section-PH.html

so:

% The wall MUST have an FP link to something! FP is a new link type, stands for "First after Punctuation".

LEFT-WALL: W+ & FP+;

% A capitalized common noun MUST link to the wall with FP, and MUST link to the link type UPd (a new link type indicated down-cased word.)

nonCAP.ZZZ: FP- & UPd+;

% A capitalized proper noun optionally links to the wall with FP. Its optional, because proper nouns might occur mid-sentence, where such a link is unwanted. The UPu+ link is mandatory, so that it attaches to something.

% Note that the tokenizer will also produce nonCAP.ZZZ in the middle of sentences, but these will fail to link unless they are first in the sentence. Which is exactly what we want.

1stCAP.ZZZ: {FP-} & UPu+;

% If a common noun attaches to the wall, it must have the UPd link to enforce capitalization.

<most-nouns-and-some-other-things>: UPd- & W- & <other-links> or <other-stuff>;

% A proper noun MUST always have a UP link, so that 1stCAP.ZZZ is linked, no matter what position it takes in the sentence.

<proper-nouns>: UPu- & (W- & <other-links> or <other-stuff>);

Fully inserting and debugging the above will take a little bit of work. To avoid breaking existing code, it can be staged by making FP and UP always optional; and only making them mandatory, as needed, once everything has been converted.