Both adjective and noun?

20 views
Skip to first unread message

Mike Dowd

unread,
Feb 3, 2022, 11:59:02 AM2/3/22
to link-grammar
From 4.0.dict for English
% A+: "It has twice the percent value"
percent.u parts.u:
  (<noun-modifiers> &
    ((ND- & {DD-} & <noun-rel-x> & (<noun-main-x> or B*x+)) or
    <noun-main-p> or
    (ND- & {DD-} & <noun-and-x>) or
    U-))
  or (ND- & (OD- or AN+ or YS+))
  or ({E- or EA-} & A+);

If I read this correctly, this expression is saying that "percent" and "parts" can be appear as either a noun (AN+ noun-modifier) or an adjective (A+). In some other parts of the dictionary that I've looked at, different classifications for the same word would be handled with separate categories, i.e. one for noun case and one for the adjective case. The word would appear twice in the dictionary. Is there something special about this situation that it didn't get handled that way?

This is actually relevant to what I'm working on, but sadly I can't go into details.

Linas Vepstas

unread,
Feb 6, 2022, 2:46:00 PM2/6/22
to link-grammar
On Thu, Feb 3, 2022 at 10:59 AM Mike Dowd <mike...@gmail.com> wrote:
From 4.0.dict for English
% A+: "It has twice the percent value"
percent.u parts.u:
  (<noun-modifiers> &
    ((ND- & {DD-} & <noun-rel-x> & (<noun-main-x> or B*x+)) or
    <noun-main-p> or
    (ND- & {DD-} & <noun-and-x>) or
    U-))
  or (ND- & (OD- or AN+ or YS+))
  or ({E- or EA-} & A+);

If I read this correctly, this expression is saying that "percent" and "parts" can be appear as either a noun (AN+ noun-modifier) or an adjective (A+).

Yes, that's correct.

In some other parts of the dictionary that I've looked at, different classifications for the same word would be handled with separate categories, i.e. one for noun case and one for the adjective case. The word would appear twice in the dictionary. Is there something special about this situation that it didn't get handled that way?

No. You will discover that there are hundreds, if not thousands of words handled much like the above.  Here are the guidelines:

The .u at the end of percent.u is called a "word subscript".  Superscripts have no grammatical role or function; they are used only as a debugging aid.  In the dictionary, the subscripts are used haphazardly and inconsistently, and should not be trusted to convey any kind of meaningful grammatical information. Doing so will lead to hair-pulling and assorted errors.

The correct way to identify the grammatical role of a word is to look at it's disjunct.  For example:

linkparser> !disj
Display of disjuncts used turned on.
linkparser> It has twice the percent value
            LEFT-WALL     2.000  hWd+ RW+
                   it     0.000  Wd- Ss+
                has.v     2.000  Ss- @MV+ O*n+
              twice.e     0.000  MVa-
                  the     0.000  D+
            percent.u     0.000  A+
              value.s     0.000  @A- Ds**x- Os-
           RIGHT-WALL     0.000  RW-


Here, the "A+" makes it clear that "percent" is being used as an adjective.  Similarly, the "Ss- @MV+ O*n+" makes it clear that "has" is being used as a transitive verb with a singular subject.  Each disjunct should be thought of as a hyper-fine, detailed "part of speech" or "grammatical category".

If you find this level of detail to be too much, you should create filters to assign disjuncts into broader categories (e.g. nouns, verbs, adjectives...) It's up to you to design these filters, as apparently, different linguists have different tastes with regard to these. There is no broad-category assignment done by link-grammar (other than what you get if you turn on the phrase-structure output)

I'm saying all this because the word subscripts do seem to resemble broad categories, and so everyone loves to think that is what they are.   .. They're not.  You really really want to focus entirely on the disjuncts and what they are saying.


This is actually relevant to what I'm working on, but sadly I can't go into details.

? Aww. Why not?

-- linas

--
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.
 

Mike Dowd

unread,
Feb 7, 2022, 2:05:38 AM2/7/22
to link-grammar
I get it about the subscripts. My confusion was thinking that the 1716 categories in the English dictionary were the final word on grammatical roles. I.e. the dictionary defined 1716 fine-grained parts-of-speech. And that's not wrong. But as you have clarified, it's not until a category's expression has been executed do the actual parts-of-speech for a sentence emerge. Thanks.

My project is still in the proof of concept phase and has been declared hush-hush.

Linas Vepstas

unread,
Feb 9, 2022, 12:25:01 AM2/9/22
to link-grammar
Hi Mike,

On Mon, Feb 7, 2022 at 1:05 AM Mike Dowd <mike...@gmail.com> wrote:
I get it about the subscripts. My confusion was thinking that the 1716 categories in the English dictionary were the final word on grammatical roles. I.e. the dictionary defined 1716 fine-grained parts-of-speech.

I don't know how you counted 1716. I was referring to disjuncts, and there are literally millions of them. Maybe tens of millions, I dunno.  It's hard to count.

Perhaps you were counting the number of semicolons in 4.0.dict?  These are NOT parts of speech; they simply are just classes of words that behave similarly enough that human curators of the grammar can deal with it. Nothing more.

For example,  the word "saw" can be both a noun and a verb, and if I make a dictionary entry

saw.xxx: (S- & O+) or (D- & S+);

that is a valid dictionary entry encoding two parts of speech: a transitive verb (S- & O+) and a common noun (D- & S+) and there is just one semicolon there. Perhaps it is the only dictionary entry that looks exactly like that, but that does not imply that "saw" is in some unique grammatical category. It is simply just some word that can be a noun or a verb, and it was convenient to write it that way.  I could have written

saw.yyy: (S- & O+);
saw.zzz: (D- & S+);

and things would be the same as before, except for the subscript usage.

The !! command tells you about the disjuncts.

For example, pick a word, any word, say "Southern" with capital S. At the prompt:

linkparser> !!Southern
String splits to:
 Southern southern

Token "Southern" matches:
    Southern                      23554  disjuncts <en/words/entities.organizations.sing>

Token "southern" matches:
    southern.a                       4  disjuncts

so this word alone has 23K "fine-grained parts of speech".  Some of these disjuncts have @ connectors which are variable in number, so you have to consider zero, one, two ... or more ... times the number of @ connectors in each disjunct.

Here's another:

linkparser> !!jumping
Token "jumping" matches:
    jumping.g                    38398  disjuncts <en/words/words.v.6.5>

    jumping.v                     1604  disjuncts <en/words/words.v.6.4>

Even if you filter for duplicates in these lists, you get zillions quite rapidly.

-- Linas
 
--
You received this message because you are subscribed to the Google Groups "link-grammar" group.
To unsubscribe from this group and stop receiving emails from it, send an email to link-grammar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/link-grammar/31621622-4e6d-49ba-b818-dc7bf64c9a2fn%40googlegroups.com.

Mike Dowd

unread,
Feb 9, 2022, 2:16:42 AM2/9/22
to link-grammar
I was referring to the rules, the expressions that are executed to produce the disjuncts. The dictionary looks at them as categories. I got a list of them from the dictionary and could see what the fully expanded expressions look like . There are 1716 of those. And I understand that I was mistaken about them being the fine-grained parts-of-speech.

I had not seen the !! feature of the parser yet. Thanks for pointing that out.

It will take me some time to wrap my head around how that many disjuncts are produced. 

Thank you for patiently explaining these concepts to me.

Linas Vepstas

unread,
Feb 9, 2022, 3:45:22 PM2/9/22
to link-grammar
On Wed, Feb 9, 2022 at 1:16 AM Mike Dowd <mike...@gmail.com> wrote:
I was referring to the rules, the expressions that are executed to produce the disjuncts. The dictionary looks at them as categories. I got a list of them from the dictionary and could see what the fully expanded expressions look like . There are 1716 of those.

I have no clue what you mean by "category", nor "how to get a list of them", or how you counted 1716.

If I count the number of semicolons in 4.0.dict, I get about 2.2K which is close to your number of 1716. But those semicolons are both for macro definitions, and for word-lists.

The factorization of 4.0.dict is onto three matrixes: L D R where L and R are sparse matrixes, and D is dense.  The L matrix consists of word-lists, the R matrix consists of named macros (and their expansion to disjuncts).   The matrix D is the association of word-lists to expressions-that-include-unexpanded macros.

The dimension of L is W x N
The dimension of D is N x M
The dimension of R is M x S

W = number of words, approx 100K
S = number of disjuncts, approx 4 million
M = number of macros, approx 100 or 200
N = number of "rules" in 4.0.dict, approx 2K (maybe 1716???)

The number of semi-colons in 4.0.dict is approx M+N

Determining M and N exactly is tricky, because multiple factorizations are possible. Not everything that could be a macro has been made into one. Many of the rules should be factored into multiple distinct parts.  Keeping this factorization under control, reasonably well-pruned, is the hard part of maintaining 4.0.dict

--linas
 

Mike Dowd

unread,
Feb 9, 2022, 8:02:34 PM2/9/22
to link-grammar

link_experimental_api(const Category *)

dictionary_get_categories(const Dictionary dict);


/* List of words in a dictionary category. */

typedef struct

{

unsigned int num_words;

const char* name;

Exp *exp;  <---------------------------------- fully expanded dictionary entry

char const ** word;

} Category;

First expression:

(XXXENTITY+) or (({G-} & {[MG+]} & (({DG- or [GN-]2.100 or [[{@A-} & {D-}]]} & (({@MX+} & {NMr+} & (JG- or (((Ss*s+ & ({((({@hCOd- or dHM-} & (C- or ((dRJrc- or dRJlc+)))) or ({hCO-} & {[@hCO-]} & Wd-))) or [Rn-]})) or SIs- or (Js- & {Mf+}) or (Os*e- & {Sg+ or Sj+}) or ([{[Bsj+]} & Xd- & Xc+ & MX-]0.100))) or ((({@M+} & dSJls+) or ({[@M+]} & dSJrs-))))) or YS+ or YP+)) or AN+ or ({@A-} & {OH-} & Wa-) or (OH- & SIs-) or G+))) or (({[[Wa-]]} & (({OH-} & Wc- & {MG+} & (Xc+ or [()]1.200) & (Qd+ or Wq+)) or ({Xd-} & {OH-} & (Xc+ or [[()]]) & [dCOa+])))) or (({OH-} & Xc+ & S**i+))

It's the expansion of this dictionary entry:

% Words that are also given names
% Cannot take A or D links.
% Art Bell Bill Bob Buck Bud
%
% The bisex dict includes names that can be given to both
% men and women.
/en/words/entities.given-bisex.sing
/en/words/entities.given-female.sing
/en/words/entities.given-male.sing
/en/words/entities.goddesses
/en/words/entities.gods:
  <marker-entity> or <given-names> or <directive-opener> or <directive-subject>;

All the words in this category are proper names.

Linas Vepstas

unread,
Feb 9, 2022, 10:31:46 PM2/9/22
to link-grammar
Oh, ah Heh. That is an experimental API, evolving to fit the grammar learning project. The number there, 1716, is kind-of grammatically meaningless; it is indeed the number "N" of the matrix factorization described in the previous email.

--linas

Reply all
Reply to author
Forward
0 new messages