Re: Using the medical vocabulary with the Link Grammar Parser

35 views
Skip to first unread message

Peter Szolovits

unread,
Jul 5, 2011, 12:56:28 PM7/5/11
to Can Bruce, link-g...@googlegroups.com
Dr. Bruce, those instructions are many years old, and I had not recently checked to see if they still work.  It used to be that the instructions I gave worked to extend the parser.  However, there have been many changes since that time, and something may work quite differently now. I did reproduce your experiment, with the same result.  I also tried to move the EXTRA.k files into the words subfolder, but to no avail.  I am posting your query to the link grammar discussion group in case one of the people who has worked on changing the way dictionaries are loaded can suggest an appropriate work-around.  Linas Vepstas some years ago said that he had incorporated a number of the new dictionary entries that my work created into the standard dictionaries, but he had omitted a lot of terms that he thought would be very uncommon in ordinary text. I am sure that "pinealomas" belongs to that group.

--Pete Szolovits

On Jul 5, 2011, at 12:19 PM, Can (John) Bruce wrote:

Dear Dr. Szolovits,
 
I am setting up the medical vocabulary addition to the Link Grammar Parser and am having a problem. I was wondering whether you might be able to help me or could direct me to someone who could assist.
 
I have followed the instructions on the Web page http://groups.csail.mit.edu/medg/projects/text/lexicon.html but link-parser does not seem to recognize the new medical terms:
 
[cb442@baklava en]$ pwd
/home/cb442/perlmods/link-grammar-4.7.4/data/en
[cb442@baklava en]$ cp 4.0.dict 4.0.dict.bkp
[cb442@baklava en]$ cat 4.0.dict extra.dict > 4.0.dict.new
[cb442@baklava en]$ mv 4.0.dict.new cp 4.0.dict
[cb442@baklava en]$ ls
4.0.affix                  4.0.dict          4.0.fixes.batch  EXTRA.1   EXTRA.13  EXTRA.17  EXTRA.5  EXTRA.9      Makefile.in
4.0.batch                  4.0.dict.bkp      4.0.knowledge    EXTRA.10  EXTRA.14  EXTRA.2   EXTRA.6  extra.dict   README
4.0.biolg.batch            4.0.dict.m4       4.0.regex        EXTRA.11  EXTRA.15  EXTRA.3   EXTRA.7  Makefile     tiny.dict
4.0.constituent-knowledge  4.0.enwiki.batch  4.0.voa.batch    EXTRA.12  EXTRA.16  EXTRA.4   EXTRA.8  Makefile.am  words
[cb442@baklava en]$ head -1 EXTRA.2
pinealomas.n leucoencephalitides.n Michaels.n
 [cb442@baklava data]$ cd ../link-grammar
[cb442@baklava link-grammar]$ ./link-parser
link-grammar: Info: Dictionary found at /usr/local/share/link-grammar/en/4.0.dict
link-grammar: Info: Dictionary version 4.7.4.
link-grammar: Info: Library version link-grammar-4.7.4. Enter "!help" for help.
linkparser> pinealomas occur in the pineal gland.
Found 1 linkage (1 had no P.P. violations)
        Unique linkage, cost vector = (UNUSED=0 DIS=0 FAT=0 AND=0 LEN=9)
 
    +--------------------------Xp-------------------------+
    |                              +--------Js-------+    |
    |                              |  +------Ds------+    |
    +-----Wd-----+-----Sp----+-MVp-+  |      +---A---+    |
    |            |           |     |  |      |       |    |
LEFT-WALL pinealomas[!].n occur.v in the pineal.a gland.n .
 
In the above example the parser does not recognize the term  “pinealomas” in EXTRA.2
 
Do you know what I might be the problem?
 
Can (John) Bruce, Ph.D.
Associate Director,
Bioinformatics Resource
Keck Foundation Biotechnology Resource Laboratory
Yale University
 
Office: TAC N-226
Mail Address: Yale University, MB&B Dept., 333 Cedar St., New Haven, CT 06512
 

Linas Vepstas

unread,
Jul 5, 2011, 6:39:47 PM7/5/11
to link-g...@googlegroups.com, Can Bruce
Hi,

A very quick reply below; I won't be able to validate anything for
a few weeks.

On 5 July 2011 11:56, Peter Szolovits <p...@mit.edu> wrote:
> Dr. Bruce, those instructions are many years old, and I had not recently
> checked to see if they still work.  It used to be that the instructions I
> gave worked to extend the parser.  However, there have been many changes
> since that time, and something may work quite differently now. I did
> reproduce your experiment, with the same result.  I also tried to move the
> EXTRA.k files into the words subfolder, but to no avail.  I am posting your
> query to the link grammar discussion group in case one of the people who has
> worked on changing the way dictionaries are loaded can suggest an
> appropriate work-around.  Linas Vepstas some years ago said that he had
> incorporated a number of the new dictionary entries that my work created
> into the standard dictionaries, but he had omitted a lot of terms that he
> thought would be very uncommon in ordinary text. I am sure that "pinealomas"
> belongs to that group.

The isntructions look reasonable, with the following notes:
-- Be sure not to skip step 5 "The contents of the file extra.dict
need to be appended to the end of the file 4.0.dict" Make sure that
you are editing
the correct copy of 4.0.dict: i.e. after change it in your local directory, you
will need to 'make install" again. Alternately, you need to go to
/usr/local/lib/link-grammar/en (or whereever it was installed), and
edit the 4.0.dict there. I'm guessing this is the root cause of the
problem.

-- Some, but not all of the EXTRA* files were merged. Some were
merged (and since then re-arranged), some were just renamed.
Some were left out. My notes below. Anway, to avoid complaints
during dictionary loading, you will want to comment out those parts
of step 5 (above) that refer to the merged dictionaries.

Hope that helps.

== Linas

/extra.1: -- done already

/extra.2: -- skip too big
/extra.3: -- skip, too big

/extra.4: -- /en/words/words-medical.v.4.2:
/extra.5: -- /en/words/words-medical.v.4.1:
/extra.6: -- /en/words/words-medical.adj.2:
/extra.7: -- /en/words/words-medical.n.p

/extra.8: -- skip too big
/extra.9: -- skip random names

/extra.10: -- /en/words/words-medical.adv.1:
/extra.11: -- /en/words/words-medical.v.4.5:
/extra.12: -- skip too big
/extra.13: /en/words/words-medical.v.4.3:
/extra.14: words-medical.prep.1
/extra.15: adj /en/words/words-medical.adj.3:
/extra.16: /en/words/words-medical.v.2.1:
/extra.17: -- skip too big

Linas Vepstas

unread,
Jul 5, 2011, 6:43:19 PM7/5/11
to link-g...@googlegroups.com, Can Bruce
To be clear: for step 5, gut out the contents of "extra.dict" except
for the lines that refer to extra.2, 3, 8, 9, 12, 17 and append only
those to 4.0.dict it should work, as long was you edit the
4.0.dict that is actually used. The command-line client should
print the file path to the 4.0.dict being used.

--linas

Peter Szolovits

unread,
Jul 8, 2011, 10:35:20 PM7/8/11
to link-g...@googlegroups.com, Can Bruce
I've been spending some time cleaning up the EXTRA files in my 2003 extensions to LGP based on the UMLS Specialist lexicon. Linas has incorporated a lot of the words that are not specifically medical (and rare) in the main LGP dictionaries, which means that loading the additions as originally described at http://www.medg.lcs.mit.edu/projects/text/ leads to many errors showing duplications. Therefore, I have built a new
version of my additions, which remove these duplicates. If you want to use these with the current (4.7.4) version of LGP, please look at http://groups.csail.mit.edu/medg/projects/text/lexicon.html.

Unfortunately, I have only been able to semi-automate this process, so as additional changes are made to the LGP dictionary, it may be necessary to re-do it.

In the process, I have run into a few questions, however. These mainly concern how one is supposed to represent

1. I note that the current LGP dictionaries include 7 files of male, female and ambiguous names, and singular locations, nations, organizations and states. I'm a bit confused about how ambiguity is handled. For example, Austin is listed as an ambiguous given name (.b), so "We went to Austin, Texas." parses that with this interpretation. Does that simply not create problems?

2. Arctic is a location, but Antarctic is an organization.

3. America is a female name (.f) and a location (.l). So why isn't Austin handled similarly?

4. Atlantic is not in the dictionary. In the heuristics that I used in 2003, they created both an Atlantic.n (as in "I sail the Atlantic.") and Atlantic.a (as in "I fly over the Atlantic Ocean."). That was just a dumb computer doing it, and it leads to a combinatorial multiplication of parses. How is one supposed to deal with productive problems like this? Comparably, North is only an organization, and gets an AN link to Pole.n in "North Pole".

Alas, there are tons of such questions. Do we just get by with them, or is there a systematic way to think about these?

Thanks. --Pete Sz.

> --
> You received this message because you are subscribed to the Google Groups "link-grammar" group.
> To post to this group, send email to link-g...@googlegroups.com.
> To unsubscribe from this group, send email to link-grammar...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/link-grammar?hl=en.
>

CanBruce

unread,
Jul 11, 2011, 12:02:20 PM7/11/11
to link-grammar
Thanks for cleaning up the medical term extension for the lexicon.
With the updated directions I have been able to apply the extensions
with no errors.

Regarding you questions of disambiguation (e.g., Austin the name vs.
the location), you may want to collect and create lists of all given
names, geographic names, etc. The Wikipedia would be a convenient
source.

I suppose it would be efficient to have the program scan proper noun
lists only if there is a likelihood of finding a match within them.
For example if a sentence has a term that could contain a GN link
(e.g., "I fly over the Atlantic Ocean"), the program would then look
at the Proper Noun lists and give that parse an extra boost if the
term (e.g., "Atlantic Ocean") does exist in, say, the geographic name
list. In some cases the ambiguity would be unavoidable ("I went to
Austin") but in others, ("I went to Austin, Texas") this would assist
in the parsing.

Can Bruce

On Jul 8, 10:35 pm, Peter Szolovits <p...@mit.edu> wrote:
> I've been spending some time cleaning up the EXTRA files in my 2003 extensions to LGP based on the UMLS Specialist lexicon.  Linas has incorporated a lot of the words that are not specifically medical (and rare) in the main LGP dictionaries, which means that loading the additions as originally described athttp://www.medg.lcs.mit.edu/projects/text/leads to many errors showing duplications.  Therefore, I have built a new
> version of my additions, which remove these duplicates.  If you want to use these with the current (4.7.4) version of LGP, please look athttp://groups.csail.mit.edu/medg/projects/text/lexicon.html.

Linas Vepstas

unread,
Jul 22, 2011, 10:04:19 PM7/22/11
to link-g...@googlegroups.com, Peter Szolovits
Hi Peter,

Sorry for late reply, have been out on vacation.

On 8 July 2011 21:35, Peter Szolovits <p...@mit.edu> wrote:
> I've been spending some time cleaning up the EXTRA files in my 2003 extensions to LGP based on the UMLS Specialist lexicon.  Linas has incorporated a lot of the words that are not specifically medical (and rare) in the main LGP dictionaries, which means that loading the additions as originally described at http://www.medg.lcs.mit.edu/projects/text/ leads to many errors showing duplications.  Therefore, I have built a new
> version of my additions, which remove these duplicates.  If you want to use these with the current (4.7.4) version of LGP, please look at http://groups.csail.mit.edu/medg/projects/text/lexicon.html.

At one point, you had suggested that perhaps the parsing accuracy
was not improved by adding these rare terms, i.e. that the default
guesser for unknown words worked quite well. Do you have
new/better data to clarify this?


> 1. I note that the current LGP dictionaries include 7 files of male, female and ambiguous names, and singular locations, nations, organizations and states.  I'm a bit confused about how ambiguity is handled. For example, Austin is listed as an ambiguous given name (.b), so "We went to Austin, Texas." parses that with this interpretation.  Does that simply not create problems?

Ugh. Yes, it is a problem if you interpret suffixes as anything other
than vague hints. Really, the correct way to determine the part of
speech of a word is to look at the set of links that a word participated
in. The .b and .l tags were attempts to provide a stronger hint for
those cases that seemed relatively straightforward, but, as you note,
these don't work well/at-all when a word has multiple meanings.

> 2. Arctic is a location, but Antarctic is an organization.

Oh, that's dumb ... i just fixed this.

The core problem was that both "arctic" and "antarctic", lower-case,
are listed as adjectives; this causes problems when these are
used upper-case, and so must be explicitly added to the
dictionaries in upper-case form.

> 3. America is a female name (.f) and a location (.l).  So why isn't Austin handled similarly?

No good reason. Note, however, that listing a name both ways
will, in most cases, just double the number of parses found; one
parse will use America.l and another will use America.f I find this
annoying.


> 4. Atlantic is not in the dictionary. In the heuristics that I used in 2003, they created both an Atlantic.n (as in "I sail the Atlantic.") and Atlantic.a (as in "I fly over the Atlantic Ocean.").  That was just a dumb computer doing it, and it leads to a combinatorial multiplication of parses.  How is one supposed to deal with productive problems like this?  Comparably, North is only an organization, and gets an AN link to Pole.n in "North Pole".

Capitalized words, if not otherwise found in the dictionary,
are treated as if they were proper names; this allows them
to get AN links so that chains of capitalized words are recognized
as the names of things. Things get a little trickier when such
words also appear lower-case in the dictionaries, in which case
I believe that the upper-case variant must be explicitly included
in order for things to work (I believe, I'd have to double-check)

So, since lower-case 'atlantic' in not in the dictionary, the
upper-case Atlantic falls under the capital-letter rule, and
everything works fine. However, there is a lower-case north
in the dictionary, necessitating an upper-case North to be added.

The "capitalized word" detector is implemented as a regex pattern.
Note that other regex patterns are used to identify words ending
in -ium as nouns (e.g. chromium, etc.) This avoids the need to list
large numbers of rare latinate technical terms in the dictionary;
a blanket rule covers then all.

My suggestion is that, as you refine the "biomedical terms" dictionary,
that, if there are other classes of words with such patterns, e.g. other
Latin -derived terms, then we should add a regex for them, rather than
adding individual terms to the dictionary.

> Alas, there are tons of such questions.  Do we just get by with them, or is there a systematic way to think about these?

Best to bring them up on a case by cases basis. Some are done in a
certain way for a reason, others are just plain bugs, and some just
are difficult to deal with, or possibly have been neglected.

--linas

Reply all
Reply to author
Forward
0 new messages