Query apparent anomalies in en/words/words-medical*

Paul McQuesten

unread,

Mar 16, 2011, 5:29:36 PM3/16/11

to link-grammar

I have a few questions about words-medical. There is detailed data
after the numbered questions, along with the corresponding egrep.

1. None of the words in words-medical.adv.1 and words-medical.prep.1
have subscripts. Is this okay?

2. Some words in words-medical.v* have no subscript, although most do.

3. Most words in words-medical.v.4.[1234] have .v subscript.

4. words-medical.v.4.4 begins with a blank line.

5. Most words in words-medical.v.4.5 have .g subscript; 19 have none.

6. x_ray and X_ray differ only in initial capital. Does this impact
sentence
detection and/or entity recognition?

0. Are such matters appropriate for this group?

//
-------------------------------------------------------------------------------------
// 2. words-medical.v: some words with no subscript
en/words> egrep -c -v '[.][a-z]$' words-medical.v*
words-medical.v.4.1:11
words-medical.v.4.2:8
words-medical.v.4.3:21
words-medical.v.4.4:20
words-medical.v.4.5:19

//
-------------------------------------------------------------------------------------
// 3. words-medical.v.4.[1234] supposed to have .v subscript??
// words-medical.v.4.4 begins with a blank line; okay??
en/words> egrep -v '[.]v$' words-medical.v.4.[1234] >nov
words-medical.v.4.1:adenosine_diphosphate-ribosylate
words-medical.v.4.1:cross_dress
words-medical.v.4.1:cross_fertilize
words-medical.v.4.1:cross_match
words-medical.v.4.1:freeze_dry
words-medical.v.4.1:jack_knife
words-medical.v.4.1:over_weight
words-medical.v.4.1:portacaval_shunt
words-medical.v.4.1:scar_cicatrise
words-medical.v.4.1:scar_cicatrize
words-medical.v.4.1:tap_dance

words-medical.v.4.2:adenosine_diphosphate-ribosylates
words-medical.v.4.2:cross_fertilizes
words-medical.v.4.2:cross_matches
words-medical.v.4.2:freeze_dries
words-medical.v.4.2:jack_knifes
words-medical.v.4.2:over_weights
words-medical.v.4.2:portacaval_shunts
words-medical.v.4.2:scar_cicatrises

words-medical.v.4.3:adenosine_diphosphate-ribosylated
words-medical.v.4.3:bandpass_filtered
words-medical.v.4.3:bench_pressed
words-medical.v.4.3:catalase_tested
words-medical.v.4.3:C_banded
words-medical.v.4.3:cross_fertilized
words-medical.v.4.3:cross_matched
words-medical.v.4.3:cryo_sectioned
words-medical.v.4.3:field_tested
words-medical.v.4.3:free_grafted
words-medical.v.4.3:freeze_dried
words-medical.v.4.3:immuno_assayed
words-medical.v.4.3:jack_knifed
words-medical.v.4.3:Nissl_stained
words-medical.v.4.3:over_weighted
words-medical.v.4.3:portacaval_shunted
words-medical.v.4.3:scar_cicatrised
words-medical.v.4.3:skate_boarded
words-medical.v.4.3:ventricular_hypertrophied
words-medical.v.4.3:x_rayed
words-medical.v.4.3:X_rayed

words-medical.v.4.4:
words-medical.v.4.4:adenosine_diphosphate-ribosylating
words-medical.v.4.4:bandpass_filtering
words-medical.v.4.4:bench_pressing
words-medical.v.4.4:catalase_testing
words-medical.v.4.4:cross_fertilizing
words-medical.v.4.4:cryo_sectioning
words-medical.v.4.4:field_testing
words-medical.v.4.4:free_grafting
words-medical.v.4.4:freeze_drying
words-medical.v.4.4:immuno_assaying
words-medical.v.4.4:jack_knifing
words-medical.v.4.4:Nissl_staining
words-medical.v.4.4:over_weighting
words-medical.v.4.4:portacaval_shunting
words-medical.v.4.4:scar_cicatrising
words-medical.v.4.4:scar_cicatrizing
words-medical.v.4.4:ventricular_hypertrophying
words-medical.v.4.4:x_raying
words-medical.v.4.4:X_raying

//
-------------------------------------------------------------------------------------
// 5. words-medical.v.4.5 supposed to have .g subscript??
en/words> egrep -v '[.]g$' words-medical.v.4.5
adenosine_diphosphate-ribosylating
bandpass_filtering
bench_pressing
catalase_testing
cross_fertilizing
cryo_sectioning
field_testing
free_grafting
freeze_drying
immuno_assaying
jack_knifing
Nissl_staining
over_weighting
portacaval_shunting
scar_cicatrising
scar_cicatrizing
ventricular_hypertrophying
x_raying
X_raying

Linas Vepstas

unread,

Mar 18, 2011, 8:06:07 PM3/18/11

to link-g...@googlegroups.com, Paul McQuesten

On 16 March 2011 16:29, Paul McQuesten <mcqu...@gmail.com> wrote:
> I have a few questions about words-medical. There is detailed data
> after the numbered questions, along with the corresponding egrep.
>
> 1. None of the words in words-medical.adv.1 and words-medical.prep.1
> have subscripts. Is this okay?

Yes. Subscripts aren't really needed; they are used for only one
reason: to distinguish multiple dictionary entries for the same word.
That they also happen to be kind-of-like part-of-speech tags is
a lucky accident for the user.

> 2. Some words in words-medical.v* have no subscript, although most do.

Feel free to send in patches, if you wish.

> 3. Most words in words-medical.v.4.[1234] have .v subscript.

Presumably, they're verbs. Although past-tense verbs
should be tagged with .v-d so that people who care about
past tense (e.g. the stanford-parser compat mode in relex)
get it right.

> 4. words-medical.v.4.4 begins with a blank line.

shouldn't matter.

> 5. Most words in words-medical.v.4.5 have .g subscript; 19 have none.
>
> 6. x_ray and X_ray differ only in initial capital. Does this impact
> sentence
> detection and/or entity recognition?

Possibly. I'd have to experiment & play with this. The capital-letter
logic is frustratingly delicate.

> 0. Are such matters appropriate for this group?

Yes.

> -------------------------------------------------------------------------------------
> // 3. words-medical.v.4.[1234] supposed to have .v subscript??
> // words-medical.v.4.4 begins with a blank line; okay??
> en/words> egrep -v '[.]v$' words-medical.v.4.[1234] >nov
> words-medical.v.4.1:adenosine_diphosphate-ribosylate
> words-medical.v.4.1:cross_dress
> words-medical.v.4.1:cross_fertilize

Nothing with an underscore can ever have a subscript;
this is due to a deep technical limitation.

--linas

Paul McQuesten

unread,

Mar 18, 2011, 8:29:55 PM3/18/11

to link-grammar

On Mar 18, 5:06 pm, Linas Vepstas <linasveps...@gmail.com> wrote:

> On 16 March 2011 16:29, Paul McQuesten <mcques...@gmail.com> wrote:

> > -------------------------------------------------------------------------------------
> > // 3. words-medical.v.4.[1234] supposed to have .v subscript??
> > // words-medical.v.4.4 begins with a blank line; okay??
> > en/words> egrep -v '[.]v$' words-medical.v.4.[1234] >nov
> > words-medical.v.4.1:adenosine_diphosphate-ribosylate
> > words-medical.v.4.1:cross_dress
> > words-medical.v.4.1:cross_fertilize
>
> Nothing with an underscore can ever have a subscript;
> this is due to a deep technical limitation.

Ah, the mysterious_idiom code!.

>
> > 2. Some words in words-medical.v* have no subscript, although most do.
> Feel free to send in patches, if you wish.

Just now noticed that all the missing subscripts are on idioms, so
above covers this case.

> > 5. Most words in words-medical.v.4.5 have .g subscript; 19 have none.

I should have pointed out that several of them appear in both v.4.4
and v.4.5. Which rule do they get? I have been unable to devise a test
case, since I do not know how to use adenosine_diphosphate-
ribosylating in a sentence ;-)

Linas Vepstas

unread,

Apr 10, 2011, 7:15:00 PM4/10/11

to link-g...@googlegroups.com, Paul McQuesten

On 18 March 2011 19:29, Paul McQuesten <mcqu...@gmail.com> wrote:
>> > 5. Most words in words-medical.v.4.5 have .g subscript; 19 have none.
> I should have pointed out that several of them appear in both v.4.4
> and v.4.5. Which rule do they get? I have been unable to devise a test
> case, since I do not know how to use adenosine_diphosphate-
> ribosylating in a sentence ;-)

That depends. The *v.*.4 words are gerunds, the v.*.5 are present
participles (or maybe the other way around) so which is used depends
on how its used.

--linas

Reply all

Reply to author

Forward