Yes. Subscripts aren't really needed; they are used for only one
reason: to distinguish multiple dictionary entries for the same word.
That they also happen to be kind-of-like part-of-speech tags is
a lucky accident for the user.
> 2. Some words in words-medical.v* have no subscript, although most do.
Feel free to send in patches, if you wish.
> 3. Most words in words-medical.v.4.[1234] have .v subscript.
Presumably, they're verbs. Although past-tense verbs
should be tagged with .v-d so that people who care about
past tense (e.g. the stanford-parser compat mode in relex)
get it right.
> 4. words-medical.v.4.4 begins with a blank line.
shouldn't matter.
> 5. Most words in words-medical.v.4.5 have .g subscript; 19 have none.
>
> 6. x_ray and X_ray differ only in initial capital. Does this impact
> sentence
> detection and/or entity recognition?
Possibly. I'd have to experiment & play with this. The capital-letter
logic is frustratingly delicate.
> 0. Are such matters appropriate for this group?
Yes.
> -------------------------------------------------------------------------------------
> // 3. words-medical.v.4.[1234] supposed to have .v subscript??
> // words-medical.v.4.4 begins with a blank line; okay??
> en/words> egrep -v '[.]v$' words-medical.v.4.[1234] >nov
> words-medical.v.4.1:adenosine_diphosphate-ribosylate
> words-medical.v.4.1:cross_dress
> words-medical.v.4.1:cross_fertilize
Nothing with an underscore can ever have a subscript;
this is due to a deep technical limitation.
--linas
That depends. The *v.*.4 words are gerunds, the v.*.5 are present
participles (or maybe the other way around) so which is used depends
on how its used.
--linas