Semcor gives different output than is shown in the docs

18 views
Skip to first unread message

Mahesh Abnave

unread,
Sep 3, 2021, 7:12:19 PMSep 3
to nltk-users

I was trying semcor corpus in nltk.

I found this code here:

>>> list(map(str, semcor.tagged_chunks(tag='both')[:3]))
['(DT The)', "(Lemma('group.n.01.group') (NE (NNP Fulton County Grand Jury)))", "(Lemma('state.v.01.say') (VB said))"]

I tried the same on colab (check last cell in this notebook):

>>> list(map(str, semcor.tagged_chunks(tag='both')[:3]))
['(DT The)', '(group.n.01 (NE (NNP Fulton County Grand Jury)))', '(say.v.01 (VB said))']

Note that docs output tag contains sense stem state in state.v.01.say.
But I didn't get sense stem state in the output tag in say.v.01.

Why the difference in the output? What I am missing here?

( In fact, I have a colab notebook in which code line executed yesterday outputted sense stem, but code line (in the same notebook) executed today did not output sense stem. I compared output of pip list executed yesterday and today and they are exactly the same. So I was wondering what's going wrong within the same notebook. I did not give the details here because I don't want to complicate and want to focus on above question. Just to point out, my colleague is also not getting sense stem in jupyter notebook running on his local machine. Fortunately, on my local jupyter, I am getting sense stem. Hope, it will not vanish.)

PS: Earlier, I wrote same post, but I do not find it in the group. I dont recall if I clicked Post message button or delete button. But if it takes some time to appear in the group, then this message will be duplicate. Apologies, in advance. This is my post here. 

Mahesh Abnave

unread,
Sep 5, 2021, 4:09:15 PMSep 5
to nltk-users

I knew that semcor uses wordnet senses to tag to subset of brown corpus. But I was not aware that semcor APIs can work with or without wordnet predownloaded and it will give tags in different format in these different scenarios. I honestly feel, at least semcor API documentation should have some mention of this.

So, without wordnet predownloaded, it does not return sense stems:

>>> import nltk
>>> nltk.download('semcor')
>>> ! unzip -o /root/nltk_data/corpora/semcor.zip -d /root/nltk_data/corpora # in colab / jupyter
>>> from nltk.corpus import semcor
>>> semcor.tagged_sents(tag='sem')
[[['The'], Tree('group.n.01', [Tree('NE', ['Fulton', 'County', 'Grand', 'Jury'])]), Tree('say.v.01', ['said']), Tree('friday.n.01', ['Friday']), ['an'], Tree('investigation.n.01', ['investigation']), ['of'], Tree('atlanta.n.01', ['Atlanta']), ["'s"], Tree('recent.s.02', ['recent']), Tree('primary_election.n.01', ['primary', 'election']), Tree('produce.v.04', ['produced']), ['``'], ['no'], Tree('evidence.n.01', ['evidence']), ["''"], ['that'], ['any'], Tree('irregularity.n.01', ['irregularities']), Tree('take_place.v.01', ['took', 'place']), ['.']], [['The'], Tree('jury.n.01', ['jury']), Tree('far.r.02', ['further']), Tree('say.v.01', ['said']), ['in'], Tree('term.n.02', ['term']), Tree('end.n.02', ['end']), Tree('presentment.n.01', ['presentments']), ['that'], ['the'], Tree('group.n.01', [Tree('NE', ['City', 'Executive', 'Committee'])]), [','], ['which'], Tree('have.v.04', ['had']), Tree('overall.s.02', ['over-all']), Tree('charge.n.06', ['charge']), ['of'], ['the'], Tree('election.n.01', ['election']), [','], ['``'], Tree('deserve.v.01', ['deserves']), ['the'], Tree('praise.n.01', ['praise']), ['and'], Tree('thanks.n.01', ['thanks']), ['of'], ['the'], Tree('location.n.01', [Tree('NE', ['City', 'of', 'Atlanta'])]), ["''"], ['for'], ['the'], Tree('manner.n.01', ['manner']), ['in'], ['which'], ['the'], Tree('election.n.01', ['election']), ['was'], Tree('conduct.v.01', ['conducted']), ['.']], ...]

With wordnet pre-downloaded, it does return sense stems:

>>> import nltk
>>> nltk.download('wordnet')
>>> nltk.download('semcor')
>>> ! unzip -o /root/nltk_data/corpora/semcor.zip -d /root/nltk_data/corpora # in colab / jupyter
>>> from nltk.corpus import semcor
>>> semcor.tagged_sents(tag='sem')
[[['The'], Tree(Lemma('group.n.01.group'), [Tree('NE', ['Fulton', 'County', 'Grand', 'Jury'])]), Tree(Lemma('state.v.01.say'), ['said']), Tree(Lemma('friday.n.01.Friday'), ['Friday']), ['an'], Tree(Lemma('probe.n.01.investigation'), ['investigation']), ['of'], Tree(Lemma('atlanta.n.01.Atlanta'), ['Atlanta']), ["'s"], Tree(Lemma('late.s.03.recent'), ['recent']), Tree(Lemma('primary.n.01.primary_election'), ['primary', 'election']), Tree(Lemma('produce.v.04.produce'), ['produced']), ['``'], ['no'], Tree(Lemma('evidence.n.01.evidence'), ['evidence']), ["''"], ['that'], ['any'], Tree(Lemma('abnormality.n.04.irregularity'), ['irregularities']), Tree(Lemma('happen.v.01.take_place'), ['took', 'place']), ['.']], [['The'], Tree(Lemma('jury.n.01.jury'), ['jury']), Tree(Lemma('far.r.02.far'), ['further']), Tree(Lemma('state.v.01.say'), ['said']), ['in'], Tree(Lemma('term.n.02.term'), ['term']), Tree(Lemma('end.n.02.end'), ['end']), Tree(Lemma('presentment.n.01.presentment'), ['presentments']), ['that'], ['the'], Tree(Lemma('group.n.01.group'), [Tree('NE', ['City', 'Executive', 'Committee'])]), [','], ['which'], Tree(Lemma('own.v.01.have'), ['had']), Tree(Lemma('overall.s.02.overall'), ['over-all']), Tree(Lemma('mission.n.03.charge'), ['charge']), ['of'], ['the'], Tree(Lemma('election.n.01.election'), ['election']), [','], ['``'], Tree(Lemma('deserve.v.01.deserve'), ['deserves']), ['the'], Tree(Lemma('praise.n.01.praise'), ['praise']), ['and'], Tree(Lemma('thanks.n.01.thanks'), ['thanks']), ['of'], ['the'], Tree(Lemma('location.n.01.location'), [Tree('NE', ['City', 'of', 'Atlanta'])]), ["''"], ['for'], ['the'], Tree(Lemma('manner.n.01.manner'), ['manner']), ['in'], ['which'], ['the'], Tree(Lemma('election.n.01.election'), ['election']), ['was'], Tree(Lemma('conduct.v.01.conduct'), ['conducted']), ['.']], ...]


Reply all
Reply to author
Forward
0 new messages