Re: Why is Stanford POS tag_sents tagging individual letters?

202 views
Skip to first unread message
Message has been deleted

Raabia Asif

unread,
Jun 30, 2018, 8:29:06 AM6/30/18
to nltk-users
Hey,

Have you been able to resolve the issue? Please do share, if yes. I am also facing the same problem.

On Wednesday, April 12, 2017 at 1:17:41 AM UTC+5, ico...@gmail.com wrote:
Hi all,

I can't seem to get 'tag_sents' to work in the expected manner. If I segment to sentences and then tag the words using 'tag', I don't have any issues. But each time I try to do "batches" of sentences, the tagger splits up the input sentences into single letters which are then tagged. I've tried to track down the issue using print statements and everything appears to work as expected right up until it enters the Stanford POS black box. Any clues? 

I've included my working code below. I want to adapt it to use 'pos.tag_sents' instead of 'pos.tag.' Don't be too harsh on it, I'm new Python and my C days are well behind me. Let me know if there's a better or more efficient way of coding anything.

Best,
iconseq

# includes
import nltk
import textmining
from nltk.tokenize import sent_tokenize
from nltk.tag.stanford import StanfordPOSTagger
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import os

# setJavaPath
java_path = "<path here>"
os.environ['JAVAHOME'] = java_path

# readFiles
corpusText = open('corpus.txt').read()

# tokenizer
tokenized = sent_tokenize(corpusText)

# stanfordPOS
stanford_dir = '<path here>/Python/stanford-postagger-full-2016-10-31/'
modelfile = stanford_dir + 'models/english-left3words-distsim.tagger'
jarfile = stanford_dir + 'stanford-postagger.jar'
pos = StanfordPOSTagger(model_filename=modelfile, path_to_jar=jarfile)

# wnLemmatizer
wn_lemmatizer = WordNetLemmatizer()

# tdmMatrix
termDocuMatrix = textmining.TermDocumentMatrix()

# globalVariables
indexGlobal = 0


def processContent(indexGlobal):  # processFunction
    try:
        for sentences in tokenized[indexGlobal:]:
            print(sentences)
            words = nltk.word_tokenize(sentences)
            tagged = pos.tag(words)
            return(tagged)

    except Exception as e:
        print(str(e))


def getWordnetPOS(popPOS):  # retagFunction
    if popPOS.startswith('J'):
        return wordnet.ADJ
    elif popPOS.startswith('V'):
        return wordnet.VERB
    elif popPOS.startswith('N'):
        return wordnet.NOUN
    elif popPOS.startswith('R'):
        return wordnet.ADV
    else:
        return ''


while True:  # while not end of file
    try:
        bitSentence = []  # list must be in scope for doc-term matrix init
        taggedWords = processContent(indexGlobal)
        nouns = [tag for tag in taggedWords if tag[1] == 'NN' or tag[1] == 'NNP' or tag[1] == 'NNS']
        # or tag[1] == 'JJ' or tag[1] == 'JJR' or tag[1] == 'JJS']
        length = len(nouns)

        for n in nouns[:length]:
            bitPop = nouns.pop()
            popWord = bitPop[0]
            popPOS = bitPop[1]
            wnPOS = getWordnetPOS(popPOS)
            lemmatized = (wn_lemmatizer.lemmatize(str.lower(popWord), wnPOS))
            bitSentence.insert(0, lemmatized)

        strSentence = ' '.join(bitSentence)
        termDocuMatrix.add_doc(strSentence)
        indexGlobal += 1
    except:
        break

termDocuMatrix.append_csv("doc_termmatrix.csv", cutoff=1)

for row in termDocuMatrix.rows(cutoff=1):
    print(row)

Raabia Asif

unread,
Jun 30, 2018, 8:29:06 AM6/30/18
to nltk-users
Okay, I have been able to solve the issue. 
Actually tag_sents calls the tag method for each sentence, and the tag method takes tokenized sentence as input. So the solution is to pass tokenized sentences to tag_sents instead of string sentences. Here is the code:

from nltk.tag import StanfordPOSTagger
from nltk import word_tokenize, sent_tokenize
doc = """I am John. I love Lahore."""
sentences = sent_tokenize(doc)
sentencesTokenized = [word_tokenize(sentence) for sentence in sentences]
jarPOS = 'C:/Users/Raabia/Desktop/phd course work/tools/NERs/stanford tools/stanford-postagger-2018-02-27/stanford-postagger.jar'
modelPOS = 'C:/Users/Raabia/Desktop/phd course work/tools/NERs/stanford tools/stanford-postagger-2018-02-27/models/english-bidirectional-distsim.tagger'

pos_tagger = StanfordPOSTagger(modelPOS, jarPOS, encoding='utf8')
text = pos_tagger.tag_sents(sentencesTokenized)
print(text)


And here is the output it prints:
[[('I', 'PRP'), ('am', 'VBP'), ('John', 'NNP'), ('.', '.')], [('I', 'PRP'), ('love', 'VBP'), ('Lahore', 'NNP'), ('.', '.')]]


Reply all
Reply to author
Forward
0 new messages