4. Can you get a sense of the complexity of this work? If yes, what level of expertise would you think necessary to do it?
--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
File "<stdin>", line 1
wordlists = PlaintextCorpuReader(corpus_root, '.*', encoding='utf-8' , errors='ignore)
^
SyntaxError: EOL while scanning string literal
I'm guessing that this is an impossible attempt to modify how PlaintextCorpusReader (a module?) is working with the files?
Is there a way to do something like this within NLTK or will I need to modify my files beforehand?
Again, apologies for my inarticulate and probably very roundabout way of stating my problems.
Jembatan
>>> wordlists.words()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/collections.py", line 226, in __repr__
for elt in self:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 402, in iterate_from
for tok in piece.iterate_from(max(0, start_tok-offset)):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 296, in iterate_from
tokens = self.read_block(self._stream)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/plaintext.py", line 122, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1148, in readline
new_chars = self._read(readsize)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1380, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1411, in _incr_decode
return self.decode(bytes, 'strict')
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc8 in position 10: invalid continuation byte
>>>