Strange tokens showing up in dictionary

Kenneth Orton

unread,

May 26, 2016, 6:52:00 AM5/26/16

to gensim

I'm using wikicorpus.py to create a corpus/dictionary and the logging information is showing tokens being kept in the dictionary.

Some of the tokens look like this:

u'inicia\xe7\xe3o/JJ', u'\u7f8e\u6728\u591a\u99c5/NN'

Radim Řehůřek

unread,

May 26, 2016, 8:16:57 AM5/26/16

to gensim

Hi Kenneth,

not sure which part you consider strange, but:

iniciação/JJ = iniciação (word) + JJ (POS tag)

The "/JJ" part is there because you're using lemmatizer (which uses the pattern library under the hood). It assumes English by default, so it probably won't assign meaningful POS categories to foreign words like iniciação.

Let me know if that answers it,

Radim

Kenneth Orton

unread,

May 26, 2016, 8:46:03 AM5/26/16

to gensim

I was more concerned with the strings followed by the backslashes in between the POS tags and the word.

After grep'ing the sample wiki I think I've come to the conclusion that either the hash between the <sha1></sha1> tags

is getting picked up or these are tags leftover from allowing the jpeg description to be used as a token. It happens

even without lemmatization and so far the words haven't gone into the model's vocabulary because of the filter_extremes.

It's happened quite often with this run of corpus/dict creation. The only thing I've changed since the last run is size of vocab from 100000 to 170000.

cat simplewiki-latest-pages-articles.xml | grep xe0

>International Business Times, "Harold Camping Says End did come May 21, spiritually; Predicts New Date: October 21" [http://au.ibtimes.com/articles/150707/20110524/harold-camping-says-end-did-come-may-21-spiritually-predicts-new-date-october-21.htm] Retrieved May 23, 2011</ref><ref name="1994?">[http://www.youtube.com/watch?v=OT0Y2lxe00I video about the book "1994?"]</ref>

<sha1>mubanmrxe05huuyyf63atmry1qukcpf</sha1>

<sha1>incl0mcpvb9vwcesxe0ovdhczp50eu1</sha1>

<sha1>jcvrxe0q4td1tpaomtdta04jguati8i</sha1>

Screenshot from 2016-05-26 03:42:07.png

Kenneth Orton

unread,

May 26, 2016, 9:00:34 AM5/26/16

to gensim

Nevermind, these are all just unicode characters.

I guess that is the way pattern parses the text

>>> text = 'Iniciação (do latim initiatio) é um termo que remete a começo, entrada: iniciar um evento, ação, circunstância ou acontecimento. Também tem um significado de ascensão de um nível (abandonado) de existência para um outro nível superior.'
>>> from pattern.en import parse
>>> parsed = parse(text, lemmata=True, collapse=False)
/usr/local/lib/python2.7/dist-packages/Pattern-2.6-py2.7.egg/pattern/text/__init__.py:979: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  and tokens[j] in ("'", "\"", u"”", u"’", "...", ".", "!", "?", ")", EOS):
>>> print(parsed)
[[[u'Inicia\xe7\xe3o', u'NNP', 'B-NP', 'O', u'inicia\xe7\xe3o'], [u'(', u'(', 'O', 'O', u'('], [u'do', u'VB', 'B-VP', 'O', u'do'], [u'latim', u'JJ', 'B-NP', 'O', u'latim'], [u'initiatio', u'NN', 'I-NP', 'O', u'initiatio'], [u')', u')', 'O', 'O', u')'], [u'\xe9', u'FW', 'O', 'O', u'\xe9'], [u'um', u'FW', 'O', 'O', u'um'], [u'termo', u'FW', 'O', 'O', u'termo'], [u'que', u'FW', 'O', 'O', u'que'], [u'remete', u'VB', 'B-VP', 'O', u'remete'], [u'a', u'DT', 'B-NP', 'O', u'a'], [u'come\xe7o', u'NN', 'I-NP', 'O', u'come\xe7o'], [u',', u',', 'O', 'O', u','], [u'entrada', u'NN', 'B-NP', 'O', u'entrada'], [u':', u':', 'O', 'O', u':'], [u'iniciar', u'JJ', 'B-ADJP', 'O', u'iniciar'], [u'um', u'FW', 'O', 'O', u'um'], [u'evento', u'IN', 'B-PP', 'O', u'evento'], [u',', u',', 'O', 'O', u','], [u'a\xe7\xe3o', u'VBP', 'B-VP', 'O', u'a\xe7\xe3o'], [u',', u',', 'O', 'O', u','], [u'circunst\xe2ncia', u'NN', 'B-NP', 'O', u'circunst\xe2ncia'], [u'ou', u'NN', 'I-NP', 'O', u'ou'], [u'acontecimento', u'IN', 'B-PP', 'O', u'acontecimento'], [u'.', u'.', 'O', 'O', u'.']], [[u'Tamb\xe9m', u'NN', 'B-NP', 'O', u'tamb\xe9m'], [u'tem', u'NN', 'I-NP', 'O', u'tem'], [u'um', u'FW', 'O', 'O', u'um'], [u'significado', u'FW', 'O', 'O', u'significado'], [u'de', u'FW', 'O', 'O', u'de'], [u'ascens\xe3o', u'FW', 'O', 'O', u'ascens\xe3o'], [u'de', u'FW', 'O', 'O', u'de'], [u'um', u'FW', 'O', 'O', u'um'], [u'n\xedvel', u'NN', 'B-NP', 'O', u'n\xedvel'], [u'(', u'(', 'O', 'O', u'('], [u'abandonado', u'NN', 'B-NP', 'O', u'abandonado'], [u')', u')', 'O', 'O', u')'], [u'de', u'IN', 'B-PP', 'B-PNP', u'de'], [u'exist\xeancia', u'NN', 'B-NP', 'I-PNP', u'exist\xeancia'], [u'para', u'FW', 'O', 'O', u'para'], [u'um', u'FW', 'O', 'O', u'um'], [u'outro', u'JJ', 'B-NP', 'O', u'outro'], [u'n\xedvel', u'NN', 'I-NP', 'O', u'n\xedvel'], [u'superior', u'JJ', 'B-ADJP', 'O', u'superior'], [u'.', u'.', 'O', 'O', u'.']]]

Kenneth Orton

unread,

May 26, 2016, 9:15:22 AM5/26/16

to gensim

It looks like simple_preprocess does something similar. I'm not concerned about non-english words either way. It's strange that the english wikipedia dump would have entries in

it that are of a completely different language.

>>> text = 'Entre os objetivos de alguns tipos de iniciação, destacam-se o aprendizado de valores fundamentais para a vida no nível seguinte (adulto). O iniciado deve aprender a se fortalecer com o isolamento, sobreviver em condições precárias, estar preparado para as dificuldades da vida (por exemplo, muitas iniciações exigem que o iniciado construa a cabana em que ficará isolado durante o ritual), aprender a caçar, pescar, conhecer a fauna e flora etc.'
>>> print(utils.simple_preprocess(text))
[u'entre', u'os', u'objetivos', u'de', u'alguns', u'tipos', u'de', u'inicia\xe7\xe3o', u'destacam', u'se', u'aprendizado', u'de', u'valores', u'fundamentais', u'para', u'vida', u'no', u'n\xedvel', u'seguinte', u'adulto', u'iniciado', u'deve', u'aprender', u'se', u'fortalecer', u'com', u'isolamento', u'sobreviver', u'em', u'condi\xe7\xf5es', u'prec\xe1rias', u'estar', u'preparado', u'para', u'as', u'dificuldades', u'da', u'vida', u'por', u'exemplo', u'muitas', u'inicia\xe7\xf5es', u'exigem', u'que', u'iniciado', u'construa', u'cabana', u'em', u'que', u'ficar\xe1', u'isolado', u'durante', u'ritual', u'aprender', u'ca\xe7ar', u'pescar', u'conhecer', u'fauna', u'flora', u'etc']

Radim Řehůřek

unread,

May 26, 2016, 10:42:46 AM5/26/16

to gensim

Nothing to do with Pattern actually; this is just how non-ASCII unicode characters are "visualized" in Python: backslash followed by some identifier code. It's just like \n stands for newline, without printing the newline.

-rr

Reply all

Reply to author

Forward