Loading WikiCorpus

35 views
Skip to first unread message

Tomáš Holler

unread,
Mar 23, 2024, 6:03:24 PMMar 23
to Gensim
Hello Gensim team,

I have a question regarding the WikiCorpus. One of your examples of loading a preprocessing the Wiki dump:

wiki = WikiCorpus(
"path/to/file",
tokenizer_func=tokenize,
metadata=True,
dictionary={},
)

with smart_open.open("path/to/new_file", "w", encoding='utf8') as fout:
for article_no, (content, (page_id, title)) in enumerate(wiki.get_texts()):
title = ' '.join(title.split())
if article_no % 500000 == 0:
logging.info("processing article #%i: %r (%i tokens)", article_no, title, len(content))
fout.write(f"{title}\t{' '.join(content)}\n")

This is working fine, no problem with that. However, I would like to get the text (sentences) of articles in a single string, instead of tokenized words. Also no preprocessing, such as lowercasing, tokenization, etc. only to clean up the text by filtering and removing markup. So the question is, is it possible to use gensim.corpora.wikicorpus.filter_wiki or gensim.corpora.wikicorpus.remove_markup, or both? I have a problem with the way this could be implemented for my goal.

Thank you and best regards,
Tomas Holler

Gordon Mohr

unread,
Mar 28, 2024, 3:48:59 PMMar 28
to Gensim
I haven't tested this, but have you tried specifying a no-op `tokenizer_func`, fitting the signature that's specified in the docs (https://radimrehurek.com/gensim/corpora/wikicorpus.html#gensim.corpora.wikicorpus.WikiCorpus), that simply returns the passed-in text string ? eg:

    ..., tokenizer_func=lambda text, token_min_len, token_max_len, lower: text,...

- Gordon
Reply all
Reply to author
Forward
0 new messages