Hello Gensim team,
I have a question regarding the WikiCorpus. One of your examples of loading a preprocessing the Wiki dump:
wiki = WikiCorpus(
"path/to/file",
tokenizer_func=tokenize,
metadata=True,
dictionary={},
)
with smart_open.open("path/to/new_file", "w", encoding='utf8') as fout:
for article_no, (content, (page_id, title)) in enumerate(wiki.get_texts()):
title = ' '.join(title.split())
if article_no % 500000 == 0:
logging.info("processing article #%i: %r (%i tokens)", article_no, title, len(content))
fout.write(f"{title}\t{' '.join(content)}\n")
This is working fine, no problem with that. However, I would like to get the text (sentences) of articles in a single string, instead of tokenized words. Also no preprocessing, such as lowercasing, tokenization, etc. only to clean up the text by filtering and removing markup. So the question is, is it possible to use gensim.corpora.wikicorpus.filter_wiki or gensim.corpora.wikicorpus.remove_markup, or both? I have a problem with the way this could be implemented for my goal.
Thank you and best regards,
Tomas Holler