Gensim sentences extraction (or some other method of sentence extraction?)

Tedo Vrbanec

unread,

Mar 18, 2022, 4:46:14 AM3/18/22

to Gensim

I found that nltk.tokenize.sent_tokenize results with faulty splitting sentences when find i.e., e.g. etc. and other abbreviations.

So, i tried to find alternatives and was thrilled to find that gensim also have (or had) such a feature (https://radimrehurek.com/gensim_3.8.3/summarization/textcleaner.html), but it is not working/existing any more.

In Gensim 4, I dont see any results on searching "sent" (https://radimrehurek.com/gensim/search.html?q=sent&check_keywords=yes&area=default#).

What is the best way to solve my problem?

Thanks!

Tedo Vrbanec

unread,

Mar 18, 2022, 7:05:29 AM3/18/22

to Gensim

I tried several modules and the best results came from sentence-splitter (https://github.com/mediacloud/sentence-splitter).

Gordon Mohr

unread,

Mar 18, 2022, 1:14:57 PM3/18/22

to Gensim

I suspect that if you tried the old (`gensim.summarization`) sentence-splitter, you'd find its behavior similar to, or worse than, NLTK. It used a single regex for sentence-splitting, which you can view (& re-use if against all odds it works well on your corpus) from: https://github.com/RaRe-Technologies/gensim/blob/3.8.3/gensim/summarization/textcleaner.py#L37

I'm unfamiliar with the `sentence-splitter` library you mention, but would also be sure to evaluate some of the options in `spaCy` – either the default sentence-segmentation of its `DependencyParser` or via its alternate `Sentencizer` class. See: