Find abusing words in corpus

조회수 35회
읽지 않은 첫 메시지로 건너뛰기

Arnay Joshi

읽지 않음,
2017. 9. 23. 오후 12:35:4817. 9. 23.
받는사람 nltk-users
We were reading chat logs for analysis and found tons of abusing words.
Is there a clear way to read and clean these words as we do for stop words. Can we get language support for this?

Dimitriadis, A. (Alexis)

읽지 않음,
2017. 9. 24. 오전 6:18:2417. 9. 24.
받는사람 nltk-...@googlegroups.com
You can filter out any list of words, just like you filter stopwords. Adapting the code is quite straightforward— and the nltk is a programming toolkit, don’t expect all possible functionality to be pre-wrapped in a function.

But you’ll need to come up with a list of the “abusing” words that you don’t want to see. There isn’t one among the nltk’s data offerings, as far as I know, for English or for any other language.

But you should be aware that if you censor chats in real time, people will just find ways to disguise the “bad” words, like writing “f**k” etc. So your mileage may vary…

Alexis


> On 23 Sep 2017, at 18:02, Arnay Joshi <arnay...@gmail.com> wrote:
>
> We were reading chat logs for analysis and found tons of abusing words.
> Is there a clear way to read and clean these words as we do for stop words. Can we get language support for this?
>
> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

전체답장
작성자에게 답글
전달
새 메시지 0개