Is there a way to separate numbers with characters?

25 views
Skip to first unread message

Weihua Fan

unread,
Dec 12, 2018, 11:45:00 PM12/12/18
to open-korean-text
We have lots of short chat messages need to be tokenized before processing. I noticed that often numbers and characters are not separated as expected.
for example: 8시간 뒤에

8 and 시간 should be separated as two tokens.

any suggestions are welcome, thank you.

Hohyon Ryu

unread,
Dec 13, 2018, 5:41:08 PM12/13/18
to open-korean-text
Hi Weihua,

You can separate them with a simple regex after the fact. Twitter needed them attached to have the unit with the number as a meaningful chunk.

--
You received this message because you are subscribed to the Google Groups "open-korean-text" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-korean-te...@googlegroups.com.
To post to this group, send email to open-kor...@googlegroups.com.
Visit this group at https://groups.google.com/group/open-korean-text.
To view this discussion on the web visit https://groups.google.com/d/msgid/open-korean-text/10929cde-4c15-4519-81bb-2586fe641d3c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Weihua Fan

unread,
Dec 14, 2018, 1:46:54 AM12/14/18
to open-korean-text
Thank you

在 2018年12月14日星期五 UTC+8上午6:41:08,Hohyon Ryu写道:
Reply all
Reply to author
Forward
0 new messages