Non printable utf-8 chars

2,224 views
Skip to first unread message

Forud A

unread,
Aug 30, 2014, 4:09:23 AM8/30/14
to golan...@googlegroups.com
Hi,
I want to remove any non-printable chars from my string, and by ANY i mean some charachters more than [:print:] which is only ASCII  (for example this one http://play.golang.org/p/cvrnd6Al2e )
In PHP there is a "u" modifier so the old code use '\w\s' with some other charachter like #$% and .. in pattern, so the real working pattern to replace is  : '/[^\w\s !@#$%^&*()+-=]+/ui'

How I can do this with regexp package?

Thank you!

Ibrahim M. Ghazal

unread,
Aug 30, 2014, 4:33:08 AM8/30/14
to Forud A, golang-nuts
You could use strings.Map [1] together with unicode.IsPrint [2] or
unicode.IsGraphic [3].

[1] http://golang.org/pkg/strings/#Map
[2] http://golang.org/pkg/unicode/#IsPrint
[3] http://golang.org/pkg/unicode/#IsGraphic

Forud A

unread,
Aug 30, 2014, 5:58:02 AM8/30/14
to golan...@googlegroups.com, fzero...@gmail.com

Thank you. But is there any regexp solution to this problem?

Tamás Gulácsi

unread,
Aug 30, 2014, 6:56:21 AM8/30/14
to golan...@googlegroups.com
A regexp will be slower for this (check each rune) than Map.

Lars Seipel

unread,
Aug 30, 2014, 7:06:01 AM8/30/14
to Forud A, golan...@googlegroups.com
On Sat, Aug 30, 2014 at 02:58:02AM -0700, Forud A wrote:
>
> Thank you. But is there any regexp solution to this problem?

Why? What would be the benefit of using regular expressions? It's a
straightforward problem: loop over the string and build a new one,
dropping the stuff you don't want (preferably using a byte slice to
avoid creating garbage on each iteration). When done convert back to
string and return. There's even a library function that does this (and
more) for you, the mentioned strings.Map.

Forud A

unread,
Aug 30, 2014, 8:01:10 AM8/30/14
to golan...@googlegroups.com, fzero...@gmail.com
For this problem, yes. it is better to use map , but not every where . I want to select a word in any utf-8 language in regex. and THIS is one of my every day tasks since I am not a native english speaker (I need to handle Persian, Arabic, Urdo language). writing any available charachter in my regexp is not my option. HOW I can handle that? is that even possible like 'u' modifier in PHP or not?

Tamás Gulácsi

unread,
Aug 30, 2014, 9:00:55 AM8/30/14
to golan...@googlegroups.com
For splitting, a simple function is better, imho.

But for regexp, can't you use unicode character classes and \p{Greek} ?

Taru Karttunen

unread,
Aug 30, 2014, 10:14:15 AM8/30/14
to Forud A, golan...@googlegroups.com
On 30.08 05:01, Forud A wrote:
> For this problem, yes. it is better to use map , but not every where . I
> want to select a word in any utf-8 language in regex. and THIS is one of my
> every day tasks since I am not a native english speaker (I need to handle
> Persian, Arabic, Urdo language). writing any available charachter in my
> regexp is not my option. HOW I can handle that? is that even possible like
> 'u' modifier in PHP or not?

Word-splitting international things is quite complex, and simple
Unicode handling is not enough. For more information see
http://www.unicode.org/reports/tr29/

- Taru Karttunen

Silvan Jegen

unread,
Sep 2, 2014, 5:54:05 AM9/2/14
to golan...@googlegroups.com, fzero...@gmail.com
On Saturday, August 30, 2014 2:01:10 PM UTC+2, Forud A wrote:
For this problem, yes. it is better to use map , but not every where . I want to select a word in any utf-8 language in regex. and THIS is one of my every day tasks since I am not a native english speaker (I need to handle Persian, Arabic, Urdo language). writing any available charachter in my regexp is not my option. HOW I can handle that? is that even possible like 'u' modifier in PHP or not?

As Taru mentioned, for international text this issue is non-trivial.

In Natural language processing this problem is known as Tokenization: http://en.wikipedia.org/wiki/Tokenization

There may be some Go tokenization libraries that you can use. Here is one for Japanese for example:

  
Reply all
Reply to author
Forward
0 new messages