Capture two char Country codes other than CN and KR with match function.

Hongyi Zhao

unread,

Dec 1, 2016, 1:44:27 AM12/1/16

to

Hi all,

I use gawk built-in match function to capture two char Country codes
other than CN and KR. Currently, I use the following code:

awk 'match($0, /...([A-Z]{2}).../, a ) {
if ( a[1] != "CN" && a[1] != "KR" ) {
do_something
}

I try to find the direct method without using the ``if'' to do this job.

Is there some other more concise regexp for this?

Regards
--
.: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.

Ed Morton

unread,

Dec 1, 2016, 10:35:56 AM12/1/16

to

On 12/1/2016 12:44 AM, Hongyi Zhao wrote:
> Hi all,
>
> I use gawk built-in match function to capture two char Country codes
> other than CN and KR. Currently, I use the following code:
>
>
> awk 'match($0, /...([A-Z]{2}).../, a ) {
> if ( a[1] != "CN" && a[1] != "KR" ) {
> do_something
> }
>
> I try to find the direct method without using the ``if'' to do this job.
>
> Is there some other more concise regexp for this?

[A-BD-JL-Z][A-Z]|C[A-MO-Z]|K[A-QS-Z]

Read a book and THINK, there's really no substitute. Also, unless each line of
input is exactly 8 characters your regexp will produce false matches since it's
not using word boundaries or anchors.

Ed.

Kaz Kylheku

unread,

Dec 1, 2016, 12:18:07 PM12/1/16

to

On 2016-12-01, Ed Morton <morto...@gmail.com> wrote:
> Read a book and THINK, there's really no substitute.

Well, yes there kind of is, namely: read a *newsgroup* and THINK.

Based on how well that is working, I don't estimate great results for
read-a-book-and-think.

Lorenz

unread,

Dec 2, 2016, 3:32:13 AM12/2/16

to

Hongyi Zhao wrote:

>Hi all,
>
>I use gawk built-in match function to capture two char Country codes
>other than CN and KR. Currently, I use the following code:
>
>
> awk 'match($0, /...([A-Z]{2}).../, a ) {
> if ( a[1] != "CN" && a[1] != "KR" ) {
> do_something
>}
>
>I try to find the direct method without using the ``if'' to do this job.

what about

> awk 'match($0, /...([A-Z]{2}).../, a ) && a[1] != "CN" && a[1] != "KR" ) {
> do_something with a[1]
> }

or

> awk 'match($0, /...([A-Z]{2}).../, a ) && !/...(CN|KR).../ {
> do_something with a[1]
> }
--

Lorenz

Hongyi Zhao

unread,

Dec 2, 2016, 6:34:53 AM12/2/16

to

On Fri, 02 Dec 2016 08:27:52 +0000, Lorenz wrote:

> what about
>
>> awk 'match($0, /...([A-Z]{2}).../, a ) && a[1] != "CN" && a[1] !=
>> "KR" ) {
>> do_something with a[1]
>> }
>
> or
>
>> awk 'match($0, /...([A-Z]{2}).../, a ) && !/...(CN|KR).../ {
>> do_something with a[1]
>> }
> --
>
> Lorenz

Thanks.

Anton Treuenfels

unread,

Dec 11, 2016, 4:14:38 PM12/11/16

to

"Hongyi Zhao" <hongy...@gmail.com> wrote in message
news:o1ogs9$k25$1...@aspen.stu.neva.ru...

> Hi all,
>
> I use gawk built-in match function to capture two char Country codes
> other than CN and KR. Currently, I use the following code:
>
>
> awk 'match($0, /...([A-Z]{2}).../, a ) {
> if ( a[1] != "CN" && a[1] != "KR" ) {
> do_something
> }
>
> I try to find the direct method without using the ``if'' to do this job.
>
> Is there some other more concise regexp for this?

if ( match($0, /...([^CK].|C[^N]|K[^R]).../, a) ) {
do_something
}

which will produce false matches if those two middle characters are not
always both uppercase alphabetic.

- Anton Treuenfels

Hongyi Zhao

unread,

Dec 12, 2016, 8:22:45 PM12/12/16

to

On Sun, 11 Dec 2016 15:14:26 -0600, Anton Treuenfels wrote:

> if ( match($0, /...([^CK].|C[^N]|K[^R]).../, a) ) {
> do_something
> }
>
> which will produce false matches if those two middle characters are not
> always both uppercase alphabetic.

Thanks a lot.

>
> - Anton Treuenfels

Hongyi Zhao

unread,

Dec 13, 2016, 9:01:44 AM12/13/16

to

On Sun, 11 Dec 2016 15:14:26 -0600, Anton Treuenfels wrote:

> if ( match($0, /...([^CK].|C[^N]|K[^R]).../, a) ) {
> do_something
> }
>
> which will produce false matches if those two middle characters are not
> always both uppercase alphabetic.

Thanks, after a second thought, I've two additional notes based on your
solution:

[1] `if' is not need here according to the match function's description
in the manual:

Return the position in s where the regular
expression r occurs, or 0 if r is not present...

[2] In order to workaround the failure case, I revise your code into the
following form:

match($0, /...([^CK][A-Z]|C[^N]|K[^R]).../, a) {
do_something
}

Regards

>
> - Anton Treuenfels

Hongyi Zhao

unread,

Dec 13, 2016, 9:05:30 AM12/13/16

to

On Tue, 13 Dec 2016 14:01:43 +0000, Hongyi Zhao wrote:

> match($0, /...([^CK][A-Z]|C[^N]|K[^R]).../, a) {
> do_something
> }

Should be:

match($0, /...([ABD-JL-Z][A-Z]|C[^N]|K[^R]).../, a) {
do_something
}

Regards

Hongyi Zhao

unread,

Dec 13, 2016, 9:13:18 AM12/13/16

to

On Thu, 01 Dec 2016 09:35:51 -0600, Ed Morton wrote:

> [A-BD-JL-Z][A-Z]|C[A-MO-Z]|K[A-QS-Z]

I think there is no need for `-' between A-B:

[ABD-JL-Z][A-Z]|C[A-MO-Z]|K[A-QS-Z]

Thanks again.

Anton Treuenfels

unread,

Dec 14, 2016, 11:05:40 AM12/14/16

to

"Hongyi Zhao" <hongy...@gmail.com> wrote in message

news:o2ov78$nsg$2...@aspen.stu.neva.ru...

> On Tue, 13 Dec 2016 14:01:43 +0000, Hongyi Zhao wrote:
>
> match($0, /...([ABD-JL-Z][A-Z]|C[^N]|K[^R]).../, a) {
> do_something
> }
>

This still doesn't quite get around the problem of non-alphabetic characters
completely. For instance, the character following 'C' or 'K' could be
numeric or lower case. Maybe a second pattern would help:

/...[A-Z]{2}.../ && match( $0, /...([^CK].|C[^N]|K[^R]).../, a ) {
do_something
}

Of course if the only reason the array 'a' exists is to check for 'CN' or
'KR', it can be discarded as no longer necessary:

/...[A-Z]{2}.../ && /...([^CK].|C[^N]|K[^R]).../ {
do_something
}

- Anton Treuenfels