Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Capture two char Country codes other than CN and KR with match function.

49 views
Skip to first unread message

Hongyi Zhao

unread,
Dec 1, 2016, 1:44:27 AM12/1/16
to
Hi all,

I use gawk built-in match function to capture two char Country codes
other than CN and KR. Currently, I use the following code:


awk 'match($0, /...([A-Z]{2}).../, a ) {
if ( a[1] != "CN" && a[1] != "KR" ) {
do_something
}

I try to find the direct method without using the ``if'' to do this job.

Is there some other more concise regexp for this?

Regards
--
.: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.

Ed Morton

unread,
Dec 1, 2016, 10:35:56 AM12/1/16
to
On 12/1/2016 12:44 AM, Hongyi Zhao wrote:
> Hi all,
>
> I use gawk built-in match function to capture two char Country codes
> other than CN and KR. Currently, I use the following code:
>
>
> awk 'match($0, /...([A-Z]{2}).../, a ) {
> if ( a[1] != "CN" && a[1] != "KR" ) {
> do_something
> }
>
> I try to find the direct method without using the ``if'' to do this job.
>
> Is there some other more concise regexp for this?

[A-BD-JL-Z][A-Z]|C[A-MO-Z]|K[A-QS-Z]

Read a book and THINK, there's really no substitute. Also, unless each line of
input is exactly 8 characters your regexp will produce false matches since it's
not using word boundaries or anchors.

Ed.

Kaz Kylheku

unread,
Dec 1, 2016, 12:18:07 PM12/1/16
to
On 2016-12-01, Ed Morton <morto...@gmail.com> wrote:
> Read a book and THINK, there's really no substitute.

Well, yes there kind of is, namely: read a *newsgroup* and THINK.

Based on how well that is working, I don't estimate great results for
read-a-book-and-think.

Lorenz

unread,
Dec 2, 2016, 3:32:13 AM12/2/16
to
Hongyi Zhao wrote:

>Hi all,
>
>I use gawk built-in match function to capture two char Country codes
>other than CN and KR. Currently, I use the following code:
>
>
> awk 'match($0, /...([A-Z]{2}).../, a ) {
> if ( a[1] != "CN" && a[1] != "KR" ) {
> do_something
>}
>
>I try to find the direct method without using the ``if'' to do this job.


what about

> awk 'match($0, /...([A-Z]{2}).../, a ) && a[1] != "CN" && a[1] != "KR" ) {
> do_something with a[1]
> }

or

> awk 'match($0, /...([A-Z]{2}).../, a ) && !/...(CN|KR).../ {
> do_something with a[1]
> }
--

Lorenz

Hongyi Zhao

unread,
Dec 2, 2016, 6:34:53 AM12/2/16
to
On Fri, 02 Dec 2016 08:27:52 +0000, Lorenz wrote:

> what about
>
>> awk 'match($0, /...([A-Z]{2}).../, a ) && a[1] != "CN" && a[1] !=
>> "KR" ) {
>> do_something with a[1]
>> }
>
> or
>
>> awk 'match($0, /...([A-Z]{2}).../, a ) && !/...(CN|KR).../ {
>> do_something with a[1]
>> }
> --
>
> Lorenz

Thanks.

Anton Treuenfels

unread,
Dec 11, 2016, 4:14:38 PM12/11/16
to

"Hongyi Zhao" <hongy...@gmail.com> wrote in message
news:o1ogs9$k25$1...@aspen.stu.neva.ru...
> Hi all,
>
> I use gawk built-in match function to capture two char Country codes
> other than CN and KR. Currently, I use the following code:
>
>
> awk 'match($0, /...([A-Z]{2}).../, a ) {
> if ( a[1] != "CN" && a[1] != "KR" ) {
> do_something
> }
>
> I try to find the direct method without using the ``if'' to do this job.
>
> Is there some other more concise regexp for this?

if ( match($0, /...([^CK].|C[^N]|K[^R]).../, a) ) {
do_something
}

which will produce false matches if those two middle characters are not
always both uppercase alphabetic.

- Anton Treuenfels

Hongyi Zhao

unread,
Dec 12, 2016, 8:22:45 PM12/12/16
to
On Sun, 11 Dec 2016 15:14:26 -0600, Anton Treuenfels wrote:

> if ( match($0, /...([^CK].|C[^N]|K[^R]).../, a) ) {
> do_something
> }
>
> which will produce false matches if those two middle characters are not
> always both uppercase alphabetic.

Thanks a lot.

>
> - Anton Treuenfels

Hongyi Zhao

unread,
Dec 13, 2016, 9:01:44 AM12/13/16
to
On Sun, 11 Dec 2016 15:14:26 -0600, Anton Treuenfels wrote:

> if ( match($0, /...([^CK].|C[^N]|K[^R]).../, a) ) {
> do_something
> }
>
> which will produce false matches if those two middle characters are not
> always both uppercase alphabetic.

Thanks, after a second thought, I've two additional notes based on your
solution:

[1] `if' is not need here according to the match function's description
in the manual:

Return the position in s where the regular
expression r occurs, or 0 if r is not present...


[2] In order to workaround the failure case, I revise your code into the
following form:

match($0, /...([^CK][A-Z]|C[^N]|K[^R]).../, a) {
do_something
}

Regards

>
> - Anton Treuenfels

Hongyi Zhao

unread,
Dec 13, 2016, 9:05:30 AM12/13/16
to
On Tue, 13 Dec 2016 14:01:43 +0000, Hongyi Zhao wrote:

> match($0, /...([^CK][A-Z]|C[^N]|K[^R]).../, a) {
> do_something
> }

Should be:

match($0, /...([ABD-JL-Z][A-Z]|C[^N]|K[^R]).../, a) {
do_something
}

Regards

Hongyi Zhao

unread,
Dec 13, 2016, 9:13:18 AM12/13/16
to
On Thu, 01 Dec 2016 09:35:51 -0600, Ed Morton wrote:

> [A-BD-JL-Z][A-Z]|C[A-MO-Z]|K[A-QS-Z]

I think there is no need for `-' between A-B:

[ABD-JL-Z][A-Z]|C[A-MO-Z]|K[A-QS-Z]

Thanks again.

Anton Treuenfels

unread,
Dec 14, 2016, 11:05:40 AM12/14/16
to

"Hongyi Zhao" <hongy...@gmail.com> wrote in message
news:o2ov78$nsg$2...@aspen.stu.neva.ru...
> On Tue, 13 Dec 2016 14:01:43 +0000, Hongyi Zhao wrote:
>
> match($0, /...([ABD-JL-Z][A-Z]|C[^N]|K[^R]).../, a) {
> do_something
> }
>

This still doesn't quite get around the problem of non-alphabetic characters
completely. For instance, the character following 'C' or 'K' could be
numeric or lower case. Maybe a second pattern would help:

/...[A-Z]{2}.../ && match( $0, /...([^CK].|C[^N]|K[^R]).../, a ) {
do_something
}

Of course if the only reason the array 'a' exists is to check for 'CN' or
'KR', it can be discarded as no longer necessary:

/...[A-Z]{2}.../ && /...([^CK].|C[^N]|K[^R]).../ {
do_something
}

- Anton Treuenfels

0 new messages