Nice trim/strip/chop in awk

Ian Zimmerman

unread,

Jul 8, 2017, 12:30:00 AM7/8/17

to

Is there a concise oneliner, in _standard_ awk, to do what strip does in
python and chop does in perl? Not only on $0 but separately on each
field.

I can see a way with the match function with the nonstandard gawk third
argument, but otherwise awk's matching apparatus seems to suffer
severely from the lack of backreferences.

Motivation: I have input like

foo : bar

foobar : baz
# This can happen, too
blah:etch

and I need to generate output where whitespace is significant (so must
be absent, in the above example).

--
Please don't Cc: me privately on mailing lists and Usenet,
if you also post the followup to the list or newsgroup.
Do obvious transformation on domain to reply privately _only_ on Usenet.

Janis Papanagnou

unread,

Jul 8, 2017, 2:02:36 AM7/8/17

to

On 08.07.2017 06:29, Ian Zimmerman wrote:
> Is there a concise oneliner, in _standard_ awk, to do what strip does in
> python and chop does in perl? Not only on $0 but separately on each
> field.

What do those functions in those languages? (Strip leading/trailling
whitespace?)

>
> I can see a way with the match function with the nonstandard gawk third
> argument, but otherwise awk's matching apparatus seems to suffer
> severely from the lack of backreferences.
>
> Motivation: I have input like
>
> foo : bar
>
> foobar : baz

What is the field separator in your samples here; a colon, or colon
with surrounding blanks, or is the above just one field?

> # This can happen, too
> blah:etch
>
> and I need to generate output where whitespace is significant (so must
> be absent, in the above example).

*Significant* whitespace must be absent? - It would be helpful to show
expected output for your samples.

Generally, if you want to work on individual fields in standard awk you
iterate over the fields and use a sub() call to remove undesired data
gsub(/^ +| +$/,"",$i)
Another (a bit bulky) option is to avoid a loop and work directly on
the whole line with a constructed dynamic regexp with FS and spaces to
be replaced by your desired OFS.

Janis

Ben Bacarisse

unread,

Jul 8, 2017, 6:32:07 AM7/8/17

to

Ian Zimmerman <i...@primate.usenet-nospam-remove.net> writes:

> Is there a concise oneliner, in _standard_ awk, to do what strip does in
> python and chop does in perl? Not only on $0 but separately on each
> field.
>
> I can see a way with the match function with the nonstandard gawk third
> argument, but otherwise awk's matching apparatus seems to suffer
> severely from the lack of backreferences.
>
> Motivation: I have input like
>
> foo : bar
>
> foobar : baz
> # This can happen, too
> blah:etch
>
> and I need to generate output where whitespace is significant (so must
> be absent, in the above example).

Some thoughts...

If the data in the fields never contain spaces you can switch the field
separator to a single space and use counting to ignore what used to be
the separators (the fields become $1, $3, $5 etc).

If you make the field separator [[:space:]]*:[[:space:]]*, then you only
need to trim leading space from $1 and trailing space from $NF.

If neither won't fly, you'll probably need to use sub and/or gsub on the
fields.

--
Ben.

Ed Morton

unread,

Jul 8, 2017, 9:04:15 AM7/8/17

to

On 7/7/2017 11:29 PM, Ian Zimmerman wrote:
> Is there a concise oneliner, in _standard_ awk, to do what strip does in
> python and chop does in perl? Not only on $0 but separately on each
> field.

Tell us what those functions do and provide concise, testable sample input and
expected output. You'll find a lot of awk experts here but YMMV with python and
perl experience.

> I can see a way with the match function with the nonstandard gawk third
> argument, but otherwise awk's matching apparatus seems to suffer
> severely from the lack of backreferences.
>
> Motivation: I have input like
>
> foo : bar
>
> foobar : baz
> # This can happen, too
> blah:etch
>
> and I need to generate output where whitespace is significant (so must
> be absent, in the above example).
>

"is significant and so must be absent" doesn't make sense. Do you mean "is
insignificant and so must be absent" or "is significant and so must be present"
or something else?

Ed.

Ben Bacarisse

unread,

Jul 8, 2017, 9:45:15 AM7/8/17

to

Ed Morton <morto...@gmail.com> writes:

> On 7/7/2017 11:29 PM, Ian Zimmerman wrote:

<snip>

>> Motivation: I have input like
>>
>> foo : bar
>>
>> foobar : baz
>> # This can happen, too
>> blah:etch
>>
>> and I need to generate output where whitespace is significant (so must
>> be absent, in the above example).
>
> "is significant and so must be absent" doesn't make sense. Do you mean
> "is insignificant and so must be absent" or "is significant and so
> must be present" or something else?

I'd put money on "I need to generate an output format where whitespace
is significant, so leading and trailing whitespace must be removed from
the source data".

--
Ben.

Ed Morton

unread,

Jul 8, 2017, 8:29:18 PM7/8/17

to

OK, that makes sense. Then given this input and assuming ":" is the field separator:

$ cat file
foo : bar
foobar : baz
blah:etch

with a POSIX awk you'd do:

$ awk '{gsub(/[[:space:]]+:[[:space:]]+/,":");
gsub(/^[[:space:]]+|[[:space:]]+$/,"")}1' file
foo:bar
foobar:baz
blah:etch

and with GNU awk:

$ awk '{$0=gensub(/\s+(:)\s+|^\s+|\s+$/,"\\1","g")}1' file
foo:bar
foobar:baz
blah:etch

Ed.

Ian Zimmerman

unread,

Jul 9, 2017, 10:56:57 AM7/9/17

to

On 2017-07-08 19:29, Ed Morton wrote:

> On 7/8/2017 8:45 AM, Ben Bacarisse wrote:

> > I'd put money on "I need to generate an output format where
> > whitespace is significant, so leading and trailing whitespace must
> > be removed from the source data".

Yes, that nails it precisely. Thanks for expanding my overly terse
language, Ben.

> with a POSIX awk you'd do:
>
> $ awk '{gsub(/[[:space:]]+:[[:space:]]+/,":");
> gsub(/^[[:space:]]+|[[:space:]]+$/,"")}1' file
> foo:bar
> foobar:baz
> blah:etch

Not quite as short as I hoped, but it will do. Thanks!

Kaz Kylheku

unread,

Jul 9, 2017, 11:07:34 AM7/9/17

to

On 2017-07-09, Ian Zimmerman <i...@primate.usenet-nospam-remove.net> wrote:
> On 2017-07-08 19:29, Ed Morton wrote:
>
>> On 7/8/2017 8:45 AM, Ben Bacarisse wrote:
>
>> > I'd put money on "I need to generate an output format where
>> > whitespace is significant, so leading and trailing whitespace must
>> > be removed from the source data".
>
> Yes, that nails it precisely. Thanks for expanding my overly terse
> language, Ben.
>
>> with a POSIX awk you'd do:
>>
>> $ awk '{gsub(/[[:space:]]+:[[:space:]]+/,":");
>> gsub(/^[[:space:]]+|[[:space:]]+$/,"")}1' file
>> foo:bar
>> foobar:baz
>> blah:etch
>
> Not quite as short as I hoped, but it will do. Thanks!

The [[:space:]] verbosity denotes some poorly defined set of characters
influenced by the dark magic of locale. You can replace that with a
literal space character to just match spaces or else a character class
nominating the exact set of whitespace characters you care about.

In awk, you can specify a tab as \t, so [\t ]+ is possible for matching
tabs and spaces.

Janis Papanagnou

unread,

Jul 9, 2017, 11:28:41 AM7/9/17

to

Indeed. Most likely (in context of a FS=":")

{ gsub(/^ +| +$/,"") ; gsub(/ +: +/,":") }

will suffice, or (in context of a FS=" +: +") even

{ gsub(/^ +| +$/,"") }

which, I suppose, should be terse enough for the OP. It should be noted
that the latter will clean up the fields in the first place, and if the
spurious spaces are still necessary in the awk program in another place
then the former variant should be used.

Janis

Ed Morton

unread,

Jul 9, 2017, 12:56:33 PM7/9/17

to

The [[:space:]] is to accommodate any white space, including newlines when the
RS is something other than the default. Unless someone says "the only white
space in my data is blank chars" then use [[:space:]].

Ed.

Janis Papanagnou

unread,

Jul 9, 2017, 2:07:36 PM7/9/17

to

On 09.07.2017 18:56, Ed Morton wrote:
> On 7/9/2017 10:28 AM, Janis Papanagnou wrote:
>>

>> Indeed. Most likely [...] will suffice, or [...]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

>
> The [[:space:]] is to accommodate any white space, including newlines when the
> RS is something other than the default. Unless someone says "the only white
> space in my data is blank chars" then use [[:space:]].

The OP's posting was not very precise or reliable. I certainly don't expect
him to have newlines in his data as you propose. He wanted something terse
for sure. OTOH he gave me the impression that he already knows how to extend
a sub-expression from " " to "[ \t]", or to [[:char-class:]] _if_ necessary.

Janis

>
> Ed.

Kaz Kylheku

unread,

Jul 9, 2017, 2:38:31 PM7/9/17

to

Nonsense. Usually when people say "space" they do not mean "also include
a double-width Chinese character space, a Klingon space, the U+00A0 hard
space, and whatever else".

By using the [[:space:]], you're quite likely inventing requirements which
aren't there.

If you suspect that there is a requirement for those spaces, that needs
to be clarified. Without clarification, the better assumption is
conservative.

Ed Morton

unread,

Jul 9, 2017, 3:49:10 PM7/9/17

to

On 7/9/2017 1:07 PM, Janis Papanagnou wrote:
> On 09.07.2017 18:56, Ed Morton wrote:
>> On 7/9/2017 10:28 AM, Janis Papanagnou wrote:
>>>
>>> Indeed. Most likely [...] will suffice, or [...]
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> The [[:space:]] is to accommodate any white space, including newlines when the
>> RS is something other than the default. Unless someone says "the only white
>> space in my data is blank chars" then use [[:space:]].
>
> The OP's posting was not very precise or reliable. I certainly don't expect
> him to have newlines in his data as you propose.

I frequently have newlines in my data, e.g. when using RS="" or RS="some
regexp", so I personally don't think it's useful to avoid using [[:space:]] but
YMMV I suppose. All I saw was that he wanted "to generate output where
whitespace ... must be absent" and removing [[:space:]] characters seems like
the obvious way to do that rather than guessing at which white space characters
(s)he might have in their data.

He wanted something terse
> for sure. OTOH he gave me the impression that he already knows how to extend
> a sub-expression from " " to "[ \t]", or to [[:char-class:]] _if_ necessary.

That's fine, whatever works for him/her.

Ed.
>
> Janis
>
>>
>> Ed.
>