Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

split field by delimiter

96 views
Skip to first unread message

moonhkt

unread,
Dec 12, 2008, 1:50:12 AM12/12/08
to

Hi All

Data
cat grep.txt
"BA" "TSI-000000" 1 "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" ?


A One Space inside "ZOBHA PSL/G"

How to split field by delimiter "space" using awk

I try not ok
cat grep.text | awk '{printf("%3s ", NR); $1=$1}1' RS=" " OFS="\n"

Result should be
1 "BA"
2 "TSI-000000"
3 1
4 "ZOBHA PSL/G"
5 3.3
6 33
7 14.28
8 0
9 0
10 0
11 0
12 ""
13 ?

Dave B

unread,
Dec 12, 2008, 4:44:42 AM12/12/08
to
moonhkt wrote:

Basically, you need a dedicated parser here. Assuming there is exactly a
single space between fields, and no spaces at the beginning or end of line,
see if this works (adapted from http://awk.freeshell.org/AwkTips#toc6):

{
$0=$0" ";
while($0) {
match($0,/"[^"]*" |[^ ]* /);
sf=f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;
sub(sf,"");
}
}

With your input, the above code outputs

Field 1 is "BA"
Field 2 is "TSI-000000"
Field 3 is 1
Field 4 is "ZOBHA PSL/G"
Field 5 is 3.3
Field 6 is 33
Field 7 is 14.28
Field 8 is 0
Field 9 is 0
Field 10 is 0
Field 11 is 0
Field 12 is ""
Field 13 is ?

--
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c";
printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,
X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,
O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O}'

moonhkt

unread,
Dec 12, 2008, 9:56:25 AM12/12/08
to
> see if this works (adapted fromhttp://awk.freeshell.org/AwkTips#toc6):

cat grep.text
"BA" "TSI-000000" 1 "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" 4 ?

cat grep.text | awk '{


$0=$0" ";
while($0) {
match($0,/"[^"]*" |[^ ]* /);
sf=f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;
sub(sf,"");
}

}'


Field 1 is "BA"
Field 2 is "TSI-000000"
Field 3 is 1
Field 4 is "ZOBHA PSL/G"
Field 5 is 3.3
Field 6 is 33
Field 7 is 14.28
Field 8 is 0
Field 9 is 0
Field 10 is 0
Field 11 is 0
Field 12 is ""

Field 13 is 4
Field 14 is ?
awk: 0602-521 There is a regular expression error.
?*+ not preceded by valid expression.

The input line number is 1.
The source line number is 8.

? case above error.

Dave B

unread,
Dec 12, 2008, 10:31:37 AM12/12/08
to
moonhkt wrote:

Yes, that is because of the final sub(sf,""), where "sf" is a computed
regex. Since your input at some point makes it contain a regex metacharacter
("?") which appears to be incorrectly used, the regex engine of your awk
implementation complains. GNU awk works fine in this specific case (but mawk
and bell labs awk complain), so either make sure "sf" is properly escaped
before using it in the sub (which can however become a bit complicated), or
just use substr (example follows):

$0=$0" ";
while($0) {
match($0,/"[^"]*" |[^ ]* /);
sf=f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;

$0=substr($0,RLENGTH+1);
}

I will try to contact the owners of the wiki to fix their code.

moonhkt

unread,
Dec 12, 2008, 11:01:20 AM12/12/08
to

Hi Dave

The code work now.

"BA" "TSI-000000" 1 ? "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" 4
"BA" "D"

cat grep.text | awk '
BEGIN { LN=1
print "Line " LN;
}
{


$0=$0" ";
while($0){
match($0,/"[^"]*" |[^ ]* /);
sf=f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;
$0=substr($0,RLENGTH+1);

if ($0 == "") {
++LN ;
print "Line " LN;
c=0
}
}
}'

Result as below
Line 1


Field 1 is "BA"
Field 2 is "TSI-000000"
Field 3 is 1

Field 4 is ?
Field 5 is "ZOBHA PSL/G"
Field 6 is 3.3
Field 7 is 33
Field 8 is 14.28


Field 9 is 0
Field 10 is 0
Field 11 is 0

Field 12 is 0
Field 13 is ""
Field 14 is 4
Line 2
Field 1 is "BA"
Field 2 is "D"
Line 3

Dave B

unread,
Dec 12, 2008, 11:18:08 AM12/12/08
to
moonhkt wrote:

> The code work now.
>
> "BA" "TSI-000000" 1 ? "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" 4
> "BA" "D"
>
> cat grep.text | awk '
> BEGIN { LN=1
> print "Line " LN;
> }
> {
> $0=$0" ";
> while($0){
> match($0,/"[^"]*" |[^ ]* /);
> sf=f=substr($0,RSTART,RLENGTH);
> sub(/ $/,"",f);
> print "Field " ++c " is " f;
> $0=substr($0,RLENGTH+1);
> if ($0 == "") {
> ++LN ;
> print "Line " LN;
> c=0
> }
> }
> }'

You don't need the LN complication, since awk already has the builtin
variable NR which contains the current line number:

cat grep.text | awk '
{
c=0;print "Line "NR;


$0=$0" ";
while($0){
match($0,/"[^"]*" |[^ ]* /);
sf=f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;
$0=substr($0,RLENGTH+1);
}

}'

Ed Morton

unread,
Dec 12, 2008, 4:17:49 PM12/12/08
to
On Dec 12, 10:18 am, Dave B <da...@addr.invalid> wrote:
> moonhkt wrote:
> > The code work now.
>
> > "BA" "TSI-000000" 1 ? "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" 4
> > "BA" "D"
>
> > cat grep.text | awk '
> > BEGIN { LN=1
> >   print "Line " LN;
> >  }
> > {
> >   $0=$0" ";
> >   while($0){
> >      match($0,/"[^"]*" |[^ ]* /);
> >      sf=f=substr($0,RSTART,RLENGTH);
> >      sub(/ $/,"",f);
> >      print "Field " ++c " is " f;
> >      $0=substr($0,RLENGTH+1);
> >      if ($0 == "") {
> >         ++LN ;
> >         print "Line " LN;
> >         c=0
> >      }
> >    }
> > }'
>
> You don't need the LN complication, since awk already has the builtin
> variable NR which contains the current line number:
>
> cat grep.text | awk '

OT, but that's a UUOC.

> {
>   c=0;print "Line "NR;
>   $0=$0" ";
>   while($0){
>      match($0,/"[^"]*" |[^ ]* /);
>      sf=f=substr($0,RSTART,RLENGTH);
>      sub(/ $/,"",f);
>      print "Field " ++c " is " f;
>      $0=substr($0,RLENGTH+1);
>   }
>
> }'
>

I find it easier to just rebuild $0 using a different field separator
from the text inside the quotes, e.g. using the awk SUBSEP character:

$ cat decsv.awk
BEGIN{ FS=OFS=SUBSEP }
{
n = split($0,f,"")
$0=""
for (i=1;i<=n;i++) {
inFld = (f[i] ~ /"/ ? !inFld : inFld)
$0 = $0 (!inFld && (f[i] ~ /[[:space:]]/) ? FS : f[i])
}

print "Line "NR
for (i=1;i<=NF;i++)
print "Field " i " is " $i
}
$ awk -f decsv.awk file


Line 1
Field 1 is "BA"
Field 2 is "TSI-000000"
Field 3 is 1
Field 4 is ?
Field 5 is "ZOBHA PSL/G"
Field 6 is 3.3
Field 7 is 33
Field 8 is 14.28
Field 9 is 0
Field 10 is 0
Field 11 is 0
Field 12 is 0
Field 13 is ""
Field 14 is 4
Line 2
Field 1 is "BA"
Field 2 is "D"

That way you can use $1, etc. in the rest of your script exactly as
you would normally. If you can have chains of white space between
fields and you're using " " as the FS, just add a gsub
(SUBSEP"+",SUBSEP) after the first "for" loop and if your field
separator is something other than the default " ", just tweak the
code...

Ed.

Dave B

unread,
Dec 12, 2008, 4:50:52 PM12/12/08
to
Ed Morton wrote:

>> {
>> c=0;print "Line "NR;
>> $0=$0" ";
>> while($0){
>> match($0,/"[^"]*" |[^ ]* /);
>> sf=f=substr($0,RSTART,RLENGTH);
>> sub(/ $/,"",f);
>> print "Field " ++c " is " f;
>> $0=substr($0,RLENGTH+1);
>> }
>>
>> }'
>>
>
> I find it easier to just rebuild $0 using a different field separator
> from the text inside the quotes, e.g. using the awk SUBSEP character:
>
> $ cat decsv.awk
> BEGIN{ FS=OFS=SUBSEP }
> {
> n = split($0,f,"")
> $0=""
> for (i=1;i<=n;i++) {
> inFld = (f[i] ~ /"/ ? !inFld : inFld)
> $0 = $0 (!inFld && (f[i] ~ /[[:space:]]/) ? FS : f[i])
> }
>
> print "Line "NR
> for (i=1;i<=NF;i++)
> print "Field " i " is " $i
> }

>[snip]

> That way you can use $1, etc. in the rest of your script exactly as
> you would normally. If you can have chains of white space between
> fields and you're using " " as the FS, just add a gsub
> (SUBSEP"+",SUBSEP) after the first "for" loop and if your field
> separator is something other than the default " ", just tweak the
> code...

Good point, thanks for bringing that out.

For completeness, it might be worth mentioning that the original algorithm
can be adapted to do the same thing:

BEGIN { FS=SUBSEP }
{


print "Line "NR;
$0=$0" ";
while($0){
match($0,/"[^"]*" |[^ ]* /);

f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
line=line s f; s=FS;
$0=substr($0,RLENGTH+1);
}

$0=line;
for(i=1;i<=NF;i++)print "Field " i " is " $i

Ed Morton

unread,
Dec 12, 2008, 6:21:08 PM12/12/08
to

Add:

line=""

>   while($0){
>     match($0,/"[^"]*" |[^ ]* /);
>     f=substr($0,RSTART,RLENGTH);
>     sub(/ $/,"",f);
>     line=line s f; s=FS;
>     $0=substr($0,RLENGTH+1);
>   }
>
>   $0=line;
>   for(i=1;i<=NF;i++)print "Field " i " is " $i
>
> }'
>

Even after init-ing "line", the above doesn't quite work as it adds a
blank first field at the start of the second line:

$ cat file


"BA" "TSI-000000" 1 ? "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" 4
"BA" "D"

$ cat decsv2.awk


BEGIN { FS=SUBSEP }
{
print "Line "NR;
$0=$0" ";

line=""


while($0){
match($0,/"[^"]*" |[^ ]* /);
f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
line=line s f; s=FS;
$0=substr($0,RLENGTH+1);
}


$0=line;
for(i=1;i<=NF;i++)print "Field " i " is <" $i ">"

}

$ awk -f decsv2.awk file


Line 1
Field 1 is <"BA">
Field 2 is <"TSI-000000">
Field 3 is <1>
Field 4 is <?>
Field 5 is <"ZOBHA PSL/G">
Field 6 is <3.3>
Field 7 is <33>
Field 8 is <14.28>
Field 9 is <0>
Field 10 is <0>
Field 11 is <0>
Field 12 is <0>
Field 13 is <"">
Field 14 is <4>
Line 2
Field 1 is <>

Field 2 is <"BA">
Field 3 is <"D">

Ed.

Ed Morton

unread,
Dec 12, 2008, 6:24:25 PM12/12/08
to

Never mind, you just have to init "s" too to solve that problem:

BEGIN { FS=SUBSEP }
{
print "Line "NR;
$0=$0" ";

line=s=""


while($0){
match($0,/"[^"]*" |[^ ]* /);
f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
line=line s f; s=FS;
$0=substr($0,RLENGTH+1);
}

$0=line;
for(i=1;i<=NF;i++)print "Field " i " is <" $i ">"

}

Ed.

Anton Treuenfels

unread,
Dec 12, 2008, 9:27:20 PM12/12/08
to

"moonhkt" <moo...@gmail.com> wrote in message
news:efbd5c5d-2e13-409b...@e25g2000vbe.googlegroups.com...

>
> Hi All
>
> Data
> cat grep.txt
> "BA" "TSI-000000" 1 "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" ?
>
>
> A One Space inside "ZOBHA PSL/G"
>
> How to split field by delimiter "space" using awk

In addition to the interesting solutions presented so far, another approach
(untested):

{
print "Line " NR
f = 0
while ( $0 ) {
if ( !match($0,/^[ ]+/ ) {
if ( !match($0,/^"[^"]*"/ )
match( $0, /^[^ ]+/ )
print "Field " ++f " is " substr( $0, 1, RLENGTH )
}
$0 = substr( $0, RLENGTH+1 )
}
}

- Anton Treuenfels


Anton Treuenfels

unread,
Dec 12, 2008, 9:40:53 PM12/12/08
to

"Dave B" <da...@addr.invalid> wrote in message
news:ghtbic$phg$1...@news.motzarella.org...

> {
> $0=$0" ";
> while($0) {
> match($0,/"[^"]*" |[^ ]* /);
> sf=f=substr($0,RSTART,RLENGTH);
> sub(/ $/,"",f);
> print "Field " ++c " is " f;
> sub(sf,"");
> }
> }

I'm trying to puzzle out what happens here if the original record ends with
one or more spaces, ie., there is a trailing field separator after the last
field and before the record separator. I am not certain but I think the loop
never terminates because match() repeatedly fails when it reaches this point
and $0 never becomes null. Is that right?

- Anton Treuenfels


Steffen Schuler

unread,
Dec 13, 2008, 9:58:02 AM12/13/08
to

A shorter and tested version (works with any POSIX awk):

{ i = 0
while (match($0, /"([^"]|"")*"|[^" \t]+/)) {
printf "%2d %s\n", ++i, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + RLENGTH)
} }

--
Steffen

Dave B

unread,
Dec 13, 2008, 4:05:07 PM12/13/08
to

No, AFAICT the above code assumes exactly one space between fields, so
additional spaces just delimit empty fields (code incorporates fixes made so
far):

$ echo 'a b c "d e" f g ' | awk '
{
$0=$0" ";
c=0;


while($0) {
match($0,/"[^"]*" |[^ ]* /);

f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;

$0=substr($0,RLENGTH+1);
}}'
Field 1 is a
Field 2 is b
Field 3 is c
Field 4 is "d e"
Field 5 is f
Field 6 is g
Field 7 is
Field 8 is

Dave B

unread,
Dec 13, 2008, 4:09:39 PM12/13/08
to
Ed Morton wrote:

> Never mind, you just have to init "s" too to solve that problem:
>
> BEGIN { FS=SUBSEP }
> {
> print "Line "NR;
> $0=$0" ";
> line=s=""
> while($0){
> match($0,/"[^"]*" |[^ ]* /);
> f=substr($0,RSTART,RLENGTH);
> sub(/ $/,"",f);
> line=line s f; s=FS;
> $0=substr($0,RLENGTH+1);
> }
>
> $0=line;
> for(i=1;i<=NF;i++)print "Field " i " is <" $i ">"
>
> }

Yes, I realized shortly after posting that my code was working correctly
only for an input of a single line (that's how I tested it), for more lines
(as you correctly did) you have to reinit "s" and "line". I just thought it
was a minor problem that anyone could fix by himself and not worth an
additional post (and I had already turned off the computer!), however thank
you for making the necessary corrections.

Hai Vu

unread,
Dec 13, 2008, 6:00:56 PM12/13/08
to
I know many others are trying to provide the solution in AWK, but
perhaps if you can use bash shell, then the job would be much easier:

#!/bin/sh
for arg
do
echo $arg
done

I believe is bash is the right tool for this job.

Anton Treuenfels

unread,
Dec 14, 2008, 1:47:14 AM12/14/08
to

"Dave B" <da...@addr.invalid> wrote in message
news:gi17qc$inn$1...@news.motzarella.org...

> Anton Treuenfels wrote:
> > "Dave B" <da...@addr.invalid> wrote in message
> > news:ghtbic$phg$1...@news.motzarella.org...
> >
> >> {
> >> $0=$0" ";
> >> while($0) {
> >> match($0,/"[^"]*" |[^ ]* /);
> >> sf=f=substr($0,RSTART,RLENGTH);
> >> sub(/ $/,"",f);
> >> print "Field " ++c " is " f;
> >> sub(sf,"");
> >> }
> >> }
> >
> > I'm trying to puzzle out what happens here if the original record ends
with
> > one or more spaces, ie., there is a trailing field separator after the
last
> > field and before the record separator. I am not certain but I think the
loop
> > never terminates because match() repeatedly fails when it reaches this
point
> > and $0 never becomes null. Is that right?
>
> No, AFAICT the above code assumes exactly one space between fields, so
> additional spaces just delimit empty fields (code incorporates fixes made
so
> far):

Ah, okay, now I see where I misunderstood. Oops!

- Anton Treuenfels


Ed Morton

unread,
Dec 14, 2008, 10:08:06 AM12/14/08
to

In addition to being OT in this NG, that's printing arguments to a
shell script, not parsing the contents of a file. You can print
arguments to an awk script that way too but that's not the problem
we're trying to solve.

Ed.

Ed Morton

unread,
Dec 14, 2008, 10:31:56 AM12/14/08
to
On Dec 13, 8:58 am, Steffen Schuler <schuler.stef...@googlemail.com>
wrote:

It also correctly ignores leading and trailing spaces unlike the
solutions Dave and I had posted. I think we have a winner... Here it
is updated to rebuild $0:

BEGIN{ FS=SUBSEP }
{ i = 0
rec = sep = ""


while (match($0, /"([^"]|"")*"|[^" \t]+/)) {

fld = substr($0, RSTART, RLENGTH)
rec = rec sep fld
sep = FS


$0 = substr($0, RSTART + RLENGTH)
}

$0 = rec


for (i=1;i<=NF;i++) {
print "Field " i " is <" $i ">"
}
}

Ed.

Rajan

unread,
Dec 14, 2008, 8:58:13 PM12/14/08
to
"Ed Morton" <morto...@gmail.com> wrote in message
news:c7dd522b-7b0d-4ac2...@w24g2000prd.googlegroups.com...

1 would while (match($0, /"[^"]*"|[^" \t]+/)) work ?

> fld = substr($0, RSTART, RLENGTH)
> rec = rec sep fld
> sep = FS
> $0 = substr($0, RSTART + RLENGTH)
> }
> $0 = rec
> for (i=1;i<=NF;i++) {
> print "Field " i " is <" $i ">"
> }
> }
>
> Ed.

2 Would this work as well?

BEGIN{OFS=FS="\""}
{
i=0
for (i=1;i<=NF;i++) if (!(i%2)) gsub(" ",SUBSEP,$i)
nf=split($0,arr," ")
for (i=1;i<=nf;i++) {
gsub(SUBSEP," ",arr[i])
print "Field " i " is <" arr[i] ">"
}
}

Thanks

hpt

unread,
Dec 15, 2008, 2:50:35 AM12/15/08
to

Could someone tell me why should "$1 = $1" here?

Ed Morton

unread,
Dec 15, 2008, 8:48:29 AM12/15/08
to
> Could someone tell me why should "$1 = $1" here?- Hide quoted text -
>

It's not useful in that script but any modification of a field causes
awk to recompile the current record, so all ocurrences of FS get
changed to OFS, so assigning any field to itself will do that without
changing any fields.

Ed.

Hai Vu

unread,
Dec 15, 2008, 3:03:21 PM12/15/08
to
With all due respect, I agree that my message of using bash instead of
AWK is off-topic. However, I am trying to show a solution that might
work and is simpler to implement. I admit that my original submission
was not on the money. Here is the revised solution, which read a file,
break each line into tokens, as the original message asked for:

#!/bin/sh
# Attempt to read the file quote.txt and parse the parameters

while read line
do
eval set '$line'
#echo "\nLine: $line"
#echo "Number of tokens: $#"


for arg
do
echo $arg
done

done < grep.txt
# End of file

Dave B

unread,
Dec 15, 2008, 3:50:02 PM12/15/08
to
Hai Vu wrote:

> With all due respect, I agree that my message of using bash instead of
> AWK is off-topic. However, I am trying to show a solution that might
> work and is simpler to implement. I admit that my original submission
> was not on the money. Here is the revised solution, which read a file,
> break each line into tokens, as the original message asked for:
>
> #!/bin/sh
> # Attempt to read the file quote.txt and parse the parameters
>
> while read line
> do
> eval set '$line'
> #echo "\nLine: $line"
> #echo "Number of tokens: $#"
> for arg
> do
> echo $arg
> done
> done < grep.txt
> # End of file

This will not work if some fields contain spaces, like eg

field1 "field 2" field3 "a b c d"

your program will print

field1
"field
2"
field3
"a
b
c
d"

In other words, your program is not able to recognize when a field appears
in double quotes, in which case everything inside the quotes (and the quotes
themselves) must be taken as a single field, even if there are spaces.

(and yes, it's a bit offtopic here too, although in this case, at least
imho, that is not the main problem)

Ed Morton

unread,
Dec 15, 2008, 4:30:35 PM12/15/08
to
On Dec 15, 2:03 pm, Hai Vu <haivu2...@gmail.com> wrote:
> With all due respect, I agree that my message of using bash instead of
> AWK is off-topic. However, I am trying to show a solution that might
> work and is simpler to implement.

I understand that, but many people using awk aren't using it on UNIX
so a UNIX-specific solution isn't a good one for awk users in general,
and this isn't the NG where most UNIX experts hang out so a UNIX
solution won't be reviewed/criticised/enhanced by the appropriate
people, unlike UNIX solutions proposed at comp.unix.shell.

<OT>


> I admit that my original submission
> was not on the money. Here is the revised solution, which read a file,
> break each line into tokens, as the original message asked for:
>
> #!/bin/sh
> # Attempt to read the file quote.txt and parse the parameters
>
> while read line
> do
>     eval set '$line'
>     #echo "\nLine: $line"
>     #echo "Number of tokens: $#"
>     for arg
>     do
>         echo $arg
>     done
> done < grep.txt
> # End of file

The above doesn't work either. Change it to:

while IFS= read -r line


do
eval set "$line"
#echo "\nLine: $line"
#echo "Number of tokens: $#"
for arg
do
echo "$arg"
done
done < grep.txt

and it's probably close.
</OT>

Ed.

Hai Vu

unread,
Dec 15, 2008, 6:32:31 PM12/15/08
to
Ed,
Your point is well taken. By the way, I am learning quite a bit from
this thread. Thanks.
Hai
0 new messages