Data
cat grep.txt
"BA" "TSI-000000" 1 "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" ?
A One Space inside "ZOBHA PSL/G"
How to split field by delimiter "space" using awk
I try not ok
cat grep.text | awk '{printf("%3s ", NR); $1=$1}1' RS=" " OFS="\n"
Result should be
1 "BA"
2 "TSI-000000"
3 1
4 "ZOBHA PSL/G"
5 3.3
6 33
7 14.28
8 0
9 0
10 0
11 0
12 ""
13 ?
Basically, you need a dedicated parser here. Assuming there is exactly a
single space between fields, and no spaces at the beginning or end of line,
see if this works (adapted from http://awk.freeshell.org/AwkTips#toc6):
{
$0=$0" ";
while($0) {
match($0,/"[^"]*" |[^ ]* /);
sf=f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;
sub(sf,"");
}
}
With your input, the above code outputs
Field 1 is "BA"
Field 2 is "TSI-000000"
Field 3 is 1
Field 4 is "ZOBHA PSL/G"
Field 5 is 3.3
Field 6 is 33
Field 7 is 14.28
Field 8 is 0
Field 9 is 0
Field 10 is 0
Field 11 is 0
Field 12 is ""
Field 13 is ?
--
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=O""O;while(X++<=x+o+o)c=c"%c";
printf c,(x-O)*(x-O),x*(x-o)-o,x*(x-O)+x-O-o,+x*(x-O)-x+o,X*(o*o+O)+x-O,
X*(X-x)-o*o,(x+X)*o*o+o,x*(X-x)-O-O,x-O+(O+o+X+x)*(o+O),X*X-X*(x-O)-x+O,
O+X*(o*(o+O)+O),+x+O+X*o,x*(x-o),(o+X+x)*o*o-(x-O-O),O+(X-x)*(X+O),x-O}'
cat grep.text
"BA" "TSI-000000" 1 "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" 4 ?
cat grep.text | awk '{
$0=$0" ";
while($0) {
match($0,/"[^"]*" |[^ ]* /);
sf=f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;
sub(sf,"");
}
}'
Field 1 is "BA"
Field 2 is "TSI-000000"
Field 3 is 1
Field 4 is "ZOBHA PSL/G"
Field 5 is 3.3
Field 6 is 33
Field 7 is 14.28
Field 8 is 0
Field 9 is 0
Field 10 is 0
Field 11 is 0
Field 12 is ""
Field 13 is 4
Field 14 is ?
awk: 0602-521 There is a regular expression error.
?*+ not preceded by valid expression.
The input line number is 1.
The source line number is 8.
? case above error.
Yes, that is because of the final sub(sf,""), where "sf" is a computed
regex. Since your input at some point makes it contain a regex metacharacter
("?") which appears to be incorrectly used, the regex engine of your awk
implementation complains. GNU awk works fine in this specific case (but mawk
and bell labs awk complain), so either make sure "sf" is properly escaped
before using it in the sub (which can however become a bit complicated), or
just use substr (example follows):
$0=$0" ";
while($0) {
match($0,/"[^"]*" |[^ ]* /);
sf=f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;
$0=substr($0,RLENGTH+1);
}
I will try to contact the owners of the wiki to fix their code.
Hi Dave
The code work now.
"BA" "TSI-000000" 1 ? "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" 4
"BA" "D"
cat grep.text | awk '
BEGIN { LN=1
print "Line " LN;
}
{
$0=$0" ";
while($0){
match($0,/"[^"]*" |[^ ]* /);
sf=f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;
$0=substr($0,RLENGTH+1);
if ($0 == "") {
++LN ;
print "Line " LN;
c=0
}
}
}'
Result as below
Line 1
Field 1 is "BA"
Field 2 is "TSI-000000"
Field 3 is 1
Field 4 is ?
Field 5 is "ZOBHA PSL/G"
Field 6 is 3.3
Field 7 is 33
Field 8 is 14.28
Field 9 is 0
Field 10 is 0
Field 11 is 0
Field 12 is 0
Field 13 is ""
Field 14 is 4
Line 2
Field 1 is "BA"
Field 2 is "D"
Line 3
> The code work now.
>
> "BA" "TSI-000000" 1 ? "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" 4
> "BA" "D"
>
> cat grep.text | awk '
> BEGIN { LN=1
> print "Line " LN;
> }
> {
> $0=$0" ";
> while($0){
> match($0,/"[^"]*" |[^ ]* /);
> sf=f=substr($0,RSTART,RLENGTH);
> sub(/ $/,"",f);
> print "Field " ++c " is " f;
> $0=substr($0,RLENGTH+1);
> if ($0 == "") {
> ++LN ;
> print "Line " LN;
> c=0
> }
> }
> }'
You don't need the LN complication, since awk already has the builtin
variable NR which contains the current line number:
cat grep.text | awk '
{
c=0;print "Line "NR;
$0=$0" ";
while($0){
match($0,/"[^"]*" |[^ ]* /);
sf=f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;
$0=substr($0,RLENGTH+1);
}
}'
OT, but that's a UUOC.
> {
> c=0;print "Line "NR;
> $0=$0" ";
> while($0){
> match($0,/"[^"]*" |[^ ]* /);
> sf=f=substr($0,RSTART,RLENGTH);
> sub(/ $/,"",f);
> print "Field " ++c " is " f;
> $0=substr($0,RLENGTH+1);
> }
>
> }'
>
I find it easier to just rebuild $0 using a different field separator
from the text inside the quotes, e.g. using the awk SUBSEP character:
$ cat decsv.awk
BEGIN{ FS=OFS=SUBSEP }
{
n = split($0,f,"")
$0=""
for (i=1;i<=n;i++) {
inFld = (f[i] ~ /"/ ? !inFld : inFld)
$0 = $0 (!inFld && (f[i] ~ /[[:space:]]/) ? FS : f[i])
}
print "Line "NR
for (i=1;i<=NF;i++)
print "Field " i " is " $i
}
$ awk -f decsv.awk file
Line 1
Field 1 is "BA"
Field 2 is "TSI-000000"
Field 3 is 1
Field 4 is ?
Field 5 is "ZOBHA PSL/G"
Field 6 is 3.3
Field 7 is 33
Field 8 is 14.28
Field 9 is 0
Field 10 is 0
Field 11 is 0
Field 12 is 0
Field 13 is ""
Field 14 is 4
Line 2
Field 1 is "BA"
Field 2 is "D"
That way you can use $1, etc. in the rest of your script exactly as
you would normally. If you can have chains of white space between
fields and you're using " " as the FS, just add a gsub
(SUBSEP"+",SUBSEP) after the first "for" loop and if your field
separator is something other than the default " ", just tweak the
code...
Ed.
>> {
>> c=0;print "Line "NR;
>> $0=$0" ";
>> while($0){
>> match($0,/"[^"]*" |[^ ]* /);
>> sf=f=substr($0,RSTART,RLENGTH);
>> sub(/ $/,"",f);
>> print "Field " ++c " is " f;
>> $0=substr($0,RLENGTH+1);
>> }
>>
>> }'
>>
>
> I find it easier to just rebuild $0 using a different field separator
> from the text inside the quotes, e.g. using the awk SUBSEP character:
>
> $ cat decsv.awk
> BEGIN{ FS=OFS=SUBSEP }
> {
> n = split($0,f,"")
> $0=""
> for (i=1;i<=n;i++) {
> inFld = (f[i] ~ /"/ ? !inFld : inFld)
> $0 = $0 (!inFld && (f[i] ~ /[[:space:]]/) ? FS : f[i])
> }
>
> print "Line "NR
> for (i=1;i<=NF;i++)
> print "Field " i " is " $i
> }
>[snip]
> That way you can use $1, etc. in the rest of your script exactly as
> you would normally. If you can have chains of white space between
> fields and you're using " " as the FS, just add a gsub
> (SUBSEP"+",SUBSEP) after the first "for" loop and if your field
> separator is something other than the default " ", just tweak the
> code...
Good point, thanks for bringing that out.
For completeness, it might be worth mentioning that the original algorithm
can be adapted to do the same thing:
BEGIN { FS=SUBSEP }
{
print "Line "NR;
$0=$0" ";
while($0){
match($0,/"[^"]*" |[^ ]* /);
f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
line=line s f; s=FS;
$0=substr($0,RLENGTH+1);
}
$0=line;
for(i=1;i<=NF;i++)print "Field " i " is " $i
Add:
line=""
> while($0){
> match($0,/"[^"]*" |[^ ]* /);
> f=substr($0,RSTART,RLENGTH);
> sub(/ $/,"",f);
> line=line s f; s=FS;
> $0=substr($0,RLENGTH+1);
> }
>
> $0=line;
> for(i=1;i<=NF;i++)print "Field " i " is " $i
>
> }'
>
Even after init-ing "line", the above doesn't quite work as it adds a
blank first field at the start of the second line:
$ cat file
"BA" "TSI-000000" 1 ? "ZOBHA PSL/G" 3.3 33 14.28 0 0 0 0 "" 4
"BA" "D"
$ cat decsv2.awk
BEGIN { FS=SUBSEP }
{
print "Line "NR;
$0=$0" ";
line=""
while($0){
match($0,/"[^"]*" |[^ ]* /);
f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
line=line s f; s=FS;
$0=substr($0,RLENGTH+1);
}
$0=line;
for(i=1;i<=NF;i++)print "Field " i " is <" $i ">"
}
$ awk -f decsv2.awk file
Line 1
Field 1 is <"BA">
Field 2 is <"TSI-000000">
Field 3 is <1>
Field 4 is <?>
Field 5 is <"ZOBHA PSL/G">
Field 6 is <3.3>
Field 7 is <33>
Field 8 is <14.28>
Field 9 is <0>
Field 10 is <0>
Field 11 is <0>
Field 12 is <0>
Field 13 is <"">
Field 14 is <4>
Line 2
Field 1 is <>
Field 2 is <"BA">
Field 3 is <"D">
Ed.
Never mind, you just have to init "s" too to solve that problem:
BEGIN { FS=SUBSEP }
{
print "Line "NR;
$0=$0" ";
line=s=""
while($0){
match($0,/"[^"]*" |[^ ]* /);
f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
line=line s f; s=FS;
$0=substr($0,RLENGTH+1);
}
$0=line;
for(i=1;i<=NF;i++)print "Field " i " is <" $i ">"
}
Ed.
In addition to the interesting solutions presented so far, another approach
(untested):
{
print "Line " NR
f = 0
while ( $0 ) {
if ( !match($0,/^[ ]+/ ) {
if ( !match($0,/^"[^"]*"/ )
match( $0, /^[^ ]+/ )
print "Field " ++f " is " substr( $0, 1, RLENGTH )
}
$0 = substr( $0, RLENGTH+1 )
}
}
- Anton Treuenfels
> {
> $0=$0" ";
> while($0) {
> match($0,/"[^"]*" |[^ ]* /);
> sf=f=substr($0,RSTART,RLENGTH);
> sub(/ $/,"",f);
> print "Field " ++c " is " f;
> sub(sf,"");
> }
> }
I'm trying to puzzle out what happens here if the original record ends with
one or more spaces, ie., there is a trailing field separator after the last
field and before the record separator. I am not certain but I think the loop
never terminates because match() repeatedly fails when it reaches this point
and $0 never becomes null. Is that right?
- Anton Treuenfels
A shorter and tested version (works with any POSIX awk):
{ i = 0
while (match($0, /"([^"]|"")*"|[^" \t]+/)) {
printf "%2d %s\n", ++i, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + RLENGTH)
} }
--
Steffen
No, AFAICT the above code assumes exactly one space between fields, so
additional spaces just delimit empty fields (code incorporates fixes made so
far):
$ echo 'a b c "d e" f g ' | awk '
{
$0=$0" ";
c=0;
while($0) {
match($0,/"[^"]*" |[^ ]* /);
f=substr($0,RSTART,RLENGTH);
sub(/ $/,"",f);
print "Field " ++c " is " f;
$0=substr($0,RLENGTH+1);
}}'
Field 1 is a
Field 2 is b
Field 3 is c
Field 4 is "d e"
Field 5 is f
Field 6 is g
Field 7 is
Field 8 is
> Never mind, you just have to init "s" too to solve that problem:
>
> BEGIN { FS=SUBSEP }
> {
> print "Line "NR;
> $0=$0" ";
> line=s=""
> while($0){
> match($0,/"[^"]*" |[^ ]* /);
> f=substr($0,RSTART,RLENGTH);
> sub(/ $/,"",f);
> line=line s f; s=FS;
> $0=substr($0,RLENGTH+1);
> }
>
> $0=line;
> for(i=1;i<=NF;i++)print "Field " i " is <" $i ">"
>
> }
Yes, I realized shortly after posting that my code was working correctly
only for an input of a single line (that's how I tested it), for more lines
(as you correctly did) you have to reinit "s" and "line". I just thought it
was a minor problem that anyone could fix by himself and not worth an
additional post (and I had already turned off the computer!), however thank
you for making the necessary corrections.
#!/bin/sh
for arg
do
echo $arg
done
I believe is bash is the right tool for this job.
Ah, okay, now I see where I misunderstood. Oops!
- Anton Treuenfels
In addition to being OT in this NG, that's printing arguments to a
shell script, not parsing the contents of a file. You can print
arguments to an awk script that way too but that's not the problem
we're trying to solve.
Ed.
It also correctly ignores leading and trailing spaces unlike the
solutions Dave and I had posted. I think we have a winner... Here it
is updated to rebuild $0:
BEGIN{ FS=SUBSEP }
{ i = 0
rec = sep = ""
while (match($0, /"([^"]|"")*"|[^" \t]+/)) {
fld = substr($0, RSTART, RLENGTH)
rec = rec sep fld
sep = FS
$0 = substr($0, RSTART + RLENGTH)
}
$0 = rec
for (i=1;i<=NF;i++) {
print "Field " i " is <" $i ">"
}
}
Ed.
1 would while (match($0, /"[^"]*"|[^" \t]+/)) work ?
> fld = substr($0, RSTART, RLENGTH)
> rec = rec sep fld
> sep = FS
> $0 = substr($0, RSTART + RLENGTH)
> }
> $0 = rec
> for (i=1;i<=NF;i++) {
> print "Field " i " is <" $i ">"
> }
> }
>
> Ed.
2 Would this work as well?
BEGIN{OFS=FS="\""}
{
i=0
for (i=1;i<=NF;i++) if (!(i%2)) gsub(" ",SUBSEP,$i)
nf=split($0,arr," ")
for (i=1;i<=nf;i++) {
gsub(SUBSEP," ",arr[i])
print "Field " i " is <" arr[i] ">"
}
}
Thanks
Could someone tell me why should "$1 = $1" here?
It's not useful in that script but any modification of a field causes
awk to recompile the current record, so all ocurrences of FS get
changed to OFS, so assigning any field to itself will do that without
changing any fields.
Ed.
#!/bin/sh
# Attempt to read the file quote.txt and parse the parameters
while read line
do
eval set '$line'
#echo "\nLine: $line"
#echo "Number of tokens: $#"
for arg
do
echo $arg
done
done < grep.txt
# End of file
> With all due respect, I agree that my message of using bash instead of
> AWK is off-topic. However, I am trying to show a solution that might
> work and is simpler to implement. I admit that my original submission
> was not on the money. Here is the revised solution, which read a file,
> break each line into tokens, as the original message asked for:
>
> #!/bin/sh
> # Attempt to read the file quote.txt and parse the parameters
>
> while read line
> do
> eval set '$line'
> #echo "\nLine: $line"
> #echo "Number of tokens: $#"
> for arg
> do
> echo $arg
> done
> done < grep.txt
> # End of file
This will not work if some fields contain spaces, like eg
field1 "field 2" field3 "a b c d"
your program will print
field1
"field
2"
field3
"a
b
c
d"
In other words, your program is not able to recognize when a field appears
in double quotes, in which case everything inside the quotes (and the quotes
themselves) must be taken as a single field, even if there are spaces.
(and yes, it's a bit offtopic here too, although in this case, at least
imho, that is not the main problem)
I understand that, but many people using awk aren't using it on UNIX
so a UNIX-specific solution isn't a good one for awk users in general,
and this isn't the NG where most UNIX experts hang out so a UNIX
solution won't be reviewed/criticised/enhanced by the appropriate
people, unlike UNIX solutions proposed at comp.unix.shell.
<OT>
> I admit that my original submission
> was not on the money. Here is the revised solution, which read a file,
> break each line into tokens, as the original message asked for:
>
> #!/bin/sh
> # Attempt to read the file quote.txt and parse the parameters
>
> while read line
> do
> eval set '$line'
> #echo "\nLine: $line"
> #echo "Number of tokens: $#"
> for arg
> do
> echo $arg
> done
> done < grep.txt
> # End of file
The above doesn't work either. Change it to:
while IFS= read -r line
do
eval set "$line"
#echo "\nLine: $line"
#echo "Number of tokens: $#"
for arg
do
echo "$arg"
done
done < grep.txt
and it's probably close.
</OT>
Ed.