Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to select an awk record when one field does not match the end of another field

154 views
Skip to first unread message

Larry W. Virden

unread,
Jun 15, 2016, 10:17:49 AM6/15/16
to
Hello!

Assume a data file such as this

"20";"aaa20"
"21";"abc21"
"23";"xyz00"
"24";"pdq24"


I want to write an awk expression that outputs the record if the value in column 1 does not match the end of the record's column 2.

Complicating things further is the fact that each field is quoted.

I am trying

awk '-F;' '\
$2 !~ /${1}$/ { print $0 ; }
' myfile


But I haven't figured out how to remove the quotes from the fields in the expression.

Does anyone have any ideas on how to do this?

Marc de Bourget

unread,
Jun 15, 2016, 10:46:44 AM6/15/16
to
Le mercredi 15 juin 2016 16:17:49 UTC+2, Larry W. Virden a écrit :
> Does anyone have any ideas on how to do this?

Probably something like:
BEGIN {
FS = ";"
}

{
sub(/^\"/, "", $1)
sub(/\"$/, "", $1)
field1value = $1

sub(/^\"/, "", $2)
sub(/\"$/, "", $2)
field2value = substr($2, length($2)-length($1)+1, length($1))

if (field1value != field2value)
print $0
}

Janis Papanagnou

unread,
Jun 15, 2016, 11:14:10 AM6/15/16
to
Maybe the easiest[*] is to define an appropriate regexp as field separator
to access the right portions of the fields; e.g.

awk -F $'("|";")' 'match($3,$2"$") { print "matching", NR }'

will output with your sample data

matching 1
matching 2
matching 4


Janis

[*] Depending on the regularity of the actual data, and whether there may be
regexp metacharacters in your fields or not.

Marc de Bourget

unread,
Jun 15, 2016, 11:26:06 AM6/15/16
to
Yes, regexp comparison might be easier but string comparison is more secure.

Janis Papanagnou

unread,
Jun 15, 2016, 12:32:31 PM6/15/16
to
On 15.06.2016 17:26, Marc de Bourget wrote:
> Le mercredi 15 juin 2016 17:14:10 UTC+2, Janis Papanagnou a écrit :
>>
>> awk -F $'("|";")' 'match($3,$2"$") { print "matching", NR }'
[...]
>>
>> [*] Depending on the regularity of the actual data, and whether there may be
>> regexp metacharacters in your fields or not.
>
> Yes, regexp comparison might be easier but string comparison is more secure.

As said; it depends on the data. If, as so often - and certainly as in the
OP's case (where there's just two-digit numbers) -, you don't have regexp
metacharacters, it's completely "secure".

Note that even if you prefer string comparisons (as you proposed in your
answer) you can significantly simplify your program by specifying the FS
in a more sensible way; then you can avoid all the sub() function calls.

Janis

Marc de Bourget

unread,
Jun 15, 2016, 2:49:28 PM6/15/16
to
The OP asked how to remove the quotes, so I added code to remove the quotes.
I can't get your one line AWK program to work on the Windows command line.
Can you rewrite it as a full program (at least BEGIN { FS = ... }), please?
Further, I haven't yet understood why using $3 for a two columns text file?

Janis Papanagnou

unread,
Jun 15, 2016, 3:51:43 PM6/15/16
to
On 15.06.2016 20:49, Marc de Bourget wrote:
>
> I can't get your one line AWK program to work on the Windows command line.
> Can you rewrite it as a full program (at least BEGIN { FS = ... }), please?

In the BEGIN clause you need to escape the quotes, and my approach would
then become (for the original requirement to print non-matching lines)...

BEGIN { FS = "(\"|\";\")" }
!match($3,$2"$")

And my proposal of your version with a changed FS would then become

BEGIN { FS = "(\"|\";\")" }
$2 != substr($3,length($3)-length($2)+1,length($2))


> Further, I haven't yet understood why using $3 for a two columns text file?

The field separators _separate_ fields, so if we define the /"/ and the /";"/
both as field separators we get an empty field $1 before the leading " and an
empty field $NF after the final ".

"AA";"BBBCC" e.g. would be split in <empty> " AA ";" BBBCC " <empty> ,
with $1=="", $2=="AA", $3=="BBBCC", $4="", and NF==4.

Janis

Janis Papanagnou

unread,
Jun 15, 2016, 3:57:10 PM6/15/16
to
On 15.06.2016 21:51, Janis Papanagnou wrote:
> On 15.06.2016 20:49, Marc de Bourget wrote:
>>
>> I can't get your one line AWK program to work on the Windows command line.
>> Can you rewrite it as a full program (at least BEGIN { FS = ... }), please?
>
> In the BEGIN clause you need to escape the quotes, and my approach would
> then become (for the original requirement to print non-matching lines)...
>
> BEGIN { FS = "(\"|\";\")" }
> !match($3,$2"$")
>
> And my proposal of your version with a changed FS would then become
>
> BEGIN { FS = "(\"|\";\")" }
> $2 != substr($3,length($3)-length($2)+1,length($2))

And it just occurred to me that we can simplify this one even more like this

BEGIN { FS = "(\"|\";\")" }
$2 != substr($3,length($3)-length($2)+1)

since the third parameter of substr(), the length, is unnecessary if we want
the substring to the end of the field.

Janis

> [...]


Marc de Bourget

unread,
Jun 15, 2016, 4:51:40 PM6/15/16
to
Yes, I do know that the third parameter actually isn't needed but I wrote it for demonstration purposes to show the actual cuts by the substring function.

I still don't like it so much as a matter of personal taste to use a Field separator solution which separates two fields in three fields although obviously only two fields exist. Using $3 for a field which is actually $2 looks slightly strange :-) So I have adjusted my solution as an alternative a bit not to change the original values. I assume that is what the OP wanted to get as a result:

function csvdequote( str ) {
sub(/^\"/, "", str)
sub(/\"$/, "", str)
return ( str )
}

BEGIN {
FS = ";"
}

{
field1val = csvdequote($1)
field2val = csvdequote($2)
field2val = substr(field2val, length(field2val)-length(field1val)+1)

if (field1val != field2val)
print $0
}

Of course the CSV dequote function is only very basic. It gets even more complicated if there are semicolons inside a field as field values. AWK isn't suited well for this. Therefore, I always use (and would recommend it to everyone) tab (ASCII 9) as Field separator without any escaping quotes. This is the best and simplest solution.

Of course Janis's solution is very good. It's good to have several solutions.
BTW, OP where are you? Did we solve your problem? Are you satisfied with our work :-) ?

Janis Papanagnou

unread,
Jun 15, 2016, 6:36:58 PM6/15/16
to
On 15.06.2016 22:51, Marc de Bourget wrote:
>
> I still don't like it so much as a matter of personal taste to use a Field
> separator solution which separates two fields in three fields although
> obviously only two fields exist. [...]

What's a field in Awk is completely up to you. The insight is that with a
proper field seperator defined you can write an Awk program that is doing
what Awk is best at; directly operate on fields without bulky workarounds.
(Note that gawk goes even farther with its feature to not only be able
to define patterns for the separators are but also patterns for the data,
which is often more appropriate.)

> [...]
>
> Of course the CSV dequote function is only very basic. It gets even more
> complicated if there are semicolons inside a field as field values. AWK
> isn't suited well for this. Therefore, I always use (and would recommend it
> to everyone) tab (ASCII 9) as Field separator without any escaping quotes.
> This is the best and simplest solution.

Your observation that Awk isn't good with CSV data is correct. And if you
have control over the input data generation you are free to use appropriate
unique field separators, but generally you have to process what you get.

(The ASCII Tab, BTW, has a lot other disadvantages, and is also generally
not a good choice. For example you can't see multi-Tab sequences, and it
provokes hard to detect bugs if one line has, say, two Tabs to align data
and the thus program interprets it as an spurious additional empty field.)

Janis

Robert Mesibov

unread,
Jun 15, 2016, 7:39:50 PM6/15/16
to
> (The ASCII Tab, BTW, has a lot other disadvantages, and is also generally
> not a good choice. For example you can't see multi-Tab sequences, and it
> provokes hard to detect bugs if one line has, say, two Tabs to align data
> and the thus program interprets it as an spurious additional empty field.)
>
> Janis

I don't see other disadvantages, and I agree with Marc that tab is the most recommendable field separator. Among other advantages, tab is the default separator for cut, paste and nl, and in a terminal or text editor it is simple to display tab-separated columns clearly (with 'tabs' command, or with tab length in an editor).

The hardest part in converting a CSV to a TSV is working out how double quotes were used in the CSV - there are too many different cases!

Kenny McCormack

unread,
Jun 15, 2016, 7:42:44 PM6/15/16
to
In article <881169da-4e5d-4abd...@googlegroups.com>,
Yes. Janis is clearly wrong here.

--
Modern Conservative: Someone who can take time out from flashing her
wedding ring around and bragging about her honeymoon to complain that a
fellow secretary who keeps a picture of her girlfriend on her desk is
"flauting her sexuality" and "forcing her lifestyle down our throats".

Janis Papanagnou

unread,
Jun 15, 2016, 9:05:25 PM6/15/16
to
On 16.06.2016 01:42, Kenny McCormack wrote:
> In article <881169da-4e5d-4abd...@googlegroups.com>,
> Robert Mesibov <robert....@gmail.com> wrote:
>>> (The ASCII Tab, BTW, has a lot other disadvantages, and is also generally
>>> not a good choice. For example you can't see multi-Tab sequences, and it
>>> provokes hard to detect bugs if one line has, say, two Tabs to align data
>>> and the thus program interprets it as an spurious additional empty field.)
>>
>> I don't see other disadvantages,

You maybe don't see it but there are. Besides the mentioned multi-Tabs is the
depenency of editors in hand-created files where editors expand it to Space
or mixtures of Tab and Space, even depending on specific Tab-Width setting.

>> and I agree with Marc that tab is the most recommendable field separator.

Any unique _visible_ separator is preferable. YMMV. A typical choice is the
pipe symbol which is rarely used inside the payload data and clearly visible.
I've seen a lot advanced software products in the past three decades that
have used that choice.

Nonetheless it's always depending on the application data context. So any
religious or dogmatic statement about the One True Separator is nonsense.
The fact that my hint and explanations about inherent problems with Tab and
that it's unsuited as a general recommendation (Remember, I replied to the
statement: "I [Marc] always use (and would recommend it to everyone) tab
(ASCII 9) as Field separator") shows me that this thread is going to become
a religious war again. I will abstain.

>> Among other advantages, tab is the default separator for cut, paste and nl,

Which is irrelevant for Awk (and comp.lang.awk), since in Awk Tab is not
the default. <OT> Actually, the way that (e.g. cut) is handling delimiters
is extremely primitive and (as opposed to awk) doesn't work for arbitrary
delimiters is more of a pain that an enlightenment. Obviously there *is*
an argument for Tab; to be able to process the same files with other tools
than Awk as well, and since those tools are so primitive you need to resort
to a very primitive one-character width delimiter. </OT> But one insight
that experienced Awk users should have is that if you use Awk you can avoid
single or pipes of versions of those primitive tools and be better off.

>> and in a terminal or text editor it is simple to
>> display tab-separated columns clearly (with 'tabs' command, or with tab length in
>> an editor).

Since in editors there's no one static configuration that is rather a problem
than an advantage (as illustrated above).

<OT>
>> The hardest part in converting a CSV to a TSV is working out how double quotes
>> were used in the CSV - there are too many different cases!

There's no doubt in problems with CSV; we seem to have agreed on that. There's
even more problems with it; e.g. it's non-standardized, there are many "CSV"
formats, or escaping (of quotes or delimiters) inside the data, an incoherent
across existing versions. Substituting a comma or semicolon delimiter in CSV
data by a Tab is not solving the inherent problems of those CSV formats.

I don't think that converting CSV is the way to go; since if you can convert
it reliably you can also just process it the same way the converter does.
With an explicit converter you may think to spare duplicating effort, but
given that there's no single standard (as mentioned already) it's better to
create data in advance in an appropriate format.
</OT>

Luckily we have (as opposed to those primitive Unix tools) the option to
specify powerful separators in Awk to handle most cases of Real World Data.

Janis

Marc de Bourget

unread,
Jun 16, 2016, 4:21:30 AM6/16/16
to
You should use as CSV editor instead a text editor for working with CSV or TSV files or viewing them. At least for Windows there are great CSV editors available like "Ron's Editor".

A pipe is only good for visualizing fields in simple text file format. I also use it, but only for demonstrating purposes in AWK google groups. I can't use it elsewhere because the data in our company contain "|" as field value.

So, in real professional life, tab (aka "\t") is the only secure field separator. I use it daily for database import and export without any issues. I prepocess and postprocess the data with AWK. The tab separator is the best and simplest solution with no need for escaping anything with quotes. Therefore, it is best suited for native AWK field handling ($1, $2, ...).

Kenny McCormack

unread,
Jun 16, 2016, 7:53:03 AM6/16/16
to
In article <8fae91ef-ad21-4b33...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
>You should use as CSV editor instead a text editor for working with CSV or
>TSV files or viewing them. At least for Windows there are great CSV
>editors available like "Ron's Editor".

Yes, correct. The idea that you should be able to visually inspect and/or
use a conventional text editor on a delimited file is archaic. In fact,
when all is said and done, the best "CSV editor" is Excel.

BTW, well done on changing the Subject: header. I always do that when
needed, but most people don't. Would be good if more people followed
(y)our lead on this.

>A pipe is only good for visualizing fields in simple text file format. I
>also use it, but only for demonstrating purposes in AWK google groups. I
>can't use it elsewhere because the data in our company contain "|" as
>field value.

Exactly.

>So, in real life, tab (aka "\t") is the only reliable field separator.

Yeah, it is pretty clear that Janis just made an error here.
It's OK. We all make mistakes from time to time. He just needs to accept
it and move on.

--
They say compassion is a virtue, but I don't have the time!

- David Byrne -

Kaz Kylheku

unread,
Jun 16, 2016, 10:29:30 AM6/16/16
to
On 2016-06-16, Kenny McCormack <gaz...@shell.xmission.com> wrote:
> In article <8fae91ef-ad21-4b33...@googlegroups.com>,
> Marc de Bourget <marcde...@gmail.com> wrote:
>>You should use as CSV editor instead a text editor for working with CSV or
>>TSV files or viewing them. At least for Windows there are great CSV
>>editors available like "Ron's Editor".
>
> Yes, correct. The idea that you should be able to visually inspect and/or
> use a conventional text editor on a delimited file is archaic. In fact,
> when all is said and done, the best "CSV editor" is Excel.

A very good field separator is the null byte, using two
consecutive null bytes to terminate the record.

A text editor can be easily adapted to work with this.

Except that it won't nicely break lines on the double nulls, "vim -b"
will handle this.

Mike Sanders

unread,
Jun 16, 2016, 10:58:49 AM6/16/16
to
Kenny McCormack <gaz...@shell.xmission.com> wrote:

> ...the best "CSV editor" is Excel.

Under Windows/Wine, see also CSVpad (GPL'd):

<http://www.trustfm.net/software/utilities/CSVpad.php>

--
Mike Sanders
www.peanut-software.com

Marc de Bourget

unread,
Jun 16, 2016, 11:12:18 AM6/16/16
to
Le jeudi 16 juin 2016 16:29:30 UTC+2, Kaz Kylheku a écrit :
> A very good field separator is the null byte, using two
> consecutive null bytes to terminate the record.
>
> A text editor can be easily adapted to work with this.
>
> Except that it won't nicely break lines on the double nulls, "vim -b"
> will handle this.

I have never heard of anyone using the null byte as field separator.
This separator really sounds extremely inconvenient and cumbersome :-)

Kenny McCormack

unread,
Jun 16, 2016, 12:35:27 PM6/16/16
to
In article <6328b632-4de2-4b92...@googlegroups.com>,
Marc de Bourget <marcde...@gmail.com> wrote:
>Le jeudi 16 juin 2016 16:29:30 UTC+2, Kaz Kylheku a écrit :
>> A very good field separator is the null byte, using two
>> consecutive null bytes to terminate the record.
>>
>> A text editor can be easily adapted to work with this.
>>
>> Except that it won't nicely break lines on the double nulls, "vim -b"
>> will handle this.
>
>I have never heard of anyone using the null byte as field separator.

What does it matter whether or not you've heard of it?

There are lots of perfectly valid and useful things in the world that any
one of us may happen to be unfamiliar with.

--
I'm building a wall.

Marc de Bourget

unread,
Jun 16, 2016, 1:18:43 PM6/16/16
to
Le jeudi 16 juin 2016 18:35:27 UTC+2, Kenny McCormack a écrit :
> What does it matter whether or not you've heard of it?
>
> There are lots of perfectly valid and useful things in the world that any
> one of us may happen to be unfamiliar with.
>
> --
> I'm building a wall.

Of course if this is a good solution for Kaz he shall use it and be happy :-)

Bruce Horrocks

unread,
Jun 16, 2016, 2:15:32 PM6/16/16
to
On 16/06/2016 15:29, Kaz Kylheku wrote:
> On 2016-06-16, Kenny McCormack <gaz...@shell.xmission.com> wrote:
>> In article <8fae91ef-ad21-4b33...@googlegroups.com>,
>> Marc de Bourget <marcde...@gmail.com> wrote:
>>> You should use as CSV editor instead a text editor for working with CSV or
>>> TSV files or viewing them. At least for Windows there are great CSV
>>> editors available like "Ron's Editor".
>>
>> Yes, correct. The idea that you should be able to visually inspect and/or
>> use a conventional text editor on a delimited file is archaic. In fact,
>> when all is said and done, the best "CSV editor" is Excel.

Back in 1967, those nice people who invented ASCII included codes for
record separator (code 30) and unit (field) separator (code 31). But for
whatever reason, most terminal manufacturers failed to create visible
representations for them, which meant that a listed or edited file
typically showed them as a space, or worse, didn't show anything at all
so adjacent fields ran together.

If these codes had been shown in some visible manner by the terminal by
default then I suspect we would still be using them today and there
would be no problem.

> A very good field separator is the null byte, using two
> consecutive null bytes to terminate the record.

I humbly suggest that ASCII 31 is a very good field separator since it
was expressly intended for that purpose. :-)

> A text editor can be easily adapted to work with this.
>
> Except that it won't nicely break lines on the double nulls, "vim -b"
> will handle this.

Vim will happily display ASCII 31 (as the ^_ digraph) and those equally
nice Unicode people have defined representations for ASCII codes zero to
31 in the U2400 block, so any Unicode compliant file viewer or editor
should consider using those when displaying low-ASCII.

--
Bruce Horrocks
Surrey
England
(bruce at scorecrow dot com)

Kaz Kylheku

unread,
Jun 16, 2016, 3:46:17 PM6/16/16
to
On 2016-06-16, Kenny McCormack <gaz...@shell.xmission.com> wrote:
> In article <6328b632-4de2-4b92...@googlegroups.com>,
> Marc de Bourget <marcde...@gmail.com> wrote:
>>Le jeudi 16 juin 2016 16:29:30 UTC+2, Kaz Kylheku a écrit :
>>> A very good field separator is the null byte, using two
>>> consecutive null bytes to terminate the record.
>>>
>>> A text editor can be easily adapted to work with this.
>>>
>>> Except that it won't nicely break lines on the double nulls, "vim -b"
>>> will handle this.
>>>
>>I have never heard of anyone using the null byte as field separator.
>
> What does it matter whether or not you've heard of it?

The /proc/self/environ on Linux dumps null-terminated entries.

In the Microsoft world, a multi-string Registry value consists of
null terminated strings, with a double null at the end.

This is also the representation of environment variables returned,
as a single block of memory, by the Win32 GetEnvironmentStrings
function:

https://msdn.microsoft.com/en-us/library/windows/desktop/ms683187(v=vs.85).aspx

Because C has made null terminated strings so widespread, you're
unlikely to run into a situation where nulls have to be part of the
data. So there is no need to escape it.

Where this is cumbersome is if you have to quote the data
in text documents. For that, CSV can be used, with all the escaping.

The reason it is cumbersome in those situations is precisely that NUL
is not expected to be a part of text, which is the virtue that
makes it convenient for communication and storage.

Kaz Kylheku

unread,
Jun 16, 2016, 3:47:05 PM6/16/16
to
There is an obvious problem with usign a double null as a record
separator in that it creates an ambiguity with empty fields.

Something more workable is:

* NUL-NUL terminates every field.
* NUL-NL terminates records.
* NUL-<anything-else> is erroneous

Thus this is a complete record, whose fields correspond
top the CSV "abc", "def", "":

abc[NUL][NUL]def[NUL][NUL][NUL][NUL][NUL][NL]

pjfarley3

unread,
Jun 17, 2016, 12:10:30 AM6/17/16
to
On Wednesday, June 15, 2016 at 10:17:49 AM UTC-4, Larry W. Virden wrote:
> Hello!
>
> Assume a data file such as this
>
> "20";"aaa20"
> "21";"abc21"
> "23";"xyz00"
> "24";"pdq24"
<Snipped>
> Does anyone have any ideas on how to do this?

Your data looks like pretty regular CSV format. Perhaps this can help.

Search for "csvsplit" in the archives of this forum for a little CSV function originally authored by Harlan Grove (a past contributor to this forum) and edited a little bit by me. Look for the debug-clean version that I posted in 2014 rather than the old 1999 version which has a number of unneeded debugging lines.

Invoke it as "ncsv = csvsplit()". When it is done the data for the current awk input line will be in array elements f[1], f[2], ... f[ncsv] with quotes stripped.

HTH

Peter

Marc de Bourget

unread,
Jun 17, 2016, 5:02:46 AM6/17/16
to
Do you mean this link?
https://groups.google.com/forum/#!searchin/comp.lang.awk/csvsplit/comp.lang.awk/2a03yKWuaFg/HZtY2jYpx94J

It seems to work for reading data. For writing changed data, I'm not sure. E.g. after changing the unquoted version of f[4], how to write it back to a quoted version for $0?

However, for reading it seems to work. I have added an optional debug parameter to CSVSPLIT-2.AWK:
call without debug: csvsplit($0), with debug: csvsplit($0, 1)

function csvsplit(n, debug, newf, oldf, fs) {
fs = FS
oldf = newf = 0
while ( oldf < NF ) {
f[++newf] = $(++oldf)
if ( f[newf] ~ /^"/ ) {
while ( gsub(/"/, "\"", f[newf]) % 2 ) {
if ( oldf >= NF ) {
if ((getline) > 0) {
if (debug) print NR"func:"$0"."
oldf = 0
fs = "\n"
}
else break
} else fs = FS
f[newf] = f[newf] fs $(++oldf)
tmpf = f[newf]
tmpn = gsub(/"/, "\"", tmpf)
if (debug) print "Debug-NR:" NR ". f[newf]:" f[newf] "."
if (debug) print "(cont'd):" NR ". gsub(/\"/, \"\\\"\", f[newf]):" tmpn \
". % 2:" (tmpn % 2) "."
}
sub(/^"/, "", f[newf])
sub(/"$/, "", f[newf])
gsub(/""/, "\"", f[newf])
}
n = length(f[newf])
}
return newf
}

BEGIN {
FS = ";"
debug = 0
}

{
n = csvsplit($0)
print "Number of fields: " n " f[1]: #" f[1] "#"
print "Number of fields: " n " f[2]: #" f[2] "#"
}

=> Result:
Number of fields: 2 f[1]: #20#
Number of fields: 2 f[2]: #aaa20#
Number of fields: 2 f[1]: #21#
Number of fields: 2 f[2]: #abc21#
Number of fields: 2 f[1]: #23#
Number of fields: 2 f[2]: #xyz00#
Number of fields: 2 f[1]: #24#
Number of fields: 2 f[2]: #pdq24#

Marc de Bourget

unread,
Jun 17, 2016, 7:36:49 AM6/17/16
to
My "debug = 0" initialisation was of course not correct.
This is another variable than the parameter name "debug".

Marc de Bourget

unread,
Jun 17, 2016, 7:55:38 AM6/17/16
to
Here is the amended version without false and unneeded "debug" initialisation:
{
n = csvsplit($0)
print "Number of fields: " n " f[1]: #" f[1] "#"
print "Number of fields: " n " f[2]: #" f[2] "#"
}

I haven't yet scrutinized the OP's original csvsplit algorithm for correctness.

Marc de Bourget

unread,
Jun 18, 2016, 5:09:06 AM6/18/16
to
OFS = "\t"
}

{
n = csvsplit($0)
for (i=1; i<=n; i++) {
if (i < n)
printf("%s%s", f[i], OFS)
else
printf("%s\n", f[i])
}
}

The csvsplit author's code can in any case be useful for converting a quoted semicolon "csv" version to an unquoted tab "tsv" version, see example above.

If I have time, I'll write a csvquote function for writing unquoted content back to quoted content with several options.

pjfarley3

unread,
Jun 19, 2016, 9:52:51 AM6/19/16
to
On Friday, June 17, 2016 at 5:02:46 AM UTC-4, Marc de Bourget wrote:
> Le vendredi 17 juin 2016 06:10:30 UTC+2, pjfarley3 a écrit :
<Snipped>
That's the first one from 1999, I really wanted you to find this one from 2014:

https://groups.google.com/forum/?hl=en#!searchin/comp.lang.awk/csvsplit/comp.lang.awk/HdBfRmyT1Og/YrKyzEYBo4QJ

That 2014 post has the "clean" version.

As for writing it out as CSV again, I never thought to use csvsplit() for that function. When I have needed to use csvsplit() I usually am either filtering or summarizing or converting the data into some other format anyway, so outputting as CSV has not been a frequent need.

The csvsplit function does not change the contents of $0, #1, ...., $NF, so you can always just replace the field(s) you changed, re-constitute $0 and write that out again.

HTH

Peter

charlemagn...@gmail.com

unread,
Jun 19, 2016, 10:43:22 AM6/19/16
to
> When I have needed to use csvsplit() I usually am either filtering or summarizing or converting the data into some other format anyway, so outputting as CSV has not been a frequent need.

That's how I've used it, with the jq program to convert JSON to CSV and then csvsplit() to process the data in awk. Easiesr then parsing JSON in awk.
Message has been deleted

Marc de Bourget

unread,
Jun 19, 2016, 12:56:45 PM6/19/16
to
OK. Great, thank you Peter!
Then, the final solution for the OP's original first question with your 2014 csvsplit function version may look best like:

function csvsplit( newf, oldf, fs) {
fs = FS
oldf = newf = 0
while ( oldf < NF ) {
f[++newf] = $(++oldf)
if ( f[newf] ~ /^"/ ) {
while ( gsub(/"/, "\"", f[newf]) % 2 ) {
if ( oldf >= NF ) {
if ((getline) > 0) {
oldf = 0
fs = "\n"
}
else break
}
else fs = FS
f[newf] = f[newf] fs $(++oldf)
}
sub(/^"/, "", f[newf])
sub(/"$/, "", f[newf])
gsub(/""/, "\"", f[newf])
}
}
return newf
}

BEGIN {
FS = ";"
}

{
csvsplit()
if (f[1] != substr(f[2], length(f[2])-length(f[1])+1) )
print $0
}
0 new messages