Identifying the Delimiter

Doc Trins O'Grace

unread,

Nov 25, 2015, 4:20:17 PM11/25/15

to

Any of you all have a clever way of determining what kind of delimiter a given file contains? I'm imagining an awk program that could scan a file and recommend coma, pipe, space, etc.

By the way, I have appreciate your help from time to time. Thank you for being willing to share your expertise with the rest of us.

Have a very happy and blessed Thanksgiving for those you who are states side.

Janis Papanagnou

unread,

Nov 25, 2015, 5:01:33 PM11/25/15

to

On 25.11.2015 22:20, Doc Trins O'Grace wrote:
> Any of you all have a clever way of determining what kind of delimiter a
> given file contains? I'm imagining an awk program that could scan a file
> and recommend coma, pipe, space, etc.

x,| ;y,| ;z,| ;q

What is the delimiter of a file like this? (And this is even regular, i.e.
a comparably simple case.)

Either you'd have to know in advance how the delimiters are defined or you
have to know how the data is defined. If you can say that, e.g., data only
contains alpha-numeric characters then all you have to do is matching any
(or the first) non-alpha-numeric character. Delimiters may also be sequences
of charecters, but since those can be expressed as regular expressions the
solution is all the same; e.g. use match() or gsub(), and substr() on the
result.

Janis

Joe User

unread,

Nov 25, 2015, 11:23:24 PM11/25/15

to

Doc Trins O'Grace wrote:

> Any of you all have a clever way of determining what kind of
delimiter a
> given file contains?

Not really.

The program 'file' will give a good idea about some file formats.

But then, your script has to be ready for any of the thousands of
file type descriptions that file can produce.

Janis Papanagnou

unread,

Nov 25, 2015, 11:47:53 PM11/25/15

to

On 26.11.2015 05:23, Joe User wrote:
> Doc Trins O'Grace wrote:
>
>> Any of you all have a clever way of determining what kind of
>> delimiter a given file contains?
>
> Not really.
>
> The program 'file' will give a good idea about some file formats.

But not about the delimiters, as requested by the OP. Whether the data
is pipe-delimited, comma-delimited, semicolon-delimited, or whatnot, my
file(1) command says only it's "ASCII text". It doesn't even identify it
[vaguely] as "CSV".

Janis

> [...]

Bruce Horrocks

unread,

Dec 4, 2015, 6:57:53 PM12/4/15

to

That's a bit pessimistic. The whole point of using delimited files is
that they are supposed to be consistent[*]. So you can have a good guess
simply by looking at the first two rows and seeing which delimiter
appears the same number of times in both.

NR == 1 { row1 = $0 }
NR == 2 { row2 = $0 }
NR >= 2 { exit } # guess from first 2 rows only
END {
delims = ",:| " # comma, colon, pipe, whitespace
for (i = 1; i <= length(delims); i++) {
s = substr(delims, i, 1)
n1 = split(row1, arr, s)
n2 = split(row2, arr, s)
if ( n1 > 1 && n1 == n2 ) {
likely = likely s
}
}
print "Likely delimiters are: " likely
}

[*] CSV excepted :-(

--
Bruce Horrocks
Surrey
England
(bruce at scorecrow dot com)

Janis Papanagnou

unread,

Dec 4, 2015, 9:08:06 PM12/4/15

to

On 05.12.2015 00:57, Bruce Horrocks wrote:
> On 25/11/2015 22:01, Janis Papanagnou wrote:
>> On 25.11.2015 22:20, Doc Trins O'Grace wrote:
>>> Any of you all have a clever way of determining what kind of delimiter a
>>> given file contains? I'm imagining an awk program that could scan a file
>>> and recommend coma, pipe, space, etc.
>>
>> x,| ;y,| ;z,| ;q
>>
>> What is the delimiter of a file like this? (And this is even regular, i.e.
>> a comparably simple case.)
>>
>> Either you'd have to know in advance how the delimiters are defined or you
>> have to know how the data is defined. If you can say that, e.g., data only
>> contains alpha-numeric characters then all you have to do is matching any
>> (or the first) non-alpha-numeric character. Delimiters may also be sequences
>> of charecters, but since those can be expressed as regular expressions the
>> solution is all the same; e.g. use match() or gsub(), and substr() on the
>> result.
>
> That's a bit pessimistic.

(This was just condensed experience from practice. Sadly it doesn't leave
much room for optimism.)

> The whole point of using delimited files is that
> they are supposed to be consistent[*].

There's a lot consistent data around (beyond all those CSV variations),
where you can't tell the separators without context information just by
a simple algorithm like the one you suggested.

> So you can have a good guess simply by
> looking at the first two rows and seeing which delimiter appears the same
> number of times in both.

No, sadly you can't use such simple approach. In case you don't like my
artificial example I posted upthread to demontrate the inherent problem,
just consider this very consistent format, e.g.,

x , y , z , q

Is the delimiter /,/ or / / or / , / ? From visual inspection it seems
obvious, but technically it's hard to make the right decision (on unknown
files); your program below (as far as I understand the code correctly)
won't recognize it.

Or look at

Papanagnou, Janis,Europe,Third stone from the Sun

Are there three or four fields here? - If assuming consistent separators
it's three since "Papanagnou, Janis" is a common way to format a name
attribute and the other attributes don't have that space as part of the
field separator.

And then you have the very common "CSV" (whatever one means by that in
any specific context). (But you excepted this in your argumentation.)

Yes, if you can except some strange forms, or if you can make assumptions
(like your "one character delimiter" and "delimiter not part of the data"
implicit assumption) you can write programs (like yours) to handle subsets
of delimited data formats.

Another observation is; plain heuristics are not guaranteed to work in
the general case.

But generally, if you need reliability, it all boils down to what I said
upthread:

"Either you'd have to know in advance how the delimiters are defined
or you have to know how the data is defined."

That all said and going farther; I'm confident that you can elaborate a
heuristic pattern matching algorithm, one that is more complex than the
code below, to more reliably (but still not reliable) find consistently
defined separators. This is certainly an interesting theoretical exercise.

In practise it's often simpler to demand a specification of the data.
(Or manually inspect the data you've got to define a hopefully fitting
heuristic that is still valid with the next data shipment. - Been there,
abandoned all hope.)

Janis

Bruce Horrocks

unread,

Dec 6, 2015, 8:58:20 AM12/6/15

to

On 05/12/2015 02:08, Janis Papanagnou wrote:
> On 05.12.2015 00:57, Bruce Horrocks wrote:

>> So you can have a good guess simply by
>> looking at the first two rows and seeing which delimiter appears the same
>> number of times in both.
>
> No, sadly you can't use such simple approach. In case you don't like my
> artificial example I posted upthread to demontrate the inherent problem,
> just consider this very consistent format, e.g.,
>
> x , y , z , q
>
> Is the delimiter /,/ or / / or / , / ? From visual inspection it seems
> obvious, but technically it's hard to make the right decision (on unknown
> files); your program below (as far as I understand the code correctly)
> won't recognize it.

I understood the original post to be asking for a way to check a set of
pre-defined possible delimiters, not for a 'general purpose delimiter
detector'. Reading the OP again I'm no longer so sure.

If the OP was asking for a GPDD then I agree with what you say.

If the reason behind the question was, say, a need to scan a few hundred
files in a directory and identify those that might be pipe delimited
versus those that might be comma delimited versus those that are just
plain text then the simple approach gets you somewhere. Not perfect, of
course, but one step on from the output of 'files'.

[snip]

>
> Another observation is; plain heuristics are not guaranteed to work in
> the general case.
>
> But generally, if you need reliability, it all boils down to what I said
> upthread:
>
> "Either you'd have to know in advance how the delimiters are defined
> or you have to know how the data is defined."
>
>
> That all said and going farther; I'm confident that you can elaborate a
> heuristic pattern matching algorithm, one that is more complex than the
> code below, to more reliably (but still not reliable) find consistently
> defined separators. This is certainly an interesting theoretical exercise.
>
> In practise it's often simpler to demand a specification of the data.
> (Or manually inspect the data you've got to define a hopefully fitting
> heuristic that is still valid with the next data shipment. - Been there,
> abandoned all hope.)

I've encountered systems where the XML isn't compatible with the stated
schema!

Doc Trins O'Grace

unread,

Dec 8, 2015, 11:53:45 AM12/8/15

to

Bruce, your "likely delimiter" was what I had in mind. Thank you!

Also, thanks for the rest of the discussion, folks, that is informative.