On 05.12.2015 00:57, Bruce Horrocks wrote:
> On 25/11/2015 22:01, Janis Papanagnou wrote:
>> On 25.11.2015 22:20, Doc Trins O'Grace wrote:
>>> Any of you all have a clever way of determining what kind of delimiter a
>>> given file contains? I'm imagining an awk program that could scan a file
>>> and recommend coma, pipe, space, etc.
>>
>> x,| ;y,| ;z,| ;q
>>
>> What is the delimiter of a file like this? (And this is even regular, i.e.
>> a comparably simple case.)
>>
>> Either you'd have to know in advance how the delimiters are defined or you
>> have to know how the data is defined. If you can say that, e.g., data only
>> contains alpha-numeric characters then all you have to do is matching any
>> (or the first) non-alpha-numeric character. Delimiters may also be sequences
>> of charecters, but since those can be expressed as regular expressions the
>> solution is all the same; e.g. use match() or gsub(), and substr() on the
>> result.
>
> That's a bit pessimistic.
(This was just condensed experience from practice. Sadly it doesn't leave
much room for optimism.)
> The whole point of using delimited files is that
> they are supposed to be consistent[*].
There's a lot consistent data around (beyond all those CSV variations),
where you can't tell the separators without context information just by
a simple algorithm like the one you suggested.
> So you can have a good guess simply by
> looking at the first two rows and seeing which delimiter appears the same
> number of times in both.
No, sadly you can't use such simple approach. In case you don't like my
artificial example I posted upthread to demontrate the inherent problem,
just consider this very consistent format, e.g.,
x , y , z , q
Is the delimiter /,/ or / / or / , / ? From visual inspection it seems
obvious, but technically it's hard to make the right decision (on unknown
files); your program below (as far as I understand the code correctly)
won't recognize it.
Or look at
Papanagnou, Janis,Europe,Third stone from the Sun
Are there three or four fields here? - If assuming consistent separators
it's three since "Papanagnou, Janis" is a common way to format a name
attribute and the other attributes don't have that space as part of the
field separator.
And then you have the very common "CSV" (whatever one means by that in
any specific context). (But you excepted this in your argumentation.)
Yes, if you can except some strange forms, or if you can make assumptions
(like your "one character delimiter" and "delimiter not part of the data"
implicit assumption) you can write programs (like yours) to handle subsets
of delimited data formats.
Another observation is; plain heuristics are not guaranteed to work in
the general case.
But generally, if you need reliability, it all boils down to what I said
upthread:
"Either you'd have to know in advance how the delimiters are defined
or you have to know how the data is defined."
That all said and going farther; I'm confident that you can elaborate a
heuristic pattern matching algorithm, one that is more complex than the
code below, to more reliably (but still not reliable) find consistently
defined separators. This is certainly an interesting theoretical exercise.
In practise it's often simpler to demand a specification of the data.
(Or manually inspect the data you've got to define a hopefully fitting
heuristic that is still valid with the next data shipment. - Been there,
abandoned all hope.)
Janis