Hi 4tH-ers!
Let's be honest - XML and JSON are a horror to parse, unless you add some hefty library to do the heavy lifting. CSV is a bit better, but there are some quirks that doesn't make it quite easy, unless you use the 4tH libraries.
And there are a few dedicated CSV libraries. PARSING allows you to parse a CSV field - but you need CSVFROM to get the double quotes right. And still, all these libs can still not handle embedded line breaks (as RFC 4180 requires). So no - it's not a walk in the park.
I've read about TSV, that is an informal format using TABs instead of commas as delimiters. But the problem is - what if the TAB is part of a field?
Now - I've been playing for quite a while with that idea. What can we do to improve on that. And yesterday I sat down and implemented it.
When offered a field, the TSV format requires you to escape the following characters:
0 (null), 7 (bell), 8 (backspace), 9 (tab), 10 (linefeed), 11 (vertical tab), 12 (formfeed), 13 (carriage return), 27 (escape), 92 (backslash).
So - if a field contains a tab, it is escaped. Linebreaks are no problem either for this reason. UTF-8 works fine with this scheme AFAIK.
The ESCAPE library is fully capable of "unescaping" fields. Which can be done safely, because nothing is actually expanded.
You can reduce the "parsing" part of this format to a simple 9 PARSE. If you want to know if you were EOL, simply use this definition:
: Field?> >in @ 9 parse s>escape rot >in @ < ;
Which adds a flag to the parsed string (false if EOL). The library in question is called TSV-W.4tH and is in SVN.
Tell me what you think!
Hans Bezemer