UTF-8 support

131 views
Skip to first unread message

matrixik

unread,
Jul 14, 2010, 1:22:15 PM7/14/10
to csvfix
Hello, first of all I want to Thank You for that excellent program.

I read that you don't have plan to implement UTF-8 support, but maybe
you change your mind when you
read official Google post about UTF-8 on the internet:
http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html

What that post shown? That UTF-8 now is more common used on the
internet than ASCII and still dynamically growing. Btw. that post have
2 years.

Thank You again.

Cheers
Dobrosław Żybort

Jonathan Leffler

unread,
Jul 15, 2010, 2:41:11 AM7/15/10
to csv...@googlegroups.com
What do you think is needed for csvfix to support UTF-8?

UTF-8 is carefully designed so that naïve programs that are modestly careful do not run into problems with it.

The only time I can think of where UTF-8 support would matter is if the field delimiter (normally comma), record delimiter (normally newline) or field quote (normally double quote) characters needed to be a multi-byte UTF-8 character rather than a single-byte character as all the default values are.

And, for that to be a practical problem, I think you would need to provide convincing evidence that there's a major software system that is routinely configured in some locale to use non-default values for the format.  It might be that somewhere in Asia, that is the case - you should document this carefully, and explain what the default values are in that locale.

Absent such compelling evidence, I don't see that there is a compelling reason for csvfix to do anything different from what it does now.

--
Jonathan Leffler <jonathan...@gmail.com>  #include <disclaimer.h>
Guardian of DBD::Informix - v2008.0513 - http://dbi.perl.org
"Blessed are we who can laugh at ourselves, for we shall never cease to be amused."

matrixik

unread,
Aug 14, 2010, 12:50:25 PM8/14/10
to csvfix
Old but good article:

The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
http://www.joelonsoftware.com/articles/Unicode.html

Cheers

On 15 Lip, 08:41, Jonathan Leffler <jonathan.leff...@gmail.com> wrote:
> Jonathan Leffler <jonathan.leff...@gmail.com>;  #include <disclaimer.h>
> Guardian of DBD::Informix - v2008.0513 -http://dbi.perl.org

Jonathan Leffler

unread,
Aug 14, 2010, 2:50:18 PM8/14/10
to csv...@googlegroups.com
On Sat, Aug 14, 2010 at 9:50 AM, matrixik <matr...@gmail.com> wrote:
Old but good article:

The Absolute Minimum Every Software Developer Absolutely, Positively
Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky
http://www.joelonsoftware.com/articles/Unicode.html

An excellent article; one which I have read several times.

My question to you stands unanswered:

* What do you think is needed for csvfix to support UTF-8?

I run cvsfix on a system using UTF-8; I've not run into problems.
How have you run into problems?

Or, in other words, cvsfix already interprets valid UTF-8 data.

CVSFIX supports UTF-8 for all character sets as long as the delimiter, quote, newline characters are from the single-byte (ISO 8859-1) compliant portion of Unicode (U+0000..U+007F).  It does not attempt to diagnose invalid UTF-8; it does not support multi-byte delimiter, quote or newline characters (U+0080..U+10FFFF).

 
On 15 Lip, 08:41, Jonathan Leffler <jonathan.leff...@gmail.com> wrote:
> On Wed, Jul 14, 2010 at 10:22 AM, matrixik <matri...@gmail.com>; wrote:
> > I read that you don't have plan to implement UTF-8 support, but maybe
> > you change your mind when you
> > read official Google post about UTF-8 on the internet:
> >http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html
>
> What do you think is needed for csvfix to support UTF-8?
>
> UTF-8 is carefully designed so that naïve programs that are modestly careful
> do not run into problems with it.
>
> The only time I can think of where UTF-8 support would matter is if the
> field delimiter (normally comma), record delimiter (normally newline) or
> field quote (normally double quote) characters needed to be a multi-byte
> UTF-8 character rather than a single-byte character as all the default
> values are.


--
Jonathan Leffler <jonathan...@gmail.com>  #include <disclaimer.h>
Guardian of DBD::Informix - v2008.0513 - http://dbi.perl.org
Reply all
Reply to author
Forward
0 new messages