Strings in Fancy

ckruse

unread,

Jun 15, 2010, 5:47:51 AM6/15/10

to The Fancy Programming Language

Hi there,

did you think about string encodings in fancy? Did you think about
source code encodings?

I think it's important to define things like that in the early
development stage to avoid backward incompatible changes and to
provide fully transparent support for string encodings.

A look at your source code for strings (string.cc, string.fnc,
bootstrap/string.cc) tells me that there is no encoding support yet.
What do you think about defining the input encoding for source files
as UTF-8 (the most common Unicode encoding) and adding some more or
less basic encoding support for strings? We also have to add some
support for IO routines, don't we?

Greetings,
CK

Christopher Bertels

unread,

Jun 15, 2010, 7:53:09 AM6/15/10

to fancy-lang

Hi,

Excerpts from ckruse's message of Di Jun 15 11:47:51 +0200 2010:

> did you think about string encodings in fancy? Did you think about
> source code encodings?

Yes, I have thought about it but haven't implemented it in code yet.

> I think it's important to define things like that in the early
> development stage to avoid backward incompatible changes and to
> provide fully transparent support for string encodings.

Good point. I must admit, I haven't worked on string encoding stuff
before. Would you like to add support for different encodings for strings?

> A look at your source code for strings (string.cc, string.fnc,
> bootstrap/string.cc) tells me that there is no encoding support yet.
> What do you think about defining the input encoding for source files
> as UTF-8 (the most common Unicode encoding) and adding some more or
> less basic encoding support for strings? We also have to add some
> support for IO routines, don't we?

Yeah, UTF-8 sounds good. That's also what I would have suggested.
What exactly do you mean by "support for IO routines"?

Cheers,
Christopher.
--
================================
Christopher Bertels
http://www.fancy-lang.org
http://www.adztec-independent.de
GPG Key ID: 0x2345b203

Christian Kruse

unread,

Jun 15, 2010, 8:11:01 AM6/15/10

to fancy...@googlegroups.com

Hi,

Am 15.06.2010 um 13:53 schrieb Christopher Bertels:

> Excerpts from ckruse's message of Di Jun 15 11:47:51 +0200 2010:
>> did you think about string encodings in fancy? Did you think about
>> source code encodings?
>
> Yes, I have thought about it but haven't implemented it in code yet.

Can you explain what you thought about? Something like Javas encoding support (everything gets coded to UTF16BE, transparently) or somewhat like ruby's encoding support (every string object can have a different encoding and operations like concatenation convert strings to different encodings, if necessary and possible)?

>> I think it's important to define things like that in the early
>> development stage to avoid backward incompatible changes and to
>> provide fully transparent support for string encodings.
>
> Good point. I must admit, I haven't worked on string encoding stuff
> before. Would you like to add support for different encodings for strings?

Sounds very interesting, so: yes, I'd like to ;)

> What exactly do you mean by "support for IO routines"?

Well, strings read by IO routines have to have a coding, too. We should define how things should work: should a user have to provide the encoding of a file? Eventually optionally and we work with a default coding if none specified? I don't really know; ruby and java do it this way (you CAN specify a coding, but you don't have to. If you don't specify an encoding, the default encoding is uses). I think this is a reasonable way, isn't it?

LG,
CK

Christopher Bertels

unread,

Jun 15, 2010, 8:51:40 AM6/15/10

to fancy-lang

Excerpts from Christian Kruse's message of Di Jun 15 14:11:01 +0200 2010:

> Can you explain what you thought about? Something like Javas encoding support (everything gets coded to UTF16BE, transparently) or somewhat like ruby's encoding support (every string object can have a different encoding and operations like concatenation convert strings to different encodings, if necessary and possible)?

I don't know the details on how Java deals with this stuff and what
its limitations are, but having once basic encoding everything gets
converted to seems straightforward but might cause problems if new
encodings want to be added that don't behave well with the standard
one? Or am I wrong with this assumption?

So I suppose Ruby's approach might be better?

> Sounds very interesting, so: yes, I'd like to ;)

Cool, go for it then and tell me when you've got it somewhat working ;)

> > What exactly do you mean by "support for IO routines"?
>
> Well, strings read by IO routines have to have a coding, too. We should define how things should work: should a user have to provide the encoding of a file? Eventually optionally and we work with a default coding if none specified? I don't really know; ruby and java do it this way (you CAN specify a coding, but you don't have to. If you don't specify an encoding, the default encoding is uses). I think this is a reasonable way, isn't it?

Ah OK. Yeah, I agree. Having a default encoding if none is specified
sounds reasonable. UTF-8 would be a good choice, right?

Christian Kruse

unread,

Jun 15, 2010, 9:09:20 AM6/15/10

to fancy...@googlegroups.com

Hi there,

Am 15.06.2010 um 14:51 schrieb Christopher Bertels:

> I don't know the details on how Java deals with this stuff and what
> its limitations are, but having once basic encoding everything gets
> converted to seems straightforward but might cause problems if new
> encodings want to be added that don't behave well with the standard
> one? Or am I wrong with this assumption?

To get around this, one uses something like UTF16 or UTF32 internally. Every single code point defined by the unicode standard can be coded in these encodings. UTF16 has the disadvantage of having variable length characters (some need 16 bit, some need 32 bit) and UTF32 has the disadvantage of consuming much memory (32 bit for every single character).

> So I suppose Ruby's approach might be better?

I like the Ruby way, too, it matches more the TIMTOWTDI paradigm. But it has other disadvantages. E.G. what should happen, if a Big5 coded string should be concatenated with a, well, ISO-8859-8 coded string? These codings are really incompatible. Ruby throws an exception in these cases.

>>> What exactly do you mean by "support for IO routines"?
>>
>> Well, strings read by IO routines have to have a coding, too. We should define how things should work: should a user have to provide the encoding of a file? Eventually optionally and we work with a default coding if none specified? I don't really know; ruby and java do it this way (you CAN specify a coding, but you don't have to. If you don't specify an encoding, the default encoding is uses). I think this is a reasonable way, isn't it?
>
> Ah OK. Yeah, I agree. Having a default encoding if none is specified
> sounds reasonable. UTF-8 would be a good choice, right?

UTF-8 as default with graceful degration to US-ASCII if there are byte sequences which are not possible in UTF-8? Seems more robust for me.

Greetings,
CK

Christopher Bertels

unread,

Jun 15, 2010, 9:26:33 AM6/15/10

to fancy-lang

Excerpts from Christian Kruse's message of Di Jun 15 15:09:20 +0200 2010:

> To get around this, one uses something like UTF16 or UTF32 internally.
> Every single code point defined by the unicode standard can be coded in these encodings.
> UTF16 has the disadvantage of having variable length characters (some need 16 bit,
> some need 32 bit) and UTF32 has the disadvantage of consuming much memory (32 bit for
> every single character).

32bit for every character really is allot and can cut down on
performance in the long run, especially when only 8-bit characters are
needed and the programmer expects smaller memory footprints in his code.

> > So I suppose Ruby's approach might be better?
>
> I like the Ruby way, too, it matches more the TIMTOWTDI paradigm.
> But it has other disadvantages. E.G. what should happen, if a Big5
> coded string should be concatenated with a, well, ISO-8859-8 coded string?
> These codings are really incompatible. Ruby throws an exception in these cases.

Hmm, well I guess that's something a programmer has to cope with. If
they really are that incompatible, the system should reflect it imho.

> > Ah OK. Yeah, I agree. Having a default encoding if none is specified
> > sounds reasonable. UTF-8 would be a good choice, right?
>
> UTF-8 as default with graceful degration to US-ASCII if there are
> byte sequences which are not possible in UTF-8? Seems more robust for me.

Yeah, that sounds good. Alot of code really only needs UTF-8 or ASCII
most of the time so this should be fine.

Christian Kruse

unread,

Jun 15, 2010, 9:45:09 AM6/15/10

to fancy...@googlegroups.com

Hi there,

Am 15.06.2010 um 15:26 schrieb Christopher Bertels:

> Excerpts from Christian Kruse's message of Di Jun 15 15:09:20 +0200 2010:
>> To get around this, one uses something like UTF16 or UTF32 internally.
>> Every single code point defined by the unicode standard can be coded in these encodings.
>> UTF16 has the disadvantage of having variable length characters (some need 16 bit,
>> some need 32 bit) and UTF32 has the disadvantage of consuming much memory (32 bit for
>> every single character).
>
> 32bit for every character really is allot and can cut down on
> performance in the long run, especially when only 8-bit characters are
> needed and the programmer expects smaller memory footprints in his code.

Yes, true. That's 4 byte per character (sic!).

>>> So I suppose Ruby's approach might be better?
>>
>> I like the Ruby way, too, it matches more the TIMTOWTDI paradigm.
>> But it has other disadvantages. E.G. what should happen, if a Big5
>> coded string should be concatenated with a, well, ISO-8859-8 coded string?
>> These codings are really incompatible. Ruby throws an exception in these cases.
>
> Hmm, well I guess that's something a programmer has to cope with. If
> they really are that incompatible, the system should reflect it imho.

Ok, so we throw an exception, too?

We could also do something like an automatic conversion to utf8 of the new string object when needed. E.G. we create a new string with coding UTF-8 which contains the concatenated Big5 and ISO-8859-8 strings. I would prefer this method since there is no need to bother about codings until it comes to I/O. I think it should be fully transparent for the programmer if he wants to, but he should have access to these infos if he needs it.

>>> Ah OK. Yeah, I agree. Having a default encoding if none is specified
>>> sounds reasonable. UTF-8 would be a good choice, right?
>>
>> UTF-8 as default with graceful degration to US-ASCII if there are
>> byte sequences which are not possible in UTF-8? Seems more robust for me.
>
> Yeah, that sounds good. Alot of code really only needs UTF-8 or ASCII
> most of the time so this should be fine.

Fine! :)

Greetings,
CK

Christopher Bertels

unread,

Jun 15, 2010, 10:00:35 AM6/15/10

to fancy-lang

Excerpts from Christian Kruse's message of Di Jun 15 15:45:09 +0200 2010:

> Ok, so we throw an exception, too?

I'd say yes.

> We could also do something like an automatic conversion to utf8 of
> the new string object when needed. E.G. we create a new string with
> coding UTF-8 which contains the concatenated Big5 and ISO-8859-8
> strings. I would prefer this method since there is no need to bother
> about codings until it comes to I/O. I think it should be fully
> transparent for the programmer if he wants to, but he should have
> access to these infos if he needs it.

Alright, you can handle all the details. I'm pretty much fine as long
as the current fancy code works in the same way as before ;)