Received: by 10.205.126.4 with SMTP id gu4mr2663792bkc.8.1343721789403; Tue, 31 Jul 2012 01:03:09 -0700 (PDT) X-BeenThere: erlang-programming@googlegroups.com Received: by 10.204.7.203 with SMTP id e11ls5461766bke.8.gmail; Tue, 31 Jul 2012 01:03:09 -0700 (PDT) Received: by 10.204.151.213 with SMTP id d21mr2670427bkw.0.1343721788918; Tue, 31 Jul 2012 01:03:08 -0700 (PDT) Received: by 10.204.151.213 with SMTP id d21mr2670426bkw.0.1343721788899; Tue, 31 Jul 2012 01:03:08 -0700 (PDT) Return-Path: Received: from hades.cslab.ericsson.net (hades.cslab.ericsson.net. [192.121.151.104]) by gmr-mx.google.com with ESMTP id j4si3598367bkj.3.2012.07.31.01.03.08; Tue, 31 Jul 2012 01:03:08 -0700 (PDT) Received-SPF: pass (google.com: domain of erlang-questions-boun...@erlang.org designates 192.121.151.104 as permitted sender) client-ip=192.121.151.104; Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of erlang-questions-boun...@erlang.org designates 192.121.151.104 as permitted sender) smtp.mail=erlang-questions-boun...@erlang.org; dkim=neutral (body hash did not verify) header...@gmail.com Received: from hades.cslab.ericsson.net (hades [192.121.151.104]) by hades.cslab.ericsson.net (Postfix) with ESMTP id 1C3C65C147; Tue, 31 Jul 2012 10:03:03 +0200 (CEST) X-Original-To: erlang-questi...@erlang.org Delivered-To: erlang-questi...@erlang.org Received: from mail-lb0-f181.google.com (mail-lb0-f181.google.com [209.85.217.181]) by hades.cslab.ericsson.net (Postfix) with ESMTP id 715DA5C00B for ; Tue, 31 Jul 2012 10:03:00 +0200 (CEST) Received: by lbbgk8 with SMTP id gk8so3960186lbb.40 for ; Tue, 31 Jul 2012 01:03:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=N0RqFvqqiZy0ludB5sjv1FxyRKXFMx7TK8Y+ZUqImio=; b=UUoIj5y9NiftNL1gR825lLS0hSL4iv3b0SHihBa4vnXcAs2J5nlSC2P3LkOAcR2RyW mcIBkdZUu7pWoJmH5q8l9egoC3FSfNYHwSyZ2asX8GOs1RGKtmFah1dXeG8Cdbz95ZUs QIXIJQxdYrNBFGr5nbRyA8/2qaYk84zCQgQcEdMK49MqULSUTSHEwTSoQnRMlS+rdqoj aIlyVIx5NGAjkygHQ8EfxrGMnMSbVOKY0vGT7ByYHkk9R36PYvZt/yBjzsH6pEgrLQyu FfQuFVb0NJAOeofAb66HnPGZmYl7oSdufcxrelKA7AehlYWH6qa0dO/MMZbjXdIJrkLZ PlLg== Received: by 10.152.103.11 with SMTP id fs11mr13952688lab.23.1343721780193; Tue, 31 Jul 2012 01:03:00 -0700 (PDT) Received: from [192.168.0.123] (c-d6ef70d5.019-149-7570701.cust.bredbandsbolaget.se. [213.112.239.214]) by mx.google.com with ESMTPS id h6sm2640843lbl.13.2012.07.31.01.02.59 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 31 Jul 2012 01:02:59 -0700 (PDT) Message-ID: <50179132.8070...@gmail.com> Date: Tue, 31 Jul 2012 10:02:58 +0200 From: Richard Carlsson User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: erlang-questi...@erlang.org References: <50168ABD.50...@gmail.com> <50171BA4.3000...@gmail.com> <501784BA.6080...@gmail.com> In-Reply-To: Subject: Re: [erlang-questions] unicode in string literals X-BeenThere: erlang-questi...@erlang.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: General Erlang/OTP discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="windows-1252"; Format="flowed" Errors-To: erlang-questions-boun...@erlang.org Sender: erlang-questions-boun...@erlang.org On 07/31/2012 09:32 AM, Michel Rijnders wrote: > IMO this doesn't solve the problem, and only confuses the issue; > consider the following: > > test() -> > io:format("~w~n", ["Just my =800.02"]), > io:format("~w~n", [lists:reverse("Just my =800.02")]). > >> test(). > [74,117,115,116,32,109,121,32,226,130,172,48,46,48,50] > [50,48,46,48,172,130,226,32,121,109,32,116,115,117,74] Yes, this is what happens today, because all involved parts (including = the call to io:format with ~w) assumes Latin-1 and just passes all the = bytes straight through. Basically, it's your editor and terminal that = are lying by displaying a particular sequence of 3 bytes as =80 although = the program is really using Latin-1. They conspire against you to make = you think that things are working correctly. > If the list data was kept as UTF-8 then the output of the second > statement should be: > [50,48,46,48,226,130,172,32,121,109,32,116,115,117,74] That would only be the result if you used a single code point = representation for the input to reverse, and then converted the result = back to a byte encoding (e.g. by printing with ~ts). > The above of course depends on whether you view strings as lists of > bytes vs lists of characters. Strings are lists of characters (code points), so when your example gets = through tokenization, the encoding from the file would already be = forgotten, and you'd have a single integer for the =80. (The same goes for = atoms and variable names, by the way, the answer to so Vlad's question = is that these will also get a greater range of available characters.) = String manipulation functions should assume they are working on single = code points, not on a byte encoding. /Richard _______________________________________________ erlang-questions mailing list erlang-questi...@erlang.org http://erlang.org/mailman/listinfo/erlang-questions