On 30/12/11 00:24, Luka Djigas wrote:
> On Thu, 29 Dec 2011 18:08:27 +0100, Andreas Prilop
> <
prilo...@trashmail.net> wrote:
>
>> On Thu, 29 Dec 2011, Luka Djigas wrote:
>>
>>>
http://i44.tinypic.com/120tr93.jpg
>>
>> This means that the letters are actually encoded in UTF-8,
>> but incorrectly interpreted as Windows-1250.
>> You can see the same effect on
>>
http://www.user.uni-hannover.de/nhtcapri/multilingual1.html#latin
>> when you manually select the encoding Central European Windows-1250
>> in your browser.
>
> :))
>
>> Try
>> set fileencoding=cp1250
>> instead and keep
>> set encoding=utf-8
>>
> Same thing (or something visually very similar).
>
>>> But it is weird, since UTF8 is supposed to be a much larger
>>> character set, and it is supposed to include those characters
>>> as well. I would like to keep using it as a value for encoding,
Yes, UTF-8 can represent anything in cp1250, and many things that cp1250
cannot represent, but it doesn't represent them the same way:
- the 128 7-bit characters in US-ASCII are mapped at the same Unicode
codepoint numbers (U+0000 to U+007F) _and_ UTF-8 represents them the
same way, by a single byte each.
- the 256 8-bit characters in Latin1 aka ISO-8859-1 are mapped at the
same Unicode codepoints (U+0000 to U+00FF) but only the first half
(common with US-ASCII) are represented the same way. Codepoints U+0080
to U+07FF require two bytes in UTF-8, higher codepoints need even more.
- See
http://www.unicode.org/charts/ from where you can browse or
download PDF code charts for the various sections of the Unicode
codepoint range.
UTF-8 represents Unicode codepoints as follows:
U+0000 to U+007F are one byte whose top bit is a zero bit.
The rest are two or more bytes, of which:
- the first byte (leading byte) has as many one-bits at top as there are
bytes in the whole multibyte sequence, then one zero bit, the rest are
data bits
- the other bytes (trailer bytes) have their two top bits set to 10, the
rest is six data bits
- the data bits are ordered "big end first" among the various bytes.
For instance, the Chinese "number one" character, an ideogram consisting
of just one horizontal stroke, is U+4E00, or 100.1110.0000.0000 in
binary. This is more than twelve bits, so three bytes will be necessary.
In UTF-8 it is represented as 1110.0100 10.111000 10.000000 (where I
separate the bytes by a space and status bits from data bits by a dot),
or E4 B8 80.
Since the main East-European accented Latin characters are among the
"Latin Extended-A" characters at U+0100 to U+017F, they are represented
in UTF-8 by two bytes each, between C4 80 and C5 BF. If there is a
misunderstanding between Vim and Windows about how the clipboard is
coded, you could get two characters for each codepoint, the first of
which woulod be C4 or C5, i.e. (if misinterpreted as Latin1) Ä
(A-umlaut) or Å (A-ball).
> available in Latin1& Co. character sets) ...
>
>> I have no practical experience with the Windows version of Vim,
>> only with Linux Vim.
>
> Do you think there is a difference when it comes to this?
>
> -- Luka
OK, another try:
In the vimrc:
if has('multi_byte')
if &enc !~? '^u' " caret-u, not control-u
if &tenc == ""
" avoid clobbering keyboard locale
let &tenc = &enc
endif
set enc=utf-8
endif
set fencs=ucs-bom,utf-8,cp1250
" anything after cp1250 (which is 8-bit)
" would be ignored anyway
endif
If (and only if) it still doesn't work, try adding
language ctype Polish_Poland.65001
(according to
http://en.wikipedia.org/wiki/Windows_code_pages#List ,
65001 is the "Windows code page number" for UTF-8).
If it _still_ doesn't work, you may have to set your "Country settings"
(or whatever they are called) on Windows to use code page 65001 "Unicode
(UTF-8)" (or some such). But this could give problems in other programs,
so it is only a last resort.
Best regards,
Tony.
--
You're not drunk if you can lie on the floor without holding on.
-- Dean Martin