The spelling dictionaries only contain ASCII quotes and Vim doesn't
change that. This has annoyed me as well but not enough to fix the
problem. To do so you can preprocess the word list to add variants with
curly quotes or someone could write a Vim patch that automatically does
this internally (a better solution but more complex).
- Peter Odding
You are using weird quotes from cp1252. The spell checker works with
latin1 quotes. The equivalent of cp1252 0x92 is 0x2019 in Unicode.
They are not the same, thus Vim says it's an error to use that.
Please don't use cp1252, it's Windows-only stuff.
--
Are leaders born or made? And if they're made, can we return them under
warranty?
(Scott Adams - The Dilbert principle)
/// Bram Moolenaar -- Br...@Moolenaar.net -- http://www.Moolenaar.net \\\
/// sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\ an exciting new programming language -- http://www.Zimbu.org ///
\\\ help me help AIDS victims -- http://ICCF-Holland.org ///
> You are using weird quotes from cp1252. The spell checker works with
> latin1 quotes. The equivalent of cp1252 0x92 is 0x2019 in Unicode.
> They are not the same, thus Vim says it's an error to use that.
> Please don't use cp1252, it's Windows-only stuff.
I don’t use cp1252, nor do I use the (almost) equally horrible latin1
(ISO-8859-1). I use Unicode, specifically U+2018, U+2019, U+201C, and
U+201D.
spelllang=en encoding=utf-8
:spelldump gives me
/regions=usaucagbnz
# file: /usr/share/vim/vim73/spell/en.utf-8.spl
so everything seems to be in order, except for the fact that ‘’’ isn’t
being recognized. :spelldump also lists
don't
but not
don’t
nor any other word with a ‘’’.
If the spell lists are in 7-bit ASCII then applying a Unicode to ASCII
conversion should map U+2019 to U+0027 and make spell DWIM.
TTFN
Mike
--
.egassem terces eht dnuof evah ouY !snoitalutargnoC
> If the spell lists are in 7-bit ASCII then applying a Unicode to ASCII
> conversion should map U+2019 to U+0027 and make spell DWIM.
What do you mean by ”applying a Unicode to ASCII conversion”?
Passing the Unicode encoded text through something like iconv to produce
an ASCII version of it. There are standard mapping tables for the
punctuation symbols that would handle most of the general punctuation
block U+2000-206F. For example left and right double quotes would both
be mapped to the quotation character.
It will depend on how any Unicode characters that have no mapping are
treated. In a conversion you normally are allowed to specify a
replacement character for them. Ideally you would pick a character that
would not be picked up by the spell code.
HTH
> Passing the Unicode encoded text through something like iconv to produce an
> ASCII version of it. There are standard mapping tables for the punctuation
> symbols that would handle most of the general punctuation block U+2000-206F.
> For example left and right double quotes would both be mapped to the
> quotation character.
Why would I want to do that? I want my Unicode-encoded text.
I figured that you were suggesting some sort of modification of the
spelling tables.
If the spelling tables are ASCII then I am just pointing out there is a
means of having Unicode encoded text spell checked against them. The
spell checking code could converts the buffer contents to temporary
storage before running. I imagine there will be some fun in mapping
back from the temporary storage to the original buffer contents to
highlight spelling issues.
I imagine this would also work for anyone else working in Unicode where
the spelling checklists are not in Unicode.
There is a solution to the problem. The question is is it worth
implementing? Someone who knows the spell checking code would be able
to say.
Mike
--
A clean room is a sure sign of a broken computer.
> >> Writing “Let’s begin …” marks the ‘s’ as a spelling
> >> error. Writing “Let's begin …” works fine. Is this a bug,
> >> or am I missing something?
>
> > You are using weird quotes from cp1252. The spell checker works with
> > latin1 quotes. The equivalent of cp1252 0x92 is 0x2019 in Unicode.
> > They are not the same, thus Vim says it's an error to use that.
> > Please don't use cp1252, it's Windows-only stuff.
>
> I don’t use cp1252, nor do I use the (almost) equally horrible latin1
> (ISO-8859-1). I use Unicode, specifically U+2018, U+2019, U+201C, and
> U+201D.
Your message header had:
Content-Type: text/plain; charset=windows-1252
This one has:
Content-Type: text/plain; charset=UTF-8
> spelllang=en encoding=utf-8
>
> :spelldump gives me
>
> /regions=usaucagbnz
> # file: /usr/share/vim/vim73/spell/en.utf-8.spl
>
> so everything seems to be in order, except for the fact that ‘’’ isn’t
> being recognized. :spelldump also lists
>
> don't
>
> but not
>
> don’t
>
> nor any other word with a ‘’’.
Right, only latin1 quotes are supported. Note that the first 256
characters of Unicode are latin1.
--
Managers are like cats in a litter box. They instinctively shuffle things
around to conceal what they've done.
Not according to GMail (this is what it sent):
MIME-Version: 1.0
Sender: nikolai...@gmail.com
Received: by 10.220.190.204 with HTTP; Tue, 30 Nov 2010 03:40:55 -0800 (PST)
Date: Tue, 30 Nov 2010 12:40:55 +0100
Delivered-To: nikolai...@gmail.com
X-Google-Sender-Auth: Okja3V51CPh3aY-A-4p7RBH6WCQ
Message-ID: <AANLkTi=9T5WdogfP5QiGDvQOf...@mail.gmail.com>
Subject: =?UTF-8?B?U3BlbGxpbmcgc3VwcG9ydCBkb2VzbuKAmXQgZGVhbCB3aXRoIOKAmOKAmeKAmSBjb3JyZQ==?=
=?UTF-8?B?Y3RseQ==?=
From: Nikolai Weibull <n...@bitwi.se>
To: Vim Developers <vim...@googlegroups.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Perhaps Google Groups messed with the message along the way.
> Right, only latin1 quotes are supported.
OK, so let’s fix that. How do we fix that?
Also, I don’t understand what you say latin1 quotes, as it would be a
lot clearer if you said ASCII quotes. (Latin1 doesn’t add any
additional quotes. That’s one of the main differences between latin1
and cp1252.)
>> Right, only latin1 quotes are supported.
> OK, so let’s fix that. How do we fix that?
>
> Also, I don’t understand what you say latin1 quotes, as it would be a
> lot clearer if you said ASCII quotes. (Latin1 doesn’t add any
> additional quotes. That’s one of the main differences between latin1
> and cp1252.)
Still waiting for a response to this question.
> On Wed, Dec 1, 2010 at 22:00, Nikolai Weibull <n...@bitwi.se> wrote:
> > On Wed, Dec 1, 2010 at 21:12, Bram Moolenaar <Br...@moolenaar.net> wrote:
> >>
> >> Nikolai Weibull wrote:
> >>
> >>> Writing “Let’s begin …†marks the ‘s’ as a spelling
> >>> error.  Writing “Let's begin …†works fine.  Is this a bug,
> >>> or am I missing something?
>
> >> Right, only latin1 quotes are supported.
>
> > OK, so let’s fix that.  How do we fix that?
> >
> > Also, I don’t understand what you say latin1 quotes, as it would be a
> > lot clearer if you said ASCII quotes.  (Latin1 doesn’t add any
> > additional quotes.  That’s one of the main differences between latin1
> > and cp1252.)
>
> Still waiting for a response to this question.
I don't know how to fix this (well, don't have time to look into it). I
don't even know what the fix would do anyway. We don't want to allow
just any quotes, only the ones that are appropriate for the language.
Probably this needs to be done in the spell files themselves. Or with
an option in the affix file.
--
Did you ever stop to think... and forget to start again?
-- Steven Wright
The hunspell doc is not very clear but I think this is what the
ICONV directive of Hunspell is for. Looking at this English
dictionary of OpenOffice 3.x at:
http://extensions.services.openoffice.org/en/project/dict-en-fixed
... the en_US.aff file contains:
2839 ICONV 6
2840 ICONV ’ '
2841 ICONV ffi ffi
2842 ICONV ffl ffl
2843 ICONV ff ff
2844 ICONV fi fi
2845 ICONV fl fl
2846
2847 OCONV 1
2848 OCONV ' ’
My understanding is that ICONV causes to convert the input
fancy quote U+2019 into a regular quote (among other conversions)
before probing the dictionary. So "Let’s" and “Let's" are both
recognized as correct.
But Vim currently still uses dictionaries from OpenOffice-2.x and
does not support ICONV either.
I found the following patch which adds support of Hunspell
dictionary in Vim:
https://bugzilla.redhat.com/show_bug.cgi?id=219777
I tried the patch and it still works with latest vim-7.3.138
(I did not test it extensively yet). With some clean up, it
could be a good addition to Vim-7.4.
How about adding options such as:
" Comma separated directories where to search for
" Hunspell dictionaries (*.aff and *.dic).
set hunspelldir=~/hunspell,/usr/share/hunspell
" Boolean option to use Hunspell dictionaries directly
" rather than Vim spelling dictionaries.
set hunspell
Using Hunspell dictionaries directly solves several issues.
I never managed to convert the latest French dictionary from
OpenOffice-3.x from Hunspell to Vim. The dictionaries from
OpenOffice-2.x are quite out of date (at least for French).
I wish I could use the latest dictionary from OpenOffice-3.x
Regards
-- Dominique
Personally, I'd prefer to use something like Enchant[0] over a specific
spelling library, if the effort is going to be taken to use something
other than Vim's internal spell-checking. I've done some work in a
local branch to integrate Enchant, but ran into some larger questions
that need to be addressed and haven't had time to draft a reasonable
email about them yet.
[0]: http://www.abisource.com/projects/enchant/
--
James
GPG Key: 1024D/61326D40 2003-09-02 James Vega <jame...@jamessan.com>
I did not know about Enchant until now but the description looks
convincing considering the number of backends it supports. I'm
mostly interested in better support of Hunspell but Enchant would
provide that along with support of other spelling systems.
In short, it looks promising.
-- Dominique
Why are you echoing what I already said above?
Saying “latin1 quotes” adds zero clarity. It actually muddles the
facts, especially since cp1252 does add quotes and, again especially,
since there was some confusion about what quotes (and encoding) I
(well, Google) was using in my e-mails.
But this is a big “whatever”. As latin1 (or, more appropriately,
iso-8859-1) is a superset of ASCII and Unicode is a superset of
latin1, then what I really care about is having support for Unicode
quotes. Or, Unicode apostrophes, to be exact (not U+0027, but
U+2019), as it’s not ‘’’’s role as a right single quotation mark, but
as an apostrophe, that I care about.
This has been discussed before, e.g. here:
http://groups.google.com/group/vim_dev/msg/fd9a82ef07460726
regards,
Christian
> But this is a big "whatever". As latin1 (or, more appropriately,
> iso-8859-1) is a superset of ASCII and Unicode is a superset of
> latin1, then what I really care about is having support for Unicode
> quotes.
Latin1 is a superset of ASCII, but Unicode is not a superset of
latin1. Unicode supports a larger set of characters than latin1 and
shares some character encodings in common with latin1 but it is a
different encoding.
Regards,
Gary
Unicode is a superset of Latin1 in the sense that every Latin1 character
is also a Unicode codepoint, and at the same ordinal position (the first
256 Unicode codepoints are the 256 Latin1 characters in the same order).
However no Unicode encoding represents Latin1 characters higher than
0x7F *on disk* by the same binary value that Latin1 does (UTF-8, but not
the other Unicode encodings except maybe --I'm not sure-- GB18030,
represents the 128 US-ASCII characters the same way as both US-ASCII and
Latin1).
<encyclopedia>
The above paragraph implies that Unicode is not *one* encoding, even
though Vim represents all Unicode codepoints the same way *in memory*.
Rather, Unicode should be seen as a way of classifying all known writing
systems as a one-dimensional list going from zero to "something high" by
integer steps or "codepoints". These codepoints may be coded as bytes in
different ways:
* UTF-8, which uses one or more bytes per codepoint, and where the byte
0x00 can only represent the codepoint U+0000 (the null codepoint) so
it's useful for a representation using C strings. The first byte used
for any codepoint tells how many bytes there will be in all, the other
ones (if any) have values which cannot happen in the first byte, so
synchronization is easy even if corrupt bytes become embedded in the text.
* UCS-2, which uses one two-byte word (big-endian or little-endian) per
codepoint and cannot represent any codepoint higher than U+FFFF
* UTF-16, which extends UCS-2 up to U+10FFFF by means of "surrogate
codepoints", using two words for codepoints higher than U+FFFF
* UCS-4 aka UTF-32, which can be big-endian or little-endian (or even,
I've been told, ordered 2143 or 3412) and uses one four-byte doubleword
per codepoint. It simply stores each codepoint as its ordinal value
expressed as one unsigned 32-bit integer.
* GB18030, which is skewed in favour of Chinese; it allows
representation of any Unicode codepoint but the conversion in either
direction between it and other Unicode encodings requires bulky tables.
Conversion between any of the above except GB18030 is trivial; Vim does
it with no need for the iconv library. For UCS-2, UTF-16 and UTF-32,
when the endianness is omitted, big-endian is implied, even on
little-endian processors such as the Intel ones used in all Windows PCs,
most Linux ones, and many of those equipped with Mac OSX.
</encyclopedia>
Best regards,
Tony.
--
Champagne don't make me lazy.
Cocaine don't drive me crazy.
Ain't nobody's business but my own.
-- Taj Mahal