If the encoding of edited text file differ form the system/vim encoding, it's
inconvenient to set default HTML charset to be 'encoding'. Thus, after
':TOhtml', we should modify the generated HTML file to make the file encoding
the same as HTML charset.
e.g. If the system/vim encoding is 'UTF-8', but a text file encoding is
'latin-1'. If the default HTML charset is 'encoding', after ':TOhtml', we
should change the HTML charset to 'iso-8859-1', or save the generated HTML
file by ':w ++enc=utf-8'. But if the default HTML charset is 'fileencoding',
we should do nothing after ':TOhtml'.
Changes as the attachment.
Best regards,
Yanwei.
--
I encountered the opposite that 'fileencoding' is often different from
'encoding' while editing existing files.
If 'fileencoding' is empty (buffer-locally), Vim will save the file with
'encoding'. This is documented behaviour.
For details, see http://vim.wikia.org/wiki/Working_with_Unicode and the
help topics listed there.
Best regards,
Tony.
--
Meeting, n.:
An assembly of people coming together to decide what person or
department not represented in the room must solve a problem.
You got it right, and it does indeed make sense.
One possibility is that anything can be represented in UTF-8, including
text not yet saved from the latest edit of the file, and possibly
incompatible with the 'fileencoding' - such text is of course in error,
and will cause an error if one tries to save it.
>
> You say you need to do nothing to the TOhtml output if we set the
> charset to the file encoding. But, don't we also need to ensure that
> the file encoding of the new html file is the same as the file
> encoding of the source file? The file encoding could be different from
> file to file, whereas Vim's encoding is always the same. I can picture
> this causing problems, if the charset says one thing, but the file
> encoding is different.
HTML metadata can be written in ASCII. If needed, one can use &#nnnnn;
entities in text (where nnnnn is the decimal representation of the
Unicode codepoint number; recent browsers accept also &#xnnnn; where x
is the letter x as in X-Ray and nnnn is the hex representation) or
percent-escaping in URLs (where, even in a Latin1 HTML page,
percent-escaping always escapes each byte of the UTF-8 representation
separately, with a % sign followed by exactly two hex digits: for
instance U+00E9 (Latin small letter e with acute) would be represented
as %C3%A9 and U+4E00 (Chinese "number one" horizontal-stroke sign) would
be represented as %E4%B8%80 in a URL, including in the query text if any.
>
> By the way, until this is fixed...you can use the g:html_use_encoding
> option to override the normal detection mechanisms, rather than
> manually editing the generated HTML file.
>
Best regards,
Tony.
--
If you put garbage in a computer nothing comes out but garbage. But
this garbage, having passed through a very expensive machine, is
somehow enobled and none dare criticize it.
Ok, I think I'll make the edit, then.
Your response gives me an idea to fix something else that's been
bothering me. Currently, if Vim cannot determine the correct charset
to use, it defaults to not including one at all. I think I'll have it
default the charset and file encoding to UTF-8 if neither the
fileencoding nor the encoding option gives a valid charset. The user
should be able to manually leave out the charset and manually set the
encoding if desired.
Here's what I'm thinking in more detail:
For one buffer:
1. If user specified a charset, try to determine 'fileencoding' from
charset. If this fails, warn the user they will need to manually set
the fileencoding.
2. If no charset is specified, try to determine a charset from the
'fileencoding' option. If successful, use the same 'fileencoding' and
the associated charset in the generated buffer.
3. If could not determine charset from 'fileencoding', try again with
'encoding'. If successful, set 'fileencoding' to blank in the new html
buffer and use the charset from the 'encoding' option.
4. If could not determine charset from either 'encoding' or
'fileencoding', default to UTF-8 and warn the user.
Multiple buffers in diff mode will be done similarly, except that we
will determine the charset as above for ALL buffers. If they differ,
set 'fileencoding' to blank and use the charset from 'encoding' (or
UTF-8 if cannot determine charset from 'encoding').
What do you think? Or maybe this is too complicated and I should just
use 'encoding' as done currently?
What do you think?
I think you're on the right track. Maybe a little too complicated but
I'm not sure. I would just use 'fileencoding', or if empty (or if it can
be ascertained that the current buffer contains characters which are
invalid for it) then fall back on 'encoding' (by leaving 'fileencoding'
empty in the tohtml output buffer). But go ahead if you think you can
refine it more or make it better.
I don't know what is being done ATM, but I'd always include the line
<meta http-equiv="Content-Type" content="text/html; charset=whatever" />
(replacing "whatever" by the charset name) somewhere near the start of
the <head> element. You may want to use a synonym, e.g. iso-8859-1 for
Latin1, but that's just the finishing touch.
Best regards,
Tony.
--
"In defeat, unbeatable; in victory, unbearable."
-- Winston Curchill, of Montgomery
Yes, that's mostly what it does now, except it omits the line if it
could not determine the charset, always uses 'encoding' instead of
'fileencoding', and specifies the encoding in the <?xml line instead
when optionally using xhtml. I think using utf-8 as a fallback instead
of leaving it out entirely would be a better idea.
The user can specify the charset now, but then the fileencoding will
be wrong unless the user remembers to manually set it (or if it gets
inherited...'fileencoding' seems to act like a "global-local" option).
Well, for existing files, 'fileencoding' will be set locally by the
'fileencodings' (plural) heuristic if the latter option is set. For new
files, you can :setg fenc=something and it will be used when creating a
new file.
If 'fileencoding' (singular) is the empty string for a file (which is
the default for new files) you'll inherit the value of 'encoding'.
Best regards,
Tony.
--
Said a swinging young chick named Lyth
Whose virtue was largely a myth,
"Try as hard as I can,
I can't find a man
That it's fun to be virtuous with."
Additionally, it will now support a lot more encodings, and
automatically set the file encoding of the new file to match the
charset.
All encodings that are both native to Vim (listed by name in :help
encoding-names) and appear in the IANA registry (
http://www.iana.org/assignments/character-sets ) are supported. Note
that not all of these encodings are supported by major web browsers or
the w3c validator. New options are provided to override specific
encodings in the charset detection, or there is still
g:html_use_encoding to override all automatic detection. It is
probably a good idea to use this option if publishing to a web page.
There may be some charsets that previously were automatically detected
that no longer are, and there are some encodings supported by Vim
which I could not find in the IANA registry.
Unfortunately, I could not find a list of widely supported charsets,
so I just used all the ones in Vim and the IANA registry, as mentioned
previously. If there is such a list, would it be a good idea to limit
the automatically detected charsets to those in the list? Along those
lines, it could be a good idea to automatically use UTF-8 in place of
UTF-16 and UTF-32. Currently these charsets are selected as-is.
So, consider this a beta release. PLEASE test and comment, I expect
some changes may be needed before final submission.
Patch is attached, or the files are available for download at the site
I use for the TOhtml test suite: