Character encoding .wiki help files

15 views
Skip to first unread message

Samuel Murray

unread,
Mar 28, 2009, 6:18:31 AM3/28/09
to wikidpad-devel
G'day everyone

I notice that the .wiki help files are all in UTF-8 (with BOM) except
for the one about international characters, which appear to be in
codepage 1252 (also called ANSI, sometimes called Latin-1, although
it's not quite ISO-8859-1 which is also called Latin-1). Is there a
reason for this use of 1252 instead of UTF-8? I'm just concerned that
if people open this page in a text editor that fails to distinguish
between 1252 and ISO-8859-1, they will experience corrupted
characters.

I also noticed that Wikipad doesn't read UTF-16 LE, although this is a
format that some Windows programs may default to (eg MS Word). Are
there any plans to support this?

These aren't crucial issues, though.

Samuel

Michael Butscher

unread,
Mar 29, 2009, 5:29:22 AM3/29/09
to wikidpa...@googlegroups.com
Samuel Murray wrote:
> G'day everyone
>
> I notice that the .wiki help files are all in UTF-8 (with BOM) except
> for the one about international characters, which appear to be in
> codepage 1252 (also called ANSI, sometimes called Latin-1, although
> it's not quite ISO-8859-1 which is also called Latin-1). Is there a
> reason for this use of 1252 instead of UTF-8? I'm just concerned that
> if people open this page in a text editor that fails to distinguish
> between 1252 and ISO-8859-1, they will experience corrupted
> characters.

Thank you for the report. The reason is that older versions of WikidPad
only supported the current system codepage (often 1252). When updating
wikis the pages were not automatically converted to utf-8.

Instead they are only converted when they are modified, but the
international characters page was obviously never modified since these
ancient times.

In the next version this page will be encoded properly in UTF-8.

Further inconsistencies of encoding of other pages (there are much more
of them as Christian Ziemski told me) will be fixed in a later version
as I first have to write a little tool for that.


> I also noticed that Wikipad doesn't read UTF-16 LE, although this is a
> format that some Windows programs may default to (eg MS Word). Are
> there any plans to support this?

Next 2.0 version will be able to read UTF-16 LE and BE files with
appropriate BOMs.


Michael

Christian Ziemski

unread,
Mar 29, 2009, 7:21:28 AM3/29/09
to wikidpa...@googlegroups.com
On 29.03.2009 11:29 Michael Butscher wrote:

> In the next version this page will be encoded properly in UTF-8.

> [...] as I first have to write a little tool for that.

You may want to use the Linux tools:

'iconv' (old) or 'recode' (newer)


Regarding the BOM in many of those files:

According to
- http://unicode.org/faq/utf_bom.html#bom5
- http://en.wikipedia.org/wiki/Byte_Order_Mark
and other sources it seems to be a good idea to
*not* using a BOM in UTF8 files.

(In the help files and in the Wiki pages generally.)


Christian

Michael Butscher

unread,
Mar 29, 2009, 9:18:51 AM3/29/09
to wikidpa...@googlegroups.com

Unfortunately there may be some occasions where ANSI-encoded data (or
UTF-16 BE/LE encoded one) from other sources may go into the wiki. I
assume that some people may modify wiki pages outside of WikidPad or add
new ones. So I can't be sure about the encoding of a file without a BOM.

Michael

Samuel Murray

unread,
Mar 29, 2009, 9:21:14 AM3/29/09
to wikidpad-devel

On Mar 29, 1:21 pm, Christian Ziemski <cz...@gmx.de> wrote:

> According to
>  -http://unicode.org/faq/utf_bom.html#bom5
>  -http://en.wikipedia.org/wiki/Byte_Order_Mark
> and other sources it seems to be a good idea to
> *not* using a BOM in UTF8 files.

Whether to use a BOM or not for UTF-8 largely depends on the tools you
use to edit it :-) and what operating system you're on. It makes no
difference to me -- whichever the developer is happy with. The
Wikipad POT files are in UTF-8 with BOM, and I assume Michael did not
painstakingly add the BOMs manually. What would be interesting is
whether his PO tools can handle PO files in UTF-8 that do not have a
BOM, e.g. if a translator uses a tool that strips the BOM.

Most of the arguments against using BOMs are not relevant to .wiki
and .pot/.po files. The no-BOM arguments usually centre around POSIX
systems being unable to parse text-based executable files that open
with hashbangs. Neither of the two links you mention attempt to make
a case against BOMs -- they simply explain that a BOM in UTF-8 is a
signature only, and why it is bad to use it *for all files* on a POSIX
system.

Sincerely
Samuel

Michael Butscher

unread,
Mar 29, 2009, 3:08:31 PM3/29/09
to wikidpa...@googlegroups.com
Samuel Murray wrote:
>
> On Mar 29, 1:21 pm, Christian Ziemski <cz...@gmx.de> wrote:
>
>> According to
>> -http://unicode.org/faq/utf_bom.html#bom5
>> -http://en.wikipedia.org/wiki/Byte_Order_Mark
>> and other sources it seems to be a good idea to
>> *not* using a BOM in UTF8 files.
>
> Whether to use a BOM or not for UTF-8 largely depends on the tools you
> use to edit it :-) and what operating system you're on. It makes no
> difference to me -- whichever the developer is happy with. The
> Wikipad POT files are in UTF-8 with BOM, and I assume Michael did not
> painstakingly add the BOMs manually. What would be interesting is
> whether his PO tools can handle PO files in UTF-8 that do not have a
> BOM, e.g. if a translator uses a tool that strips the BOM.

PO files must always be in UTF-8 even if no BOM is detected because they
are meant to be installed on many other systems with different default
encodings.

The .wiki files are more probable to stay on the same system so assuming
the default encoding for the BOM-less ones is a good guess.

Michael

Reply all
Reply to author
Forward
0 new messages