Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bug#1053983: dos2unix: Man page should mention that UTF-16 is converted to UTF-8 by default

5 views
Skip to first unread message

Ben Wong

unread,
Oct 15, 2023, 5:30:05 AM10/15/23
to
Package: dos2unix
Version: 7.5.1-1
Severity: normal
X-Debbugs-Cc: bugs.de...@wongs.net

Dear Maintainer,

The dos2unix man page claims that the default mode is "ASCII" and that
in ASCII mode only line endings will be changed. This is no longer
true. In the default mode, UTF-16 is converted to UTF-8 and the BOM is
removed.

I do not know if this is still considered an "ASCII" mode or if the
default is some new UTF-8 mode. Please consider updating the
documentation to match the current behavior.

Thanks!

*** End of the template - remove these template lines ***


-- System Information:
Debian Release: trixie/sid
APT prefers testing
APT policy: (500, 'testing')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 6.5.0-1-amd64 (SMP w/8 CPU threads; PREEMPT)
Kernel taint flags: TAINT_WARN
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages dos2unix depends on:
ii libc6 2.37-12

dos2unix recommends no packages.

dos2unix suggests no packages.

-- no debconf information

tony mancill

unread,
Oct 15, 2023, 11:40:05 AM10/15/23
to
Hi Ben,

On Sun, Oct 15, 2023 at 02:22:43AM -0700, Ben Wong wrote:
> Package: dos2unix
> Version: 7.5.1-1
> Severity: normal
> X-Debbugs-Cc: bugs.de...@wongs.net
>
> Dear Maintainer,
>
> The dos2unix man page claims that the default mode is "ASCII" and that
> in ASCII mode only line endings will be changed. This is no longer
> true. In the default mode, UTF-16 is converted to UTF-8 and the BOM is
> removed.
>
> I do not know if this is still considered an "ASCII" mode or if the
> default is some new UTF-8 mode. Please consider updating the
> documentation to match the current behavior.

Thank you for your bug report.

I believe the portion of the manpage you are referring to is:

CONVERSION MODES
ascii
In mode "ascii" only line breaks are converted. This is the default
conversion mode. [**Missing information about UTF-16 behavior.**]

Although the name of this mode is ASCII, which is a 7 bit standard,
the actual mode is 8 bit. Use always this mode when converting
Unicode UTF-8 files.

Is this where you are expecting to see the manpage updated?

It is perhaps somewhat hidden in the manpage, but I think this at least
partially addresses the use case you describe:

-u, --keep-utf16
Keep the original UTF-16 encoding of the input file. The output
file will be written in the same UTF-16 encoding, little or big
endian, as the input file. This prevents transformation to UTF-8.
An UTF-16 BOM will be written accordingly. This option can be
disabled with the "-ascii" option.

That is, the use of -ascii (the default) negates --keep-utf16 and thus
*does* perform the transformation to UTF-8 and *does not* write the
UTF-16 BOM.

I will forward the report to the upstream author.

Thank you,
tony

Ben Wong

unread,
Oct 16, 2023, 12:50:05 PM10/16/23
to
Hello Tony,

On Sun, Oct 15, 2023 at 8:26 AM tony mancill <tman...@debian.org> wrote:
Hi Ben,


I believe the portion of the manpage you are referring to is:

CONVERSION MODES
  ascii
    In mode "ascii" only line breaks are converted. This is the default
    conversion mode.  [**Missing information about UTF-16 behavior.**]

    Although the name of this mode is ASCII, which is a 7 bit standard,
    the actual mode is 8 bit. Use  always  this  mode  when  converting
    Unicode UTF-8 files.

Yes, that is the section I was considering. The first sentence is quite definite that _only_ line breaks are converted, so the conversion of UTF-16 was a surprise (a pleasant one, yes, but still a surprise). 

Ideally, I'd like to see the option renamed and the man page say something like:

CONVERSION MODES
  unix
    In the default mode, "unix", line breaks are converted to newlines, Unicode characters are encoded in UTF-8, and any BOM header is stripped. This mode was originally called "ascii" and that name still works as an alias for backwards compatibility.

And, yes, I had missed the reference to how -ascii actually works in the --keep-utf16 section. It is good to see that unix2dos actually does what I wanted rather than what it claims it does.

Thank you.

Ben Wong

0 new messages