Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

The running order of iconv and dos2unix.

85 views
Skip to first unread message

Hongyi Zhao

unread,
Jul 4, 2020, 8:22:50 AM7/4/20
to
Hi,

I have a text file which is generated on Microsoft Windows OS using the GBK encoding. Now, on Ubuntu 20.04, I want to convert this file into utf-8 encoding.

For this job, the following commands should be used:

$ dos2unix the-file.old
$ iconv -f GBK -t UTF8 the-file.old -o the-file.new

But I can't figure out which command should I run first. Any hints for this problem?

John McCue

unread,
Jul 4, 2020, 3:06:18 PM7/4/20
to
Hongyi Zhao <hongy...@gmail.com> wrote:
> Hi,
>
[snip]
>
> For this job, the following commands should be used:
>
> $ dos2unix the-file.old
> $ iconv -f GBK -t UTF8 the-file.old -o the-file.new
>
> But I can't figure out which command should I run first.
> Any hints for this problem?

I does not matter, see what dos2unix does vs what is in
UTF-8. You will see order will be unimportant.

PS, please use wrap your lines. It make it hard for me to
reply and going forward I am planning on ignoring posts from
people who do not wrap lines in posts.
Thanks

http://www.mugsy.org/asa_faq/getting_along/usenet.shtml

Dmitry Alexandrov

unread,
Jul 4, 2020, 3:10:31 PM7/4/20
to
In general: reencoding first, of course. While dos2unix(1) tries to be smarter than simple awk -v IRS='\r\n' -v ORS='\n' '1' it does not succeed much in that.

I have no idea whatʼs GBK, though. Maybe, itʼs ASCII-compatible, then there is no difference.

That is:

$ iconv -f gbk -t utf-8 INFILE | dos2unix > OUTFILE
signature.asc

Dmitry Alexandrov

unread,
Jul 4, 2020, 3:32:45 PM7/4/20
to
John McCue <jmc...@obsd2.mhome.org> wrote:
> Hongyi Zhao <hongy...@gmail.com> wrote:
>> $ dos2unix the-file.old
>> $ iconv -f GBK -t UTF8 the-file.old -o the-file.new
>>
>> But I can't figure out which command should I run first.
>> Any hints for this problem?
>
> I does not matter, see what dos2unix does

Various things. Normally, itʼs used as a simple filter that substitute "\r\n" with "\n", though.

> vs what is in UTF-8.

There is also a third component.

> You will see order will be unimportant.

Perhaps. However, itʼs hardly a good practice to use an algorithm that only works under certain conditions, such as specific combination of encodings, when it can be easily generalized.

> PS, please use wrap your lines.

Please, stop wrapping your lines.

> It make it hard for me to reply

It makes it hard for me to read.

> and going forward I am planning on ignoring posts from people who do not wrap lines in posts.

Who knows, maybe, thatʼs for the better. :-)
signature.asc

Janis Papanagnou

unread,
Jul 4, 2020, 4:09:11 PM7/4/20
to
On 04.07.2020 21:32, Dmitry Alexandrov wrote:
> John McCue <jmc...@obsd2.mhome.org> wrote:
>> Hongyi Zhao <hongy...@gmail.com> wrote:
>>> [...]
>
>> PS, please use wrap your lines.
>
> Please, stop wrapping your lines.

Why? - Have Usenet conventions silently been invalidated with invention
of Google Groups or else when posting to a Usenet group?

Janis

Dmitry Alexandrov

unread,
Jul 4, 2020, 5:49:44 PM7/4/20
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 04.07.2020 21:32, Dmitry Alexandrov wrote:
>> John McCue <jmc...@obsd2.mhome.org> wrote:
>>> please use wrap your lines.
>>
>> Please, stop wrapping your lines.
>
> Why?

Because there is no any sane reason to do that. (Besides gaining an pretext to look down on those noobs, of course.)

>>> It make it hard for me to reply
>>
>> It makes it hard for me to read.

> - Have Usenet conventions silently been invalidated with invention of Google Groups

No, theyʼve been useless since invention of MIME. And been plainly harmful since invention of handheld PC.
signature.asc

Dmitry Alexandrov

unread,
Jul 4, 2020, 6:02:43 PM7/4/20
to
Dmitry Alexandrov <d...@gnui.org> wrote:
> And been plainly harmful since invention of handheld PC.

And have been mildly harmful even earlier, as many MUAs / newsreaders have been trying to follow them too aggressively, while being unable to to that right, producing a quoting mess from time to time. Your Mozilla Thunderbird is one of them, IIRC.
signature.asc

Janis Papanagnou

unread,
Jul 4, 2020, 6:13:33 PM7/4/20
to
What's the issue/mess with Thunderbird? It's not "mine", BTW, but if there's
really any inherent issue with my posts when using Thunderbird I'd certainly
be interested to hear.

Janis

Janis Papanagnou

unread,
Jul 4, 2020, 6:14:06 PM7/4/20
to
I read that that you mean you think they are useless.

Seems more of a religious spin-off of this thread. Thanks, I'll abstain.

Janis

Kaz Kylheku

unread,
Jul 4, 2020, 8:03:10 PM7/4/20
to
On 2020-07-04, Dmitry Alexandrov <d...@gnui.org> wrote:
> --=-=-=
> Content-Type: text/plain; charset=utf-8
> Content-Transfer-Encoding: quoted-printable

Lol! Someone posting quoted-printable garbage to Usenet is
educating about encoding matters.

Dmitry Alexandrov

unread,
Jul 4, 2020, 9:20:46 PM7/4/20
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 05.07.2020 00:02, Dmitry Alexandrov wrote:
>> Dmitry Alexandrov <d...@gnui.org> wrote:
>>> And been plainly harmful since invention of handheld PC.
>>
>> And have been mildly harmful even earlier, as many MUAs / newsreaders have been trying to follow them too aggressively, while being unable to to that right, producing a quoting mess from time to time. Your Mozilla Thunderbird is one of them, IIRC.
>
> What's the issue/mess with Thunderbird?

By default (out of a box) it hardwraps lines on certain column automatically. It does it without taking context into consideration. Which is generally impossible without AI, of course, but some other tools do better guesses.

Beside mangling a code snippet (which is obvious) it can, for instance, break a long URL. Or, as I said, mess with quoting, namely, produce something like:

>>>> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
> tempor
>>>> incididunt ut labore et dolore magna aliqua.

All of this is avoidable with a certain care, of course, yet too many people realise that their letter is messed up only when itʼs sent (or never).

But above all is that all these nuisances serve no good purpose.

> It's not "mine", BTW

Sorry, is there any wrong implication in ‘your’ said about program one use? I have not meant anything but that.

> but if there's really any inherent issue with my posts

I am not going to scan Usenet for your posts right now, but sure, I will notify you upon encounter.
signature.asc

Dmitry Alexandrov

unread,
Jul 4, 2020, 9:26:58 PM7/4/20
to
Janis Papanagnou <janis_pa...@hotmail.com> wrote:
> On 04.07.2020 23:49, Dmitry Alexandrov wrote:
>> No, theyʼve been useless since invention of MIME.
>
> I read that that you mean you think they are useless.

Does that elaboration really add anything?

Or do you mean, you _know_ any use hardwraps have? Except being a part of subculture.

> Seems more of a religious spin-off of this thread.

A spin-off, of course. But I see nothing religious in it.

> Thanks, I'll abstain.

As you wish.
signature.asc

Janis Papanagnou

unread,
Jul 5, 2020, 4:55:07 AM7/5/20
to
On 05.07.2020 03:20, Dmitry Alexandrov wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
[...]
>> What's the issue/mess with Thunderbird?
>
> By default (out of a box) it hardwraps lines on certain column
> automatically. It does it without taking context into consideration.
> Which is generally impossible without AI, of course, but some other tools
> do better guesses.

Ah, you mean the tool is not optimal if compared to others. Well, I'm sure
you're right; there are so many tools out there.

> Beside mangling a code snippet (which is obvious) it can, for instance,
> break a long URL. Or, as I said, mess with quoting, namely, produce
> something like:
>
>>>>> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
>>>>> eiusmod
>> tempor
>>>>> incididunt ut labore et dolore magna aliqua.

Actually (from my perspective), Thunderbird does well with quoting if the
base text (the original one) conforms to certain (Usenet) standards. If I
have to reformat some post that violates even basic standard (and has some
other formating issues as well) that may indeed fail, but in those cases
I wouldn't blame "my" tool but the other end of the communication channel.

A specific problem is - as you also say, in the context of text and code -
if folks use (non-standard) long lines and code interspersed in one post.
If I want to reformat such posts with Thunderbird I cannot just "^A" "^R"
the whole post but I'd have to indivitually select the text parts (using
slow procedures like the mouse), and here some formattings of the original
posts may syntactically result in a malformatted quote. It wouldn't be an
issue if the original post would have conformed to the Usenet standards in
the first place. (Othen I just ignore such post.)

With long URLs I observe that a quoted URL, while be *displayed* with a
break in a followup post client-side (to fit in a wondow and to make it
unnecessary to scroll inside the window to see it), but it's still sent
intact as one component so that all see it in one piece (on one line) and
are able to use it consistently. No issue here (from my perspective).

> All of this is avoidable with a certain care, of course, yet too many
> people realise that their letter is messed up only when itʼs sent (or
> never).
>
> But above all is that all these nuisances serve no good purpose.
>
>> It's not "mine", BTW
>
> Sorry, is there any wrong implication in ‘your’ said about program one use?
> I have not meant anything but that.
>
>> but if there's really any inherent issue with my posts
>
> I am not going to scan Usenet for your posts right now, but sure, I will
> notify you upon encounter.

I haven't intended to impose any task on you. Only since you said there's
problems I thought to ask since I assumed you have something specific in
mind and since you specifically mentioned "my" tool. Specifically I have
not seen or heard of issues, so I doubt that you have a point, but CMIIW.

But more important I think there are generally also no issues when using
(Usernet-)standard conforming posts and any real newsreader. A problem
may arise once folks try new devices - say, smartphones - in conjunction
with old (Usenet) protocols (and [sensible] conventions). It's probably
better for those (often younger) folks to visit a web forum, a medium that
better address their expectation of User Experience any dynamic display
presentation. As long as these folks post here they should stay standard
conform if possible. YMMV. On the other hand, their posts might otherwise
just be ignored and considered noise; also no issue, just a minor nuisance.

(Still, the standard here in Usenet is "follow the conventions", and the
suggestion to use a Real Newreader to not unnecessarily impose rigors.)

Janis

Janis Papanagnou

unread,
Jul 5, 2020, 5:06:47 AM7/5/20
to
On 05.07.2020 03:26, Dmitry Alexandrov wrote:
> Janis Papanagnou <janis_pa...@hotmail.com> wrote:
>> On 04.07.2020 23:49, Dmitry Alexandrov wrote:
>>> No, theyʼve been useless since invention of MIME.
>>
>> I read that that you mean you think they are useless.
>
> Does that elaboration really add anything?
>
> Or do you mean, you _know_ any use hardwraps have? Except being a part of subculture.

Erm, yes. Increase readability for example, and to make it unnecessary
to scroll texts using slow User Interface procedures with standard tools.
And, simply speaking, to conform to standards that tools relied and still
rely on.

See, introducing (as you say) MIME should (IMO; YMMV) not disrupt common
use of standards and conventions. (I have a strong Deja Vu with respect
to introduction of HTML mail when I read your MIME statement.)

Substitute your informal word "subculture" with a (concrete) "standard"
and "convention" and it makes more sense.

Janis

Michael Bäuerle

unread,
Jul 5, 2020, 5:31:20 AM7/5/20
to
This is a different case and clearly defined by RFC 5536 [1]:
|
| User agents MUST meet the definition of MIME conformance in [RFC2049]
| and MUST also support [RFC2231]. [...]

Line wrapping is a different case. RFC 5322 says in [2]:
|
| There are two limits that this specification places on the number of
| characters in a line. Each line of characters MUST be no more than
| 998 characters, and SHOULD be no more than 78 characters, excluding
| the CRLF. [...]

RFC 2119 defines the meaning of "MUST" and "SHOULD" [3]:
|
| 1. MUST This word, or the terms "REQUIRED" or "SHALL", mean that the
| definition is an absolute requirement of the specification.
| [...]
| 3. SHOULD This word, or the adjective "RECOMMENDED", mean that there
| may exist valid reasons in particular circumstances to ignore a
| particular item, but the full implications must be understood and
| carefully weighed before choosing a different course.

Before thinking about "valid reasons" to ignore the "SHOULD", remember
that a scheme for flowed text (Paragraph Text) is already defined by
RFC 3676 [4]. Therefore "there is no defined way to do it" is no valid
reason for normal text. Valid reasons may be:
1) URIs
Some newsreaders cannot present URIs in a clickable way if they
have linebreaks in it.
2) Tables
Rewrapping the lines makes the table unreadable.

Traditionally the text was intended to be displayed with a monospace
font on 80 column width text terminals. RFC 3676 [4] defines this as
the default, if there is no "format" parameter present in the MIME
"Content-Type" header field.
This means that a newsreader should not rewrap the lines for "fixed"
format by default. And this is what makes it annoying to read articles
from Google (with very long lines and no "flowed" declaration).

It is clear that devices with small displays may want to rewrap all
articles for display purposes, regardless of the RFC 3676 declarations.
If they use a proportional font the line length is independent from the
number of characters anyway. And there is no problem with that for the
receiving direction in general.
But for the sending direction all articles should follow RFC 5322 (no
lines longer than 78 characters) or use RFC 3676 "flowed" format.


______________
[1] <https://tools.ietf.org/html/rfc5536#section-2.3>
[2] <https://tools.ietf.org/html/rfc5322#section-2.1.1>
[3] <https://tools.ietf.org/html/rfc2119>
[4] <https://tools.ietf.org/html/rfc3676>

Kaz Kylheku

unread,
Jul 6, 2020, 12:18:12 PM7/6/20
to
On 2020-07-05, Michael Bäuerle <michael....@gmx.net> wrote:
> Kaz Kylheku wrote:
>> On 2020-07-04, Dmitry Alexandrov <d...@gnui.org> wrote:
>> >
>> > --=-=-=
>> > Content-Type: text/plain; charset=utf-8
>> > Content-Transfer-Encoding: quoted-printable
>>
>> Lol! Someone posting quoted-printable garbage to Usenet is
>> educating about encoding matters.
>
> This is a different case and clearly defined by RFC 5536 [1]:

MIME shit off-topic in this newsgroup.

>|
>| User agents MUST meet the definition of MIME conformance in [RFC2049]
>| and MUST also support [RFC2231]. [...]
>
> Line wrapping is a different case. RFC 5322 says in [2]:
>|
>| There are two limits that this specification places on the number of
>| characters in a line. Each line of characters MUST be no more than
>| 998 characters, and SHOULD be no more than 78 characters, excluding
>| the CRLF. [...]
>

They got it wrong; the limit is actually 72. Though that figure
historically relates to irrelevant line length limitations of American
TTY's, there is a reason for it in Usenet: it leaves a bit of room for
several levels of "> " quoting.

A reasonable exception has to be made for preformatted material like
code and tables.

RFC 5322 is from 2009. Any Usenet RFC which is that late to the party
can safely be ignored. Usenet didn't require any new RFC in 2009;
and issuing one didn't save it from decline. (Here is a question: how
many of the authors named in that RFC participate in Usenet today?)

The man page for my news reader (installed as a package, available for
Ubuntu 18 from a standard repository on Ubuntu 18) is from 2008.
I think that is perfectly fine.

Posting MIME to comp.* discussion newsgroups is netiquette violation.
Whether or not MIME is codified by an RFC is neither here nor there.

Independently of MIME, using quoted-printable in 2020 is moronic, since
UTF-8 has become ubiquitous. Usenet agents from before RFC 5322 can
handle UTF-8 (all they have to do is pass 8 bit text to and from the
editor, and terminal window, if they run in one).

You can use special characters, like math symbols, via UTF-8, without
resorting to MIME and quoted-printable.

For the English apostrophe like in "it's", for pete's sake, just use the
ASCII apostrophe (character code 39, decimal).
0 new messages