Re: sed error message reports byte position instead of char position when program contains UTF-8

Eli Zaretskii

unread,

May 14, 2013, 4:23:04 PM5/14/13

to Camion SPAM, bon...@gnu.org, bug-gn...@gnu.org

> Date: Tue, 14 May 2013 15:22:48 +0100 (BST)
> From: Camion SPAM <camion_sp...@yahoo.fr>
>
> In case of error in a sed program, sed error message reports byte position instead of char position when program contains UTF-8.
>
> $ sed 's/qwerty///'
> sed: -e expression #1, char 11: unknown option to `s'
> $ sed 's/qwérty///'
> sed: -e expression #1, char 12: unknown option to `s'

Options to the 's' command are all pure-ASCII single-byte characters,
so Sed is correct in this case, I think.

Paul Jarc

unread,

May 15, 2013, 2:36:47 PM5/15/13

to Camion SPAM, bon...@gnu.org, bug-gn...@gnu.org

Eli Zaretskii <el...@gnu.org> wrote:
> Options to the 's' command are all pure-ASCII single-byte characters,
> so Sed is correct in this case, I think.

I don't think so. The example doesn't violate that rule. The unknown
option is "/". But due to a multibyte character in the pattern, the
offset is either miscalculated (should say "11" instead of "12") or
badly described (should say "byte" instead of "char").

paul

Eli Zaretskii

unread,

May 15, 2013, 5:00:45 PM5/15/13

to Paul Jarc, bon...@gnu.org, bug-gn...@gnu.org, camion_sp...@yahoo.fr

> From: p...@case.edu (Paul Jarc)
> Date: Wed, 15 May 2013 14:36:47 -0400

"Byte" is the only viable alternative, but that leaves the burden of
counting bytes on the user.

Camion SPAM

unread,

May 16, 2013, 7:42:20 AM5/16/13

to Eli Zaretskii, Paul Jarc, bon...@gnu.org, bug-gn...@gnu.org

> Eli Zaretskii <el...@gnu.org> wrote:

> "Byte" is the only viable alternative, but that leaves the burden of
> counting bytes on the user.

If you use a very long sed script, this can be a problem.
I happen to write sed scripts that are more than 1000 characters long.
My last one is currently 1762 characters long and still growing.
That's why I wrote a little bash function which would show me
the n'th character. but this gave wrong positions on scripts with
UTF-8 chars.

The work-around is to change LC_CTYPE to C around the string
processing part in my bash function, but, I believe that since sed
supports multibytes characters, the error message should count
characters and not bytes. btw : the error message states that the
position is "char" and not "byte" :

Jose E. Marchesi

unread,

May 16, 2013, 7:47:53 AM5/16/13

to Camion SPAM, bug-gn...@gnu.org, bon...@gnu.org, Paul Jarc

> "Byte" is the only viable alternative, but that leaves the burden of
> counting bytes on the user.

If you use a very long sed script, this can be a problem.

I agree. I will take a look to this.

Eli Zaretskii

unread,

May 16, 2013, 8:54:58 AM5/16/13

to Camion SPAM, p...@case.edu, bug-gn...@gnu.org, bon...@gnu.org

> Date: Thu, 16 May 2013 12:42:20 +0100 (BST)
> From: Camion SPAM <camion_sp...@yahoo.fr>
> Cc: "bon...@gnu.org" <bon...@gnu.org>,
> "bug-gn...@gnu.org" <bug-gn...@gnu.org>

>
> > "Byte" is the only viable alternative, but that leaves the burden of
> > counting bytes on the user.
>
> If you use a very long sed script, this can be a problem.

> I happen to write sed scripts that are more than 1000 characters long.
> My last one is currently 1762 characters long and still growing.
> That's why I wrote a little bash function which would show me
> the n'th character. but this gave wrong positions on scripts with
> UTF-8 chars.
>
> The work-around is to change LC_CTYPE to C around the string
> processing part in my bash function, but, I believe that since sed
> supports multibytes characters, the error message should count
> characters and not bytes. btw : the error message states that the
> position is "char" and not "byte" :
>
> sed: -e expression #1, char 12: unknown option to `s'

How do you expect Sed to know what character set is being used for the
command line? Are we again going to limit ourselves to the current
locale's charset?

John Cowan

unread,

May 16, 2013, 9:33:42 AM5/16/13

to Eli Zaretskii, bon...@gnu.org, p...@case.edu, bug-gn...@gnu.org, Camion SPAM

Eli Zaretskii scripsit:

> How do you expect Sed to know what character set is being used for the
> command line? Are we again going to limit ourselves to the current
> locale's charset?

I think it is a reasonable assumption that the command line uses the
same encoding that is used for the input files. Nothing else will
make common cases like "sed 's/föö*/bär/'" work correctly.

--
John Cowan co...@ccil.org http://ccil.org/~cowan
Objective consideration of contemporary phenomena compel the conclusion
that optimum or inadequate performance in the trend of competitive
activities exhibits no tendency to be commensurate with innate capacity,
but that a considerable element of the unpredictable must invariably be
taken into account. --Ecclesiastes 9:11, Orwell/Brown version

Eli Zaretskii

unread,

May 16, 2013, 9:44:58 AM5/16/13

to John Cowan, bon...@gnu.org, p...@case.edu, bug-gn...@gnu.org, camion_sp...@yahoo.fr

> Date: Thu, 16 May 2013 09:33:42 -0400
> From: John Cowan <co...@mercury.ccil.org>
> Cc: Camion SPAM <camion_sp...@yahoo.fr>, p...@case.edu,
> bug-gn...@gnu.org, bon...@gnu.org

>
> Eli Zaretskii scripsit:
>
> > How do you expect Sed to know what character set is being used for the
> > command line? Are we again going to limit ourselves to the current
> > locale's charset?
>
> I think it is a reasonable assumption that the command line uses the
> same encoding that is used for the input files.

Yes, mostly. But how do you know what is the encoding of the input
files?

John Cowan

unread,

May 16, 2013, 10:21:39 AM5/16/13

to Eli Zaretskii, bon...@gnu.org, p...@case.edu, bug-gn...@gnu.org, camion_sp...@yahoo.fr

Eli Zaretskii scripsit:

> Yes, mostly. But how do you know what is the encoding of the input
> files?

If you don't know that, you don't know how to interpret regular
expressions against the text of the file, because you don't know what
characters it contains. Even in seemingly trivial cases, like "sed
s/abc/def/", you have no idea what to do if you don't know whether the
file is ASCII or EBCDIC. For that matter, even "sed 2p" will not work
correctly if you don't know the encoding of the newline character.

So either you do just use the locale, or sed needs an option to specify
the file encoding (in which case it should also provide the command
line encoding).

--
Well, I have news for our current leaders John Cowan
and the leaders of tomorrow: the Bill of co...@ccil.org
Rights is not a frivolous luxury, in force http://www.ccil.org/~cowan
only during times of peace and prosperity.
We don't just push it to the side when the going gets tough. --Molly Ivins

Eli Zaretskii

unread,

May 16, 2013, 11:28:26 AM5/16/13

to John Cowan, bon...@gnu.org, p...@case.edu, bug-gn...@gnu.org, camion_sp...@yahoo.fr

> Date: Thu, 16 May 2013 10:21:39 -0400
> From: John Cowan <co...@mercury.ccil.org>
> Cc: camion_sp...@yahoo.fr, p...@case.edu, bug-gn...@gnu.org,
> bon...@gnu.org

>
> Eli Zaretskii scripsit:
>
> > Yes, mostly. But how do you know what is the encoding of the input
> > files?
>
> If you don't know that, you don't know how to interpret regular
> expressions against the text of the file, because you don't know what
> characters it contains.

AFAIK, Sed uses bytes, not characters.

John Cowan

unread,

May 16, 2013, 11:41:00 AM5/16/13

to Eli Zaretskii, bon...@gnu.org, p...@case.edu, bug-gn...@gnu.org, camion_sp...@yahoo.fr

Eli Zaretskii scripsit:

> AFAIK, Sed uses bytes, not characters.

Definitely not. Look at the following:

$ echo $LANG
en_US.UTF-8
$ cat >foo
f��
(Ctrl-D)
$ wc -c foo
6 foo
(including the newline; therefore the file is UTF-8)
$ sed -n '/^...$/p' <foo
f��
$ sed -n '/^.....$/p' <foo
$

So the regex matches 3 characters, not 5 bytes.

--
We call nothing profound co...@ccil.org
that is not wittily expressed. John Cowan
--Northrop Frye (improved)

Errembault Philippe

unread,

May 16, 2013, 12:24:44 PM5/16/13

to Eli Zaretskii, John Cowan, p...@case.edu, bug-gn...@gnu.org, bon...@gnu.org

Eli Zaretskii wrote:

> How do you expect Sed to know what character set is being used for the
> command line? Are we again going to limit ourselves to the current
> locale's charset?

John Cowan wrote

> I think it is a reasonable assumption that the command line uses the
> same encoding that is used for the input files.

Eli Zaretskii wrote:
> Yes, mostly. But how do you know what is the encoding of the input
> files?

What are you talking about !!!
This is neither about current character set nor about the processed file.
The only way to do this is to have sed respect the values of the current locales
variables LANG and LC_* which can be changed according to the needs.
thats what every localized program do.