Accented characters made with Notepad

Jean-Pierre Coulon

unread,

Nov 2, 2011, 9:29:05 AM11/2/11

to

A colleague has written a text with accented characters with Notepad under XP.
How can I include this tex into a LaTeX document? \usepackage[cp850]{inputenc}
didn't work. Same with [latin1].

When I open the Notepad document with the old DOS EDIT each accented character
is represented by *two* strange characters.

Regards,

--
Jean-Pierre Coulon (here "cacas.pam" is what others call "nospam")

Lars Madsen

unread,

Nov 2, 2011, 9:39:35 AM11/2/11

to

Jean-Pierre Coulon wrote, On 2011-11-02 14:29:
> A colleague has written a text with accented characters with Notepad
> under XP. How can I include this tex into a LaTeX document?
> \usepackage[cp850]{inputenc} didn't work. Same with [latin1].
>
> When I open the Notepad document with the old DOS EDIT each accented
> character is represented by *two* strange characters.
>
> Regards,
>
>

ansinew or perhaps utf8?

(do not remember if cp850 = ansinew)

Your last comment sounds like it might be utf8

--

/daleif (remove RTFSIGNATURE from email address)

Memoir and mh bundle maintainer
LaTeX FAQ: http://www.tex.ac.uk/faq
LaTeX book: http://www.imf.au.dk/system/latex/bog/ (in Danish)
Remember to post minimal examples, see URL below
http://www.minimalbeispiel.de/mini-en.html

Jean-Pierre Coulon

unread,

Nov 2, 2011, 10:29:58 AM11/2/11

to

On Wed, 2 Nov 2011, Lars Madsen wrote:

> ansinew or perhaps utf8?

> (do not remember if cp850 = ansinew)
>
> Your last comment sounds like it might be utf8

Ansinew fails at all accented characters. utf8 works except for the character
I would write \^u in LaTeX. The error message is:

! Package inputenc Error: Unicode char \u8:Ã" not set up for use with LaTeX.

Jussi Piitulainen

unread,

Nov 2, 2011, 10:48:37 AM11/2/11

to

Jean-Pierre Coulon writes:

> Ansinew fails at all accented characters. utf8 works except for the
> character I would write \^u in LaTeX. The error message is:
>
> ! Package inputenc Error: Unicode char \u8:Ã" not set up for use with LaTeX.

Try \usepackage{utf8x}. It worked for me in a similar situation.

Or perhaps try xelatex. It seems to be a native speaker of UTF-8,
among other things.

Lars Madsen

unread,

Nov 2, 2011, 10:48:45 AM11/2/11

to

Jean-Pierre Coulon wrote, On 2011-11-02 15:29:
> On Wed, 2 Nov 2011, Lars Madsen wrote:
>
>> ansinew or perhaps utf8?
>
>> (do not remember if cp850 = ansinew)
>>
>> Your last comment sounds like it might be utf8
>
> Ansinew fails at all accented characters. utf8 works except for the
> character I would write \^u in LaTeX. The error message is:
>
> ! Package inputenc Error: Unicode char \u8:Ã" not set up for use with
> LaTeX.
>

Then I guess you will have to define it. There is some information
somewhere. Not all chars are covered by the utf8 setup.

Was your collegue using xelatex instead?

Robin Fairbairns

unread,

Nov 2, 2011, 11:10:28 AM11/2/11

to

Lars Madsen <dal...@RTFMSIGNATUREimf.au.dk> writes:

> Jean-Pierre Coulon wrote, On 2011-11-02 15:29:
>> On Wed, 2 Nov 2011, Lars Madsen wrote:
>>
>>> ansinew or perhaps utf8?
>>>
>>> (do not remember if cp850 = ansinew)
>>>
>>> Your last comment sounds like it might be utf8
>>
>> Ansinew fails at all accented characters. utf8 works except for the
>> character I would write \^u in LaTeX. The error message is:
>>
>> ! Package inputenc Error: Unicode char \u8:Ã" not set up for use
>> with LaTeX.
>
> Then I guess you will have to define it. There is some information
> somewhere. Not all chars are covered by the utf8 setup.

utf8 is supposed to cover things that are covered by "official" latex
encodings; \^u is so covered (it's in the t1 encoding).

if this is confirmed, i would judge it a good topic for a latex bug
report (those do get resolved, eventually ... it's just babel bug
reports that are ignored).
--
Robin Fairbairns, Cambridge
my address is @cl.cam.ac.uk, regardless of the header. sorry about that.

Ulrike Fischer

unread,

Nov 2, 2011, 11:46:54 AM11/2/11

to

Am Wed, 02 Nov 2011 15:10:28 +0000 schrieb Robin Fairbairns:

>>>> ansinew or perhaps utf8?
>>>>
>>>> (do not remember if cp850 = ansinew)
>>>>
>>>> Your last comment sounds like it might be utf8
>>>
>>> Ansinew fails at all accented characters. utf8 works except for the
>>> character I would write \^u in LaTeX. The error message is:
>>>
>>> ! Package inputenc Error: Unicode char \u8:Ã" not set up for use
>>> with LaTeX.
>>
>> Then I guess you will have to define it. There is some information
>> somewhere. Not all chars are covered by the utf8 setup.
>
> utf8 is supposed to cover things that are covered by "official" latex
> encodings; \^u is so covered (it's in the t1 encoding).

Works fine for me

\documentclass{scrartcl}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\begin{document}
û Û
\end{document}

But I don't think that Ã" can represent a valid utf8 char. (" has a
too low number to be part of an utf8-sequence). So I think something
else is wrong. Perhaps the encoding got mixed up.

--
Ulrike Fischer

Jean-Pierre Coulon

unread,

Nov 2, 2011, 11:35:27 AM11/2/11

to

I noticed that the faulty character is encoded with C3 22 in hexadecimal.
Isn't there any way to say something like \DeclareUnicodeCharacter{C322}{\^u} or
similar?

Heiko Oberdiek

unread,

Nov 2, 2011, 12:35:37 PM11/2/11

to

Jean-Pierre Coulon <cou...@cacas.pam.obs-nice.fr> wrote:

> I noticed that the faulty character is encoded with C3 22 in hexadecimal.

No, that's not valid UTF-8, the second byte must be between 128 and
191.

> Isn't there any way to say something like \DeclareUnicodeCharacter{C322}{\^u} or
> similar?

\^u is U+00FB or as hex bytes in UTF-8: 0xC3 0xBB
It is declared in t1enc.dfu that is usually loaded by default,
see the .log file.
Try to make a minmal example that shows the problem and
inspect the problematic line with a hex editor.

The hexadecimal notation can even be used in TeX:

\documentclass{article}

\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\begin{document}

\^u = ^^c3^^bb
\end{document}

Advantage: that survives the recodings of mail and news programs.

BTW, in the "save as" dialog of notepad the coding can also be changed
to "ANSI", then \usepackage[ansinew]{inputenc} can be used. Of course,
unless characters outside ANSI are needed.

--
Heiko Oberdiek

Philipp Stephani

unread,

Nov 2, 2011, 4:02:35 PM11/2/11

to

Jean-Pierre Coulon <cou...@cacas.pam.obs-nice.fr> writes:

> I noticed that the faulty character is encoded with C3 22 in
> hexadecimal.

This is illegal in UTF-8. Where does this come from? What does Notepad
show? If you save a file with Notepad as UTF-8 text, you should not see
such illegal sequences.

--
Change “LookInSig” to “tcalveu” to answer by mail.

Jean-Pierre Coulon

unread,

Nov 3, 2011, 12:35:47 AM11/3/11

to

On Wed, 2 Nov 2011, Heiko Oberdiek wrote:

> Try to make a minmal example that shows the problem and
> inspect the problematic line with a hex editor.

Here is how it comes when I copy-paste it into my mailer:

\documentclass[12pt]{article}

\usepackage[utf8]{inputenc}
\begin{document}

Heiko Oberdiek

unread,

Nov 3, 2011, 12:52:02 AM11/3/11

to

Jean-Pierre Coulon <cou...@cacas.pam.obs-nice.fr> wrote:

> On Wed, 2 Nov 2011, Heiko Oberdiek wrote:
>
> > Try to make a minmal example that shows the problem and
> > inspect the problematic line with a hex editor.
>
> Here is how it comes when I copy-paste it into my mailer:
>
> \documentclass[12pt]{article}
> \usepackage[utf8]{inputenc}
> \begin{document}

> reprรฉsente sร"rement

> \end{document}
>
> The first accented character works, the second fails.

As you have already said, the hex code of the second is 0xC3 0x22,
expected would be 0xC3 0xbb, thus the file is broken.

Some speculation: 0xbb is guillemotright (right-pointing double angle
quotation mark in latin1/ansinew, perhaps some program inbetween
(mail, editor, ...) has converted these quotation marks to straight
ones unaware of the right encoding?

--
Heiko Oberdiek

Peter Flynn

unread,

Nov 3, 2011, 5:17:51 PM11/3/11

to

On 03/11/11 04:35, Jean-Pierre Coulon wrote:
> On Wed, 2 Nov 2011, Heiko Oberdiek wrote:
>
>> Try to make a minmal example that shows the problem and
>> inspect the problematic line with a hex editor.
>
> Here is how it comes when I copy-paste it into my mailer:
>
> \documentclass[12pt]{article}
> \usepackage[utf8]{inputenc}
> \begin{document}

> représente s�"rement \end{document}

>
> The first accented character works, the second fails.

That "character" is bogus, so it won't work (anywhere, AFAICS). If it
was done on a Windows machine, there may be a bug or "feature" which
substitutes a u-circumflex at that point, but it would be non-standard
and completely unprocessable anywhere else.

///Peter