Notepad displays garbage; Wordpad works fine.

Ryan Westafer

unread,

Jul 31, 2003, 3:30:00 PM7/31/03

to

Every now and again a data collection application
generates a file that Notepad simply cannot display.
However Wordpad, command-line edit.exe, vi on UNIX, etc.
will open the file just fine.

Is this a bug in notepad?
(The result was verified on multiple XP Professional
machines)

You may find a sample file at the following address:

http://www.ihmc.us/NotepadFails.txt.zip
(it is simply a zipped text file)

-Ryan

Ken Wickes [MSFT]

unread,

Jul 31, 2003, 4:39:59 PM7/31/03

to

Look like notepad interprets the file as Unicode. In fact if you put a bp
on advapi32!IsTextUnicode it will return true. This can happen as notepad
does have to guess whether it is Ansi or Unicode.

You could write the file as Unicode or UTF-8 and add a BOM. I'm not sure if
there is a BOM for ASCII, although a UTF-8 BOM would probably work.

--

Ken Wickes [MSFT]
This posting is provided "AS IS" with no warranties, and confers no rights.

"Ryan Westafer" <rwes...@hotmail.com> wrote in message
news:01ff01c3579a$21d98730$a501...@phx.gbl...

Alan J. McFarlane

unread,

Aug 1, 2003, 3:56:33 PM8/1/03

to

Ken Wickes [MSFT] <ken...@online.microsoft.com> wrote:
> "Ryan Westafer" <rwes...@hotmail.com> wrote in message
> news:01ff01c3579a$21d98730$a501...@phx.gbl...

>> Every now and again a data collection application
>> generates a file that Notepad simply cannot display.
>> However Wordpad, command-line edit.exe, vi on UNIX, etc.
>> will open the file just fine.
>>

[cut]

> Look like notepad interprets the file as Unicode. In fact if you put
> a bp on advapi32!IsTextUnicode it will return true. This can happen
> as notepad does have to guess whether it is Ansi or Unicode.
>

I would guess that the presence of the TAB characters misleads the function
into thinking the file is Unicode 16-bit. As a workaround, insert a single
space character before the first TAB, this seems to cause Notepad to
correctly load it as 8-bit.
--
Alan J. McFarlane
http://homepage.ntlworld.com/alanjmcf/
Please follow-up in the newsgroup for the benefit of all.

Craig Kelly

unread,

Aug 1, 2003, 5:04:57 PM8/1/03

to

"Alan J. McFarlane" <alan...@yahoo.com> wrote:
> Ken Wickes [MSFT] <ken...@online.microsoft.com> wrote:
> > "Ryan Westafer" <rwes...@hotmail.com> wrote in message
> > news:01ff01c3579a$21d98730$a501...@phx.gbl...
>
>>> Every now and again a data collection application
>>> generates a file that Notepad simply cannot display.
>>> However Wordpad, command-line edit.exe, vi on UNIX, etc.
>>> will open the file just fine.
>>>
>> [cut]
>> Look like notepad interprets the file as Unicode. In fact if you put
>> a bp on advapi32!IsTextUnicode it will return true. This can happen
>> as notepad does have to guess whether it is Ansi or Unicode.
>>
> I would guess that the presence of the TAB characters misleads the
> function into thinking the file is Unicode 16-bit. As a workaround,
> insert a single space character before the first TAB, this seems to
> cause Notepad to correctly load it as 8-bit.

I've never played with IsTextUnicode, so this was the perfect excuse :).

As Ken points out, it appears that the tabs are causing the file to look
like Unicode to the IS_TEXT_UNICODE_STATISTICS test (which MSDN points out
might give a false positive).

In addition to Ken's suggestion (which never even crossed my mind and sounds
like the simplest solution), you could:

- Not use Notepad (or write a replacement :)

- Output all your files as Unicode

- Check IsTextUnicode on your output files and convert them to Unicode if it
appears that Notepad will get confused.

- Use commas instead of tabs to delimit your records (I've verified that
this does work).

Craig

Ken Wickes [MSFT]

unread,

Aug 1, 2003, 6:36:19 PM8/1/03

to

"Craig Kelly" <cnk...@worldnet.att.net> wrote in message
news:ZBAWa.80332$3o3.5...@bgtnsc05-news.ops.worldnet.att.net...

The TABs weren't my idea, that was the other poster. I don't really think
it's the tabs, they don't look like Unicode tabs. Also I think the reason
that adding the extra char in this case only works because it makes the
buffer an odd length which would be illegal in Unicode. Looking at the
code, for this file it's making a purely statistical judgement. I don't
understand the formula enough to understand why it breaks down in this case.
I'm also confused on why the ASCII CR/LF doesn't convince it to call it
ASCII.

Craig Kelly

unread,

Aug 1, 2003, 7:35:57 PM8/1/03

to

"Ken Wickes [MSFT]" <ken...@online.microsoft.com> wrote:

[snip]

> The TABs weren't my idea, that was the other poster. I don't really think
> it's the tabs, they don't look like Unicode tabs. Also I think the reason
> that adding the extra char in this case only works because it makes the
> buffer an odd length which would be illegal in Unicode. Looking at the
> code, for this file it's making a purely statistical judgement. I don't
> understand the formula enough to understand why it breaks down in this
case.
> I'm also confused on why the ASCII CR/LF doesn't convince it to call it
> ASCII.

I am such a fool! My apologies to all for mis-attributing... it's
especially embarassing since my mistake was only 11 words from my quote of
Alan's post.

Anyway, to clarify my experimentation with the file in question... After I
saw this in MSDN

<quote for IsTextUnicode>
For example, if lpBuffer points to the ASCII string 0x41, 0x0A, 0x0D, 0x1D
(A\n\r^Z), the string passes the IS_TEXT_UNICODE_STATISTICS test, though
failure would be preferable
</quote>

I looked at character frequencies in the file, and the only ASCII value that
occurred more often than a tab (0x09) was '0' (0x30). So... I replaced the
tab characters with comma's (but I verified that just about any
alpha-numeric worked, too) and IsTextUnicode's statistical test worked fine
(and I could open the file in Notepad).

However, as you point out, an odd number of bytes would work as well, so
perhaps a simple solution to the problem would be to add a trailing byte of
some kind on creation if the file has an even number of bytes.

Craig

Ken Wickes [MSFT]

unread,

Aug 1, 2003, 9:39:33 PM8/1/03

to

"Craig Kelly" <cnk...@worldnet.att.net> wrote in message

news:xPCWa.80582$3o3.5...@bgtnsc05-news.ops.worldnet.att.net...

Putting an FF FF in the file seems to work too as it is illegal in Unicode.