StreamWriter outfile = new StreamWriter(args[1], true,
Encoding.UTF8);
StreamReader infile = new StreamReader(args[0], Encoding.UTF8);
while((strTemp = infile.ReadLine())!= null)
{
//Do substring stuff to generate comma-delimited fields in
output file
}
here's the thing: This file will occasionally have tilde ns or umlat
os (ASCII 164 and ASCII 148 respectively). Then I investigate strTemp
after it reads one of those lines, I find that it didn't read that
character at all. It's not that it changed it to a question mark or
anything, it's that the string length in strTemp is one less than it
should be.
I had been using StreamReader infile = File.OpenText(args[0]); to open
my input file but switched to the method above when I saw that I could
be more specific as to the encoding. I've tried changing it to
Unicode but get the exact same results.
Any help in this is appreciated: this is beyond the success of this
little application I'm writing, this really has me mystified about the
framework! Am I going to have to use a binary reader for this?
Mike Lerch
//Western European (ISO)
Encoding.GetEncoding("iso-8859-1");
// OR
//Western European (Windows)
Encoding.GetEncoding( 1252 );
Chris A. R.
"Mike Lerch" <mlerchNOS...@nycap.rr.com> wrote in message
news:blll3vgbkd95tg155...@4ax.com...
>Have you tried either of these?
>
>//Western European (ISO)
>Encoding.GetEncoding("iso-8859-1");
>// OR
>//Western European (Windows)
>Encoding.GetEncoding( 1252 );
I just gave both of those a shot and neither worked (both got the same
results I described earlier, where they wouldn't read the tilde n
ASCII 164 ñ). So frustrating! The file doesn't have any characters
that aren't in the regular 256 "extended ASCII" or "ANSI" character
set that I'm used to seeing. Argh!
Mike Lerch
There is no such thing as ASCII 164 or ASCII 148, or "extended ASCII" -
ASCII is 7-bit, plain and simple.
You need to find out which encoding you *really* mean. A good first step
would be to find out which Unicode character you want the result to be -
see http://www.unicode.org to find that out.
--
Jon Skeet - <sk...@pobox.com>
http://www.pobox.com/~skeet/
If replying to the group, please do not mail me too
> I just gave both of those a shot and neither worked (both got the same
> results I described earlier, where they wouldn't read the tilde n
> ASCII 164 ñ). So frustrating! The file doesn't have any characters
> that aren't in the regular 256 "extended ASCII" or "ANSI" character
> set that I'm used to seeing. Argh!
Well, as others have noted, extended ASCII is not real. But FWIW, here is some background info on
164 (a.k.a. 0xa4) across various Windows code pages:
In any case, the one you REALLY want is code page 437 -- the Windows US OEM code page:
437
U+00F1 (LATIN SMALL LETTER N WITH TILDE)
U+00D1 (LATIN CAPITAL LETTER N WITH TILDE)
So try GetEncoding(437) for the conversion.
--
MichKa
This posting is provided "AS IS" with
no warranties, and confers no rights.
>There is no such thing as ASCII 164 or ASCII 148, or "extended ASCII" -
>ASCII is 7-bit, plain and simple.
>
>You need to find out which encoding you *really* mean. A good first step
>would be to find out which Unicode character you want the result to be -
>see http://www.unicode.org to find that out.
Thanks, John, Chris and Michael. I took the advice above, first
looking at the file from a binary perspective to realize that the
tilde n wasn't being stored as 164, it was 241. A little searching on
the web showed that to indicate codepage 1252 (one of Chris'
suggestions...it worked contrary to my earlier post (I had changed the
getencoding in the wrong spot))).
I'm still somewhat surprised by the behavior that the code exhibited:
instead of translating the unknown 241 character to a question mark or
space or whatever, it simply didn't include a character at all,
leaving the string one short of its expected length.
Thanks again for everyone's help.
Mike Lerch
That sounds very strange - are you sure you weren't using a multi-byte
encoding at the time (e.g. UTF-8) and it was deciding that the 241 was
just part of a multi-byte character?
(Having just done a test with Encoding.ASCII, I'm somewhat surprised to
see that it just appears to do a bitmask with 0x7f - i.e. it incorrectly
decodes values >= 0x80, rather than giving '?'. Admittedly the docs
don't state what should happen in this case, which is bad enough to
start with...)