Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

A Preferred or Better Way?

20 views
Skip to first unread message

MikeCopeland

unread,
Sep 4, 2015, 6:14:49 PM9/4/15
to
I am processing a file that's "named" .csv, but which isn't. It's a
file with values that are separated by tab characters...but there's a
lot of data that isn't data at all. For example, every character is
followed by a '\0' character, and there are non-ASCII characters at the
"front" of the data stream record.
I want to process this file as though it's a comma-separated file,
and I'm cleaning it up with the following code (which works - the data
line goes from 1289 characters to 389 characters):

size_t ttt = 0;
while(ttt < str.length()) // clean up line; replace Tabs w/commas
{
if((str.at(ttt) > 127) || (str.at(ttt) < 1)) str.erase(ttt, 1);
else
{
if(str.at(ttt) == '\t') str.at(ttt++) = ',';
else ttt++;
}
} // while

There may be other, better ways, to do this. I assume that using an
iterator is one, but I don't know quite how to do so (when to advance
the iterator, etc.). Are there other, better, ways? If so, can someone
show me how? TIA


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Paavo Helde

unread,
Sep 4, 2015, 6:41:30 PM9/4/15
to
MikeCopeland <mrc...@cox.net> wrote in news:MPG.3053c5acd6186f3f9896b0
@news.eternal-september.org:

> I am processing a file that's "named" .csv, but which isn't. It's a
> file with values that are separated by tab characters...but there's a
> lot of data that isn't data at all. For example, every character is
> followed by a '\0' character, and there are non-ASCII characters at the
> "front" of the data stream record.

If (almost) every second byte is zero and there is a BOM marker in the
beginning then this most probably means this file is in Unicode UCS-2 or
UTF-16 encoding. On Windows you should use std::wstream and std::wstring
for processing it. Your current approach is corrupting any non-ASCII
characters in the file, apart of other problems.

If you have your file read into a std::wstring, you can translate tabs to
commas by a single line:

std::replace(str.begin(), str.end(), L'\t', L',');



hth
Paavo


0 new messages