Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

how to parse a binary stream in C++

551 views
Skip to first unread message

DE

unread,
Feb 6, 2012, 8:55:07 PM2/6/12
to
Hi,

I'm faced with this issue. Suppose I have a binary stream received
from network, in network format. And I know the format of the stream.
For example:
1st byte: a char
2nd and 3rd: an int (a short)
4th-7th: an int
8th-15th: a double

Here are my questions:
1) should I use signed or unsigned char[]? Those int's and double's
could be negative. I used (signed) char[]. Not sure if this is right.
2) how do I convert the stream to host format from network format? I
know I can use ntohl or ntohs functions. But my questions is more
about should I call ntohl on the whole stream or should I call it only
on numerical values?
3) also a really basic question, is it safe for me to use ntohl on
double type?
4) how to efficiently parse the numerics? I wrote some function like
below for ints
int getInt(const char[]& buffer, const size_t& index, const size_t&
numberOfBytes)
{
int nResult = 0;
for( size_t i = 0; i < numberOfBytes; ++i )
nResult += buffer[i] * pow(2, numberOfBytes - 1 - i); //this
assumes big endian
}
I believe there is a better way for it. Also, this method doesn't work
for doubles...

Many thanks for your suggestions.

Johann Klammer

unread,
Feb 7, 2012, 12:57:57 AM2/7/12
to
Hello,

For two byte shorts and 4 byte ints...
Not sure if this works,
but it _does_ use ntohs...

int getInt(const char[]& buffer, const size_t& index, const size_t&
numberOfBytes)
{
uint16_t a;
uint32_t b;
if(numberOfBytes==2)
{
a=ntohs(*(reinterpret_cast<uint16_t*>(buffer)));
nResult=(int) reinterpret_cast<short>(a);
}
else
{
b=ntohl(*(reinterpret_cast<uint32_t*>(buffer)));
nResult=reinterpret_cast<int>(b);
}
return nResult;
}


Sorry, it's been some time since I last used C++,
JK

Joshua Maurice

unread,
Feb 7, 2012, 6:38:18 PM2/7/12
to
This won't work in C++ nor C. You're running afoul of the strict
aliasing rule. You're accessing memory through a pointer of the
"wrong" type. gcc commonly "breaks" code exactly like this, and it's
allowed to because it has "undefined behavior" according to the C and C
++ standards. See:
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fstrict_002daliasing-849
http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html

In short, almost any use of reinterpret_cast is undefined behavior.
There are exceptions, such as
1- casting /to/ unsigned char* or char*, but not from unsigned char*
or char* to other pointer types,
2- casting to an integer of sufficient size and casting back to the /
exact/ same pointer type,
3- C's rule about reading the first common member from different
structs, which C++ has too IIRC,
and so on. However, I won't go into them in full detail because you
shouldn't be using them if you're learning them merely from this
post.

Also because there's some lack of consensus on the issues, the
standards are less than clear, and even I'm not sure if I know it all
correctly. I've made a couple of posts to comp.std.c++ about it and
various presumably unintended consequences of the current rules, but I
didn't get any replies. My comp.std.c discussion was a little more
lively, but still no resolution.

Rainer Weikusat

unread,
Feb 8, 2012, 9:14:34 AM2/8/12
to
Joshua Maurice <joshua...@gmail.com> writes:
> On Feb 6, 9:57 pm, Johann Klammer <klamm...@NOSPAM.a1.net> wrote:
>> Hello,
>>
>> For two byte shorts and 4 byte ints...
>> Not sure if this works,
>> but it _does_ use ntohs...
>>
>> int getInt(const char[]& buffer, const size_t& index, const size_t&
>> numberOfBytes)
>> {
>>    uint16_t a;
>>    uint32_t b;
>>    if(numberOfBytes==2)
>>    {
>>      a=ntohs(*(reinterpret_cast<uint16_t*>(buffer)));
>>      nResult=(int) reinterpret_cast<short>(a);
>>    }
>>    else
>>    {
>>      b=ntohl(*(reinterpret_cast<uint32_t*>(buffer)));
>>      nResult=reinterpret_cast<int>(b);
>>    }
>>    return nResult;
>>
>> }
>>
>> Sorry, it's been some time since I last used C++,
>> JK
>
> This won't work in C++ nor C.

This is not strictly compliant ISO-C. That's something rather
different than "it won't work".

> You're running afoul of the strict aliasing rule. You're accessing
> memory through a pointer of the "wrong" type. gcc commonly "breaks"
> code exactly like this, and it's allowed to because it has
> "undefined behavior" according to the C and C ++ standards. See:

"It is not explicitly prohibited" and "it is allowed" are two very
much different things. For gcc, the behaviour can actually be selected
at compilation time with the -fno-strict-aliasing switch. This is
pretty much a requirement since people write stuff like operating
system kernels, malloc implementations, code using UNIX(*) memory
management interfaces and code for embedded use in C.

Ben Kaufman

unread,
Feb 8, 2012, 10:22:02 AM2/8/12
to
something to the effect of the below is alot more efficient.

int getInt(const unsigned char * ptr)
{
return (int) ((unsigned int) ptr[0] << 24 |
(unsigned int) ptr[1] << 16 |
(unsigned int) ptr[2] << 8 |
(unsigned int)ptr[3])
}

Ben

Ian Collins

unread,
Feb 8, 2012, 2:01:07 PM2/8/12
to
On 02/ 8/12 12:38 PM, Joshua Maurice wrote:
> On Feb 6, 9:57 pm, Johann Klammer<klamm...@NOSPAM.a1.net> wrote:
>> Hello,
>>
>> For two byte shorts and 4 byte ints...
>> Not sure if this works,
>> but it _does_ use ntohs...
>>
>> int getInt(const char[]& buffer, const size_t& index, const size_t&
>> numberOfBytes)
>> {
>> uint16_t a;
>> uint32_t b;
>> if(numberOfBytes==2)
>> {
>> a=ntohs(*(reinterpret_cast<uint16_t*>(buffer)));
>> nResult=(int) reinterpret_cast<short>(a);
>> }
>> else
>> {
>> b=ntohl(*(reinterpret_cast<uint32_t*>(buffer)));
>> nResult=reinterpret_cast<int>(b);
>> }
>> return nResult;
>>
>> }
>>
>> Sorry, it's been some time since I last used C++,
>> JK
>
> This won't work in C++ nor C. You're running afoul of the strict
> aliasing rule. You're accessing memory through a pointer of the
> "wrong" type. gcc commonly "breaks" code exactly like this, and it's
> allowed to because it has "undefined behavior" according to the C and C
> ++ standards.

A more obvious problem is alignment. If buffer isn't correctly aligned
for a uint32_t or uint16_t, bad things will happen on some platforms.

--
Ian Collins

BGB

unread,
Feb 23, 2012, 12:19:53 AM2/23/12
to
a few times I have wished there were types like "int32leu_t" and
"int32beu_t" and similar which would have the property of both having a
defined endianess and safely allowing unaligned access.

then, one could validly write things like:
int i;
void *p;
...
i=*(int32leu_t *)p;

rather than needing to manually read the value using bytes and shifts or
similar (and, also giving the compiler a little more chance to optimize it).

likewise, forms without the 'u' would assume aligned access, as normal.


but, sadly, one can't really get ones' hopes up here...

Mark

unread,
Feb 23, 2012, 5:15:13 AM2/23/12
to
On Mon, 6 Feb 2012 17:55:07 -0800 (PST), DE <oo...@hotmail.com> wrote:

>Hi,
>
>I'm faced with this issue. Suppose I have a binary stream received
>from network, in network format. And I know the format of the stream.
>For example:
>1st byte: a char
>2nd and 3rd: an int (a short)
>4th-7th: an int
>8th-15th: a double
>
>Here are my questions:
>1) should I use signed or unsigned char[]?

It depends on what data they contain.

>Those int's and double's
>could be negative. I used (signed) char[]. Not sure if this is right.

Huh?

>2) how do I convert the stream to host format from network format? I
>know I can use ntohl or ntohs functions. But my questions is more
>about should I call ntohl on the whole stream or should I call it only
>on numerical values?

On numerical values only.

>3) also a really basic question, is it safe for me to use ntohl on
>double type?

NO! Look at the prototypes for these functions.

>4) how to efficiently parse the numerics? I wrote some function like
>below for ints
>int getInt(const char[]& buffer, const size_t& index, const size_t&
>numberOfBytes)
>{
> int nResult = 0;
> for( size_t i = 0; i < numberOfBytes; ++i )
> nResult += buffer[i] * pow(2, numberOfBytes - 1 - i); //this
>assumes big endian
>}

Ugh!

>I believe there is a better way for it. Also, this method doesn't work
>for doubles...

You could locate the correct location in the byte stream and then cast
it to the pointer of the correct type.

>Many thanks for your suggestions.

If you can change the format of the byte stream then just send the
data in a different format, like XML. Encode all numeric values as
strings. It's much easier to handle.
--
(\__/) M.
(='.'=) If a man stands in a forest and no woman is around
(")_(") is he still wrong?

A. W. Dunstan

unread,
Feb 23, 2012, 2:55:53 PM2/23/12
to
If it's not too late, would XDR be useful? I've used that to read & write
network streams in years past, and I think it's meant for doing RPC kinds of
things.
--
Al Dunstan, Software Engineer
OptiMetrics, Inc.
3115 Professional Drive
Ann Arbor, MI 48104-5131

"There are two ways of constructing a software design. One way is to
make it so simple that there are obviously no deficiencies. And the
other way is to make it so complicated that there are no obvious
deficiencies."
- C. A. R. Hoare

Joshua Maurice

unread,
Feb 23, 2012, 8:21:42 PM2/23/12
to
On Feb 23, 2:15 am, Mark <i...@dontgetlotsofspamanymore.invalid>
wrote:
> You could locate the correct location in the byte stream and then cast
> it to the pointer of the correct type.

As explained else-thread, if I understand your suggestion correctly,
this will lead to undefined behavior from violating the strict
aliasing rule, and potentially lead to undefined behavior from
violating alignment requirements. See Ben Kaufman's post for something
that works if you're willing to assume a certain int size.

If you know the serialization will be temporary, and only to this
computer, then you can "hack it" (without undefined behavior) with
something like:

#include <fstream>
using namespace std;

int main()
{
//read from file
ifstream fin("a", ios::binary);
char buf[sizeof(int)];
fin.read(buf, sizeof buf);
int x;
char* c2 = reinterpret_cast<char*>(&x);
for (int i=0; i<sizeof(int); ++i)
c2[i] = buf[i];

//write to file
ofstream fout("a", ios::binary);
int y = 2;
fout.write(reinterpret_cast<char*>(&y), sizeof y);
}

Remember, in C and C++ you are allowed to read from or write to any
POD object through a char lvalue or an unsigned char lvalue. Do not
think in terms of the allowed casting. Think instead of the type of
the expression (lvalue) used to access the object. You can access an
int object through a char lvalue, but you cannot access a char object
through an int lvalue. You should probably look up the actual rules -
they're somewhat arcane and involved, and I've left out a lot of
details. Offhand, I again suggest:
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fstrict_002daliasing-849
http://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html

If you know the serialization will be used between computers, and you
want to write portable code, then you need to use a portable
serialization scheme. See Boost's serialization library as starting
place, and see google as well.

Mark

unread,
Feb 24, 2012, 4:54:32 AM2/24/12
to
On Thu, 23 Feb 2012 17:21:42 -0800 (PST), Joshua Maurice
<joshua...@gmail.com> wrote:

>On Feb 23, 2:15 am, Mark <i...@dontgetlotsofspamanymore.invalid>
>wrote:
>> You could locate the correct location in the byte stream and then cast
>> it to the pointer of the correct type.
>
>As explained else-thread, if I understand your suggestion correctly,
>this will lead to undefined behavior from violating the strict
>aliasing rule, and potentially lead to undefined behavior from
>violating alignment requirements. See Ben Kaufman's post for something
>that works if you're willing to assume a certain int size.

It can be done but i wasn't recommending it.

>If you know the serialization will be used between computers, and you
>want to write portable code, then you need to use a portable
>serialization scheme.

I would always recommend using portable code.
0 new messages