Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Reading a C struct in java

2,352 views
Skip to first unread message

Mark

unread,
Sep 29, 2009, 4:46:48 AM9/29/09
to
Hi,
I am writing an app in java which reads data from a socket from a C
language program. The data is encoded as a C struct of char arrays
something like this;

typedef struct {
char type[1];
char length[6];
char acknowledge[1];
char duplicate[1];
...
} type_t;

How can I decode a structure like this in Java without using JNI (a
requirement)?
--
(\__/) M.
(='.'=) Due to the amount of spam posted via googlegroups and
(")_(") their inaction to the problem. I am blocking most articles
posted from there. If you wish your postings to be seen by
everyone you will need use a different method of posting.
[Reply-to address valid until it is spammed.]

Lothar Kimmeringer

unread,
Sep 29, 2009, 5:33:29 AM9/29/09
to
Mark wrote:

> I am writing an app in java which reads data from a socket from a C
> language program. The data is encoded as a C struct of char arrays
> something like this;
>
> typedef struct {
> char type[1];
> char length[6];
> char acknowledge[1];
> char duplicate[1];
> ...
> } type_t;
>
> How can I decode a structure like this in Java without using JNI (a
> requirement)?

Do you need a parser creating Java-classes or do you simply
want to read in the data? If the latter you have to find out
the endianess of the system if you have to read in data consisting
of more than one byte (e.g. int) and how many bytes an int
has (varies in dependence of the processor architecture).
The rest can be done with simply reading in from the stream

Above struct can be read the following way:

InputStreamReader is = new InputStreamReader(socket.getInputStream(),
"8859_1");
int type = is.read();
char[] length = new char[6];
is.read(length);
int acknowledge = is.read();
int duplicate = is.read();

this assumes that you only have chars in that struct (i.e.
unsigned bytes). If you have mixed types you should use an
InputStream and create String out of the byte-array:

InputStream is = socket.getInputStream();
int type = is.read();
byte[] buf = new byte[6];
is.read(buf);
String length = new String(buf, 0, 6, "8859_1");
int acknowledge = is.read();
int duplicate = is.read();
...

If you need a parser you should check out Google. There are a
couple of results but I didn't take the time to check them in
detail, so I can't suggest anything.


Regards, Lothar
--
Lothar Kimmeringer E-Mail: spam...@kimmeringer.de
PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

Always remember: The answer is forty-two, there can only be wrong
questions!

Alessio Stalla

unread,
Sep 29, 2009, 6:34:45 AM9/29/09
to
On Sep 29, 11:33 am, Lothar Kimmeringer <news200...@kimmeringer.de>
wrote:

Why use a Reader? A C struct is a pack of bytes, not characters. Also,
Java chars are not ASCII chars and are thus not represented as a
single byte, so if the struct used "char" to mean "byte", you might
want to use a byte in Java, with the caveat that Java bytes are always
signed.

> this assumes that you only have chars in that struct (i.e.
> unsigned bytes). If you have mixed types you should use an
> InputStream and create String out of the byte-array:

Again, why create a String if you are dealing with an array of bytes?
If you deal with C strings, you have to take into account the \0
string terminator, but this is not the case of the OP, as far as I
understand.

> InputStream is = socket.getInputStream();
> int type = is.read();
> byte[] buf = new byte[6];
> is.read(buf);
> String length = new String(buf, 0, 6, "8859_1");
> int acknowledge = is.read();
> int duplicate = is.read();
> ...
>
> If you need a parser you should check out Google. There are a
> couple of results but I didn't take the time to check them in
> detail, so I can't suggest anything.
>
> Regards, Lothar
> --

> Lothar Kimmeringer                E-Mail: spamf...@kimmeringer.de

Alessio Stalla

unread,
Sep 29, 2009, 6:43:13 AM9/29/09
to
On Sep 29, 10:46 am, Mark <i...@dontgetlotsofspamanymore.invalid>
wrote:

> Hi,
> I am writing an app in java which reads data from a socket from a C
> language program.  The data is encoded as a C struct of char arrays
> something like this;
>
> typedef struct {
>     char type[1];
>     char length[6];
>     char acknowledge[1];
>     char duplicate[1];
>     ...
>
> } type_t;
>
> How can I decode a structure like this in Java without using JNI (a
> requirement)?

Take a look at java.io.DataInputStream. You can use that to build your
own parser. I'd advise replicating the struct in Java as a class and
use the parser to populate an instance of that class, so you properly
abstract over the details of the binary format of the struct.

Alessio

Mark

unread,
Sep 29, 2009, 7:10:14 AM9/29/09
to
On Tue, 29 Sep 2009 11:33:29 +0200, Lothar Kimmeringer
<news2...@kimmeringer.de> wrote:

>Mark wrote:
>
>> I am writing an app in java which reads data from a socket from a C
>> language program. The data is encoded as a C struct of char arrays
>> something like this;
>>
>> typedef struct {
>> char type[1];
>> char length[6];
>> char acknowledge[1];
>> char duplicate[1];
>> ...
>> } type_t;
>>
>> How can I decode a structure like this in Java without using JNI (a
>> requirement)?
>
>Do you need a parser creating Java-classes or do you simply
>want to read in the data?

I don't want a parser. The exact structure is known at compile time.

>If the latter you have to find out
>the endianess of the system if you have to read in data consisting
>of more than one byte (e.g. int) and how many bytes an int
>has (varies in dependence of the processor architecture).

Endian-ness is not an issue. All the fields are char arrays.

>The rest can be done with simply reading in from the stream

I don't have access to the stream. The data arrives to my code in
already in a byte array.

I guess I want to know if there is an easy way to map the C structure
to a java class or whether I will have to dissect the byte array
completely manually.

Lew

unread,
Sep 29, 2009, 7:33:21 AM9/29/09
to

You have to control how the C struct is written. Otherwise things like the
padding between struct elements can vary and trip you up.

--
Lew

bugbear

unread,
Sep 29, 2009, 7:43:53 AM9/29/09
to
Mark wrote:
>
> I don't want a parser. The exact structure is known at compile time.

Yes. Some parsers read fixed formats. No contradiction there.

BugBear

Mark

unread,
Sep 29, 2009, 8:37:26 AM9/29/09
to

This is already done. I just need to know how to get at the data and
make it useable in Java.

I am an experienced programmer in C/C++ but am new to Java BTW.

bugbear

unread,
Sep 29, 2009, 8:46:32 AM9/29/09
to
Mark wrote:

>
> This is already done. I just need to know how to get at the data and
> make it useable in Java.
>
> I am an experienced programmer in C/C++ but am new to Java BTW.

Just to be explicit, I guess the thing "in your head" is the C
idiom read(stream, &mystruct, sizeof(mystruct)

You definitely can't do this in Java, and (IMHO) you
shouldn't do it in C.

I've been burnt too many times by alignment, padding and byte
order. And when your boss SWEARS your code will have a short
life and never be ported, (s)he's lying.

BugBear

Mark

unread,
Sep 29, 2009, 10:04:20 AM9/29/09
to
On Tue, 29 Sep 2009 13:46:32 +0100, bugbear
<bugbear@trim_papermule.co.uk_trim> wrote:

>Mark wrote:
>
>>
>> This is already done. I just need to know how to get at the data and
>> make it useable in Java.
>>
>> I am an experienced programmer in C/C++ but am new to Java BTW.
>
>Just to be explicit, I guess the thing "in your head" is the C
>idiom read(stream, &mystruct, sizeof(mystruct)
>
>You definitely can't do this in Java, and (IMHO) you
>shouldn't do it in C.

I know you can't do this in Java, hence my question. I'm beginning to
figure it out, but I am not yet familiar with all the Java classes ;-)

>I've been burnt too many times by alignment, padding and byte
>order. And when your boss SWEARS your code will have a short
>life and never be ported, (s)he's lying.

All the C/C++ compilers we use have pragmas to force known
alignment/padding etc. Byte order is not an issue in this layer,
since all parameters are encoded in strings.

Kenneth P. Turvey

unread,
Sep 29, 2009, 11:10:36 AM9/29/09
to
On Tue, 29 Sep 2009 12:10:14 +0100, Mark wrote:

> I don't have access to the stream. The data arrives to my code in
> already in a byte array.

How does it arrive in your code? Is this the result of a native call?

--
Kenneth P. Turvey <evot...@gmail.com>

Patricia Shanahan

unread,
Sep 29, 2009, 11:18:48 AM9/29/09
to
Mark wrote:
> Hi,
> I am writing an app in java which reads data from a socket from a C
> language program. The data is encoded as a C struct of char arrays
> something like this;
>
> typedef struct {
> char type[1];
> char length[6];
> char acknowledge[1];
> char duplicate[1];
> ...
> } type_t;
>
> How can I decode a structure like this in Java without using JNI (a
> requirement)?

I expect this to map most directly to four byte[] arrays in Java.

Here's an idea for a method that might be useful:

***** DANGER! DANGER! UNTESTED CODE AHEAD *****

public byte[][] getRaw(InputStream in, int... sizes) throws IOException{
byte[][] result = new byte[sizes.length][];
for(int i=0; i<sizes.length; i++){
result[i] = new byte[sizes[i]];
in.read(result[i]);
}
return result;
}

If I've got this right, the result of calling:

getRaw(in, 1, 6, 1, 1)

should be a four element array of byte[], with each element containing
the raw data from one of the fields. That could be used, for example, as
a constructor parameter for a class that represents type_t in Java, with
each fields converted to a more useful type. For example, if duplicate
is logically a boolean, the class would have a boolean field,
initialized according to how the C char represents booleans.

Patricia

Eric Sosman

unread,
Sep 29, 2009, 11:27:06 AM9/29/09
to

What you "have" to do is decide what kind of a Java object (or
set of objects) you want to create from this bag of bytes. Then do
whatever's needed to create the object(s). Stop thinking about
representation, and start thinking about information, about values.

Keep in mind that Java has nothing precisely analogous to a C
struct. Also, keep in mind that Java's `char' is not C's `char',
unless you happen to be using one of those rare C implementations
that uses a 16-bit `char' with Unicode encoding. Finally, keep
in mind that Java's String is not C's string (I can't tell from
your description whether any of those C arrays are supposed to
hold C strings, but if they do, they're not Java Strings).

--
Eric....@sun.com

Alessio Stalla

unread,
Sep 29, 2009, 11:29:53 AM9/29/09
to
On Sep 29, 1:10 pm, Mark <i...@dontgetlotsofspamanymore.invalid>
wrote:

> On Tue, 29 Sep 2009 11:33:29 +0200, Lothar Kimmeringer
>
>
>
> <news200...@kimmeringer.de> wrote:
> >Mark wrote:
>
> >> I am writing an app in java which reads data from a socket from a C
> >> language program.  The data is encoded as a C struct of char arrays
> >> something like this;
>
> >> typedef struct {
> >>     char type[1];
> >>     char length[6];
> >>     char acknowledge[1];
> >>     char duplicate[1];
> >>     ...
> >> } type_t;
>
> >> How can I decode a structure like this in Java without using JNI (a
> >> requirement)?
>
> >Do you need a parser creating Java-classes or do you simply
> >want to read in the data?
>
> I don't want a parser.  The exact structure is known at compile time.

...by the C compiler, and by you. However since a C struct is really
an unstructured byte array, you need a parser to extract its parts, or
if you prefer a Java class with methods like

public char getType() {
return (char) myStructArray[0];
}

public void setType(char c) {
myStructArray[0] = (byte) c;
}

> >If the latter you have to find out
> >the endianess of the system if you have to read in data consisting
> >of more than one byte (e.g. int) and how many bytes an int
> >has (varies in dependence of the processor architecture).
>
> Endian-ness is not an issue.  All the fields are char arrays.  
>
> >The rest can be done with simply reading in from the stream
>
> I don't have access to the stream.  The data arrives to my code in
> already in a byte array.

You can create a stream that reads from a byte array:
java.io.ByteArrayInputStream.

> I guess I want to know if there is an easy way to map the C structure
> to a java class or whether I will have to dissect the byte array
> completely manually.

The second one you said. In Java, you have no way to know how classes
are laid out in memory, and you cannot take a byte array and interpret
it as a Java object like C does.

What you could do in principle - if you had to map a huge number of C
structs - would be to write a translator that, statically, read the
struct declarations from C code and automatically generated Java
classes with methods to parse the structures and access their fields.

Alessio

Roedy Green

unread,
Sep 29, 2009, 11:53:50 AM9/29/09
to
On Tue, 29 Sep 2009 11:33:29 +0200, Lothar Kimmeringer
<news2...@kimmeringer.de> wrote, quoted or indirectly quoted someone
who said :

>
>InputStreamReader is = new InputStreamReader(socket.getInputStream(),
> "8859_1");
>int type = is.read();
>char[] length = new char[6];
>is.read(length);
>int acknowledge = is.read();
>int duplicate = is.read();

and if it is binary data, create a great string of
DataInputStream.readInt calls or LEDataInputSTream.readInt calls (for
little endian). You must skip any padding or alignment bytes C
inserted.

Ugly, You betcha as the lady from Alaska would say.

You might write a program that parsed the C source code for the struct
and generated the Java source.

see http://mindprod.com/project/readc.html


--
Roedy Green Canadian Mind Products
http://mindprod.com

On two occasions I have been asked [by members of Parliament], "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
~ Charles Babbage (born: 1791-12-26 died: 1871-10-18 at age: 79)

Roedy Green

unread,
Sep 29, 2009, 11:59:06 AM9/29/09
to
On Tue, 29 Sep 2009 12:10:14 +0100, Mark
<i...@dontgetlotsofspamanymore.invalid> wrote, quoted or indirectly
quoted someone who said :

>I don't have access to the stream. The data arrives to my code in


>already in a byte array.

You presumably have a map of where various 8-bit strings start and end
in that byte[]. Your job is convert each 8-bit string to a java 16-bit
string with whatever encoding the C program that wrote them originally
used.

see http://mindprod.com/jgloss/conversion.html#BYTETOSTRING
http://mindprod.com/jgloss/encoding.html

Roedy Green

unread,
Sep 29, 2009, 12:02:31 PM9/29/09
to
On Tue, 29 Sep 2009 12:10:14 +0100, Mark
<i...@dontgetlotsofspamanymore.invalid> wrote, quoted or indirectly
quoted someone who said :

>>Do you need a parser creating Java-classes or do you simply


>>want to read in the data?
>
>I don't want a parser. The exact structure is known at compile time.

You still might want a parser, if there are great reams of this crud,
or if you think you may get another such problem handed to you next
month. The parser does not scan the data but the source code for the
Struct.
see http://mindprod.com/project/readc.html


It could be a klutzy "parser" where the data is massaged in some way
to make it easy to read, e.g. converted to CSV.

Lothar Kimmeringer

unread,
Sep 29, 2009, 1:05:59 PM9/29/09
to
Alessio Stalla wrote:

>> InputStreamReader is = new InputStreamReader(socket.getInputStream(),
>> � �"8859_1");
>> int type = is.read();
>> char[] length = new char[6];
>> is.read(length);
>> int acknowledge = is.read();
>> int duplicate = is.read();
>
> Why use a Reader? A C struct is a pack of bytes, not characters. Also,
> Java chars are not ASCII chars and are thus not represented as a
> single byte, so if the struct used "char" to mean "byte", you might
> want to use a byte in Java, with the caveat that Java bytes are always
> signed.

I was assuming that the data in the struct is actually text. So
using a Reader allows you to create a String just by passing
a char-array instead of doing a conversion from a byte-array
to String by specifying the encoding over and over again.

>> this assumes that you only have chars in that struct (i.e.
>> unsigned bytes). If you have mixed types you should use an
>> InputStream and create String out of the byte-array:
>
> Again, why create a String if you are dealing with an array of bytes?

The time I was programming C is some years ago, but char-arrays
(as being used in the example-struct) are in general used to
represent (printable) text. So why _not_ using Strings if you
have to deal with text?


Regards, Lothar
--
Lothar Kimmeringer E-Mail: spam...@kimmeringer.de

Roedy Green

unread,
Sep 29, 2009, 4:33:41 PM9/29/09
to
On Tue, 29 Sep 2009 08:59:06 -0700, Roedy Green
<see_w...@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>You presumably have a map of where various 8-bit strings start and end


>in that byte[]. Your job is convert each 8-bit string to a java 16-bit
>string with whatever encoding the C program that wrote them originally
>used.

In addition, to the fixed size allocation marking the boundaries,
there is a terminating null byte to mark the end of the string. You
will have to scan for that to find the actual length of the string in
bytes.

It conceivable, though unlikely, there are no such null markers, and
you have no multi-byte encodings. Then you could decode the entire
byte[] struct to a String in one fell swoop, then use String.substring
to pick out the fields. You are safer to extract the fields at the
byte level. Then the code will work even if there are multi-byte
encodings, and nulls.

Roedy Green

unread,
Sep 29, 2009, 4:38:33 PM9/29/09
to
On Tue, 29 Sep 2009 08:18:48 -0700, Patricia Shanahan <pa...@acm.org>

wrote, quoted or indirectly quoted someone who said :

>should be a four element array of byte[], with each element containing


>the raw data from one of the fields. That could be used, for example, as
>a constructor parameter for a class that represents type_t in Java, with
>each fields converted to a more useful type. For example, if duplicate
>is logically a boolean, the class would have a boolean field,
>initialized according to how the C char represents booleans.

This is what DataInputStream and LEDataInputStream do for a living.

Lew

unread,
Sep 29, 2009, 9:17:01 PM9/29/09
to
Mark wrote:
> All the C/C++ compilers we use have pragmas to force known
> alignment/padding etc. Byte order is not an issue in this layer,
> since all parameters are encoded in strings.

And you think that makes you safe. Chuckle.

If you succeed in writing a reader that counts on those pragma settings, you
can count on someone sending it data that doesn't conform to them at some
point in the probably-not-distant future. That stakeholder will then complain
loud and long for an "enhancement" to deal with their data, and likely will
refuse to use the "old" pragmas just to be compatible with the reader.

Sometimes the overhead of writing the reader is low enough that you don't mind
that scenario. You just write a new reader for the different settings and
tell them to use that one instead.

If you then encounter a third format, you might begin to lament the early
decision to rely on one particular specific set of pragma settings.

--
Lew

Arne Vajhøj

unread,
Sep 29, 2009, 9:27:55 PM9/29/09
to

I am not convinced.

I will not call fread for a parser because it can read
a struct.

http://en.wikipedia.org/wiki/Parser says:

<quote>
In computer science and linguistics, parsing, or, more formally,
syntactic analysis, is the process of analyzing a text, made of a
sequence of tokens (for example, words), to determine its grammatical
structure with respect to a given (more or less) formal grammar.

Parsing is also an earlier term for the diagramming of sentences of
natural languages, and is still used for the diagramming of inflected
languages, such as the Romance languages or Latin. The term parsing
comes from Latin pars (ōrātiōnis), meaning part (of speech).[1][2]</quote>

But it is obviously more terminology than software.

Arne

Arne Vajhøj

unread,
Sep 29, 2009, 9:30:41 PM9/29/09
to
Mark wrote:
> On Tue, 29 Sep 2009 13:46:32 +0100, bugbear
> <bugbear@trim_papermule.co.uk_trim> wrote:
>> Mark wrote:
>>> This is already done. I just need to know how to get at the data and
>>> make it useable in Java.
>>>
>>> I am an experienced programmer in C/C++ but am new to Java BTW.
>> Just to be explicit, I guess the thing "in your head" is the C
>> idiom read(stream, &mystruct, sizeof(mystruct)
>>
>> You definitely can't do this in Java, and (IMHO) you
>> shouldn't do it in C.
>
> I know you can't do this in Java, hence my question. I'm beginning to
> figure it out, but I am not yet familiar with all the Java classes ;-)

ByteArrayInputStream and DataInputStream would be my suggestion for
builtin Java functionality (*).

Combined with simple code to convert endianess and String getBytes
to convert bytes to String.


Arne

*) I do have an alternative that I will get back to later.

Arne Vajhøj

unread,
Sep 29, 2009, 9:32:44 PM9/29/09
to

It makes him as safe as any other solution.

There is a spec. You have some code that follow that spec.

If someone is not following the spec and is able to force you to
support their alternative spec, then you need to write some new code.

It does not matter if it is C, Java or Cobol.

Arne

Eric Sosman

unread,
Sep 29, 2009, 10:37:25 PM9/29/09
to
Mark wrote:
> On Tue, 29 Sep 2009 13:46:32 +0100, bugbear
> <bugbear@trim_papermule.co.uk_trim> wrote:
>
>> Mark wrote:
>>
>>> This is already done. I just need to know how to get at the data and
>>> make it useable in Java.
>>>
>>> I am an experienced programmer in C/C++ but am new to Java BTW.
>> Just to be explicit, I guess the thing "in your head" is the C
>> idiom read(stream, &mystruct, sizeof(mystruct)
>>
>> You definitely can't do this in Java, and (IMHO) you
>> shouldn't do it in C.
>
> I know you can't do this in Java, hence my question. I'm beginning to
> figure it out, but I am not yet familiar with all the Java classes ;-)

It's not a matter of memorizing a whole bucketload of Java
classes, it's about realizing what a Java class *is*: A Java class
is a collection of variables and methods and constructors. Period.

There are a few subtleties in the above that might not jump
out at you right away. One is the word "collection:" the code
and the data are just "collected," not "organized" or "ordered"
or "arranged" in any way. There is no notion of one variable
being "before" or "after" or "adjacent to" another; they're just
somewhere in the collection -- so you can forget about any notion
you might have had about "mapping" a bunch of raw bytes into a
Java object instance.

The same applies even to the individual data items themselves:
You know that a Java `int' is 32 bits wide, and that a Java `byte'
is eight bits wide. It does *not* follow that an `int' consists of
four `byte's -- that's true in C (often), but not in Java. As far
as Java is concerned, the `int' is an indivisible 32-bit atomic
value; it is *not* four individual `byte's, and most definitely not
an array of four `byte's.

Here's another thing: In C, an array is just a bunch of memory
slots arranged contiguously. In Java, an array is a full-fledged
object -- that is, Object -- with behaviors of its own. Use a bad
index on a C array and there's no telling what might happen; do so
in Java and you get an exception thrown into your teeth, reliably.
Similar remarks apply to Strings versus strings: A string is just
an array with a conventional structure, but a String is an object
with methods and data and so on. A String is *not* an array of
`char' -- try "Hello"[0] in Java, and I guarantee you will not
get an 'H'.

> All the C/C++ compilers we use have pragmas to force known
> alignment/padding etc.

That's unfortunate. It's probably tempted you to write bad
C code, *slow* C code, in an attempt to duck the issues. Your
compilers offer crutches; when you use them, you hobble like a
cripple. You will never walk, much less run, until you visit
Lourdes and discard your crutches there.

> Byte order is not an issue in this layer,
> since all parameters are encoded in strings.

See the remarks above about the difference between strings
and Strings. (Also, I doubt your assertion: Three of the four
arrays you showed us are only one char long, and hence can hold
only one C string value: "". It's possible, I guess, that your
protocol involves transmitting multiple known-to-be-constant and
hence information-empty bytes, but I doubt it. Most likely, you
are not dealing with strings, not entirely at any rate.)

--
Eric Sosman
eso...@ieee-dot-org.invalid

Mark

unread,
Sep 30, 2009, 4:14:38 AM9/30/09
to
On Tue, 29 Sep 2009 11:27:06 -0400, Eric Sosman <Eric....@sun.com>
wrote:

I should have phrased my question better. I know this, I just need to
know how to implement it in Java.

My main problem is not knowing the Java class library very well.

Mark

unread,
Sep 30, 2009, 4:08:54 AM9/30/09
to
On 29 Sep 2009 15:10:36 GMT, "Kenneth P. Turvey" <evot...@gmail.com>
wrote:

>On Tue, 29 Sep 2009 12:10:14 +0100, Mark wrote:
>
>> I don't have access to the stream. The data arrives to my code in
>> already in a byte array.
>
>How does it arrive in your code? Is this the result of a native call?

No. There must be no native code. In fact the project is a port from
native code.

The data arrives from another Java package which handles the network
protocol.

Mark

unread,
Sep 30, 2009, 4:17:08 AM9/30/09
to
On Tue, 29 Sep 2009 08:29:53 -0700 (PDT), Alessio Stalla
<alessi...@gmail.com> wrote:

>What you could do in principle - if you had to map a huge number of C
>structs - would be to write a translator that, statically, read the
>struct declarations from C code and automatically generated Java
>classes with methods to parse the structures and access their fields.

I actually already done that very thing for a different project.
However, the bulk of the work is done in native code. There was 100's
of data structures in that case. This project only has a few.

RedGrittyBrick

unread,
Sep 30, 2009, 5:44:46 AM9/30/09
to

Mark wrote:
> On Tue, 29 Sep 2009 11:33:29 +0200, Lothar Kimmeringer
> <news2...@kimmeringer.de> wrote:
>
>> Mark wrote:
>>
>>> I am writing an app in java which reads data from a socket from a C
>>> language program. The data is encoded as a C struct of char arrays
>>> something like this;
>>>
>>> typedef struct {
>>> char type[1];
>>> char length[6];
>>> char acknowledge[1];
>>> char duplicate[1];
>>> ...
>>> } type_t;
>>>
>>> How can I decode a structure like this in Java without using JNI (a
>>> requirement)?
[...]

> The data arrives to my code in already in a byte array.
>
> I guess I want to know if there is an easy way to map the C structure
> to a java class or whether I will have to dissect the byte array
> completely manually.

Someone please put me out of my misery, I'm missing something obvious,
why can't Mark do something like

-------------------------------------8<-----------------------------------
import java.util.Arrays;

public class DecodeStruct {
public static void main(String[] args) {

char[] struct = {'t','0','0','0','0','2','1','N','A'};

Foo foo = new Foo(struct);

System.out.println("length is " + foo.getLengthAsString());
}
}

class Foo {
private char type;
private char[] length;
private char duplicate;
private char acknowledge;

Foo (char[] struct) {
type = struct[0];
length = Arrays.copyOfRange(struct, 1, 7);
duplicate = struct[7];
acknowledge = struct[8];
}

public String getLengthAsString() {
return new String(length);
}
}
-------------------------------------8<-----------------------------------
Output:
length is 000021

--
RGB

Arne Vajhøj

unread,
Sep 30, 2009, 7:32:35 AM9/30/09
to


Does it "map the C structure to a java class" ?

Arne

Joshua Cranmer

unread,
Sep 30, 2009, 7:47:12 AM9/30/09
to
On 09/30/2009 05:44 AM, RedGrittyBrick wrote:
> class Foo {
> private char type;
> private char[] length;
> private char duplicate;
> private char acknowledge;

If the OP was using char because he wanted individual bytes and not
because he was using textual data, byte would be more appropriate here,
especially when he tries to read in or write out data...

--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth

Martin Gregorie

unread,
Sep 30, 2009, 8:36:34 AM9/30/09
to
On Wed, 30 Sep 2009 09:08:54 +0100, Mark wrote:

> On 29 Sep 2009 15:10:36 GMT, "Kenneth P. Turvey" <evot...@gmail.com>
> wrote:
>

.../....

>>How does it arrive in your code? Is this the result of a native call?
>
> No. There must be no native code. In fact the project is a port from
> native code.
>
> The data arrives from another Java package which handles the network
> protocol.
>

In that case why are you talking about a C struct? Just swipe the code
from the Java class that replaced the C struct.


--
martin@ | Martin Gregorie
gregorie. | Essex, UK
org |

RedGrittyBrick

unread,
Sep 30, 2009, 8:40:37 AM9/30/09
to

Not really I suppose, at least, not dynamically. Nor is it different
from the "dissect the byte array completely manually" that Mark wished
to avoid.

However it is the first thing I'd think of in order to instantiate a
Java class from a byte array whose contents originated from a C struct.
It just doesn't seem that bad to me.

--
RGB

markspace

unread,
Sep 30, 2009, 9:57:53 AM9/30/09
to
Mark wrote:

> On Tue, 29 Sep 2009 11:27:06 -0400, Eric Sosman <Eric....@sun.com>

>> What you "have" to do is decide what kind of a Java object (or


>> set of objects) you want to create from this bag of bytes. Then do
>> whatever's needed to create the object(s). Stop thinking about
>> representation, and start thinking about information, about values.


> I should have phrased my question better. I know this, I just need to
> know how to implement it in Java.


Well, can you tell us then? What kind of Java object do you want to
create? Also tell us what you have now (an array of bytes, I'm
guessing) and we might be able to give you some better advice.


> My main problem is not knowing the Java class library very well.


I don't think there's a library class that will do this for you. Maybe
serialization, if you wrote special methods to handle it. Dunno if
that's the best idea though.


Mark

unread,
Sep 30, 2009, 11:46:19 AM9/30/09
to
On Wed, 30 Sep 2009 12:36:34 +0000 (UTC), Martin Gregorie
<mar...@address-in-sig.invalid> wrote:

>On Wed, 30 Sep 2009 09:08:54 +0100, Mark wrote:
>
>> On 29 Sep 2009 15:10:36 GMT, "Kenneth P. Turvey" <evot...@gmail.com>
>> wrote:
>>
>.../....
>
>>>How does it arrive in your code? Is this the result of a native call?
>>
>> No. There must be no native code. In fact the project is a port from
>> native code.
>>
>> The data arrives from another Java package which handles the network
>> protocol.
>>
>In that case why are you talking about a C struct? Just swipe the code
>from the Java class that replaced the C struct.

The message format is described in terms of a C structure. There is
no extant code to "replace" the C struct.

Mark

unread,
Sep 30, 2009, 11:48:21 AM9/30/09
to

Yes. Something like this.

Lew

unread,
Sep 30, 2009, 12:07:18 PM9/30/09
to
On Sep 30, 11:46 am, Mark <i...@dontgetlotsofspamanymore.invalid>

wrote:
> On Wed, 30 Sep 2009 12:36:34 +0000 (UTC), Martin Gregorie
>
>
>
> <mar...@address-in-sig.invalid> wrote:
> >On Wed, 30 Sep 2009 09:08:54 +0100, Mark wrote:
>
> >> On 29 Sep 2009 15:10:36 GMT, "Kenneth P. Turvey" <evotur...@gmail.com>

> >> wrote:
>
> >.../....
>
> >>>How does it arrive in your code?  Is this the result of a native call?
>
> >> No.  There must be no native code.  In fact the project is a port from
> >> native code.
>
> >> The data arrives from another Java package which handles the network
> >> protocol.
>
> >In that case why are you talking about a C struct? Just swipe the code
> >from the Java class that replaced the C struct.
>
> The message format is described in terms of a C structure.  There is
> no extant code to "replace" the C struct.  

This contradicts what you said upthread:


> The data arrives from another Java package which handles the network
> protocol.
>

--
Lew

markspace

unread,
Sep 30, 2009, 12:23:00 PM9/30/09
to
Lew wrote:

> This contradicts what you said upthread:

> On Sep 30, 11:46 am, Mark <i...@dontgetlotsofspamanymore.invalid>


>> The data arrives from another Java package which handles the network
>> protocol.


While true, I'd bet dollars to donuts that the package just returns an
array of bytes or something similar.

Thinking outside the OP's problem statement a little bit, I wonder if
there's a general best practice for this sort of thing. One has legacy
byte stream and needs to unmarshal it to some Java objects. What's the
best way to go about this?

Assuming the number of different objects is small, I think
RedGrittyBrick's solution to just pass-in a buffer and offset for each
class built is probably the best. It's simple and easy to understand.

For larger numbers of different objects, some kind of factory and parser
would be better I think. Any ideas where to start there? We kind of
have a problem similar to grz01, where the factory needs to return
almost a random assortment of objects. I wonder if an
ObjectOutputStream would help here?

Patricia Shanahan

unread,
Sep 30, 2009, 12:23:42 PM9/30/09
to
Mark wrote:
> On Wed, 30 Sep 2009 12:36:34 +0000 (UTC), Martin Gregorie
> <mar...@address-in-sig.invalid> wrote:
>
>> On Wed, 30 Sep 2009 09:08:54 +0100, Mark wrote:
>>
>>> On 29 Sep 2009 15:10:36 GMT, "Kenneth P. Turvey" <evot...@gmail.com>
>>> wrote:
>>>
>> .../....
>>
>>>> How does it arrive in your code? Is this the result of a native call?
>>> No. There must be no native code. In fact the project is a port from
>>> native code.
>>>
>>> The data arrives from another Java package which handles the network
>>> protocol.
>>>
>> In that case why are you talking about a C struct? Just swipe the code
>>from the Java class that replaced the C struct.
>
> The message format is described in terms of a C structure. There is
> no extant code to "replace" the C struct.

I strongly suggest getting away from describing it in terms of a C
structure, because it just does not give useful data. Several of us have
effectively answered different questions because of the lack of clarity.

Here is an example of how I might describe one interpretation of your
data format:

type: 1 byte, ascii, valid values 'a', 'b', 'c'.
length: 6 bytes, unsigned decimal in ascii, digits '0' through '9'.
acknowledge: 1 byte, binary, zero represents false, all other values
accepted and represent true.
duplicate: 1 byte, binary, zero represents false, all other values
accepted and represent true.

Can you describe your data stream as it really is in that sort of form?

Patricia

markspace

unread,
Sep 30, 2009, 12:35:17 PM9/30/09
to
Patricia Shanahan wrote:
> Several of us have
> effectively answered different questions because of the lack of clarity.


I think I read about this somewhere. "Garbage in, garbage out" perhaps?

Arne Vajhøj

unread,
Sep 30, 2009, 1:52:15 PM9/30/09
to
Mark wrote:
> On Wed, 30 Sep 2009 12:36:34 +0000 (UTC), Martin Gregorie
> <mar...@address-in-sig.invalid> wrote:
>> On Wed, 30 Sep 2009 09:08:54 +0100, Mark wrote:
>>> On 29 Sep 2009 15:10:36 GMT, "Kenneth P. Turvey" <evot...@gmail.com>
>>> wrote:
>>>> How does it arrive in your code? Is this the result of a native call?
>>> No. There must be no native code. In fact the project is a port from
>>> native code.
>>>
>>> The data arrives from another Java package which handles the network
>>> protocol.
>>>
>> In that case why are you talking about a C struct? Just swipe the code
>>from the Java class that replaced the C struct.
>
> The message format is described in terms of a C structure.

Hopefully with additional text about implementation of C data
types and padding.

Arne

Martin Gregorie

unread,
Sep 30, 2009, 2:45:14 PM9/30/09
to
On Wed, 30 Sep 2009 09:35:17 -0700, markspace wrote:

> I think I read about this somewhere. "Garbage in, garbage out" perhaps?
>

I realise that we're talking about messages generated by some dark-age
legacy system here, but the real answer is to design the message format
properly in the first place, something all too few designers do.

In this context 'properly' implies implementation independence. If
possible it should also be human readable since debugging is usually
easier if you can easily read messages displayed by Wireshark or a
datascope. Failing that, at least use a standard representation: CSV, a
SWIFT knockoff, ASN.1 or even XML all spring to mind. I've listed these
in order of descending personal preference but ymmv.

The overhead of converting between a program's internal storage
structures and the transmission format is fairly minimal if a sensible
message format is used.

RedGrittyBrick

unread,
Oct 1, 2009, 5:03:33 AM10/1/09
to

Martin Gregorie wrote:
> I realise that we're talking about messages generated by some dark-age
> legacy system here, but the real answer is to design the message format
> properly in the first place, something all too few designers do.
>
> In this context 'properly' implies implementation independence. If
> possible it should also be human readable since debugging is usually
> easier if you can easily read messages displayed by Wireshark or a
> datascope. Failing that, at least use a standard representation: CSV, a
> SWIFT knockoff, ASN.1 or even XML all spring to mind. I've listed these
> in order of descending personal preference but ymmv.
>

That's an interesting list.

CSV - Simple, ubiquitous, venerable.

SWIFT - what is that? Not "Society for Worldwide Interbank Financial
Telecommunication"?

ASN.1 - my immediate reaction is Eek, but having studied Wikipedia for a
few seconds - maybe I should reconsider. You probably mean DER, XER or
PER for the actual data stream.

XML - I think you have the position right. I use it a lot and have
become steadily more disillusioned with XML as a data format.

I'd add EDIFACT at the end of the list, just to show how horrible data
representations can be. I'd add JSON after CSV.


--
RGB

Mark

unread,
Oct 1, 2009, 5:26:19 AM10/1/09
to
On Wed, 30 Sep 2009 09:23:42 -0700, Patricia Shanahan <pa...@acm.org>
wrote:

>Mark wrote:
>> On Wed, 30 Sep 2009 12:36:34 +0000 (UTC), Martin Gregorie
>> <mar...@address-in-sig.invalid> wrote:
>>
>>> On Wed, 30 Sep 2009 09:08:54 +0100, Mark wrote:
>>>
>>>> On 29 Sep 2009 15:10:36 GMT, "Kenneth P. Turvey" <evot...@gmail.com>
>>>> wrote:
>>>>
>>> .../....
>>>
>>>>> How does it arrive in your code? Is this the result of a native call?
>>>> No. There must be no native code. In fact the project is a port from
>>>> native code.
>>>>
>>>> The data arrives from another Java package which handles the network
>>>> protocol.
>>>>
>>> In that case why are you talking about a C struct? Just swipe the code
>>>from the Java class that replaced the C struct.
>>
>> The message format is described in terms of a C structure. There is
>> no extant code to "replace" the C struct.
>
>I strongly suggest getting away from describing it in terms of a C
>structure, because it just does not give useful data. Several of us have
>effectively answered different questions because of the lack of clarity.

I only used C structures because that was the information given to me.


>Here is an example of how I might describe one interpretation of your
>data format:
>
>type: 1 byte, ascii, valid values 'a', 'b', 'c'.
>length: 6 bytes, unsigned decimal in ascii, digits '0' through '9'.
>acknowledge: 1 byte, binary, zero represents false, all other values
>accepted and represent true.
>duplicate: 1 byte, binary, zero represents false, all other values
>accepted and represent true.
>
>Can you describe your data stream as it really is in that sort of form?

Yes. It's pretty much as you describe, except there is no binary
data, only ASCII. So "acknowledge" would be set to '0' or '1', for
example.

Anyway I think I have found a solution using the ByteBuffer class.

Thanks to everyone who has tried to help.

RedGrittyBrick

unread,
Oct 1, 2009, 5:42:44 AM10/1/09
to

markspace wrote:
> Mark wrote:
>
>> On Tue, 29 Sep 2009 11:27:06 -0400, Eric Sosman <Eric....@sun.com>
>
>>> What you "have" to do is decide what kind of a Java object (or
>>> set of objects) you want to create from this bag of bytes. Then do
>>> whatever's needed to create the object(s). Stop thinking about
>>> representation, and start thinking about information, about values.
>
>
>> I should have phrased my question better. I know this, I just need to
>> know how to implement it in Java.
>
>
> Well, can you tell us then? What kind of Java object do you want to
> create? Also tell us what you have now (an array of bytes, I'm
> guessing) and we might be able to give you some better advice.

In an earlier message Mark said that the data was indeed delivered as an
array of bytes. Presumably the byte array contains all the data from the
whole C struct that corresponds to the payload of a network message.

>
>
>> My main problem is not knowing the Java class library very well.
>
>
> I don't think there's a library class that will do this for you. Maybe
> serialization, if you wrote special methods to handle it. Dunno if
> that's the best idea though.

That's an interesting idea, I wonder if it is possible to design a Java
class such that one can deserialize mark's byte array into an instance
of that class?

ObjectInputStream in = new ObjectInputStream(
new ByteArrayInputStream(bytes));
Foo foo = (javax.swing.JButton) in.readObject();
in.close();

Martin Gregorie mentioned ASN.1. I'm vaguely aware that ASN.1 is long
beloved of network protocol designers/developers, so it might be
especially appropriate.

If one could describe Mark's struct in ASN.1 (which seems like a
respectable thing to do) then, with luck, the byte array might
correspond to an encoding (PER?) of data matching that ASN.1
description. In which case it might be possible to use something like
this:
http://harmony.apache.org/subcomponents/classlibrary/asn1_framework.html

I've only glanced at this ASN.1/Harmony stuff, so I may be rather lost
in the woods, so to speak. I don't think Harmony consumes ASN.1 and a
byte array and emits Java objects so it is probably a red-herring - at
least in terms of minimising the amount of Java to be written and
libraries to be understood.

In another message Mark said he has only a *few* classes to map to
structs. So I'd use the hand-crafted approach I outlined earlier. I'd
save the heavy lifting machinery for when it is really needed.

Just my �0.02 worth.

--
RGB

Alessio Stalla

unread,
Oct 1, 2009, 6:25:06 AM10/1/09
to
On Oct 1, 11:42 am, RedGrittyBrick <RedGrittyBr...@spamweary.invalid>
wrote:

> >> My main problem is not knowing the Java class library very well.
>
> > I don't think there's a library class that will do this for you.  Maybe
> > serialization, if you wrote special methods to handle it.  Dunno if
> > that's the best idea though.
>
> That's an interesting idea, I wonder if it is possible to design a Java
> class such that one can deserialize mark's byte array into an instance
> of that class?

AFAIK Java serialization always includes a reference to the Java class
before the actual serialized data, so you cannot take an arbitrary
byte stream and deserialize it. That said, if you can augment the
stream with the class information, it could be made to work.

Lew

unread,
Oct 1, 2009, 7:52:03 AM10/1/09
to
RedGrittyBrick wrote:
> XML - I think you have the position right. I use it a lot and have
> become steadily more disillusioned with XML as a data format.

I've been using XML a lot - a whole lot - for over ten years, and I am still
becoming steadily more enamored of it as a data format.

--
Lew

Mark

unread,
Oct 1, 2009, 8:09:00 AM10/1/09
to
On Wed, 30 Sep 2009 09:07:18 -0700 (PDT), Lew <l...@lewscanon.com>
wrote:

It's not a contradiction. The network layer is separate (as it should
be). The network layer just presents a byte array, with no kind of
structure.

Martin Gregorie

unread,
Oct 1, 2009, 9:35:20 AM10/1/09
to
On Thu, 01 Oct 2009 10:03:33 +0100, RedGrittyBrick wrote:

> SWIFT - what is that? Not "Society for Worldwide Interbank Financial
> Telecommunication"?
>

The very same.

Its a really simple tagged-field format that was designed so it could be
hand-transmitted by teletype. Yes, it goes back that far!

These days SWIFT uses specialised header and trailer blocks to deal with
message routing and checksums, etc, but the body format is what I was
thinking of here. Each body field consists of a tag enclosed in colons
and a value which is terminated by a newline followed by the next field
or the end of the body. The tags are usually numeric. They define the
meaning and format of a field, so a 33B field always contains a currency
code and amount:

:33B:USD3.95

while a 32A field contains a value date, currency code and amount:

:32A:091001GBP999.00

The field tag meaning is independent of the message type. You get the
picture, the point being that splitting a message into fields is about as
easy as dealing with CSV but with the added advantage that you can drive
field decoding with a field dictionary, keyed on tag value, that contains
both field meaning and parsing rules. Add a message dictionary to ensure
that the message has the right set of fields and message assembly and
parsing become really easy to automate.



> ASN.1 - my immediate reaction is Eek, but having studied Wikipedia for a
> few seconds - maybe I should reconsider. You probably mean DER, XER or
> PER for the actual data stream.
>

ASN.1 messages are compact and fairly to read given a hex dump. Its a
very common format in the telecoms industry: GSM network components tend
to communicate via ASN.1 messages. Messages are fast and easy to assemble
and parse because 'asn.1 compilers', which generate source code from the
message definitions, are readily available. That said, I've also hand-
built syntax recognisers that dissassembled ASN.1 messages with the aid
of a simple field dictionary.



> XML - I think you have the position right. I use it a lot and have
> become steadily more disillusioned with XML as a data format.
>

Agreed. All too often there's no DTD available, which sort of defeats the
whole object of the exercise. The message constructors I've seen have
merely replaced tokens with field values in predefined templates that
contained the structure and tags.

> I'd add EDIFACT at the end of the list, just to show how horrible data
> representations can be. I'd add JSON after CSV.
>

I haven't been so 'lucky', but nothing I've read about it has been
exactly glowing.

RedGrittyBrick

unread,
Oct 1, 2009, 10:02:45 AM10/1/09
to

Canonicalisation.
Exclusive?
Namespaces?
Namespace prefixes?

Digital signatures.
Embedded, Detached?
Certs, PKCS12!

Schema definition languages.
XSD insufficient so Schematron too!

Attributes vs elements.

Aaaaaaaaaaaaieeeeee!

--
R…G…B

Lew

unread,
Oct 1, 2009, 10:40:07 AM10/1/09
to
RedGrittyBrick wrote:
>>> XML - I think you have the position right. I use it a lot and have
>>> become steadily more disillusioned with XML as a data format.
>

Lew wrote:
>> I've been using XML a lot - a whole lot - for over ten years, and I am
>> still becoming steadily more enamored of it as a data format.
>

RGB:
> Canonicalisation.
>    Exclusive?
>

What do you mean by "exclusive"?

>    Namespaces?
>    Namespace prefixes?
>

Is this supposed to be a complaint about or criticism of XML? Because
I would argue that these are XML strengths.

> Digital signatures.
>    Embedded, Detached?
>

This is a criticism of XML itself? I don't think so.

>    Certs, PKCS12!
>

Also not inherent to XML.

> Schema definition languages.
>    XSD insufficient so Schematron too!
>

XSD has always sufficed for my needs.

> Attributes vs elements.
>

Again, this is a criticism? How is this a criticism?

> …
>

Ellipsis so that we can imagine there are more things to say, albeit
you couldn't think of them.

> Aaaaaaaaaaaaieeeeee!
>

Programming is hard, that's why they pay us the big bucks.

Consider the alternatives to XML:

- EDI - talk about "Aaaaaaaaaaaaieeeeee!"
- fixed-width formats - layout has semantics. Inflexible. Not human-
readable.
- Java 'Serializable' - specialized
- CSV - virtually randomly specified - not consistent, not
regularizable.

I have yet to find a format that is as clean or flexible or extensible
or maintainable for data exchange as XML, let alone more so.

--
Lew

Mayeul

unread,
Oct 1, 2009, 11:41:33 AM10/1/09
to
RedGrittyBrick wrote:
>
> Lew wrote:
>> RedGrittyBrick wrote:
>>> XML - I think you have the position right. I use it a lot and have
>>> become steadily more disillusioned with XML as a data format.
>>
>> I've been using XML a lot - a whole lot - for over ten years, and I am
>> still becoming steadily more enamored of it as a data format.
>>
>
> Canonicalisation.
> Exclusive?
> Namespaces?
> Namespace prefixes?

Out of curiosity, what are the uses of a data format canonicalization?

Canonical class names, host names, URLs, file path and the like, are of
obvious use to me. This, not. Also, Either I actually don't know the
point of canonicalization, either I can't see a problem with namepaces
and their prefixes.

> Digital signatures.
> Embedded, Detached?
> Certs, PKCS12!
>
> Schema definition languages.
> XSD insufficient so Schematron too!

Err, would anything short of a turing complete data description language
ever be sufficient? I'd wager that's what programming languages are for.

> Attributes vs elements.

Has only sometimes bothered me in a few niche cases. What are the problems?

--
Mayeul

Bent C Dalager

unread,
Oct 1, 2009, 11:53:32 AM10/1/09
to
On 2009-10-01, Mayeul <mayeul....@free.fr> wrote:

> RedGrittyBrick wrote:
>>
>> Canonicalisation.
>> Exclusive?
>> Namespaces?
>> Namespace prefixes?
>
> Out of curiosity, what are the uses of a data format canonicalization?

XML canonicalization is useful for calculating checksums over a block
of XML data. As an example, two XML streams that differ only in
whitespace are (often) considered semantically equal but their
checksums would be different if you didn't apply some canonicalization
strategy.

Cheers,
Bent D
--
Bent Dalager - b...@pvv.org - http://www.pvv.org/~bcd
powered by emacs

RedGrittyBrick

unread,
Oct 1, 2009, 12:07:48 PM10/1/09
to

Lew wrote:
> RedGrittyBrick wrote:
>>>> XML - I think you have the position right. I use it a lot and have
>>>> become steadily more disillusioned with XML as a data format.
>
> Lew wrote:
>>> I've been using XML a lot - a whole lot - for over ten years, and I am
>>> still becoming steadily more enamored of it as a data format.
>
> RGB:
>> Canonicalisation.
>> Exclusive?
>>
>
> What do you mean by "exclusive"?

http://www.w3.org/TR/xml-exc-c14n/
Why do I have to choose? How is it sensible to have more than one
*canonical* form?


>
>> Namespaces?
>> Namespace prefixes?
>>
>
> Is this supposed to be a complaint about or criticism of XML?

Yes it is.

Two XML documents can be equivalent even though they use different
prefixes for the same namespaces and therefore have differing canonical
forms. This is clearly a problem for canonicalisation. I also dislike
all the tedious moving about and multiplication of xmlns attributes that
becomes necessary.


> Because I would argue that these are XML strengths.

They are a necessary evil. Mostly I find they get in my way.


>> Digital signatures.
>> Embedded, Detached?
>>
>
> This is a criticism of XML itself? I don't think so.

OK. It's a criticism of the whole caboodle of XML stuff that one often
has to deal with when using XML.


>
>> Certs, PKCS12!
>>
>
> Also not inherent to XML.

I'll concede that. It seems to be unavoidable in big organisations for
authentication, integrity and non-repudiability of XML. Smaller
companies are often a bit more flexible.


>
>> Schema definition languages.
>> XSD insufficient so Schematron too!
>>
>
> XSD has always sufficed for my needs.

Alas, government departments don't agree with you. They like to codify
constraints that XSD can't express.


>
>> Attributes vs elements.
>>
>
> Again, this is a criticism?

Yes.


> How is this a criticism?

When designing an XML schema you very often have the choice of
representating a given datum as either an attribute <foo bar="x" /> or
as an element <foo><bar>x</bar></foo>. I'm not sure there is a good
enough reason to have two ways to do the same thing. Often the choices
people made seem odd.


>> �


>>
>
> Ellipsis so that we can imagine there are more things to say, albeit
> you couldn't think of them.
>

� Yes dammit!

At the time :-)

Another thing I don't like is the clumsy way that XML parsers have to
preserve useless whitespace (indentation and such). Of all the things
that break XML canonicalisation, I find this the most ridiculous.


>> Aaaaaaaaaaaaieeeeee!
>>
>
> Programming is hard, that's why they pay us the big bucks.

Hard I don't mind. Pointlessly hard irritates me.


>
> Consider the alternatives to XML:
>
> - EDI - talk about "Aaaaaaaaaaaaieeeeee!"

Goodness me, we agree about something.


> - fixed-width formats - layout has semantics. Inflexible. Not human-
> readable.
> - Java 'Serializable' - specialized
> - CSV - virtually randomly specified - not consistent, not
> regularizable.

JSON?


> I have yet to find a format that is as clean or flexible or extensible
> or maintainable for data exchange as XML, let alone more so.

With the exception of EDIFACT I've not found one so complex, obscure and
troublesome in actual use :-)

You are right that much of the trouble I have with XML is not to do with
the basic grammar of tags and elements but with the Byzantine forest of
stuff that has grown up around it.


--
RGB

Mayeul

unread,
Oct 1, 2009, 12:10:39 PM10/1/09
to
Bent C Dalager wrote:
> On 2009-10-01, Mayeul <mayeul....@free.fr> wrote:
>> RedGrittyBrick wrote:
>>> Canonicalisation.
>>> Exclusive?
>>> Namespaces?
>>> Namespace prefixes?
>> Out of curiosity, what are the uses of a data format canonicalization?
>
> XML canonicalization is useful for calculating checksums over a block
> of XML data. As an example, two XML streams that differ only in
> whitespace are (often) considered semantically equal but their
> checksums would be different if you didn't apply some canonicalization
> strategy.

Err, sure but, hence the question: when does it matter?
I am not sure, when I want a checksum, that I want to ignore
modifications in attributes ordering, newline conventions, in-tag
whitespaces, or any kind of modification in the list of bytes (or
characters).

--
Mayeul

markspace

unread,
Oct 1, 2009, 12:12:51 PM10/1/09
to
Mark wrote:

>
> It's not a contradiction. The network layer is separate (as it should
> be). The network layer just presents a byte array, with no kind of
> structure.


Some of would think of that as not quite enough abstraction, or design,
or something. In an object oriented language, it's ok to have your
network layer return objects. Byte arrays are a little "no type here,
move along" for many folks.

Mayeul

unread,
Oct 1, 2009, 12:28:40 PM10/1/09
to
RedGrittyBrick wrote:

>
> Lew wrote:
>> - EDI - talk about "Aaaaaaaaaaaaieeeeee!"
>
> Goodness me, we agree about something.
>
>
>> - fixed-width formats - layout has semantics. Inflexible. Not human-
>> readable.
>> - Java 'Serializable' - specialized
>> - CSV - virtually randomly specified - not consistent, not
>> regularizable.
>
> JSON?

I for one, actually use JSON in a lot of situations where it seems a
time-saver.

Still, it is not really extensible once designed, and not human-readable
but for the most trivial cases (cases where going XML might have been
overkill in the first place.)

This leads to a lot of shortcomings if compared to XML, while loosing
the XML behemoth toolchains (unless we convert the JSON to a XML subset
somewhere in the chain.)

> You are right that much of the trouble I have with XML is not to do with
> the basic grammar of tags and elements but with the Byzantine forest of
> stuff that has grown up around it.

Do you have trouble just ignoring them?

(Admittedly, sometimes you need to comply with some atrocity designed
around some ridiculously complex XML format, but that is hardly a valid
criticism against XML the data format. For now at least.)

--
Mayeul

RedGrittyBrick

unread,
Oct 1, 2009, 12:54:39 PM10/1/09
to

The thinking seems to be as follows.

I send you an order for 3000 copies of your two latest books. I send it
as an XML stream. This stream typically has an XML envelope that says
who I am and contains proof of who I am. The stream also contains an XML
body that contains the order details. In order that you can be sure the
body isn't corrupted or tampered with, I make a checksum and digitally
sign the checksum with an asymmetric key certified by some authority.

Apparently the people who worked on the standards for this thought it
likely that the XML body might suffer from being rearranged as part of
the normal handling of the XML. For example my order contains
<book author="Mayeul" isbn="123" />
but in processing it could conceivably be harmlessly rearranged as
<book isbn="123" author="Mayeul"/>
Horror - the signature no longer matches. Transaction cancelled.

So, before you sign XML, you canonicalize it. Part of that process is
ordering the attributes alphabetically (probably involving locales, I
forget). Before checking signatures, ditto.

In my experience, XML doesn't get rearranged in this way. At least, not
up to the point where you'd sensibly want to check signatures, which is
early on in any processing. Which means that canonicalisation is usually
more of an awkward annoyance than a useful thing to do. but you have to
do it because everyone expects it.

This is just the tip of the XML canonicalization iceberg.

In order that my signature could remain valid if the XML body were
removed from one envelope and inserted in another for onward processing,
clever things have to be done with namespace declarations in the
envelope that are used in the body. Sometimes this means that one
namespace declaration in the envelope becomes copied into numerous tags
in the body.

In my experience, no one needs signatures of XML bodies to be preserved
outside the context of the original XML envelope. In fact, sometimes the
part that is signed includes parts of the XML that are actually part of
the container and not the payload. E.g. the signature is of the <body>
element which contains the payload, not of the payload's top element.
This means it is impossible for the signed payload to actually be
embedded in a different envelope as it still has the grisly stumps of
the prior container attached to it.
My experience isn't that broad but it seems to me that this portability
of signature is likely to be useful in only very rare circumstances. For
me it is a hindrance.

--
RGB

RedGrittyBrick

unread,
Oct 1, 2009, 1:12:33 PM10/1/09
to

Mayeul wrote:
>> You are right that much of the trouble I have with XML is not to do
>> with the basic grammar of tags and elements but with the Byzantine
>> forest of stuff that has grown up around it.
>
> Do you have trouble just ignoring them?

When transmitting XML to departments of various very large
organisations, I would have trouble ignoring the specs they probably
took three years and �3,000,000 writing and which are already agreed
internationally :-)


> (Admittedly, sometimes you need to comply with some atrocity designed
> around some ridiculously complex XML format, but that is hardly a valid
> criticism against XML the data format. For now at least.)

There exist standards saying XML should be canonicalized before being
signed. Therefore the consultants employed by the aforementioned very
large organisations write specs that expect XML to be canonicalized
before being signed. Even if they get byte for byte what I transmit and
canonicalisation actually serves no purpose. In their shoes I'd probably
do much the same.

XML is surrounded by stuff that gets used even in circumstances where it
serves no purpose. Academically this is not the fault of XML. For
practical purposes it seems to me, in my line of work, to be inevitable
when using XML. YMMV.

--
RGB

Bent C Dalager

unread,
Oct 1, 2009, 5:24:07 PM10/1/09
to

People have found that when they checksum XML documents they often
want this feature. They have therefore developed techniques for XML
canonicalization.

Whether you would want it or not is an entirely different matter and
largely irrelevant to the reason for it having been developed. If you
don't need or want it then you simply don't use it. Most programmers
probably couldn't care less about XML canonicalization(*) but this
doesn't make it any less useful for those who do need it.

Cheers,
Bent D

* I wasn't aware of its existence until I was tasked with implementing
an XML signing library a few years back, and haven't really needed
it since. It was an invaluable tool for that signing library however.

Mayeul

unread,
Oct 2, 2009, 2:45:14 AM10/2/09
to
RedGrittyBrick wrote:

> Mayeul wrote:
>> (Admittedly, sometimes you need to comply with some atrocity designed
>> around some ridiculously complex XML format, but that is hardly a
>> valid criticism against XML the data format. For now at least.)
>
> There exist standards saying XML should be canonicalized before being
> signed. Therefore the consultants employed by the aforementioned very
> large organisations write specs that expect XML to be canonicalized
> before being signed. Even if they get byte for byte what I transmit and
> canonicalisation actually serves no purpose. In their shoes I'd probably
> do much the same.

I'd first check whether the XML toolchains I know in most languages
offer a simple way to format canonical XML. (Never needed that, so
dunno, by the way. I've just seen a few CANONICAL public constants here
and there.)
If it does exist everywhere reliably enough, then I can't see a problem
and I'd go for the same specs.
If canonicalization actually is an overcomplication, I'd encourage the
simplification point of XML rather than the theoretical beauty of
document equivalence.

Ah, if the toolchains or some libraries directly offer a way to sign XML
and automatically canonicalize it in the process, that would be even
simpler and better, of course.

> XML is surrounded by stuff that gets used even in circumstances where it
> serves no purpose. Academically this is not the fault of XML. For
> practical purposes it seems to me, in my line of work, to be inevitable
> when using XML. YMMV.

It seems it does vary, but admittedly I don't have all that much
experience yet.

Complications seem very avoidable to me, and that is kind of the point
of XML in my mind. This is especially true when I get to define
protocols or data format, or when I get to negotiate them because their
usage is not implemented yet.

Now admittedly, I had to support some rather complicated XML-based
specifications that were complicated and rather pointless in my opinion.
Yes, but I never had an impression XML needed to be involved to suffer
this. The same kind of overkill seems to exist pretty much everywhere.
When they were XML-based, at least I always saved time thanks to XML
strict rules. That's a lot more than I can say for when anything else
was chosen.

Bottom line: YMMV indeed.

--
Mayeul

Nigel Wade

unread,
Oct 2, 2009, 5:31:41 AM10/2/09
to

I'm pretty sure that that would not be possible. You can't just make up a
OIS from an arbitrary sequence of bytes. An OIS has, as I understand it,
some preamble which is read from the underlying Stream when the OIS is
constructed/opened. Also, I'm pretty sure there is more to the OIS
contents than just the object members.

Surely all it requires is a class with a method which will populate the
members of that class when given the relevant byte[] (this could be the
constructor). Within that method it's just a matter of picking the right
elements from the byte[] and converting/storing them in the relevant
members of the object.

If all that's required is a quick and dirty solution to match the style
of the original C, then that ought to get the job done. But without any
information regarding what those char (which C often uses in place of
byte) fields actually represent I don't see that it's possible to provide
any better solution.

--
Nigel Wade

RedGrittyBrick

unread,
Oct 2, 2009, 6:29:45 AM10/2/09
to

I had a look at the bytes produced by serializing an object and came to
the same conclusion: It isn't possible.

> Surely all it requires is a class with a method which will populate the
> members of that class when given the relevant byte[] (this could be the
> constructor). Within that method it's just a matter of picking the right
> elements from the byte[] and converting/storing them in the relevant
> members of the object.

That was my first thought:
<http://groups.google.com/group/comp.lang.java.programmer/msg/0a66b2d5a2c37a27>


> If all that's required is a quick and dirty solution to match the style
> of the original C, then that ought to get the job done. But without any
> information regarding what those char (which C often uses in place of
> byte) fields actually represent I don't see that it's possible to provide
> any better solution.

There are some suggestions elsethread.

--
RGB

Thomas Pornin

unread,
Oct 2, 2009, 8:35:30 AM10/2/09
to
According to Mayeul <mayeul....@free.fr>:

> Err, sure but, hence the question: when does it matter?
> I am not sure, when I want a checksum, that I want to ignore
> modifications in attributes ordering, newline conventions, in-tag
> whitespaces, or any kind of modification in the list of bytes (or
> characters).

XML is a syntax for representing some data organized as a tree. However,
it is also _text_, designed so that it could be handled as plain text,
sent by email, edited by hand an so on. The whole point is that XML is
robust with regards to whatever abuse may be thrown at it; a prime
example of such abuse being the automatic conversions of line
terminators (CR, LF, CR+LF...), but other whitespace characters also
suffer (e.g. copy-pasting a tab may result in a bunch of spaces).

XML makes little sense in situations where such abuse is known not to
happen. If you can faithfully transmit bytes, without seeing a single
bit changed, then binary encodings are much superior. ASN.1+BER (at
least a carefully chosen subset thereof) is more compact (especially for
big blobs of data -- e.g. embedded images -- where XML must resort to
Base64) and widely easier to parse, and a small parser will be very
efficient at it.

In brief, if you digitally sign XML data and find that you have no
need for canonicalization, then, most probably, you have no need for
the specificities of XML either, although you are paying for them.


--Thomas Pornin

Mayeul

unread,
Oct 2, 2009, 9:14:35 AM10/2/09
to

Err, I do know about a lot of shortcomings in XML, and therefore can
admit the whole idea of XML bashing.

But I still can't believe you could state this with a straight face.

--
Mayeul

Mayeul

unread,
Oct 2, 2009, 9:50:21 AM10/2/09
to

Well, thank you for sharing the experience.

I think I'm with RedGrittyBrick here. I have the impression most of the
time this canonicalization is needed, it is an artificial need and
signing the byte flow or character flow would have been enough and
immediate.

Besides, how hard can it be to design a library that makes XML
canonicalization immediate? If that was a common need, it'd think it
would just be there to use.

--
Mayeul

Bent C Dalager

unread,
Oct 2, 2009, 12:00:56 PM10/2/09
to
On 2009-10-02, Mayeul <mayeul....@free.fr> wrote:
> Bent C Dalager wrote:
>> On 2009-10-01, Mayeul <mayeul....@free.fr> wrote:
>>
>> * I wasn't aware of its existence until I was tasked with implementing
>> an XML signing library a few years back, and haven't really needed
>> it since. It was an invaluable tool for that signing library however.
>
> Well, thank you for sharing the experience.
>
> I think I'm with RedGrittyBrick here. I have the impression most of the
> time this canonicalization is needed, it is an artificial need and
> signing the byte flow or character flow would have been enough and
> immediate.

In my case this failed spectacularly, prompting my need for (and
discovery of) canonicalization.

> Besides, how hard can it be to design a library that makes XML
> canonicalization immediate? If that was a common need, it'd think it
> would just be there to use.

I'm not sure what you mean. As far as I can recall all I ended up
having to do was call the canonicalization algorithm in the XML
library with an appropriate constant (there's a couple different algos
you can choose between) and that was basically it.

Cheers,
Bent D

Thomas Pornin

unread,
Oct 2, 2009, 12:04:12 PM10/2/09
to
According to Mayeul <mayeul....@free.fr>:

> Err, I do know about a lot of shortcomings in XML, and therefore can
> admit the whole idea of XML bashing.

I you interpreted my message as "XML bashing" then you widely
misunderstood me. Maybe I did not write clearly enough, maybe you did
not read carefully enough. Probably both.


--Thomas Pornin

Jeff Higgins

unread,
Oct 15, 2009, 9:39:56 AM10/15/09
to
Mark wrote:
> Hi,
> I am writing an app in java which reads data from a socket from a C
> language program. The data is encoded as a C struct of char arrays
> something like this;
>
> typedef struct {
> char type[1];
> char length[6];
> char acknowledge[1];
> char duplicate[1];
> ...
> } type_t;
>
> How can I decode a structure like this in Java without using JNI (a
> requirement)?

I see no one has mentioned javolution.io.Struct
<http://javolution.org/target/site/apidocs/javolution/io/Struct.html>

I don't know if it would help you.

JH

0 new messages