offset-based hash table for ASCII data

5 views
Skip to first unread message

Rex Mottram

unread,
Apr 15, 2008, 8:15:06 AM4/15/08
to
I'm looking for an offset-based data structure to hold character data.
Background: I'm working with an app whose server is written in Java
while the client is in C. The server needs to package up a bunch of data
and ship it to the client where it will be traversed, read-only, many
many times. There's actually a working implementation which sends the
data in an XML format, but it's too slow and profiling reveals that it
spends almost all its time in XML parsing, so I need to replace the XML
with something more natural to the purpose.

The client starts thousands of child processes which each start by
parsing the XML document in order to place its data in hash tables,
after which it is accessed via the hash table only. So the XML is very
much an intermediate format, and the obvious win would be to have the
Java server package the data in a form which is directly usable by the
clients. Presumably the data would arrive from the server and be stored
in a file: each client would then map the file into its address space
and treat it as a hash table (or similar data structure) right away.

The traditional issue with transportable data structures is that since
the client can't reliably control what address the data is mapped to,
all addressing must be made relative to the starting point. Does anyone
know of an implementation of such a format which can be generated in
Java and consumed in C code?

TIA,
RM

shakah...@gmail.com

unread,
Apr 15, 2008, 8:23:54 AM4/15/08
to

Can the client parse the XML and share it with the child processes
(thus avoiding having each child parse the XML)?

Also, if the XML parsing is only done once by each child process, why
is the parsing performance so important?

Rex Mottram

unread,
Apr 15, 2008, 9:59:26 AM4/15/08
to
shakah...@gmail.com wrote:
> Can the client parse the XML and share it with the child processes
> (thus avoiding having each child parse the XML)?

Yes, that's also an option but it doesn't change the underlying
complication much, while leaving in an extra stage which would be better
avoided.

The fundamental need is for an offset-based data structure since the
child processes can't be guaranteed of being able to map the file at a
constant location. This would still be a problem between the client and
its children. So given that someone needs to place the data in an
offset-based table, it would be far preferable to do the ADT work in
Java and send it straight to the children. Yes, the server could send
XML and the client could rework it to a native format and pass it on to
the children but there's just more moving parts to break that way, not
to mention more code written in C to core dump. In general I, along with
I think most people in the situation, prefer to push as much coding as
possible over to the Java side where you have garbage collection,
exceptions, and a really nice debugger. Not to mention that the server
process typically runs on a "server" (bigger, faster) machine.

> Also, if the XML parsing is only done once by each child process, why
> is the parsing performance so important?

The short and somewhat obnoxious answer would be that "why" doesn't
matter - profiling has revealed that XML is the major cost and profiling
doesn't lie. The more substantive response is that many of these child
processes will end up doing either very little or nothing at all;
basically they start up, parse the XML data, and compare current reality
against what the server says it should be. In the majority of cases
current reality is fine, in which case the process can exit immediately.
Thus the constant factor of XML processing can become the dominant part
of child performance when multiplied by thousands of them.

RM

shakah...@gmail.com

unread,
Apr 15, 2008, 10:03:40 AM4/15/08
to
On Apr 15, 9:59 am, Rex Mottram <r...@not.here> wrote:

Fair enough.

I guess I read your OP as describing a client launching many long-
lived (and busy) children, where the initial XML parse wouldn't matter
so much.

Ben Bacarisse

unread,
Apr 15, 2008, 11:07:56 AM4/15/08
to
Rex Mottram <re...@not.here> writes:

> I'm working with an app whose server is written in
> Java while the client is in C.

<snip>


> The traditional issue with transportable data structures is that since
> the client can't reliably control what address the data is mapped to,
> all addressing must be made relative to the starting point. Does
> anyone know of an implementation of such a format which can be
> generated in Java and consumed in C code?

This is done a lot by Remote Procedure Call systems (though your
format may be more complex than they can handle) but I would look
using some form of RPC, or at least the library that marshals the
data. Since Sun used RPC a lot for its networking, I'd image there is
good Java support/bindings (but that is just a wild guess).

If you fancy standards, there is always ASN.1. Old, and it used to
thought of as rather "heavy" but that was before everyone started
using XML for everything.

--
Ben.

Til

unread,
Apr 15, 2008, 11:43:16 AM4/15/08
to
May be a binary XML format would be the simplest way to improve the
performance. For example there is wbxml4j for java and libbxml for C. I
never used them, but it seems worth a try.
If your XML structure is not too complex you could also implement your
own simple binary format. So that the clients just reads a kind of
nested structs from shared memory that use offsets to point to childs an
siblings.

RedGrittyBrick

unread,
Apr 15, 2008, 11:53:20 AM4/15/08
to
Rex Mottram wrote:
> I'm working with an app whose server is written in Java
> while the client is in C. The server needs to package up a bunch of data
> and ship it to the client where it will be traversed, read-only, many
> many times. There's actually a working implementation which sends the
> data in an XML format, but it's too slow and profiling reveals that it
> spends almost all its time in XML parsing, so I need to replace the XML
> with something more natural to the purpose.

Have you considered JSON?

--
RGB

Rainer Weikusat

unread,
Apr 15, 2008, 12:28:29 PM4/15/08
to
Ben Bacarisse <ben.u...@bsb.me.uk> writes:
> Rex Mottram <re...@not.here> writes:
>> I'm working with an app whose server is written in
>> Java while the client is in C.
> <snip>
>> The traditional issue with transportable data structures is that since
>> the client can't reliably control what address the data is mapped to,
>> all addressing must be made relative to the starting point. Does
>> anyone know of an implementation of such a format which can be
>> generated in Java and consumed in C code?
>
> This is done a lot by Remote Procedure Call systems (though your
> format may be more complex than they can handle) but I would look
> using some form of RPC, or at least the library that marshals the
> data.

The easy way to solve such a problem is

a) get over the completely insane idea that designing data
exchange formats would be something network programmers should
not care about.

b) defined a (presumably simple) binary message format
containing the information you want to communicate.

Regarding b, if you want to make your life easy, avoid numbers not
composed of an integral number of octets and try to preserve 'natural
alignment' of the members, ie let four-octet quantities (32-bit
integers) start on a four-octet boundary relative to the start of the
message.

For a structure which absolutely must contain a lot of pointers,
introducing a 'fixup' step could be sensible: Store the offsets of all
contained pointers in some array, traverse this array and add the
start address of the buffer the message is contained in to the value
stored at the location start + <current offset> in the buffer.



> If you fancy standards, there is always ASN.1. Old, and it used to
> thought of as rather "heavy" but that was before everyone started
> using XML for everything.

ASN.1 means 'abstract syntax notation one' and it is (mostly) a
standardized high-level language for declaring structured data types.
It is not a definition of an actual 'wire formats' of data. That would
be the purpose of some set of encoding rules for ASN.1, eg BER (basic
encoding rules) or PER (packed encoding rules). BER encoding is
already fairly byzantine and that it transmits and encoded length and
an encoded type for each 'information quantity' is useless for
applications where producer and consumer 'know' how the messages are
supposed to be structured.

According to one of the people responsible for the PER-rules, these
were initially conceived by a couple of frustrated people busy with
getting drunk, with the frustration resulting from the fact that too
many members of some gremium had provided negative, constructive
feedback regarding their previous attempt at defining an 'improved
BER'. PER was designed to address this problem and the person I am
referring to (author of a freely downloadable book on ASN.1)
recommends that people don't even try to understand the
specification.

Mark Space

unread,
Apr 15, 2008, 2:03:22 PM4/15/08
to
Rex Mottram wrote:
> I'm looking for an offset-based data structure to hold character data.

I'm not sure what you mean by "offset based data structure." I think I
missed that chapter when doing my school work, where the official offset
based data structure that all languages inter-operate with was described.

It sounds like you want just plain old binary data. This is a huge
mistake, imo, since your client and server combination is going to be a
lot less modifiable than just one system by itself. Better to send the
XML to the client, parse the whole thing once into some binary format,
then spawn the child processes. Getting two systems to work together in
a binary format is going to be very hard to maintain.

But anyway, look at DataOutputStream for Java. Send the info to a
network stream, or you can buffer it to memory by hooking the DataOutput
Stream to a ByteArrayOutputStream, and then send the buffer at your leisure.

Think very hard about how you will modify that binary data in the future
when you implement this.

If we knew a little more about what was being parsed here, we might be
able to help you further. My advice is "don't" but you seem wedded to
the idea, so hopefully we can save you from disaster. It's still really
unclear to me how sending data in binary is going to present some huge
savings on the client. Networks are slow, CPUs are fast. Servers also
tend to be heavily loaded. Parsing once in C on the client seems like
the obvious answer and much less headache prone.

Anyway, good luck.


> The traditional issue with transportable data structures is that since
> the client can't reliably control what address the data is mapped to,
> all addressing must be made relative to the starting point. Does anyone
> know of an implementation of such a format which can be generated in
> Java and consumed in C code?

Re-reading this, I'm still totally unclear on what you are actually
asking here. You need "addressing?" Are you saying you don't know what
indirection is in C? Ever hear of a look-up table? How about a hash table?

That may come off as condescending, but that's what your description
sounds like to me. Maybe a clearer explanation of how you think the
data will be parsed/accessed in binary will help. I think part of the
problem is you are a bit unclear yourself.

Rainer Weikusat

unread,
Apr 15, 2008, 2:30:56 PM4/15/08
to
Mark Space <mark...@sbc.global.net> writes:
> Rex Mottram wrote:
>> I'm looking for an offset-based data structure to hold character
>> data.
>
> I'm not sure what you mean by "offset based data structure."

This fairly obviously means 'replacing pointers by offsets relative to
the start of the data structure' (because pointer values usually
cannot even be meaningfully communicated between different processes
running on the same machine, let alone processes running on different
machines).

[...]

> It sounds like you want just plain old binary data.

That was indeed what was intended: Use a data format which can be
'readily prepared' on the server for use as-is, because the overhead
of parsing an XML-steganogram of a structured message into a
structured message had been _proven_ to be to expensive.

> This is a huge mistake, imo, since your client and server
> combination is going to be a lot less modifiable than just one
> system by itself.

At worst, it will need twice the effort for modifying one system.

> Better to send the XML to the client, parse the whole thing once
> into some binary format, then spawn the child processes. Getting
> two systems to work together in a binary format is going to be very
> hard to maintain.

Why would it? The code itself isn't structurally different: For the
most general case, you have a 'native' representation on system #1,
which is encoded into a transport representation by an encoding
routine, transmitted to system #2 and then decoded into a native
representation by a decoding routine.

[...]

> If we knew a little more about what was being parsed here, we might be
> able to help you further. My advice is "don't" but you seem wedded to
> the idea, so hopefully we can save you from disaster. It's still
> really unclear to me how sending data in binary is going to present
> some huge savings on the client.

Did it occur to you that your lack of understanding in this respect
means that you are a less-than-ideal person regarding having a
sensible opinion on the topic?

The original problem was that _measurements_ had proven that the need
to parse an XML-representation was too expensive for this particular
application. Assuming this as true, not parsing the data on the
client, or at least decoding something signficantly less noisy than
XML (eg a binary interchange format or a simpler text format) is an
obvious solution. It will reduce the load on the server, too
(constructing XML is not cheaper than deconstructing XML) and will
even lead to less bandwidht waste during transmission, assuming the
'other message format' is designed with a better SDU/ PDU than XML can
provide (which will be easy[*]).

[*] No, compression is not the solution: It's a ressource
intense workaround for the original problem.

shakah...@gmail.com

unread,
Apr 15, 2008, 2:46:24 PM4/15/08
to
On Apr 15, 2:30 pm, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
>
> The original problem was that _measurements_ had proven that the need
> to parse an XML-representation was too expensive for this particular
> application. ...

Slight quibble here in that while measurements may have
shown parsing to be too expensive for a parse-in-every-child
implementation, the same might not hold
true in an implementation where parsing is done once on the
client and then shared with the children (particularly
if those are child processes in the UNIX sense). The existence
of "a working implementation which sends the data
in an XML format" (implying existing C parsing code and internal
representation) would only make me look further in that direction.

Mark Space

unread,
Apr 15, 2008, 5:09:42 PM4/15/08
to
Rainer Weikusat wrote:
> Mark Space <mark...@sbc.global.net> writes:
>> Rex Mottram wrote:
>>> I'm looking for an offset-based data structure to hold character
>>> data.
>> I'm not sure what you mean by "offset based data structure."
>
> This fairly obviously means 'replacing pointers by offsets relative to
> the start of the data structure' (because pointer values usually
> cannot even be meaningfully communicated between different processes
> running on the same machine, let alone processes running on different
> machines).

Is that all you need? The Java version should be a tad easier, actually.
Note the last loop uses nothing but offsets from the buffer to print
out the data.

If you where hoping for the magic library that does this for you, I
think it's called "hands on keyboard."


lut_test a 12 longer_string_test C

Total buffer size: 63
Numer of entries: 5
String 0: filename
String 1: a
String 2: 12
String 3: longer_string_test
String 4: C

/*
* File: lut_test.c
*
* Created on April 15, 2008, 12:43 PM
*/

#include <stdio.h>
#include <stdlib.h>

struct pre_parsed {
int size;
int length;
int indexes[];
};

int main(int argc, char** argv) {

struct pre_parsed *buffer;
char ** argv_copy = argv;

if( argc < 2 )
{
fprintf( stderr, "Usage: lut_test outfile string_list\n");
}
size_t total_string_size = 0;

while( *++argv )
{
size_t len = strlen( *argv );
total_string_size += len+1;
}

// Test some values in debugger

size_t struct_size = sizeof (struct pre_parsed);
size_t int_size = sizeof (int);
size_t array_size = sizeof (int) * (argc-1);

// Build the buffer

buffer = malloc( sizeof (struct pre_parsed) + sizeof (int) *
(argc-1) + total_string_size );

(*buffer).length = argc -1;
(*buffer).size = sizeof (struct pre_parsed) + sizeof (int) *
(argc-1) + total_string_size;
int index = 0;
size_t offset = sizeof (struct pre_parsed)
+ sizeof (int) * (argc-1);
while( *++argv_copy )
{
(*buffer).indexes[index] = offset;
strcpy( (char*)(buffer + offset), *argv_copy );
offset += strlen( *argv_copy ) + 1;
index++;
}

// Read back from the buffer

printf( "Total buffer size: %d\n", (*buffer).size );
printf( "Numer of entries: %d\n", (*buffer).length );
int i;
for( i = 0; i < (*buffer).length; i++ )
{
printf( "String %d: %s\n", i,
(buffer + (*buffer).indexes[i]) );
}

return (EXIT_SUCCESS);
}

rossum

unread,
Apr 15, 2008, 5:11:42 PM4/15/08
to
On Tue, 15 Apr 2008 08:15:06 -0400, Rex Mottram <re...@not.here> wrote:

>I'm looking for an offset-based data structure to hold character data.
>Background: I'm working with an app whose server is written in Java
>while the client is in C. The server needs to package up a bunch of data
>and ship it to the client where it will be traversed, read-only, many
>many times.

If it is read-only then why do you need to parse it many times? Parse
it once at the client end and let all the child processes access the
result.

>There's actually a working implementation which sends the
>data in an XML format, but it's too slow and profiling reveals that it
>spends almost all its time in XML parsing, so I need to replace the XML
>with something more natural to the purpose.

Is that actually CPU use in parsing or in network IO? If IO then
compression might be faster - probably a long shot.


>
>The client starts thousands of child processes which each start by
>parsing the XML document in order to place its data in hash tables,
>after which it is accessed via the hash table only. So the XML is very
>much an intermediate format, and the obvious win would be to have the
>Java server package the data in a form which is directly usable by the
>clients. Presumably the data would arrive from the server and be stored
>in a file: each client would then map the file into its address space
>and treat it as a hash table (or similar data structure) right away.

Why are you hashing to memory addresses (or pointers)? Why not hash
to array indexes which immediately makes the whole structure
relocatable in memory since everything is indexed from the start of
the array. You would need fixed-length data for this. Would
fixed-length data be a problem? A C union could handle otherwise
different types in a single array.

Would comma separated be simpler to parse than XML? It does not have
as much overhead and is generally less verbose.

>
>The traditional issue with transportable data structures is that since
>the client can't reliably control what address the data is mapped to,
>all addressing must be made relative to the starting point. Does anyone
>know of an implementation of such a format which can be generated in
>Java and consumed in C code?

Heap, sorted array + binary search, hash to an array address are all
possibilities.

I suspect that the biggest saving would be parse once rather than
parse many. If you only parse once then you can do a lot of
complicated setup at the client end and the work is amortised over the
many children.

rossum

>
>TIA,
>RM

Mark Space

unread,
Apr 15, 2008, 6:41:43 PM4/15/08
to
rossum wrote:

> Would comma separated be simpler to parse than XML? It does not have
> as much overhead and is generally less verbose.

That's actually a pretty good idea. Much simpler and more portable.

Rex Mottram

unread,
Apr 15, 2008, 7:49:02 PM4/15/08
to
Mark Space wrote:
> Rex Mottram wrote:
>> I'm looking for an offset-based data structure to hold character data.
> It sounds like you want just plain old binary data.

Possibly you missed the phrase "ASCII data" cleverly hidden in the
subject line, and likewise "character data" in the first sentence? :-)

Anyway, in response to you and a few other posters, yes, parsing the XML
once into some other format at the client end would indeed solve the
problem mentioned. But although you argue that would be more flexible, I
think it would result in a Rube Goldberg type of system with two
different representations needing to be kept in sync. Not very robust
programming practice IMHO.

Also, again in response to you and a few others, network bandwidth is
not an issue. I.e. moving the XML document, even if it's large (and it
often is) is not a significant cost, and for the record it's already
transported in compressed form. It's the multi-parsing that's a problem.

I could of course write up my own index-based ADT in C, but would much
prefer not to. There are a number of free ADT packages out there - I'm
currently using one called Kazlib which is very nice and I have notes on
6-10 others which also look good. But all these use pointers, with the
cited problem.

There actually an open source C library which was written to be
offset-based. I remember seeing it a few years ago. I haven't gone
digging for it yet because I was hoping to find something I could build
up from the Java side. Will go see if I can find that next.

RM

Mark Space

unread,
Apr 15, 2008, 9:38:54 PM4/15/08
to
Rex Mottram wrote:

> I could of course write up my own index-based ADT in C, but would much
> prefer not to. There are a number of free ADT packages out there - I'm
> currently using one called Kazlib which is very nice and I have notes on
> 6-10 others which also look good. But all these use pointers, with the
> cited problem.


Ah ha. And specifically you want to serialize and deserialize your
ADTs. Between C and Java. Nothing like picking the implementation
first then going looking for solutions. ;)

What you're asking for is called pointer swizzling, btw, and the general
case seems to be very hard, which may be why you aren't find any libraries.

And no I don't know any Java packages for that. Java programmers seem
to prefer general solutions that always work (XML) to light weight
solutions that may impose some restrictions. Sorry about that.

If you find anything interesting, let us know.

You might try some custom de-/serialization on the Java side, so you can
control the format, and produce something easier to parse in C. Just a
thought.

Logan Shaw

unread,
Apr 15, 2008, 9:45:35 PM4/15/08
to
Rex Mottram wrote:
> I'm looking for an offset-based data structure to hold character data.
> Background: I'm working with an app whose server is written in Java
> while the client is in C. The server needs to package up a bunch of data
> and ship it to the client where it will be traversed, read-only, many
> many times. There's actually a working implementation which sends the
> data in an XML format, but it's too slow and profiling reveals that it
> spends almost all its time in XML parsing, so I need to replace the XML
> with something more natural to the purpose.

Let me see if I can understand the problem statement. It sounds like
you do not need a hash table specifically, but what you actually need
is a way to map string values to other string values. As far as I know,
you do not say you need to be able to iterate over the keys, although
it's a common requirement, so you might need it.

If it is true that that is all you actually need, removing the artificial
constraint of a hash might open up some implementation possibilities.

For example, you could send a sorted list of pairs of C strings concatenated
together, by which I mean 8-bit bytes terminated by null bytes. The client
side would be very easy to implement:

* load the entire image (i.e. message) into a contiguous area of memory
* make a pass through it to count pairs of strings (which is equivalent
to counting the zero bytes in the block of memory, then dividing by two)
* allocate an array of pairs of pointers (of type "const char *")
* make another pass and write the memory address of every string into
the array of pointers
* do key lookup via binary search

Note that this approach does not require encoding any integers or offsets
at all into the message. It is O(N) to parse it, but that is no worse
asymptotically than the time required to receive it over the network.

Another alternative would be to build a trie on the server side, where
states in the trie's state machine (I think of a trie as a finite state
machine that recognizes strings) are represented by byte offsets from the
beginning of the image. The client could then do lookups on the structure
directly (except that to move to the next state, it would have to add an
offset, but that is no big deal).

- Logan

Mark Space

unread,
Apr 15, 2008, 9:50:18 PM4/15/08
to
Mark Space wrote:

> You might try some custom de-/serialization on the Java side, so you can
> control the format, and produce something easier to parse in C. Just a
> thought.

Oh, and the hash table itself in a Java hash map isn't serializable,
iirc. So if you're expecting to send the hash table over directly,
rather than just the objects it contains, you'll have to roll your own
implementation or do something similar.

Lew

unread,
Apr 16, 2008, 12:30:09 AM4/16/08
to
Mark Space wrote:
> Mark Space wrote:
>
>> You might try some custom de-/serialization on the Java side, so you
>> can control the format, and produce something easier to parse in C.
>> Just a thought.
>
> Oh, and the hash table itself in a Java hash map isn't serializable, iirc.

You don't.

public class HashMap<K,V>
extends AbstractMap<K,V>
implements Map<K,V>, Cloneable, Serializable

> So if you're expecting to send the hash table over directly,
> rather than just the objects it contains, you'll have to roll your own
> implementation or do something similar.

Or just use the HashMap built-in implementation of Serializable.

Every listed subclass of AbstractMap except WeakHashMap is Serializable.

--
Lew

Rainer Weikusat

unread,
Apr 16, 2008, 7:05:26 AM4/16/08
to
Rex Mottram <re...@not.here> writes:
> Mark Space wrote:
>> Rex Mottram wrote:
>>> I'm looking for an offset-based data structure to hold character
>>> data.
>> It sounds like you want just plain old binary data.
>
> Possibly you missed the phrase "ASCII data" cleverly hidden in the
> subject line, and likewise "character data" in the first sentence?
> :-)

Possibly, you missed the 'hash table' cleverly hidden in the subject
line. This means 'plain old binary data', ie a table of pointers.

> Anyway, in response to you and a few other posters, yes, parsing the
> XML once into some other format at the client end would indeed solve
> the problem mentioned. But although you argue that would be more
> flexible, I think it would result in a Rube Goldberg type of system
> with two different representations needing to be kept in sync.

If it would, then this would be the type of system you already have,
because you already have one representation on the server, a transport
representation and a representation on the client. That you
(presently) construct identical client representations x times in
parallell and manage to overwhelm the 'client machine' with this
nonsense doesn't change the structure of the application, just its
"simple-mindedness". Parsing the XML-steganogram once and re-using it
x - 1 times would be by far the simplest solution from a coding
standpoint (as someone else already wrote).

> Also, again in response to you and a few others, network bandwidth is
> not an issue. I.e. moving the XML document, even if it's large (and it
> often is) is not a significant cost,

... because, as you need to understand, this is MY PRIVATE GIG-E LAN
which is been solely provided to transport MY XML (and it has excess
bandwidth for that) ...

It is not possible to assess the significance of the 'cost' of a
particular communication isolated. Networks are usually used by
multiple (and often, many) computers for multiple (often many)
applications in parallell.

> and for the record it's already transported in compressed form.

If network usage is not of concern, why do you bother to compress the
XML? Not doing so would be another easy way to make the processing on
both client and server much simpler.

> It's the multi-parsing that's a problem.

Well, then don't.

Rex Mottram

unread,
Apr 16, 2008, 10:10:16 AM4/16/08
to
Rainer Weikusat wrote:
> If it would, then this would be the type of system you already have,
> because you already have one representation on the server, a transport
> representation and a representation on the client. That you
> (presently) construct identical client representations x times in
> parallell and manage to overwhelm the 'client machine' with this
> nonsense doesn't change the structure of the application, just its
> "simple-mindedness".

You may have me (the OP) mixed up with someone else. I write not to
defend the current implementation but to improve it.

> Parsing the XML-steganogram once and re-using it
> x - 1 times would be by far the simplest solution from a coding
> standpoint (as someone else already wrote).

Yes, with the single exception of the alternative, which is to deliver
it in pre-parsed form and thus remove one more intermediate representation.

N < N + 1. QED.

>> Also, again in response to you and a few others, network bandwidth is
>> not an issue. I.e. moving the XML document, even if it's large (and it
>> often is) is not a significant cost,
>
> ... because, as you need to understand, this is MY PRIVATE GIG-E LAN
> which is been solely provided to transport MY XML (and it has excess
> bandwidth for that) ...

You are swinging (and missing) especially wildly today. The
insignificance of transport is primarily due to the fact that it's a
fairly rare event which occurs just once per client/server transaction,
while the client fires children, as noted, potentially thousands of
times per transaction.

The largest XML file I've measured to date is around 5 megabytes.
Because XML is so wordy it compresses quite well, and thus the gzipped
version which travels over the wire is around 150K. This may happen
about once every half hour. I don't know what your network looks like
but this is well within the noise factor for mine.

More than that, the raison d'etre of this application is to keep the
clients from having to run unnecessary jobs. I.e. it's an optimizer. So
whatever network traffic it generates will be paid back by orders of
magnitude in other traffic which doesn't happen later.

> If network usage is not of concern, why do you bother to compress the
> XML? Not doing so would be another easy way to make the processing on
> both client and server much simpler.

Arg. I do it because it's trivially easy and the right thing to do. On
the server side it's as easy as constructing a "new GZIPOutputStream"
object. The client side uses libcurl which has built-in knowledge of
compression so it's as easy as adding an Accept-Encoding header.

It really couldn't be simpler than that. The XML parser doesn't have to
worry about compression because it's uncompressed by then. See, it's
compressed, sent over the wire, uncompressed, then parsed. All with two
extra lines of code, one in each language.

>> It's the multi-parsing that's a problem.
>
> Well, then don't.

Arg. If you didn't spend so much energy looking for someone to bully,
you might have more fun. I know we would.

RM

Rainer Weikusat

unread,
Apr 16, 2008, 10:41:23 AM4/16/08
to
Rex Mottram <re...@not.here> writes:
> Rainer Weikusat wrote:
>> If it would, then this would be the type of system you already have,
>> because you already have one representation on the server, a transport
>> representation and a representation on the client. That you
>> (presently) construct identical client representations x times in
>> parallell and manage to overwhelm the 'client machine' with this
>> nonsense doesn't change the structure of the application, just its
>> "simple-mindedness".
>
> You may have me (the OP) mixed up with someone else.

No, I was being sarcastic.

>> Parsing the XML-steganogram once and re-using it
>> x - 1 times would be by far the simplest solution from a coding
>> standpoint (as someone else already wrote).
>
> Yes, with the single exception of the alternative, which is to deliver
> it in pre-parsed form and thus remove one more intermediate
> representation.

Assuming a UNIX(*)-like 'application structure', the necessary changes
would roughly amount to forking after the XML has been parsed instead of
before it has parsed.

>>> Also, again in response to you and a few others, network bandwidth is
>>> not an issue. I.e. moving the XML document, even if it's large (and it
>>> often is) is not a significant cost,
>> ... because, as you need to understand, this is MY PRIVATE GIG-E LAN
>> which is been solely provided to transport MY XML (and it has excess
>> bandwidth for that) ...
>
> You are swinging (and missing) especially wildly today.
> The insignificance of transport is primarily due to the fact that it's a
> fairly rare event which occurs just once per client/server
> transaction,

Less creative editing could help to aid your understanding greatly:

,----


| It is not possible to assess the significance of the 'cost' of a
| particular communication isolated. Networks are usually used by
| multiple (and often, many) computers for multiple (often many)
| applications in parallell.

`----

See. This time, I even explained the sarcasm. Note that this was a
general remark regarding the (common) misconception of the 'isolated,
insignificant event', which really isn't an 'isolated event' but a
contributing factor.

[...]

>> If network usage is not of concern, why do you bother to compress the
>> XML? Not doing so would be another easy way to make the processing on
>> both client and server much simpler.
>
> Arg. I do it because it's trivially easy and the right thing to do.

Something which is 'trivially easy' for you to code is not necessarily
'trivially easy' for a computer to execute. In this particular
situation, it is certain that the process is computionally expensive.
You (again) stated that 'network usage' is of no concern to you, while
you (again) stated that 'CPU usage on the client' is of concern to
you. Yet you try to save 'network bandwidth' at the expense of
'processing time' by using compression.

This doesn't exactly make sense together.

[...]

>>> It's the multi-parsing that's a problem.
>> Well, then don't.
>
> Arg. If you didn't spend so much energy looking for someone to bully,
> you might have more fun. I know we would.

That you know that some or all of you would have more fun if some or
all of you spent less energy looking for someone to bully means
nothing about me.

Rex Mottram

unread,
Apr 16, 2008, 11:40:00 AM4/16/08
to
Rainer Weikusat wrote:
> Assuming a UNIX(*)-like 'application structure', the necessary changes
> would roughly amount to forking after the XML has been parsed instead of
> before it has parsed.

No, the "necessary changes" would include being able to remove an entire
XML-processing state machine, many associated static variables, remove
dependence on an external XML parsing library, not to mention (again)
the extra representation. These are very significant differences. I
truly think you've switched your POV here, either without realizing it
or out of pure orneriness.

> Less creative editing could help to aid your understanding greatly:
>
> ,----
> | It is not possible to assess the significance of the 'cost' of a
> | particular communication isolated. Networks are usually used by
> | multiple (and often, many) computers for multiple (often many)
> | applications in parallell.
> `----
>
> See. This time, I even explained the sarcasm. Note that this was a
> general remark regarding the (common) misconception of the 'isolated,
> insignificant event', which really isn't an 'isolated event' but a
> contributing factor.

Speaking of creative editing, I gather you chose to remove the parts of
my response which addressed this. Unlike you I know things about how my
network is used and have done measurements. Sending the XML compressed
is, empirically, an insignificant cost. Sending it uncompressed is
probably insignificant too. The only reason I said anything about the
transport cost and compressed vs not is that some respondents mistakenly
thought that was the problem and I wanted to set them straight.

There is no possible way that the cost of transporting this data packet
is a significant fraction of the overall network burden, or network
profile, of this application or the environment it operates in. You may
repeat mantras like the above to your heart's content, but though they
are certainly true in isolation they are not applicable here. This whole
network bandwidth discussion is a red herring.

>>> If network usage is not of concern, why do you bother to compress the
>>> XML? Not doing so would be another easy way to make the processing on
>>> both client and server much simpler.
>> Arg. I do it because it's trivially easy and the right thing to do.
>
> Something which is 'trivially easy' for you to code is not necessarily
> 'trivially easy' for a computer to execute. In this particular
> situation, it is certain that the process is computionally expensive.
> You (again) stated that 'network usage' is of no concern to you, while
> you (again) stated that 'CPU usage on the client' is of concern to
> you. Yet you try to save 'network bandwidth' at the expense of
> 'processing time' by using compression.
>
> This doesn't exactly make sense together.

For you to accuse me of creative editing and in the *same post* put a
phrase in quotes, attributed directly to me, which I did not say is
truly breathtaking. Search for the phrase 'CPU usage on the client' -
its first appearance is when you "quoted" it.

The point here is that all the one-time costs are swamped by the
aggregate costs of parsing. This applies equally to the compression,
transmission, and decompression of the XML, and many other constant
costs. They are just not measurable. I don't know why you insist on
talking about them.

RM

Rex Mottram

unread,
Apr 16, 2008, 12:01:24 PM4/16/08
to
Mark Space wrote:
> If you find anything interesting, let us know.

I apologize for letting other parts of this thread devolve into a
useless squabble. But for those of you with a substantive interest, let
me report back that I'm getting moderately interested in/excited about
the CDB ("constant database") family.

Having realized that what I'm looking at can be thought of as a constant
(meaning write once, read many) database, I did some searching and
found CDB. The original CDB package is C code which is unchanged since
2000 but still works fine. There's also a "TinyCDB" fork of the project
which is newer and ported to Windows and still format-compatible with
the original.

Most interestingly, there are APIs for a number of languages including a
jar file for Java (this version is called sg-cdb). So it looks like I
can use the sg-cdb API to write a cdb-format file directly from Java and
send it to the client(s), which can link with libcdb.a and navigate it
directly.

Whether I can encode my data into the key-value style of CDB (which is a
dbm-like model) is an open question but that's my problem. For the sake
of readers here and subsequent archive searches I thought I'd mention
that the combination of sg-cdb on the server side and tinycdb on the
client side may make a lot of sense for certain applications.

Licensing is also simple: TinyCDB is in the public domain and sg-cdb is
BSD licensed.

RM

rossum

unread,
Apr 16, 2008, 12:01:18 PM4/16/08
to
On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram <re...@not.here> wrote:

>The point here is that all the one-time costs are swamped by the
>aggregate costs of parsing.

Then change the parsing from an aggregate cost to a one-time cost.
Parse once at the receiving end and let all the child processes have
access to the read-only parsed version. It would not be too complex
to set up version numbering or reference counting if required.

I am not clear as to why you cannot do this, it seems such an obvious
solution.

rossum

Rex Mottram

unread,
Apr 16, 2008, 12:18:33 PM4/16/08
to
rossum wrote:
> On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram <re...@not.here> wrote:
>
>> The point here is that all the one-time costs are swamped by the
>> aggregate costs of parsing.
> Then change the parsing from an aggregate cost to a one-time cost.
> Parse once at the receiving end and let all the child processes have
> access to the read-only parsed version. It would not be too complex
> to set up version numbering or reference counting if required.
>
> I am not clear as to why you cannot do this, it seems such an obvious
> solution.

I thought I had explained this in excruciating detail but will try once
more. There are two problems with this approach, not necessarily
unsolvable but still valid problems:

1. It means double complexity, as the data has to be extracted from
POJOs on the server side and converted to XML ("format A"), then parsed
and placed into an alternate representation ("format B") on the client
side. The preferable plan would be to have the server place the data
directly into format B. I am not clear on why this is complex or
controversial; it seems beyond argument to me. Particularly since I've
already found one API for doing so (CDB) as mentioned elsewhere in the
thread.

2. In what format would that "read-only parsed version" be? It cannot be
a traditional hash table or other ADT since they can't be shared between
processes, as explained previously.

The whole point is that I'm looking for a format in which the data can
be shared. It looks like the CDB format might be a candidate, which
allows me to turn the question around: whyever would I want to write the
data to XML on the server and convert it to CDB on the client when
there's an API for writing the CDB format directly on the server? I can
see no argument at all for using two steps when one would suffice.

RM

Rainer Weikusat

unread,
Apr 16, 2008, 12:33:52 PM4/16/08
to
rossum <ross...@coldmail.com> writes:
> On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram <re...@not.here> wrote:
>
>>The point here is that all the one-time costs are swamped by the
>>aggregate costs of parsing.
> Then change the parsing from an aggregate cost to a one-time cost.
> Parse once at the receiving end and let all the child processes have
> access to the read-only parsed version. It would not be too complex
> to set up version numbering or reference counting if required.
>
> I am not clear as to why you cannot do this, it seems such an obvious
> solution.

The short answer is that the guy is a complete moron.

Mark Space

unread,
Apr 16, 2008, 1:07:30 PM4/16/08
to
Rex Mottram wrote:
> Mark Space wrote:
>> If you find anything interesting, let us know.

> me report back that I'm getting moderately interested in/excited about

> the CDB ("constant database") family.

> Most interestingly, there are APIs for a number of languages including a

> jar file for Java (this version is called sg-cdb). So it looks like I

Thanks for the report. I learned some things this thread, which is
always valuable to me, and this project you found will be an cool tool I
can keep in my back pocket in case I ever need such a thing.

I hope this project goes well for you.

moi

unread,
Apr 16, 2008, 2:23:30 PM4/16/08
to
On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:


> 2. In what format would that "read-only parsed version" be? It cannot be
> a traditional hash table or other ADT since they can't be shared between
> processes, as explained previously.

No need to parse it.
Just a native OS-flatfile. Since your data + keys is ASCII,
this flatfile is an ASCII file.

> The whole point is that I'm looking for a format in which the data can
> be shared. It looks like the CDB format might be a candidate, which
> allows me to turn the question around: whyever would I want to write the
> data to XML on the server and convert it to CDB on the client when
> there's an API for writing the CDB format directly on the server? I can
> see no argument at all for using two steps when one would suffice.

You can store an array of {offset, hash,(key)len,next} in
another flatfile.
This second flatfile needs to be binary, but can be created on the target
platform. The next-field in the struct is not a pointer but an index into
the array itself. (for the overflow chain). (NB use -1 or >=N for
sentinel value. zero wont work ;-)

Using mmap on the target-platform will cause the lookup extremely fast
(typically 2 page faults). Allocating two big arrays and reading them in
at program startup will be faster (or equally fast), but the startup will
be slower.

Combinining the two files is a possible extension, too.

HTH,
AvK

jellybean stonerfish

unread,
Apr 17, 2008, 12:45:31 AM4/17/08
to
On Wed, 16 Apr 2008 16:41:23 +0200, Rainer Weikusat wrote:

> Rex Mottram <re...@not.here> writes:
>> Arg. I do it because it's trivially easy and the right thing to do.
>
> Something which is 'trivially easy' for you to code is not necessarily
> 'trivially easy' for a computer to execute. In this particular
> situation, it is certain that the process is computionally expensive.
> You (again) stated that 'network usage' is of no concern to you, while
> you (again) stated that 'CPU usage on the client' is of concern to you.
> Yet you try to save 'network bandwidth' at the expense of 'processing
> time' by using compression.
>
> This doesn't exactly make sense together.
>

Maybe it is faster for the client to decompress the little file, then it
would take to download the uncompressed version?

Rainer Weikusat

unread,
Apr 17, 2008, 2:16:43 AM4/17/08
to

It would be much faster to just not download any file ;-), ie the code
on the client (presumably) doesn't do something because it is 'faster'
than doing something else, but to accomplish some useful purpose. The
same way 'the network' is shared among all applications trying to use
it, the host CPU is shared among all applications running on a
particular host. Download of a large file could go on as background
task, using very little host ressources, more of which would then
be available to 'other active tasks', while downloading a smaller file
faster would have a more 'bursty' host ressource usage pattern and
decompressing a small file to a large file is a ressource-intense
operation, although the absolute time needed by it may be short enough
that measuring its influence could be difficult.

This is all, of course, mostly 'thin air' without much more
information about the actual application. But a statement like 'I do
not care about transmitting much more data than I actually need,
except that I compress these data to make it smaller, but anyway, my
actual problem is that preprocessing of the downloaded data is too
expensive for a particular situation' doesn't make much sense
altogether. To me, it sounds a lot like 'I am trading off processor
time for transmission time and now my programs run to slow'. Duh.

If doing it all 'in the right way' doesn't work, then it isn't the
right way.


Rainer Weikusat

unread,
Apr 17, 2008, 5:56:03 AM4/17/08
to
moi <ro...@invalid.address.org> writes:
> On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:
>> 2. In what format would that "read-only parsed version" be? It cannot be
>> a traditional hash table or other ADT since they can't be shared between
>> processes, as explained previously.
>
> No need to parse it. Just a native OS-flatfile. Since your data +
> keys is ASCII, this flatfile is an ASCII file.

[...]

> You can store an array of {offset, hash,(key)len,next} in
> another flatfile. This second flatfile needs to be binary, but can
> be created on the target platform.

Which requires the data to be parsed on the target platform to create
the index. It just (presumably) requires less complicated parsing
than parsing the XML file. The most sensible way to do this would
still be to preprocess the transport format once and share the result
among all client instances. The OP has repeatedly went ballistic at
the mere suggestion that this could be a sensible strategy.

Not to mention that a non-XML transport format would still need to be
designed. But this would (likely) lead to reducing the amount of data
transmitted, so, again cause Mr Rex' mind to blow up in anger.


moi

unread,
Apr 17, 2008, 8:05:20 PM4/17/08
to
On Thu, 17 Apr 2008 11:56:03 +0200, Rainer Weikusat wrote:

> moi <ro...@invalid.address.org> writes:
>> On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:
>>> 2. In what format would that "read-only parsed version" be? It cannot
>>> be a traditional hash table or other ADT since they can't be shared
>>> between processes, as explained previously.
>>
>> No need to parse it. Just a native OS-flatfile. Since your data + keys
>> is ASCII, this flatfile is an ASCII file.
>
> [...]
>
>> You can store an array of {offset, hash,(key)len,next} in another
>> flatfile. This second flatfile needs to be binary, but can be created
>> on the target platform.
>
> Which requires the data to be parsed on the target platform to create
> the index. It just (presumably) requires less complicated parsing than
> parsing the XML file. The most sensible way to do this would still be to
> preprocess the transport format once and share the result among all
> client instances. The OP has repeatedly went ballistic at the mere
> suggestion that this could be a sensible strategy.

I am afraid you are right. As an experiment, I used /usr/dict ~5MB, >500K
words, and created the hashtable in approx 2 seconds walltime. Once the
hashtable is present on disk, searching one key takes 2 ms or so (most of
it program startup, of course) All using mmap.
Creating a binary file for another architecture is not impossible to do
(even dd has options to swap words). Most hardware is little-endian,
anyway these days, and most people don't get beyond x86, anyway.

Agreed, my parser is simple (just one sweep, looking for space and CR; an
XML parse may need more than one pass, and a lot more state. But programs
like these are governed by I/O cost, and one can easily perform 10 scans
over each buffer being read in. (once it *is* read in ;-)

>
> Not to mention that a non-XML transport format would still need to be
> designed. But this would (likely) lead to reducing the amount of data
> transmitted, so, again cause Mr Rex' mind to blow up in anger.

XML is the problem. Sticking to it is more than a problem. It is a
disease.

AvK

Charles Coldwell

unread,
Apr 20, 2008, 8:17:49 AM4/20/08
to
Rex Mottram <re...@not.here> writes:

> The traditional issue with transportable data structures is that since
> the client can't reliably control what address the data is mapped to,
> all addressing must be made relative to the starting point. Does
> anyone know of an implementation of such a format which can be
> generated in Java and consumed in C code?

IIUC, what you need is a hash table. Traditionally, a hash table is
implemented as an array of pointers, where the pointers point to linked
lists of objects that hash to the same value, and that value is the
index into the array.

So the fundamental problem is that pointers on the server contain
addresses that are meaningless on the client.

I've often seen linked lists implemented using array indices instead of
pointers. For example:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

struct list {
int next;
const char *data;
};

static struct list pool[1024];
static int freeidx;

static int pool_alloc(void)
{
int tmp = freeidx;

if (freeidx != -1)
freeidx = pool[freeidx].next;

return tmp;
}

static void pool_free(int idx)
{
pool[idx].next = freeidx;
freeidx = idx;
}

static int hashtable[256];

const char *strings[] = {
"hello",
"world",
"this",
"is",
"a",
"hashtable",
"example"
};

static unsigned char hashfunc(const char *str)
{
unsigned int hash = 0;

for (hash = 0; *str; ++str)
hash += *str;

return hash & 0xff;
}

static void hash_insert(const char *str)
{
int hidx = hashfunc(str);
int pidx = pool_alloc();

if (pidx == -1) {
fputs("out of memory\n", stderr);
exit(1);
}

pool[pidx].data = str;
pool[pidx].next = hashtable[hidx];
hashtable[hidx] = pidx;
}

static int hash_find(const char *str)
{
int hidx = hashfunc(str);
int iter;

for (iter = hashtable[hidx]; iter != -1; iter = pool[iter].next)
if (strcmp(str, pool[iter].data) == 0)
break;

return iter != -1;
}

int main(int argc, char *argv[])
{
int i;

/* initialize the pool */
for (i=0; i<sizeof(pool)/sizeof(pool[0]); i++)
pool[i].next = i+1;
pool[i-1].next = -1;

/* initialize the hash table */
memset(hashtable, -1, sizeof(hashtable));

for (i=0; i<sizeof(strings)/sizeof(strings[0]); i++)
hash_insert(strings[i]);

for (i=1; i<argc; i++)
printf ("%s%s found in table\n", argv[i],
(hash_find(argv[i]) ? "" : " not"));

return 0;
}


--
Charles M. "Chip" Coldwell
"Turn on, log in, tune out"
GPG Key ID: 852E052F
GPG Key Fingerprint: 77E5 2B51 4907 F08A 7E92 DE80 AFA9 9A8F 852E 052F

Robert Klemme

unread,
Apr 20, 2008, 8:27:58 AM4/20/08
to

I tried to find a post of yours in the thread where you explain what
kind of data structure you need but could not find one. The term
"offset based hash table for ASCII data" is far too generic for me to
get a concrete idea. Can you elaborate a bit or show a sample XML if it
is not too large?

Kind regards

robert

Rex Mottram

unread,
Apr 20, 2008, 1:19:26 PM4/20/08