offset-based hash table for ASCII data

Rex Mottram

unread,

Apr 15, 2008, 8:15:06 AM4/15/08

to

I'm looking for an offset-based data structure to hold character data.
Background: I'm working with an app whose server is written in Java
while the client is in C. The server needs to package up a bunch of data
and ship it to the client where it will be traversed, read-only, many
many times. There's actually a working implementation which sends the
data in an XML format, but it's too slow and profiling reveals that it
spends almost all its time in XML parsing, so I need to replace the XML
with something more natural to the purpose.

The client starts thousands of child processes which each start by
parsing the XML document in order to place its data in hash tables,
after which it is accessed via the hash table only. So the XML is very
much an intermediate format, and the obvious win would be to have the
Java server package the data in a form which is directly usable by the
clients. Presumably the data would arrive from the server and be stored
in a file: each client would then map the file into its address space
and treat it as a hash table (or similar data structure) right away.

The traditional issue with transportable data structures is that since
the client can't reliably control what address the data is mapped to,
all addressing must be made relative to the starting point. Does anyone
know of an implementation of such a format which can be generated in
Java and consumed in C code?

TIA,
RM

shakah...@gmail.com

unread,

Apr 15, 2008, 8:23:54 AM4/15/08

to

Can the client parse the XML and share it with the child processes
(thus avoiding having each child parse the XML)?

Also, if the XML parsing is only done once by each child process, why
is the parsing performance so important?

Rex Mottram

unread,

Apr 15, 2008, 9:59:26 AM4/15/08

to

shakah...@gmail.com wrote:
> Can the client parse the XML and share it with the child processes
> (thus avoiding having each child parse the XML)?

Yes, that's also an option but it doesn't change the underlying
complication much, while leaving in an extra stage which would be better
avoided.

The fundamental need is for an offset-based data structure since the
child processes can't be guaranteed of being able to map the file at a
constant location. This would still be a problem between the client and
its children. So given that someone needs to place the data in an
offset-based table, it would be far preferable to do the ADT work in
Java and send it straight to the children. Yes, the server could send
XML and the client could rework it to a native format and pass it on to
the children but there's just more moving parts to break that way, not
to mention more code written in C to core dump. In general I, along with
I think most people in the situation, prefer to push as much coding as
possible over to the Java side where you have garbage collection,
exceptions, and a really nice debugger. Not to mention that the server
process typically runs on a "server" (bigger, faster) machine.

> Also, if the XML parsing is only done once by each child process, why
> is the parsing performance so important?

The short and somewhat obnoxious answer would be that "why" doesn't
matter - profiling has revealed that XML is the major cost and profiling
doesn't lie. The more substantive response is that many of these child
processes will end up doing either very little or nothing at all;
basically they start up, parse the XML data, and compare current reality
against what the server says it should be. In the majority of cases
current reality is fine, in which case the process can exit immediately.
Thus the constant factor of XML processing can become the dominant part
of child performance when multiplied by thousands of them.

RM

shakah...@gmail.com

unread,

Apr 15, 2008, 10:03:40 AM4/15/08

to

On Apr 15, 9:59 am, Rex Mottram <r...@not.here> wrote:

Fair enough.

I guess I read your OP as describing a client launching many long-
lived (and busy) children, where the initial XML parse wouldn't matter
so much.

Ben Bacarisse

unread,

Apr 15, 2008, 11:07:56 AM4/15/08

to

Rex Mottram <re...@not.here> writes:

> I'm working with an app whose server is written in
> Java while the client is in C.

<snip>

> The traditional issue with transportable data structures is that since
> the client can't reliably control what address the data is mapped to,
> all addressing must be made relative to the starting point. Does
> anyone know of an implementation of such a format which can be
> generated in Java and consumed in C code?

This is done a lot by Remote Procedure Call systems (though your
format may be more complex than they can handle) but I would look
using some form of RPC, or at least the library that marshals the
data. Since Sun used RPC a lot for its networking, I'd image there is
good Java support/bindings (but that is just a wild guess).

If you fancy standards, there is always ASN.1. Old, and it used to
thought of as rather "heavy" but that was before everyone started
using XML for everything.

--
Ben.

Til

unread,

Apr 15, 2008, 11:43:16 AM4/15/08

to

May be a binary XML format would be the simplest way to improve the
performance. For example there is wbxml4j for java and libbxml for C. I
never used them, but it seems worth a try.
If your XML structure is not too complex you could also implement your
own simple binary format. So that the clients just reads a kind of
nested structs from shared memory that use offsets to point to childs an
siblings.

RedGrittyBrick

unread,

Apr 15, 2008, 11:53:20 AM4/15/08

to

Rex Mottram wrote:
> I'm working with an app whose server is written in Java
> while the client is in C. The server needs to package up a bunch of data
> and ship it to the client where it will be traversed, read-only, many
> many times. There's actually a working implementation which sends the
> data in an XML format, but it's too slow and profiling reveals that it
> spends almost all its time in XML parsing, so I need to replace the XML
> with something more natural to the purpose.

Have you considered JSON?

--
RGB

Rainer Weikusat

unread,

Apr 15, 2008, 12:28:29 PM4/15/08

to

Ben Bacarisse <ben.u...@bsb.me.uk> writes:
> Rex Mottram <re...@not.here> writes:
>> I'm working with an app whose server is written in
>> Java while the client is in C.
> <snip>
>> The traditional issue with transportable data structures is that since
>> the client can't reliably control what address the data is mapped to,
>> all addressing must be made relative to the starting point. Does
>> anyone know of an implementation of such a format which can be
>> generated in Java and consumed in C code?
>
> This is done a lot by Remote Procedure Call systems (though your
> format may be more complex than they can handle) but I would look
> using some form of RPC, or at least the library that marshals the
> data.

The easy way to solve such a problem is

a) get over the completely insane idea that designing data
exchange formats would be something network programmers should
not care about.

b) defined a (presumably simple) binary message format
containing the information you want to communicate.

Regarding b, if you want to make your life easy, avoid numbers not
composed of an integral number of octets and try to preserve 'natural
alignment' of the members, ie let four-octet quantities (32-bit
integers) start on a four-octet boundary relative to the start of the
message.

For a structure which absolutely must contain a lot of pointers,
introducing a 'fixup' step could be sensible: Store the offsets of all
contained pointers in some array, traverse this array and add the
start address of the buffer the message is contained in to the value
stored at the location start + <current offset> in the buffer.

> If you fancy standards, there is always ASN.1. Old, and it used to
> thought of as rather "heavy" but that was before everyone started
> using XML for everything.

ASN.1 means 'abstract syntax notation one' and it is (mostly) a
standardized high-level language for declaring structured data types.
It is not a definition of an actual 'wire formats' of data. That would
be the purpose of some set of encoding rules for ASN.1, eg BER (basic
encoding rules) or PER (packed encoding rules). BER encoding is
already fairly byzantine and that it transmits and encoded length and
an encoded type for each 'information quantity' is useless for
applications where producer and consumer 'know' how the messages are
supposed to be structured.

According to one of the people responsible for the PER-rules, these
were initially conceived by a couple of frustrated people busy with
getting drunk, with the frustration resulting from the fact that too
many members of some gremium had provided negative, constructive
feedback regarding their previous attempt at defining an 'improved
BER'. PER was designed to address this problem and the person I am
referring to (author of a freely downloadable book on ASN.1)
recommends that people don't even try to understand the
specification.

Mark Space

unread,

Apr 15, 2008, 2:03:22 PM4/15/08

to

Rex Mottram wrote:
> I'm looking for an offset-based data structure to hold character data.

I'm not sure what you mean by "offset based data structure." I think I
missed that chapter when doing my school work, where the official offset
based data structure that all languages inter-operate with was described.

It sounds like you want just plain old binary data. This is a huge
mistake, imo, since your client and server combination is going to be a
lot less modifiable than just one system by itself. Better to send the
XML to the client, parse the whole thing once into some binary format,
then spawn the child processes. Getting two systems to work together in
a binary format is going to be very hard to maintain.

But anyway, look at DataOutputStream for Java. Send the info to a
network stream, or you can buffer it to memory by hooking the DataOutput
Stream to a ByteArrayOutputStream, and then send the buffer at your leisure.

Think very hard about how you will modify that binary data in the future
when you implement this.

If we knew a little more about what was being parsed here, we might be
able to help you further. My advice is "don't" but you seem wedded to
the idea, so hopefully we can save you from disaster. It's still really
unclear to me how sending data in binary is going to present some huge
savings on the client. Networks are slow, CPUs are fast. Servers also
tend to be heavily loaded. Parsing once in C on the client seems like
the obvious answer and much less headache prone.

Anyway, good luck.

> The traditional issue with transportable data structures is that since
> the client can't reliably control what address the data is mapped to,
> all addressing must be made relative to the starting point. Does anyone
> know of an implementation of such a format which can be generated in
> Java and consumed in C code?

Re-reading this, I'm still totally unclear on what you are actually
asking here. You need "addressing?" Are you saying you don't know what
indirection is in C? Ever hear of a look-up table? How about a hash table?

That may come off as condescending, but that's what your description
sounds like to me. Maybe a clearer explanation of how you think the
data will be parsed/accessed in binary will help. I think part of the
problem is you are a bit unclear yourself.

Rainer Weikusat

unread,

Apr 15, 2008, 2:30:56 PM4/15/08

to

Mark Space <mark...@sbc.global.net> writes:
> Rex Mottram wrote:
>> I'm looking for an offset-based data structure to hold character
>> data.
>
> I'm not sure what you mean by "offset based data structure."

This fairly obviously means 'replacing pointers by offsets relative to
the start of the data structure' (because pointer values usually
cannot even be meaningfully communicated between different processes
running on the same machine, let alone processes running on different
machines).

[...]

> It sounds like you want just plain old binary data.

That was indeed what was intended: Use a data format which can be
'readily prepared' on the server for use as-is, because the overhead
of parsing an XML-steganogram of a structured message into a
structured message had been _proven_ to be to expensive.

> This is a huge mistake, imo, since your client and server
> combination is going to be a lot less modifiable than just one
> system by itself.

At worst, it will need twice the effort for modifying one system.

> Better to send the XML to the client, parse the whole thing once
> into some binary format, then spawn the child processes. Getting
> two systems to work together in a binary format is going to be very
> hard to maintain.

Why would it? The code itself isn't structurally different: For the
most general case, you have a 'native' representation on system #1,
which is encoded into a transport representation by an encoding
routine, transmitted to system #2 and then decoded into a native
representation by a decoding routine.

[...]

> If we knew a little more about what was being parsed here, we might be
> able to help you further. My advice is "don't" but you seem wedded to
> the idea, so hopefully we can save you from disaster. It's still
> really unclear to me how sending data in binary is going to present
> some huge savings on the client.

Did it occur to you that your lack of understanding in this respect
means that you are a less-than-ideal person regarding having a
sensible opinion on the topic?

The original problem was that _measurements_ had proven that the need
to parse an XML-representation was too expensive for this particular
application. Assuming this as true, not parsing the data on the
client, or at least decoding something signficantly less noisy than
XML (eg a binary interchange format or a simpler text format) is an
obvious solution. It will reduce the load on the server, too
(constructing XML is not cheaper than deconstructing XML) and will
even lead to less bandwidht waste during transmission, assuming the
'other message format' is designed with a better SDU/ PDU than XML can
provide (which will be easy[*]).

[*] No, compression is not the solution: It's a ressource
intense workaround for the original problem.

shakah...@gmail.com

unread,

Apr 15, 2008, 2:46:24 PM4/15/08

to

On Apr 15, 2:30 pm, Rainer Weikusat <rweiku...@mssgmbh.com> wrote:
>
> The original problem was that _measurements_ had proven that the need
> to parse an XML-representation was too expensive for this particular

> application. ...

Slight quibble here in that while measurements may have
shown parsing to be too expensive for a parse-in-every-child
implementation, the same might not hold
true in an implementation where parsing is done once on the
client and then shared with the children (particularly
if those are child processes in the UNIX sense). The existence
of "a working implementation which sends the data
in an XML format" (implying existing C parsing code and internal
representation) would only make me look further in that direction.

Mark Space

unread,

Apr 15, 2008, 5:09:42 PM4/15/08

to

Rainer Weikusat wrote:
> Mark Space <mark...@sbc.global.net> writes:
>> Rex Mottram wrote:
>>> I'm looking for an offset-based data structure to hold character
>>> data.
>> I'm not sure what you mean by "offset based data structure."
>
> This fairly obviously means 'replacing pointers by offsets relative to
> the start of the data structure' (because pointer values usually
> cannot even be meaningfully communicated between different processes
> running on the same machine, let alone processes running on different
> machines).

Is that all you need? The Java version should be a tad easier, actually.
Note the last loop uses nothing but offsets from the buffer to print
out the data.

If you where hoping for the magic library that does this for you, I
think it's called "hands on keyboard."

lut_test a 12 longer_string_test C

Total buffer size: 63
Numer of entries: 5
String 0: filename
String 1: a
String 2: 12
String 3: longer_string_test
String 4: C

/*
* File: lut_test.c
*
* Created on April 15, 2008, 12:43 PM
*/

#include <stdio.h>
#include <stdlib.h>

struct pre_parsed {
int size;
int length;
int indexes[];
};

int main(int argc, char** argv) {

struct pre_parsed *buffer;
char ** argv_copy = argv;

if( argc < 2 )
{
fprintf( stderr, "Usage: lut_test outfile string_list\n");
}
size_t total_string_size = 0;

while( *++argv )
{
size_t len = strlen( *argv );
total_string_size += len+1;
}

// Test some values in debugger

size_t struct_size = sizeof (struct pre_parsed);
size_t int_size = sizeof (int);
size_t array_size = sizeof (int) * (argc-1);

// Build the buffer

buffer = malloc( sizeof (struct pre_parsed) + sizeof (int) *
(argc-1) + total_string_size );

(*buffer).length = argc -1;
(*buffer).size = sizeof (struct pre_parsed) + sizeof (int) *
(argc-1) + total_string_size;
int index = 0;
size_t offset = sizeof (struct pre_parsed)
+ sizeof (int) * (argc-1);
while( *++argv_copy )
{
(*buffer).indexes[index] = offset;
strcpy( (char*)(buffer + offset), *argv_copy );
offset += strlen( *argv_copy ) + 1;
index++;
}

// Read back from the buffer

printf( "Total buffer size: %d\n", (*buffer).size );
printf( "Numer of entries: %d\n", (*buffer).length );
int i;
for( i = 0; i < (*buffer).length; i++ )
{
printf( "String %d: %s\n", i,
(buffer + (*buffer).indexes[i]) );
}

return (EXIT_SUCCESS);
}

rossum

unread,

Apr 15, 2008, 5:11:42 PM4/15/08

to

On Tue, 15 Apr 2008 08:15:06 -0400, Rex Mottram <re...@not.here> wrote:

>I'm looking for an offset-based data structure to hold character data.
>Background: I'm working with an app whose server is written in Java
>while the client is in C. The server needs to package up a bunch of data
>and ship it to the client where it will be traversed, read-only, many
>many times.

If it is read-only then why do you need to parse it many times? Parse
it once at the client end and let all the child processes access the
result.

>There's actually a working implementation which sends the
>data in an XML format, but it's too slow and profiling reveals that it
>spends almost all its time in XML parsing, so I need to replace the XML
>with something more natural to the purpose.

Is that actually CPU use in parsing or in network IO? If IO then
compression might be faster - probably a long shot.

>
>The client starts thousands of child processes which each start by
>parsing the XML document in order to place its data in hash tables,
>after which it is accessed via the hash table only. So the XML is very
>much an intermediate format, and the obvious win would be to have the
>Java server package the data in a form which is directly usable by the
>clients. Presumably the data would arrive from the server and be stored
>in a file: each client would then map the file into its address space
>and treat it as a hash table (or similar data structure) right away.

Why are you hashing to memory addresses (or pointers)? Why not hash
to array indexes which immediately makes the whole structure
relocatable in memory since everything is indexed from the start of
the array. You would need fixed-length data for this. Would
fixed-length data be a problem? A C union could handle otherwise
different types in a single array.

Would comma separated be simpler to parse than XML? It does not have
as much overhead and is generally less verbose.

>
>The traditional issue with transportable data structures is that since
>the client can't reliably control what address the data is mapped to,
>all addressing must be made relative to the starting point. Does anyone
>know of an implementation of such a format which can be generated in
>Java and consumed in C code?

Heap, sorted array + binary search, hash to an array address are all
possibilities.

I suspect that the biggest saving would be parse once rather than
parse many. If you only parse once then you can do a lot of
complicated setup at the client end and the work is amortised over the
many children.

rossum

>
>TIA,
>RM

Mark Space

unread,

Apr 15, 2008, 6:41:43 PM4/15/08

to

rossum wrote:

> Would comma separated be simpler to parse than XML? It does not have
> as much overhead and is generally less verbose.

That's actually a pretty good idea. Much simpler and more portable.

Rex Mottram

unread,

Apr 15, 2008, 7:49:02 PM4/15/08

to

Mark Space wrote:
> Rex Mottram wrote:
>> I'm looking for an offset-based data structure to hold character data.

> It sounds like you want just plain old binary data.

Possibly you missed the phrase "ASCII data" cleverly hidden in the
subject line, and likewise "character data" in the first sentence? :-)

Anyway, in response to you and a few other posters, yes, parsing the XML
once into some other format at the client end would indeed solve the
problem mentioned. But although you argue that would be more flexible, I
think it would result in a Rube Goldberg type of system with two
different representations needing to be kept in sync. Not very robust
programming practice IMHO.

Also, again in response to you and a few others, network bandwidth is
not an issue. I.e. moving the XML document, even if it's large (and it
often is) is not a significant cost, and for the record it's already
transported in compressed form. It's the multi-parsing that's a problem.

I could of course write up my own index-based ADT in C, but would much
prefer not to. There are a number of free ADT packages out there - I'm
currently using one called Kazlib which is very nice and I have notes on
6-10 others which also look good. But all these use pointers, with the
cited problem.

There actually an open source C library which was written to be
offset-based. I remember seeing it a few years ago. I haven't gone
digging for it yet because I was hoping to find something I could build
up from the Java side. Will go see if I can find that next.

RM

Mark Space

unread,

Apr 15, 2008, 9:38:54 PM4/15/08

to

Rex Mottram wrote:

> I could of course write up my own index-based ADT in C, but would much
> prefer not to. There are a number of free ADT packages out there - I'm
> currently using one called Kazlib which is very nice and I have notes on
> 6-10 others which also look good. But all these use pointers, with the
> cited problem.

Ah ha. And specifically you want to serialize and deserialize your
ADTs. Between C and Java. Nothing like picking the implementation
first then going looking for solutions. ;)

What you're asking for is called pointer swizzling, btw, and the general
case seems to be very hard, which may be why you aren't find any libraries.

And no I don't know any Java packages for that. Java programmers seem
to prefer general solutions that always work (XML) to light weight
solutions that may impose some restrictions. Sorry about that.

If you find anything interesting, let us know.

You might try some custom de-/serialization on the Java side, so you can
control the format, and produce something easier to parse in C. Just a
thought.

Logan Shaw

unread,

Apr 15, 2008, 9:45:35 PM4/15/08

to

Rex Mottram wrote:
> I'm looking for an offset-based data structure to hold character data.
> Background: I'm working with an app whose server is written in Java
> while the client is in C. The server needs to package up a bunch of data
> and ship it to the client where it will be traversed, read-only, many
> many times. There's actually a working implementation which sends the
> data in an XML format, but it's too slow and profiling reveals that it
> spends almost all its time in XML parsing, so I need to replace the XML
> with something more natural to the purpose.

Let me see if I can understand the problem statement. It sounds like
you do not need a hash table specifically, but what you actually need
is a way to map string values to other string values. As far as I know,
you do not say you need to be able to iterate over the keys, although
it's a common requirement, so you might need it.

If it is true that that is all you actually need, removing the artificial
constraint of a hash might open up some implementation possibilities.

For example, you could send a sorted list of pairs of C strings concatenated
together, by which I mean 8-bit bytes terminated by null bytes. The client
side would be very easy to implement:

* load the entire image (i.e. message) into a contiguous area of memory
* make a pass through it to count pairs of strings (which is equivalent
to counting the zero bytes in the block of memory, then dividing by two)
* allocate an array of pairs of pointers (of type "const char *")
* make another pass and write the memory address of every string into
the array of pointers
* do key lookup via binary search

Note that this approach does not require encoding any integers or offsets
at all into the message. It is O(N) to parse it, but that is no worse
asymptotically than the time required to receive it over the network.

Another alternative would be to build a trie on the server side, where
states in the trie's state machine (I think of a trie as a finite state
machine that recognizes strings) are represented by byte offsets from the
beginning of the image. The client could then do lookups on the structure
directly (except that to move to the next state, it would have to add an
offset, but that is no big deal).

- Logan

Mark Space

unread,

Apr 15, 2008, 9:50:18 PM4/15/08

to

Mark Space wrote:

> You might try some custom de-/serialization on the Java side, so you can
> control the format, and produce something easier to parse in C. Just a
> thought.

Oh, and the hash table itself in a Java hash map isn't serializable,
iirc. So if you're expecting to send the hash table over directly,
rather than just the objects it contains, you'll have to roll your own
implementation or do something similar.

Lew

unread,

Apr 16, 2008, 12:30:09 AM4/16/08

to

Mark Space wrote:
> Mark Space wrote:
>
>> You might try some custom de-/serialization on the Java side, so you
>> can control the format, and produce something easier to parse in C.
>> Just a thought.
>
> Oh, and the hash table itself in a Java hash map isn't serializable, iirc.

You don't.

public class HashMap<K,V>
extends AbstractMap<K,V>
implements Map<K,V>, Cloneable, Serializable

> So if you're expecting to send the hash table over directly,
> rather than just the objects it contains, you'll have to roll your own
> implementation or do something similar.

Or just use the HashMap built-in implementation of Serializable.

Every listed subclass of AbstractMap except WeakHashMap is Serializable.

--
Lew

Rainer Weikusat

unread,

Apr 16, 2008, 7:05:26 AM4/16/08

to

Rex Mottram <re...@not.here> writes:
> Mark Space wrote:
>> Rex Mottram wrote:
>>> I'm looking for an offset-based data structure to hold character
>>> data.
>> It sounds like you want just plain old binary data.
>
> Possibly you missed the phrase "ASCII data" cleverly hidden in the
> subject line, and likewise "character data" in the first sentence?
> :-)

Possibly, you missed the 'hash table' cleverly hidden in the subject
line. This means 'plain old binary data', ie a table of pointers.

> Anyway, in response to you and a few other posters, yes, parsing the
> XML once into some other format at the client end would indeed solve
> the problem mentioned. But although you argue that would be more
> flexible, I think it would result in a Rube Goldberg type of system
> with two different representations needing to be kept in sync.

If it would, then this would be the type of system you already have,
because you already have one representation on the server, a transport
representation and a representation on the client. That you
(presently) construct identical client representations x times in
parallell and manage to overwhelm the 'client machine' with this
nonsense doesn't change the structure of the application, just its
"simple-mindedness". Parsing the XML-steganogram once and re-using it
x - 1 times would be by far the simplest solution from a coding
standpoint (as someone else already wrote).

> Also, again in response to you and a few others, network bandwidth is
> not an issue. I.e. moving the XML document, even if it's large (and it
> often is) is not a significant cost,

... because, as you need to understand, this is MY PRIVATE GIG-E LAN
which is been solely provided to transport MY XML (and it has excess
bandwidth for that) ...

It is not possible to assess the significance of the 'cost' of a
particular communication isolated. Networks are usually used by
multiple (and often, many) computers for multiple (often many)
applications in parallell.

> and for the record it's already transported in compressed form.

If network usage is not of concern, why do you bother to compress the
XML? Not doing so would be another easy way to make the processing on
both client and server much simpler.

> It's the multi-parsing that's a problem.

Well, then don't.

Rex Mottram

unread,

Apr 16, 2008, 10:10:16 AM4/16/08

to

Rainer Weikusat wrote:
> If it would, then this would be the type of system you already have,
> because you already have one representation on the server, a transport
> representation and a representation on the client. That you
> (presently) construct identical client representations x times in
> parallell and manage to overwhelm the 'client machine' with this
> nonsense doesn't change the structure of the application, just its
> "simple-mindedness".

You may have me (the OP) mixed up with someone else. I write not to
defend the current implementation but to improve it.

> Parsing the XML-steganogram once and re-using it
> x - 1 times would be by far the simplest solution from a coding
> standpoint (as someone else already wrote).

Yes, with the single exception of the alternative, which is to deliver
it in pre-parsed form and thus remove one more intermediate representation.

N < N + 1. QED.

>> Also, again in response to you and a few others, network bandwidth is
>> not an issue. I.e. moving the XML document, even if it's large (and it
>> often is) is not a significant cost,
>
> ... because, as you need to understand, this is MY PRIVATE GIG-E LAN
> which is been solely provided to transport MY XML (and it has excess
> bandwidth for that) ...

You are swinging (and missing) especially wildly today. The
insignificance of transport is primarily due to the fact that it's a
fairly rare event which occurs just once per client/server transaction,
while the client fires children, as noted, potentially thousands of
times per transaction.

The largest XML file I've measured to date is around 5 megabytes.
Because XML is so wordy it compresses quite well, and thus the gzipped
version which travels over the wire is around 150K. This may happen
about once every half hour. I don't know what your network looks like
but this is well within the noise factor for mine.

More than that, the raison d'etre of this application is to keep the
clients from having to run unnecessary jobs. I.e. it's an optimizer. So
whatever network traffic it generates will be paid back by orders of
magnitude in other traffic which doesn't happen later.

> If network usage is not of concern, why do you bother to compress the
> XML? Not doing so would be another easy way to make the processing on
> both client and server much simpler.

Arg. I do it because it's trivially easy and the right thing to do. On
the server side it's as easy as constructing a "new GZIPOutputStream"
object. The client side uses libcurl which has built-in knowledge of
compression so it's as easy as adding an Accept-Encoding header.

It really couldn't be simpler than that. The XML parser doesn't have to
worry about compression because it's uncompressed by then. See, it's
compressed, sent over the wire, uncompressed, then parsed. All with two
extra lines of code, one in each language.

>> It's the multi-parsing that's a problem.
>
> Well, then don't.

Arg. If you didn't spend so much energy looking for someone to bully,
you might have more fun. I know we would.

RM

Rainer Weikusat

unread,

Apr 16, 2008, 10:41:23 AM4/16/08

to

Rex Mottram <re...@not.here> writes:
> Rainer Weikusat wrote:
>> If it would, then this would be the type of system you already have,
>> because you already have one representation on the server, a transport
>> representation and a representation on the client. That you
>> (presently) construct identical client representations x times in
>> parallell and manage to overwhelm the 'client machine' with this
>> nonsense doesn't change the structure of the application, just its
>> "simple-mindedness".
>
> You may have me (the OP) mixed up with someone else.

No, I was being sarcastic.

>> Parsing the XML-steganogram once and re-using it
>> x - 1 times would be by far the simplest solution from a coding
>> standpoint (as someone else already wrote).
>
> Yes, with the single exception of the alternative, which is to deliver
> it in pre-parsed form and thus remove one more intermediate
> representation.

Assuming a UNIX(*)-like 'application structure', the necessary changes
would roughly amount to forking after the XML has been parsed instead of
before it has parsed.

>>> Also, again in response to you and a few others, network bandwidth is
>>> not an issue. I.e. moving the XML document, even if it's large (and it
>>> often is) is not a significant cost,
>> ... because, as you need to understand, this is MY PRIVATE GIG-E LAN
>> which is been solely provided to transport MY XML (and it has excess
>> bandwidth for that) ...
>
> You are swinging (and missing) especially wildly today.
> The insignificance of transport is primarily due to the fact that it's a
> fairly rare event which occurs just once per client/server
> transaction,

Less creative editing could help to aid your understanding greatly:

,----

| It is not possible to assess the significance of the 'cost' of a
| particular communication isolated. Networks are usually used by
| multiple (and often, many) computers for multiple (often many)
| applications in parallell.

`----

See. This time, I even explained the sarcasm. Note that this was a
general remark regarding the (common) misconception of the 'isolated,
insignificant event', which really isn't an 'isolated event' but a
contributing factor.

[...]

>> If network usage is not of concern, why do you bother to compress the
>> XML? Not doing so would be another easy way to make the processing on
>> both client and server much simpler.
>
> Arg. I do it because it's trivially easy and the right thing to do.

Something which is 'trivially easy' for you to code is not necessarily
'trivially easy' for a computer to execute. In this particular
situation, it is certain that the process is computionally expensive.
You (again) stated that 'network usage' is of no concern to you, while
you (again) stated that 'CPU usage on the client' is of concern to
you. Yet you try to save 'network bandwidth' at the expense of
'processing time' by using compression.

This doesn't exactly make sense together.

[...]

>>> It's the multi-parsing that's a problem.
>> Well, then don't.
>
> Arg. If you didn't spend so much energy looking for someone to bully,
> you might have more fun. I know we would.

That you know that some or all of you would have more fun if some or
all of you spent less energy looking for someone to bully means
nothing about me.

Rex Mottram

unread,

Apr 16, 2008, 11:40:00 AM4/16/08

to

Rainer Weikusat wrote:
> Assuming a UNIX(*)-like 'application structure', the necessary changes
> would roughly amount to forking after the XML has been parsed instead of
> before it has parsed.

No, the "necessary changes" would include being able to remove an entire
XML-processing state machine, many associated static variables, remove
dependence on an external XML parsing library, not to mention (again)
the extra representation. These are very significant differences. I
truly think you've switched your POV here, either without realizing it
or out of pure orneriness.

> Less creative editing could help to aid your understanding greatly:
>
> ,----
> | It is not possible to assess the significance of the 'cost' of a
> | particular communication isolated. Networks are usually used by
> | multiple (and often, many) computers for multiple (often many)
> | applications in parallell.
> `----
>
> See. This time, I even explained the sarcasm. Note that this was a
> general remark regarding the (common) misconception of the 'isolated,
> insignificant event', which really isn't an 'isolated event' but a
> contributing factor.

Speaking of creative editing, I gather you chose to remove the parts of
my response which addressed this. Unlike you I know things about how my
network is used and have done measurements. Sending the XML compressed
is, empirically, an insignificant cost. Sending it uncompressed is
probably insignificant too. The only reason I said anything about the
transport cost and compressed vs not is that some respondents mistakenly
thought that was the problem and I wanted to set them straight.

There is no possible way that the cost of transporting this data packet
is a significant fraction of the overall network burden, or network
profile, of this application or the environment it operates in. You may
repeat mantras like the above to your heart's content, but though they
are certainly true in isolation they are not applicable here. This whole
network bandwidth discussion is a red herring.

>>> If network usage is not of concern, why do you bother to compress the
>>> XML? Not doing so would be another easy way to make the processing on
>>> both client and server much simpler.
>> Arg. I do it because it's trivially easy and the right thing to do.
>
> Something which is 'trivially easy' for you to code is not necessarily
> 'trivially easy' for a computer to execute. In this particular
> situation, it is certain that the process is computionally expensive.
> You (again) stated that 'network usage' is of no concern to you, while
> you (again) stated that 'CPU usage on the client' is of concern to
> you. Yet you try to save 'network bandwidth' at the expense of
> 'processing time' by using compression.
>
> This doesn't exactly make sense together.

For you to accuse me of creative editing and in the *same post* put a
phrase in quotes, attributed directly to me, which I did not say is
truly breathtaking. Search for the phrase 'CPU usage on the client' -
its first appearance is when you "quoted" it.

The point here is that all the one-time costs are swamped by the
aggregate costs of parsing. This applies equally to the compression,
transmission, and decompression of the XML, and many other constant
costs. They are just not measurable. I don't know why you insist on
talking about them.

RM

Rex Mottram

unread,

Apr 16, 2008, 12:01:24 PM4/16/08

to

Mark Space wrote:
> If you find anything interesting, let us know.

I apologize for letting other parts of this thread devolve into a
useless squabble. But for those of you with a substantive interest, let
me report back that I'm getting moderately interested in/excited about
the CDB ("constant database") family.

Having realized that what I'm looking at can be thought of as a constant
(meaning write once, read many) database, I did some searching and
found CDB. The original CDB package is C code which is unchanged since
2000 but still works fine. There's also a "TinyCDB" fork of the project
which is newer and ported to Windows and still format-compatible with
the original.

Most interestingly, there are APIs for a number of languages including a
jar file for Java (this version is called sg-cdb). So it looks like I
can use the sg-cdb API to write a cdb-format file directly from Java and
send it to the client(s), which can link with libcdb.a and navigate it
directly.

Whether I can encode my data into the key-value style of CDB (which is a
dbm-like model) is an open question but that's my problem. For the sake
of readers here and subsequent archive searches I thought I'd mention
that the combination of sg-cdb on the server side and tinycdb on the
client side may make a lot of sense for certain applications.

Licensing is also simple: TinyCDB is in the public domain and sg-cdb is
BSD licensed.

RM

rossum

unread,

Apr 16, 2008, 12:01:18 PM4/16/08

to

On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram <re...@not.here> wrote:

>The point here is that all the one-time costs are swamped by the
>aggregate costs of parsing.

Then change the parsing from an aggregate cost to a one-time cost.
Parse once at the receiving end and let all the child processes have
access to the read-only parsed version. It would not be too complex
to set up version numbering or reference counting if required.

I am not clear as to why you cannot do this, it seems such an obvious
solution.

rossum

Rex Mottram

unread,

Apr 16, 2008, 12:18:33 PM4/16/08

to

rossum wrote:
> On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram <re...@not.here> wrote:
>
>> The point here is that all the one-time costs are swamped by the
>> aggregate costs of parsing.
> Then change the parsing from an aggregate cost to a one-time cost.
> Parse once at the receiving end and let all the child processes have
> access to the read-only parsed version. It would not be too complex
> to set up version numbering or reference counting if required.
>
> I am not clear as to why you cannot do this, it seems such an obvious
> solution.

I thought I had explained this in excruciating detail but will try once
more. There are two problems with this approach, not necessarily
unsolvable but still valid problems:

1. It means double complexity, as the data has to be extracted from
POJOs on the server side and converted to XML ("format A"), then parsed
and placed into an alternate representation ("format B") on the client
side. The preferable plan would be to have the server place the data
directly into format B. I am not clear on why this is complex or
controversial; it seems beyond argument to me. Particularly since I've
already found one API for doing so (CDB) as mentioned elsewhere in the
thread.

2. In what format would that "read-only parsed version" be? It cannot be
a traditional hash table or other ADT since they can't be shared between
processes, as explained previously.

The whole point is that I'm looking for a format in which the data can
be shared. It looks like the CDB format might be a candidate, which
allows me to turn the question around: whyever would I want to write the
data to XML on the server and convert it to CDB on the client when
there's an API for writing the CDB format directly on the server? I can
see no argument at all for using two steps when one would suffice.

RM

Rainer Weikusat

unread,

Apr 16, 2008, 12:33:52 PM4/16/08

to

rossum <ross...@coldmail.com> writes:
> On Wed, 16 Apr 2008 11:40:00 -0400, Rex Mottram <re...@not.here> wrote:
>
>>The point here is that all the one-time costs are swamped by the
>>aggregate costs of parsing.
> Then change the parsing from an aggregate cost to a one-time cost.
> Parse once at the receiving end and let all the child processes have
> access to the read-only parsed version. It would not be too complex
> to set up version numbering or reference counting if required.
>
> I am not clear as to why you cannot do this, it seems such an obvious
> solution.

The short answer is that the guy is a complete moron.

Mark Space

unread,

Apr 16, 2008, 1:07:30 PM4/16/08

to

Rex Mottram wrote:
> Mark Space wrote:
>> If you find anything interesting, let us know.

> me report back that I'm getting moderately interested in/excited about

> the CDB ("constant database") family.

> Most interestingly, there are APIs for a number of languages including a

> jar file for Java (this version is called sg-cdb). So it looks like I

Thanks for the report. I learned some things this thread, which is
always valuable to me, and this project you found will be an cool tool I
can keep in my back pocket in case I ever need such a thing.

I hope this project goes well for you.

moi

unread,

Apr 16, 2008, 2:23:30 PM4/16/08

to

On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:

> 2. In what format would that "read-only parsed version" be? It cannot be
> a traditional hash table or other ADT since they can't be shared between
> processes, as explained previously.

No need to parse it.
Just a native OS-flatfile. Since your data + keys is ASCII,
this flatfile is an ASCII file.

> The whole point is that I'm looking for a format in which the data can
> be shared. It looks like the CDB format might be a candidate, which
> allows me to turn the question around: whyever would I want to write the
> data to XML on the server and convert it to CDB on the client when
> there's an API for writing the CDB format directly on the server? I can
> see no argument at all for using two steps when one would suffice.

You can store an array of {offset, hash,(key)len,next} in
another flatfile.
This second flatfile needs to be binary, but can be created on the target
platform. The next-field in the struct is not a pointer but an index into
the array itself. (for the overflow chain). (NB use -1 or >=N for
sentinel value. zero wont work ;-)

Using mmap on the target-platform will cause the lookup extremely fast
(typically 2 page faults). Allocating two big arrays and reading them in
at program startup will be faster (or equally fast), but the startup will
be slower.

Combinining the two files is a possible extension, too.

HTH,
AvK

jellybean stonerfish

unread,

Apr 17, 2008, 12:45:31 AM4/17/08

to

On Wed, 16 Apr 2008 16:41:23 +0200, Rainer Weikusat wrote:

> Rex Mottram <re...@not.here> writes:
>> Arg. I do it because it's trivially easy and the right thing to do.
>
> Something which is 'trivially easy' for you to code is not necessarily
> 'trivially easy' for a computer to execute. In this particular
> situation, it is certain that the process is computionally expensive.
> You (again) stated that 'network usage' is of no concern to you, while
> you (again) stated that 'CPU usage on the client' is of concern to you.
> Yet you try to save 'network bandwidth' at the expense of 'processing
> time' by using compression.
>
> This doesn't exactly make sense together.
>

Maybe it is faster for the client to decompress the little file, then it
would take to download the uncompressed version?

Rainer Weikusat

unread,

Apr 17, 2008, 2:16:43 AM4/17/08

to

It would be much faster to just not download any file ;-), ie the code
on the client (presumably) doesn't do something because it is 'faster'
than doing something else, but to accomplish some useful purpose. The
same way 'the network' is shared among all applications trying to use
it, the host CPU is shared among all applications running on a
particular host. Download of a large file could go on as background
task, using very little host ressources, more of which would then
be available to 'other active tasks', while downloading a smaller file
faster would have a more 'bursty' host ressource usage pattern and
decompressing a small file to a large file is a ressource-intense
operation, although the absolute time needed by it may be short enough
that measuring its influence could be difficult.

This is all, of course, mostly 'thin air' without much more
information about the actual application. But a statement like 'I do
not care about transmitting much more data than I actually need,
except that I compress these data to make it smaller, but anyway, my
actual problem is that preprocessing of the downloaded data is too
expensive for a particular situation' doesn't make much sense
altogether. To me, it sounds a lot like 'I am trading off processor
time for transmission time and now my programs run to slow'. Duh.

If doing it all 'in the right way' doesn't work, then it isn't the
right way.

Rainer Weikusat

unread,

Apr 17, 2008, 5:56:03 AM4/17/08

to

moi <ro...@invalid.address.org> writes:
> On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:
>> 2. In what format would that "read-only parsed version" be? It cannot be
>> a traditional hash table or other ADT since they can't be shared between
>> processes, as explained previously.
>
> No need to parse it. Just a native OS-flatfile. Since your data +
> keys is ASCII, this flatfile is an ASCII file.

[...]

> You can store an array of {offset, hash,(key)len,next} in
> another flatfile. This second flatfile needs to be binary, but can
> be created on the target platform.

Which requires the data to be parsed on the target platform to create
the index. It just (presumably) requires less complicated parsing
than parsing the XML file. The most sensible way to do this would
still be to preprocess the transport format once and share the result
among all client instances. The OP has repeatedly went ballistic at
the mere suggestion that this could be a sensible strategy.

Not to mention that a non-XML transport format would still need to be
designed. But this would (likely) lead to reducing the amount of data
transmitted, so, again cause Mr Rex' mind to blow up in anger.

moi

unread,

Apr 17, 2008, 8:05:20 PM4/17/08

to

On Thu, 17 Apr 2008 11:56:03 +0200, Rainer Weikusat wrote:

> moi <ro...@invalid.address.org> writes:
>> On Wed, 16 Apr 2008 12:18:33 -0400, Rex Mottram wrote:
>>> 2. In what format would that "read-only parsed version" be? It cannot
>>> be a traditional hash table or other ADT since they can't be shared
>>> between processes, as explained previously.
>>
>> No need to parse it. Just a native OS-flatfile. Since your data + keys
>> is ASCII, this flatfile is an ASCII file.
>
> [...]
>
>> You can store an array of {offset, hash,(key)len,next} in another
>> flatfile. This second flatfile needs to be binary, but can be created
>> on the target platform.
>
> Which requires the data to be parsed on the target platform to create
> the index. It just (presumably) requires less complicated parsing than
> parsing the XML file. The most sensible way to do this would still be to
> preprocess the transport format once and share the result among all
> client instances. The OP has repeatedly went ballistic at the mere
> suggestion that this could be a sensible strategy.

I am afraid you are right. As an experiment, I used /usr/dict ~5MB, >500K
words, and created the hashtable in approx 2 seconds walltime. Once the
hashtable is present on disk, searching one key takes 2 ms or so (most of
it program startup, of course) All using mmap.
Creating a binary file for another architecture is not impossible to do
(even dd has options to swap words). Most hardware is little-endian,
anyway these days, and most people don't get beyond x86, anyway.

Agreed, my parser is simple (just one sweep, looking for space and CR; an
XML parse may need more than one pass, and a lot more state. But programs
like these are governed by I/O cost, and one can easily perform 10 scans
over each buffer being read in. (once it *is* read in ;-)

>
> Not to mention that a non-XML transport format would still need to be
> designed. But this would (likely) lead to reducing the amount of data
> transmitted, so, again cause Mr Rex' mind to blow up in anger.

XML is the problem. Sticking to it is more than a problem. It is a
disease.

AvK

Charles Coldwell

unread,

Apr 20, 2008, 8:17:49 AM4/20/08

to

Rex Mottram <re...@not.here> writes:

> The traditional issue with transportable data structures is that since
> the client can't reliably control what address the data is mapped to,
> all addressing must be made relative to the starting point. Does
> anyone know of an implementation of such a format which can be
> generated in Java and consumed in C code?

IIUC, what you need is a hash table. Traditionally, a hash table is
implemented as an array of pointers, where the pointers point to linked
lists of objects that hash to the same value, and that value is the
index into the array.

So the fundamental problem is that pointers on the server contain
addresses that are meaningless on the client.

I've often seen linked lists implemented using array indices instead of
pointers. For example:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

struct list {
int next;
const char *data;
};

static struct list pool[1024];
static int freeidx;

static int pool_alloc(void)
{
int tmp = freeidx;

if (freeidx != -1)
freeidx = pool[freeidx].next;

return tmp;
}

static void pool_free(int idx)
{
pool[idx].next = freeidx;
freeidx = idx;
}

static int hashtable[256];

const char *strings[] = {
"hello",
"world",
"this",
"is",
"a",
"hashtable",
"example"
};

static unsigned char hashfunc(const char *str)
{
unsigned int hash = 0;

for (hash = 0; *str; ++str)
hash += *str;

return hash & 0xff;
}

static void hash_insert(const char *str)
{
int hidx = hashfunc(str);
int pidx = pool_alloc();

if (pidx == -1) {
fputs("out of memory\n", stderr);
exit(1);
}

pool[pidx].data = str;
pool[pidx].next = hashtable[hidx];
hashtable[hidx] = pidx;
}

static int hash_find(const char *str)
{
int hidx = hashfunc(str);
int iter;

for (iter = hashtable[hidx]; iter != -1; iter = pool[iter].next)
if (strcmp(str, pool[iter].data) == 0)
break;

return iter != -1;
}

int main(int argc, char *argv[])
{
int i;

/* initialize the pool */
for (i=0; i<sizeof(pool)/sizeof(pool[0]); i++)
pool[i].next = i+1;
pool[i-1].next = -1;

/* initialize the hash table */
memset(hashtable, -1, sizeof(hashtable));

for (i=0; i<sizeof(strings)/sizeof(strings[0]); i++)
hash_insert(strings[i]);

for (i=1; i<argc; i++)
printf ("%s%s found in table\n", argv[i],
(hash_find(argv[i]) ? "" : " not"));

return 0;
}

--
Charles M. "Chip" Coldwell
"Turn on, log in, tune out"
GPG Key ID: 852E052F
GPG Key Fingerprint: 77E5 2B51 4907 F08A 7E92 DE80 AFA9 9A8F 852E 052F

Robert Klemme

unread,

Apr 20, 2008, 8:27:58 AM4/20/08

to

I tried to find a post of yours in the thread where you explain what
kind of data structure you need but could not find one. The term
"offset based hash table for ASCII data" is far too generic for me to
get a concrete idea. Can you elaborate a bit or show a sample XML if it
is not too large?

Kind regards

robert

Rex Mottram

unread,

Apr 20, 2008, 1:19:26 PM4/20/08

to

Charles Coldwell wrote:
> Rex Mottram <re...@not.here> writes:
>
>> The traditional issue with transportable data structures is that since
>> the client can't reliably control what address the data is mapped to,
>> all addressing must be made relative to the starting point. Does
>> anyone know of an implementation of such a format which can be
>> generated in Java and consumed in C code?
>
> IIUC, what you need is a hash table. Traditionally, a hash table is
> implemented as an array of pointers, where the pointers point to linked
> lists of objects that hash to the same value, and that value is the
> index into the array.
>
> So the fundamental problem is that pointers on the server contain
> addresses that are meaningless on the client.

Almost, but not quite. While it's true that pointers on the server don't
map to pointers on the client, the problem is worse than that: you
cannot share pointers between processes *even on the same system*.

In order to share a hash table it would be necessary to mmap() the file
containing it (or use a shared or malloc-ed memory region but the issues
would be exactly the same). Unfortunately you cannot reliably ask for
any of these addresses to be mapped at a preset place, which makes
perfect sense when you think about it - what if something else is
already mapped at that location? Thus the documentation sets for mmmap
and the equivalent Win32 APIs have clear warnings to the effect that
"you may request mapping at a particular virtual address but you cannot
count on that request being honored".

So the problem is not just between server and client but between all
processes whether on server or client (and in my case there are multiple
client-side processes sharing the data). This is where the "ship the
data to the client in XML, then parse it into a hash table and share it
around" proposal made by many people falls down.

Anyway, a good solution has been found which I will describe in another
sub-thread.

RM

moi

unread,

Apr 20, 2008, 1:34:42 PM4/20/08

to

On Sun, 20 Apr 2008 13:19:26 -0400, Rex Mottram wrote:

> Charles Coldwell wrote:

> In order to share a hash table it would be necessary to mmap() the file
> containing it (or use a shared or malloc-ed memory region but the issues
> would be exactly the same). Unfortunately you cannot reliably ask for
> any of these addresses to be mapped at a preset place, which makes

No. you don't *need* to mmap the array to a fixed address.
Just make sure all it's members are offsets /indexes; not pointers.
period.
[ see my previous posts (and others) in comp.programmig ]

> perfect sense when you think about it - what if something else is
> already mapped at that location? Thus the documentation sets for mmmap
> and the equivalent Win32 APIs have clear warnings to the effect that
> "you may request mapping at a particular virtual address but you cannot
> count on that request being honored".
>
> So the problem is not just between server and client but between all
> processes whether on server or client (and in my case there are multiple
> client-side processes sharing the data). This is where the "ship the
> data to the client in XML, then parse it into a hash table and share it
> around" proposal made by many people falls down.

The problem is that you failed to describe your problem.

> Anyway, a good solution has been found which I will describe in another

The solution is trivial. Presenting the problem was much harder.

AvK

Rex Mottram

unread,

Apr 20, 2008, 2:36:15 PM4/20/08

to

Robert Klemme wrote:
> I tried to find a post of yours in the thread where you explain what
> kind of data structure you need but could not find one. The term
> "offset based hash table for ASCII data" is far too generic for me to
> get a concrete idea. Can you elaborate a bit or show a sample XML if it
> is not too large?

I don't know how many other people will be interested in reading to this
depth but for the record, here's a (long) synopsis. There are 3
elemental concepts: Activities, States, and Transactions. Activities are
operations which read some States and modify others. Every time an
Activity takes place, all affected states are logged (in a server-side
database) as a Transaction.

Now, these Activities can be pretty slow/expensive to run. Therefore we
look for opportunities to skip them. Specifically, if a previous
Transaction was run on the identical set of input States, we can skip
the new Activity and instead log a pointer back to the previous
Transaction. This is a _really_ big win when possible.

Since XML must be parsed sequentially and feeling that a DOM model would
be too expensive, we went to some lengths to define a schema which could
accommodate the above logic in a single (SAX) pass. The result was
something like the following 3 stanzas (naturally this will have lots of
angle brackets and I can't promise it will render correctly in HTML
readers):

This leaves out many details but the basic structure is fairly clear.
The client code parses through the Transaction section, storing a
reference to each transaction in a local hash table known as the TX
table. Then it walks through the set of activities, doing a
(necessarily) linear search for the current Activity. Once that is
matched it sets a "found" flag and chews through the rest of the
Activity stanza until reaching the States stanza.

Next, each State is compared vs current reality. Whenever a State
mismatch is encountered, the Transactions associated with the mismatched
State are deleted from the TX table. If we reach the end of the States
and the TX table is not empty, what remains are matches. If at any
intermediate point we find that the TX table has become empty, we can
abort the parse and just go run the Activity.

What makes this preferable to a DOM design is the opportunity to abort
the parse as soon as potential matches are exhausted; DOM would have
required a full parse followed by some memory accesses.

This is the current model and it works quite stably, but unfortunately
not fast enough. XML parsing proves to be the dominant time cost. Since
the Transaction stanza is small relative to the other two, which can
grow quite large, it seems quite likely that it's the linear search
aspect of the latter two stanzas that's the problem.

Thus the key requirement for the "XML replacement technology" is that it
be random access. A hash table is the obvious candidate, and that's
exactly what CDB turned out to be, but a tree-based structure such as a
red-black tree might have been fine too. Anyway, CDB turned out to be
everything I was looking for.

RM

Rex Mottram

unread,

Apr 20, 2008, 2:44:14 PM4/20/08

to

Rex Mottram wrote:
> Having realized that what I'm looking at can be thought of as a constant
> (meaning write once, read many) database, I did some searching and
> found CDB. The original CDB package is C code which is unchanged since
> 2000 but still works fine. There's also a "TinyCDB" fork of the project
> which is newer and ported to Windows and still format-compatible with
> the original.
>
> Most interestingly, there are APIs for a number of languages including a
> jar file for Java (this version is called sg-cdb). So it looks like I
> can use the sg-cdb API to write a cdb-format file directly from Java and
> send it to the client(s), which can link with libcdb.a and navigate it
> directly.
>
> Whether I can encode my data into the key-value style of CDB (which is a
> dbm-like model) is an open question but that's my problem. For the sake
> of readers here and subsequent archive searches I thought I'd mention
> that the combination of sg-cdb on the server side and tinycdb on the
> client side may make a lot of sense for certain applications.
>
> Licensing is also simple: TinyCDB is in the public domain and sg-cdb is
> BSD licensed.

Epilogue: CDB has worked out beautifully. In essence it's exactly what
the subject line specified: a hash table contained within a file which
can be mapped and used at any memory location because it doesn't rely on
pointers. The fact that it can be written via a pure Java API (sg-cdb)
and accessed via a simple C API (TinyCDB) which handles its own file
mapping logic is a nice bonus. As are the easy license terms.

RM

Robert Klemme

unread,

Apr 20, 2008, 6:44:00 PM4/20/08

to

On 20.04.2008 20:36, Rex Mottram wrote:
> Robert Klemme wrote:
>> I tried to find a post of yours in the thread where you explain what
>> kind of data structure you need but could not find one. The term
>> "offset based hash table for ASCII data" is far too generic for me to
>> get a concrete idea. Can you elaborate a bit or show a sample XML if
>> it is not too large?
>
> I don't know how many other people will be interested in reading to this
> depth but for the record, here's a (long) synopsis.

Thanks for elaborating!

> There are 3
> elemental concepts: Activities, States, and Transactions. Activities are
> operations which read some States and modify others. Every time an
> Activity takes place, all affected states are logged (in a server-side
> database) as a Transaction.

With affected states do you mean prior or post activity (or both)? Is
the transaction list you transfer to clients always the complete history?

> Now, these Activities can be pretty slow/expensive to run. Therefore we
> look for opportunities to skip them. Specifically, if a previous
> Transaction was run on the identical set of input States, we can skip
> the new Activity and instead log a pointer back to the previous
> Transaction. This is a _really_ big win when possible.

So you need a quick match based on (activity, (input state1, ...input
stateN)), do you?

> Since XML must be parsed sequentially and feeling that a DOM model would
> be too expensive, we went to some lengths to define a schema which could
> accommodate the above logic in a single (SAX) pass. The result was
> something like the following 3 stanzas (naturally this will have lots of
> angle brackets and I can't promise it will render correctly in HTML
> readers):
>
> <transactions>
> <tx id="1" ... />
> <tx id="2" ... />
> .
> .
> .
> </transactions>

You mention that transactions involve state but I do not see it here.

> <activities>
> <activity id="1" ... />
> <activity id="2" ... />
> .
> .
> .
> </activities>
>
>
> <states>
> <state id="1" ... />
> <state id="2" ... />
> .
> .
> .
> </states>
>
>
> This leaves out many details but the basic structure is fairly clear.
> The client code parses through the Transaction section, storing a
> reference to each transaction in a local hash table known as the TX
> table. Then it walks through the set of activities, doing a
> (necessarily) linear search for the current Activity.

How do you determine from the structure above which is the current one?
There must be something (I am) missing.

> Once that is
> matched it sets a "found" flag and chews through the rest of the
> Activity stanza until reaching the States stanza.
>
> Next, each State is compared vs current reality.

So you have an additional input, current state(s)?

> Whenever a State
> mismatch is encountered, the Transactions associated with the mismatched
> State are deleted from the TX table. If we reach the end of the States
> and the TX table is not empty, what remains are matches. If at any
> intermediate point we find that the TX table has become empty, we can
> abort the parse and just go run the Activity.

In other words: if after looking at all states the TX table is non
empty, you do not need to run the activity (and save a lot of time).

> What makes this preferable to a DOM design is the opportunity to abort
> the parse as soon as potential matches are exhausted; DOM would have
> required a full parse followed by some memory accesses.

I find DOM rarely the best choice. Even for other types of XML
processing (that do not involve early exit) SAX is usually a much better
(faster and memory savvier) choice.

> This is the current model and it works quite stably, but unfortunately
> not fast enough. XML parsing proves to be the dominant time cost. Since
> the Transaction stanza is small relative to the other two, which can
> grow quite large, it seems quite likely that it's the linear search
> aspect of the latter two stanzas that's the problem.
>
> Thus the key requirement for the "XML replacement technology" is that it
> be random access. A hash table is the obvious candidate, and that's
> exactly what CDB turned out to be, but a tree-based structure such as a
> red-black tree might have been fine too. Anyway, CDB turned out to be
> everything I was looking for.

And apparently you solved the encoding issue. That's good to hear.

(reading CDB)

My recommendations would have gone into a similar direction: define a
binary format that satisfies your lookup requirements and create read
write code in Java and read code in C. If files are small enough you
can memory map them, otherwise you have to use an approach using
indexing techniques as CDB does. But since you found your solution we
don't need to waste more brain cycles.

Thanks again. And please let us know the outcome (aka speed
improvement) or any issues you find along the way. That way we can all
learn a bit. :-)

Kind regards

robert

Rex Mottram

unread,

Apr 20, 2008, 7:04:47 PM4/20/08

to

Robert Klemme wrote:
> With affected states do you mean prior or post activity (or both)? Is
> the transaction list you transfer to clients always the complete history?

We _store_ both precondition states and result states of each
transaction. But of course when looking for a prior match we need only
compare preconditions.

> So you need a quick match based on (activity, (input state1, ...input
> stateN)), do you?

Precisely.

> You mention that transactions involve state but I do not see it here.

I did say I was leaving out many details ... yes, in the XML there are
attributes mapping each transaction to a list of states. In fact many
attributes (represented by "...") were left out.

>> This leaves out many details but the basic structure is fairly clear.
>> The client code parses through the Transaction section, storing a
>> reference to each transaction in a local hash table known as the TX
>> table. Then it walks through the set of activities, doing a
>> (necessarily) linear search for the current Activity.
>
> How do you determine from the structure above which is the current one?
> There must be something (I am) missing.

Yes, I left that out as an implementation detail. The client determines
its current activity and then chews through the Activities stanza
looking for a match.

> So you have an additional input, current state(s)?

Both current States and current Activity are discoverable by the client
on its own, so whether that's considered an input or not is debatable
(but not worth debating).

> In other words: if after looking at all states the TX table is non
> empty, you do not need to run the activity (and save a lot of time).

Yes.

> Thanks again. And please let us know the outcome (aka speed
> improvement) or any issues you find along the way. That way we can all
> learn a bit. :-)

Will do.

RM

Mark Space

unread,

Apr 20, 2008, 9:17:22 PM4/20/08

to

Rex Mottram wrote:

>
> I don't know how many other people will be interested in reading to this
> depth but for the record, here's a (long) synopsis. There are 3

I'm still reading this thread too. I hope I don't need anything besides
an XML parser for exchanging data, but if I do I'm glad to know there
are other solutions out there.

Laziness is a virtue. Never code your own solution when you can
download one instead.

Rex Mottram

unread,

Apr 25, 2008, 12:05:05 PM4/25/08

to

Rex Mottram wrote:
>> Thanks again. And please let us know the outcome (aka speed
>> improvement) or any issues you find along the way. That way we can
>> all learn a bit. :-)
>
> Will do.

As promised, the final results:

A. I've had no problems with CDB in either of the versions I'm using; it
works quite well, is clearly documented, etc. It took a few minutes to
port TinyCDB to Windows but no big deal, and I've offered the diffs to
the author.

B. In the particular test case I was profiling, a typical run of the XML
version averaged around 19-20 minutes. The CDB version is around *2*
minutes. I consider this a rousing success.

I was a little surprised, FWIW, to find that the CDB file was typically
no smaller than the XML file. Given how "wordy" XML is I'd have expected
almost any other format to be smaller, but CDB came in around the same
size. Of course these details will vary considerably with many factors
so this is just one data point.

RM

Robert Maas, http://tinyurl.com/uh3t

unread,

Apr 25, 2008, 11:18:20 PM4/25/08

to

> From: Rex Mottram <r...@not.here>

> I'm looking for an offset-based data structure to hold character data.

I'm not sure what you mean by "offset-based data structure".
Do you mean that there are links (pointers) from one place to
another, such as a link from a *place* within one block of data to
the *front* of another block of data? For example:
Block1: imhihmgihiothmctio[ptr]pfoevpogsip
|
/-------------------/
V
Block2: erjevpigihgipgipughipru
And instead of [ptr] showing the absolute address of Block2, it
shows the difference between the address of Block2 and the address
of the pointer itself?

I definitely don't know what you mean by "character data" in this
context. Except for pointers linking a place within one block to
the start of another block (or whatever else you might have in mind
that I can't guess), is the rest of the content of each block a
sequence of characters? Are these characters fixed-size, such as
US-ASCII or Latin-1 or UCS-2, or variable size such as UTF-8?

> Background: I'm working with an app whose server is written in
> Java while the client is in C. The server needs to package up a
> bunch of data and ship it to the client where it will be traversed,
> read-only, many many times.

Java is certainly powerful enough to build a data-structure with
genuine pointers between the various pieces, then compile each
atomic block of data (not including any links) into a pure array of
data, then create an abstract layout into a linear sequence which
includes handles to each atomic block and tokens where each link
from one place to another block is needed, then measure the sizes
of all those atomic blocks and use that information to compute
where each block and each link-place would start if laid out per
that linear model, hence compute the offset for each link, and
finally allocate a large block and copy all the atomic blocks and
link-offsets into the large block per the linear model. And then
it's trivial to write that array out to a file or communication
channel verbatim.

Can you please give me a small example of a structure that you
might need to process in this way, perhaps three to five atomic
blocks of data, nor more than ten or so bytes of data within each
atomic block, with two to four (N-1) links connecting them into a
structure, or maybe more than N-1 links if you allow some blocks to
have more than one link to each? We could use that example for
illustration of the linear-model algorithm, and eventually as test
data for any code that would be written.

> There's actually a working implementation which sends the data in
> an XML format, but it's too slow and profiling reveals that it
> spends almost all its time in XML parsing, so I need to replace the
> XML with something more natural to the purpose.

When you say "IT" is too slow, and when you say "IT" spends almost
all its time in XML parsing, does "IT" mean the C code to receive
the XML data and convert to internal data structure?

> The client starts thousands of child processes which each start
> by parsing the XML document in order to place its data in hash
> tables, after which it is accessed via the hash table only.

Are all the child processes on one machine, or distributed over
many hosts all over the InterNet?

I can see how parsing XML more than once for copies of exactly the
same data structure would be a wasteful practice.

> So the XML is very much an intermediate format, and the obvious
> win would be to have the Java server package the data in a form
> which is directly usable by the clients.

You need to tell us what forms of data *are* usable by the clients.
For example, if a C program were given the total size of all the
atomic blocks of data plus all the machine-absolute pointers needed
for all the links from a place within a non-atomic block to the
start of another block, would the C program be able to malloc a
single block of that size, then sub-allocate to build all the
blocks and links between them? Then would the rest of the program
be able to run using that in-RAM layout of data? If so, it seems to
me the following protocol would suffice:
- Transmit an integer telling the total size of the data block needed.
- Transmit a block of bytes of that size, containing all the atomic
blocks of data in correct relative position, with zero bytes for
all the places where pointers are needed.
- Transmit an integer telling the total number of links.
- Transmit a block consisting of that many (place,target) pairs,
each relative to the start of the overall block.
The C client then reads that first integer, and mallocs a block of
that size, then array-reads the block of that size into that block,
then reads the second integer, then computes the size of array for
that many pointers and allocates that size block, then block-reads
a block of that size into that second block, then steps through
that second block fetching each (relPlace,relTarget) pair and
adding each value to the address of the start of the first block to
yield a (absPlace,absTarget) and then executing *absPlace=absTarget,
and finally that second block is freed. So that's 6 lines of code
to do all the input, a program loop of 3 lines of code (or one line
of code with nested expression) to convert and patch all the links,
and finally one line of code to free that second array.

> Presumably the data would arrive from the server and be stored in
> a file: each client would then map the file into its address space

You seem to be implying that all the clients are running on just
one computer, possibly multiple CPUs sharing RAM, whereby it's
possible to map one block of RAM into *all* the CPUs?

> and treat it as a hash table (or similar data structure) right
> away.

Why a hash table??? Why do you need a hash table???
You need to explain the abstract model of your data structure
better so that we can decide whether a simple linked structure (as
I assumed above), or a hash table is most appropriate for your
needs, or possibly multiple independent linked structures with
hash-table entries for each, or possibly multiple linked structures
that share some of their structure with a hash table that points
into various blocks within the overall structure, is most
appropriate.

> The traditional issue with transportable data structures is that
> since the client can't reliably control what address the data is
> mapped to, all addressing must be made relative to the starting
> point.

Agreed, but as soon as you have a hash table you can't even say
what the relative locations are between various parts of the
overall data corpus. If a hash table is used *only* for
key-to-location lookup by the user interface, and all the corpus of
data is pure linked-structure, then all you need is one large block
of binary data (per the model I described earlier) plus a list of
(key,location) pairs which can then be installed in the hash table.
relLocation values would be transmitted, and the client would
convert each to absLocation.

I think you definitely need to create a *complete* toy example of
what kind of data corpus structure you wish to transmit from server
to client and build into disk file and thence into RAM on client,
so that we can see what you are really wanting.

> Does anyone know of an implementation of such a format which can
> be generated in Java and consumed in C code?

Before that can be answered, we need a better idea precisely what
kind of data corpus structure you wish to transmit from server to
client. It would do no good for me to spend one hour writing the
server code and five minutes to write the client code, for the
model I described above, if that wouldn't serve your needs.

Robert Klemme

unread,

Apr 27, 2008, 6:11:55 AM4/27/08

to

On 25.04.2008 18:05, Rex Mottram wrote:
> Rex Mottram wrote:
>>> Thanks again. And please let us know the outcome (aka speed
>>> improvement) or any issues you find along the way. That way we can
>>> all learn a bit. :-)
>>
>> Will do.
>
> As promised, the final results:

Thank you for the summary!

> A. I've had no problems with CDB in either of the versions I'm using; it
> works quite well, is clearly documented, etc. It took a few minutes to
> port TinyCDB to Windows but no big deal, and I've offered the diffs to
> the author.

Thank you, that way the community can benefit as well.

> B. In the particular test case I was profiling, a typical run of the XML
> version averaged around 19-20 minutes. The CDB version is around *2*
> minutes. I consider this a rousing success.

Indeed!

> I was a little surprised, FWIW, to find that the CDB file was typically
> no smaller than the XML file. Given how "wordy" XML is I'd have expected
> almost any other format to be smaller, but CDB came in around the same
> size. Of course these details will vary considerably with many factors
> so this is just one data point.

IIRC CDB uses hashing. At least in memory hash tables are usually
larger than the number of entries in there so that might be the case
here as well. I am sure a closer look at the documentation of the file
format will reveal the reason but since you said that bandwidth was not
an issue I guess it's not worthwhile bothering. Still a good hint for
others.

Kind regards

robert