Out of line strings

27 views
Skip to first unread message

Paul Moore

unread,
Sep 12, 2016, 10:46:16 AM9/12/16
to Construct
I'm trying to define a structure that contains a fixed-length block of data, followed by a block of "string data". In C, what I want is something like

    struct {
        size_t exe_offset;
        size_t cwd_offset;
        char data[1];
    }

In Python, I'd have an object:

    data = object()
    data.exe = b'/the/name/of/the/exe'
    data.cwd = b'/the/cwd/to/use'

and what I want is a Struct definition that results in

    0, /* offset of exe from the start of data */
    21, /* offset of cwd */
    "/the/name/of/the/exe\0/the/cwd/to/use\0" /* The character data */

(Ideally, I'd also like None to translate to an offset of 0 and no character data, but I think I can work out know how to do that myself).

I can't find anything obvious in the docs on how to do something like this. Can anyone help?


Arek Bulski

unread,
Sep 26, 2016, 10:29:25 AM9/26/16
to Construct
Look at Struct() with two Int32uk and one CString() or Bytes()

Paul Moore

unread,
Oct 3, 2016, 11:40:14 AM10/3/16
to Construct
On Monday, 26 September 2016 15:29:25 UTC+1, Arek Bulski wrote:
Look at Struct() with two Int32uk and one CString() or Bytes()

That seems like it would just expose the offsets and the data block. What I was hoping for was something that would expose the content as two strings.

If all I can get is 2 numbers and some data, and I have to do the slicing of the data block myself, then so be it. But I don't need Construct for that, I can do it just as well with the stdlib struct module.
Paul

Paul Moore

unread,
Oct 3, 2016, 11:52:40 AM10/3/16
to Construct
Sorry, I should probably give a better example.

What I was hoping for is something along the lines of

    >>> format = Struct(
    >>>     # ... something here that does what I want, defining fields "exe", and "cwd"
    >>> )
 
    >>> format.build(dict(exe=b"a/b/c", cwd=b"d/e/f"))
    b'\x00\x00\x00\x00\x00\x00\x00\x06a/b/c\0d/e/f\0'

    >>> c = format.parse(b'\x00\x00\x00\x00\x00\x00\x00\x06a/b/c\0d/e/f\0')
    >>> c.exe
    b'a/b/c'
    >>> c.cwd
    b'd/e/f'

(Bonus points for being able to specify in the format object that exe and cwd are strings encoded in UTF-8 and then be able to just access c.exe and c.cwd and get Python string objects, and being able to pass strings in the build call).

Paul

Benjamin

unread,
Oct 3, 2016, 2:23:57 PM10/3/16
to construct3
Paul, your c-struct defines where in a stream of data the pieces your looking for are. Construct defines the content of the stream of data in its entirety.

Now, based on your example, assuming the two strings are null-terminated, adjacent, and null prefixed, up until the (redundant) offset:

In [18]: s = Struct(
    ...:     GreedyRange(Const(b'\x00')),
    ...:     'redundant_offset' / Byte,
    ...:     'exe' / CString(encoding='utf8'),
    ...:     'cwd' / CString(encoding='utf8'))

In [19]: c = s.parse(b'\x00\x00\x00\x00\x00\x00\x00\x06a/b/c\0d/e/f\0')

In [20]: c.exe
Out[20]: 'a/b/c'

In [21]: c.cwd
Out[21]: 'd/e/f' 

In [22]: c.redundant_offset
Out[22]: 6


--
You received this message because you are subscribed to the Google Groups "Construct" group.
To unsubscribe from this group and stop receiving emails from it, send an email to construct3+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Moore

unread,
Oct 4, 2016, 3:55:23 AM10/4/16
to Construct
On Monday, 3 October 2016 19:23:57 UTC+1, Benjamin Riggs wrote:
Paul, your c-struct defines where in a stream of data the pieces your looking for are. Construct defines the content of the stream of data in its entirety.

Now, based on your example, assuming the two strings are null-terminated, adjacent, and null prefixed, up until the (redundant) offset:

In [18]: s = Struct(
    ...:     GreedyRange(Const(b'\x00')),
    ...:     'redundant_offset' / Byte,
    ...:     'exe' / CString(encoding='utf8'),
    ...:     'cwd' / CString(encoding='utf8'))

In [19]: c = s.parse(b'\x00\x00\x00\x00\x00\x00\x00\x06a/b/c\0d/e/f\0')


No, sorry, you've missed the point. There's no guarantee in the data block that exe comes before cwd. Or that there isn't unneeded data in the data block. The structure is defined as:

1. An offset within the data block where "exe" starts.
2. An offset within the data block where "cwd" starts.
3. The data block.

The offsets are not redundant, they define which parts of the data block are relevant. And the data block is meaningless without the offsets.

All of the following data blocks should produce the same result: exe = b'a/b/c', cwd = b'd/e/f'

    \x00\x00\x00\x00\x00\x00\x00\x06a/b/c\x00d/e/f\x00
    \x00\x00\x00\x06\x00\x00\x00\x00d/e/f\x00a/b/c\x00
    \x00\x00\x00\x06\x00\x00\x00\x10xxxxxxd/e/f\x00xxxxa/b/c\x00

The structure is defined in such a way that it's easy to process in C - it's a fixed format structure with a single varying length block at the end. And all the pointers that you need can be calculated by adding offsets to the pointer to the start of the variable length portion. But it's less straightforward to process in Python, hence my hope that Construct would do the work for me :-)

Paul

Benjamin

unread,
Oct 4, 2016, 12:44:14 PM10/4/16
to construct3
Those three examples aren't consistent; nothing declarative could parse all of those. (Nothing that doesn't make repeated guesses could parse those, in fact.) First and foremost, the 'offset' values clearly aren't encoded in your data. Without those very, very important data points, the information required to parse this isn't self-contained. Construct is a parser, not a heuristics engine.

If you were to make the first two bytes of the data-structure be the two offset values (from the start of the data), what you're suggesting would be simple:

In [23]: s = Struct(
    ...:     'e' / Byte,
    ...:     'c' / Byte,
    ...:     'exe' / Pointer(this.e, CString(encoding='utf8')),
    ...:     'cwd' / Pointer(this.c, CString(encoding='utf8')))

In [24]: c = s.parse(b'\x18\x0e\x00\x06\x00\x00\x00\x10xxxxxxd/e/f\x00xxxxa/b/c\x00')

In [25]: c.exe
Out[25]: 'a/b/c'

In [26]: c.cwd
Out[26]: 'd/e/f'

--

Paul Moore

unread,
Oct 5, 2016, 10:29:03 AM10/5/16
to Construct
On Tuesday, 4 October 2016 17:44:14 UTC+1, Benjamin Riggs wrote:
Those three examples aren't consistent; nothing declarative could parse all of those. (Nothing that doesn't make repeated guesses could parse those, in fact.) First and foremost, the 'offset' values clearly aren't encoded in your data. Without those very, very important data points, the information required to parse this isn't self-contained. Construct is a parser, not a heuristics engine.

Having the offsets named in the output structure is fine for me. I just don't have any interest in *using* the offsets directly so I didn't give them names in my description.

The examples are, I believe, all consistent. 2 4-byte offsets, then a data block with 0-terminated strings at the given offsets. The offsets differ but the strings (which is what I care about) are the same in all 3 cases.

 
If you were to make the first two bytes of the data-structure be the two offset values (from the start of the data), what you're suggesting would be simple:

In [23]: s = Struct(
    ...:     'e' / Byte,
    ...:     'c' / Byte,
    ...:     'exe' / Pointer(this.e, CString(encoding='utf8')),
    ...:     'cwd' / Pointer(this.c, CString(encoding='utf8')))


Apart from using 1-byte offsets rather than 4-byte, that's what I was looking for. Thanks, and sorry if my explanation was confusing - I swear I didn't think it was, but of course if you didn't understand it, it was confusing by definition :-(

It's not something I care about in the particular case I needed this for, but can that definition *build* the structure as well? As in s.build(dict(exe=b'a/b', cwd=b'c/d'))? In particular, I'd like to be able to omit e anc c, and have Construct infer them (in the "obvious" way - the first one, e, is 0, the second, c, is e + len(exe) + 1 (for the null byte)). Clearly the logic of how to calculate e and c would need to be written out, but can that be done in the definition of s, rather than each time in the call to build? It's not that big a deal if it can't, it's just that I see Construct as "defining a two-way mapping between binary structures and Python objects" and that seems a natural thing to expect from that viewpoint.

Paul

Benjamin

unread,
Oct 5, 2016, 4:13:58 PM10/5/16
to construct3
I believe I was confused because, in your third example, the pointers values are reversed, so I didn't realize they were supposed to be int32s counting from after the two pointers. With your given examples (ignoring the flipped pointers), this works:

In [27]: s = Struct(
    ...:     'e' / Int32ub,
    ...:     'c' / Int32ub,
    ...:     'exe' / Pointer(this.e + 8, CString(encoding='utf8')),
    ...:     'cwd' / Pointer(this.c + 8, CString(encoding='utf8')))

In [28]: c = s.parse(b'\x00\x00\x00\x06\x00\x00\x00\x10xxxxxxd/e/f\x00xxxxa/b/c\x00')

In [29]: c.exe
Out[29]: 'd/e/f'

In [30]: c.cwd
Out[30]: 'a/b/c'

In [31]: c = s.parse(b'\x00\x00\x00\x06\x00\x00\x00\x00d/e/f\x00a/b/c\x00')

In [32]: c.exe
Out[32]: 'a/b/c'

In [33]: c.cwd
Out[33]: 'd/e/f'

While it technically may be possible to create something in construct which would figure out the pointers, it's much more sane to simply make a helper function to do it (just watch out for off-by-1 errors):

In [38]: def data(exe, cwd):
    ...:     return {'exe': exe, 'cwd': cwd, 'e': 0, 'c': len(exe)+1}
    ...: 

In [39]: s.build(data('a/b/c', 'd/e/f'))
Out[39]: b'\x00\x00\x00\x00\x00\x00\x00\x05a/b/c\x00d/e/f\x00'

Or:

In [40]: def data(exe, cwd):
    ...:     return {'exe': exe, 'cwd': cwd, 'e': len(cwd)+1, 'c': 0}
    ...: 

In [41]: s.build(data('a/b/c', 'd/e/f'))
Out[41]: b'\x00\x00\x00\x05\x00\x00\x00\x00d/e/f\x00a/b/c\x00'

Interestingly, Construct will even fill extra space with null:

In [42]: def data(exe, cwd):
    ...:     return {'exe': exe, 'cwd': cwd, 'e': 5+len(cwd)+1, 'c': 2}
    ...: 

In [42]: s.build(data('a/b/c', 'd/e/f'))
Out[42]: b'\x00\x00\x00\x0b\x00\x00\x00\x02\x00\x00d/e/f\x00\x00\x00\x00a/b/c\x00'



--

Paul Moore

unread,
Oct 6, 2016, 4:23:05 AM10/6/16
to Construct
On Wednesday, 5 October 2016 21:13:58 UTC+1, Benjamin Riggs wrote:
I believe I was confused because, in your third example, the pointers values are reversed, so I didn't realize they were supposed to be int32s counting from after the two pointers. With your given examples (ignoring the flipped pointers), this works:

Whoops. You're right - I was trying to demonstrate that padding would be ignored, but the padding I used confused me as to which offset was which. My apologies. One of the reasons I don't want to have to construct binary data structures by hand, I guess :-)
 
While it technically may be possible to create something in construct which would figure out the pointers, it's much more sane to simply make a helper function to do it (just watch out for off-by-1 errors):

In [38]: def data(exe, cwd):
    ...:     return {'exe': exe, 'cwd': cwd, 'e': 0, 'c': len(exe)+1}
    ...: 

In [39]: s.build(data('a/b/c', 'd/e/f'))

That's a good point, and one I hadn't thought of. That is indeed a perfectly sensible solution.

Thanks for your patience and help with this.
Paul
Reply all
Reply to author
Forward
0 new messages