Parse unicode paths

Mark Woan

unread,

Sep 6, 2013, 3:50:50 PM9/6/13

to const...@googlegroups.com

Hi,

I am trying to parse Windows prefetch files which are a set of Unicode strings located at a specific offset, each string has the usual two null byte terminator, then there is a further null. I do have a total count of the paths involved.

An example path is:

\DEVICE\HARDDISKVOLUME2\WINDOWS\SYSTEM32\NTDLL.DLL

I have tried the following:

Pointer(lambda ctx: ctx.start + ctx.path_offsets, RepeatUntil(lambda obj, ctx: len(obj.file_name) == 0, AccessedFile))

Where AccessedFile is:

AccessedFile = Struct("accessed_file",

CString(name='file_name', encoding='utf16')

)

But that doesn't work since it is Unicode and all I get back is '\\', as CString uses a single null as the default terminator, and it works by comparing a character at a time

Then I tried the following, but then I end up with characters from the next path!

AccessedFile = Struct("accessed_file",

RepeatUntil(lambda obj, ctx:obj=='\x00\x00\x00', Field("garbage", 3)),

)

I suspect there is an easy way that I am missing?

Thanks

Mark

Tomer Filiba

unread,

Sep 7, 2013, 10:28:03 AM9/7/13

to Mark Woan, const...@googlegroups.com

sounds like you should be using the Tunnel adapter, to first decode the stream in UTF16 and then process it. Tunnel lets you do that - it makes two passes over the data. But if you don't know the length of the data in advance, you'll have to make yet another pass to identify the strings first. In such a case, I'd suggest writing your custom construct class, e.g., UTF16CString. Sounds to me it'll be easier (and more efficient).

-tomer

-----------------------------------------------------------------

Tomer Filiba
tomerfiliba.com

--
You received this message because you are subscribed to the Google Groups "Construct" group.
To unsubscribe from this group and stop receiving emails from it, send an email to construct3+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mark Woan

unread,

Sep 9, 2013, 6:50:21 AM9/9/13

to const...@googlegroups.com, Mark Woan

Thanks for the reply, guess I assumed that someone may have already run into the same problem and had a solution that used the defined objects within the framework!

In the end my solution was:

def unpack_wstring(buffer, offset):

end = buffer.find("\x00\x00", offset)

if end - 2 <= offset:

return ""

length = end - offset

try:

return buffer[offset:offset + length].decode("utf16").partition("\x00")[0]

except UnicodeDecodeError:

return buffer[offset:offset + length + 1].decode("utf16").partition("\x00")[0]

class UTF16CString(Construct):

def _parse(self, stream, context):

contents = stream.getvalue()

accessed_files = []

offset = context.path_offsets

for i in range(context.path_count):

path = unpack_wstring(contents, offset)

length = (len(path) * 2) + 2

offset += length

accessed_files.append(path)

if offset > context.path_offsets + context.size:

break

return accessed_files

Arek Bulski

unread,

Sep 26, 2016, 11:41:11 AM9/26/16

to Construct

In Construct 2.8 the string fields are kind of broken with respect to utf16 and utf32. If you submit a feeature request and provide an clear example, fixes will start rolling.

Reply all

Reply to author

Forward