Convert raw binary file to ascii

r2

unread,

Jul 27, 2009, 8:29:23 AM7/27/09

to

I have a memory dump from a machine I am trying to analyze. I can view
the file in a hex editor to see text strings in the binary code. I
don't see a way to save these ascii representations of the binary, so
I went digging into Python to see if there were any modules to help.

I found one I think might do what I want it to do - the binascii
module. Can anyone describe to me how to convert a raw binary file to
an ascii file using this module. I've tried? Boy, I've tried.

Am I correct in assuming I can get the converted binary to ascii text
I see in a hex editor using this module? I'm new to this forensics
thing and it's quite possible I am mixing technical terms. I am not
new to Python, however. Thanks for your help.

Peter Otten

unread,

Jul 27, 2009, 9:06:17 AM7/27/09

to

r2 wrote:

> I have a memory dump from a machine I am trying to analyze. I can view
> the file in a hex editor to see text strings in the binary code. I
> don't see a way to save these ascii representations of the binary, so
> I went digging into Python to see if there were any modules to help.
>
> I found one I think might do what I want it to do - the binascii
> module. Can anyone describe to me how to convert a raw binary file to
> an ascii file using this module. I've tried? Boy, I've tried.

That won't work because a text editor doesn't need any help to convert the
bytes into characters. If it expects ascii it just will be puzzled by bytes
that are not valid ascii. Also, it will happily display byte sequences that
are valid ascii, but that you as a user will see as gibberish because they
were meant to be binary data by the program that wrote them.

> Am I correct in assuming I can get the converted binary to ascii text
> I see in a hex editor using this module? I'm new to this forensics
> thing and it's quite possible I am mixing technical terms. I am not
> new to Python, however. Thanks for your help.

Unix has the "strings" commandline tool to extract text from a binary.
Get hold of a copy of the MinGW tools if you are on windows.

Peter

Grant Edwards

unread,

Jul 27, 2009, 10:11:21 AM7/27/09

to

On 2009-07-27, r2 <rlichl...@gmail.com> wrote:

> I have a memory dump from a machine I am trying to analyze. I can view
> the file in a hex editor to see text strings in the binary code. I
> don't see a way to save these ascii representations of the binary,

$ strings memdump.binary >memdump.strings

$ hexdump -C memdump.binary >memdump.hex+ascii

--
Grant Edwards grante Yow! I'm ZIPPY the PINHEAD
at and I'm totally committed
visi.com to the festive mode.

r2

unread,

Jul 27, 2009, 1:25:18 PM7/27/09

to

Okay. Thanks for the guidance. I have a machine with Linux, so I
should be able to do what you describe above. Could Python extract the
strings from the binary as well? Just wondering.

r2

unread,

Jul 27, 2009, 1:25:44 PM7/27/09

to

On Jul 27, 10:11 am, Grant Edwards <invalid@invalid> wrote:

Grant,

Thanks for the commands!

Peter Otten

unread,

Jul 27, 2009, 2:07:37 PM7/27/09

to

r2 wrote:

As a special service for you here is a naive implementation to build upon:

#!/usr/bin/env python
import sys

wanted_chars = ["\0"]*256
for i in range(32, 127):
wanted_chars[i] = chr(i)
wanted_chars[ord("\t")] = "\t"
wanted_chars = "".join(wanted_chars)

THRESHOLD = 4

for s in sys.stdin.read().translate(wanted_chars).split("\0"):
if len(s) >= THRESHOLD:
print s

Peter

r2

unread,

Jul 27, 2009, 3:33:21 PM7/27/09

to

> Peter- Hide quoted text -
>
> - Show quoted text -

Perfect! Thanks.

Dave Angel

unread,

Jul 27, 2009, 4:07:16 PM7/27/09

to r2, pytho...@python.org

Yes, you could do the same thing in Python easily enough. And with the
advantage that you could define your own meanings for "characters."

The memory dump could be storing characters that are strictly ASCII. Or
it could have EBCDIC, or UTF-8. And it could be Unicode, 16 bit or 32
bits, and big-endian or little-endian. Or the characters could be in
some other format specific to a particular program.

However, it's probably very useful to see what a "strings" program might
look like, because you can quickly code variations on it, to suit your
particular data.
Something like the following (totally untested)

def isprintable(char):
return 0x20 <= char <= 0x7f

def string(filename):
data = open(filename, "rb").read()
count = 0
line = ""
for ch in data:
if isprintable(ch):
count += 1
line = line + ch
else:
if count > 4 : #cutoff, don't print strings smaller
than this because they're probably just coincidence
print line
count = 0
line= ""
print line

Now you can change the definition of what's "printable", you can change
the min-length that you care about. And of course you can fine-tune
things like max-length lines and such.

DaveA

Jan Kaliszewski

unread,

Jul 27, 2009, 5:09:14 PM7/27/09

to Grant Edwards, pytho...@python.org

Hello Friends,

It's my first post to python-list, so first let me introduce myself...
* my name is Jan Kaliszewski,
* country -- Poland,
* occupation -- composer (studied in F. Chopin Academy of Music @Warsaw)
and programmer (currently in Record System company,
working on Anakonda -- ERP system for
big companies [developed in Python + WX
+ Postgres]).

Now, to the matter...

27-07-2009 Grant Edwards <inv...@invalid.chopin.edu.pl> wrote:

> On 2009-07-27, r2 <rlichl...@gmail.com> wrote:
>
>> I have a memory dump from a machine I am trying to analyze. I can view
>> the file in a hex editor to see text strings in the binary code. I
>> don't see a way to save these ascii representations of the binary,
>
> $ strings memdump.binary >memdump.strings
>
> $ hexdump -C memdump.binary >memdump.hex+as

Do You (r2) want to do get ASCII substrings (i.e. extract only those
pieces of file that consist of ASCII codes -- i.e. 7-bit values -- i.e in
range 0...127), or rather "possibly readable ascii representation" of
the whole file, with printable ascii characters preserved 'as is' and
not-printable/non-ascii characters being replaced with their codes
(e.g. with '\x...' notation).

If the latter, you probably want something like this:

import codecs
with open('memdump.binary', 'rb') as source:
with open('memdump.txt', 'w') as target:
for quasiline in codecs.iterencode(source, 'string_escape'):
target.write(quasiline)

--
Jan Kaliszewski (zuo) <z...@chopin.edu.pl>