Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

extract text from binary file

807 views
Skip to first unread message

Gabor Grothendieck

unread,
Feb 9, 2013, 8:59:49 PM2/9/13
to
Is it possible to extract the text from a binary file using a cmd script? In UNIX one would do this:

tr -d '[\000-\011\013-\037\177-\377]' < infile > outfile

I am aware that one can get a tr program for Windows but was wondering if this is possible just with what comes with Windows.

foxidrive

unread,
Feb 9, 2013, 10:22:48 PM2/9/13
to
There are different ways - either using tokens in a for command from the exact text (and it depends on
the exact text to define the tokens and delimiters)

or splitting the string up with SET

set var2=%var:~0,3% %var:~10,12%



--
foxi

Gabor Grothendieck

unread,
Feb 9, 2013, 10:44:39 PM2/9/13
to
I was hoping for a general solution similar to the tr one but if that is not feasible then the particular situation is that I don't know the exact text or its position but I do know it represents a semicolon separated path like this:

{app}\bin;{app}\gcc-4.6.3\bin;

Any other text and binary data in the file is not of interest. I am trying to extract the path.


foxidrive

unread,
Feb 9, 2013, 11:01:58 PM2/9/13
to
On 10/02/2013 2:44 PM, Gabor Grothendieck wrote:
> On Saturday, February 9, 2013 10:22:48 PM UTC-5, foxidrive wrote:
>> On 10/02/2013 12:59 PM, Gabor Grothendieck wrote:
>>
>>> Is it possible to extract the text from a binary file using a cmd script? In UNIX one would do
>>> this: tr -d '[\000-\011\013-\037\177-\377]' < infile > outfile

>> There are different ways - either using tokens in a for command from the exact text (and it depends
>> on the exact text to define the tokens and delimiters)
>>
>> or splitting the string up with SET
>>
>> set var2=%var:~0,3% %var:~10,12%
>
> I was hoping for a general solution similar to the tr one but if that is not feasible then the
> particular situation is that I don't know the exact text or its position but I do know it represents a
> semicolon separated path like this:
>
> {app}\bin;{app}\gcc-4.6.3\bin;
>
> Any other text and binary data in the file is not of interest. I am trying to extract the path.

I didn't take note that it is a binary file. Batch isn't so good with a full range of 8 bit characters.
VBS could help here though, possibly powershell too.

Your tr command implies that you know the position of the text, is that right?


--
foxi

Gabor Grothendieck

unread,
Feb 10, 2013, 12:23:47 AM2/10/13
to
The tr command deletes any character in the indicated ranges, i.e. outside the range of ascii characters. That would leave me with just text. Then I would have to search for the desired string among the text. I don't know the position of the string. I assume that each component in the string always starts with the 5 characters {app} and is terminated by the first following semicolon. In the example I gave there are two such components.

If I used gawk then this seems to work but I am aiming for something more self contained in a single file and not dependent on software that does not come with Windows:

BEGIN { FPAT = "{app}[^;]*;" } # {app} followed by non-;'s followed by ;
NF { for (i = 1; i <= NF; i++) printf("%s", $i); print "" }

Frank Westlake

unread,
Feb 10, 2013, 5:38:07 AM2/10/13
to
On 2013-02-09 19:44, Gabor Grothendieck wrote:
> On Saturday, February 9, 2013 10:22:48 PM UTC-5, foxidrive wrote:
>> On 10/02/2013 12:59 PM, Gabor Grothendieck wrote:
>>> Is it possible to extract the text from a binary file using a cmd
>>> script?

Sometimes.

> I was hoping for a general solution similar to the tr one but if that
> is not feasible then the particular situation is that I don't know
> the exact text or its position but I do know it represents a
> semicolon separated path like this:
>
> {app}\bin;{app}\gcc-4.6.3\bin;


No general solution, you will need to customize an extraction which
would succeed with some binaries and fail with others.

Start by extracting something which you know will always be there:

TYPE "%file%" | FINDSTR /C:";"

Look at the result and find part of it which can be used to reduce the
output, then add that to the pipe:

TYPE "%file%" | FINDSTR /C:";" | FINDSTR /C:"bin"

You included a GCC path so if that will always be there then this would
be a good start:

TYPE "%file%" | FINDSTR /C:"\gcc-"

After you've narrowed it down to a single line then place it in a FOR
with no delimiters and capture the line into a variable:

FOR /F "delims=" %%a in (
'TYPE "%file%" ^| FINDSTR /C:"\gcc-"'
) Do (
SET "WORK=%%a"
)

Now you can split the variable by replacing the semicolon with a space,
but since paths can contain spaces you will need to add quotes.
Something like this:

SET WORK="!WORK:;=" "!"

I'm not certain that will work without modification, I'm not testing
any of this as I write it.

Now you have a space delimited string and you can examine each quoted
component one at a time for the one you need:

FOR %%a in (!WORK!) Do (
Set "component=%%~a"
If "!component:gcc=!" NEQ "!component!" (
REM Then we have the component containing GCC.
)
)

Frank

Frank Westlake

unread,
Feb 10, 2013, 5:47:09 AM2/10/13
to
On 2013-02-10 02:38, Frank Westlake wrote:
> Start by extracting something which you know will always be there:
>
> TYPE "%file%" | FINDSTR /C:";"
>
> Look at the result and find part of it which can be used to reduce the
> output, then add that to the pipe:
>
> TYPE "%file%" | FINDSTR /C:";" | FINDSTR /C:"bin"
>
> You included a GCC path so if that will always be there then this would
> be a good start:
>
> TYPE "%file%" | FINDSTR /C:"\gcc-"

It would be better to include the file as a parameter for FINDSTR:

FINDSTR /C:";" %file%

FINDSTR /C:";" %file% | FINDSTR /C:"bin"

FINDSTR /C:"\gcc-" %file%

Frank

foxidrive

unread,
Feb 10, 2013, 8:01:20 AM2/10/13
to
On 10/02/2013 4:23 PM, Gabor Grothendieck wrote:
> On Saturday, February 9, 2013 11:01:58 PM UTC-5, foxidrive wrote:
>> On 10/02/2013 2:44 PM, Gabor Grothendieck wrote:

>>> I was hoping for a general solution similar to the tr one but if that is not feasible then the
>>> particular situation is that I don't know the exact text or its position but I do know it represents a
>>> semicolon separated path like this:
>>
>>> {app}\bin;{app}\gcc-4.6.3\bin;
>>
>>> Any other text and binary data in the file is not of interest. I am trying to extract the path.
>>
>> I didn't take note that it is a binary file. Batch isn't so good with a full range of 8 bit characters.
>> VBS could help here though, possibly powershell too.
>
> The tr command deletes any character in the indicated ranges, i.e. outside the range of ascii characters. That would leave me with just text. Then I would have to search for the desired string among the text. I don't know the position of the string. I assume that each component in the string always starts with the 5 characters {app} and is terminated by the first following semicolon. In the example I gave there are two such components.

Ahh, I mistook the function of tr.

There is a tool that is downloadable from microsoft which might help you locate the string.

Strings v2.41
Copyright (C) 1999-2009 Mark Russinovich
Sysinternals - www.sysinternals.com

usage: strings.exe [-a] [-b bytes] [-n length] [-o] [-q] [-s] [-u] <file or directory>
-a Ascii-only search (Unicode and Ascii is default)
-b Bytes of file to scan
-o Print offset in file string was located
-n Minimum string length (default is 3)
-q Quiet (no banner)
-s Recurse subdirectories
-u Unicode-only search (Unicode and Ascii is default)


Is this binary file an executable where the string will always be in the same place?

> If I used gawk then this seems to work but I am aiming for something more self contained in a single file and not dependent on software that does not come with Windows:
>
> BEGIN { FPAT = "{app}[^;]*;" } # {app} followed by non-;'s followed by ;
> NF { for (i = 1; i <= NF; i++) printf("%s", $i); print "" }

You can use regexp of a sort in VBS scripts. I'm not sure about Powershell scrips but that's worth
looking at too.



--
foxi

Gabor Grothendieck

unread,
Feb 10, 2013, 10:15:32 AM2/10/13
to
foxidrive,

The strings program is nifty and

strings -a | findstr {app}

returns the desired path. That would be the basis of a solution if I were not limiting myself to not use programs that don't come with Windows.

Frank, I think the binary nature of the file is interfering with the use of cmd scripting but if I can figure out some way of extracting the ascii first then I am sure your approach would work. I may have to give up on a pure scripting solution.

findstr /c:"{app}"

does seem to give the required line which is a combination of binary junk and the desired text.

Frank Westlake

unread,
Feb 10, 2013, 10:56:54 AM2/10/13
to
On 2013-02-10 07:15, Gabor Grothendieck wrote:
> findstr /c:"{app}"
>
> does seem to give the required line which is a combination of binary
> junk and the desired text.

That's normal. The following script is the general solution which may
need to be customized for your binary; you could first choose delimiters
which reduce some of the binary clutter and nibble the string down to
something that can be broken at semicolons.

If it is you who are creating the binaries then it would help if you
formatted the string so that it could more easily be parsed by inserting
a preceding and following semicolon: "path=;path\a;path\b;".

@Echo OFF
SetLocal EnableExtensions EnableDelayedExpansion
Set "knownString={app}"
Set "file=your binary"

FOR /F "delims=" %%a in (
'FINDSTR /C:"%knownString%" "%file%"'
) Do (
SET "WORK=%%a"
)
SET WORK="!WORK:;=" "!"
FOR %%a in (!WORK!) Do (
Set "component=%%~a"
If "!component:%knownString%=!" NEQ "!component!" (
Echo "%%~a" is it.
)
)

Frank

Gabor Grothendieck

unread,
Feb 10, 2013, 11:54:29 AM2/10/13
to
Thanks. WORK does not seem to get set. If you want to try it with the actual file run this installer:

http://cran.r-project.org/bin/windows/Rtools/Rtools30.exe

and in the top level directory it will create a file called unins000.dat . That is the binary file of interest.

Frank Westlake

unread,
Feb 10, 2013, 11:58:35 AM2/10/13
to
On 2013-02-10 07:15, Gabor Grothendieck wrote:
> Frank, I think the binary nature of the file is interfering with the
> use of cmd scripting but if I can figure out some way of extracting
> the ascii first then I am sure your approach would work.

There is also a much more involved procedure which will work more reliably:

1. Save the line which has the desired text with the undesired
binary data to a file.

FINDSTR /C:"%string%" "%file%" >"%TEMP%\app.bin"

2. Use FC.exe to create an ASCII-HEX dump of the file.

For %%a in ("%TEMP%\app.bin") Do Set size=%%~za
fsUtil file createnew "%TEMP%\app.zero" %size%
FC /B "%TEMP%\app.bin" "%TEMP%\app.zero" > "%TEMP%\app.hex"

3. Go through the ASCII-HEX file line by line with FOR/F skipping
everything until you get to hex values for your "{app}" string, then
convert what's needed into ASCII again.

Frank

foxidrive

unread,
Feb 10, 2013, 12:39:27 PM2/10/13
to
On 11/02/2013 3:54 AM, Gabor Grothendieck wrote:
> Thanks. WORK does not seem to get set. If you want to try it with the actual file run this installer:
>
> http://cran.r-project.org/bin/windows/Rtools/Rtools30.exe
>
> and in the top level directory it will create a file called unins000.dat . That is the binary file of interest.

It would help if you uploaded the unins000.dat file to a webspace so we could download just that file.

--
foxi

Frank Westlake

unread,
Feb 10, 2013, 12:44:20 PM2/10/13
to
On 2013-02-10 08:54, Gabor Grothendieck wrote:
> http://cran.r-project.org/bin/windows/Rtools/Rtools30.exe

That sort of problem can't be solved without magic. Here's the magic:

@Echo OFF
SetLocal EnableExtensions EnableDelayedExpansion

Set "string={app}"
Set "file=\rtools\unins000.dat"

For /F "delims=" %%a in (
'findstr /C:"%string%" "%file%"^|MORE'
) Do (
Set "$=%%~a"
If /I "!$:~0,5!" EQU "%string%" (
Set $=!$:;=" "!
For %%b in ("!$!") Do (
Set "#=%%~b"
If "!#:~0,5!" EQU "%string%" (
CALL :work !#!
)
)
)
)
Goto :EOF
:work
Echo %~1
Goto :EOF

Do your work in the subroutine ":work", which you create. I get the two
strings

{app}\bin
{app}\gcc-4.6.3\bin

Note that the strings are sent to the subroutine quotes and that I
dequote them in the ECHO statement.

Frank

Frank Westlake

unread,
Feb 10, 2013, 12:46:59 PM2/10/13
to
On 2013-02-10 09:44, Frank Westlake wrote:
> CALL :work !#!

Somehow I boobooed. That should be quoted

CALL :work "!#!"

because it might contain spaces.

Frank

Gabor Grothendieck

unread,
Feb 10, 2013, 12:56:05 PM2/10/13
to
Yes! It works! Excellent.

Herbert Kleebauer

unread,
Feb 10, 2013, 9:17:42 PM2/10/13
to
Because there is an editor in Windows, you could write a few lines of C code.
But here is a (slow) batch only solution:



@echo off
setlocal enabledelayedexpansion

::::::::::::::::::::::::::::::::::::::::::::::::::::

set infile=infile.bin
set outfile=outfile.bin
set tmpfile=_.tmp

::::::::::::::::::::::::::::::::::::::::::::::::::::
:: setup conversion table: print only ascii 32-126
:: and CR, LF, TAB

for %%i in (0 1) do for %%j in (0 1 2 3 4 5 6 7 8 9 a b c d e f) do set [%%i%%j]=
for %%i in (2 3 4 5 6 7) do for %%j in (0 1 2 3 4 5 6 7 8 9 a b c d e f) do set [%%i%%j]=%%i%%j
for %%i in (8 9 a b c d e f) do for %%j in (0 1 2 3 4 5 6 7 8 9 a b c d e f) do set [%%i%%j]=
set [09]=09
set [0d]=0d
set [0a]=0a
set [7f]=
set [ ]=

::::::::::::::::::::::::::::::::::::::::::::::::::::

if exist "%tmpfile%" del "%tmpfile%"


certutil -f -encodehex "%infile%" "%outfile%"

for /f "tokens=1*" %%i in (%outfile%) do (
set a=%%j

set c=%%[!a:~0,2!]%% %%[!a:~3,2!]%% %%[!a:~6,2!]%% %%[!a:~9,2!]%% %%[!a:~12,2!]%% %%[!a:~15,2!]%%
set c=!c! %%[!a:~18,2!]%% %%[!a:~21,2!]%% %%[!a:~25,2!]%% %%[!a:~28,2!]%% %%[!a:~31,2!]%%
set c=!c! %%[!a:~34,2!]%% %%[!a:~37,2!]%% %%[!a:~40,2!]%% %%[!a:~43,2!]%% %%[!a:~46,2!]%%
call set c=!c!
echo.!c!>>%tmpfile%)

certutil -f -decodehex "%tmpfile%" "%outfile%"

if exist "%tmpfile%" del "%tmpfile%"
goto :eof

::::::::::::::::::::::::::::::::::::::::::::::::::::

Gabor Grothendieck

unread,
Feb 11, 2013, 12:49:50 AM2/11/13
to
That's quite amazing!

Frank Westlake

unread,
Feb 11, 2013, 4:31:32 AM2/11/13
to
Very nice work. Thank you for publishing this, I can use it here in this
group. But this won't help Gabor Grothendieck because he needs it to
function on Windows XP.

Frank

Frank Westlake

unread,
Feb 11, 2013, 4:36:42 AM2/11/13
to
For those of you with Windows XP, this script prints a hex dump which
looks like this:


0000 49 6e 6e 6f 20 53 65 74 75 70 20 55 6e 69 6e 73 Inno Setup Unins
0010 74 61 6c 6c 20 4c 6f 67 20 28 62 29 00 00 00 00 tall Log (b)....
0020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0040 52 74 6f 6f 6c 73 00 00 00 00 00 00 00 00 00 00 Rtools..........
0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0070 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

And etcetera (why is etcetera not in my dictionary?).

Frank

Frank Westlake

unread,
Feb 11, 2013, 5:07:27 AM2/11/13
to
On 2013-02-11 01:36, Frank Westlake wrote:
> For those of you with Windows XP, this script prints a hex dump which
> looks like this:

I was mistaken. Apparently I caused the script to abort early and I read
only what CERTUTIL outputs, but the script continues to extract only the
ASCII strings and saves those to another file.

Frank

Herbert Kleebauer

unread,
Feb 11, 2013, 12:21:39 PM2/11/13
to
On 11.02.2013 10:36, Frank Westlake wrote:
> For those of you with Windows XP, this script prints a hex dump which
> looks like this:
>
>
> 0000 49 6e 6e 6f 20 53 65 74 75 70 20 55 6e 69 6e 73 Inno Setup Unins
> 0010 74 61 6c 6c 20 4c 6f 67 20 28 62 29 00 00 00 00 tall Log (b)....

certutil can generte different formats, for example:


certutil -f -encodehex infile outfile 4

ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 00 48
00 48 00 00 ff db 00 43 00 01 01 01 01 01 01 01
01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01


certutil -f -encodehex infile outfile 10

0000 ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 00 48
0010 00 48 00 00 ff db 00 43 00 01 01 01 01 01 01 01
0020 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01


certutil -f -encodehex infile outfile 12

ffd8ffe000104a46494600010101004800480000ffdb0043000101010101010101010101010101010101010101




// certenrolld_begin -- CRYPT_STRING_*
#define CRYPT_STRING_BASE64HEADER 0x00000000
#define CRYPT_STRING_BASE64 0x00000001
#define CRYPT_STRING_BINARY 0x00000002
#define CRYPT_STRING_BASE64REQUESTHEADER 0x00000003
#define CRYPT_STRING_HEX 0x00000004
#define CRYPT_STRING_HEXASCII 0x00000005
#define CRYPT_STRING_BASE64_ANY 0x00000006
#define CRYPT_STRING_ANY 0x00000007
#define CRYPT_STRING_HEX_ANY 0x00000008
#define CRYPT_STRING_BASE64X509CRLHEADER 0x00000009
#define CRYPT_STRING_HEXADDR 0x0000000a
#define CRYPT_STRING_HEXASCIIADDR 0x0000000b
#define CRYPT_STRING_HEXRAW 0x0000000c

#define CRYPT_STRING_NOCRLF 0x40000000
#define CRYPT_STRING_NOCR 0x80000000
// certenrolld_end

// CryptBinaryToString uses the following flags
// CRYPT_STRING_BASE64HEADER - base64 format with certificate begin
// and end headers
// CRYPT_STRING_BASE64 - only base64 without headers
// CRYPT_STRING_BINARY - pure binary copy
// CRYPT_STRING_BASE64REQUESTHEADER - base64 format with request begin
// and end headers
// CRYPT_STRING_BASE64X509CRLHEADER - base64 format with x509 crl begin
// and end headers
// CRYPT_STRING_HEX - only hex format
// CRYPT_STRING_HEXASCII - hex format with ascii char display
// CRYPT_STRING_HEXADDR - hex format with address display
// CRYPT_STRING_HEXASCIIADDR - hex format with ascii char and address display
//
// CryptBinaryToString accepts CRYPT_STRING_NOCR or'd into one of the above.
// When set, line breaks contain only LF, instead of CR-LF pairs.

// CryptStringToBinary uses the following flags
// CRYPT_STRING_BASE64_ANY tries the following, in order:
// CRYPT_STRING_BASE64HEADER
// CRYPT_STRING_BASE64
// CRYPT_STRING_ANY tries the following, in order:
// CRYPT_STRING_BASE64_ANY
// CRYPT_STRING_BINARY -- should always succeed
// CRYPT_STRING_HEX_ANY tries the following, in order:
// CRYPT_STRING_HEXADDR
// CRYPT_STRING_HEXASCIIADDR
// CRYPT_STRING_HEXASCII
// CRYPT_STRING_HEX


Frank Westlake

unread,
Feb 11, 2013, 3:06:44 PM2/11/13
to
On 2013-02-11 09:21, Herbert Kleebauer wrote:
> certutil can generte different formats, for example:

Thank you for that information.

Frank

Stanley Daniel de Liver

unread,
Feb 12, 2013, 10:53:51 AM2/12/13
to
On Mon, 11 Feb 2013 17:21:39 -0000, Herbert Kleebauer <kl...@unibwm.de>
wrote:

> On 11.02.2013 10:36, Frank Westlake wrote:
>> For those of you with Windows XP, this script prints a hex dump which
>> looks like this:
>>
XP Pro; I don't have certutil
--
[dash dash space newline 4line sig]

Money/Life question

foxidrive

unread,
Feb 12, 2013, 11:06:32 AM2/12/13
to
On 13/02/2013 2:53 AM, Stanley Daniel de Liver wrote:
> On Mon, 11 Feb 2013 17:21:39 -0000, Herbert Kleebauer <kl...@unibwm.de>
> wrote:
>
>> On 11.02.2013 10:36, Frank Westlake wrote:
>>> For those of you with Windows XP, this script prints a hex dump which
>>> looks like this:
>>>
> XP Pro; I don't have certutil

You have 1 year and 2 months left of security updates. That'll go in a flash!



--
foxi

Stanley Daniel de Liver

unread,
Feb 12, 2013, 12:14:25 PM2/12/13
to
On Tue, 12 Feb 2013 16:06:32 -0000, foxidrive <n...@this.address.invalid>
wrote:
I'm getting into linux

Herbert Kleebauer

unread,
Feb 12, 2013, 12:31:08 PM2/12/13
to
On 12.02.2013 16:53, Stanley Daniel de Liver wrote:
> On Mon, 11 Feb 2013 17:21:39 -0000, Herbert Kleebauer <kl...@unibwm.de>
> wrote:

> XP Pro; I don't have certutil

But then you are one of the few lucky ones who has
a 16 bit subsystem where such things are trivial.


0 new messages