Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Does any one recognize this binary data storage format

8 views
Skip to first unread message

geske...@hotmail.com

unread,
Aug 9, 2005, 1:29:14 PM8/9/05
to
I am hoping someone can help me solve a bit of a puzzle.

We are working on a data file reader and extraction tool for an old
MS-DOS accounting system dating back to the mid 80's.

In the data files, the text information is stored in clearly readable
ASCII text, so I am comfortable that this file isn't EBCIDIC, however,
the some of the numbers are stored in a format that we can't seem to
recognize or unpack using the standard python tools (struct, binascii)
... or or atleast our understanding of how these tools work !


Any assistance would be appreciated.

Here are a few examples of telephone numbers;

Exmaple 1:

Phone 1: 5616864700
Hex On Disk: C0DBA8ECF441

Phone 2: 5616885403
Hex on Disk: B0E9ADECF4F1

Another example:
Phone 1: 8003346488
Hex On Disk: 800396d0fd41

Phone2: 9544261331
Hex On Disk: F8f50ec70142

Phone3: 9544278601
Hex On Disk: 481211c70142


TIA.

Christopher Subich

unread,
Aug 9, 2005, 2:18:37 PM8/9/05
to
geske...@hotmail.com wrote:
> I am hoping someone can help me solve a bit of a puzzle.
>
> We are working on a data file reader and extraction tool for an old
> MS-DOS accounting system dating back to the mid 80's.
>
> In the data files, the text information is stored in clearly readable
> ASCII text, so I am comfortable that this file isn't EBCIDIC, however,
> the some of the numbers are stored in a format that we can't seem to
> recognize or unpack using the standard python tools (struct, binascii)
> ... or or atleast our understanding of how these tools work !
>
>
> Any assistance would be appreciated.
>
> Here are a few examples of telephone numbers;
>
> Exmaple 1:
>
> Phone 1: 5616864700
> Hex On Disk: C0DBA8ECF441
>
> Phone 2: 5616885403
> Hex on Disk: B0E9ADECF4F1

Is this value a typo instead of ...F441?

Dejan Rodiger

unread,
Aug 9, 2005, 2:42:34 PM8/9/05
to
geske...@hotmail.com said the following on 9.08.2005 19:29:

> We are working on a data file reader and extraction tool for an old
> MS-DOS accounting system dating back to the mid 80's.

Could you tell us what is the extension of those files?

Could you post full 5-10 records (ASCII + HEX)?

--
Dejan Rodiger - PGP ID 0xAC8722DC
Delete wirus from e-mail address

Dejan Rodiger

unread,
Aug 9, 2005, 2:45:40 PM8/9/05
to
geske...@hotmail.com said the following on 9.08.2005 19:29:
> Phone 1: 5616864700
> Hex On Disk: C0DBA8ECF441

5616864700(10)=14ECA8DBC(16)
14 EC A8 DB C leftshift by 4 bits (it will add 0 on last C)
C0 DB A8 EC 14 00 write bytes from right to left
C0 DB A8 EC F4 41 Add E041

> Phone 1: 8003346488
> Hex On Disk: 800396d0fd41

8003346488(10)=1DD096038(16)
1D D0 96 03 8
80 03 96 D0 1D 00
80 03 96 d0 fd 41 Add E041

But works only for Phone 1 :-)

geske...@hotmail.com

unread,
Aug 9, 2005, 3:57:02 PM8/9/05
to
the extension on the files is *.mas but I a pretty sure it is not
relevant. I beleive it used by the application.

I can posted records as it will take up to much space.
But all three phone numbers are stored in 8 bytes with null bytes (ie.
00) stored in the leading positions (ie. the left hand side)

I do have some more examples;

I have inserted the leading null bytes and seperated with spaces for
clarity.

Ex #1) 333-3333
Hex On disk: 00 00 00 80 6a 6e 49 41

Ex #2) 666-6666
Hex On disk: 00 00 00 80 6a 6e 59 41

Ex#3) 777-7777
Hex On Disk: 00 00 00 40 7C AB 5D 41

Ex#4) 123-4567
Hex On Disk: 00 00 00 00 87 D6 32 41

Ex#5) 000-0001
Hex On disk: 00 00 00 00 00 00 F0 3F

Ex#6) 999-9999
Hex On disk: 00 00 00 E0 CF 12 63 41

Christopher Subich

unread,
Aug 9, 2005, 4:04:14 PM8/9/05
to
Dejan Rodiger wrote:
> 8003346488(10)=1DD096038(16)
> 1D D0 96 03 8
> 80 03 96 D0 1D 00
> 80 03 96 d0 fd 41 Add E041

I'm pretty sure that the last full byte is a parity check of some sort.
I still thing that Phone2 (..F1) is a typo and should be 41. Even if
it's not, it could be a more detailed parity (crc-like?) check.

If the F1/41 is a typo, the last byte is ..41 if the parity of the other
40 bits is odd, and ..42 if the parity is even. (Since ..41 and ..42
each have two 1s, it does not change the parity of the entire string).
If not, Lucy has some 'splaining to do.

Taking the last byte out of ther equation entirely, 40 bytes for 10
decimal numbers is 4 bytes / number, meaning there is some redundancy
still in the remainder (the full 10-digit number can be expressed with
room to spare in 36 bits).

Thinking like an 80s Mess-Dos programmer, 32-bit math is out of the
question since the CPU doesn't support it. Five decimal digits already
pushes the 16-bit boundary, so thinking of using the full phone number
or any computation is insane.

#1/#2 and #4/#5 share both the first five digits of the real phone
number and the last 16 bits of the remaining expression. Both pairs
*also* share bits 5-8 (the second hex digit).

Therefore, we may possibly conclude that digits 5-10 are stored in bits
5-8 and 21-36. This is an even 20 of the 40 data-bits. In fact, bits
6-8 of all expamples given were all 0, but since I can't find an
equivalent always-x set for the other 5 digits I'm not sure that this is
significant.

Therefore:
95442 = 8c701 = 1 + c701 (?)
56168 = 0ECF4 = 0 + ecf4 (?)

I'm not coming up with a terribly useful algorithm from this, though. :/
My guess is that somewhere, there's a boolean check based on whether a
digit is >= 6 [maybe 3?] (to prevent running afoul of 16-bitness). I'm
also 90% sure that the first and second halves of the phone number are
processed separately, at mostly, for the same reason.

Roel Schroeven

unread,
Aug 9, 2005, 4:17:16 PM8/9/05
to
Dejan Rodiger wrote:
> geske...@hotmail.com said the following on 9.08.2005 19:29:
>
>>Phone 1: 5616864700
>>Hex On Disk: C0DBA8ECF441
>
>
> 5616864700(10)=14ECA8DBC(16)
> 14 EC A8 DB C leftshift by 4 bits (it will add 0 on last C)
> C0 DB A8 EC 14 00 write bytes from right to left
> C0 DB A8 EC F4 41 Add E041
>
>
>>Phone 1: 8003346488
>>Hex On Disk: 800396d0fd41
>
>
> 8003346488(10)=1DD096038(16)
> 1D D0 96 03 8
> 80 03 96 D0 1D 00
> 80 03 96 d0 fd 41 Add E041
>
> But works only for Phone 1 :-)

E041 is some kind of checksum perhaps?

--
If I have been able to see further, it was only because I stood
on the shoulders of giants. -- Isaac Newton

Roel Schroeven

geske...@hotmail.com

unread,
Aug 9, 2005, 4:17:52 PM8/9/05
to
You are correct, that was a typo.
the second example should end in F441.

Thanks.

Grant Edwards

unread,
Aug 9, 2005, 4:33:54 PM8/9/05
to

> I can posted records as it will take up to much space. But all
> three phone numbers are stored in 8 bytes with null bytes (ie.
> 00) stored in the leading positions (ie. the left hand side)
>
> I do have some more examples;
>
> I have inserted the leading null bytes and seperated with spaces for
> clarity.
>
> Ex #1) 333-3333
> Hex On disk: 00 00 00 80 6a 6e 49 41
>
> Ex #2) 666-6666
> Hex On disk: 00 00 00 80 6a 6e 59 41

So there's only a 1-bit different between the on-disk
representation of 333-3333 and 666-6666.

That sounds pretty unlikely. Are you 100% sure you're looking
at the correct bytes?

--
Grant

geske...@hotmail.com

unread,
Aug 9, 2005, 4:45:52 PM8/9/05
to
Yes I double checked as I appreciate any help, but that is what is
stored on disk.

If it helps, we modified Ex#3. to be 777-777-7777
On disk this is now 00 00 10 87 77 F9 Fc 41

All the input fields are filled in this new example.

Dejan Rodiger

unread,
Aug 9, 2005, 5:28:35 PM8/9/05
to
geske...@hotmail.com said the following on 9.08.2005 22:45:

So for number with 10 digit numbers you could say that it is:
7777777777(10)=1CF977871(16)
1CF977871 SHL 4 bits = 1C F9 77 87 10
write them from right to left and shift left for 8 bits
10 87 77 f9 1C 00
And then add F0 41

Could you also give some examples with nine to one digits?

Scott David Daniels

unread,
Aug 9, 2005, 5:33:43 PM8/9/05
to
Grant Edwards wrote:
>>Ex #1) 333-3333
>>Hex On disk: 00 00 00 80 6a 6e 49 41
>>
>>Ex #2) 666-6666
>>Hex On disk: 00 00 00 80 6a 6e 59 41
>
> So there's only a 1-bit different between the on-disk
> representation of 333-3333 and 666-6666.
>
> That sounds pretty unlikely. Are you 100% sure you're looking
> at the correct bytes?

Perhaps the one bit is an exponent -- some kind of floating point
based format? That matches the doubling of all digits.

--Scott David Daniels
Scott....@Acm.Org

Dejan Rodiger

unread,
Aug 9, 2005, 5:30:34 PM8/9/05
to
Dejan Rodiger said the following on 9.08.2005 23:28:

> geske...@hotmail.com said the following on 9.08.2005 22:45:
>
>>Yes I double checked as I appreciate any help, but that is what is
>>stored on disk.
>>
>>If it helps, we modified Ex#3. to be 777-777-7777
>>On disk this is now 00 00 10 87 77 F9 Fc 41
>>
>>All the input fields are filled in this new example.
>>
>
>
> So for number with 10 digit numbers you could say that it is:
> 7777777777(10)=1CF977871(16)
> 1CF977871 SHL 4 bits = 1C F9 77 87 10
> write them from right to left and shift left for 8 bits
> 10 87 77 f9 1C 00
> And then add F0 41

And add E041 (not F0 41)

Grant Edwards

unread,
Aug 9, 2005, 5:50:06 PM8/9/05
to

That would just be sick. I can't imagine anybody on an 8-bit
CPU using FP for a phone number.

--
Grant

Christopher Subich

unread,
Aug 9, 2005, 7:55:26 PM8/9/05
to
Grant Edwards wrote:
> That would just be sick. I can't imagine anybody on an 8-bit
> CPU using FP for a phone number.

Nobody on an 8-bit CPU would have a FPU, so I'll guarantee that this is
done using only 8 or 16-bit (probably 8) integer math.

Grant Edwards

unread,
Aug 9, 2005, 8:42:14 PM8/9/05
to

And I'll guarantee that the difference between 333-3333 and
666-6666 has to be more than 1-bit. There's no way that can be
the correct data unless it's something like an index into a
different table or a pointer or something along those lines.

--
Grant Edwards grante Yow! ANN JILLIAN'S HAIR
at makes LONI ANDERSON'S
visi.com HAIR look like RICARDO
MONTALBAN'S HAIR!

Christopher Subich

unread,
Aug 9, 2005, 9:13:34 PM8/9/05
to
Grant Edwards wrote:
> And I'll guarantee that the difference between 333-3333 and
> 666-6666 has to be more than 1-bit. There's no way that can be
> the correct data unless it's something like an index into a
> different table or a pointer or something along those lines.

Absolutely. I hadn't even taken a good look at those datapoints yet.

The dataset that I'd like to see:
000-000-0001
000-000-0010
(etc)
000-000-0002
000-000-0004
000-000-0008
000-000-0016
(etc)

I also wonder if the last 8-16 bits involves, at least in part, a count
of the length of the phone number, or at least a flag to distinguish 7
from 10 digits.

Bengt Richter

unread,
Aug 9, 2005, 11:47:06 PM8/9/05
to

>>> def double_binary_lehex_to_double(dhex):
... "convert little-endian hex of ieee double binary to double"
... assert len(dhex)==16, (
... "hex of double in binary must be 8 bytes (hex pairs in little-endian order")
... dhex = ''.join(reversed([dhex[i:i+2] for i in xrange(0,16,2)]))
... m = int(dhex, 16)
... x = ((m>>52)&0x7ff) - 0x3ff - 52
... s = (m>>63)&0x1
... f = (m & ((1<<52)-1))|((m and 1 or 0)<<52)
... return (1.0,-1.0)[s]*f*2.0**x
...
>>> double_binary_lehex_to_double('000000806a6e4941')
3333333.0
>>> double_binary_lehex_to_double('000000806a6e5941')
6666666.0
>>> double_binary_lehex_to_double('0000108777F9Fc41')
7777777777.0

;-)

Regards,
Bengt Richter

Bengt Richter

unread,
Aug 10, 2005, 12:29:27 AM8/10/05
to

Now the easy way ;-)

>>> import struct
>>> def d2d(h):
... return struct.unpack('d',''.join(chr(int(h[i:i+2],16)) for i in xrange(0,16,2)))[0]
...
>>> d2d('000000806a6e4941')
3333333.0
>>> d2d('000000806a6e5941')
6666666.0
>>> d2d('0000108777F9Fc41')
7777777777.0

Regards,
Bengt Richter

John Machin

unread,
Aug 10, 2005, 12:28:05 AM8/10/05
to

Well done, Scott & Bengt!!
I've just verified that this works with all 12 corrected examples posted
by the OP.

Grant, MS-DOS implies 16 bits at least; and yes there was an FPU (the
8087). And yes there are a lot of sick people who store things as
numbers (whether integer or FP) when the only arithmetic operations that
can be applied to them stuff them up mightily (like losing leading
zeroes off post-codes, having NEGATIVE tax file numbers, etc) and it's
still happening on the best OSes and 64-bit CPUS. Welcome to the real
world :-)

Cheers,
John

geske...@hotmail.com

unread,
Aug 10, 2005, 8:30:37 AM8/10/05
to
Thanks so much for this. It is exactly what I was looking for.

If I am simply reading the bytes from disk, would I still need to
convert the these values HEX characters first with Hexlify, or is there
a more direct route ?
ie. convert them to the double float directly from the byte values ?

Bengt Richter

unread,
Aug 10, 2005, 9:23:22 AM8/10/05
to

Yes. Use struct.unpack ;-)

BTW, my second post was doing ''.join(chr(int(h[i:i+2],16)) for i in xrange(0,16,2))
to undo the hexlify you had done (I'd forgotten that there's a binascii.unhexlify ;-)

Regards,
Bengt Richter

Grant Edwards

unread,
Aug 10, 2005, 9:31:53 AM8/10/05
to
On 2005-08-10, Bengt Richter <bo...@oz.net> wrote:
> On Tue, 09 Aug 2005 21:50:06 -0000, Grant Edwards <gra...@visi.com> wrote:
>
>>On 2005-08-09, Scott David Daniels <Scott....@Acm.Org> wrote:
>>> Grant Edwards wrote:
>>>>>Ex #1) 333-3333
>>>>>Hex On disk: 00 00 00 80 6a 6e 49 41
>>>>>
>>>>>Ex #2) 666-6666
>>>>>Hex On disk: 00 00 00 80 6a 6e 59 41
>>>>
>>>> So there's only a 1-bit different between the on-disk
>>>> representation of 333-3333 and 666-6666.
>>>>
>>>> That sounds pretty unlikely. Are you 100% sure you're looking
>>>> at the correct bytes?
>>>
>>> Perhaps the one bit is an exponent -- some kind of floating point
>>> based format? That matches the doubling of all digits.
>>
>>That would just be sick. I can't imagine anybody on an 8-bit
>>CPU using FP for a phone number.

> >>> def double_binary_lehex_to_double(dhex):


> ... "convert little-endian hex of ieee double binary to double"
> ... assert len(dhex)==16, (
> ... "hex of double in binary must be 8 bytes (hex pairs in little-endian order")
> ... dhex = ''.join(reversed([dhex[i:i+2] for i in xrange(0,16,2)]))
> ... m = int(dhex, 16)
> ... x = ((m>>52)&0x7ff) - 0x3ff - 52
> ... s = (m>>63)&0x1
> ... f = (m & ((1<<52)-1))|((m and 1 or 0)<<52)
> ... return (1.0,-1.0)[s]*f*2.0**x
> ...
> >>> double_binary_lehex_to_double('000000806a6e4941')
> 3333333.0
> >>> double_binary_lehex_to_double('000000806a6e5941')
> 6666666.0
> >>> double_binary_lehex_to_double('0000108777F9Fc41')
> 7777777777.0
>
> ;-)

Damn.

I still say that's just plain sick.

--
Grant Edwards grante Yow! NEWARK has been
at REZONED!! DES MOINES has
visi.com been REZONED!!

Grant Edwards

unread,
Aug 10, 2005, 9:35:12 AM8/10/05
to
On 2005-08-10, John Machin <sjma...@lexicon.net> wrote:

>>>>Perhaps the one bit is an exponent -- some kind of floating point
>>>>based format? That matches the doubling of all digits.
>>>
>>>That would just be sick. I can't imagine anybody on an 8-bit
>>>CPU using FP for a phone number.

>> >>> double_binary_lehex_to_double('000000806a6e4941')


>> 3333333.0
>> >>> double_binary_lehex_to_double('000000806a6e5941')
>> 6666666.0
>> >>> double_binary_lehex_to_double('0000108777F9Fc41')
>> 7777777777.0
>>
>> ;-)
>

> Well done, Scott & Bengt!!
> I've just verified that this works with all 12 corrected examples posted
> by the OP.
>
> Grant, MS-DOS implies 16 bits at least;

You're right. For some reason I was thinking you had said CP/M.

> and yes there was an FPU (the 8087).

I never met an MS-DOS box that had an 8087 (though I did write
firmware for an 8086+8087 fire-control computer once upon a
time).

> And yes there are a lot of sick people who store things as
> numbers (whether integer or FP) when the only arithmetic
> operations that can be applied to them stuff them up mightily
> (like losing leading zeroes off post-codes, having NEGATIVE
> tax file numbers, etc) and it's still happening on the best
> OSes and 64-bit CPUS. Welcome to the real world :-)

I've been in the real world for a long time, and the dumb
things people (including myself) do still surprise me.

--
Grant Edwards grante Yow! Hello, GORRY-O!! I'm
at a GENIUS from HARVARD!!
visi.com

geske...@hotmail.com

unread,
Aug 10, 2005, 11:31:31 AM8/10/05
to
Thanks again.
Sort of thru me off, but is working perfectly now.

Calvin Spealman

unread,
Aug 10, 2005, 1:05:32 PM8/10/05
to pytho...@python.org
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Original Poster should send this off to thedailywtf.com

Christopher Subich

unread,
Aug 10, 2005, 2:08:25 PM8/10/05
to
Calvin Spealman wrote:
>
> Original Poster should send this off to thedailywtf.com

I absolutely agree. This is a terrible programming practice.

Christos Georgiou

unread,
Oct 5, 2005, 4:24:48 AM10/5/05
to
On Wed, 10 Aug 2005 13:23:22 GMT, rumours say that bo...@oz.net (Bengt
Richter) might have written:

>BTW, my second post was doing ''.join(chr(int(h[i:i+2],16)) for i in xrange(0,16,2))
>to undo the hexlify you had done (I'd forgotten that there's a binascii.unhexlify ;-)

And there's also str.decode('hex'), at least after 2.3 .
--
TZOTZIOY, I speak England very best.
"Dear Paul,
please stop spamming us."
The Corinthians

0 new messages