We are working on a data file reader and extraction tool for an old
MS-DOS accounting system dating back to the mid 80's.
In the data files, the text information is stored in clearly readable
ASCII text, so I am comfortable that this file isn't EBCIDIC, however,
the some of the numbers are stored in a format that we can't seem to
recognize or unpack using the standard python tools (struct, binascii)
... or or atleast our understanding of how these tools work !
Any assistance would be appreciated.
Here are a few examples of telephone numbers;
Exmaple 1:
Phone 1: 5616864700
Hex On Disk: C0DBA8ECF441
Phone 2: 5616885403
Hex on Disk: B0E9ADECF4F1
Another example:
Phone 1: 8003346488
Hex On Disk: 800396d0fd41
Phone2: 9544261331
Hex On Disk: F8f50ec70142
Phone3: 9544278601
Hex On Disk: 481211c70142
TIA.
Is this value a typo instead of ...F441?
Could you tell us what is the extension of those files?
Could you post full 5-10 records (ASCII + HEX)?
--
Dejan Rodiger - PGP ID 0xAC8722DC
Delete wirus from e-mail address
5616864700(10)=14ECA8DBC(16)
14 EC A8 DB C leftshift by 4 bits (it will add 0 on last C)
C0 DB A8 EC 14 00 write bytes from right to left
C0 DB A8 EC F4 41 Add E041
> Phone 1: 8003346488
> Hex On Disk: 800396d0fd41
8003346488(10)=1DD096038(16)
1D D0 96 03 8
80 03 96 D0 1D 00
80 03 96 d0 fd 41 Add E041
But works only for Phone 1 :-)
I can posted records as it will take up to much space.
But all three phone numbers are stored in 8 bytes with null bytes (ie.
00) stored in the leading positions (ie. the left hand side)
I do have some more examples;
I have inserted the leading null bytes and seperated with spaces for
clarity.
Ex #1) 333-3333
Hex On disk: 00 00 00 80 6a 6e 49 41
Ex #2) 666-6666
Hex On disk: 00 00 00 80 6a 6e 59 41
Ex#3) 777-7777
Hex On Disk: 00 00 00 40 7C AB 5D 41
Ex#4) 123-4567
Hex On Disk: 00 00 00 00 87 D6 32 41
Ex#5) 000-0001
Hex On disk: 00 00 00 00 00 00 F0 3F
Ex#6) 999-9999
Hex On disk: 00 00 00 E0 CF 12 63 41
I'm pretty sure that the last full byte is a parity check of some sort.
I still thing that Phone2 (..F1) is a typo and should be 41. Even if
it's not, it could be a more detailed parity (crc-like?) check.
If the F1/41 is a typo, the last byte is ..41 if the parity of the other
40 bits is odd, and ..42 if the parity is even. (Since ..41 and ..42
each have two 1s, it does not change the parity of the entire string).
If not, Lucy has some 'splaining to do.
Taking the last byte out of ther equation entirely, 40 bytes for 10
decimal numbers is 4 bytes / number, meaning there is some redundancy
still in the remainder (the full 10-digit number can be expressed with
room to spare in 36 bits).
Thinking like an 80s Mess-Dos programmer, 32-bit math is out of the
question since the CPU doesn't support it. Five decimal digits already
pushes the 16-bit boundary, so thinking of using the full phone number
or any computation is insane.
#1/#2 and #4/#5 share both the first five digits of the real phone
number and the last 16 bits of the remaining expression. Both pairs
*also* share bits 5-8 (the second hex digit).
Therefore, we may possibly conclude that digits 5-10 are stored in bits
5-8 and 21-36. This is an even 20 of the 40 data-bits. In fact, bits
6-8 of all expamples given were all 0, but since I can't find an
equivalent always-x set for the other 5 digits I'm not sure that this is
significant.
Therefore:
95442 = 8c701 = 1 + c701 (?)
56168 = 0ECF4 = 0 + ecf4 (?)
I'm not coming up with a terribly useful algorithm from this, though. :/
My guess is that somewhere, there's a boolean check based on whether a
digit is >= 6 [maybe 3?] (to prevent running afoul of 16-bitness). I'm
also 90% sure that the first and second halves of the phone number are
processed separately, at mostly, for the same reason.
E041 is some kind of checksum perhaps?
--
If I have been able to see further, it was only because I stood
on the shoulders of giants. -- Isaac Newton
Roel Schroeven
Thanks.
> I can posted records as it will take up to much space. But all
> three phone numbers are stored in 8 bytes with null bytes (ie.
> 00) stored in the leading positions (ie. the left hand side)
>
> I do have some more examples;
>
> I have inserted the leading null bytes and seperated with spaces for
> clarity.
>
> Ex #1) 333-3333
> Hex On disk: 00 00 00 80 6a 6e 49 41
>
> Ex #2) 666-6666
> Hex On disk: 00 00 00 80 6a 6e 59 41
So there's only a 1-bit different between the on-disk
representation of 333-3333 and 666-6666.
That sounds pretty unlikely. Are you 100% sure you're looking
at the correct bytes?
--
Grant
If it helps, we modified Ex#3. to be 777-777-7777
On disk this is now 00 00 10 87 77 F9 Fc 41
All the input fields are filled in this new example.
So for number with 10 digit numbers you could say that it is:
7777777777(10)=1CF977871(16)
1CF977871 SHL 4 bits = 1C F9 77 87 10
write them from right to left and shift left for 8 bits
10 87 77 f9 1C 00
And then add F0 41
Could you also give some examples with nine to one digits?
Perhaps the one bit is an exponent -- some kind of floating point
based format? That matches the doubling of all digits.
--Scott David Daniels
Scott....@Acm.Org
And add E041 (not F0 41)
That would just be sick. I can't imagine anybody on an 8-bit
CPU using FP for a phone number.
--
Grant
Nobody on an 8-bit CPU would have a FPU, so I'll guarantee that this is
done using only 8 or 16-bit (probably 8) integer math.
And I'll guarantee that the difference between 333-3333 and
666-6666 has to be more than 1-bit. There's no way that can be
the correct data unless it's something like an index into a
different table or a pointer or something along those lines.
--
Grant Edwards grante Yow! ANN JILLIAN'S HAIR
at makes LONI ANDERSON'S
visi.com HAIR look like RICARDO
MONTALBAN'S HAIR!
Absolutely. I hadn't even taken a good look at those datapoints yet.
The dataset that I'd like to see:
000-000-0001
000-000-0010
(etc)
000-000-0002
000-000-0004
000-000-0008
000-000-0016
(etc)
I also wonder if the last 8-16 bits involves, at least in part, a count
of the length of the phone number, or at least a flag to distinguish 7
from 10 digits.
>>> def double_binary_lehex_to_double(dhex):
... "convert little-endian hex of ieee double binary to double"
... assert len(dhex)==16, (
... "hex of double in binary must be 8 bytes (hex pairs in little-endian order")
... dhex = ''.join(reversed([dhex[i:i+2] for i in xrange(0,16,2)]))
... m = int(dhex, 16)
... x = ((m>>52)&0x7ff) - 0x3ff - 52
... s = (m>>63)&0x1
... f = (m & ((1<<52)-1))|((m and 1 or 0)<<52)
... return (1.0,-1.0)[s]*f*2.0**x
...
>>> double_binary_lehex_to_double('000000806a6e4941')
3333333.0
>>> double_binary_lehex_to_double('000000806a6e5941')
6666666.0
>>> double_binary_lehex_to_double('0000108777F9Fc41')
7777777777.0
;-)
Regards,
Bengt Richter
Now the easy way ;-)
>>> import struct
>>> def d2d(h):
... return struct.unpack('d',''.join(chr(int(h[i:i+2],16)) for i in xrange(0,16,2)))[0]
...
>>> d2d('000000806a6e4941')
3333333.0
>>> d2d('000000806a6e5941')
6666666.0
>>> d2d('0000108777F9Fc41')
7777777777.0
Regards,
Bengt Richter
Well done, Scott & Bengt!!
I've just verified that this works with all 12 corrected examples posted
by the OP.
Grant, MS-DOS implies 16 bits at least; and yes there was an FPU (the
8087). And yes there are a lot of sick people who store things as
numbers (whether integer or FP) when the only arithmetic operations that
can be applied to them stuff them up mightily (like losing leading
zeroes off post-codes, having NEGATIVE tax file numbers, etc) and it's
still happening on the best OSes and 64-bit CPUS. Welcome to the real
world :-)
Cheers,
John
If I am simply reading the bytes from disk, would I still need to
convert the these values HEX characters first with Hexlify, or is there
a more direct route ?
ie. convert them to the double float directly from the byte values ?
Yes. Use struct.unpack ;-)
BTW, my second post was doing ''.join(chr(int(h[i:i+2],16)) for i in xrange(0,16,2))
to undo the hexlify you had done (I'd forgotten that there's a binascii.unhexlify ;-)
Regards,
Bengt Richter
> >>> def double_binary_lehex_to_double(dhex):
> ... "convert little-endian hex of ieee double binary to double"
> ... assert len(dhex)==16, (
> ... "hex of double in binary must be 8 bytes (hex pairs in little-endian order")
> ... dhex = ''.join(reversed([dhex[i:i+2] for i in xrange(0,16,2)]))
> ... m = int(dhex, 16)
> ... x = ((m>>52)&0x7ff) - 0x3ff - 52
> ... s = (m>>63)&0x1
> ... f = (m & ((1<<52)-1))|((m and 1 or 0)<<52)
> ... return (1.0,-1.0)[s]*f*2.0**x
> ...
> >>> double_binary_lehex_to_double('000000806a6e4941')
> 3333333.0
> >>> double_binary_lehex_to_double('000000806a6e5941')
> 6666666.0
> >>> double_binary_lehex_to_double('0000108777F9Fc41')
> 7777777777.0
>
> ;-)
Damn.
I still say that's just plain sick.
--
Grant Edwards grante Yow! NEWARK has been
at REZONED!! DES MOINES has
visi.com been REZONED!!
>>>>Perhaps the one bit is an exponent -- some kind of floating point
>>>>based format? That matches the doubling of all digits.
>>>
>>>That would just be sick. I can't imagine anybody on an 8-bit
>>>CPU using FP for a phone number.
>> >>> double_binary_lehex_to_double('000000806a6e4941')
>> 3333333.0
>> >>> double_binary_lehex_to_double('000000806a6e5941')
>> 6666666.0
>> >>> double_binary_lehex_to_double('0000108777F9Fc41')
>> 7777777777.0
>>
>> ;-)
>
> Well done, Scott & Bengt!!
> I've just verified that this works with all 12 corrected examples posted
> by the OP.
>
> Grant, MS-DOS implies 16 bits at least;
You're right. For some reason I was thinking you had said CP/M.
> and yes there was an FPU (the 8087).
I never met an MS-DOS box that had an 8087 (though I did write
firmware for an 8086+8087 fire-control computer once upon a
time).
> And yes there are a lot of sick people who store things as
> numbers (whether integer or FP) when the only arithmetic
> operations that can be applied to them stuff them up mightily
> (like losing leading zeroes off post-codes, having NEGATIVE
> tax file numbers, etc) and it's still happening on the best
> OSes and 64-bit CPUS. Welcome to the real world :-)
I've been in the real world for a long time, and the dumb
things people (including myself) do still surprise me.
--
Grant Edwards grante Yow! Hello, GORRY-O!! I'm
at a GENIUS from HARVARD!!
visi.com
Original Poster should send this off to thedailywtf.com
I absolutely agree. This is a terrible programming practice.
>BTW, my second post was doing ''.join(chr(int(h[i:i+2],16)) for i in xrange(0,16,2))
>to undo the hexlify you had done (I'd forgotten that there's a binascii.unhexlify ;-)
And there's also str.decode('hex'), at least after 2.3 .
--
TZOTZIOY, I speak England very best.
"Dear Paul,
please stop spamming us."
The Corinthians