Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Division of really huge numbers

79 views
Skip to first unread message

bv_schornak

unread,
Feb 2, 2003, 11:09:50 AM2/2/03
to
Hi!

I'm working for some years now to code a BCD calculator which handles
numbers with up to 256 BCD digits - let's call it a "hobby within my
hobby". Another thread about division of big numbers remembered me of a
"little" problem, which stopped development of the calculator for a
while now.

Existing parts:
---------------

There are the two operands stored in two buffers OP1 and OP2, the result
buffer RES and a common buffer COM to store temporary results. Two
auxiliary buffers are available, if needed. The entire "calculator"
storage is part of my system's memory for numeric data. The memory usage
(3 pages á 4096 byte, 1st page stored in a file, 2nd and 3rd page are
runtime) is defined as:

0000 numeric data (752 dwords)
0BBF

0BC0 4 * 16 byte storage for data
of the 4 math buffers
0BFF

0C00 COM common
0CFF

0D00 OP1 operand 1
0DFF

0E00 OP2 operand 2
0EFF

0F00 RES result
0FFF

1000 conversion tables,
additional buffers,
runtime data 1024 dwords
2FFF

Each buffer has a size of 256 byte. Each byte holds a valid BCD digit
(lower nibble). All bytes above the most significant digit are 0xFF
(first the buffer is filled with 0xFF's, then the number is written).
Operands are stored with the most significant digit first, the least
significant digit *always* occupies the last byte in the buffer (looks
like [...FF FF FF FF FF FF 01 02 03 04] for the number 1234).

The definition of the format is: A leading header, including the
exponent (4 digits -9999 ... +9999) and some additional data like the
signs, amount of digits, rounding bit and some more info, followed by a
string with up to 256 digits.

The calculator:
---------------

Addition, subtraction and multiplication routines are coded and work,
e.g. in hex <-> dec conversion routines for huge numbers (all routines
are capable of producing valid results for any base 2...16).

The multiplication is done by adding OP1 to COM (initial zero), then
searching through OP2 for "1's". Whenever a "1" is found, COM is added
to RES (initial zero) with some prior shifting (index registers are set
to other offsets). If the 1st 0xFF in OP2 is detected, the routine
starts again with adding OP1 to COM, until OP2 finally is searched for
all "9's".

The problem:
------------

Because a division is an automated subtraction, it isn't possible to
take the larger operand as OP1 (as it is done within the preparation of
the multiplication routine, so we get the lowest possible amount of
loops for the digit search).

I know that I have to "shift" OP2 (or OP1!), until both MSD's have the
same offset. Then I compare OP1 against OP2, subtract OP2 as long as OP1
is greater than OP2 and repeat the "shift / compare / subtract" thing
until OP1 is equal to zero. The problem is, that it needs up to 9
comparisons / subtractions, until OP1 may become smaller than OP2. This
applies to all further "shifts", too. The worst case would be an amount
of 256 * 9 = 2304 comparisons / subtractions, if all digits in OP1 are 9
and OP2 = 1.

One thing I could do is to allocate another page and store all multiples
(2...9) of OP2 in separate buffers, so repeated subtractions are
avoided. In this case, the comparison of OP1 against the multiples of
OP2 may start with OP2 * 5 (to find the direction), then OP2 * 6(, 7, 8,
9) or OP2 * 4(, 3, 2, 1) until a matching multiple is found.

The question:
-------------

Is there another way to do a division with numbers of this size?

It's a theoretical question - maybe I get a clue how to do it a more
effective way, before I finally start to code it.


Greetings from Augsburg

Bernhard Schornak

Richard Pavlicek

unread,
Feb 2, 2003, 1:21:49 PM2/2/03
to
Bernhard Schornak wrote:

If the divisor is dword-size, you can use the DIV instruction.
Just start with the highest dword of the dividend, and divide each
successive dword carrying the remainders down. That is, the high
dword of edx:eax starts as zero, then edx contains the previous
remainder.

If the divisor is greater than dword-size, I don't think there is
an effective algorithm using the DIV instruction. You have to use
the old "Russian peasant" (shift and subtract) scheme.

Years ago, I played around with this to write a DOS "Super Calculator"
(up to 60,408 decimal digits). If anyone is interested:

Brief Explanation: http://www.rpbridge.net/t/sc.txt
Download (only 9K): http://www.rpbridge.net/z/sc10.zip

E-mail me if you want the source code (MASM style).

--
Richard Pavlicek
Web site: http://www.rpbridge.net


bv_schornak

unread,
Feb 2, 2003, 1:42:58 PM2/2/03
to
Richard Pavlicek wrote:

>Years ago, I played around with this to write a DOS "Super Calculator"
>(up to 60,408 decimal digits). If anyone is interested:
>
> Brief Explanation: http://www.rpbridge.net/t/sc.txt
> Download (only 9K): http://www.rpbridge.net/z/sc10.zip
>
>E-mail me if you want the source code (MASM style).
>

Thanks for the offer - sorry, I'm using GAS! But I will read the text,
it's always a good idea to learn something.

Since both operands may have up to 256 digits, the DIV version isn't the
right thing. As I can see, my "Casio" calculator can't compete with your
"Quartium" machine... ;)

giuseppe2

unread,
Feb 3, 2003, 4:27:56 PM2/3/03
to
Hi,
This is how I see the divide subject:
someone in comp.lang.c suggest to see numbers like

typedef struct
{unsigned len;
unsigned char* num;
}numero;

or in asm
1234= 4 1 2 3 4
or better
1234= 4 4 3 2 1

Implement divide_base_10 with this numbers is not impossible for a
newbie. (ma mi ci sono volute sette camicie)
Then we have to write a proc that write numbers in different bases.

Someone said that numbers are better in base 256=0x100
so
1234= 2 210 4
we have to rewrite divide_base_256

and if is base=0x10000?
1234= 1 1234
we have to rewrite divide_base_65536

Paper and pencil and NGs have help me more than internet and
**illegible** bignums library.

wolfgang kern

unread,
Feb 3, 2003, 5:17:38 PM2/3/03
to
Bernhard asked :

[about large BCD calc.]



| Is there another way to do a division with numbers of this size?
|
| It's a theoretical question - maybe I get a clue how to do it a more
| effective way, before I finally start to code it.

My os contains a 256-bit calculator which gives up to 78 digits precision.
At the very beginning I had the idea to calculate and store BCD,
but as I always optimise for speed and size, I chose the binary form.

But the divide (and the BIN->ASCII output) problem is equal:

Either up to nine compare/subtract-cycles after D.P.-adjust and NZ-check.

Or use a semi-logarithm look-up-table
(as I use for precise integer/fraction-results up to 10^78 [9*78*32 bytes!]),

Or just subtract the log() of both operands
by using a 256-digit precision log-table? (fast, but awful large),
or with a selectable precision log/unlog-calc (that may be slow),
so you can avoid the preceding D.P.-match circuit.

I remember now a BCD-solution I once wrote for Z80:
it just had one "1x1"-table and worked like XLAT for multiplication,
and it used 'successive approximation' (starting at half..) for divide.
This little program calculated 3 times faster
on a 4 MHz Z80 than the win3.11 calculator on a 20 MHz PC.

256 digits??
I use 64 bytes(512 bits) for trig&log-constants and result buffers.
This gives ~155 digits precision.
I support numeric variables up to 32 mantissa plus 4 exponent bytes.
And I'm asked much too often: "za woos prauch i des?"

Your target is to hit the left eye of a fly on Sirius III with a laser gun? :)

But yes, a neat, cheap and interesting hobby.
__
wolfgang

bv_schornak

unread,
Feb 3, 2003, 7:25:07 PM2/3/03
to
giuseppe2 wrote:

>Hi,
>This is how I see the divide subject:
>someone in comp.lang.c suggest to see numbers like
>
>typedef struct
>{unsigned len;
> unsigned char* num;
>}numero;
>
>or in asm
>1234= 4 1 2 3 4
>or better
>1234= 4 4 3 2 1
>
>Implement divide_base_10 with this numbers is not impossible for a
>newbie. (ma mi ci sono volute sette camicie)
>

The translation would be of interest... ;)

>Then we have to write a proc that write numbers in different bases.
>
>Someone said that numbers are better in base 256=0x100
>so
>1234= 2 210 4
>we have to rewrite divide_base_256
>
>and if is base=0x10000?
>1234= 1 1234
>we have to rewrite divide_base_65536
>

First - thanks for offering help.

But I think, you misunderstood something. The BCD calculator already is
coded in assembler - except the division routine. As I said, I already
found a "solution" myself - but I have my doubts about the "quality" and
efficiency of that one.

>Paper and pencil and NGs have help me more than internet and
>**illegible** bignums library.
>

Much truth in this statement. Even if I'm used to find a solution by
myself, sometimes my limited math-knowledge needs to be updated with
some additional data. Thanks, again!

bv_schornak

unread,
Feb 3, 2003, 7:52:04 PM2/3/03
to
wolfgang kern wrote:

>My os contains a 256-bit calculator which gives up to 78 digits precision.
>At the very beginning I had the idea to calculate and store BCD,
>but as I always optimise for speed and size, I chose the binary form.
>
>But the divide (and the BIN->ASCII output) problem is equal:
>
>Either up to nine compare/subtract-cycles after D.P.-adjust and NZ-check.
>
>Or use a semi-logarithm look-up-table
>(as I use for precise integer/fraction-results up to 10^78 [9*78*32 bytes!]),
>
>Or just subtract the log() of both operands
> by using a 256-digit precision log-table? (fast, but awful large),
> or with a selectable precision log/unlog-calc (that may be slow),
> so you can avoid the preceding D.P.-match circuit.
>
>I remember now a BCD-solution I once wrote for Z80:
>it just had one "1x1"-table and worked like XLAT for multiplication,
>and it used 'successive approximation' (starting at half..) for divide.
>This little program calculated 3 times faster
>on a 4 MHz Z80 than the win3.11 calculator on a 20 MHz PC.
>

So maybe my own solution with the 9 buffers for all multiples 1...9
isn't that bad?

All other solution you offered are not that good in my case - don't ask
me anything about log()! I'm happy that I'm able to do some simple dB
calculations for CB radio, one of my other hobbies. And I have to stay
within the definitions for my system. The system includes memory
management, a data base engine and some additional functions (about 100
or so) as a base to simplify the coding of applications. Due to this, I
can't add too large components. The BCD calculator should offer very
accurate calculations, but it is only one part of the entire system...

I will think about all the things I heard from you and the others right now!

>256 digits??
>I use 64 bytes(512 bits) for trig&log-constants and result buffers.
>This gives ~155 digits precision.
>I support numeric variables up to 32 mantissa plus 4 exponent bytes.
>And I'm asked much too often: "za woos prauch i des?"
>
>Your target is to hit the left eye of a fly on Sirius III with a laser gun? :)
>

Actually only a limited area with some defective facets... ;)

It's just an obsession I have for about 20 years now. But it's still not
realized 'til the end, because my math knowledge isn't very good. I just
never learned all that stuff...

>But yes, a neat, cheap and interesting hobby.
>

That it is! :)


Thanks for help and greetings from Augsburg

Bernhard Schornak

wolfgang kern

unread,
Feb 4, 2003, 11:14:33 AM2/4/03
to

"Bernhard" wrote:

| So maybe my own solution with the 9 buffers for all multiples 1...9
| isn't that bad?

It seems to be an easy way, even it costs some time to fill the buffers.
[this nine values are a "semi-log table"!]

| All other solution you offered are not that good in my case
| - don't ask me anything about log()!

1970 I bought "Das grosse Handbuch der Mathematik",
(Universität München, Verlag BZL, Köln)

I think a calculator should contain scientific stuff also.
And a log/unlog for large figures makes life much easier.
I make use of it without any precision losses,
as my constants and buffers are twice as precise as variables.

[dB audio:]
20 dB = x10,
3 dB = 5% AFAIR?
may be a bit confusing due the 2*log10() scale.

| >...to hit the left eye of a fly on Sirius III with a laser gun? :)


| Actually only a limited area with some defective facets... ;)

I see, the remote physician..... :)

| It's just an obsession I have for about 20 years now. But it's still not
| realized 'til the end, because my math knowledge isn't very good.
| I just never learned all that stuff...

I'm old, and I'm crazy due I just can't stop learning every day....

__
wolfgang


bv_schornak

unread,
Feb 6, 2003, 2:37:14 PM2/6/03
to
wolfgang kern wrote:

>It seems to be an easy way, even it costs some time to fill the buffers.
>[this nine values are a "semi-log table"!]
>

Always learning - we didn't speak the same "language" ¹ here - math
terms are a book with seven seals for me... ;)

¹ "buffers" vs. "semi-log table"

>1970 I bought "Das grosse Handbuch der Mathematik",
> (Universität München, Verlag BZL, Köln)
>

Still available? I'll visit our bookstores. Thanks!

>I think a calculator should contain scientific stuff also.
>And a log/unlog for large figures makes life much easier.
>I make use of it without any precision losses,
>as my constants and buffers are twice as precise as variables.
>

Yes, it *should* - but I have to read the book, first... ;)

Maybe I open another thread, if I have learned a little bit more?

>[dB audio:]
>20 dB = x10,
> 3 dB = 5% AFAIR?
> may be a bit confusing due the 2*log10() scale.
>

Same as in electronics: Ratios for power are log10() while voltages are
2*log10()...

>I see, the remote physician..... :)
>

Oh, that was the coarse version, of course. Now we're working on the
"one facet" improvement... ;)

[ Definetely the end of the joke, we can't go down to the atomic level
... or, maybe ... ok, ready to continue this one! ]

>I'm old, and I'm crazy due I just can't stop learning every day....
>

I may have left a wrong impression - of course I'm always "burning" to
learn. It was just the insight, that some things cannot be re-done with
success, if the conditions stay the same. One condition is my limited
knowledge, the other one is the "wrong" way teachers at school explain
things. So I din't learn too much, because I missed the clue. It never
was my thing to know something "from memory", I always asked for the
underlying matters to understand "How does it work?". With this
knowledge in mind - whenever I need a formula or rule in physics or
math, then I am able to develop it by my own ... I don't have to know
all the stuff by heart. Saves a lot of storage for important things... :)

Permanent activity of the brain preserves your youth. It's always a
pleasure to talk with people who are (a little bit) older than me. They
know so many things which I still have to learn. But - it seems, that
most people have forgotten this important way of passing knowledge
between the generations. Nowadays, older people (40 and above) are seen
as a burden, a "question of costs", rather than the valuable pool of
experience and knowledge they are for real...


P.S.: Sorry for the delay - urgent delivery in Sweden (3200 km in 50
hours, the snow slowed me down too much)...

Message has been deleted

wolfgang kern

unread,
Feb 6, 2003, 7:26:53 PM2/6/03
to

"Bernhard" answered:

| >It seems to be an easy way, even it costs some time to fill the buffers.
| >[this nine values are a "semi-log table"!]

| Always learning - we didn't speak the same "language" ¹ here - math
| terms are a book with seven seals for me... ;)

Ok, the term "semi-log" isn't mentioned in the books,
I think they call it "partial" or similar.

A small overview about log:
similar to the use of MUL/DIV instead of repeated ADD/SUB
[factor=count, and shift-add if count is non-integer],
you can ADD/SUB the exponents instead of MUL/DIV.
But you need to adjust both operands to an equal "base".

ie:
base10 (some books name it "lg", but I use log10)
1E4 * 1E7 = 1E11 (that's easy) log10(4+7)
and 1E4 / 1E7 = 1E-3 log10(4-7)

and 123E4 * 456E7 = 123*456E11 (still easy)
123E4 / 456E7 = 123/456E-3 =0.2697368421E-3 =>269.7E-6
(easy, but division and d.p.-adjustment needed)

now lets use log10 for divide:
log10(123E4) = 6.089905112
log10(456E7) = 9.658964843
sub for divide =-3.569059731
now 10^x = 10^3.5690059731 = 269.7368421E-6

but this 10^(3.56) may be a bit confusing,
the main trick is to split every decimal digit into "nine" values,
the logarithm:

10^0 = 1
----
10^0.1 = 1.258925412
10^0.2 = 1.5848
10^0.3 = 1.9952
10^0.4 = 2.5118
10^0.5 = 3.1622
10^0.6 = 3.9810
10^0.7 = 5.0118
10^0.8 = 6.3095
10^0.9 = 7.9432
----
10^1 = 10
10^1.1 = 12.58925412 ! compare this with the 10^0.1 !

got the idea behind?

base"e" works also:
ln (123E4) = 14.022524736 123E4 = 2.718281828^14.02
ln (456E7) = 22.240588463 456E7 = 2.718281828^22.24
sub for divide = -8.218063733
now e^x = 269.736842E-6 = 2.718281828^-8.21


or log10(2) = 1/log2(10) = 0.301029996
these two constants ease bin<=>dec conversions a lot.

ie: how many decimal digits are needed to represent a 53 bit binary?
exact: 1 + int(53 * 0.30103)= 16
close: [IMUL x,imm 09A2; shr x,13; inc x] exact up to 300 digits.

| ¹ "buffers" vs. "semi-log table"

It's very similar:
I create a complete table during power-up,
you calculate part of it when needed.


My log-table (22Kb) just for example:
all entries are equal size [32 bytes(256 bits)]

entry# binary value

1 1 [1..9 * 10^0]
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10 [1..9 * 10^1]
11 20
12 30
13 40
14 50
15 60
16 70
17 80
18 90
19 100 [1..9 * 10^2]
20 200
21 300
22 400
23 500
...
691 8x10^76
692 9x10^76
693 1x10^77 maximum (2^256= 1,1579....E+77 )

For fast BIN->BCD/ASCII conversion:
known: value(NZ)of MSD, power10 of MSD

look-up at: [power10 * 9 + value-1]* entry-size + table-location.
and thats part of the log10 rules.

ie: a "8" in digit#76 76*9 +7 = 691
(see above, at this entry the binary form of 8E76 is stored)

In case of BCD you don't need a format-conversion-table,
but you may use a table for top-down comparison before subtract,
and you already point to the value to subtract.
There is no divide-operation in all the story then.
And your division will become much faster.

For your 256 BCD digits a semi-log table would have
256 *9 entries, 128 bytes/entry = <300Kb

| >1970 I bought "Das grosse Handbuch der Mathematik",
| > (Universität München, Verlag BZL, Köln)

| Still available? I'll visit our bookstores. Thanks!

Don't know if it is still available,
but it is a complete set which answers all math.-questions
(over 800 tiny-font pages, clear and short commented)

| >I think a calculator should contain scientific stuff also......

| Yes, it *should* - but I have to read the book, first... ;)
| Maybe I open another thread, if I have learned a little bit more?

| >[dB audio:...]


| Same as in electronics: Ratios for power are log10() while voltages are
| 2*log10()...

..isn't that logical correct? "P=U^2/R"

| >I see, the remote physician..... :)

| Oh, that was the coarse version, of course. Now we're working on the
| "one facet" improvement... ;)

:)

| [ Definitely the end of the joke, we can't go down to the atomic level


| ... or, maybe ... ok, ready to continue this one! ]

Many things are smaller than electrons,
but as they are awful fast,
all we can measure is their (shitty) tracks.

| >I'm old, and I'm crazy due I just can't stop learning every day....

| I may have left a wrong impression - of course I'm always "burning" to
| learn. It was just the insight, that some things cannot be re-done with
| success, if the conditions stay the same. One condition is my limited
| knowledge, the other one is the "wrong" way teachers at school explain
| things. So I din't learn too much, because I missed the clue. It never
| was my thing to know something "from memory", I always asked for the
| underlying matters to understand "How does it work?". With this
| knowledge in mind - whenever I need a formula or rule in physics or
| math, then I am able to develop it by my own ... I don't have to know
| all the stuff by heart. Saves a lot of storage for important things... :)

Yes, even sometimes it's annoying to not have things ready by heart,
but it's always funny to check on the ability to recreate a logical path.

| Permanent activity of the brain preserves your youth. It's always a
| pleasure to talk with people who are (a little bit) older than me. They
| know so many things which I still have to learn. But - it seems, that
| most people have forgotten this important way of passing knowledge
| between the generations. Nowadays, older people (40 and above) are seen
| as a burden, a "question of costs", rather than the valuable pool of
| experience and knowledge they are for real...

This may be due many 'older' folks believe in being already educated enough.

| P.S.: Sorry for the delay - urgent delivery in Sweden (3200 km in 50
| hours, the snow slowed me down too much)...

Don't care, I'm not online every day either.
Reminds me of a job about 25 years ago (one million km's in 8 years)....

__
wolfgang


bv_schornak

unread,
Feb 6, 2003, 7:48:51 PM2/6/03
to
Buffalo wrote:

>Speaking as both a burden and a question of costs (I'm 55), may I give
>you something from my limited store of knowledge?
>

You're welcome! :)

Being 46, I see myself as a member of the mentioned group (except my
lack of knowledge)... ;)

>If your numbers are as long as they are, I don't think a shift and
>subtract algorithm is really good enough. It would be far too slow.
>You really have to use Div. The problem is when the divisor is bigger
>than a d-word can contain. But there is an algorithm for four-word
>division, using div, that exists as a library download, by someone
>called Roger Moser. I can't give you the URL, but you might try a
>Google search. His routine uses div and is probably easily adaptable
>for your bigger numbers. That's one option.
>

I will look for that one.

The problem is, that I have unpacked BCD numbers... So I would have to
convert input before and output after the division. It is worth thinking
about, because it could work faster than the method I have in mind at
the moment. Maybe there are some more suggestions, and I will think a
time about it, before I finally start writing code.

>The other option is to get the app that I've been using to learn
>assembly language over the last year- Ketman Maximaster -
>http://www.geocities.com/ketmanweb - because it has a long and
>detailed tutorial on this question, as a separate exe file called
>"Bignums.exe". The advantage is that it isn't just source text with
>commentary, but a full tutorial. It walks you through the whole
>process of division by numbers of up to 4 words in length. The
>algorithm that is finally produced is much faster than Moser's. I'm
>quite sure that it would be easily adaptable for much longer numbers.
>Ketman only provides for 16-bit programming, but if you don't mind
>that, you'll probably find it a good source of info about your
>project.
>

Took a short look and will study it closer within the next days (running
out of free time at the moment...).

See <http://schornak.de/english/pgms/code/bcd/index.htm> - it's an
overview of the existing functions and routines for the BCD calculator...

bv_schornak

unread,
Feb 6, 2003, 10:28:58 PM2/6/03
to
Hi Wolfgang!

Wow! Give me one or two days to learn it ... I come back with a detailed
reply!

Thanks a lot!

bv_schornak

unread,
Feb 10, 2003, 5:07:08 PM2/10/03
to
wolfgang kern wrote:

Heavy stuff. Got stuck somewhere...

>base10 (some books name it "lg", but I use log10)
> 1E4 * 1E7 = 1E11 (that's easy) log10(4+7)
>and 1E4 / 1E7 = 1E-3 log10(4-7)
>

Ok - up to here, I did know something about it...

>and 123E4 * 456E7 = 123*456E11 (still easy)
>

This one is clear, too. It's just like

0.5 * 0.5 = (5 * 5)E-2 = 25 E-2 = 0.25

> 123E4 / 456E7 = 123/456E-3 =0.2697368421E-3 =>269.7E-6
> (easy, but division and d.p.-adjustment needed)
>

Got it - it's funny, that I used this methods before without knowing
what I'm doing. I think, I get the clue of the log calculations - it's
nothing else than shifting the operands left or right as we need them,
then update the exponent with the corrected value.

>now lets use log10 for divide:
>log10(123E4) = 6.089905112
>log10(456E7) = 9.658964843
>sub for divide =-3.569059731
>now 10^x = 10^3.5690059731 = 269.7368421E-6
>

A little bit hard to get the clue where the numbers are coming from, but
finally ...

1.23 E6 -> 6.08...
4.56 E9 -> 9.65...

... I see: the integer is the power of 10. The numbers behind the
decimal point still are a problem. The result is

1 / 1.0 E3.569... (which is the same as 1.0 E-3.569...).

>but this 10^(3.56) may be a bit confusing,
>the main trick is to split every decimal digit into "nine" values,
>the logarithm:
>
>10^0 = 1
>----
>10^0.1 = 1.258925412
>10^0.2 = 1.5848
>10^0.3 = 1.9952
>10^0.4 = 2.5118
>10^0.5 = 3.1622
>10^0.6 = 3.9810
>10^0.7 = 5.0118
>10^0.8 = 6.3095
>10^0.9 = 7.9432
>----
>10^1 = 10
>10^1.1 = 12.58925412 ! compare this with the 10^0.1 !
>
>got the idea behind?
>

I know how to use, but not how to calculate this floating point numbers!
Of course, there are tables where I can cut & paste them - but it is
just using something without knowing how it "works". Would make it much
easier to understand the whole story, if I knew, where all the values
are coming from.

As I guess (or remember?):

1/n n_ ____
10 = \/ 10 ???

(Sorry to be unpolite, but I only know the German terms: "Zehn hoch
(eins geteilt durch n) ist gleich der n-ten Wurzel aus 10".)

so

1.0 E 0.50 = 1.0 E 1/2 = "2nd root"(10)
1.0 E 0.33 = 1.0 E 1/3 = "3rd root"(10)
1.0 E 0.25 = 1.0 E 1/4 = "4th root"(10)

and so on (read "n-th root(10)" as "n-te Wurzel aus 10"). Is that right?

If so - isn't it exactly the same "inaccurate" rounding we have to do
with floating point numbers in binary calculators?

Example: 0.4 -> 0.33 + 0.0625 + ... (we probably never reach the proper
value of 0.4).


<cut off some vital parts, but they are not forgotten! First I have to
understand the above...>

>In case of BCD you don't need a format-conversion-table,
>but you may use a table for top-down comparison before subtract,
>and you already point to the value to subtract.
>There is no divide-operation in all the story then.
>And your division will become much faster.
>
>For your 256 BCD digits a semi-log table would have
>256 *9 entries, 128 bytes/entry = <300Kb
>

Would be a case for my database engine. But each comparison is a loop
through up to 256 digits, too - it may cost the same time than an add or
sub loop.

>...isn't that logical correct? "P=U^2/R"
>

... came to my mind because of your sentence about 2*log10() in dB audio ...

>Many things are smaller than electrons,
>but as they are awful fast,
>all we can measure is their (shitty) tracks.
>

Oh well... (maybe I should unsubscribe from alt.fan.madonna...) ;)

The last level of this one would have been the story about causing some
"pollution" in the lattice of the eye (you know how ICs or transistors
are produced - there is a very small amount of other atoms in the Si
lattice, which give the component a N or P polarity). So the eye would
be mis-used to build an organic computer. Thus, the fly could be
controlled and used as a bio-camera to transfer pictures to earth...

Next to the speed, there is the "Unschaerfe-Relation" - we can't see
things which are smaller than the size of the beam we use to make them
visible. My knowledge isn't very accurate here, but Heisenberg's theory
says something in that direction.

>Yes, even sometimes it's annoying to not have things ready by heart,
>but it's always funny to check on the ability to recreate a logical path.
>

Surely doesn't exclude everything - simple formulas like "R = U / I" and
similar stuff is stored in my brain, of course. It depends on how often
I use a formula, before I store (or better: learn) it. But - if I don't
know something, there are some books and magazines in reach... ;)

>This may be due many 'older' folks believe in being already educated enough.
>

Let's say - some... But it is a general problem of the "modern" society.
If you're older than 35 ... 40, your "worth" decreases from year to
year. A wrong way, but who is able to do something against a common
prejudice? Same thing is the saying, that *all* unemployed are just too
lazy to do a job - if there are no jobs (4.5 millions unemployed
here...), then this can't be the truth.

>Don't care, I'm not online every day either.
>Reminds me of a job about 25 years ago (one million km's in 8 years)....
>

Another driver? Hi there, keep the bumpers clean!

25 years ago it was really hard work to keep larger vehicles moving.
Today the tech doesn't leave too much real work to do (full automatic
drives - no clutch, gears are shifted with a joystick, power steering
better than in limousines, et cetera). But fixed dates and the
individual traffic are eating up all the advantages - comparable to the
latest PC, which always will be too slow for the latest OS (nope, no
names; even Linux isn't that fast, if you use all the "nice" (animated)
graphic features)...

Ben Peddell

unread,
Feb 11, 2003, 1:24:30 AM2/11/03
to
OK. What about the general Newton-Raphson recurrence for the reciprocal?
It goes:
Z[i+1] = Z[i] * (2 - b * Z[i])

First, convert each 128 digit number to a 512-bit long integer or a 512.1536
or someting fixed-point number. Perhaps your Log table would be good for
this.
Then, bsr through the divisor, until you find the MSB. This'll create an
exponent. Then, use some of the upper bits (say 16 bits) to index into a
lookup table. Then, take this lookup (in fixed point 0.16 format or
something), and shift it right by the exponent. This'll give a basic
approximation of the exponent. Then refine this approximation by
re-iterating the general Newton-Raphson recurrence for the reciprocal.
After the reciprocal is refined enough, then you can multiply it by the
dividend.
Then, convert the number back into an unpacked BCD number. Your table would
be good for this again.

I'd estimate that there would be about 8192 32-bit multiplies, about 8192
32-bit adds and about 64 32-bit subtracts per iteration, and that about 128
iterations would be required to get a suitably refined divisor. Then, there
would be another 4096 or so 32-bit multiplies and about the same number of
32-bit adds to do the final 2048-bit multiply.
This is a total of perhaps 11 million instruction cycles for the actual
division (assuming that a multiply takes 10 cycles, and an add takes 1
cycle).
The memory requirement would depend on how good you want the initial
approximation to be. In fact, the lookup table may be omitted. However, the
Log table is really needed (even one entry for each power of ten - 32768
bytes for 128 digit to 2048-bit fixed point), unless you are really short on
memory and have lots of time to spare.

wolfgang kern <now...@nevernet.at> wrote in message
news:b1uui2$h95$1...@newsreader1.netway.at...

bv_schornak

unread,
Feb 11, 2003, 6:50:02 PM2/11/03
to
Ben Peddell wrote:

Wow!

I never learned this stuff at school, so I have to find out, what it is
all about. All this terms and methods you mention - I have the
impression that it is too complicated for my dumb head to mix all this
data together to a working routine...

Why should I convert, reconvert and all the other operations? As I told
in the original posting, my own idea about the routine would produce a
maximum of 2304 subtract / compare loops. With a 2048 byte table (OP2 *
2... OP2 * 9) it would be a maximum of let's say 512...2048 loops,
depending on how the comparison algorithm is "optimized".

If it really speeds up the division if it is done in binary, septimal or
octal mode, then I would convert / reconvert the operands, of course.
But I have no clue about how to do that, because I didn't even think
about it until now. The reason for BCD is *accuracy* (+/- 1 in the least
significant digit) - can't reach the same results with binary floating
point numbers.

Nevertheless - as we have started the lesson about logarithms, I'm
trying to learn this stuff! Maybe with less terms I don't know (as you
may have seen, I probably know many German terms, but it's very
difficult to translate). Learning - for me - is understanding how things
"work", and there are some things I still don't understand (how to
calculate the floating point numbers)...

I just zipped the current available sources, makefile, some additional
text and the bcd.lib for download <http://schornak.de/download/BCD.zip>
(18 913 byte). If the doc contains false statements, please tell me, it
will be corrected! Texts are written with OS/2 E.EXE. You need an editor
which is able to handle the original PC character set, codepage 850, to
view it, e.g. EDIT.COM (should look like
<http://schornak.de/pics/scnshot.png>).

wolfgang kern

unread,
Feb 11, 2003, 6:07:18 PM2/11/03
to

"Ben Peddell" wrote:

As I answered Bernhard for his unpacked BCD solution,
and I'm using 256-bit binary integer mantissa tagged with 10^x exponents,
I'll answer just to it.



| OK. What about the general Newton-Raphson recurrence for the reciprocal?
| It goes:
| Z[i+1] = Z[i] * (2 - b * Z[i])

Could work, even two MULs by bit, but isn't that the way the FPU works?
With the 0.1 as a periodic figure? (which is awful delaying and inexact)
I prefer exact integers and let the user do the rounding as desired.


| First, convert each 128 digit number to a 512-bit long integer or a 512.1536

| or something fixed-point number. Perhaps your Log table would be good for this.

My mantissa is integer already.
And my buffers are twice as large as variables.

| Then, bsr through the divisor, until you find the MSB. This'll create an
| exponent.

Yes, an exponent power of 2, which I convert to decimal as:
(done for both operands)

XOR eax
BSR [oper1+size],eax
IMUL eax,000009A2h
SHR eax,0dh
JZ +2
INC eax

which actually does a sharp enough multiply by 0.30103 =log10(2)
so the result is the 10^x value of MSB.
this avoids unnecessary zero-comparisons and look-up.

Then I multiply the dividend by the (already pointed to)
power 10 value of the divisor ,to get maximum precision and have the dividend larger than the divisor to produce in integer result.

| Then, use some of the upper bits (say 16 bits) to index into a lookup table.
| Then, take this lookup (in fixed point 0.16 format or something),
| and shift it right by the exponent.

| This'll give a basic approximation of the exponent.

As I use 'strict integer' and my exponents are power of 10,
I can avoid shifts here and the result exponent is exact known already yet.

| Then refine this approximation by re-iterating the general Newton-Raphson
| recurrence for the reciprocal.
| After the reciprocal is refined enough, then you can multiply it by the
| dividend.
| Then, convert the number back into an unpacked BCD number. Your table would
| be good for this again.

My variables are stored as integers,
but yes I use this table for conversion to ASCII (display/print/text)

| I'd estimate that there would be about 8192 32-bit multiplies, about 8192
| 32-bit adds and about 64 32-bit subtracts per iteration, and that about 128
| iterations would be required to get a suitably refined divisor. Then, there
| would be another 4096 or so 32-bit multiplies and about the same number of
| 32-bit adds to do the final 2048-bit multiply.
| This is a total of perhaps 11 million instruction cycles for the actual
| division (assuming that a multiply takes 10 cycles, and an add takes 1
| cycle).
| The memory requirement would depend on how good you want the initial
| approximation to be. In fact, the lookup table may be omitted. However, the
| Log table is really needed (even one entry for each power of ten - 32768
| bytes for 128 digit to 2048-bit fixed point), unless you are really short on
| memory and have lots of time to spare.


I did play around with the 1/x multiply in the past.
But as I support more than 40 different numeric-types,
from signed byte up to (any byte size) unsigned 256 bit and others,
and all of these types shall participate on calculations without conversion,
I first decided to have a compare/subtract-loop division,
and finally added the log/exp calculation to be also used for large figure divisions.

The speed comparison talks for compare/subtract for num-types below 8 bytes,
while the partial log-look-up is faster for larger types.

I didn't recalculate your estimated 11 million cycles,
but I divide two 256-bit values with 32-bit exponents within 150K cycles.

Perhaps I look again at the 1/x solution,
in the integer form 10^(MSD-power of mantissa x)/x....

__
wolfgang


wolfgang kern

unread,
Feb 11, 2003, 9:50:30 PM2/11/03
to

"Bernhard" answered:

Yes.
[...]


| I know how to use, but not how to calculate this floating point numbers!

| Of course, there are tables where I can cut & paste them - but it is
| just using something without knowing how it "works". Would make it much
| easier to understand the whole story, if I knew, where all the values
| are coming from.

See in the later following

| As I guess (or remember?):

| 1/n n_ ____
| 10 = \/ 10 ???

| (Sorry to be unpolite, but I only know the German terms: "Zehn hoch
| (eins geteilt durch n) ist gleich der n-ten Wurzel aus 10".)

| so
| 1.0 E 0.50 = 1.0 E 1/2 = "2nd root"(10)
| 1.0 E 0.33 = 1.0 E 1/3 = "3rd root"(10)
| 1.0 E 0.25 = 1.0 E 1/4 = "4th root"(10)
| and so on (read "n-th root(10)" as "n-te Wurzel aus 10"). Is that right?

This is correct! ;except spell:"impolite" :)

And already the next step in using log:
for x^y and "x-th root of y" [x,y nonintegers also]
where the POWER/ROOT calculation can be
replaced by MUL/DIV log(x),log(y) if equal base,
and even easy to convert if different base:

logA(n)=logA(B)*logB(n)

log10(n)=1/ln(10)*ln(n)
constants: ln(10) = 2.3025851... 1/ln(10)= 0.4342945..
so:
log10(n)= ln(n)*0.4342945

But my math.-knowledge/-terminology is German based as well...
A native English speaker may correct us if we abuse the terms. :)

But the log-(and trigonometric) table-creation is performed by:
endless (or until: tired, enough precise, overflow, doomsday)
rows:

ln(1+x) = x -x^2/2 +x3/3 -x^4/4 + ...±x^n/n

e^x = 1 +x/1! +x^2/2! +x^3/3! ... x^n/n!

sin x = x -x^3/3! +x^5/5! -x^7/7! +...±x^n/n! [n odd]

the "!" means faculty, which is 1*2*3...*n ie: 6! = 720

(hope I typed it correct)
some special tricks to short-cut the last element are mentioned in the books, but this may be wrong for 256 digits.

| If so - isn't it exactly the same "inaccurate" rounding we have to do
| with floating point numbers in binary calculators?

| Example: 0.4 -> 0.33 + 0.0625 + ...
| (we probably never reach the proper value of 0.4).

You're right, but even integers may produce fractions when divided.
What will you display as result for 4/3 ? or more worse 1/7, 1/31 ?
(a nominator/denominator pair would be the only precise result)
But as you already defined your buffers twice of needed size,
the rounding can start on the very, very least significant digit.

| <cut off some vital parts, but they are not forgotten! First I have to
| understand the above...>


Sorry, forget this table, it just covers the binary 10^x values,
which you don't need for BCD.

My thought was just within my binary solution.
A complete log look-up table for 256 digits mantissa would have a size of
10^256 -1 entries, 256/512 bytes each!
and that's a lot!
And a semi-log table won't help too much in your case.

Btw: compare loops should start at MSD and may compare 32 bits at once.
Only if the first/previous compare produces "=" it needs to loop further.
A no-match will end the iteration anyway.

But instead of the MUL-fill your nine buffers and so on,
you may calculate the log up to desired precision.
So you wouldn't need a table at all.
This just depends on the main target, as always is "speed or memory".


-------
|......So the eye would be mis-used to build an organic computer.


| Thus, the fly could be controlled and used as a bio-camera to transfer
| pictures to earth...

Nice idea, perhaps not 'that' far away from being possible.
I know about attempts to produce organic semiconductors,
they are working and can be controlled with 10^-18 Amp or even light,
but the problem was stability, they became "old" too fast.

| Next to the speed, there is the "Unschaerfe-Relation" - we can't see
| things which are smaller than the size of the beam we use to make them
| visible. My knowledge isn't very accurate here, but Heisenberg's theory
| says something in that direction.

I once read several things on that and were really interested in more,
but my bank-accountancy wasn't very pleased with ....

| Another driver? Hi there, keep the bumpers clean!

I didn't ride trucks, but emergency medical equipment repair kept my on the roads, and several boxes of spare-parts and tools made
my arms a bit longer :)

__
wolfgang
btw: did you realise we use the same dummy

bv_schornak

unread,
Feb 13, 2003, 3:54:25 PM2/13/03
to
wolfgang kern wrote:

[Stuff I did understand so far...]

>...except spell:"impolite" :)
>

I assumed to turn it to the opposite with a preceeding "un-" (matching
in many cases)...

[Thanks! I appreciate to hear about faults - the only way to learn
something!]


[Problems and no end...]

1st - thanks for the time you take to teach me all this things!

I have visited some pages with additional information (much graphics and
more text). As I have seen, all this stuff is the most complex kind of
calculations you can do in math. So please keep patient, if I need some
time to understand all the things you tell me ( my head is smoking
without a burning cigarette around... :) ).

If you learn this at school, you don't know anything about further stuff
or where the things you learn are used later on. And this could be my
problem: I already know a little bit about math, I'm doing calculations
without knowing that they are part of this stuff, I know where I could
need it, and so on. Thus, it is very hard to switch down a gear and fill
some holes in my storage with the missing knowledge...

>And already the next step in using log:
> for x^y and "x-th root of y" [x,y nonintegers also]
>where the POWER/ROOT calculation can be
>replaced by MUL/DIV log(x),log(y) if equal base,
>and even easy to convert if different base:
>
>logA(n)=logA(B)*logB(n)
>
>log10(n)=1/ln(10)*ln(n)
> constants: ln(10) = 2.3025851... 1/ln(10)= 0.4342945..
>so:
>log10(n)= ln(n)*0.4342945
>

Ok, let me guess:

ln(10)
e = 10

(could be...) so

ln(x)
e = x ???

>But the log-(and trigonometric) table-creation is performed by:
> endless (or until: tired, enough precise, overflow, doomsday)
> rows:
>
>ln(1+x) = x -x^2/2 +x3/3 -x^4/4 + ...±x^n/n
>

The first view I did read it as

2/2 3/3 4/4
X * -X * X * -X ... Ouch.

1
but this doesn't make sense, because X = X ... so there are some other
ways to read:

2 3 4
X X X
X * - --- * + --- * - --- ... or
2 3 4

2 3 4
X X X
X - --- + --- - --- ...?
2 3 4


<Let's skip the following for later discussion...>

>e^x = 1 +x/1! +x^2/2! +x^3/3! ... x^n/n!
>
>sin x = x -x^3/3! +x^5/5! -x^7/7! +...±x^n/n! [n odd]
>
>the "!" means faculty, which is 1*2*3...*n ie: 6! = 720
>

>(hope I typed it correct)
>some special tricks to short-cut the last element are mentioned in the books, but this may be wrong for 256 digits.
>

If we can develop a rule which is valid and working for 5 digits, then
it will still be valid (and working) for 50,000 digits. This way I test
my calculator functions - if they work with "2 +-*/ 2", then they will
work with "123,456,789 +-*/ 987,654,321", too. And if they work with the
2nd two numbers, then they work with all possible numbers... ;)


[Precision...]

>| If so - isn't it exactly the same "inaccurate" rounding we have to do
>| with floating point numbers in binary calculators?
>| Example: 0.4 -> 0.33 + 0.0625 + ...
>| (we probably never reach the proper value of 0.4).
>
>You're right, but even integers may produce fractions when divided.
>What will you display as result for 4/3 ? or more worse 1/7, 1/31 ?
>(a nominator/denominator pair would be the only precise result)
>But as you already defined your buffers twice of needed size,
>the rounding can start on the very, very least significant digit.
>

The results of 4/3 et cetera are periodic. There are some tricks to keep
precise results with one or two bits (at least - one byte is enough to
store the amount of periodic digits)...


[No needs for semi-log table...]

>Btw: compare loops should start at MSD and may compare 32 bits at once.
>Only if the first/previous compare produces "=" it needs to loop further.
>A no-match will end the iteration anyway.
>

Problem, because:

OP1 .. FF FF FF FF 01 02 03 04 ( 1 234)
OP2 .. FF FF FF 05 06 07 08 09 (56 789)

-> using standard comparison results in OP1 > OP2...

The existing compare routine first counts the amount of digits of both
operands. A a real top down compare is done only if both operands are of
the same size. But you're right. The usage of dword compares should be
implied to speed up the routine. It's on my to-do list!

The reason for the FF is, that FF is no valid expression (all routines
can do calculations in any base 2...16, assuming that the input operands
*are* valid numbers of the defined base).

Since all routines work backwards, they couldn't distinguish, if a "00"
is a valid digit within the number, or if we already passed the MSD. BTW
- the FF's are included in the corresponding XLAT-tables for ASCII <->
BCD or hex, too (8 tables à 256 byte).

Whenever a FF is detected in one of the operands, the routine stops the
running calculation.

>But instead of the MUL-fill your nine buffers and so on,
>you may calculate the log up to desired precision.
>So you wouldn't need a table at all.
>This just depends on the main target, as always is "speed or memory".
>

The answer would be "the optimum" - highest possible speed and lowest
possible memory consumption. In real life, we can't have it all, and
with today's computers, the memory problem isn't that important. Speed
is the goal, because I don't know, how many calculations an application
programmer wants to perform in a row - so my system should be optimized
for speed, of course.

To optimize for speed, it needs some tables with a lot of entries, I
guess. We have to translate to "natural" logarithm (do we?), calculate
and re-transform to the given base (default is 10, but a base 7 could be
possible as well¹). Needs at least two tables with high precision, a
case for my database engine...

¹ The concept of the entire math unit is, that the basic operations -
compare, add, sub, mul and div - can be used outside the calculator,
too. They are able to handle bases 2...16, so everybody could add an own
calculator for let's say base 11, if that is what s/he needs. This is
OpenSource software, so it should be expandable...

[The eye of the fly...]

>Nice idea, perhaps not 'that' far away from being possible.
>I know about attempts to produce organic semiconductors,
>they are working and can be controlled with 10^-18 Amp or even light,
>but the problem was stability, they became "old" too fast.
>

There's a lot of research in bio-electronics nowadays. But I only know
the one or other thing I can read in c't magazine, e.g. about
bio-displays. The entire field of electronic components grew so fast,
that you have to specialize in the one or other sub-field, because you
can't have all things in mind. My special fields are audio (amps,
effects) and CB (amateur) radio.

[ Heisenberg's "Unschaerfe-Relation" (Uncertainty Principle)]

>I once read several things on that and were really interested in more,
>but my bank-accountancy wasn't very pleased with ....
>

It's a little bit more complex than I said in simple words, see
<http://www.aip.org/history/heisenberg/>. Much math... ;)

>| Another driver? Hi there, keep the bumpers clean!
>
>I didn't ride trucks, but emergency medical equipment repair kept my on the roads, and several boxes of spare-parts and tools made
>my arms a bit longer :)
>

I know that effect ... as long as the arms don't touch the ground... ;)

I drove trailers (2 units à 7.15 m) and semi-trailers (1 unit à 13.7 m)
for several years, before I found my current job. For almost 1.5 years
now my default vehicle is a Peugeot Boxer (remember - the white flash
(up to 150 km/h)). I drive "real cars", whenever the company we get our
orders from is in need of a spare driver (holidays, vacancies). It took
me almost 15 years to reach my 1st million km (due to speed and time
limits for trucks)...

>btw: did you realise we use the same dummy
>

Right! I took the name from a hidden OS/2 system folder (the shredder =>
NULL device). "nowhere" is one of 50 e-mail addresses (50 MB each)
included in the domain service. It is my spam container, but I look for
the contents at least once a day... ;) (Replace with "sho" for better
response times...)

wolfgang kern

unread,
Feb 14, 2003, 7:30:43 PM2/14/03
to

"Bernhard" wrote:

| I assumed to turn it to the opposite with a preceding "un-" (matching


| in many cases)...
| [Thanks! I appreciate to hear about faults - the only way to learn
| something!]

Me too! and there are many bugs in my verbal expressions.....
I learned to spell "illogical","impolite","nonconforming","inaccurate"
but there are, beside others, many opposition terms I don't know.

| [Problems and no end...]

| 1st - thanks for the time you take to teach me all this things!

Your welcome, as I try to optimise my calculator right know,
the discussion helps me to see it from different views.
And it may improve my knowledge base if I'm able to explain.

| I have visited some pages with additional information (much graphics and | more text). As I have seen, all this stuff is the most
complex kind of
| calculations you can do in math. So please keep patient, if I need some
| time to understand all the things you tell me ( my head is smoking
| without a burning cigarette around... :) ).

Don't worry, 'log' is easy related to 'infinitesimal matrix' or 'spherical trigonometric'.
If you catch the main sense of it the rest will be easy then.

| If you learn this at school, you don't know anything about further stuff | or where the things you learn are used later on. And
this could be my
| problem: I already know a little bit about math, I'm doing calculations
| without knowing that they are part of this stuff, I know where I could
| need it, and so on. Thus, it is very hard to switch down a gear and fill
| some holes in my storage with the missing knowledge...

| >And already the next step in using log:
| > for x^y and "x-th root of y" [x,y nonintegers also]
| >where the POWER/ROOT calculation can be
| >replaced by MUL/DIV log(x),log(y) if equal base,
| >and even easy to convert if different base:

| >logA(n)=logA(B)*logB(n)

| >log10(n)=1/ln(10)*ln(n)
| > constants: ln(10) = 2.3025851... 1/ln(10)= 0.4342945..
| >so:
| >log10(n)= ln(n)*0.4342945

| Ok, let me guess:

| ln(10)
| e = 10
| (could be...) so

| ln(x)
| e = x ???

Exact!

| > But the log-(and trigonometric) table-creation is performed by:
| > endless (or until: tired, enough precise, overflow, doomsday)
| > rows:

| >ln(1+x) = x -x^2/2 +x3/3 -x^4/4 + ...±x^n/n

| The first view I did read it as

| 2/2 3/3 4/4
| X * -X * X * -X ... Ouch.

Sorry, it would have helped if I put brackets around,
but the general syntax-order-rules say:
MUL/DIV before ADD/SUB, ("Punkt vor Strich")
EXP/LOG before MUL/DIV, (high order operation first)

| but this doesn't make sense, because X = X ... so there are some other
| ways to read:

| 2 3 4
| X X X
| X * - --- * + --- * - --- ... or
| 2 3 4
|
| 2 3 4
| X X X
| X - --- + --- - --- ...?
| 2 3 4

Yes, that's how it shall look.

| <Let's skip the following for later discussion...>

OK.


| >e^x = 1 +x/1! +x^2/2! +x^3/3! ... x^n/n!

| >sin x = x -x^3/3! +x^5/5! -x^7/7! +...±x^n/n! [n odd]

| >the "!" means faculty, which is 1*2*3...*n ie: 6! = 720

| >(hope I typed it correct)
| >some special tricks to short-cut the last element are mentioned in the | books, but this may be wrong for 256 digits.


| If we can develop a rule which is valid and working for 5 digits, then
| it will still be valid (and working) for 50,000 digits. This way I test
| my calculator functions - if they work with "2 +-*/ 2", then they will
| work with "123,456,789 +-*/ 987,654,321", too. And if they work with the | 2nd two numbers, then they work with all possible
numbers... ;)

--the small form--
I see, let's start this idea with just one digit:
ie: we decide to create a table with nine entries
ln 1...9 *10^32

ln(1e32)= 73.68272298
2 74.37587016
3 74.78133526
4 75.06901734
...
9 75.87994755
Now we got 9 values for digit#32.

Can we easy convert them to use for digit#31 ?
we need to divide by 10, or as we know already better:
we subtract ln(1e31)= ln(1e32)-ln(10)
ln(10)= 2.302585093 (is a constant)
73.68272298 - 2.302585093 =
ln(1e31)= 71.38013788

This is not a coincidence, it will work for all digits.

So a short table may contain:
9 values for the digits value.
256 entries for the various power10 values,

Even every digit must be handled and summed together to get the final LN(x),
only ADD/SUB is needed here and the table is comfortable small.

Instead of the power10 entries we may use also the relations-ship:

logA(n)=logB(n)*logA(B)

ln(10) = 2.302585093
ln(100) = 4.605170186 =2*ln(10) = log10(100)*ln(10) interesting?
ln(1000)= 6.907755279 =3*ln(10)
the table shrink to 10 entries now, although on additional MUL is needed.
-------
The common used check is "1/7" (not rounded) followed by *7 which must produce a figure with all digits are "9".
[ the 142857142857142857.. test ]

| [Precision...]
|
| >| If so - isn't it exactly the same "inaccurate" rounding we have to do
| >| with floating point numbers in binary calculators?
| >| Example: 0.4 -> 0.33 + 0.0625 + ...
| >| (we probably never reach the proper value of 0.4).

| >You're right, but even integers may produce fractions when divided.
| >What will you display as result for 4/3 ? or more worse 1/7, 1/31 ?
| >(a nominator/denominator pair would be the only precise result)
| >But as you already defined your buffers twice of needed size,
| >the rounding can start on the very, very least significant digit.

| The results of 4/3 et cetera are periodic. There are some tricks to keep | precise results with one or two bits (at least - one
byte is enough to
| store the amount of periodic digits)...

The periodic size depends on the value of the "prime" divisor.
3 -> 1
7 -> 6
13 -> 6
31 -> 12
111 -> 3
511 -> 9
how large will the period be for primes > 250 digits?


| >Btw: compare loops should start at MSD and may compare 32 bits at once.
| >Only if the first/previous compare produces "=" it needs to loop further.
| >A no-match will end the iteration anyway.

| Problem, because:
|
| OP1 .. FF FF FF FF 01 02 03 04 ( 1 234)
| OP2 .. FF FF FF 05 06 07 08 09 (56 789)
|
| -> using standard comparison results in OP1 > OP2...

Hm, I would see the no-match in this example
as " OP2 > OP1 " due MSD#2 > MSD#1.

| The existing compare routine first counts the amount of digits of both

| operands. A real top down compare is done only if both operands are of


| the same size. But you're right. The usage of dword compares should be
| implied to speed up the routine. It's on my to-do list!

| The reason for the FF is, that FF is no valid expression (all routines
| can do calculations in any base 2...16, assuming that the input operands
| *are* valid numbers of the defined base).

| Since all routines work backwards, they couldn't distinguish, if a "00"
| is a valid digit within the number, or if we already passed the MSD. BTW
| - the FF's are included in the corresponding XLAT-tables for ASCII <->
| BCD or hex, too (8 tables à 256 byte).

Yes, start with LSD for ADD/SUB/MUL,
but you'll start with MSD when CMP or DIV.
As you already counted valid digits, amount of leading zeros is known.

?? wouldn't one 16 byte string be enough for HEX/BCD->ASCII conversion??
unpacked BCD ->ASCII needs just an OR x,30h,or AND x,0Fh for ASCII->BCD.

I see, your "FFs" can avoid leading zero handling.
But wouldn't compares and calculation (think of digit-carry-over)
a bit easier if you fill the unused digits with zeros.

a REPZ SCAS (or any faster equivalent) can find the MSD which also gives the power10 value of the number and the leading zero-count
as well.

| Whenever a FF is detected in one of the operands, the routine stops the
| running calculation.

I would do this a bit different, like pass on the 'used' operand-size,
which actually is identical to the MSD's power of 10.

| >But instead of the MUL-fill your nine buffers and so on,
| >you may calculate the log up to desired precision.
| >So you wouldn't need a table at all.
| >This just depends on the main target, as always is "speed or memory".

| The answer would be "the optimum" - highest possible speed and lowest
| possible memory consumption. In real life, we can't have it all, and
| with today's computers, the memory problem isn't that important. Speed
| is the goal, because I don't know, how many calculations an application

| programmer wants to perform in a row - so my system should be optimised
| for speed, of course.

| To optimise for speed, it needs some tables with a lot of entries, I


| guess. We have to translate to "natural" logarithm (do we?), calculate
| and re-transform to the given base (default is 10, but a base 7 could be
| possible as well¹). Needs at least two tables with high precision, a
| case for my database engine...

The natural logarithm is best for trigonometric functions,
conversion into log10 can be performed with a single multiply by a constant.
So just two tables are needed, one for ln(x) and one for e^x,
but very large tables if they shall give full precision.

For 256 digits, the table entries shall have at least 260 digits,
and a full size look-up must have 10^256 entries, perhaps a several decades spanning version of the small-table-concept would be a
good compromise.
ie: if you decide for 16 blocks, the table will shrinks to 10^16 entries.


| ¹ The concept of the entire math unit is, that the basic operations -
| compare, add, sub, mul and div - can be used outside the calculator,
| too. They are able to handle bases 2...16, so everybody could add an own
| calculator for let's say base 11, if that is what s/he needs. This is
| OpenSource software, so it should be expandable...

I see your point,
but the few people with 11 fingers may also use the decimal system :)
I think binary log2
decimal log10
natural log e
will be the only useful.
Remember: if you have one log-base defined,
it's easy to convert it into any base with just a constant multiplication.

| ...My special fields are audio (amps, effects) and CB (amateur) radio.
| My main fields are medical/analysing machinery and (hobby) true KI.

| [ Heisenberg's "Unschaerfe-Relation" (Uncertainty Principle)]

| >I once read several things on that and were really interested in more,
| >but my bank-accountancy wasn't very pleased with ....

| It's a little bit more complex than I said in simple words, see
| <http://www.aip.org/history/heisenberg/>. Much math... ;)

YES, needs some spare time to go through it,
and as mentioned before "time vs. money".

__
wolfgang

Ben Peddell

unread,
Feb 14, 2003, 10:09:58 AM2/14/03
to

bv_schornak <now...@schornak.de> wrote in message
news:b2c24g$baa$02$1...@news.t-online.com...

I only learned it from an AMD 3Dnow! document, which has a description in
the end of it of how the reciprocal and reciprocal-square-root
approximations work.

The equation (in a loop) for the reciprocal approximation would be:
x = x * (2 - b * x)
where x is your reciprocal approximation, and b is the number that you want
the reciprocal (1 / b) of.
The stars are multiplies.

If you don't want to alter any of your functions to take binary, why not use
your existing multiplier and subtractor.

First, check that the divisor in non-zero. If it is zero, then set eax to 1
and return.

Then, convert up to 8 digits of the divisor to a floating point number.
Then, divide 1 by this number. Then, convert the result back into BCD, with
the decimal point just after the most significant digit. This'll give a
reciprocal approximation to start off with.

Now, for the multiplies, you'll probably have to somehow increase their size
to double the normal size (say 512 characters) so that you can multiply an
integer by a fraction. The subtract need not be changed, as it cannot return
results greater than 2.0
Now an iteration starts.
Multiply the previous reciprocal approximation by the divisor. Then,
subtract this result from 2.0. Then multiply this result by the previous
reciprocal approximation. This result will be the new reciprocal
approximation.
Do this again, up to 10 times, depending on how close the first multiply in
the iteration comes to 1.0.
Now that the reciprocal approximation has been calculated, multiply it by
the dividend. Since you only want an integer result, chop off the part of
the result after the decimal point.
If this result is zero, then you'd set eax to 3 and return.
I don't really see how the divide can ever overflow, since the result is the
same size as the inputs, and a divisor of less than 1 is not possible
without a divide by zero.
If all goes well, then you'd set eax to zero.

I'll see about trying to get a routine up and running for you, and get back
to you in a few days.


bv_schornak

unread,
Feb 15, 2003, 6:01:36 PM2/15/03
to
Ben Peddell wrote:

>I only learned it from an AMD 3Dnow! document, which has a description in
>the end of it of how the reciprocal and reciprocal-square-root
>approximations work.
>

My Acrobat reader (4.01, latest OS/2 version) shows only 1/3 page - with
one formula and some text (probably incomplete)...

The math operations use 256 digit integers, yes. But the calculator is
able to handle numbers - 999.9... E -9999 through + 999.9... E 9999 !
Handling of the floating point and the exponents should be part of the
calculator rather than of the base operations. The calculator should be
used in real world applications, so the needed amount of digits may be
far below the capacity of the routines, anyway.

Just looked at some pages about the Newton-Raphson method. Very
complicated, as I could see. The principle is clear, but the way... ;) A
similar method I had in mind, as I thought about implementing a square
root function. But my version would have done a "guess" and thousands of
calculations to find a matching result (1st guess a number (done with
the FPU - coarse, but near enough), then change each digit until the
square of the "guess" is equal to the input operand). But your method
only needs 5 ... 10 calculations!

At the moment, I'm a little bit confused with all this methods and
calculations. Would be a good thing, if we could put it all together in
one - an enhancement with the logarithm stuff *and* your method, too.
Then the calculator could be used for all math operations with high
accuracy and optimized speed.

To be fair - all of my code is OpenSource under (L)GPL (may save you a
lot of work, if you don't like OpenSource software). I probably can code
it myself, if I would understand how the underlying math "works".
(Understanding is important...)

At the moment - I already get stuck with this one "x = x * (2 - b * x)",
because I can't do more than

x * (2 - (b * x))
0 = -------------------
x

with it. This should be the point where the graph of the function hits
zero. I think, that we could skip the 1st x (-> x / x ? Have to update
my knowledge about this simple math rules), so only the (2 - (b * x)) is
divided by x. Needs some time to think about it!

Not to forget - thanks a lot for the time you spend with this problem!


Going to relax a little bit until tomorrow -

Have a nice weekend

Bernhard Schornak

bv_schornak

unread,
Feb 16, 2003, 1:39:54 PM2/16/03
to
wolfgang kern wrote:

>Don't worry, 'log' is easy related to 'infinitesimal matrix' or 'spherical trigonometric'.
>If you catch the main sense of it the rest will be easy then.
>

I'm on tenterhooks... :)

I have a big favour to ask you: If it doesn't take too much of your time
- could you give me some more examples with real numbers along with the
formulas? Most times I'm able to figure out, how something "works", if I
see a practical example. A formula is very good, after I understood how
to use it, but until I understand it, I can realize the relations much
better if I see something I know (the numbers) together with the formula.

A little bit like real life - the people who invented all this formulas
had lots of numbers first, then they saw a "law" for the relations
between some of this numbers, finally they developed a formula which is
valid for every number... :)


[Some formulas]

logA(n) = logA(B) * logB(n)

1
log10(n) = ----------------
ln(10) * ln(n)

ln(x)
e = x

[constants]

ln(10) = 2.302581.......
1 / ln(10) = 0.4342945......
e = 2,7182818284...


[Current state]

2 3 4 n
X X X X
ln(1 + x) = X - --- + --- - --- ... +/- ---
2 3 4 n


2 3 n
x x x x x
e = 1 + ---- + ---- + ---- ... + ----
1! 2! 3! n!


3 5 7 n
x x x x
sin x = x - ---- + ---- - ---- ... +/- ----
3! 5! 7! n!


(BTW - Ever noticed, that every x! with (x >= 5) is a multiple of 5?)


>MUL/DIV before ADD/SUB, ("Punkt vor Strich")
>EXP/LOG before MUL/DIV, (high order operation first)
>

Sometimes it looks like I'm a new born child...

>--the small form--
>I see, let's start this idea with just one digit:
>ie: we decide to create a table with nine entries
>ln 1...9 *10^32
>
>ln(1e32)= 73.68272298
> 2 74.37587016
> 3 74.78133526
> 4 75.06901734

>....


> 9 75.87994755
>Now we got 9 values for digit#32.
>
>Can we easy convert them to use for digit#31 ?
>we need to divide by 10, or as we know already better:
>we subtract ln(1e31)= ln(1e32)-ln(10)
>ln(10)= 2.302585093 (is a constant)
> 73.68272298 - 2.302585093 =
>ln(1e31)= 71.38013788
>
>This is not a coincidence, it will work for all digits.
>

Now, let me see if I understand it right:

Lets assume a number 123. This would be solved with:

3 * 10 E 0 -> 1.09861232
+ 2 * 10 E 1 -> 2.99573281
+ 1 * 10 E 2 -> 4.60517020
-----------------------------

-> 8.69951533 ???

(table-entry) - (n * ln(10))

This result can't be true - it is somewhere between 6000 and 10000
(guessed). As I can see, the addition is the wrong operation to put the
numbers together.

A solution using a similar way would be very fast...

>So a short table may contain:
> 9 values for the digits value.
>256 entries for the various power10 values,
>
>Even every digit must be handled and summed together to get the final LN(x),
>only ADD/SUB is needed here and the table is comfortable small.
>

We would need 265 entries a 256 byte = 67 840 byte. No problem with that...

>Instead of the power10 entries we may use also the relations-ship:
>
>logA(n)=logB(n)*logA(B)
>
>ln(10) = 2.302585093
>ln(100) = 4.605170186 =2*ln(10) = log10(100)*ln(10) interesting?
>ln(1000)= 6.907755279 =3*ln(10)
>the table shrink to 10 entries now, although on additional MUL is needed.
>-------
>The common used check is "1/7" (not rounded) followed by *7 which must produce a figure with all digits are "9".
>[ the 142857142857142857.. test ]
>

The multiplication 2*ln(10), 3*ln(10) is clear - it's similar to the dB
stuff...

I think, that this way is more flexible than calculation with base 10,
only. Would need a table with 16 entries + the constant for each base n
(2 ... 16). Only 16 sectors on a HD...

>The results of 4/3 et cetera are periodic. There are some tricks to keep precise results with one or two bits (at least - one
>byte is enough to store the amount of periodic digits)...
>
>The periodic size depends on the value of the "prime" divisor.
> 3 -> 1
> 7 -> 6
>13 -> 6
>31 -> 12
>111 -> 3
>511 -> 9
>how large will the period be for primes > 250 digits?
>

Is there an answer? As you say, it depends on the divisor (-> 111 vs. 31)...

>| OP1 .. FF FF FF FF 01 02 03 04 ( 1 234)
>| OP2 .. FF FF FF 05 06 07 08 09 (56 789)
>

>Hm, I would see the no-match in this example
>as " OP2 > OP1 " due MSD#2 > MSD#1.
>

Really?

mov eax, 0xFFFFFFFF # load OP1
cmp eax, 0xFFFFFF05 # compare against OP2

So OP1 < OP2 ???

A trick I use sometimes is to do three byte compares, then the needed
amount (here 1) of dword compares. But my example wasn't very clever,
anyway. Since OP1 has less digits, it must be smaller than OP2...

>| The reason for the FF is, that FF is no valid expression (all routines
>| can do calculations in any base 2...16, assuming that the input operands
>| *are* valid numbers of the defined base).
>
>| Since all routines work backwards, they couldn't distinguish, if a "00"
>| is a valid digit within the number, or if we already passed the MSD. BTW
>| - the FF's are included in the corresponding XLAT-tables for ASCII <->
>| BCD or hex, too (8 tables à 256 byte).
>
>Yes, start with LSD for ADD/SUB/MUL,
>but you'll start with MSD when CMP or DIV.
>As you already counted valid digits, amount of leading zeros is known.
>

Digits are calculated in the compare loop at runtime. The results are
not stored in the current code. The routine uses REPNE SCASB with AL =
0xFF to find the 1st invalid digit, then I do a (0x0100 - ECX)...

>?? wouldn't one 16 byte string be enough for HEX/BCD->ASCII conversion??
>unpacked BCD ->ASCII needs just an OR x,30h,or AND x,0Fh for ASCII->BCD.
>

Historical reasons. The "fundamental" concept of the calculator is based
in 1993. The system memory already is defined (meanwhile the system's
memory was expanded, so different other areas sit on top of the math
buffers) and the tables exist, so I leave this stuff as it is. The main
reason for the tables was (and is), that XLAT performs faster and with
less code than any other conversion method.

Also: 0x1fEc3cBA -> 0x1FEC3CBA (I hate it, if hexadecimals are not
written in capital letters...) (-> 22 entries in the 2nd XLAT-table,
there are 8 of them defined...)

Last, but not least: The table is thought to convert user input. Do you
know what a user is? If it is someone like me, a routine to mask out
false input is badly needed... ;)

>I see, your "FFs" can avoid leading zero handling.
>But wouldn't compares and calculation (think of digit-carry-over)
>a bit easier if you fill the unused digits with zeros.
>

Snippet of the addition "core":

(AT&T syntax - read left to right!)

...
std # work backwards
movb _D_BASE,%dl # numeric base for RES
ADD0:lodsb # get 1st OP digit
cmpb $0xFF,%al # end of OP ?
jne 1f
incb %dh # end of OP1 deteced
xorb %al,%al # zero, if OP1 end
1:addb %al,%ah # add to carry
xchg %esi,%ebx
lodsb # get 2nd OP digit
cmpb $0xFF,%al # end of OP ?
jne 2f
xorb %al,%al # zero, if OP2 end
incb %dh # end of OP2 detected
2:addb %ah,%al # add OPs
movb $0,%ah # clear carry
3:cmpb %dl,%al # AL > base ?
jb 4f
incb %ah # set carry
subb %dl,%al # AL -= base
jmp 3b
4:xchg %esi,%ebx
cmpb $2,%dh # if there were
two FFs, we
je 6f # reached end of
both OPs
stosb # store result digit
movb $0,%dh # clear counter
decl %ecx
je 5f # process the
carry in %ah
jmp ADD0 # loop through OPs
...

Not the most efficient code, but it is able to add operands of any base
to the defined base for the result. As I said at the beginning, it's
just a "hobby within my hobby", so some very "unusual" things are
implemented...

The basic concept of this routines is based on old A86 stuff I wrote in
1992 - and the concept wasn't developed further since that time (the
routines were replaced with other code, of course). It worked really
fast on an 20 MHz 80386, though... ;)

>a REPZ SCAS (or any faster equivalent) can find the MSD which also gives the power10 value of the number and the leading zero-count
>as well.
>

Yes, you are right here. It is just a flaw in my concept, because I
didn't think too much about the division routine. You don't need the
data of the MSD and power in the add, sub or mul routines at all (they
stop, if both digits are 0xFF; compare only is used to check, if the
operands for sub / mul should be exchanged) - but the division needs
much more work (and known data) than all of the other operations together...

>| Whenever a FF is detected in one of the operands, the routine stops the
>| running calculation.
>
>I would do this a bit different, like pass on the 'used' operand-size,
>which actually is identical to the MSD's power of 10.
>

True. My concept follows the thought, that this isn't part of the basic
operations. It should be done in the calculator logic, which is nothing
else than the "management" for the operations and handling of the
floating points, exponents and signs. There is a 64 byte area, where all
vital data of each math buffer (OP1, P2, COM, RES) should be stored. As
I started wth the calculator, I thought, that it would be better, if I
first code the math operations, then the calculator, which holds the
logic and management for the base operations. (Also I needed some
functions for a program, so I had to code this part...)

Putting control and management functions into the base operations would
slow them down markable. The question is, how to split up the single
tasks, so we have fast base operations and fast management / control of
the more complex calculations. I think, that the only real problem still
is the division, because it is much more complicated than any of the
other operations. The reason is, that we can't exchange the operands to
simplify the operation. Worse - division is the only operation which may
produce floating points - the real problem with division!

I just think, that I have to change my concept a little bit...

>The natural logarithm is best for trigonometric functions,
>conversion into log10 can be performed with a single multiply by a constant.
>So just two tables are needed, one for ln(x) and one for e^x,
> but very large tables if they shall give full precision.
>
>For 256 digits, the table entries shall have at least 260 digits,
>and a full size look-up must have 10^256 entries, perhaps a several decades spanning version of the small-table-concept would be a
>good compromise.
>ie: if you decide for 16 blocks, the table will shrinks to 10^16 entries.
>

Only problem is, that my database is limited to something about 250 MB,
which is the maximum amount of memory I can allocate for a single
process (the database can handle 4 GB fields (32 bit addresses), but is
limited to the available memory)...

>| ¹ The concept of the entire math unit is, that the basic operations -
>| compare, add, sub, mul and div - can be used outside the calculator,
>| too. They are able to handle bases 2...16, so everybody could add an own
>| calculator for let's say base 11, if that is what s/he needs. This is
>| OpenSource software, so it should be expandable...
>
>I see your point,
> but the few people with 11 fingers may also use the decimal system :)
>

What about the ones who lost one finger? ;)

AFAIK, there are several things which are calculated with octal numbers.
Also, some cryptographic routines may use algorithms which are based on
other numeric systems. As I said, I do very "unusual" things (one may
replace unusual with weird from time to time)...

>I think binary log2
> decimal log10
> natural log e
>will be the only useful.
>Remember: if you have one log-base defined,
>it's easy to convert it into any base with just a constant multiplication.
>

That's right. One table and the constants should be enough...

>YES, needs some spare time to go through it,
>and as mentioned before "time vs. money".
>

Time and money - the most important problems in our "modern" world... ;)


P.S.: As you may have seen Ben's posting, what about putting all this
stuff together?

The approximation method produces accurate results within 10 loops (I
still don't know how to calculate it, but the idea behind it is great).
I think, that you have much more knowledge about math than me, so you
probably will see much better, if there is a chance to "simplify" the
division by using the Newton-Raphson method. Maybe I throw away my
current code and start another try. I probably would do that, if I would
be able to realize how all this formulas can be split up into single
tasks for small routines. The problem is the missing practice with all
the math stuff ( to change gears I don't need a slide rule ;) )...

Ed Beroset

unread,
Feb 16, 2003, 3:06:13 PM2/16/03
to
bv_schornak <now...@schornak.de> wrote:

>(BTW - Ever noticed, that every x! with (x >= 5) is a multiple of 5?)

Every x! is a multiple of every integer n such that (n <= x), so the
observation in more general than just for the number 5.
I'm sure that if you ponder the defintion of ! for a moment, you'll
see why this must necessarily be so.

Ed

bv_schornak

unread,
Feb 16, 2003, 5:30:52 PM2/16/03
to
Ed Beroset wrote:

You're right. It's logical - whenever I multiply n with a number m, then
every following multiplication must be a multiple of m. Regardless of
the amount of following multiplications, it's true for each of them. The
quoted rule came to my mind, as I did some of the calculations in my
head. If I do calculations, I often see relations between all the
numbers. But most times I don't recognize, that the rules behind this
relations could be important to understand some mathematical "laws"...

Ed Beroset

unread,
Feb 16, 2003, 9:13:29 PM2/16/03
to
bv_schornak <now...@schornak.de> wrote:

Here's one to chew on then: 5041 is the highest known square which is
one more than a factorial. That is 5041 = 71^2 = 7! + 1. Are there
any others? Nobody knows.

Ed

wolfgang kern

unread,
Feb 17, 2003, 2:06:47 AM2/17/03
to

"Bernhard" wrote:

| >Don't worry, 'log' is easy related to 'infinitesimal matrix' or 'spherical trigonometric'.
| >If you catch the main sense of it the rest will be easy then.

| I'm on tenterhooks... :)

| I have a big favour to ask you: If it doesn't take too much of your time
| - could you give me some more examples with real numbers along with the
| formulas? Most times I'm able to figure out, how something "works", if I
| see a practical example. A formula is very good, after I understood how
| to use it, but until I understand it, I can realize the relations much
| better if I see something I know (the numbers) together with the formula.
|
| A little bit like real life - the people who invented all this formulas
| had lots of numbers first, then they saw a "law" for the relations
| between some of this numbers, finally they developed a formula which is
| valid for every number... :)

When you think of the poor guys who spent a whole life on creating a five-digits precise log-table all by hand......

| [Some formulas]
------------------------------------------
1.) logA(n) = logA(B) * logB(n)

Ok,
let A=2, B=10, n=8000h (=2^15 = 32768)

log2(n) = log2(10) * log10(n)

log2(32768) = 15
log2(10) = 1/log10(2) =3.321928095
log10(32768)= 4.515449935

4.51544935 * 3.321928095= 14.99999999... ~ 15
------------------------------------------
1
2.) log10(n) = ------- * ln(n)
ln(10)

[corrected: log10(n)=1 / ln(10)*ln(n)
not =1 / ( ln(10)*ln(n) ) ]
two steps:
let n=2
A)
log10(n) = 1 / logn(10)

log10(2) = 1/log2(10)
0.3010299 = 1 / 3.321928095
B)
log10(2) = 1/ ln(10) * ln(2)
0.3010299 = 0.434294482 * 0.6931478106
--------------------------------------------
3.)
| ln(x)
| e = x

let x=3
ln(3) = 1.098612289

2,7182818284^ 1.098612289 = 3


| [constants]

| ln(10) = 2.302581.......
| 1 / ln(10) = 0.4342945...... =1 / 2.302..

see ln(x)
----------------------------
| e = 2,7182818284...

e = 1 + 1 + 1/2! + 1/3! + 1/4! + 1/5! ... +1/n! + delta e/(n+1)!
= 2 + 0.5 + 0.166 + 0.04166 + 0.008333 + 0.00138888...
sum 2.0 ; 2.5 ; 2.666 ; 2.70766 ; 2.716100 ; 2.71748888 ;

| [Current state]
|
| 2 3 4 n
| X X X X
| ln(1 + x) = X - --- + --- - --- ... +/- ---
| 2 3 4 n

ie: ln(1.1)
ln(1+0.1) = 0.1 - 0.01/2 + 0.001/3 - 0.0001/4 + 0.00001/5 .....
- 0.005 + 0.00033 + 0.000025 + 0.000002
sum 0.095 ; 0.09533 ; 0.095305 ; 0.095307 ; ....
~0.0953101798
and
ln(1 - x)= - x - x^2/2 - x^3/3 - x^4/4...-
and
ln( (1+x)/(1-x) )= 2*(x +x^3/3 +x^5/5 +x^7/7....)

[all valid for x < 1 ]
This are just a few of many different ways do calculate...

--------or just one of my old BASIC-examples--
CALC_LN10:
'LNval=ln(10)
temp=0.9 '1-1/10 ' if n>1: 1-1/n
for c=2 to 40 'for 20 digits, 30 iterations are quite enough
add temp,(x^c)/c
next c
if n<1 then toggle sign temp
LNval = temp
return
----
according to this working routine:

ln(10)= 0.9 + [0.9^n /n];n=2 to whatever;

0.9
+ 0.81/2 0.405
+ 0.729/3 0.243
+ 0.6561/4 0.164025
+ 0.59049/5 0.118098
+ 0.531441/6 0.0885735
+ 0.4782969/7 0.068328129 1.987
+ 0.43046721/8 0.053808401 2.0408
+ 0.387420489/9 0.043046721 2.08387
............ /12 ~ 2.30258509 already
--------------------------------------------------


| 2 3 n
| x x x x x
| e = 1 + ---- + ---- + ---- ... + ----
| 1! 2! 3! n!

let x = 1.098612289

e^1.098612289 = 1 + 1.098612289/1 + 1.098612289^2/2 + 1.098612289^3/6
+ 0.6034744804 + 0.1831020481
sum 1 ; 2.098612289 ; 2.7020867694 ; 2.8851888175
and after some more elements added, it will result close to 3.000 (as 3.above)
--------------------------------------------------


| 3 5 7 n
| x x x x
| sin x = x - ---- + ---- - ---- ... +/- ----
| 3! 5! 7! n!

As the log-calculation, several other ways are possible....
let x=pi/6 = 0.523598776 (=30°)

0.143547577 0.039354383 0.010789228
sin(pi/6)= 0.5236 - ----------- + ----------- - -----------
6 120 5040

- 0.023924596 + 0.000327953 - 0.000002141......
sum 0.5236 ; 0.499675404 ; 0.500003357 ; 0.500001233.....
close enough to see sin(30°) will become 0.5 ?
-----------------------------------------------------


| (BTW - Ever noticed, that every x! with (x >= 5) is a multiple of 5?)

Yes, and every x! (x>=3) is a multiple of 3 and so on...
;perhaps 'this' is the sense if it?

| >MUL/DIV before ADD/SUB, ("Punkt vor Strich")
| >EXP/LOG before MUL/DIV, (high order operation first)

| Sometimes it looks like I'm a new born child...

:)


| >--the small form--
| >I see, let's start this idea with just one digit:
| >ie: we decide to create a table with nine entries
| >ln 1...9 *10^32

| >ln(1e32)= 73.68272298
| > 2 74.37587016
| > 3 74.78133526
| > 4 75.06901734
| >....
| > 9 75.87994755
| >Now we got 9 values for digit#32.

| >Can we easy convert them to use for digit#31 ?
| >we need to divide by 10, or as we know already better:
| >we subtract ln(1e31)= ln(1e32)-ln(10)
| >ln(10)= 2.302585093 (is a constant)
| > 73.68272298 - 2.302585093 =
| >ln(1e31)= 71.38013788

| >ln(9e31)= 73.57736246


| >This is not a coincidence, it will work for all digits.

| Now, let me see if I understand it right:
|
| Lets assume a number 123. This would be solved with:
|
| 3 * 10 E 0 -> 1.09861232
| + 2 * 10 E 1 -> 2.99573281
| + 1 * 10 E 2 -> 4.60517020
| -----------------------------
|
| -> 8.69951533 ???
|
| (table-entry) - (n * ln(10))

| This result can't be true - it is somewhere between 6000 and 10000
| (guessed). As I can see, the addition is the wrong operation to put the
| numbers together.

Right, my fault, see my apologize below......
Sure, we cannot add the 'ln'-values if we need ln(100 + 20 + 3).
This addition works with powers of ten for already defined log-entries only.

| A solution using a similar way would be very fast...

| >So a short table may contain:
| > 9 values for the digits value.
| >256 entries for the various power10 values,

| >Even every digit must be handled and summed together to get the final LN(x),
| >only ADD/SUB is needed here and the table is comfortable small.

I need to correct this:

Sorry, it will work in my binary 10^x environment only,
where it just converts the MSD and either loops CMP/SUB for small figures,
or creates the needed log-value just until desired precision for huge ones.
You can't add the ln(x) values "that" easy.
An infinite row calculation is needed for that.

| >Instead of the power10 entries we may use also the relations-ship:
| >logA(n)=logB(n)*logA(B)
| >ln(10) = 2.302585093
| >ln(100) = 4.605170186 =2*ln(10) = log10(100)*ln(10) interesting?
| >ln(1000)= 6.907755279 =3*ln(10)
| >the table shrink to 10 entries now, although on additional MUL is needed.

The above example shall explain the relations only.

YEAH, but the precision is very low with a one digit table!
But I think of a table-conversion-routine, which is adaptable to actual current needs and it must be in steps of the lowest possible
power of the LSD(s).
At the moment this seems to be a time-consuming ultra-iterative story,
but let me check on that.


| >-------
| >The common used check is "1/7" (not rounded) followed by *7 which must produce a figure with all digits are "9".
| >[ the 142857142857142857.. test ]
| >
|
| The multiplication 2*ln(10), 3*ln(10) is clear - it's similar to the dB
| stuff...
|
| I think, that this way is more flexible than calculation with base 10,
| only. Would need a table with 16 entries + the constant for each base n
| (2 ... 16). Only 16 sectors on a HD...

| >how large will the period be for primes > 250 digits?


| Is there an answer? As you say, it depends on the divisor (-> 111 vs. 31)...

I'll search the for sure already detected primes......

| >| OP1 .. FF FF FF FF 01 02 03 04 ( 1 234)
| >| OP2 .. FF FF FF 05 06 07 08 09 (56 789)

| >Hm, I would see the no-match in this example
| >as " OP2 > OP1 " due MSD#2 > MSD#1.

| Really?

| mov eax, 0xFFFFFFFF # load OP1
| cmp eax, 0xFFFFFF05 # compare against OP2

| So OP1 < OP2 ???

| A trick I use sometimes is to do three byte compares, then the needed
| amount (here 1) of dword compares. But my example wasn't very clever,
| anyway. Since OP1 has less digits, it must be smaller than OP2...

That's what I meant.

| >| The reason for the FF is, that FF is no valid expression (all routines
| >| can do calculations in any base 2...16, assuming that the input operands
| >| *are* valid numbers of the defined base).
| >
| >| Since all routines work backwards, they couldn't distinguish, if a "00"
| >| is a valid digit within the number, or if we already passed the MSD. BTW
| >| - the FF's are included in the corresponding XLAT-tables for ASCII <->
| >| BCD or hex, too (8 tables à 256 byte).

| >Yes, start with LSD for ADD/SUB/MUL,
| >but you'll start with MSD when CMP or DIV.
| >As you already counted valid digits, amount of leading zeros is known.

| Digits are calculated in the compare loop at runtime. The results are
| not stored in the current code. The routine uses REPNE SCASB with AL =
| 0xFF to find the 1st invalid digit, then I do a (0x0100 - ECX)...

| >?? wouldn't one 16 byte string be enough for HEX/BCD->ASCII conversion??
| >unpacked BCD ->ASCII needs just an OR x,30h,or AND x,0Fh for ASCII->BCD.

| Historical reasons. The "fundamental" concept of the calculator is based
| in 1993. The system memory already is defined (meanwhile the system's
| memory was expanded, so different other areas sit on top of the math
| buffers) and the tables exist, so I leave this stuff as it is. The main
| reason for the tables was (and is), that XLAT performs faster and with
| less code than any other conversion method.

| Also: 0x1fEc3cBA -> 0x1FEC3CBA (I hate it, if hexadecimals are not
| written in capital letters...) (-> 22 entries in the 2nd XLAT-table,
| there are 8 of them defined...)

I still use for HEX and BCD display:

odd:
SHR al,4
even:
AND al,0F
CMP al,0A
JNC +2
ADD al,07
ADD al 30

| Last, but not least: The table is thought to convert user input. Do you
| know what a user is? If it is someone like me, a routine to mask out
| false input is badly needed... ;)

I define input-fields as strict NUM or HEX and check against settable limits.

| >I see, your "FFs" can avoid leading zero handling.
| >But wouldn't compares and calculation (think of digit-carry-over)
| >a bit easier if you fill the unused digits with zeros.

| Snippet of the addition "core":

| (AT&T syntax - read left to right!)
|
| ...
| std # work backwards
| movb _D_BASE,%dl # numeric base for RES
| ADD0:lodsb # get 1st OP digit
| cmpb $0xFF,%al # end of OP ?
| jne 1f

| incb %dh # end of OP1 detected

I'm sure you can do this faster anyway,
my way to ADD "any variables" 8 to 256-bit binaries, looks like:
-----------------
movzx ecx,bytes_op1 ;mantissa only; 1..32
;assuming exponent already adjusted
movzx eax,bytes_op2
mov edi,op1 ;lin adr es=ds=flat
mov esi,op2
bsr [edi+ecx],edx ;get MSBit op1
bsr [esi+eax],ebx ;get MSBit op2
.... ;save edx,ebx,
shr edx,3 ;byte-offset now
shr ebx,3
.... ;if bl<dl swap (so bl holds larger operand size)
.... ;calc and save leading zero for formatted output
xor ecx,ecx ;start at LSB ,clr-Cy
ror ah,1 ;copy carry
ADD1:
mov al,[edi+ecx]
rol ah,1 ;restore carry
adc al,[esi+ecx] ;all calculations use three operand mode
mov [ecx+buffer],al ;32bit displacement, selfmodified, destination
ror ah,1 ;copy carry
inc ecx
cmp cl,20h ;32-byte (+1 carry-over)limit
j> done
cmp cl,bl ;until MSB and carry-over included, and then quit
j<= ADD1
rol ah,1 ;restore carry
jc ADD1
done:
-----------


| >a REPZ SCAS (or any faster equivalent) can find the MSD which also gives the power10 value of the number and the leading
zero-count
| >as well.

| Yes, you are right here. It is just a flaw in my concept, because I
| didn't think too much about the division routine. You don't need the
| data of the MSD and power in the add, sub or mul routines at all (they
| stop, if both digits are 0xFF; compare only is used to check, if the
| operands for sub / mul should be exchanged) - but the division needs
| much more work (and known data) than all of the other operations together...
|
| >| Whenever a FF is detected in one of the operands, the routine stops the
| >| running calculation.

| >I would do this a bit different, like pass on the 'used' operand-size,
| >which actually is identical to the MSD's power of 10.


| True. My concept follows the thought, that this isn't part of the basic
| operations. It should be done in the calculator logic, which is nothing
| else than the "management" for the operations and handling of the
| floating points, exponents and signs. There is a 64 byte area, where all
| vital data of each math buffer (OP1, P2, COM, RES) should be stored. As

| I started with the calculator, I thought, that it would be better, if I

Seems a full precision table is impossible (except stored on disk),
and the split into parts will cost additional handling.
So how about the first idea to calculate the logarithm when needed,
just to the current required precision.

[...]


| What about the ones who lost one finger? ;)
|
| AFAIK, there are several things which are calculated with octal numbers.
| Also, some cryptographic routines may use algorithms which are based on
| other numeric systems. As I said, I do very "unusual" things (one may
| replace unusual with weird from time to time)...

| >I think binary log2
| > decimal log10
| > natural log e
| >will be the only useful.
| >Remember: if you have one log-base defined,
| >it's easy to convert it into any base with just a constant multiplication.

| That's right. One table and the constants should be enough...

| P.S.: As you may have seen Ben's posting, what about putting all this
| stuff together?

| The approximation method produces accurate results within 10 loops (I
| still don't know how to calculate it, but the idea behind it is great).
| I think, that you have much more knowledge about math than me, so you
| probably will see much better, if there is a chance to "simplify" the
| division by using the Newton-Raphson method. Maybe I throw away my
| current code and start another try. I probably would do that, if I would
| be able to realize how all this formulas can be split up into single
| tasks for small routines. The problem is the missing practice with all
| the math stuff ( to change gears I don't need a slide rule ;) )...

A closer look at the formula reveals it as a logarithmic function,
and I'm afraid 10 iterations will be far away from your 256 and even for my much lesser 78 digits (256-bits) precision.
I think the FPU(18 digits) and the win98 calculator(39 digits) uses this method, with an awful rounding overhead.
But yes, a 1/x or better a 10^max/x conversion isn't a bad idea,
if it's fast 'and' precise.

__
wolfgang


bv_schornak

unread,
Feb 17, 2003, 2:55:15 PM2/17/03
to
Ed Beroset wrote:

>Here's one to chew on then: 5041 is the highest known square which is
>one more than a factorial. That is 5041 = 71^2 = 7! + 1. Are there
>any others? Nobody knows.
>

1st guess - there must be another number. After some calculations:

Time to write a little program (maybe using the new calc)... ;)

bv_schornak

unread,
Feb 17, 2003, 6:23:39 PM2/17/03
to
wolfgang kern wrote:

>When you think of the poor guys who spent a whole life on creating a five-digits precise log-table all by hand......
>

And me's poor, too - I have to calculte 256 digit precision...
I think, this is a very, very huge problem. To calculate *any*
logarithm we need the division routine first!

>| [Some formulas]
>

Thanks a lot for your examples! Now I see some more light in a
dark room! To decrease the size of our document a little bit -
I skip the entire formulas and examples here - it's not out of
my mind, even if I deleted it for the smaller size!

>--------or just one of my old BASIC-examples--
>CALC_LN10:
>'LNval=ln(10)
>temp=0.9 '1-1/10 ' if n>1: 1-1/n
>for c=2 to 40 'for 20 digits, 30 iterations are quite enough
>add temp,(x^c)/c
>next c
>if n<1 then toggle sign temp
>LNval = temp
>return
>----
>

You use some "non-standard" functions (add x,y)? I did my last
Basic program back in 1980 or so. (But - I got the point!)

>| (BTW - Ever noticed, that every x! with (x >= 5) is a multiple of 5?)
>
>Yes, and every x! (x>=3) is a multiple of 3 and so on...
> ;perhaps 'this' is the sense if it?
>

As you may have seen Ed's postings, there is a lot of stuff to
learn. But that's what I meant: If you see the numbers working,
then you get the clue behind the story much easier than with a
formula...

>Sure, we cannot add the 'ln'-values if we need ln(100 + 20 + 3).
>This addition works with powers of ten for already defined log-entries only.
>

Would have been the optimized solution. It's a pity, that this
doesn't work...

>Sorry, it will work in my binary 10^x environment only,
>where it just converts the MSD and either loops CMP/SUB for small figures,
>or creates the needed log-value just until desired precision for huge ones.
>You can't add the ln(x) values "that" easy.
>An infinite row calculation is needed for that.
>

Ok, then we castrate the BCD calculator to do *only* 10^x ...?
(Would be the main task of the calculator, anyway.)

>But I think of a table-conversion-routine, which is adaptable to actual current needs and it must be in steps of the lowest possible
>power of the LSD(s).
>At the moment this seems to be a time-consuming ultra-iterative story,
>but let me check on that.
>

It's getting more complicated, I guess? ;)

>| >how large will the period be for primes > 250 digits?
>| Is there an answer? As you say, it depends on the divisor (-> 111 vs. 31)...
>
>I'll search the for sure already detected primes......
>

I think, that it isn't predictable - right?

>| But my example wasn't very clever,
>| anyway. Since OP1 has less digits, it must be smaller than OP2...
>
>That's what I meant.
>

Got it! ;)

>I still use for HEX and BCD display:
>
>odd:
>SHR al,4
>even:
>AND al,0F
>CMP al,0A
>JNC +2
>ADD al,07
>ADD al 30
>

0:lodsb
xlatb
cmp al,0xFF
je 1f
stosb
1:decl ecx
jne 0b

Depending on a flag (setting EBX), the snippet converts:

1. string -> decimal
2. string -> hexadecimal
3. decimal -> string
4. hexadecimal -> string

Invalid digits from input are truncated with the comparison. Without the
0xFF test we would save 2 more lines. Switching decimal > hexadecimal is
done by adding 0x0100 to EBX...

Do you see, why I prefer the XLATB? You only need *one* routine for back
*and* forth conversion! And - come on - any PC has at least 64 MB of RAM
today, so the 512 byte for both tables are nothing to worry about...

>I'm sure you can do this faster anyway,
>my way to ADD "any variables" 8 to 256-bit binaries, looks like:
>-----------------
>movzx ecx,bytes_op1 ;mantissa only; 1..32
> ;assuming exponent already adjusted
>movzx eax,bytes_op2
>mov edi,op1 ;lin adr es=ds=flat
>mov esi,op2
>bsr [edi+ecx],edx ;get MSBit op1
>bsr [esi+eax],ebx ;get MSBit op2

>..... ;save edx,ebx,


>shr edx,3 ;byte-offset now
>shr ebx,3

>..... ;if bl<dl swap (so bl holds larger operand size)
>..... ;calc and save leading zero for formatted output


>xor ecx,ecx ;start at LSB ,clr-Cy
>ror ah,1 ;copy carry
>ADD1:
>mov al,[edi+ecx]
>rol ah,1 ;restore carry
>adc al,[esi+ecx] ;all calculations use three operand mode
>mov [ecx+buffer],al ;32bit displacement, selfmodified, destination
>ror ah,1 ;copy carry
>inc ecx
>cmp cl,20h ;32-byte (+1 carry-over)limit
>j> done
>cmp cl,bl ;until MSB and carry-over included, and then quit
>j<= ADD1
>rol ah,1 ;restore carry
>jc ADD1
>done:
>-----------
>

My routine is able to add numbers of *any* base 2...16. With some
additional code it could add numbers of *any* base to a number of
a *different* base... It isn't optimized code, just coded to have
a working routine, that's clear... The real difference is, that I
have a "nibble calculator" where you use bytes (dwords would be a
little bit faster?). With some trickery - I could use the FPU, it
handles 18 digits in one gulp...

>Seems a full precision table is impossible (except stored on disk),
>and the split into parts will cost additional handling.
>So how about the first idea to calculate the logarithm when needed,
>just to the current required precision.
>

As I have seen until now, we cannot create a table without a
division routine. So - how to feed the calculator with a non
existing table, to calculate the needed table? I see a *big*
problem with that...

>A closer look at the formula reveals it as a logarithmic function,
>and I'm afraid 10 iterations will be far away from your 256 and even for my much lesser 78 digits (256-bits) precision.
>I think the FPU(18 digits) and the win98 calculator(39 digits) uses this method, with an awful rounding overhead.
>But yes, a 1/x or better a 10^max/x conversion isn't a bad idea,
>if it's fast 'and' precise.
>

Reading some additional pages about the method, 5 iterations
produce a 10 digit precision - 10 iterations may give enough
precision for about 40...60 digits. The big question is, how
complex the routine would be. And the next problem: There is
still *no* accurate division routine available which *could*
do the neccessary calculations!

If you use the method in binary calculators, it must be much
more accurate than in a BCD calculator. 18 / 39 digits isn't
very different from 256. The used methods stay the same, and
it is only a question of the needed time, how many digits we
process. My own routines are coded in a way, that they could
be expanded to 1024, 2048 and more digits with a few changes
of the counter...

Jerry Coffin

unread,
Feb 17, 2003, 7:56:42 PM2/17/03
to
In article <b2olqq$c7c$01$1...@news.t-online.com>, now...@schornak.de
says...

[ ... ]

> (BTW - Ever noticed, that every x! with (x >= 5) is a multiple of 5?)

More generally, every X! with X >= N is a multiple of N. If you think
of N! as (N-1)! x N, it tends to make this fact fairly obvious.

This is very similar to a fact used in an early proof that prime numbers
form an infinite set. If we take N!' as meaning the product of all
prime numbers up to N, then N!' is a multiple of each prime number up to
N. N!' +1 then leaves a remainder of 1 when divided by any of the prime
numbers up to N (since N!' is a multiple of all of them). Therefore,
N!' must itself either be a prime number, or else must be divisible by a
prime larger than N.

--
Later,
Jerry.

The universe is a figment of its own imagination.

wolfgang kern

unread,
Feb 18, 2003, 4:19:34 PM2/18/03
to

"Jerry Coffin" replied:

[about the primes]

| More generally, every X! with X >= N is a multiple of N. If you think
| of N! as (N-1)! x N, it tends to make this fact fairly obvious.
|
| This is very similar to a fact used in an early proof that prime numbers
| form an infinite set. If we take N!' as meaning the product of all
| prime numbers up to N, then N!' is a multiple of each prime number up to
| N. N!' +1 then leaves a remainder of 1 when divided by any of the prime
| numbers up to N (since N!' is a multiple of all of them). Therefore,
| N!' must itself either be a prime number, or else must be divisible by a
| prime larger than N.

Your statement fits what's written in my books.
Do you know a way to figure out the 1/int(x) periodic size?

I think it may be something similar to the 'integer dividable' rules.

as 1/1111 got 4
1/111 3
1/11 2
1/3 1
1/7 6
1/77 6
1/13 6

__
wolfgang

wolfgang kern

unread,
Feb 18, 2003, 4:08:15 PM2/18/03
to

"Bernhard" wrote:

| And me's poor, too - I have to calculate 256 digit precision...


| I think, this is a very, very huge problem. To calculate *any*
| logarithm we need the division routine first!

For table- and constant- creation a slow CMP/SUB divide is enough,
But if we look at the

ln(x) = ±[x^n /n] n=2 to 256

we just need one small table for the 1/n values 1/2 ... 1/256
to calculate log during 'runtime-divisions' without any divide:

ln(x) = ± [x^n * (1/n)]

and another small table for the 1/n! values for e^x calculation.

e^x =1+[x^n /n!] => 1+[x^n * (1/n!)]

But be aware of faculty grows extremely fast.
ie: 66! = 5.44...*10^92
so for my 256-bit (78 digit) the maximum integer is 57!

In fact you don't need to cycle the calculation up to 256 times.
I stop the 'infinite' loop when the element-result is already smaller than the desired precision (compare desired LSB# with results
MSB#).

|....I deleted it for the smaller size!
Agreed.

[basic syntax]


| You use some "non-standard" functions (add x,y)

Seems I built a 'macro' then to avoid the disturbing 'x=x+y' syntax-junk.
And this example actually uses the formula:

ln(x) = x^n/n + x^(n+1)/(n+1) rather than the ±[ ]
This due the preceding 1-1/n for 'n>1'.

| As you may have seen Ed's postings, there is a lot of stuff to
| learn. But that's what I meant: If you see the numbers working,
| then you get the clue behind the story much easier than with a
| formula...

| >Sure, we cannot add the 'ln'-values if we need ln(100 + 20 + 3).
| >This addition works with powers of ten for already defined log-entries only.

| Would have been the optimised solution. It's a pity, that this
| doesn't work...

| Ok, then we castrate the BCD calculator to do *only* 10^x ...?


| (Would be the main task of the calculator, anyway.)

The BCD's digit# already points out the 10^x value.
But if I think of the 1/x (better 10^digits/x) conversion,
a table with just the power10 values gives the ln(dividend) already.

| >But I think of a table-conversion-routine, which is adaptable to actual
| >current needs and it must be in steps of the lowest possible
| >power of the LSD(s).
| >At the moment this seems to be a time-consuming ultra-iterative story,
| >but let me check on that.

| It's getting more complicated, I guess? ;)

Let's look how a division works.
ie:
123/4 = (100+20+3)/4 = 100/4 + 20/4 + 3/4 = 25 + 5 + 0.75 = 30.75

or--------------------------------------------------------
ln(100)-ln(4) ln(20)-ln(4) ln(3)-ln(4)
123/4= e + e + e
----------------------------------------------------------
4.605170186 2.995732274 1.098612289
ln(4) -1.386294361 -1.386294361 -1.386294361
-------------------------------------------
3.218875825 1.609437912 -0.287682072

now e^x: 25 + 5 + 0.75
===========================================
looks good, but this is math-masturbation till now.
We still need a table with 999 entries for this three digits.
So it's still equal to "ln(123)-ln(4)".

The quest for a convertible minimal table is
convert ln(100) into ln(1) +log10(100)*ln(10) this works due "1".
ln (20) into ln(2) +log10(10)*ln(10) here are some needs.

BUT
in terms of log10:
log10(200)= 2.301029996
log10(20) = 1.301029996
log10(2) = 0.301029996
A table using log10 seems to be easy convertible for every digits power.
let's try again:
(I use the shorter "lg" for log10 yet)
-------------------------------------------------------------
lg(100)-lg(4) lg(20)-lg(4) lg(3)-lg(4)
123/4= 10 + 10 + 10
------------------------------------------------------------------------
lg(100)+lg(1)-lg(4) lg(10)+lg(2)-lg(4) lg(3)-lg(4)
123/4= 10 + 10 + 10
------------------------------------------------------------------------
2+lg(1)-lg(4) 1+lg(2)-lg(4) lg(3)-lg(4)
123/4= 10 + 10 + 10
------------------------------------------------------------------------
digit# 2 1 0
lg(digit) 0.0 0.301029996 0.477121255
lg(4) -0.602059991 -0.602059991 -0.602059991
------------ ------------- -------------
+1.397940009 + 0.698970004 -0.124938737

now 10^x: 25.00000000 + 5.000000000 +0.75 = 30.75
===================================================

now only a log-table with 9 entries is required,
and even every digit must be calculated, I see only ADD and SUB here.

Oops, surprise, surprise this log10 table seems to work!
But now a 10^x function is needed in addition:

the signed integer part adjusts the decimal point.
And the fraction shall produce a 'several' digits spanning number.


-----interested in some homework exercise? -------
Are you ready do give this digit-based 10^x a try ?
hint:
somehow opposite to the above.
can the e^x formula help here?

You're already able to create the nine log10(2..9) table-entries?


---------------------------------------------------
this minimal table can also be extended to cover 001 to 999
or any other comfortable size.

---------------------------------------------------


| >| >how large will the period be for primes > 250 digits?
| >| Is there an answer? As you say, it depends on the divisor (-> 111 vs. 31)...

Sorry, 111 isn't a prime! is a multiple of 3 , obvious :)

| >I'll search the for sure already detected primes......
| I think, that it isn't predictable - right?

Some different ways to find primes are known,
but none of them says anything about the 1/prime period-size.
Would be a interesting side-job to figure out a formula.
But in our case, it would be enough to list all primes up to needed,
(a table of primes may be helpful for many other math-jobs too)
and then check out the period-sizes and store it anywhere.

[conversions..]


| 0:lodsb
| xlatb
| cmp al,0xFF
| je 1f
| stosb
| 1:decl ecx
| jne 0b
|
| Depending on a flag (setting EBX), the snippet converts:
|
| 1. string -> decimal
| 2. string -> hexadecimal
| 3. decimal -> string
| 4. hexadecimal -> string

| Invalid digits from input are truncated with the comparison. Without the
| 0xFF test we would save 2 more lines. Switching decimal > hexadecimal is
| done by adding 0x0100 to EBX...
|
| Do you see, why I prefer the XLATB? You only need *one* routine for back
| *and* forth conversion! And - come on - any PC has at least 64 MB of RAM
| today, so the 512 byte for both tables are nothing to worry about...

Ok, I'm a bit outside bounds with my 'standards',
as numeric figures are treated completely separated from text.
Even a 'text to num' is supported, a numeric type must be chosen before.

Yes, we got different I/O-targets (would be boring if all use equal stuff),
so let's just talk about the calculations.

| a *different* base... It isn't optimised code, just coded to have


| a working routine, that's clear... The real difference is, that I
| have a "nibble calculator" where you use bytes (dwords would be a
| little bit faster?). With some trickery - I could use the FPU, it
| handles 18 digits in one gulp...

If you take closer look on my code it supports any (byte-aligned)
numeric types with an early 'done' and any digit overflow is covered.
So it can add a one BYTE variable to a 32-byte variable and loops
just as long as carry-over needs to.
(I support var-types from 'signed byte' to 'unsigned 32 byte' integers
with optional up to 32 bit decimal valued exponents).

As you use unpacked BCD, you may work with byte-offsets as well.

I could use dwords for variables which are sized in multiples of 32-bit.

Forget about the FPU (it's slow and rounding overhead is awful),
perhaps some of the 64/128-bit mm-register-'Boolean's may be useful.

| As I have seen until now, we cannot create a table without a
| division routine. So - how to feed the calculator with a non
| existing table, to calculate the needed table? I see a *big*
| problem with that...

Answered above...

| >A closer look at the formula reveals it as a logarithmic function,
| >and I'm afraid 10 iterations will be far away from your 256 and even
| >for my much lesser 78 digits (256-bits) precision.
| >I think the FPU(18 digits) and the win98 calculator(39 digits) uses
| >this method, with an awful rounding overhead.
| >But yes, a 1/x or better a 10^max/x conversion isn't a bad idea,
| >if it's fast 'and' precise.

| Reading some additional pages about the method, 5 iterations
| produce a 10 digit precision - 10 iterations may give enough
| precision for about 40...60 digits. The big question is, how
| complex the routine would be. And the next problem: There is
| still *no* accurate division routine available which *could*

| do the necessary calculations!

| If you use the method in binary calculators, it must be much
| more accurate than in a BCD calculator. 18 / 39 digits isn't
| very different from 256. The used methods stay the same, and
| it is only a question of the needed time, how many digits we
| process. My own routines are coded in a way, that they could
| be expanded to 1024, 2048 and more digits with a few changes
| of the counter...

As our common target seems to be an integer-solution rather than
'so beloved' floating-point where the values are always 2>x>1
and needs a lot of rounding procedures to overcome the
decimal 0.1 is a periodic figure with 2^n exponents,
we better check and continue on our path.
I'm sure our final solution will work somehow similar,
but with an integer mantissa......

__
wolfgang

Jerry Coffin

unread,
Feb 18, 2003, 10:07:26 PM2/18/03
to
In article <b2u80a$1cf$2...@newsreader1.netway.at>, now...@nevernet.at
says...

[ ... ]

> Your statement fits what's written in my books.
> Do you know a way to figure out the 1/int(x) periodic size?
>
> I think it may be something similar to the 'integer dividable' rules.

Yes. You take the denominator, and find what is the smallest multiple
consisting of nines followed by zeroes (e.g 999000) -- and interestingly
enough, there will always be such a number.

The number of nines in that number will be the number of digits in a
group before the decimal repeats.

You're also correct that divisibility is the key to finding this number.
If the denominator is a prime, the period will be either 1 less than the
prime, or else a factor of that number (e.g. 1/11 would have a maximum
possible period of 10, but in reality has a period of 2, which is a
factor of 10).

If you want to get into all the details of finding the period, just
about any decent book on number theory should give the method.

Another interesting point is that the period is affected by the
denominator, not the numerator (not really surprising, when you consider
that something like 2/7 can be viewed as 2 * 1/7. This multiplication
doesn't really change the period.

What you might find even more interesting, however, is that in cases of
maximum period (i.e. the denominator is a prime, and the period is one
less than that) multiples repeat the same group of digits, and each
multiple just changes where in the group you start. For example:

1/7 = .142857 142857 ...
2/7 = .285714 285714 ...
3/7 = .428571 428571 ...

IOW, they all have 4287514... as a group -- all that changes is which of
those digits comes right after the decimal point.

The murky depths of number theory hide some fascinating little tidbits.

wolfgang kern

unread,
Feb 19, 2003, 1:13:57 AM2/19/03
to

"Jerry Coffin" replied:

[determine period size]

Great, thanks a lot,
now I know where to search for details in my books.
And I remember now the 'xxxx/9999' equivalent for periodic parts,
so an opposite view may reveal the size.....

__
wolfgang

Jerry Coffin

unread,
Feb 19, 2003, 10:25:15 AM2/19/03
to
In article <b2v7al$dje$1...@newsreader1.netway.at>, now...@nevernet.at
says...

[ ... ]

> Great, thanks a lot,
> now I know where to search for details in my books.
> And I remember now the 'xxxx/9999' equivalent for periodic parts,
> so an opposite view may reveal the size.....

A little looking online shows that:

http://mathforum.org/library/drmath/view/51549.html

gives the algorithm quite understandably.

wolfgang kern

unread,
Feb 19, 2003, 10:59:33 AM2/19/03
to

"Jerry Coffin" saved me a lot of time:
[....]

|
| A little looking online shows that:
|
| http://mathforum.org/library/drmath/view/51549.html
|
| gives the algorithm quite understandably.

That's a good link indeed.
Thanks!

__
wolfgang


bv_schornak

unread,
Feb 19, 2003, 4:56:36 PM2/19/03
to
Jerry Coffin wrote:

>What you might find even more interesting, however, is that in cases of
>maximum period (i.e. the denominator is a prime, and the period is one
>less than that) multiples repeat the same group of digits, and each
>multiple just changes where in the group you start. For example:
>
>1/7 = .142857 142857 ...
>2/7 = .285714 285714 ...
>3/7 = .428571 428571 ...
>
>IOW, they all have 4287514... as a group -- all that changes is which of
>those digits comes right after the decimal point.
>
>The murky depths of number theory hide some fascinating little tidbits.
>

That's really fascinating (and could be of use for the calculator)! I
guess, that the string in your signature is based on mathematical theories?

Ben Peddell

unread,
Feb 20, 2003, 11:52:47 PM2/20/03
to
I think I'll just upload updates to my homepage
(http://users.bigpond.com/killer.lightspeed/), and post here when I've
finished it.


Jerry Coffin

unread,
Feb 22, 2003, 5:14:25 PM2/22/03
to
In article <b30ufe$224$03$1...@news.t-online.com>, now...@schornak.de
says...

[ ... ]

> >IOW, they all have 4287514... as a group -- all that changes is which of
> >those digits comes right after the decimal point.

[ ... ]

> That's really fascinating

Isn't it though? There are a lot more interesting little bits like
that. Consider:

1/11 = .0909090909
2/11 = .1818181818
3/11 = .2727272727
4/11 = .3636363636

and so on --- each time you increase the numerator, the first number in
the repeating pair increases by one and the second decreases by one.
Just FWIW, 37 as a denominator acts much the same way.

> (and could be of use for the calculator)!

If you say so -- it's certainly fun to play with _using_ a high-
precision calculator. It's written in lex/yacc intead of assembly, but
a while back I wrote such a thing, more or less (it's not infinite
precision, but it handles tens of thousands of digits quite nicely
anyway.

> I
> guess, that the string in your signature is based on mathematical theories?

Naw...nothing so philosophical. When I was in college, I first heard
the line about "gravity is imaginary -- really the world just sucks" --
my response was something like "everything is imaginary", and somebody
else asked what was doing the imagining if it was all imaginary. It
took me a while, but I came up with what's now my tagline as a response
to that. My favorite response was on Fidonet when somebody (after I'd
used this tagline for five years or so) had a tagline something like:

Jerry Coffin changes tagline! World to end soon!

Ben Peddell

unread,
Feb 23, 2003, 7:48:57 AM2/23/03
to

Jerry Coffin <jco...@taeus.com> wrote in message
news:MPG.18c1a4695104100098983a@news...

> In article <b30ufe$224$03$1...@news.t-online.com>, now...@schornak.de
> says...
>
> [ ... ]
>
> > >IOW, they all have 4287514... as a group -- all that changes is which
of
> > >those digits comes right after the decimal point.
>
> [ ... ]
>
> > That's really fascinating
>
> Isn't it though? There are a lot more interesting little bits like
> that. Consider:
>
> 1/11 = .0909090909
> 2/11 = .1818181818
> 3/11 = .2727272727
> 4/11 = .3636363636
>
> and so on --- each time you increase the numerator, the first number in
> the repeating pair increases by one and the second decreases by one.
> Just FWIW, 37 as a denominator acts much the same way.

And they have repeating patterns, regardless of what base you look at them
in.
For example,
1/3 is
0.33... in decimal
0.55... in hexadecimal
0.2525... in octal
0.0101... in binary

1/7 is
0.142857142857... in decimal,
0.249249... in hexadecimal
0.11... in octal
0.001001... in binary

1/11 is
0.0909... in decimal
0.1745D1745D... in hexadecimal
0.05642721350564272135... in octal
0.00010111010001011101... in binary

wolfgang kern

unread,
Feb 23, 2003, 1:58:19 PM2/23/03
to

The solution???:

the example assumes a single digit based LUT,
but the table may be extended to fit in memory,
a 'grouped' calculation seems to be possible.

needed tables:
1/n ;n 2...precision^2 entries
1/n! ;n 2...precision max.
lg(x) ;x 2...9

the formulas:
------------------------
ln(1+x) = x - x^2/2 + x^3/3 - .. ;[1>x>0]

ln(x) = 1-1/x + [(1-1/x)^n * 1/n] + .. ;[x>0 ]

y=1-1/x

ln(x) = y + [y^n * (1/n)]
;n = 2 to precision^2
!!! but there are still divisions here !!!
So it can be used with slow CMP/SUB divide for table creation only.
------------------------
the term "log10" is called "lg" now
lg(x) = ln(x)*1/ln(10)
----------------------
lg relations:
lg(200) = 2.301029996 = -lg(0.005)
lg(20) = 1.301029996 = -lg(0.05)
lg(2) = 0.301029996 = -lg(0.5)
lg(0.2) =-0.698970004 = 1 - 0.301029996 = -lg(5)
lg(0.02)=-1.698970004 = 2 - 0.301029996 = -lg(50)
------------------------
1/x = e^(-ln(x))
= 10^(-lg(x))
------------------------
e^x = 1 + [x^n /n!]
= 1 + [x^n * (1/n!)]
using the 1/n! table
;n = 1 to [until desired precision match to n! - n/5
(trailing zeros won't increase precision)]
------------------------
10^x = e^(x*ln(10))
------------------------

So lets create tables:
-------------------
lg(1) = 0
lg(2) = 0.301029996 = ln(2)*1/ln(10) = 0.69314 *0.43429
lg(3) = 0.477121255 = ln(3)*.......
lg(4) = 0.602059991 = also lg(2)+lg(2)
lg(5) = 0.698970004
lg(6) = 0.778151250 = also lg(2)+lg(3)
lg(7) = 0.845009804
lg(8) = 0.903089987 = also 3*lg(2)
lg(9) = 0.954242509 = also lg(3)+lg(3)
lg(10)= 1
-------------------
10^0 = 1
10^0.1= 1.258925412
10^0.2= 1.584893192
10^0.3= 1.995262315
10^0.4= 2.511886432
10^0.5= 3.162277660
10^0.6= 3.981071706
10^0.7= 5.011872336
10^0.8= 6.309573445
10^0.9= 7.943282347
10^1 =10
-------------------

Let's look how a division works.
ie:

546/35 = (500+40+6)/35 = 500/35 + 40/35 + 6/35 =
= 14.2857 + 1.142857 + 0.17142857 = 15.6
------------------------------------------------------------
lg(500)-lg(35) lg(40)-lg(35) lg(6)-lg(35)
546/15 = 10 + 10 + 10
-------------------------------------------------------------
lg(100)+lg(5)-lg(35) lg(10)+lg(4)-lg(35) lg(6)-lg(35)
546/15 = 10 + 10 + 10
-------------------------------------------------------------
2+lg(5)-lg(35) 1+lg(4)-lg(35) lg(6)-lg(35)
546/15 = 10 + 10 + 10
-------------------------------------------------------------
first calculate lg(35)

lg(35) = ln(35)*1/ln(10)
but calculation will need divide,
so we use look-up, see later following

digit# 2 1 0
lg(digit) 0.698970004 0.602059991 0.778151250
lg(35) -1.544068044 -1.544068044 -1.544068044
------------ ------------- -------------
1.154901960 0.057991947 -0.765916794

after 10^x:
14.2857142857 + 1.142857142857 + 0.17142857142857 = 15.599999999..
==============
-------------------
10^x function example:
(using previous example value here lg(25.0) = 1.397940009)

the signed integer part adjusts the decimal point only.


the fraction shall produce a 'several' digits spanning number

y= x*ln(10) ; 0.397940009 * 2.305.. = 0.916290732

10^x = 1 + y + [y^2 *1/2!]+[y^3*1/3!]+[ ]

(z=y temporary, so y^n is just a multiplication in every element)

10^x = 1 + y + [z(=z*y) *1/2!]+[z(=z*y) *1/3!]+[ ]

n z 1/n! sum
1
1 0.916290732 0.916290732
2 0.839588705 * 0.5 0.419794353
3 0.769307349 * 0.166666666 0.128217892
4 0.704909194 * 0.041666666 0.029371216
5 0.645901761 * 0.008333333 0.005382515
6 0.591833798 * 0.001388888 0.000821991
7 0.542291824 * 0.000198413 0.000107598
8 0.496896972 * 0.000024802 0.000012324
9 0.455302090 * 0.000002756 0.000001255
-----------
after nine iterations already 2.499999874 *10^1 ~ 25.00

If the decimal point is adjusted in front of this,
the 'sum' can be used for all other digits and are "the quotient" already.

-------------
another way would be look-up using the lg(2..9)-table:
again example lg(25)
1.397940009 ;
extract integer 1 ;10^1 [MSD position = #0. }
0.397940009
look-up [>=lg(2)] 0.301029996 ; [first digit = 2#.]
[ <lg(3)] 0.477121255
------------

0.397940009
sub lg(2) -0.301029996
------------
0.096910013 (remaining factor to "20")

(this is lg(1.25) due lg(20)+lg(1.25)= 20 * 1.25 = "25")

-1 [/10]
------------
-0.903089987 look-up: = 1/8
20 * (1 + 1/8) = 20 *1.25 = "25"
sub log(1/8) - - 0.903089987
-------------
~0 continue with next /10 if not zero.
(the remaining factor to "25")

first digit is done, but we don't like the divide [20/8],
even the divisor can be just an integer 2..9
we may use the 1/n-table and multiply.
------------------
now the example lookup lg(35) in a lg(1..9)-table? :

first digit ="3" *10^1

look-up lg(3) 0.477121255
*10^1 +1
------------
1.477121255 ; this is lg(30)
next digit ="5"
look-up lg(5) 0.698970004

get the relation 30/5 lg(30)-lg(5) is 35/5 - 1
1.477121255
-0.698970004
------------
0.778151251 look-up: (>=6)


now adjust lg (30) to become lg(30)+lg(35/30) [30*(1+1/6)=35]
1.477121255 ; this is lg(30)
lg(6) -0.778151250 ; /6
lg(7) +0.845098040 ; *7 (1+1/6 = 7/6)
------------
lg (35) = 1.455068044
========================
Be aware to need many sub-cycles if more digits.
__
wolfgang


bv_schornak

unread,
Feb 23, 2003, 2:23:50 PM2/23/03
to
First: sorry for the long delay - at the moment I'm running out
of time...

There is a big flaw in my concept. I just thought, that I could
add digits of any base, then only subtract the base (e.g. 4) as
long as it becomes a valid digit - totally nonsense, of course,
but I had to do some calculations on paper to get the point...

Thinking about all this stuff, I decided to recode my routines.
All of the current overhead for the base handling is removed...
Also, I consider to use packed BCDs, see below.

- - -

wolfgang kern wrote:

>For table- and constant- creation a slow CMP/SUB divide is enough,
>But if we look at the
>
>ln(x) = ±[x^n /n] n=2 to 256
>
>we just need one small table for the 1/n values 1/2 ... 1/256
>to calculate log during 'runtime-divisions' without any divide:
>
>ln(x) = ± [x^n * (1/n)]
>
>and another small table for the 1/n! values for e^x calculation.
>
>e^x =1+[x^n /n!] => 1+[x^n * (1/n!)]
>

Needs two tables à 64 kB - I prefer the solution below!

>But be aware of faculty grows extremely fast.
>ie: 66! = 5.44...*10^92
>so for my 256-bit (78 digit) the maximum integer is 57!
>
>In fact you don't need to cycle the calculation up to 256 times.
>I stop the 'infinite' loop when the element-result is already smaller than the desired precision (compare desired LSB# with results
>MSB#).
>

At the moment - my calculations end, if the current operation
produces an over- or underflow condition...

>| Ok, then we castrate the BCD calculator to do *only* 10^x ...?
>| (Would be the main task of the calculator, anyway.)
>
>The BCD's digit# already points out the 10^x value.
>But if I think of the 1/x (better 10^digits/x) conversion,
>a table with just the power10 values gives the ln(dividend) already.
>

As I told, I started to redesign my routines. As you said, the
entire code which handles base 2...16 should be removed. Slows
down the entire calculation loops. Also, it would be faster if
I use packed BCDs or hexadecimals rather than nibbles...

>Let's look how a division works.
>ie:
>123/4 = (100+20+3)/4 = 100/4 + 20/4 + 3/4 = 25 + 5 + 0.75 = 30.75
>
>or--------------------------------------------------------
> ln(100)-ln(4) ln(20)-ln(4) ln(3)-ln(4)
>123/4= e + e + e
>----------------------------------------------------------
> 4.605170186 2.995732274 1.098612289
> ln(4) -1.386294361 -1.386294361 -1.386294361
> -------------------------------------------
> 3.218875825 1.609437912 -0.287682072
>
>now e^x: 25 + 5 + 0.75
> ===========================================
>looks good, but this is math-masturbation till now.
>We still need a table with 999 entries for this three digits.
>So it's still equal to "ln(123)-ln(4)".
>
>The quest for a convertible minimal table is
> convert ln(100) into ln(1) +log10(100)*ln(10) this works due "1".
> ln (20) into ln(2) +log10(10)*ln(10) here are some needs.
>
>BUT
> in terms of log10:
>log10(200)= 2.301029996
>log10(20) = 1.301029996
>log10(2) = 0.301029996
>

Looks like lg(2 * 10 E n) = n.301029996 - right?

As we have the exponents of both operands stored in a separate
place - we just have to take them as the integer part, and the
fractions are stored in the table... If it really is that easy
(seeing the "0.75" - it seems, that it isn't *that* easy)...

>-----interested in some homework exercise? -------
>Are you ready do give this digit-based 10^x a try ?
>hint:
>somehow opposite to the above.
>can the e^x formula help here?
>
>You're already able to create the nine log10(2..9) table-entries?
>
>
>---------------------------------------------------
>this minimal table can also be extended to cover 001 to 999
>or any other comfortable size.
>

No time for homework at the moment... I think I will print out
all important parts of our postings, so it is easier to follow
the red thread. (Very time consuming to search through several
postings to find the right formulas and descriptions...)

Homework is not forgotten - you *will* get 256 digit tables as
soon as possible!

Ahem - 999 values are about 256 kB...


[Prime numbers...]

>---------------------------------------------------
>| >| >how large will the period be for primes > 250 digits?
>| >| Is there an answer? As you say, it depends on the divisor (-> 111 vs. 31)...
>
>Sorry, 111 isn't a prime! is a multiple of 3 , obvious :)
>

37 * 3, of course. I didn't calculate it - I just compared the
amount of digits you gave for the periodic part. ;)

>Some different ways to find primes are known,
>but none of them says anything about the 1/prime period-size.
>Would be a interesting side-job to figure out a formula.
>But in our case, it would be enough to list all primes up to needed,
>(a table of primes may be helpful for many other math-jobs too)
>and then check out the period-sizes and store it anywhere.
>

Reading the document Jerry gave us the link, it's astonishing,
what working with numbers reveals...

The rules from the document migh be of use for the calculator,
too. As I've seen, there is a simple way to calculate numbers
with endless precision - using a simple calculator...

>Yes, we got different I/O-targets (would be boring if all use equal stuff),
>so let's just talk about the calculations.
>

Everybody sees a problem from a different view, the solutions
will differ, too...


[Lots of code snipped...]

>If you take closer look on my code it supports any (byte-aligned)
>numeric types with an early 'done' and any digit overflow is covered.
>So it can add a one BYTE variable to a 32-byte variable and loops
>just as long as carry-over needs to.
>(I support var-types from 'signed byte' to 'unsigned 32 byte' integers
> with optional up to 32 bit decimal valued exponents).
>

Operand sizes are covered by my code as well as all the other
stuff. Due to my "weird" idea to handle numbers of different
bases together with my "nibble concept", your routine is much
more sophisticated... ;)

>As you use unpacked BCD, you may work with byte-offsets as well.
>
>I could use dwords for variables which are sized in multiples of 32-bit.
>
>Forget about the FPU (it's slow and rounding overhead is awful),
>perhaps some of the 64/128-bit mm-register-'Boolean's may be useful.
>

What about

...
MOV ECX, 16
MOV EDI, offset [lsdOP1 - 8]
MOV ESI, offset [lsdOP2 - 8]
MOV EBX, offset [lsdRES - 8]
0:FBLD EDI
FBLD ESI
FADD
FBSTP
SUB EDI, 0x0000000A
SUB ESI, 0x0000000A
DEC ECX
JNE 0b
...

The operands are split up into 16 chunks à 8 byte + 2 leading
zero bytes. It only needs 16 loops for the addition. Load and
store are vector path, add is direct path...

Because only 16 of the digits are used, there are no problems
with overflow conditions. The 17th digit is added to the next
LSD, if we convert the split up operands back to a number, or
I expand the above snippet. Checking both operands before the
addition loop will save unneccessary load / store operations,
et cetera...

>As our common target seems to be an integer-solution rather than
>'so beloved' floating-point where the values are always 2>x>1
>and needs a lot of rounding procedures to overcome the
>decimal 0.1 is a periodic figure with 2^n exponents,
>we better check and continue on our path.
>I'm sure our final solution will work somehow similar,
>but with an integer mantissa......
>

That was the intention behind my calculator - 0.1 should *be*
0.1 rather than 0.099...25! ;)

Having 256 digits, there is no need for a floating point. The
floating point only exists in the exponent, nowhere else!

Just tell me, what you think about the new version of the add
loop. I start coding, if I hear your opinion, maybe there are
better ways. Packed BCDs are no solution, if the calculations
are done via ADD / ADC / SUB / SBC.

DAA / DAS are vector path, and we need them for each byte, so
unpacked nibbles are a better choice? The use of hexadecimals
is ok, but we have to convert them twice. It's a challenge to
find out the best way to have both: good performance and high
precision...

bv_schornak

unread,
Feb 23, 2003, 3:18:44 PM2/23/03
to
Jerry Coffin wrote:

>Isn't it though? There are a lot more interesting little bits like
>that. Consider:
>
>1/11 = .0909090909
>2/11 = .1818181818
>3/11 = .2727272727
>4/11 = .3636363636
>
>and so on --- each time you increase the numerator, the first number in
>the repeating pair increases by one and the second decreases by one.
>Just FWIW, 37 as a denominator acts much the same way.
>

They are all multiples of 9, so it should be that way... ;)

1 | 9 = 10 - 1
2 | 18 = 20 - 2
3 | 27 = 30 - 3
4 | 36 = 40 - 4
5 | 45 = 50 - 5
6 | 54 = 60 - 6
7 | 63 = 70 - 7
8 | 72 = 80 - 8
9 | 81 = 90 - 9

This row works for all other digits, too. But there we have to
subtract numbers > 9 - so rows of smaller digits have at least
one 1st digit which occurs twice (or more often, if the digits
are smaller, because the subtracted numbers grow).

n * m = ( 10 * n ) - ( n * ( 10 - m ) )

(Hope the formula is ok, I'm no math-crack...)

>>I guess, that the string in your signature is based on mathematical theories?
>>
>
>Naw...nothing so philosophical. When I was in college, I first heard
>the line about "gravity is imaginary -- really the world just sucks" --
>my response was something like "everything is imaginary", and somebody
>else asked what was doing the imagining if it was all imaginary. It
>took me a while, but I came up with what's now my tagline as a response
>to that. My favorite response was on Fidonet when somebody (after I'd
>used this tagline for five years or so) had a tagline something like:
>
>Jerry Coffin changes tagline! World to end soon!
>

Wo knows? Everything is possible, it just needs the sufficient
amount of imagination... ;)

wolfgang kern

unread,
Feb 23, 2003, 6:28:01 PM2/23/03
to

"Bernhard" replied:

| First: sorry for the long delay - at the moment I'm running out
| of time...

Meanwhile I tried to come around with the minimal table solution,
See parallel post.

| There is a big flaw in my concept. I just thought, that I could
| add digits of any base, then only subtract the base (e.g. 4) as
| long as it becomes a valid digit - totally nonsense, of course,
| but I had to do some calculations on paper to get the point...
|
| Thinking about all this stuff, I decided to recode my routines.
| All of the current overhead for the base handling is removed...
| Also, I consider to use packed BCDs, see below.
|
| - - -

| >we just need one small table for the 1/n values 1/2 ... 1/256


| >to calculate log during 'runtime-divisions' without any divide:
| >ln(x) = ± [x^n * (1/n)]
| >and another small table for the 1/n! values for e^x calculation.
| >e^x =1+[x^n /n!] => 1+[x^n * (1/n!)]

| Needs two tables à 64 Kb - I prefer the solution below!

But you'll see later the need for 1/x and 1/x! again.

|... stop the 'infinite' loop when the element-result is already smaller


| than the desired precision (compare desired LSB# with results >MSB#).

| At the moment - my calculations end, if the current operation
| produces an over- or underflow condition...

??

| Looks like lg(2 * 10 E n) = n.301029996 - right?

Yes for positive n. See for negative values in my parallel post.


| As I told, I started to redesign my routines. As you said, the
| entire code which handles base 2...16 should be removed. Slows
| down the entire calculation loops. Also, it would be faster if
| I use packed BCDs or hexadecimals rather than nibbles...

The fastest and memory-friendly calculation will be with 2^n binaries,
(hexadecimal is just a shot-cut view of it),
but if your target is fast updated display, then unpacked BCD is best.

[...Let's look how a division works]
example 123/4 = 30.75

| As we have the exponents of both operands stored in a separate
| place - we just have to take them as the integer part, and the
| fractions are stored in the table... If it really is that easy
| (seeing the "0.75" - it seems, that it isn't *that* easy)...

A divide will either need a additional fractions-field
or you multiply the dividend by the power10 of the divisor before the
division, to get an integer result.


| No time for homework at the moment... I think I will print out
| all important parts of our postings, so it is easier to follow
| the red thread. (Very time consuming to search through several
| postings to find the right formulas and descriptions...)
|
| Homework is not forgotten - you *will* get 256 digit tables as
| soon as possible!

| Ahem - 999 values are about 256 Kb...

If BCD

| [Prime numbers...]

| Reading the document Jerry gave us the link, it's astonishing,
| what working with numbers reveals...

| The rules from the document might be of use for the calculator,


| too. As I've seen, there is a simple way to calculate numbers
| with endless precision - using a simple calculator...

Yes, a good link indeed.

| .different I/O-targets

| Everybody sees a problem from a different view, the solutions will differ,
| too...

Sure!
I hope our two versions will 'work' finally.....

| [Lots of code snipped...]

stored on disk anyway....

| ...your routine is much more sophisticated... ;)

Oh!

| What about
| ...
| MOV ECX, 16
| MOV EDI, offset [lsdOP1 - 8]
| MOV ESI, offset [lsdOP2 - 8]
| MOV EBX, offset [lsdRES - 8]
| 0:FBLD EDI
| FBLD ESI
| FADD
| FBSTP
| SUB EDI, 0x0000000A
| SUB ESI, 0x0000000A
| DEC ECX
| JNE 0b
| ...

it would produce a FPU-stack overflow after 4 iterations? :)
ok, I see what you mean....

| The operands are split up into 16 chunks à 8 byte + 2 leading
| zero bytes. It only needs 16 loops for the addition. Load and
| store are vector path, add is direct path...
|
| Because only 16 of the digits are used, there are no problems
| with overflow conditions. The 17th digit is added to the next
| LSD, if we convert the split up operands back to a number, or
| I expand the above snippet. Checking both operands before the

| addition loop will save unnecessary load / store operations,
| et cetera...

In terms of code-size it's a beauty.
I already replaced all FBLD/FBSTP instructions in my programs
with the 10^x semi-log table converter.
As the FPU uses a complete 18 digit BIN<=>BCD conversion w/o LUT,
and a never correct rounding feature,
and the instructions are vectored too, my byte by byte method is
much faster, even some more bytes and a table (22Kb for 78digits)
are needed.

If you use multiple of 32 bit variables only,
this will be quite faster than my byte by byte way.

| >As our common target seems to be an integer-solution rather than
| >'so beloved' floating-point where the values are always 2>x>1
| >and needs a lot of rounding procedures to overcome the
| >decimal 0.1 is a periodic figure with 2^n exponents,
| >we better check and continue on our path.
| >I'm sure our final solution will work somehow similar,
| >but with an integer mantissa......

| That was the intention behind my calculator - 0.1 should *be*
| 0.1 rather than 0.099...25! ;)

Exact!

| Having 256 digits, there is no need for a floating point. The
| floating point only exists in the exponent, nowhere else!

Except in case of a partial division-result?

| Just tell me, what you think about the new version of the add
| loop. I start coding, if I hear your opinion, maybe there are
| better ways. Packed BCDs are no solution, if the calculations
| are done via ADD / ADC / SUB / SBC.

| DAA / DAS are vector path, and we need them for each byte, so
| unpacked nibbles are a better choice? The use of hexadecimals
| is ok, but we have to convert them twice. It's a challenge to
| find out the best way to have both: good performance and high
| precision...

Forget DAA/DAS and similar, they are awful slow MUL/DIV stories.

As above,..
But why convert twice, I store and calculate all numeric figures
including the 10^x valued exponent in the binary form.
So I've chosen small memory, fast calculation,
but not the fastest display/print-routines,
due already touching the physical PCI-bus-limit (66 MHZ)
with unformatted graphics-text.
And my next PC will be faster anyway.

I think you should separate storage/calculation decision from
display/print format decision.
There is no reason to calculate in the display-format.
As mentioned above, this depends on the main target
[display speed vs. memory-image and calculation-speed].

1024 bit binaries (128 byte) gives a precision of 308 digits!
__
wolfgang

bv_schornak

unread,
Feb 26, 2003, 3:49:41 PM2/26/03
to
wolfgang kern wrote:

>Meanwhile I tried to come around with the minimal table solution,
>See parallel post.
>

Now waiting in my "drafts" folder... ;)

>|... stop the 'infinite' loop when the element-result is already smaller
>| than the desired precision (compare desired LSB# with results >MSB#).
>
>| At the moment - my calculations end, if the current operation
>| produces an over- or underflow condition...
>
>??
>

That is, whenever the offset runs out of the buffer's borders, we
don't need to go further... (see below)

>| As I told, I started to redesign my routines. As you said, the
>| entire code which handles base 2...16 should be removed. Slows
>| down the entire calculation loops. Also, it would be faster if
>| I use packed BCDs or hexadecimals rather than nibbles...
>
>The fastest and memory-friendly calculation will be with 2^n binaries,
>(hexadecimal is just a shot-cut view of it),
>but if your target is fast updated display, then unpacked BCD is best.
>

Display was a minor reason (-> storage format is packed BCD). The
calculator was designed in a way I can handle the operations. ;)

>[...Let's look how a division works]
> example 123/4 = 30.75
>
>| As we have the exponents of both operands stored in a separate
>| place - we just have to take them as the integer part, and the
>| fractions are stored in the table... If it really is that easy
>| (seeing the "0.75" - it seems, that it isn't *that* easy)...
>
>A divide will either need a additional fractions-field
>or you multiply the dividend by the power10 of the divisor before the
>division, to get an integer result.
>

Hm, maybe we talk about different things here.

In the current concept - operations with floating points are done
by manipulating the exponents and the offsets within the buffers.

This doesn't work for add and sub, if the difference between both
exponents exceeds 256. If so, the result is the *larger* operand,
because the smaller operand's 1st digit is located behind the LSD
of the buffer - no calculation is needed.

For mul and div we need a control logic, which shifts both input
operands to the right places (including the correction of the ex-
ponents), so we can calculate something.

EXAMPLE:

OP1 1.00 E 250
OP2 1.00 E -200

OP1 + OP2 = 1.00 E 250 (next "1" 450 digits behind the FP!)
OP1 - OP2 = 1.00 E 250 (next "1" 450 digits behind the FP!)
OP1 * OP2 = 1.00 E 50
OP1 / OP2 = 1.00 E 450

>| The rules from the document might be of use for the calculator,
>| too. As I've seen, there is a simple way to calculate numbers
>| with endless precision - using a simple calculator...
>
>Yes, a good link indeed.
>

Thought of Dr. Math's text:

>You can use an ordinary 10-digit calculator to do the divisions you
>want to do, too. You want to find the decimal expansion of 1/N for
>some values of N less than, say, 10000000 = 10^7. Enter 1000000000
>into your calculator. That is your first dividend, D1. Divide by N.
>The integer part of the quotient is the first part of the decimal
>expansion. Call that Q1. Then enter D1 and subtract Q1*N. That will
>give you the remainder R1. Now enter R1 followed by as many 0's as
>will fit in your calculator. That is your next dividend D2. Divide by
>N. The integer part of the quotient is the next part of the decimal
>expansion. Call that Q2. Then enter D2 and subtract Q2*N. That will
>give you the remainder R2. Continue this process as long as you like.
>
>Example: Find the decimal expansion of 1/127. N = 127.
>D1 = 1000000000. D1/N = 1000000000/127 = 7874015.748...
> so Q1 = 7874015, and then D1 - Q1*N = R1 = 95.
>D2 = 9500000000. D2/N = 9500000000/127 = 74803149.61...
> so Q2 = 74803149, and then D2 - Q2*N = R2 = 77.
>D3 = 7700000000. D3/N = 7700000000/127 = 60629921.25...
> so Q3 = 60629921, and then D3 - Q3*N = R3 = 33.
>D4 = 3300000000. D4/N = 3300000000/127 = 25984251.97...
> so Q4 = 25984251, and then D4 - Q4*N = R4 = 123.
>D5 = 1230000000. D5/N = 1230000000/127 = 9685039.370...
> so Q5 = 9685039, and then D5 - Q5*N = R5 = 47.
>D6 = 4700000000. D6/N = 4700000000/127 = 37007874.01...
> so Q6 = 37007874.
>The decimal expansion we have found so far is
>
>1/127 = 0.007874015748031496062992125984251968503937 007874...
>

Looks like an accurate and fast way to do a division using 32-bit
DIV and MUL instructions. We don't need to calculate a remainder,
because EDX already holds it. Flaw: N is limited to 32 bit. OTOH,
we could calculate the log-tables using this method...

>Sure!
>I hope our two versions will 'work' finally.....
>

I hope, I find the right format... ;)

>| ...your routine is much more sophisticated... ;)
>
>Oh!
>

Of course (you've done this before, I do it the first time)! :)

>| 0:FBLD EDI
>| FBLD ESI
>| FADD
>| FBSTP
>| SUB EDI, 0x0000000A
>| SUB ESI, 0x0000000A
>| DEC ECX
>| JNE 0b
>

>it would produce a FPU-stack overflow after 4 iterations? :)
>ok, I see what you mean....
>

My book says: "FADD - adds ST0 to ST1, the result overwrites ST1.
ST0 is popped, so the result now is stored in ST0."

Because there is a FBSTP after the FADD, the FPU is in the *same*
state as it was before entering label 0 - except ST0 and ST7 have
another contents now. What should cause the stack overflow after
the 4th run?

BTW - loading could be optimized to pre-load all 8 FPU registers,
then load the next 2 within the routine. I should read the Athlon
manuals again... ;)

>In terms of code-size it's a beauty.
>I already replaced all FBLD/FBSTP instructions in my programs
>with the 10^x semi-log table converter.
>As the FPU uses a complete 18 digit BIN<=>BCD conversion w/o LUT,
>and a never correct rounding feature,
>and the instructions are vectored too, my byte by byte method is
>much faster, even some more bytes and a table (22Kb for 78digits)
>are needed.
>

I never checked any of my routines until now - I guess, they need
a lot of cycles... :(

Ok, next try... ;) Here's an "optimized" version of my addition:

.globl _addOPS
_addOPS:
pushl %ecx
pushl %edx
movl $0x0100,%ecx
xorl %edx,%edx
xorl %eax,%eax
movb _D_BASE,%dl
ADD0:movb -0x0100(%ebx,%ecx,1),%al # AL = digit OP1
cmpb $0xFF,%al # OP1 end?
jne 0f
xorb %al,%al # yes -> zero
incb %dh # stop += 1
0:cmpb $0xFF,-0x0100(%esi,%ecx,1) # OP2 end?
je 1f
addb -0x0100(%esi,%ecx,1),%al # AL = digit RES
jmp 2f
1:incb %dh # stop += 1
2:addb %ah,%al # add carry
xorb %ah,%ah
3:cmpb %dl,%al # valid digit?
jb 4f
subb %dl,%al # no, correct it
incb %ah
jmp 3b
4:cmpb $2,%dh


je 6f # reached end of both OPs

movb %al,-0x0100(%edi,%ecx,1) # store result digit
xorb %dh,%dh
decl %ecx
je 5f # loop through OPs
jmp ADD0
5:movl $0x000000FF,%eax # ERROR - OVERFLOW
jmp 8f
6:cmpw $0,%ax
je 7f
movb %al,-0x0100(%edi,%ecx,1) # store result digit
cmpb $0,%ah # carry ?
je 7f
movb %ah,-0x0100(%edi,%ecx,1) # store result digit
7:xorl %eax,%eax # no error occured
8:popl %edx
popl %ecx
ret

I replaced vector path (lodsb, stosb, et cetera) with direct path
operations. The loop has 24 DP instructions (which may execute in
less than 24 cycles (-> parallel, pipes)). If we loop through all
256 digits, the result is returned after 6164 cycles. Probably my
subtraction routine will not be much longer. The base checking is
still implemented (I had forgotten, that the routines are used in
my conversion routines, too)...

>If you use multiple of 32 bit variables only,
>this will be quite faster than my byte by byte way.
>

If! I'm still not sure, which way is the best. See below...

>| Having 256 digits, there is no need for a floating point. The
>| floating point only exists in the exponent, nowhere else!
>
>Except in case of a partial division-result?
>

The buffers always hold integer numbers. The floating point only
is determined in the corresponding data set - where exponents and
signs are stored, and the current power of 10 for the calculation
in progress *could* be stored.

The floating point is nothing else than a "marker". It stands for
a defined power of 10 within this number (the both operands *are*
integers). If we shift an operand left, then the exponent must be
decreased. If we shift it right, the exponent must be increased.

>Forget DAA/DAS and similar, they are awful slow MUL/DIV stories.
>
>As above,..
>But why convert twice, I store and calculate all numeric figures
> including the 10^x valued exponent in the binary form.
>So I've chosen small memory, fast calculation,
> but not the fastest display/print-routines,
> due already touching the physical PCI-bus-limit (66 MHZ)
> with unformatted graphics-text.
> And my next PC will be faster anyway.
>
>I think you should separate storage/calculation decision from
> display/print format decision.
>There is no reason to calculate in the display-format.
>As mentioned above, this depends on the main target
> [display speed vs. memory-image and calculation-speed].
>
>1024 bit binaries (128 byte) gives a precision of 308 digits!
>

Ok - let's talk about the used format!

Binary numbers in computers are very easy to handle. No reason to
convert them, all math operations are supported on CPU level. The
read and write access can be handled in the fastest way supported
by the hardware. Sounds great. But is it really *that* great?

Here are some disadvantages:

First, we need conversion routines to display the numbers, so the
average human can read the output. Not a big problem, we have our
routines for this purpose - same for input.

Now let's do some math. Calculations like add and sub are done by
reading a dword of the 1st operand, then adding the corresponding
dword of the 2nd operand to it, finally store the result. This is
no problem, and it is the fastest way we can do it.

But what about mul and div? In theory this are simple shift/add -
shift/sub operations. But with this "simple shift/xxx" operations
we start to become contra-productive. It isn't shifting of bytes,
words or dwords, it is shifting of *bits*! Shifting of bits might
be okay for 32 bit operations, but is it still a good idea, if we
have to shift 1024 bit? And mul or div surely need a lot of shift
(and bit test) operations...

Might be, that the overall performance of my BCD concept isn't in
the list of the fastest or best math solutions. But all the above
mentioned shifting is done by adding | subtracting one value to |
from an index register. Much easier to do...

I think, that I have to do some calculations, how many cycles are
needed for each of the operations, so we can compare both methods
in terms of execution time and performance.


Maybe we should develop a base for the calculator, first:

I think of a definition which allows to expand or shrink the size
of the operands to the needed amount of bytes (= multiples of 2).
The adjustment should be easy - e.g. by changing a counter.

All needed information - exponent, the FP position, the amount of
digits, the signs, et cetera - should be stored at a defined area
in memory (outside the buffers).

Multiplication should use "shifting" via the offsets, rather than
real bit shifts - my method could be rewritten to use hexadecimal
numbers. Would be shorter, because we can skip the base testing.

Division should use the fastest way. If it can be done with a few
log-tables, then we should use this way. If it is faster to do it
the simple way, then we should prefer to use it and implement the
higher math functions with external routines.

I think, that we have invested a lot of time and work until now -
so it should lead to some useful results. But I am still not sure
about the format and some basic definitions.

wolfgang kern

unread,
Feb 27, 2003, 6:22:38 PM2/27/03
to

"Bernhard" wrote:

| >|... stop the 'infinite' loop when the element-result is already smaller
| >| than the desired precision (compare desired LSB# with results >MSB#).

| >| At the moment - my calculations end, if the current operation
| >| produces an over- or underflow condition...

| >??

| That is, whenever the offset runs out of the buffer's borders, we
| don't need to go further... (see below)

what will an 100000.... (1=MSD max.)
+900000....
produce then?
I either adjust the exponent und round off the LSD,
or message an overflow condition if no destination exponent is defined.


| >..for. fast updated display, then unpacked BCD is best.



| Display was a minor reason (-> storage format is packed BCD). The
| calculator was designed in a way I can handle the operations. ;)

| >| As we have the exponents of both operands stored in a separate
| >| place - we just have to take them as the integer part, and the
| >| fractions are stored in the table... If it really is that easy
| >| (seeing the "0.75" - it seems, that it isn't *that* easy)...

| >A divide will either need a additional fractions-field
| >or you multiply the dividend by the power10 of the divisor before the
| >division, to get an integer result.

| Hm, maybe we talk about different things here.

I think the result of 1/7 can be interpreted and stored as:

142857 E-6 (period=6) 1E7/7
where you need to know the period-size in advance,
or make a copy of all remainders and compare...

or extended as
142557142857142857.......1428 E-256
1*10^(256 + divisor.digits - dividend.digits) /7
I would start with the last option and have the first method for
later append (more correct: preceed).

| In the current concept - operations with floating points are done
| by manipulating the exponents and the offsets within the buffers.

| This doesn't work for add and sub, if the difference between both
| exponents exceeds 256. If so, the result is the *larger* operand,
| because the smaller operand's 1st digit is located behind the LSD
| of the buffer - no calculation is needed.

In my case the selected rounding precision will determine if a

1000 - 1E-4 produce "1000" or "9999999 E-4"


| For mul and div we need a control logic, which shifts both input
| operands to the right places (including the correction of the

| exponents), so we can calculate something.

As mul and div may produce more digits than needed,
I use MUL up to all digits and then round to desired precision,
and DIV up to desired precision and then round the resulting LSD.
That's the reason for my buffers are twice as large as variables.
This allows almost correct rounding (is programmable anyway).

A shift in case of mul/div isn't necessary,
add the MSD-number (10^#) to exponent is all to do,
but something similar to shift (index-offsets) is needed for add/sub.


| EXAMPLE:
|
| OP1 1.00 E 250
| OP2 1.00 E -200
|
| OP1 + OP2 = 1.00 E 250 (next "1" 450 digits behind the FP!)
| OP1 - OP2 = 1.00 E 250 (next "1" 450 digits behind the FP!)
| OP1 * OP2 = 1.00 E 50
| OP1 / OP2 = 1.00 E 450

Yes, but be aware of the truncated digits may contain ".999.."



| >| The rules from the document might be of use for the calculator,
| >| too. As I've seen, there is a simple way to calculate numbers
| >| with endless precision - using a simple calculator...
| >
| >Yes, a good link indeed.


| Thought of Dr. Math's text:

[snip, I copied the whole doc. as well ]



| Looks like an accurate and fast way to do a division using 32-bit
| DIV and MUL instructions. We don't need to calculate a remainder,
| because EDX already holds it. Flaw: N is limited to 32 bit. OTOH,
| we could calculate the log-tables using this method...

Yes, while I still search for a method to calculate log without any divide. :)

| >| 0:FBLD EDI
| >| FBLD ESI
| >| FADD
| >| FBSTP
| >| SUB EDI, 0x0000000A
| >| SUB ESI, 0x0000000A
| >| DEC ECX
| >| JNE 0b
| >
| >it would produce a FPU-stack overflow after 4 iterations? :)
| >ok, I see what you mean....

| My book says: "FADD - adds ST0 to ST1, the result overwrites ST1.
| ST0 is popped, so the result now is stored in ST0."

Ok, my syntax will see "FADDP 1,0" for this,
but assemblers may default different if no parameters follow.
-----------------
from Intel's doc:
Opcode Instruction Description
D8 /0 FADD m32 real Add m32real to ST(0) and store result in ST(0)
DC /0 FADD m64real Add m64real to ST(0) and store result in ST(0)
D8 C0+i FADD ST(0), ST(i) Add ST(0) to ST(i) and store result in ST(0)
DC C0+i FADD ST(i), ST(0) Add ST(i) to ST(0) and store result in ST(i)
DE C0+i FADDP ST(i), ST(0) Add ST(0) to ST(i), store result in ST(i), and pop the
register stack
DE C1 FADDP Add ST(0) to ST(1), store result in ST(1), and pop the
register stack
DA /0 FIADD m32int Add m32int to ST(0) and store result in ST(0)
DE /0 FIADD m16int Add m16int to ST(0) and store result in ST(0)
------------------


| Because there is a FBSTP after the FADD, the FPU is in the *same*
| state as it was before entering label 0 - except ST0 and ST7 have
| another contents now. What should cause the stack overflow after
| the 4th run?

In my terms the FADD won't pop, but Ok if your compiler do it that way.

| BTW - loading could be optimised to pre-load all 8 FPU registers,


| then load the next 2 within the routine. I should read the Athlon
| manuals again... ;)

I think mmx load will not translate the 2^n biased exponent.


| >In terms of code-size it's a beauty.
| >I already replaced all FBLD/FBSTP instructions in my programs
| >with the 10^x semi-log table converter.
| >As the FPU uses a complete 18 digit BIN<=>BCD conversion w/o LUT,
| >and a never correct rounding feature,
| >and the instructions are vectored too, my byte by byte method is
| >much faster, even some more bytes and a table (22Kb for 78digits)
| >are needed.

| I never checked any of my routines until now - I guess, they need
| a lot of cycles... :(

Surround your code with RDTSC's (code 0F 31; -> edx:eax)
it reveals the perfect truth about cycles.

| Ok, next try... ;) Here's an "optimised" version of my addition:


|
| .globl _addOPS
| _addOPS:
| pushl %ecx
| pushl %edx
| movl $0x0100,%ecx
| xorl %edx,%edx
| xorl %eax,%eax
| movb _D_BASE,%dl
| ADD0:movb -0x0100(%ebx,%ecx,1),%al # AL = digit OP1
| cmpb $0xFF,%al # OP1 end?
| jne 0f
| xorb %al,%al # yes -> zero
| incb %dh # stop += 1
| 0:cmpb $0xFF,-0x0100(%esi,%ecx,1) # OP2 end?
| je 1f
| addb -0x0100(%esi,%ecx,1),%al # AL = digit RES
| jmp 2f
| 1:incb %dh # stop += 1
| 2:addb %ah,%al # add carry
| xorb %ah,%ah

*** add and then clear AH ?


| 3:cmpb %dl,%al # valid digit?
| jb 4f
| subb %dl,%al # no, correct it
| incb %ah
| jmp 3b
| 4:cmpb $2,%dh
| je 6f # reached end of both OPs
| movb %al,-0x0100(%edi,%ecx,1) # store result digit

?this is MOV AL,DS:[EDI+ECX+FFFFFFF00]?

| xorb %dh,%dh
| decl %ecx
| je 5f # loop through OPs
| jmp ADD0
| 5:movl $0x000000FF,%eax # ERROR - OVERFLOW
| jmp 8f
| 6:cmpw $0,%ax
| je 7f
| movb %al,-0x0100(%edi,%ecx,1) # store result digit
| cmpb $0,%ah # carry ?
| je 7f
| movb %ah,-0x0100(%edi,%ecx,1) # store result digit
| 7:xorl %eax,%eax # no error occured
| 8:popl %edx
| popl %ecx
| ret
|
| I replaced vector path (lodsb, stosb, et cetera) with direct path
| operations. The loop has 24 DP instructions (which may execute in
| less than 24 cycles (-> parallel, pipes)). If we loop through all
| 256 digits, the result is returned after 6164 cycles. Probably my
| subtraction routine will not be much longer. The base checking is
| still implemented (I had forgotten, that the routines are used in
| my conversion routines, too)...

I didn't check if your code is working,
but it looks a lot faster than the previous version.
(what assembler needs a "that awful" over-defined syntax?
ie: "push ecx" should be clear as dw/reg operation,
and "add al,ah" is obvious a byte/register story,
and I got some troubles with "jmp label#-back/forth",
as I'm use to read "jmp 8f" as hex-notation "relative -8F"
but at least I got the main sense.)



| >If you use multiple of 32 bit variables only,
| >this will be quite faster than my byte by byte way.

| If! I'm still not sure, which way is the best. See below...

| >| Having 256 digits, there is no need for a floating point. The
| >| floating point only exists in the exponent, nowhere else!

| >Except in case of a partial division-result?

| The buffers always hold integer numbers. The floating point only
| is determined in the corresponding data set - where exponents and
| signs are stored, and the current power of 10 for the calculation
| in progress *could* be stored.

| The floating point is nothing else than a "marker". It stands for
| a defined power of 10 within this number (the both operands *are*
| integers). If we shift an operand left, then the exponent must be
| decreased. If we shift it right, the exponent must be increased.

Yes.

[..]

| >1024 bit binaries (128 byte) gives a precision of 308 digits!

| Ok - let's talk about the used format!

| Binary numbers in computers are very easy to handle. No reason to
| convert them, all math operations are supported on CPU level. The
| read and write access can be handled in the fastest way supported
| by the hardware. Sounds great. But is it really *that* great?

| Here are some disadvantages:

| First, we need conversion routines to display the numbers, so the
| average human can read the output. Not a big problem, we have our
| routines for this purpose - same for input.

| Now let's do some math. Calculations like add and sub are done by
| reading a dword of the 1st operand, then adding the corresponding
| dword of the 2nd operand to it, finally store the result. This is
| no problem, and it is the fastest way we can do it.

| But what about mul and div? In theory this are simple shift/add -
| shift/sub operations. But with this "simple shift/xxx" operations
| we start to become contra-productive. It isn't shifting of bytes,
| words or dwords, it is shifting of *bits*! Shifting of bits might
| be okay for 32 bit operations, but is it still a good idea, if we
| have to shift 1024 bit? And mul or div surely need a lot of shift
| (and bit test) operations...

If you think in terms of double-indexed addressing and byte grouping,
then shifts will be replaced by inc/dec index only.
The "BT()"(0F A3 group)-instructions work up to segment-limit
(+/-2 Gbits) offset, rather than just 32 bits.


| Might be, that the overall performance of my BCD concept isn't in
| the list of the fastest or best math solutions. But all the above
| mentioned shifting is done by adding | subtracting one value to |
| from an index register. Much easier to do...

My table creation routines use slow iterative ADD for multiply,
and even slower CMP/SUB loops for divide.
But these are working on byte(or dw)-level rather than bit by bit.
And needed only during install and stored to disk for next reboot.



| I think, that I have to do some calculations, how many cycles are
| needed for each of the operations, so we can compare both methods
| in terms of execution time and performance.

| Maybe we should develop a base for the calculator, first:

| I think of a definition which allows to expand or shrink the size
| of the operands to the needed amount of bytes (= multiples of 2).
| The adjustment should be easy - e.g. by changing a counter.

At the moment I use three user-definable variables:
_precision.max.display_ < _precision.max.round_ < _precision.max.calc_
where the last is limited by the systems _precision.max.numvar.type_*2,
which is the result of the memory-image size in bytes (2^b-1 limits).
Even this are decimal digit limits, the needed bytes are known then.

| All needed information - exponent, the FP position, the amount of
| digits, the signs, et cetera - should be stored at a defined area
| in memory (outside the buffers).

Yes, I have this (together with range-clamping and display-format)
apart as well.



| Multiplication should use "shifting" via the offsets, rather than
| real bit shifts - my method could be rewritten to use hexadecimal
| numbers. Would be shorter, because we can skip the base testing.

I mentioned all this just above... should have read the trail first :)



| Division should use the fastest way. If it can be done with a few
| log-tables, then we should use this way. If it is faster to do it
| the simple way, then we should prefer to use it and implement the
| higher math functions with external routines.

I still try to figure out the 10^px/x w/o any divide,
similar to the Newton-Raphson story,
and if I look at the two digits lg(35)-calc in the example 456/35
(other post) it already reminds a bit of this formula.


| I think, that we have invested a lot of time and work until now -
| so it should lead to some useful results. But I am still not sure
| about the format and some basic definitions.

I'm sure we are already close to the final decision(s).

__
wolfgang

bv_schornak

unread,
Mar 1, 2003, 9:42:44 PM3/1/03
to
wolfgang kern wrote:


[Overflow and stop conditions]

>what will an 100000.... (1=MSD max.)
> +900000....
>produce then?
>I either adjust the exponent und round off the LSD,
>or message an overflow condition if no destination exponent is defined.
>

At the moment (below routine) the result is 256 * 0x00, EAX holds
0x000000FF to signal an overflow. Because the only possible value
in MSD - 1 can be 1 [9 + n = 10 + (n-1)], we still could correct
the result by shifting the buffer 1 byte down, set MSD to 1, then
increment the exponent (I would use a 4/5 rounding, if I skip the
last digit)...


[Format considerations...]

>I think the result of 1/7 can be interpreted and stored as:
>
>142857 E-6 (period=6) 1E7/7
>where you need to know the period-size in advance,
>or make a copy of all remainders and compare...
>
>or extended as
>142557142857142857.......1428 E-256
>1*10^(256 + divisor.digits - dividend.digits) /7
>I would start with the last option and have the first method for
>later append (more correct: preceed).
>

I would prefer the latter solution. Even if it takes more time to
calculate it, only one calculation is done. The first model would
need to calculate (detect) the amount of digits of the periodical
part, too...


>| In the current concept - operations with floating points are done
>| by manipulating the exponents and the offsets within the buffers.
>
>| This doesn't work for add and sub, if the difference between both
>| exponents exceeds 256. If so, the result is the *larger* operand,
>| because the smaller operand's 1st digit is located behind the LSD
>| of the buffer - no calculation is needed.
>
>In my case the selected rounding precision will determine if a
>
>1000 - 1E-4 produce "1000" or "9999999 E-4"
>

The example is within the 256 digit limitation. Same result in my
routines (999.9999 E 0 (I use the engineering notation))...

>As mul and div may produce more digits than needed,
>I use MUL up to all digits and then round to desired precision,
>and DIV up to desired precision and then round the resulting LSD.
>That's the reason for my buffers are twice as large as variables.
>This allows almost correct rounding (is programmable anyway).
>
>A shift in case of mul/div isn't necessary,
>add the MSD-number (10^#) to exponent is all to do,
>but something similar to shift (index-offsets) is needed for add/sub.
>

Agreed in the MUL case, but why do you need to shift ADD and SUB?
The operands start at the *last* byte in the buffer (LSD) and end
somewhere in the buffer (MSD). Only if the difference between the
exponents is too large, we need to extend or shrink the operands.
But this is a task of the calculator logic, not to be implemented
in the calculation routines...

>| EXAMPLE:
>|
>| OP1 1.00 E 250
>| OP2 1.00 E -200
>|
>| OP1 + OP2 = 1.00 E 250 (next "1" 450 digits behind the FP!)
>| OP1 - OP2 = 1.00 E 250 (next "1" 450 digits behind the FP!)
>| OP1 * OP2 = 1.00 E 50
>| OP1 / OP2 = 1.00 E 450
>
>Yes, but be aware of the truncated digits may contain ".999.."
>

Which still is far outside the calculator's 256 digits? BTW - the
larger number has a higher priority than the smaller one, because
the larger one holds the MSD.


[Dr. Math...]

>| Thought of Dr. Math's text:
>[snip, I copied the whole doc. as well ]
>
>| Looks like an accurate and fast way to do a division using 32-bit
>| DIV and MUL instructions. We don't need to calculate a remainder,
>| because EDX already holds it. Flaw: N is limited to 32 bit. OTOH,
>| we could calculate the log-tables using this method...
>
>Yes, while I still search for a method to calculate log without any divide. :)
>

Just a thought... Would be a little bit annoying to calculate the
tables "per hand" and write the results on paper... ;)


[Using the FPU...]

>Ok, my syntax will see "FADDP 1,0" for this,
>but assemblers may default different if no parameters follow.
>

My book is "Assembler Referenz" [ISBN 3-7723-7505-7 by Franzis'].
It says, that FADD without an additional operand is equal to:

FADDP ST(1),ST(0) ->

>DE C1 FADDP Add ST(0) to ST(1), store result in ST(1), and pop the
>register stack
>

>| Because there is a FBSTP after the FADD, the FPU is in the *same*
>| state as it was before entering label 0 - except ST0 and ST7 have
>| another contents now. What should cause the stack overflow after
>| the 4th run?
>
>In my terms the FADD won't pop, but Ok if your compiler do it that way.
>

The code was developed for my reply. GAS wouldn't accept the used
code, I bet...

>| BTW - loading could be optimised to pre-load all 8 FPU registers,
>| then load the next 2 within the routine. I should read the Athlon
>| manuals again... ;)
>
>I think mmx load will not translate the 2^n biased exponent.
>

A repeated FBLD xxxx would do it. The preload logic could lead to
parallel execution - so we could compensate the needed cycles for
FBLD a little bit... ;)

>| I never checked any of my routines until now - I guess, they need
>| a lot of cycles... :(
>
>Surround your code with RDTSC's (code 0F 31; -> edx:eax)
>it reveals the perfect truth about cycles.
>

Doesn't work on AMD processors... ;)

BTW - my routines are "optimized" for Athlon, if ever - a "political"
decision... ;)


[TRANSLATION]

>.globl _addOPS
>_addOPS:
> push ecx
> push edx
> mov ecx, 0x0100
> xor edx,edx
> xor eax,eax
> mov dl, byte[D_BASE]
> ADD0:mov al, byte[-0x0100(ebx, ecx, 1)]


>
> # AL = digit OP1

> cmp al, 0xFF # OP1 end?
> jne 0(forwards)
> xor al,al # yes -> zero
> inc dh # stop += 1
> 0:cmp byte[-0x0100(esi, ecx, 1)],0xFF
>
> # OP2 end?
> je 1 (forwards)
> add al,byte[-0x0100(esi, ecx, 1)]


>
> # AL = digit RES

> jmp 2 (forwards)
> 1:inc dh # stop += 1
> 2:add al,ah # add carry
> xor ah,ah
>
- - -

>*** add and then clear AH ?
>

AH is the carry - after adding the carry of the last operation we
should clear it... The next lines may increment the carry, if the
result is greater than the given base!

- - -

> 3:cmp al,dl # valid digit?
> jb 4 (forwards) # DL is 0x0A...
> sub al,dl # no, correct it
> inc ah
> jmp 3b
> 4:cmp dh, 2
> je 6 (forwards) # reached end of both OPs
> mov byte[-0x0100(edi, ecx, 1)],al
>
> # store result digit
>
- - -

>?this is MOV AL,DS:[EDI+ECX+FFFFFFF00]?
>

I don't know how to translate this into other assembler dialects.
It is the byte with the offset -0x0100 to a memory location which
is calculated at runtime:

EDI = address
ECX = counter
MUL = multiplier (1, 2, 4 or 8), here 1

Address = (EDI + (ECX * MUL)) - 0x0100

The routine needs some tricks here, because we have to do the add
backwards. This is done with decrementing ECX, so the calculation
of the address produces backwards steps in memory...

- - -

> xor dh,dh
> dec ecx
> je 5 (forwards) # loop through OPs
> jmp ADD0
> 5:mov eax, 0x000000FF # ERROR - OVERFLOW
> jmp 8 (forwards)
> 6:cmp ax, 0
> je 7 (forwards)
> mov byte[-0x0100(edi, ecx, 1)],al
>
> # store result digit
> cmp ah, 0 # carry ?
> je 7 (forwards)
> mov byte[-0x0100(edi, ecx, 1)],al
>
> # store result digit
> 7:xor eax,eax # no error occured
> 8:pop edx
> pop ecx
> ret
>

>I didn't check if your code is working,
>but it looks a lot faster than the previous version.
>

It should be at least 20 % faster than the old version...

>(what assembler needs a "that awful" over-defined syntax?
> ie: "push ecx" should be clear as dw/reg operation,
> and "add al,ah" is obvious a byte/register story,
> and I got some troubles with "jmp label#-back/forth",
> as I'm use to read "jmp 8f" as hex-notation "relative -8F"
> but at least I got the main sense.)
>

Presenting the first routine, I added a statement, that it is GAS
(AT&T syntax). In AT&T you have to write hex numbers with leading
"0x", so there's no problem with the direction postfixes...

In general it isn't neccessary to write add"b" %al,%ah (BTW, this
is add AL to AH!), but the source file compiles faster, if I give
the operand size along with the registers (GAS hasn't to look for
the right opcode).


[Formats and other considerations...]

>If you think in terms of double-indexed addressing and byte grouping,
>then shifts will be replaced by inc/dec index only.
>The "BT()"(0F A3 group)-instructions work up to segment-limit
>(+/-2 Gbits) offset, rather than just 32 bits.
>

BT* is direct path if used to test registers - for testing memory
it is vector path. I'm not shure, if you can use it for more than
32 bit. I remember something about "masking out" values above 32,
but this may be done before testing registers, only ... in my new
routines I will use hexadecimals rather than bits - it's the best
compromise to keep the routines small and fast...

>My table creation routines use slow iterative ADD for multiply,
>and even slower CMP/SUB loops for divide.
>

Ok, I think that my improved multiply may speed up your routines,
too (saves lots of repeated additions)...

>| I think of a definition which allows to expand or shrink the size
>| of the operands to the needed amount of bytes (= multiples of 2).
>| The adjustment should be easy - e.g. by changing a counter.
>
>At the moment I use three user-definable variables:
>_precision.max.display_ < _precision.max.round_ < _precision.max.calc_
>where the last is limited by the systems _precision.max.numvar.type_*2,
>which is the result of the memory-image size in bytes (2^b-1 limits).
>Even this are decimal digit limits, the needed bytes are known then.
>

Remember, that my calculator only is one small part of a complete
system. I have to define every needed or used byte, because other
parts of the system occupy the memory below and above.

Maybe I "outsource" the calculator to its own allocated memory...

>| Division should use the fastest way. If it can be done with a few
>| log-tables, then we should use this way. If it is faster to do it
>| the simple way, then we should prefer to use it and implement the
>| higher math functions with external routines.
>
>I still try to figure out the 10^px/x w/o any divide,
>similar to the Newton-Raphson story,
>and if I look at the two digits lg(35)-calc in the example 456/35
>(other post) it already reminds a bit of this formula.
>

Sorry, but my job occupies too much of my time at the moment - no
time left to do some work with all the logarithm stuff. I am busy
with so many things (including our discussion) at the moment, and
every additional working hour is one hour less for the sum of all
other things I do (including my private life, too)...

>| I think, that we have invested a lot of time and work until now -
>| so it should lead to some useful results. But I am still not sure
>| about the format and some basic definitions.
>
>I'm sure we are already close to the final decision(s).
>

Meanwhile, I thought a little bit about the format - if I replace
the unpacked BCDs with packed hexadecimals:

IN: EBX = base address OP1
ESI = OP2
EDI = RES

_addOPS:
push ecx
xor eax,eax
push edx
mov al, byte[size_of_OP1]
xor ecx,ecx
cmp al, byte[size_of_OP2]
jbe 0 (forwards)
mov al, byte[size_of_OP2]
0:mov cl,al
mov edx, 0x0000007C
shr cl, 2
and al, 0x03
je 1 (forwards)
inc cl
1:clc
2:mov eax, dword[-0x80(ebx, edx, 4)] # bad construct
addc eax, dword[-0x80(esi, edx, 4)] # maybe I should
mov dword[-0x80(edi, edx, 4)], eax # add some NOPs
dec edx
dec edx # SUB EDX, 4 changes carry, this
dec edx # is the fastest alternative way
dec edx
dec ecx
jne 2 (backwards)
addc eax, 0
mov dword[-0x80(edi, edx, 4)], eax
pop edx
pop ecx
ret

This one needs about 308 cycles, if all 32 additions are done. The
BCD routine needs something around 6144 cycles, so the new routine
is 20 times faster...

What about the new solution?

bv_schornak

unread,
Mar 2, 2003, 8:19:12 AM3/2/03
to
bv_schornak wrote:

I have forgotten to mention, that the new routines are
hexadecimal ... so we will need logarithm-tables based
on hexadecimal digits, too... ;)

wolfgang kern

unread,
Mar 3, 2003, 1:30:33 PM3/3/03
to

"Bernhard" wrote:

[Overflow and rounding]

| At the moment (below routine) the result is 256 * 0x00, EAX holds
| 0x000000FF to signal an overflow. Because the only possible value
| in MSD - 1 can be 1 [9 + n = 10 + (n-1)], we still could correct
| the result by shifting the buffer 1 byte down, set MSD to 1, then
| increment the exponent (I would use a 4/5 rounding, if I skip the
| last digit)...

OK.

[Format considerations...]

[period detection vs. straight division]

| I would prefer the latter solution. Even if it takes more time to
| calculate it, only one calculation is done. The first model would
| need to calculate (detect) the amount of digits of the periodical
| part, too...

Yes, I'll keep the idea of period-detection for later as well.

| >A shift in case of mul/div isn't necessary,
| >add the MSD-number (10^#) to exponent is all to do,
| >but something similar to shift (index-offsets) is needed for add/sub.

| Agreed in the MUL case, but why do you need to shift ADD and SUB?

e.g.: 1E1 + 1E-1 = 101 E-1 ;=10.1

Of course, the values in this example won't happen to be a problem
as "we" don't allow negative exponents to be stored anyway,
(I think we both use the "pseudo FP" or commonly called "fixpoint")
but if the exponents are different due any partial result:

BCD with BCD exponent:
easy digit-alignment by exponent-difference.

binary mantissa:
the operation depends how the exponent is valuated.

2^n-exponent:
LSD alignment is just a bit-shift (or bit index-offset adjust)
using the difference of the exponents.
But here the "0.1 = 0.0999.."-effect will occur.

10^n-exponent:
We need to adjust the operands by multiply the larger operand with
the difference of the exponents (from 10^x table) first.
This is the price to pay for "0.1" doesn't become a periodic figure.

| The operands start at the *last* byte in the buffer (LSD) and end
| somewhere in the buffer (MSD). Only if the difference between the
| exponents is too large, we need to extend or shrink the operands.
| But this is a task of the calculator logic, not to be implemented
| in the calculation routines...

Rounding and truncation on 'every' operation may lead to a large
precision-loss with iterative formulas (as statistics,log,trig... ),
therefore my "ACCUs" are double-sized.

| [Dr. Math...]
| >| ....we could calculate the log-tables using this method...


| >Yes, while I still search for a method to calculate log without any divide. :)
| Just a thought... Would be a little bit annoying to calculate the
| tables "per hand" and write the results on paper... ;)

This is done already by Newton/Gauss/... :) ,but just up to 12 digits.
But to check 'first' results, any alternate method will help.

| [FPU syntax defaults]

| My book is "Assembler Referenz" [ISBN 3-7723-7505-7 by Franzis'].
| It says, that FADD without an additional operand is equal to:
|
| FADDP ST(1),ST(0) ->

| The code was developed for my reply. GAS wouldn't accept the used
| code, I bet...

I'm not sure if all compilers will default to equal code.
So after been cheated to often and experienced from 'inline coding',
I don't need compilers at all since long,
I write my code in hex and use the remains
of my brain for immediate compiling.

[mmx-load]


| >I think mmx load will not translate the 2^n biased exponent.

| A repeated FBLD xxxx would do it. The preload logic could lead to
| parallel execution - so we could compensate the needed cycles for
| FBLD a little bit... ;)

| >| I never checked any of my routines until now - I guess, they need
| >| a lot of cycles... :(

| >Surround your code with RDTSC's (code 0F 31; -> edx:eax)
| >it reveals the perfect truth about cycles.

| Doesn't work on AMD processors... ;)

??? It should,
perhaps your compiler doesn't know about,
just try with inline code 0F 31; followed by save/compare edx:eax.

| BTW - my routines are "optimised" for Athlon, if ever - a "political"
| decision... ;)

Guess what's written on 'my' CPU, it reads "Athlon 500, K7",
and yes primary due a 'political decision'.
Even I got a set of several other PC's around here,
including Intel's P4 and XP, I prefer to work on the slower machine,
it shows the speed difference to 'common' programs more drastically :)

[TRANSLATION]
[..snipped code]
Thanks for the additional work, now it makes sense also to me :)
I was confused by the destin.<-> source syntax

|...(BTW, this is add AL to AH!)...
That was confusing me, but thanks now I learned how to read (AT&T).

[BT*-instruction]

| BT* is direct path if used to test registers - for testing memory

| it is vector path. I'm not sure, if you can use it for more than
| 32 bit.

I don't believe in books, so I checked (beside all others):
[AMD K7:]
code 0F BA xx i8 BT imm8 ;bit number masked by 1Fh
0F A3 xx BT m/r32 ;PM: bit number limited by segment limit
;RM: works up to bit offset FFFF0 w/o seg-overrun!
;this looks like a bug, but maybe a useful one :)

The '0F A3' is vector-path (8),
while FBLD is(91) and FBSTP is vectored(196)!

| I remember something about "masking out" values above 32,
| but this may be done before testing registers, only ... in my new
| routines I will use hexadecimals rather than bits - it's the best
| compromise to keep the routines small and fast...

So you will use nibble- or byte-grouping?

| >My table creation routines use slow iterative ADD for multiply,
| >and even slower CMP/SUB loops for divide.

| Ok, I think that my improved multiply may speed up your routines,
| too (saves lots of repeated additions)...

Could be, let's count the clocks :)

| >| I think of a definition which allows to expand or shrink the size
| >| of the operands to the needed amount of bytes (= multiples of 2).
| >| The adjustment should be easy - e.g. by changing a counter.

All my numeric var-types (more than 40) got global definition-fields:

maximum memory-image (bytes), mantissa-size(bytes), exponent-size(bytes),
mantissa-size(decimal digits), exponent-size(decimal digits),....



| >At the moment I use three user-definable variables:
| >_precision.max.display_ < _precision.max.round_ < _precision.max.calc_
| >where the last is limited by the systems _precision.max.numvar.type_*2,
| >which is the result of the memory-image size in bytes (2^b-1 limits).
| >Even this are decimal digit limits, the needed bytes are known then.

| Remember, that my calculator only is one small part of a complete
| system. I have to define every needed or used byte, because other
| parts of the system occupy the memory below and above.

| Maybe I "outsource" the calculator to its own allocated memory...

Dedicated memory for maximum on needed buffers and tables
is a good idea for sure.

| >| Division should use the fastest way. If it can be done with a few
| >| log-tables, then we should use this way. If it is faster to do it
| >| the simple way, then we should prefer to use it and implement the
| >| higher math functions with external routines.

| >I still try to figure out the 10^px/x w/o any divide,
| >similar to the Newton-Raphson story,
| >and if I look at the two digits lg(35)-calc in the example 456/35
| >(other post) it already reminds a bit of this formula.

| Sorry, but my job occupies too much of my time at the moment - no
| time left to do some work with all the logarithm stuff. I am busy
| with so many things (including our discussion) at the moment, and
| every additional working hour is one hour less for the sum of all
| other things I do (including my private life, too)...

Same for me, I've been off-line for a few days, as you may have noticed.
We don't need to haste....

| >| I think, that we have invested a lot of time and work until now -
| >| so it should lead to some useful results. But I am still not sure
| >| about the format and some basic definitions.

| >I'm sure we are already close to the final decision(s).

| Meanwhile, I thought a little bit about the format - if I replace
| the unpacked BCDs with packed hexadecimals:

Hmm, packed hexadecimal = (byte-organised) binary.
I see "hex" just as another display-format.

| IN: EBX = base address OP1
| ESI = OP2
| EDI = RES

| _addOPS:
| push ecx
| xor eax,eax
| push edx
| mov al, byte[size_of_OP1]
| xor ecx,ecx
| cmp al, byte[size_of_OP2]
| jbe 0 (forwards)
| mov al, byte[size_of_OP2]
| 0:mov cl,al
| mov edx, 0x0000007C

| shr cl, 2 ; ok, ecx = larger operand (DWs)
| and al, 0x03
| je 1 (forwards) ; round up if remainder

| inc cl
| 1:clc
| 2:mov eax, dword[-0x80(ebx, edx, 4)] # bad construct

? 0x7C * 4 ?; should be *1 ?


| addc eax, dword[-0x80(esi, edx, 4)] # maybe I should
| mov dword[-0x80(edi, edx, 4)], eax # add some NOPs

; depends on code alignment too
; check clock count..

| dec edx
| dec edx # SUB EDX, 4 changes carry, this
| dec edx # is the fastest alternative way
| dec edx

how about LEA edx,[edx-4] ; no flags altered!
; LEA r/r is direct-path


| dec ecx
| jne 2 (backwards)
| addc eax, 0
| mov dword[-0x80(edi, edx, 4)], eax

if this should be the MSB carry over, edx still valid?

| pop edx
| pop ecx
| ret
|
| This one needs about 308 cycles, if all 32 additions are done. The
| BCD routine needs something around 6144 cycles, so the new routine
| is 20 times faster...

I can't follow the '308' cycles,
if I assume 4*32 = 128 bytes to add,
then I see 32 iterations, with about 15 clock-periods each,
perhaps you assume one operand in L1 cache already?

| What about the new solution?

The code looks fast and short.

But I'm not sure to correct interpret your storage format,
if it shall start with LSB and loop until MSB of largest operand,
then I would see "a wrong Endian" here :)
[I use higher address = higher value, I think this is the 'tiny tribe' :)
if you prefer the other Indian, then the loop direction may be correct]
(the negative 0x80 shifts the start-point,
but the order depends on INC/- vs. DEC/-loop)

I modified your example to:

push...
mov dl, [size_of_OP1]
xor ecx,ecx
cmp dl, [size_of_OP2]
jbe 0 (f)
mov dl, [size_of_OP2]
mov dh, [size_max_result]
0: ; dl = largest operand (bytes)
1: mov eax,0x80
sub edi,eax ; if you actually need this here?
sub esi,eax ;
sub ebx,eax ;

mov eax,ecx ; = clear cy yet ,start with ecx=0
1 2: rcr al,1 ;b0 al -> cy
3 mov eax,(ebx, ecx*1)
3 adc eax,(esi, ecx*1)
3 mov (edi, ecx*1),eax
1 rcl al,1 ;cy -> b0 al
1 inc ecx
1 cmp cl,dh ;OV-error if > destination bytes
1 jg (err) ;
1 cmp cl,dl ;end if ecx > needed bytes
1 jbe 2 (b)
1 rcr al,1 ;check for carry-over
4 adc dword (edi, ecx*4),0 ;or adc "byte" may be enough

pop..
ret

the loop needs 14 (+2 for the additional destination-maximum) clocks,
is a bit shorter and uses little Endian format.

__
wolfgang


bv_schornak

unread,
Mar 4, 2003, 3:23:26 PM3/4/03
to
wolfgang kern wrote:

>e.g.: 1E1 + 1E-1 = 101 E-1 ;=10.1
>
>Of course, the values in this example won't happen to be a problem
>as "we" don't allow negative exponents to be stored anyway,
>(I think we both use the "pseudo FP" or commonly called "fixpoint")
>but if the exponents are different due any partial result:
>
>BCD with BCD exponent:
> easy digit-alignment by exponent-difference.
>

Only valid for BCD numbers...

>binary mantissa:
> the operation depends how the exponent is valuated.
>
>2^n-exponent:
> LSD alignment is just a bit-shift (or bit index-offset adjust)
> using the difference of the exponents.
> But here the "0.1 = 0.0999.."-effect will occur.
>
>10^n-exponent:
> We need to adjust the operands by multiply the larger operand with
> the difference of the exponents (from 10^x table) first.
> This is the price to pay for "0.1" doesn't become a periodic figure.
>

Nice... Either we use BCD for exponent *and* mantissa, or we have the
darn rounding for the binary or hexadecimal format...

>| The operands start at the *last* byte in the buffer (LSD) and end
>| somewhere in the buffer (MSD). Only if the difference between the
>| exponents is too large, we need to extend or shrink the operands.
>| But this is a task of the calculator logic, not to be implemented
>| in the calculation routines...
>
>Rounding and truncation on 'every' operation may lead to a large
>precision-loss with iterative formulas (as statistics,log,trig... ),
>therefore my "ACCUs" are double-sized.
>

As mine are now, if I use hexadecimals...

>| [FPU syntax defaults]


>
>I'm not sure if all compilers will default to equal code.
>So after been cheated to often and experienced from 'inline coding',
>I don't need compilers at all since long,
>I write my code in hex and use the remains
>of my brain for immediate compiling.
>

My respect! I prefer the "comfort" of GCC/2, with the 80386 opcodes -
expanded with a few useful patches (BSWAP, AAD and AAM with immediate
values instead of 0x0A, et cetera)...


[RDTSC]

>|| Surround your code with RDTSC's (code 0F 31; -> edx:eax)
>|| it reveals the perfect truth about cycles.
>
>| Doesn't work on AMD processors... ;)
>
>??? It should,
> perhaps your compiler doesn't know about,
> just try with inline code 0F 31; followed by save/compare edx:eax.
>

If I have some time to test it, I will try out...

>| BTW - my routines are "optimised" for Athlon, if ever - a "political"
>| decision... ;)
>
>Guess what's written on 'my' CPU, it reads "Athlon 500, K7",
>and yes primary due a 'political decision'.
>Even I got a set of several other PC's around here,
>including Intel's P4 and XP, I prefer to work on the slower machine,
>it shows the speed difference to 'common' programs more drastically :)
>

XP = eXPerimental Windows? ;)

If so: Does it run on the K7? :-D

>[TRANSLATION]
>[..snipped code]
>Thanks for the additional work, now it makes sense also to me :)
>I was confused by the destin.<-> source syntax
>

You haven't seen Ben's sources? It's AT&T, too - so I could read them
without any problems. ;)

>[BT*-instruction]
>
>| BT* is direct path if used to test registers - for testing memory
>| it is vector path. I'm not sure, if you can use it for more than
>| 32 bit.
>
>I don't believe in books, so I checked (beside all others):
>[AMD K7:]
>code 0F BA xx i8 BT imm8 ;bit number masked by 1Fh
> 0F A3 xx BT m/r32 ;PM: bit number limited by segment limit
> ;RM: works up to bit offset FFFF0 w/o seg-overrun!
> ;this looks like a bug, but maybe a useful one :)
>
>The '0F A3' is vector-path (8),
>while FBLD is(91) and FBSTP is vectored(196)!
>

Ok - forget about the FPU... It was just an alternative thought - but
if I see the cycles now, then it probably is slower than the "byte by
byte" routine...

>| I remember something about "masking out" values above 32,
>| but this may be done before testing registers, only ... in my new
>| routines I will use hexadecimals rather than bits - it's the best
>| compromise to keep the routines small and fast...
>
>So you will use nibble- or byte-grouping?
>

Nibbles would be a good solution for fast access in my multiplication
and division routines. Packed bytes can be accessed in dword steps, a
faster way to do calculations. Since unpacking can be done within two
clocks, I prefer the packed format for hexadecimals!

>| >My table creation routines use slow iterative ADD for multiply,
>| >and even slower CMP/SUB loops for divide.
>
>| Ok, I think that my improved multiply may speed up your routines,
>| too (saves lots of repeated additions)...
>
>Could be, let's count the clocks :)
>

The "improvement" is to use an additional buffer, holding the current
multiple of the larger operand (2...(base-1)). Instead of adding the
larger operand again and again for each digit of the smaller operand,
we search the smaller operand for matching digits. Whenever a digit's
equal to the current multiple, the contents of the additional buffer
is added to the result (with the correct "shifting", done by changing
the index register). This is repeated, until (base - 1) is done...

Because a search loop is done much faster than an addition loop, this
should speed up multiplication markable. The amount of additions is

(base - 1) * (amount digits in smaller operand),

the amount of needed comparisons is equal to the above - if you do it
the "hard way", you need ((base - 1) * (base - 1)) addition loops, if
all digits are (base - 1)!

>| >| I think of a definition which allows to expand or shrink the size
>| >| of the operands to the needed amount of bytes (= multiples of 2).
>| >| The adjustment should be easy - e.g. by changing a counter.
>
>All my numeric var-types (more than 40) got global definition-fields:
>
> maximum memory-image (bytes), mantissa-size(bytes), exponent-size(bytes),
> mantissa-size(decimal digits), exponent-size(decimal digits),....
>

There's only one type in my calculator... ;)

>| Maybe I "outsource" the calculator to its own allocated memory...
>
>Dedicated memory for maximum on needed buffers and tables
>is a good idea for sure.
>

It's on my to-do list...

>Same for me, I've been off-line for a few days, as you may have noticed.
>We don't need to haste....
>

Even Rome wasn't build in a day (or two, or three)... ;)

>| Meanwhile, I thought a little bit about the format - if I replace
>| the unpacked BCDs with packed hexadecimals:
>
>Hmm, packed hexadecimal = (byte-organised) binary.
>I see "hex" just as another display-format.
>

It's better than to handle bits...


[New routine...]

A bug, a bug - correction added!

>| IN: EBX = base address OP1
>| ESI = OP2
>| EDI = RES
>
>| _addOPS:
>| push ecx
>| xor eax,eax
>| push edx
>| mov al, byte[size_of_OP1]
>| xor ecx,ecx
>| cmp al, byte[size_of_OP2]
>| jbe 0 (forwards)
>| mov al, byte[size_of_OP2]
>| 0:mov cl,al

>| mov edx, 0x0000001F
>

EDX now is 0x7C / 4 !!!

>| shr cl, 2 ; ok, ecx = larger operand (DWs)
>

Actually it is DDs (ECX >> 2) == (ECX / 4)

>| and al, 0x03
>| je 1 (forwards) ; round up if remainder
>| inc cl
>| 1:clc
>| 2:mov eax, dword[-0x80(ebx, edx, 4)] # bad construct
> ? 0x7C * 4 ?; should be *1 ?
>

No! Now EDX is corrected...

I just practised "buggy" thinking... ;)

>| addc eax, dword[-0x80(esi, edx, 4)] # maybe I should
>| mov dword[-0x80(edi, edx, 4)], eax # add some NOPs
> ; depends on code alignment too
> ; check clock count..
>

LSD always ends at a dword boundary!

>| dec edx
>

Corrected "bug"...

> how about LEA edx,[edx-4] ; no flags altered!
> ; LEA r/r is direct-path
>

I know the LEA, but I don't know how to use it in GAS / AT&T...

>| dec ecx
>| jne 2 (backwards)
>| addc eax, 0
>| mov dword[-0x80(edi, edx, 4)], eax
> if this should be the MSB carry over, edx still valid?
>

With EDX = 0xFFFFFFFF -> EDX * 4 = 0xFFFFFFFC

-> -0x80(EDI, 0xFFFFFFFC) -> -0x84(EDI) ???

Because the operands occupy half of the space now, I start add or sub
with all operands in the middle of the buffer - so there are 64 bytes
before and behind the operand. If addition produces an overflow, then
it is stored at [MSD - 1].

Further processing should be done by the calculator logic, not by the
base calculation routines. These routines might be used by "outside"
functions as well. In case of overflow, the calculator would truncate
the last digit after rounding, then use [MSD - 1] as new MSD...

>| pop edx
>| pop ecx
>| ret
>|
>| This one needs about 308 cycles, if all 32 additions are done. The
>| BCD routine needs something around 6144 cycles, so the new routine
>| is 20 times faster...
>
>I can't follow the '308' cycles,
>if I assume 4*32 = 128 bytes to add,
>then I see 32 iterations, with about 15 clock-periods each,
>perhaps you assume one operand in L1 cache already?
>

I counted all instructions as direct path (my book says they are DP)...

>| What about the new solution?
>
>The code looks fast and short.
>

Now it is 3 instructions shorter than before! ;)

>But I'm not sure to correct interpret your storage format,
>if it shall start with LSB and loop until MSB of largest operand,
>then I would see "a wrong Endian" here :)
>

"Wrong" for the iNTEL fans, only - I prefer to write operands as they
are written in real life (I'm a 68k fan - and the missing BCD support
in the x86 architecture is a pity)!

>[I use higher address = higher value, I think this is the 'tiny tribe' :)
>if you prefer the other Indian, then the loop direction may be correct]
>(the negative 0x80 shifts the start-point,
> but the order depends on INC/- vs. DEC/-loop)
>

Not for real... ;)

My rotine starts at [LSD - 4], reading the dword with the LSD as byte
# 3 (order: 0-1-2-3). No difference if you step up or down in memory.

>I modified your example to:
>
> push...
> mov dl, [size_of_OP1]
> xor ecx,ecx
> cmp dl, [size_of_OP2]
> jbe 0 (f)
> mov dl, [size_of_OP2]
> mov dh, [size_max_result]
>

Where this can't be larger than 129 byte (with 128 byte operands)?

The original code reads the size of OP1, then compares it against the
size of OP2. The larger one is taken as loop counter. Because we want
to perform dword operations, we divide the loop count by 4. If it was
not a multiple of 4, we increment the division result, 'cause there's
one more step to do (we shifted this step out before!). So we perform
the needed amount of additions, only (w/o test within the loop)... ;)

> 0: ; dl = largest operand (bytes)
> 1: mov eax,0x80
> sub edi,eax ; if you actually need this here?
> sub esi,eax ;
> sub ebx,eax ;
>
> mov eax,ecx ; = clear cy yet ,start with ecx=0
>1 2: rcr al,1 ;b0 al -> cy
>3 mov eax,(ebx, ecx*1)
>3 adc eax,(esi, ecx*1)
>3 mov (edi, ecx*1),eax
>

Some remarks...

1. My book says direct path for all operations?
2. Is this byte-by-byte add (ECX * 1)?
3. If so, then it should be AL instead of EAX?

The above code adds each byte 4 times (shifted bytewise - except LSD)
to the result...

>1 rcl al,1 ;cy -> b0 al
>1 inc ecx
>1 cmp cl,dh ;OV-error if > destination bytes
>1 jg (err) ;
>1 cmp cl,dl ;end if ecx > needed bytes
>1 jbe 2 (b)
>1 rcr al,1 ;check for carry-over
>4 adc dword (edi, ecx*4),0 ;or adc "byte" may be enough
>
> pop..
> ret
>
>the loop needs 14 (+2 for the additional destination-maximum) clocks,
>is a bit shorter and uses little Endian format.
>

But it isn't kosher (including the "wrong" format)? ;)

The only "flaw" in my routine is, that I need two counters - one loop
counter and one counter for calculating the current address. But it's
better than to use the "wrong" endian - I would accept it for all the
pre-defined formats (word, dword, qword or floats), but never for any
number with a larger size than 8 byte - I hate all unneccessary extra
byte swapping (more than often to do with iNTEL's "nice" formats)...

wolfgang kern

unread,
Mar 7, 2003, 12:19:28 PM3/7/03
to

"Bernhard" wrote:

[exponent: BCD/ 10^n/ 2^n]

| Nice... Either we use BCD for exponent *and* mantissa, or we have the
| darn rounding for the binary or hexadecimal format...

Not necessarily,
I use binary 'integers' for mantissa and
a binary-coded (rather than BCD), but "decimal valuated" exponent.
So an exponent eg: FFFFh (-1) means 10^-1, instead of the usual 2^-1.
All numeric figures (within mantissa precision-range)
can decimal represented w/o additional rounding then.
While negative 2^n exponents won't produce exact decimal fractions and become periodic in many cases.

[...]
[RDTSC]


| If I have some time to test it, I will try out...

Note on RDTSC:
run the sequence in test at least two times and have interrupts disabled,
the first run will show up much slower due the needed cache-line fills,
and this may not be the point you like to compare.


| XP = eXPerimental Windows? ;)

It's the families "new X-mas PC" ,
but my little son don't allow me to play around with yet. :)
So I run only a few test til now.



| If so: Does it run on the K7? :-D

It does,
but I couldn't check all configurations will produce
the well known M$-bugs(documented special conditions) only.

| Ok - forget about the FPU... It was just an alternative thought - but
| if I see the cycles now, then it probably is slower than the "byte by
| byte" routine...

In fact it is.

| >So you will use nibble- or byte-grouping?

| Nibbles would be a good solution for fast access in my multiplication
| and division routines. Packed bytes can be accessed in dword steps, a
| faster way to do calculations. Since unpacking can be done within two
| clocks, I prefer the packed format for hexadecimals!

"packed hexadecimals"? , lets call them "bytes" ;)

[multiply:]


| The "improvement" is to use an additional buffer, holding the current
| multiple of the larger operand (2...(base-1)). Instead of adding the
| larger operand again and again for each digit of the smaller operand,
| we search the smaller operand for matching digits. Whenever a digit's
| equal to the current multiple, the contents of the additional buffer
| is added to the result (with the correct "shifting", done by changing
| the index register). This is repeated, until (base - 1) is done...

Just to avoid confusion, I see the "digits" as bytes now.

I'm not sure if the additional buffer will save too much time,
due this partial result needs to be calculated apart before,
but your idea is worth to give it a try.



| Because a search loop is done much faster than an addition loop, this
| should speed up multiplication markable. The amount of additions is

| (base - 1) * (amount digits in smaller operand),

Plus the carry-over handling.... [99999999..999 + 2]



| the amount of needed comparisons is equal to the above - if you do it
| the "hard way", you need ((base - 1) * (base - 1)) addition loops, if
| all digits are (base - 1)!

I think there are equal ADD-counts for either '9999 * 5' or '5 * 9999',
but yes, there is a 50% chance for an early "no-more-carry"-end,
if the smaller is added to the larger value.

In terms of decimal digits or hex nibbles there is a great chance to find several equal digits.
But if we now talk about bytes or dwords, this chance will decrease drastically.

I like to use LUTs whenever the speed/memory relation allows to.
A byte wide muliply-LUT (byte*byte=word) would just cost 128Kb.
It could be used as for DIV (cmp-scan) as well.

While searching my docs I found the 'F7' MUL (edx:eax=eax*r/m) is a
vectored (6/9) instruction.
Unfortunately the faster '0F AF' exists in the signed IMUL form only.

But I could not find any useful instruction in the mmx/3Dnow sections,
even PMULHUW etc. are fast,
there will be too much overhead to convert it into any useful format.

So give some days to figure out which of the above mentioned methods
will be the fastest multiply.

[numeric types]

| There's only one type in my calculator... ;)

already defined?
128 byte binary, up to 308 decimal digits precision...?

| >| Meanwhile, I thought a little bit about the format - if I replace
| >| the unpacked BCDs with packed hexadecimals:

| >Hmm, packed hexadecimal = (byte-organised) binary.
| >I see "hex" just as another display-format.

| It's better than to handle bits...

Yes, but it's just a four bit-grouping for easier human view.
In terms of calculation and storage "hex"="binary".

| [New routine...]
|
| A bug, a bug - correction added!
|
| >| IN: EBX = base address OP1
| >| ESI = OP2
| >| EDI = RES
| >
| >| _addOPS:
| >| push ecx
| >| xor eax,eax
| >| push edx
| >| mov al, byte[size_of_OP1]
| >| xor ecx,ecx
| >| cmp al, byte[size_of_OP2]
| >| jbe 0 (forwards)
| >| mov al, byte[size_of_OP2]
| >| 0:mov cl,al
| >| mov edx, 0x0000001F

| >| shr cl, 2 ; ok, ecx = larger operand (DWs)
| Actually it is DDs (ECX >> 2) == (ECX / 4)

?? byte-count/4 = dw-count!

| >| and al, 0x03
| >| je 1 (forwards) ; round up if remainder
| >| inc cl
| >| 1:clc
| >| 2:mov eax, dword[-0x80(ebx, edx, 4)] # bad construct

| >| addc eax, dword[-0x80(esi, edx, 4)] # maybe I should
| >| mov dword[-0x80(edi, edx, 4)], eax # add some NOPs
| >| ; depends on code alignment too
| >| ; check clock count..
| LSD always ends at a dword boundary!

I meant code-alignment, so label"2" should 'sit' on a 16-byte boundary.
| >| dec edx
----[LEA]-----not used yet, but:
If your compiler can't do 'LEA' you should look for an upgrade..
inline code with addressing parts needs exact code knowledge...
LEA edx,(edx-4) would be coded as '8D 52 FC'
---------


| >| dec ecx
| >| jne 2 (backwards)
| >| addc eax, 0
| >| mov dword[-0x80(edi, edx, 4)], eax
| > if this should be the MSB carry over, edx still valid?

| With EDX = 0xFFFFFFFF -> EDX * 4 = 0xFFFFFFFC
| -> -0x80(EDI, 0xFFFFFFFC) -> -0x84(EDI) ???

Ok.

| Because the operands occupy half of the space now, I start add or sub
| with all operands in the middle of the buffer - so there are 64 bytes
| before and behind the operand. If addition produces an overflow, then
| it is stored at [MSD - 1].

I actually don't copy all operands to buffers, just the result is buffered.



| Further processing should be done by the calculator logic, not by the
| base calculation routines. These routines might be used by "outside"
| functions as well. In case of overflow, the calculator would truncate
| the last digit after rounding, then use [MSD - 1] as new MSD...

Ok.

| >I can't follow the '308' cycles,
| >if I assume 4*32 = 128 bytes to add,
| >then I see 32 iterations, with about 15 clock-periods each,
| >perhaps you assume one operand in L1 cache already?

| I counted all instructions as direct path (my book says they are DP)...

Assuming (operands and buffers) memory already in cache,
all 'mem'-instructions will need 3 clocks, even direct path,
due to dependence, the pipes won't act in parallel.


| Now it is 3 instructions shorter than before! ;)

Well done! but see below also....

| >...wrong Endian" here :)

| "Wrong" for the iNTEL fans, only - I prefer to write operands as they
| are written in real life (I'm a 68k fan - and the missing BCD support
| in the x86 architecture is a pity)!

I'm not an Intel-fan either, (I grew up with Z-80),
but the little Endian form got the advantage of a direct 'power-of'/'offset' relation which may save many bytes.
Perhaps the human way to write numbers is logically wrong?

But either way, the storage-form-decision is on you.


| >[I use higher address = higher value, I think this is the 'tiny tribe' :)
| >if you prefer the other Indian, then the loop direction may be correct]
| >(the negative 0x80 shifts the start-point,
| > but the order depends on INC/- vs. DEC/-loop)

| Not for real... ;)

| My rotine starts at [LSD - 4], reading the dword with the LSD as byte
| # 3 (order: 0-1-2-3). No difference if you step up or down in memory.

So you have little Endians in the dwords?
Otherwise the CPU will produce wrong results with 'ADD EAX'

and in terms of next digit(byte/dw) and carry-over ?


| >I modified your example to:
| > push...
| > mov dl, [size_of_OP1]
| > xor ecx,ecx
| > cmp dl, [size_of_OP2]
| > jbe 0 (f)
| > mov dl, [size_of_OP2]
| > mov dh, [size_max_result]
| >
|
| Where this can't be larger than 129 byte (with 128 byte operands)?

Ok. If your figures all equal sized you may skip this.



| The original code reads the size of OP1, then compares it against the
| size of OP2. The larger one is taken as loop counter. Because we want
| to perform dword operations, we divide the loop count by 4. If it was
| not a multiple of 4, we increment the division result, 'cause there's
| one more step to do (we shifted this step out before!). So we perform
| the needed amount of additions, only (w/o test within the loop)... ;)

Ok. If you don't have variable destination sizes.


| > 0: ; dl = largest operand (bytes)
| > 1: mov eax,0x80
| > sub edi,eax ; if you actually need this here?
| > sub esi,eax ;
| > sub ebx,eax ;
| >
| > mov eax,ecx ; = clear cy yet ,start with ecx=0
| >1 2: rcr al,1 ;b0 al -> cy
| >3 mov eax,(ebx, ecx*1) ;
| >3 adc eax,(esi, ecx*1)
| >3 mov (edi, ecx*1),eax


| Some remarks...

| 1. My book says direct path for all operations?

YES. Direct-path, "depended" memory-access needs three clocks.

| 2. Is this byte-by-byte add (ECX * 1)?

No 'eax' means a 32-bit operation
Sorry, the counter-bug is corrected below.

| 3. If so, then it should be AL instead of EAX?

not so :)



| The above code adds each byte 4 times (shifted bytewise - except LSD)
| to the result...

Perhaps syntax interpretation?

it does: [EDI+ECX]dw = [EBX+ECX]dw + [ESI+ECX]dw + CY
next iteration: ECX+4


| >1 rcl al,1 ;cy -> b0 al

inc ecx
replaced by add ecx,4


| >1 cmp cl,dh ;OV-error if > destination bytes
| >1 jg (err) ;
| >1 cmp cl,dl ;end if ecx > needed bytes
| >1 jbe 2 (b)
| >1 rcr al,1 ;check for carry-over

| >4 adc dword (edi, ecx*1),0 ;or adc "byte" may be enough

| >the loop needs 14 (+2 for the additional destination-maximum) clocks,
| >is a bit shorter and uses little Endian format.


| But it isn't kosher (including the "wrong" format)? ;)

Yeah, corrected now;
I previous had, as your code, (ecx*4) and inc ecx,
but I changed it to [ecx*1] and add ecx+4 to avoid
the SHL,2 and remainder-roundup on top.



| The only "flaw" in my routine is, that I need two counters - one loop
| counter and one counter for calculating the current address.

Perhaps as a price for big Endian?

| But it's better than to use the "wrong" endian - I would accept it for all the
| pre-defined formats (word, dword, qword or floats), but never for any

| number with a larger size than 8 byte - I hate all unnecessary extra


| byte swapping (more than often to do with iNTEL's "nice" formats)...

I prefer and use the little Endian, and I never needed any swap routines.
It's just an 'opposite order thinking',
and as the CPU's ALU uses little Endian....?

For big Endian I would do it like:
push...
mov al, [size_of_OP1]
mov ecx,0x80 ;this is your LSB-offset
cmp al, [size_of_OP2]
jbe 0 (f)
mov al, [size_of_OP2]
0: mov dl,cl
sub dl,al ;dl = start - largest operand (bytes)
mov ah,0 ; = clear cy yet
;start with LSB ecx=0x80
1 2: rcr ah,1 ;b0 ah -> cy
3 mov al,(ebx, ecx*1) ;you cannot use 32-bit-ADD if big Endian!
3 adc al,(esi, ecx*1)
3 mov (edi, ecx*1),al
1 rcl ah,1 ;cy -> b0 ah
1 dec ecx
1 js 3 (f) ;an opposite direction penalty


1 cmp cl,dl ;end if ecx > needed bytes

1 jnb 2 (b)
1 3: rcr ah,1 ;check for carry-over
4 adc byte (edi, ecx*1),0 ;this is either MSB or overflow.
pop..

up to 128 iterations, 15 clocks each.

To keep onto big Endian on a little Endian CPU seems not to be a wise decision :)
if you chose little Endian dwords within big Endian numerics it may work,
but I can't see any advantage in doing so.

__
wolfgang


arargh...@not.at.enteract.com

unread,
Mar 7, 2003, 7:16:46 PM3/7/03
to
On Fri, 7 Mar 2003 18:19:28 +0100, "wolfgang kern"
<now...@nevernet.at> wrote:

>
>"Bernhard" wrote:
>
>[exponent: BCD/ 10^n/ 2^n]
>
>| Nice... Either we use BCD for exponent *and* mantissa, or we have the
>| darn rounding for the binary or hexadecimal format...
>
>Not necessarily,
>I use binary 'integers' for mantissa and
>a binary-coded (rather than BCD), but "decimal valuated" exponent.
>So an exponent eg: FFFFh (-1) means 10^-1, instead of the usual 2^-1.
>All numeric figures (within mantissa precision-range)
>can decimal represented w/o additional rounding then.
>While negative 2^n exponents won't produce exact decimal
>fractions and become periodic in many cases.

That method of representing floating point BCD sounds very much like
the method used in IRIS Business Basic (a minicomputer OS), IIRC. If
I can find my reference, I will check.
<snip>

--
Arargh (at arargh dot com) http://www.arargh.com
To reply by email, change the domain name, and remove the garbage.
(Enteract can keep the spam, they are gone anyway)

bv_schornak

unread,
Mar 8, 2003, 10:34:59 AM3/8/03
to
Wolfgang wrote:

>"Bernhard" wrote:
>

Why in quotes? It *is* my real name... ;)

(My bad manners - in most cases (except Hugo!) I don't change the 1st
line, which is automatically generated by Mozilla, and I forget about
changing the 1st letters to capital ones...)

>[exponent: BCD/ 10^n/ 2^n]
>
>| Nice... Either we use BCD for exponent *and* mantissa, or we have the
>| darn rounding for the binary or hexadecimal format...
>
>Not necessarily,
>I use binary 'integers' for mantissa and
>a binary-coded (rather than BCD), but "decimal valuated" exponent.
>So an exponent eg: FFFFh (-1) means 10^-1, instead of the usual 2^-1.
>All numeric figures (within mantissa precision-range)
>can decimal represented w/o additional rounding then.
>While negative 2^n exponents won't produce exact decimal fractions and become periodic in many cases.
>

Contemplating about BCD vs. HEX, my conclusion was, that there's *no*
difference between integers of any numeric base - a hexadecimal digit
can hold 6 more values, that's all. The only "forbidden" thing is the
floating point - we have to keep it outside of the mantissa!


[Nibbles vs. Bytes]

>| >So you will use nibble- or byte-grouping?
>
>| Nibbles would be a good solution for fast access in my multiplication
>| and division routines. Packed bytes can be accessed in dword steps, a
>| faster way to do calculations. Since unpacking can be done within two
>| clocks, I prefer the packed format for hexadecimals!
>
>"packed hexadecimals"? , lets call them "bytes" ;)
>

Of course - the usual term is *byte* - not "packed nibbles"... ;)

>[multiply:]
>| The "improvement" is to use an additional buffer, holding the current
>| multiple of the larger operand (2...(base-1)). Instead of adding the
>| larger operand again and again for each digit of the smaller operand,
>| we search the smaller operand for matching digits. Whenever a digit's
>| equal to the current multiple, the contents of the additional buffer
>| is added to the result (with the correct "shifting", done by changing
>| the index register). This is repeated, until (base - 1) is done...
>
>Just to avoid confusion, I see the "digits" as bytes now.
>
>I'm not sure if the additional buffer will save too much time,
>due this partial result needs to be calculated apart before,
>but your idea is worth to give it a try.
>

In work - using all "advantages" of the i...LETN hardware... ;)

>| Because a search loop is done much faster than an addition loop, this
>| should speed up multiplication markable. The amount of additions is
>
>| (base - 1) * (amount digits in smaller operand),
>
>Plus the carry-over handling.... [99999999..999 + 2]
>

Would produce 99999999..99B! ;)

An Overflow cannot produce any other result than a 1 in [MSB - 1] per
operation. This *still* should be handled outside of the loop!

>| the amount of needed comparisons is equal to the above - if you do it
>| the "hard way", you need ((base - 1) * (base - 1)) addition loops, if
>| all digits are (base - 1)!
>
>I think there are equal ADD-counts for either '9999 * 5' or '5 * 9999',
>but yes, there is a 50% chance for an early "no-more-carry"-end,
>if the smaller is added to the larger value.
>

This is the idea behind the MUL "improvement(s)"...

- - - - - - - - - - - - - - - - - - - -
Example 1234 * 5 5 * 1234
- - - - - - - - - - - - - - - - - - - -
Digit additions 4 3
Compare loops 1 4
Result additions 1 4
- - - - - - - - - - - - - - - - - - - -
Total 6 11
- - - - - - - - - - - - - - - - - - - -

BTW - taking the smaller operand as OP2 *only* is done in the SUB and
MUL routines. For SUB I avoid the "turning of the sign" & underflows,
for MUL it speeds up the routine in most cases - except the odd 1 * 1
example.

>In terms of decimal digits or hex nibbles there is a great chance to find several equal digits.
>But if we now talk about bytes or dwords, this chance will decrease drastically.
>

Wait for my code... ;)

The search loop uses byte steps. It needs some tricks to handle both
nibbles (16 * n / 1 * n). Thinking it over, I now need *2* additional
buffers to handle the "packed" format. One buffer holds the original
result of the addition, the other holds the same contents "shifted" 4
bit upwards (faster than multiply by 16). After the buffers are set,
the routine starts to read the LSB. 1st the high nibble is masked out
and the low nibble is checked. If it's equal to the current processed
digit, the 1st buffer is added to the result. Then the high nibble is
checked. The 2nd buffer is added to the result, if it is equal, too.

Now we decrease the index of the result by one - then start the above
process again, until we reach the MSB. One alternative would be, that
we do 256 adds with one buffer. In this case we save the time for the
shifting and the memory for the 2nd buffer. Because we need much more
addition loops for it, it can't be faster than the 1st way...

>I like to use LUTs whenever the speed/memory relation allows to.
>A byte wide muliply-LUT (byte*byte=word) would just cost 128Kb.
>It could be used as for DIV (cmp-scan) as well.
>
>While searching my docs I found the 'F7' MUL (edx:eax=eax*r/m) is a
>vectored (6/9) instruction.
>Unfortunately the faster '0F AF' exists in the signed IMUL form only.
>
>But I could not find any useful instruction in the mmx/3Dnow sections,
>even PMULHUW etc. are fast,
> there will be too much overhead to convert it into any useful format.
>
>So give some days to figure out which of the above mentioned methods
>will be the fastest multiply.
>

I would appreciate, if the tables use as less memory as possible - my
calculator still isn't a stand alone application. It is only a small
part of a programming system...

>[numeric types]
>| There's only one type in my calculator... ;)
>
>already defined?
>128 byte binary, up to 308 decimal digits precision...?
>

Enough to convert *any* known type into it?

>| >| Meanwhile, I thought a little bit about the format - if I replace
>| >| the unpacked BCDs with packed hexadecimals:
>
>| >Hmm, packed hexadecimal = (byte-organised) binary.
>| >I see "hex" just as another display-format.
>
>| It's better than to handle bits...
>Yes, but it's just a four bit-grouping for easier human view.
> In terms of calculation and storage "hex"="binary".
>

Sounds a little bit like the old "still one half left" / "already one
half consumed" story... ;)

To be honest, I'm used to handle hexadecimal numbers, but I don't use
binary numbers in general (I can read them, of course)... ;)

>| [New routine...]
>|
>| A bug, a bug - correction added!
>|
>| >| IN: EBX = base address OP1
>| >| ESI = OP2
>| >| EDI = RES
>| >
>| >| _addOPS:
>| >| push ecx
>| >| xor eax,eax
>| >| push edx
>| >| mov al, byte[size_of_OP1]
>| >| xor ecx,ecx
>| >| cmp al, byte[size_of_OP2]
>| >| jbe 0 (forwards)
>| >| mov al, byte[size_of_OP2]
>| >| 0:mov cl,al
>| >| mov edx, 0x0000001F
>| >| shr cl, 2 ; ok, ecx = larger operand (DWs)
>| Actually it is DDs (ECX >> 2) == (ECX / 4)
>?? byte-count/4 = dw-count!
>

Do we use different definitions?

Definition: DB = 8 bit = 1 byte -> byte
DW = 16 bit = 2 byte -> word
DD = 32 bit = 4 byte -> double word
DQ = 64 bit = 8 byte -> quad word

Following the above definition:

(8 >> 1) = 8 / 2 -> 4 * DW
(8 >> 2) = 8 / 4 -> 2 * DD
(8 >> 3) = 8 / 8 -> 1 * DQ

(Read ">>" as "SHR" (C style))

>| >| and al, 0x03
>| >| je 1 (forwards) ; round up if remainder
>| >| inc cl
>| >| 1:clc
>| >| 2:mov eax, dword[-0x80(ebx, edx, 4)] # bad construct
>| >| addc eax, dword[-0x80(esi, edx, 4)] # maybe I should
>| >| mov dword[-0x80(edi, edx, 4)], eax # add some NOPs
>| >| ; depends on code alignment too
>| >| ; check clock count..
>| LSD always ends at a dword boundary!
>I meant code-alignment, so label"2" should 'sit' on a 16-byte boundary.
>

I've got the clue, as I was reading my (already sent) posting again -
one of my problems is, that I don't know, how *long* each instruction
is (I should have purchased the Addison-Wesley assembler book instead
of the crap I bought)...

>| >| dec edx
>----[LEA]-----not used yet, but:
>If your compiler can't do 'LEA' you should look for an upgrade..
>inline code with addressing parts needs exact code knowledge...
>LEA edx,(edx-4) would be coded as '8D 52 FC'
>

No, no - GAS knows LEA, of course. I just don't know how to translate
the Intel form into AT&T syntax...

>| >| dec ecx
>| >| jne 2 (backwards)
>| >| addc eax, 0
>| >| mov dword[-0x80(edi, edx, 4)], eax
>| > if this should be the MSB carry over, edx still valid?
>
>| With EDX = 0xFFFFFFFF -> EDX * 4 = 0xFFFFFFFC
>| -> -0x80(EDI, 0xFFFFFFFC) -> -0x84(EDI) ???
>Ok.
>| Because the operands occupy half of the space now, I start add or sub
>| with all operands in the middle of the buffer - so there are 64 bytes
>| before and behind the operand. If addition produces an overflow, then
>| it is stored at [MSD - 1].
>
>I actually don't copy all operands to buffers, just the result is buffered.
>

Yep. The routines can handle operands from everywhere. But they must
be in the same format as the buffer is (64 bytes leading and trailing
zeroes and 128 leading zeroes for divide).

>| >I can't follow the '308' cycles,
>| >if I assume 4*32 = 128 bytes to add,
>| >then I see 32 iterations, with about 15 clock-periods each,
>| >perhaps you assume one operand in L1 cache already?
>
>| I counted all instructions as direct path (my book says they are DP)...
>
>Assuming (operands and buffers) memory already in cache,
> all 'mem'-instructions will need 3 clocks, even direct path,
>due to dependence, the pipes won't act in parallel.
>

Maybe I should invent a CPU next, then I would understand all of this
stuff much better... ;)

>| >...wrong Endian" here :)
>
>| "Wrong" for the iNTEL fans, only - I prefer to write operands as they
>| are written in real life (I'm a 68k fan - and the missing BCD support
>| in the x86 architecture is a pity)!
>
>I'm not an Intel-fan either, (I grew up with Z-80),
>but the little Endian form got the advantage of a direct 'power-of'/'offset' relation which may save many bytes.
>Perhaps the human way to write numbers is logically wrong?
>

Only in the western world - but now we are used to use it this way...

>But either way, the storage-form-decision is on you.
>

Thanks a lot! ;)

>| >[I use higher address = higher value, I think this is the 'tiny tribe' :)
>| >if you prefer the other Indian, then the loop direction may be correct]
>| >(the negative 0x80 shifts the start-point,
>| > but the order depends on INC/- vs. DEC/-loop)
>
>| Not for real... ;)
>
>| My rotine starts at [LSD - 4], reading the dword with the LSD as byte
>| # 3 (order: 0-1-2-3). No difference if you step up or down in memory.
>
>So you have little Endians in the dwords?
>Otherwise the CPU will produce wrong results with 'ADD EAX'
>
>and in terms of next digit(byte/dw) and carry-over ?
>

That's a good argument.

F*** the i...LETN¹ machines... :(

Just add two BSWAPs to each routine...

Read as dwords... ;)

>| >I modified your example to:
>| > push...
>| > mov dl, [size_of_OP1]
>| > xor ecx,ecx
>| > cmp dl, [size_of_OP2]
>| > jbe 0 (f)
>| > mov dl, [size_of_OP2]
>| > mov dh, [size_max_result]
>| >
>|
>| Where this can't be larger than 129 byte (with 128 byte operands)?
>Ok. If your figures all equal sized you may skip this.
>
>| The original code reads the size of OP1, then compares it against the
>| size of OP2. The larger one is taken as loop counter. Because we want
>| to perform dword operations, we divide the loop count by 4. If it was
>| not a multiple of 4, we increment the division result, 'cause there's
>| one more step to do (we shifted this step out before!). So we perform
>| the needed amount of additions, only (w/o test within the loop)... ;)
>
>Ok. If you don't have variable destination sizes.
>

Maybe we should start to write an operating system for our new (to be
developed) CPU, first? ;)

As mentioned in my 1st posting, my math formats are stored as dynamic
strings...

>| > 0: ; dl = largest operand (bytes)
>| > 1: mov eax,0x80
>| > sub edi,eax ; if you actually need this here?
>| > sub esi,eax ;
>| > sub ebx,eax ;
>| >
>| > mov eax,ecx ; = clear cy yet ,start with ecx=0
>| >1 2: rcr al,1 ;b0 al -> cy
>| >3 mov eax,(ebx, ecx*1) ;
>| >3 adc eax,(esi, ecx*1)
>| >3 mov (edi, ecx*1),eax
>
>| Some remarks...
>
>| 1. My book says direct path for all operations?
>YES. Direct-path, "depended" memory-access needs three clocks.
>| 2. Is this byte-by-byte add (ECX * 1)?
>No 'eax' means a 32-bit operation
>Sorry, the counter-bug is corrected below.
>| 3. If so, then it should be AL instead of EAX?
>not so :)
>
>| The above code adds each byte 4 times (shifted bytewise - except LSD)
>| to the result...
>Perhaps syntax interpretation?
>

Or translation errors... ;)

>it does: [EDI+ECX]dw = [EBX+ECX]dw + [ESI+ECX]dw + CY
>

How do you interprete (ECX*1)? Our misunderstanding definitely is the
result of the different interpretation of this term...

The corresponding part of the GAS documentation says:

_____________________________________________________________________

An Intel syntax indirect memory reference of the form

section:[base + index*scale + disp]

is translated into the AT&T syntax

section:disp(base, index, scale)

where base and index are the optional 32-bit base and index registers,
disp is the optional displacement, and scale, taking the values 1, 2,
4, and 8, multiplies index to calculate the address of the operand. If
no scale is specified, scale is taken to be 1. section specifies the
optional section register for the memory operand, and may override the
default section register (see a 80386 manual for section register
defaults). Note that section overrides in AT&T syntax /must/ have be
preceded by a `%'. If you specify a section override which coincides
with the default section register, |as| does /not/ output any section
register override prefixes to assemble the given instruction. Thus,
section overrides can be specified to emphasize which section register
is used for a given memory operand.

Here are some examples of Intel and AT&T style memory references:

AT&T: `-4(%ebp)', Intel: `[ebp - 4]'
base is `%ebp'; disp is `-4'.section is missing, and the default
section is used (`%ss' for addressing with `%ebp' as the base
register). index, scale are both missing.
AT&T: `foo(,%eax,4)', Intel: `[foo + eax*4]'
index is `%eax' (scaled by a scale 4); disp is `foo'. All other
fields are missing. The section register here defaults to `%ds'.
AT&T: `foo(,1)'; Intel `[foo]'
This uses the value pointed to by `foo' as a memory operand.
Note that base and index are both missing, but there is only
/one/ `,'. This is a syntactic exception.
AT&T: `%gs:foo'; Intel `gs:foo'
This selects the contents of the variable `foo' with section
register section being `%gs'.

Absolute (as opposed to PC relative) call and jump operands must be
prefixed with `*'. If no `*' is specified, |as| always chooses PC
relative addressing for jump/call labels.

Any instruction that has a memory operand /must/ specify its size
(byte, word, or long) with an opcode suffix (`b', `w', or `l',
respectively).
_____________________________________________________________________


I often use this kind of addressing, and it works the way I mentioned
it somewhere above... ;)

> next iteration: ECX+4
>| >1 rcl al,1 ;cy -> b0 al
>inc ecx
>replaced by add ecx,4
>| >1 cmp cl,dh ;OV-error if > destination bytes
>| >1 jg (err) ;
>| >1 cmp cl,dl ;end if ecx > needed bytes
>| >1 jbe 2 (b)
>| >1 rcr al,1 ;check for carry-over
>| >4 adc dword (edi, ecx*1),0 ;or adc "byte" may be enough
>
>| >the loop needs 14 (+2 for the additional destination-maximum) clocks,
>| >is a bit shorter and uses little Endian format.
>
>
>| But it isn't kosher (including the "wrong" format)? ;)
>Yeah, corrected now;
>I previous had, as your code, (ecx*4) and inc ecx,
>but I changed it to [ecx*1] and add ecx+4 to avoid
>the SHL,2 and remainder-roundup on top.
>

ECX goes zero ... so ECX + 4 -> routine works forever? Or just store
the amount of digits and compare ECX (initial zero) against it?

Ok, if we use big endian (if!), and I recode the routine core to:

EBX, ESI and EDI = base address buffer

_addOPS:
push ecx
xor ecx,ecx
clc
3 0:mov eax, dword[ebx + ecx * 4 + 0]
3 addc eax, dword[esi + ecx * 4 + 0]
3 mov dword[edi + ecx * 4 + 0], eax
1 inc ecx
1 cmp ecx, 0x0000001F
1 jbe 0 (backwards)
xor eax,eax
adc al,0
mov byte[edi + ecx * 4 + 0], al
pop ecx
ret

Then we have 32 loops à 13 clocks (1920 : 416) (BTW - the shortest of
all addition loops in this thread until now)...

BTW - I still don't get the clue of the rotate instructions. What the
heck are they good for?

- - -

I see, that it would be the best, but I *hate* this solution! Damned
i...LETN hardware. Would be no big deal to add *one* move instruction
to load data in the right order (BSWAP is just an excuse)... :(

The only big problem: I wanted to build a 40 tons truck, and now I am
the proud owner of a nice and tiny racing powerboat (which *only* can
drive backwards - we need another vehicle to pull it forwards)...

Now I have to write another announcement: "BCD-Calc died. It couldn't
survive in the hostile i...LETN environment!"... ;)

Coming up soon: The one and only "Bin Lobster" calculator!

wolfgang kern

unread,
Mar 8, 2003, 7:53:12 PM3/8/03
to

Bernhard wrote:

| Why in quotes? It *is* my real name... ;)

Sorry, I'm just too familiar with the quoted names....,better yet?

| (My bad manners - in most cases (except Hugo!) I don't change the 1st
| line, which is automatically generated by Mozilla, and I forget about
| changing the 1st letters to capital ones...)

My OE adds too many crazy stuff to the headline, so I replace it at all.

[exponent: BCD/ 10^n/ 2^n]

| >| Nice... Either we use BCD for exponent *and* mantissa, or we have the
| >| darn rounding for the binary or hexadecimal format...

| >Not necessarily,
| >I use binary 'integers' for mantissa and
| >a binary-coded (rather than BCD), but "decimal valuated" exponent.
| >So an exponent eg: FFFFh (-1) means 10^-1, instead of the usual 2^-1.
| >All numeric figures (within mantissa precision-range)
| >can decimal represented w/o additional rounding then.
| >While negative 2^n exponents won't produce exact decimal fractions
| >and become periodic in many cases.


| Contemplating about BCD vs. HEX, my conclusion was, that there's *no*
| difference between integers of any numeric base - a hexadecimal digit
| can hold 6 more values, that's all. The only "forbidden" thing is the
| floating point - we have to keep it outside of the mantissa!

Yes integers are integers in all formats, but temporary(divide) results may contain fractions and we need to handle them as well.

| [Nibbles vs. Bytes]

| >| Nibbles would be a good solution for fast access in my multiplication
| >| and division routines. Packed bytes can be accessed in dword steps, a
| >| faster way to do calculations. Since unpacking can be done within two
| >| clocks, I prefer the packed format for hexadecimals!

| >"packed hexadecimals"? , lets call them "bytes" ;)

| Of course - the usual term is *byte* - not "packed nibbles"... ;)

| >[multiply:]
| >| The "improvement" is to use an additional buffer, holding the current
| >| multiple of the larger operand (2...(base-1)). Instead of adding the
| >| larger operand again and again for each digit of the smaller operand,
| >| we search the smaller operand for matching digits. Whenever a digit's
| >| equal to the current multiple, the contents of the additional buffer
| >| is added to the result (with the correct "shifting", done by changing
| >| the index register). This is repeated, until (base - 1) is done...

| >Just to avoid confusion, I see the "digits" as bytes now.

| >I'm not sure if the additional buffer will save too much time,
| >due this partial result needs to be calculated apart before,
| >but your idea is worth to give it a try.

| In work - using all "advantages" of the i...LETN hardware... ;)

:)


| >| Because a search loop is done much faster than an addition loop, this
| >| should speed up multiplication markable. The amount of additions is

| >| (base - 1) * (amount digits in smaller operand),

| >Plus the carry-over handling.... [99999999..999 + 2]

| Would produce 99999999..99B! ;)
Ok, my example is decimal, try again with the hexadecimal equivalent :)

| An Overflow cannot produce any other result than a 1 in [MSB - 1] per
| operation. This *still* should be handled outside of the loop!

!! I'm not talking about overflow here.
If we add the first LSB [or group]; (lets use the decimal example above)
the smaller operand is just 1 digit, but you can't stop adding the "carrys"
(not sure if this plural is correct) after 1 digit done,
you need to ADC,0 in a loop unless you reach a no-more-carry or overflow.

All you need is SHL,4 for odd nibbles and AND,0F for even.

| Thinking it over, I now need *2* additional
| buffers to handle the "packed" format. One buffer holds the original
| result of the addition, the other holds the same contents "shifted" 4
| bit upwards (faster than multiply by 16). After the buffers are set,
| the routine starts to read the LSB. 1st the high nibble is masked out
| and the low nibble is checked. If it's equal to the current processed
| digit, the 1st buffer is added to the result. Then the high nibble is
| checked. The 2nd buffer is added to the result, if it is equal, too.

| Now we decrease the index of the result by one - then start the above
| process again, until we reach the MSB. One alternative would be, that
| we do 256 adds with one buffer. In this case we save the time for the
| shifting and the memory for the 2nd buffer. Because we need much more
| addition loops for it, it can't be faster than the 1st way...

But wouldn't that need some 'tag-buffer' to know which nibbles already done?

---
I checked table-mul vs. F6/F7 MUL:

* The 'byte*byte' table solution will only be faster if these 128 Kb are
locked in cache, but this is already all cache I got;
and it must do a byte by byte MUL and word-ADDs.
A nibble oriented table with only 256 byte and all needed
byte-split/combine stuff won't be any faster due the iterations double.
* My current MUL-routine is a 'byte by byte-MUL/ADD' and uses
the 'F6 xx xx' MUL (AX=AL*[ptr + idx]).

So I tried the worst case MUL (128 * 128 bytes; all bits set),
even I don't support variables larger than 32 bytes the 256 byte result was correct: = "FFFF......FFFE0000......01" (surprise,
never checked it before).

This were actually 16384 iterations and it needed,
including everything (setup buffers, size count, overflow-check,...)
less than 400000 clocks (are about 25 clocks/iteration).
After copying some commonly used code-parts, I could remove one 'nearCALL'
within the loop, which reduced the loop by 5 clock-cycles.
Some more code-aligning may save one or two cylces in addition.

Now we got just about 20 clocks/iteration,
and if you decide to use the 'F7 xx xx' MUL (EDX:EAX=EAX*[ptr+idx]),
which needs one additional clock,
the iterations shrink from 16384 to moderate 1024,
add another few clocks for the ADC [ptr+idx],EDX:EAX
and a worst case multiply with 256 bytes result
will be done within max. 30000 clock-cycles.

Take this value to compare against..

| >[numeric types]
| >| There's only one type in my calculator... ;)

| >already defined?
| >128 byte binary, up to 308 decimal digits precision...?

| Enough to convert *any* known type into it?

Sure, but a waste if mainly smaller are needed.

| >| >Hmm, packed hexadecimal = (byte-organised) binary.
| >| >I see "hex" just as another display-format.

| >| It's better than to handle bits...
| >Yes, but it's just a four bit-grouping for easier human view.
| > In terms of calculation and storage "hex"="binary".

| Sounds a little bit like the old "still one half left" / "already one
| half consumed" story... ;)

:)


| To be honest, I'm used to handle hexadecimal numbers, but I don't use
| binary numbers in general (I can read them, of course)... ;)

You won't find too many binary notation in my notes as well.

The difference in 'binary numbers' and 'binary values'
seems a bit confusing?

Just to align our terms:
'binary numbers' and 'hexadecimal numbers' differ in the representation only,
but are both equal (2^n based) "binary values".

[code]

[DD vs. dw]


| Do we use different definitions?

Obvouisly YES!

| Definition: DB = 8 bit = 1 byte -> byte
| DW = 16 bit = 2 byte -> word
| DD = 32 bit = 4 byte -> double word
| DQ = 64 bit = 8 byte -> quad word

Sorry I forgot you use SYNTAX-definitions,
I'm more familiar to not say 'DATA' in front of a size-attribut,
but I use lower-case characters only:
b = byte
w = word
q = dw = double word (quad)
dq = qw = quad word
qq = 16 byte (the cuckoo)
dqq = 32 byte (my numeric limit)


| Following the above definition:
|
| (8 >> 1) = 8 / 2 -> 4 * DW
| (8 >> 2) = 8 / 4 -> 2 * DD
| (8 >> 3) = 8 / 8 -> 1 * DQ
|
| (Read ">>" as "SHR" (C style))

Ok, I'll keep that in mind.

[code]

| >I meant code-alignment, so label"2" should 'sit' on a 16-byte boundary.

| I've got the clue, as I was reading my (already sent) posting again -
| one of my problems is, that I don't know, how *long* each instruction
| is (I should have purchased the Addison-Wesley assembler book instead
| of the crap I bought)...

Intels volume 2 contains all about (most is valid for AMD also anyway)

[LEA]


| No, no - GAS knows LEA, of course. I just don't know how to translate
| the Intel form into AT&T syntax...

Perhaps same syntax as for memory-addressed operands?

| >I actually don't copy all operands to buffers, just the result is buffered.

| Yep. The routines can handle operands from everywhere. But they must
| be in the same format as the buffer is (64 bytes leading and trailing
| zeroes and 128 leading zeroes for divide).

Wouldn't this waste a lot of memory?
Have a look to my homepage, topic "standards" for my numerics definition.
http://web.utanet.at/schw1285/KESYS/index.htm

[clock counts..]

| Maybe I should invent a CPU next, then I would understand all of this
| stuff much better... ;)

...just a few tables on AMD-CD:
"AMD AthlonTM processor x86 Code Optimization"
Appendix F: "instruction dispatch, execution resources and timing"

[wrong Endian]
[code..]

| Maybe we should start to write an operating system for our new (to be
| developed) CPU, first? ;)

"KESYS" exists since 1997, the alt.os-CPU concept was worked out 1999... :)

[..]


| How do you interprete (ECX*1)? Our misunderstanding definitely is the
| result of the different interpretation of this term...

I see [EDI+ECX*1] as: EDI=offset-pointer; '+ECX*1' or '+ECX' = scaled(by 1)index.
As all (32bit addressing) two register addresses got a index scale field.

| The corresponding part of the GAS documentation says:
| _____________________________________________________________________
|
| An Intel syntax indirect memory reference of the form
|
| section:[base + index*scale + disp]
|
| is translated into the AT&T syntax

| section:disp(base, index, scale)

| where base and index are the optional 32-bit base and index registers,
| disp is the optional displacement, and scale, taking the values 1, 2,
| 4, and 8, multiplies index to calculate the address of the operand. If
| no scale is specified, scale is taken to be 1. section specifies the
| optional section register for the memory operand, and may override the
| default section register (see a 80386 manual for section register
| defaults). Note that section overrides in AT&T syntax /must/ have be
| preceded by a `%'. If you specify a section override which coincides
| with the default section register, |as| does /not/ output any section
| register override prefixes to assemble the given instruction. Thus,
| section overrides can be specified to emphasize which section register
| is used for a given memory operand.

Yes, we talk about equal things,
Intel calls the 'sections' segments.

| Here are some examples of Intel and AT&T style memory references:
|
| AT&T: `-4(%ebp)', Intel: `[ebp - 4]'
| base is `%ebp'; disp is `-4'.section is missing, and the default
| section is used (`%ss' for addressing with `%ebp' as the base
| register). index, scale are both missing.
| AT&T: `foo(,%eax,4)', Intel: `[foo + eax*4]'
| index is `%eax' (scaled by a scale 4); disp is `foo'. All other
| fields are missing. The section register here defaults to `%ds'.
| AT&T: `foo(,1)'; Intel `[foo]'
| This uses the value pointed to by `foo' as a memory operand.
| Note that base and index are both missing, but there is only
| /one/ `,'. This is a syntactic exception.
| AT&T: `%gs:foo'; Intel `gs:foo'
| This selects the contents of the variable `foo' with section
| register section being `%gs'.
|
| Absolute (as opposed to PC relative) call and jump operands must be
| prefixed with `*'. If no `*' is specified, |as| always chooses PC
| relative addressing for jump/call labels.

In Intel-terms the Program-Counter is called IP/EIP (instruction-pointer)

| Any instruction that has a memory operand /must/ specify its size
| (byte, word, or long) with an opcode suffix (`b', `w', or `l',
| respectively).

And thats why I use an (code-side editable) disassembler instead of any
"illogical crazy syntax" asm/compiler-tools.
But thanks for the info, it will help to avoid misinterpretation.
| _____________________________________________________________________

| I often use this kind of addressing, and it works the way I mentioned
| it somewhere above... ;)

[code]

| ECX goes zero ... so ECX + 4 -> routine works forever? Or just store
| the amount of digits and compare ECX (initial zero) against it?

No,
yes, the 'cmp cl,dl' followed by the conditional branch ends the loop.

[...]


| Ok, if we use big endian (if!), and I recode the routine core to:

I think you mean the "little one" here.

| EBX, ESI and EDI = base address buffer
|
| _addOPS:
| push ecx
| xor ecx,ecx

| clc ;xor clears cy anyway
0: ;initial/restore cy
| 3 mov eax, dword[ebx + ecx * 4 + 0] ;displacements are optional,
| 3 addc eax, dword[esi + ecx * 4 + 0] ;not needed if 0


| 3 mov dword[edi + ecx * 4 + 0], eax

;save carry
| 1 inc ecx
| 1 cmp ecx, 0x0000001F ;this will destroy cy
| 1 jbe 0 (backwards) ;BTW: cmp cl,0x1F is 3 bytes shorter
;restore carry
| xor eax,eax ;this will clear cy
;movzx eax,al won't (0F B6 C0)
| adc al,0 ;but how about just:
| mov byte[edi + ecx * 4 + 0], al ;adc byte[edi+ecx*4],0
| pop ecx ;
| ret

| Then we have 32 loops à 13 clocks (1920 : 416) (BTW - the shortest of
| all addition loops in this thread until now)...

Yes, an "early out" feature may save on iterations, but enlarges the loop.
But add two clocks for carry save/restore....

| BTW - I still don't get the clue of the rotate instructions. What the
| heck are they good for?

Use it to save/restore the CARRY-bit in bit0 of any unused(8-bit)register,
due any 'compare' will alter the carry.
PUSHF/POPF would als do it, but will cost much more time(vectored 4/15).
| - - -

[Intels Endianess..]


| I see, that it would be the best, but I *hate* this solution! Damned
| i...LETN hardware. Would be no big deal to add *one* move instruction
| to load data in the right order (BSWAP is just an excuse)... :(

Only a point of view, I see Motorola's/etc. big Endians as wrong order.
Everthing is upside-down there. :)

| The only big problem: I wanted to build a 40 tons truck, and now I am
| the proud owner of a nice and tiny racing powerboat (which *only* can
| drive backwards - we need another vehicle to pull it forwards)...

| Now I have to write another announcement: "BCD-Calc died. It couldn't
| survive in the hostile i...LETN environment!"... ;)

It suffers on the 18 digit limit and the
"compatible down to the museum"-philosophy.

| Coming up soon: The one and only "Bin Lobster" calculator!

And it's a product from "nowhere" :)

| Have a nice weekend
same to you,

__
wolfgang


Beth

unread,
Mar 10, 2003, 10:24:04 AM3/10/03
to
Bernhard wrote:
> A little bit like real life - the people who invented all this
formulas
> had lots of numbers first, then they saw a "law" for the relations
> between some of this numbers, finally they developed a formula which
is
> valid for every number... :)

Debatable; Don't you know? "Real" mathematicians don't use
numbers...hehehe :)

No, seriously...this would have been the way some of the first
mathematical relationships were found...but, these days, it's all
determined algebraically from all the already established
formulae...it's easier and you can be _assured_ that you're 100%
correct (as long as you haven't slipped up in your calculations and
the original formulae you were using are also correct to begin with
:)...whereas, if you work from sets of numbers directly...well, then
that's _statistical_ and there's always the possibility with
statistics that you'll discover something that isn't there (by pure
coincidence, a "pattern" emerged in random data...highly unlikely but
perfectly possible :)...or you'll discover a relationship but have
left out an all important factor from your equations (the "sample" you
took - again, by pure coincidence - happened to be in a range where
this other "factor" was at a minimum and it simply couldn't be
detected amongst all the natural statistical error in experimental
measurement...this is certainly possible as it was Einstein being able
to "predict" what people had just thought was "statistical error" in
Newton's equations that proved to the world that Einstein was onto
something with all this "curved spacetime" nonsense ;)...

The image of a scientist heating chemicals over a bunsen burner is
persuasive...but there's a whole breed of scientist - pure
mathematicians - that never go near any ugly "statistics" or use
numbers...to a pure mathematician, numbers are, quite frankly,
"unclean" and a bit "vulgar"...hehehe :)

Beth :)


Beth

unread,
Mar 10, 2003, 11:23:05 AM3/10/03
to
Bernhard wrote:
> Ed Beroset wrote:
> >>(BTW - Ever noticed, that every x! with (x >= 5) is a multiple of
5?)
> >
> >Every x! is a multiple of every integer n such that (n <= x), so
the
> >observation in more general than just for the number 5.
> >I'm sure that if you ponder the defintion of ! for a moment, you'll
> >see why this must necessarily be so.
>
> You're right. It's logical - whenever I multiply n with a number m,
then
> every following multiplication must be a multiple of m.

And, for exactly the same reasons, it'll also be a multiple of n
too...isn't that useful? :)

> Regardless of
> the amount of following multiplications, it's true for each of them.
The
> quoted rule came to my mind, as I did some of the calculations in my
> head. If I do calculations, I often see relations between all the
> numbers. But most times I don't recognize, that the rules behind
this
> relations could be important to understand some mathematical
"laws"...

*ahem* Now you see why I say in the other post that "real"
mathematicians never use numbers...and what I meant about "another
factor being present that you failed to detect"...if you were thinking
"algebraically" and played around with your formulae then it would
have been "obvious"...probably...maybe...hehehe :)

Beth :)


laura fairhead

unread,
Mar 10, 2003, 1:05:21 PM3/10/03
to
On Sun, 23 Feb 2003 22:48:57 +1000, "Ben Peddell" <lights...@hotmail.com>
wrote:

>
>Jerry Coffin <jco...@taeus.com> wrote in message
>news:MPG.18c1a4695104100098983a@news...
>> In article <b30ufe$224$03$1...@news.t-online.com>, now...@schornak.de
>> says...
>>
>> [ ... ]
>>
>> > >IOW, they all have 4287514... as a group -- all that changes is which
>of
>> > >those digits comes right after the decimal point.
>>
>> [ ... ]
>>
>> > That's really fascinating
>>
>> Isn't it though? There are a lot more interesting little bits like
>> that. Consider:
>>
>> 1/11 = .0909090909
>> 2/11 = .1818181818
>> 3/11 = .2727272727
>> 4/11 = .3636363636
>>
>> and so on --- each time you increase the numerator, the first number in
>> the repeating pair increases by one and the second decreases by one.
>> Just FWIW, 37 as a denominator acts much the same way.
>
>And they have repeating patterns, regardless of what base you look at them
>in.

Not every base though, eg; 1/3=0.1 (base3) 1/7=0.1(base7),.

>For example,
>1/3 is
>0.33... in decimal
>0.55... in hexadecimal
>0.2525... in octal
>0.0101... in binary
>
>1/7 is
>0.142857142857... in decimal,
>0.249249... in hexadecimal
>0.11... in octal
>0.001001... in binary
>
>1/11 is
>0.0909... in decimal
>0.1745D1745D... in hexadecimal
>0.05642721350564272135... in octal
>0.00010111010001011101... in binary
>
>>

bestwishes
laura


--
alt.fan.madonna |news, interviews, discussion, writings
|chat, exchange merchandise, meet fans....
|Get into the groove baby you've got to... check us out!

bv_schornak

unread,
Mar 10, 2003, 3:32:56 PM3/10/03
to
wolfgang kern wrote:

>| Why in quotes? It *is* my real name... ;)
>Sorry, I'm just too familiar with the quoted names....,better yet?
>

It was only a "side note", not to be taken serious at all... ;)

>| (My bad manners - in most cases (except Hugo!) I don't change the 1st
>| line, which is automatically generated by Mozilla, and I forget about
>| changing the 1st letters to capital ones...)
>
>My OE adds too many crazy stuff to the headline, so I replace it at all.
>

No perfect newsreaders available out there... :)

>[exponent: BCD/ 10^n/ 2^n]


>
>| Contemplating about BCD vs. HEX, my conclusion was, that there's *no*
>| difference between integers of any numeric base - a hexadecimal digit
>| can hold 6 more values, that's all. The only "forbidden" thing is the
>| floating point - we have to keep it outside of the mantissa!
>
>Yes integers are integers in all formats, but temporary(divide) results may contain fractions and we need to handle them as well.
>

A relict of my "ancient" philosophy ... I *love* the idea of "pseudo"
floating points. Every number is an integer - we only need the buffer
with a sufficient size to hold it... ;)

>| [Nibbles vs. Bytes]


>
>| >| Because a search loop is done much faster than an addition loop, this
>| >| should speed up multiplication markable. The amount of additions is
>
>| >| (base - 1) * (amount digits in smaller operand),
>
>| >Plus the carry-over handling.... [99999999..999 + 2]
>
>| Would produce 99999999..99B! ;)
>
>Ok, my example is decimal, try again with the hexadecimal equivalent :)
>

I did - before I wrote the quoted words...

Oh well... ;)

>| An Overflow cannot produce any other result than a 1 in [MSB - 1] per
>| operation. This *still* should be handled outside of the loop!
>
>!! I'm not talking about overflow here.
>If we add the first LSB [or group]; (lets use the decimal example above)
>the smaller operand is just 1 digit, but you can't stop adding the "carrys"
>(not sure if this plural is correct) after 1 digit done,
>you need to ADC,0 in a loop unless you reach a no-more-carry or overflow.
>

Will be handled by the routines as expected - and overflow conditions
are impossible, because my buffer size is _larger_ than 2 * 112 byte.
Overflow is handled by the calculator logic - but there's room for at
least 100 continuous overflows... ;)

"Carry over" is performed with one additional addition / subtraction.


["Improved" multiply]

I think, that you didn't get the point of the improvement - so I will
give a more detailed example:

1st step:

Compare OP1 against OP2 -> Compare returns 2...
OP1 0000001234 OP2 0000056789 RES 0000000000 TMP 0000000000

2nd step:

Exchange operands -> Exchange contents of index registers...
OP1 0000056789 OP2 0000001234 RES 0000000000 TMP 0000000000

3rd step:

Digit loop 1 - addition TMP = TMP + OP1
OP1 0000056789 OP2 0000001234 RES 0000000000 TMP 0000056789

Search loop - look for "1"
- if match, add TMP at current destination index
- increment destination index
- loop until MSB reached
OP1 0000056789 OP2 0000001234 RES 0056789000 TMP 0000056789
Done digit count + 1 (1 match found) -> 1 < 4 -> continue

Digit loop 2 - addition TMP = TMP + OP1
OP1 0000056789 OP2 0000001234 RES 0056789000 TMP 0000113578

Search loop - look for "2"
OP1 0000056789 OP2 0000001234 RES 0068146800 TMP 0000113578
Done digit count + 1 (1 match found) -> 2 < 4 -> continue

Digit loop 3 - addition TMP = TMP + OP1
OP1 0000056789 OP2 0000001234 RES 0068146800 TMP 0000170367

Search loop - look for "3"
OP1 0000056789 OP2 0000001234 RES 0069850470 TMP 0000170367
Done digit count + 1 (1 match found) -> 3 < 4 -> continue

Digit loop 3 - addition TMP = TMP + OP1
OP1 0000056789 OP2 0000001234 RES 0069850470 TMP 0000227156

Search loop - look for "4"
OP1 0000056789 OP2 0000001234 RES 0070077626 TMP 0000227156
Done digit count + 1 (1 match found) -> 4 = 4 -> stop

4th step:

Exchange operands -> Exchange contents of index registers...
OP1 0000001234 OP2 0000056789 RES 0070077626 TMP 0000227156

RESULT: 1 234 * 56 789 = 70 077 626

The used operand sizes and formats are taken from the former BCD calc
for better "readability". Definitions of the new format see bottom!

This "multiplication" needs a maximum of 224 + 15 = 239 additions, 15
"shift 4 upwards" loops and 1680 comparison loops with two full sized
112 byte (224 digit) operands for a result with 224 byte (448 digit).

See above! If the routine is coded, I publish an object file for you,
so you can check it against your routine... ;)

As you may see, this is *not* a real multiplikation, *only* additions
and compares are done.

>[numeric types]


>
>| Enough to convert *any* known type into it?
>
>Sure, but a waste if mainly smaller are needed.
>

Optimization + speed + flexibility are taking their toll...

BTW - at the moment, the reserved memory isn't used at all!

>| To be honest, I'm used to handle hexadecimal numbers, but I don't use
>| binary numbers in general (I can read them, of course)... ;)
>
>You won't find too many binary notation in my notes as well.
>
>The difference in 'binary numbers' and 'binary values'
>seems a bit confusing?
>
>Just to align our terms:
>'binary numbers' and 'hexadecimal numbers' differ in the representation only,
>but are both equal (2^n based) "binary values".
>

As I learned at school - "A number can hold all values in the defined
"set", while a value is assigned to one number, which represents this
value and nothing else..."

("set theory" = "Mengenlehre", so I assume, that "set" = "Menge"?)

>[code]
>
>[DD vs. dw]
>| Do we use different definitions?
>
>Obvouisly YES!
>
>| Definition: DB = 8 bit = 1 byte -> byte
>| DW = 16 bit = 2 byte -> word
>| DD = 32 bit = 4 byte -> double word
>| DQ = 64 bit = 8 byte -> quad word
>
>Sorry I forgot you use SYNTAX-definitions,
>

Now I got the clue... ;)

Not better or worse than the D* story - this definitions are used for
most iNTEL assemblers, so I got used to it. Could be confusing, too -
a *double*word is 4 bytes in size...

>I'm more familiar to not say 'DATA' in front of a size-attribut,
>but I use lower-case characters only:
> b = byte
> w = word
> q = dw = double word (quad)
> dq = qw = quad word
> qq = 16 byte (the cuckoo)
> dqq = 32 byte (my numeric limit)
>

I will have it in mind from now on... ;)

"q" is the GAS postfix for quad word (8 byte). ;)

>[code]
>
>| >I meant code-alignment, so label"2" should 'sit' on a 16-byte boundary.
>
>| I've got the clue, as I was reading my (already sent) posting again -
>| one of my problems is, that I don't know, how *long* each instruction
>| is (I should have purchased the Addison-Wesley assembler book instead
>| of the crap I bought)...
>
>Intels volume 2 contains all about (most is valid for AMD also anyway)
>

I *have* the PDF manuals, but no time to study them with care...

>[LEA]
>| No, no - GAS knows LEA, of course. I just don't know how to translate
>| the Intel form into AT&T syntax...
>
>Perhaps same syntax as for memory-addressed operands?
>

I will check it...

>| >I actually don't copy all operands to buffers, just the result is buffered.
>
>| Yep. The routines can handle operands from everywhere. But they must
>| be in the same format as the buffer is (64 bytes leading and trailing
>| zeroes and 128 leading zeroes for divide).
>
>Wouldn't this waste a lot of memory?
>Have a look to my homepage, topic "standards" for my numerics definition.
> http://web.utanet.at/schw1285/KESYS/index.htm
>

Confusing! My data base "knows" DB, DW, DD, DQ, fixed length data and
two dynamic string types (16 and 32 bit offsets) - enough for *every*
task. If user defined types have different sizes than DB, DW, DD, DQ,
then you can use the static data format, all other "variable" formats
(e.g. Bin Lobster) should use dynamic strings...

My format definition changed again, see bottom!

>[clock counts..]
>
>| Maybe I should invent a CPU next, then I would understand all of this
>| stuff much better... ;)
>
> ...just a few tables on AMD-CD:
> "AMD AthlonTM processor x86 Code Optimization"
> Appendix F: "instruction dispatch, execution resources and timing"
>

This is available on CD, too? I only know some PDF files, and they're
sleeping well in one of my info folders (I dont't like PDF!)...

>[wrong Endian]
>[code..]
>
>| Maybe we should start to write an operating system for our new (to be
>| developed) CPU, first? ;)
>
>"KESYS" exists since 1997, the alt.os-CPU concept was worked out 1999... :)
>

The point was, that I started with the wish to add another routine to
my BCD calculator, but meanwhile we're going to reinvent the wheel...
Not to mourn about, but it went another (different) way as I expected
a while ago, as I started this thread... ;)


[ iNTEL vs. AT&T - again... ;) ]

>And thats why I use an (code-side editable) disassembler instead of any
>"illogical crazy syntax" asm/compiler-tools.
>

If you are using it long enough (since 1994 in my case), you get used
to it... ;)

>But thanks for the info, it will help to avoid misinterpretation.
>

You're welcome!

I just wondered about our misunderstandings, so I thought it would be
a good idea to add some more info here... ;)

Meanwhile the routine needs exactly the amount of additions which are
neccessary. The code is executed within the initialization phase. Why
should it be part of the loop? I use EDX as storage for the amount of
needed additions (including the MSD + carry thing).

I just missed, that CMP changes the carry, too. There is no other way
than your RCL - RCR trick, I guess... ;)

>[Intels Endianess..]
>| I see, that it would be the best, but I *hate* this solution! Damned
>| i...LETN hardware. Would be no big deal to add *one* move instruction
>| to load data in the right order (BSWAP is just an excuse)... :(
>
>Only a point of view, I see Motorola's/etc. big Endians as wrong order.
>Everthing is upside-down there. :)
>

As they say - Austr(al)ia is "down under"... ;)

(Greetings from near the border...)

>| The only big problem: I wanted to build a 40 tons truck, and now I am
>| the proud owner of a nice and tiny racing powerboat (which *only* can
>| drive backwards - we need another vehicle to pull it forwards)...
>
>| Now I have to write another announcement: "BCD-Calc died. It couldn't
>| survive in the hostile i...LETN environment!"... ;)
>
>It suffers on the 18 digit limit and the
>"compatible down to the museum"-philosophy.
>

Why 18 digits? Actually it were 256 digits! Please forget that I ever
mentioned something about using the FPU... ;)

As told before, MC68000 (and above?) supports BCD arithmetic with own
opcodes! This still is a very important issue in the financial world!
I bet, that a 68k BCD calculator could be coded in a more dense *and*
faster way! And it would be a *real* BCD calculator...

>| Coming up soon: The one and only "Bin Lobster" calculator!
>And it's a product from "nowhere" :)
>

Of course - and you will be mentioned in the contributors list! ;)

- - -

Now - some *new* definitions for the "Bin Lobster" calculator...

The format is used now as internal *and* external format - the former
format was packed before storing it and unpacked before calculations.

Operands are now defined as:

byte contents
00 flags
01 amount digits (mantissa)
02 exponent (-32768 ... 32767)
04
... mantissa (up to 112 byte)
73
74
... zeroes
FF

Flags are defined as:

bit contents
7 multi-purpose
6 internal error
5 overflow
4 rounded
3 exponent too large
2 mantissa too large
1 sign exponent (redundant, might be redefined)
0 sign mantissa

Operand memory must have a size of at least 116 bytes - padded with
leading zeroes - if it shall be used in the calculator routines. If
not, the operand must be stored in an internal buffer (prior to the
call of the calculation routines).

Buffers still are part of the system numerics (BNR). The offsets are:

COM = BNR + 0x0C00
OP1 = BNR + 0x0D00
OP2 = BNR + 0x0E00
RES = BNR + 0x0F00
T_0 = BNR + 0x1800 (temporary buffer # 0)
T_1 = BNR + 0x1900 (temporary buffer # 1)
T_2 = BNR + 0x1A00 (temporary buffer # 2)

If we use tables, they're stored in fields of my data base and loaded
into a separate memory area (as long as the calculator is active).

Operands are in reversed order (aka big endian)...

[sigh] ... as Hugo would say... ;)

laura fairhead

unread,
Mar 10, 2003, 3:37:13 PM3/10/03
to
On Mon, 10 Mar 2003 15:24:04 -0000, "Beth" <BethS...@hotmail.NOSPICEDHAM.com>
wrote:

<g>

I always say that in Mathematics there are maybe 3 numbers; 0,1 and 2
the rest is symbols and letters, especially Greek ones ;)

>
>Beth :)
>
>

byefornow

bv_schornak

unread,
Mar 10, 2003, 6:23:07 PM3/10/03
to
Beth wrote:

>> You're right. It's logical - whenever I multiply n with a number m,
>
>>then every following multiplication must be a multiple of m.
>>
>
>And, for exactly the same reasons, it'll also be a multiple of n
>too...isn't that useful? :)
>

n * m = m * n ... it's called accumulative rule (or law?)
or something like that, AFAIK.

Useful e.g. in my multiplication routine, because I use the
smaller operands for time consuming tasks, while the larger
ones are added [base - 1] times to themselves, only ... see
reply to Wolfgang from today.

>>Regardless of the amount of following multiplications,
>>
>>it's true for each of them. The quoted rule came to my
>>
>>mind, as I did some of the calculations in my head. If I
>>do calculations, I often see relations between all the numbers.
>>But most times I don't recognize, that the rules behind this
>>
>>relations could be important to understand some mathematical "laws"...
>>
>
>*ahem* Now you see why I say in the other post that "real"
>mathematicians never use numbers...and what I meant about "another
>factor being present that you failed to detect"...if you were thinking
>"algebraically" and played around with your formulae then it would
>have been "obvious"...probably...maybe...hehehe :)
>

See reply to your other posting - same topic, one answer... ;)

bv_schornak

unread,
Mar 10, 2003, 8:56:27 PM3/10/03
to
Beth wrote:

>Bernhard wrote:
>
>
>>A little bit like real life - the people who invented all this formulas
>>
>>had lots of numbers first, then they saw a "law" for the relations
>>between some of this numbers, finally they developed a formula which
>>
>>is valid for every number... :)
>>
>
>Debatable; Don't you know? "Real" mathematicians don't use
>numbers...hehehe :)
>

Maybe you've forgotten (very long time ago I mentioned it somewhere,
must be three or four months ago) - my profession is "truck driver",
not "math genius"... ;)

(Do I have to set some <nonsense> marks?)

>No, seriously...this would have been the way some of the first
>mathematical relationships were found...but, these days, it's all
>determined algebraically from all the already established
>formulae...
>

Ok - if you prefer to learn formulae by heart instead of "exploring"
the rules by yourself, then it may be the perfect way for you. But I
have some difficulties to learn this stuff by heart, because I'm too
much bound to practical thinking. I had kind of the same problems at
school - doesn't mean, that I can't think about abstract things, but
in the math case my abstraction (better: imagination) is not related
to all these weird letters... ;)

If my truck has to "climb" a mountain, I do not need formulae and my
calculator to compute a matching gear - I need experience, good ears
and the "feeling" for the machine under my feet. No formulae and all
the masses of microprocessors in modern trucks, with the latest pro-
grams written by engineers who know everything about motormanagement
and math - nothing can replace the "feeling" component.

I've been driving several Scania and Mercedes trucks with electronic
gear shifting, but they do never act like a real driver, because the
electronic cannot see the mountain in front. So the automatic shifts
the gear at a moment, where it is much too late, and the truck slows
down. That much about calculations, reality and waste of fuel... ;)

>...it's easier and you can be _assured_ that you're 100%
>correct (as long as you haven't slipped up in your calculations and
>the original formulae you were using are also correct to begin with
>:)...whereas, if you work from sets of numbers directly...well, then
>that's _statistical_ and there's always the possibility with
>statistics that you'll discover something that isn't there (by pure
>coincidence, a "pattern" emerged in random data...highly unlikely but
>perfectly possible :)...or you'll discover a relationship but have
>left out an all important factor from your equations (the "sample" you
>took - again, by pure coincidence - happened to be in a range where
>this other "factor" was at a minimum and it simply couldn't be
>detected amongst all the natural statistical error in experimental
>measurement...
>

If it would be this way, then we still would not know much more than
1+1=2 - or is even this simple example statistically "wrong"?

Never trust a statistic you haven't falsified yourself... ;)

If you play around with numbers, you probably check the validity for
your found "golden rule" with several other sets of numbers - before
you declare your "rule" as the new "universal law".

In my understanding, a formula isn't much more than a "template", to
be used by inserting real numbers into the variables, to calculate a
result which may be the value of a resistor in a circuit or the size
of a small part in a machine...

If you're bound to only formulae - it is the same as you're fixed to
numbers, only. The latter is sufficient to do basic real world jobs,
e.g. exchanging coins against wares, while the first, done as an end
in itself, isn't much more than wasted time. Only the combination of
both will be a benefit for mankind!

BTW - I never learned "higher math" at school. If I want to learn it
now (more than 26 years ago that I left school), I have to "back up"
all my former knowledge, before I'm able to learn some new stuff. If
I can understand it better with some practical examples, it is fact.
Sorry, if I'm too dumb to learn it by knowing everything by heart, a
thing I never did like too much - because it means not to understand
the basic rules and connections behind a formula. It's like reciting
a poem out of the head, but not understanding the words...

>...this is certainly possible as it was Einstein being able
>to "predict" what people had just thought was "statistical error" in
>Newton's equations that proved to the world that Einstein was onto
>something with all this "curved spacetime" nonsense ;)...
>

But - Einstein wasn't a "math genius", either - he was a thinker and
(in first position) a human being (as he would have said himself).

>The image of a scientist heating chemicals over a bunsen burner is
>persuasive...but there's a whole breed of scientist - pure
>mathematicians - that never go near any ugly "statistics" or use
>numbers...
>

All the theories about the nature of our universe nowadays are based
on mathematical calculations - the only way to describe things which
are beyond the limitations of our imagination.

I have my doubts, that even one of the theories is telling something
about the real nature of our universe. There are so many unknown and
"random" factors - reality isn't predictable in any way...

>...to a pure mathematician, numbers are, quite frankly,
>"unclean" and a bit "vulgar"...hehehe :)
>

So am I - a perfect match? ;)

wolfgang kern

unread,
Mar 11, 2003, 6:56:42 PM3/11/03
to

Bernhard wrote:

| >[exponent: BCD/ 10^n/ 2^n]
| > ...temporary(divide) results may contain fractions

| A relict of my "ancient" philosophy ... I *love* the idea of "pseudo"
| floating points. Every number is an integer - we only need the buffer
| with a sufficient size to hold it... ;)

We could preceed every divide with a 10^x multiply to produce integers.


| "Carry over" is performed with one additional addition / subtraction.

But the "one" may again produce a carry, so it must loop until no more carry.
I had some troubles with this as I started my first attempts....
As you now decided to renounce of 'early out' this affects just my program.

Yes, this "classical" shift-add
will make sense for BCD (1..9) and hex-nibbles (1..15),
but if I think of bytes(255) or dwords(4G) ?

224 byte ->448 digits ?
if you work with hex-strings,
224 bytes may produce 539 decimal digits [2^(224*8)-1].
112 269
96 231

[max.30000 clock-cycles
using 'F7 xx' for MUL of two 128-byte numbers with all bits set]

Let's take this value to compare against..



| See above! If the routine is coded, I publish an object file for you,
| so you can check it against your routine... ;)

| As you may see, this is *not* a real multiplikation, *only* additions
| and compares are done.

Yes, but too many in my opinion yet,
so feel free to change my mind,
I'm looking forward to compare....

Sorry for I cannot interpret object-files,
I would need to convert it by hand,
all I can handle direct is pure code (the only thruth).

BTW: the integer MUL (edx:eax=eax*...) needs just 9 clock-cycles and
covers 8 * 8 nibbles => 16 nibbles including carry-over already.


[numeric types]


| Optimization + speed + flexibility are taking their toll...

Unfortunately.


| BTW - at the moment, the reserved memory isn't used at all!

Wait when non-destroying DIV joines in the fun...

| >Just to align our terms:
| >'binary numbers' and 'hexadecimal numbers' differ in the representation only,
| >but are both equal (2^n based) "binary values".

| As I learned at school - "A number can hold all values in the defined
| "set", while a value is assigned to one number, which represents this
| value and nothing else..."

Correct,
A 'hex' figure of 5A can be displayed as 'bin' 0101 1010 (5;A),
both mean the identical value of 90 decimal.

All I try to say is:
"hex" is only another, shorter display-form of binary.

You may see 'hex' as 16^n also, but when it comes to larger hex-strings,
this may lead more easy to confusing than the use of
bit-number to value relation [value = 2^bit#],
which is valid for all hex-strings regardless of their nibble-count.

I would be a bit confused if I need to calculate the nibbles power:
16^nibble# is also valid and possible,
but I would need to translate it to 2^n to get a clue of the value.

Perhaps powers of two are more convenient due experience with it,
even my very first 'CPU' was a four-bit MCU.

| ("set theory" = "Mengenlehre", so I assume, that "set" = "Menge"?)

I think so, the term "set of.." sounds familiar to me.
A hex-nibble is a group of four bits...
so a bit is an element of 'hex' also...

[DD vs. dw]
Now we wrote the first chapter of the 'asm/kesys' syntax translation.

[Intels volume 2]


| I *have* the PDF manuals, but no time to study them with care...

There are just a few tables in the appendix...

[ KESYS-standards/ numerics definition.

| Confusing! My data base "knows" DB, DW, DD, DQ, fixed length data and
| two dynamic string types (16 and 32 bit offsets) - enough for *every*
| task. If user defined types have different sizes than DB, DW, DD, DQ,
| then you can use the static data format, all other "variable" formats
| (e.g. Bin Lobster) should use dynamic strings...

| My format definition changed again, see bottom!

[clock counts..]

| > "AMD AthlonTM processor x86 Code Optimization"
| > Appendix F: "instruction dispatch, execution resources and timing"

| This is available on CD, too? I only know some PDF files, and they're
| sleeping well in one of my info folders (I dont't like PDF!)...

Not sure if the (free) CD is still available,
but anyway, it contains .pdf-files only
and the newest stuff is to be found on the net.

[wrong Endian...]


| The point was, that I started with the wish to add another routine to
| my BCD calculator, but meanwhile we're going to reinvent the wheel...
| Not to mourn about, but it went another (different) way as I expected
| a while ago, as I started this thread... ;)

Even KESYS03 is already out to upgrade previous releases,
I can't stop fiddle on the performance for future versions.

In opposition to the 'large' companies, where it's enough to
have sold somehow working code and release upgrades just to
make money out from previous bugs,
I will always check every part of my code for doing smarter,
faster, shorter... ..seems to be a never ending story.
And BTW: my 256-bit variables are a 'not needed at all' feature
til now, while 128-bit integers are already in use.

another try:

EBX, ESI and EDI = base address buffer (zero-extended fills)

_addOPS:
push ecx
push edx

xor ecx,ecx ;this clears cy also
mov dl,0x20 ;0x1C for 112 bytes
3 0:mov eax, dword [ebx + ecx * 4]
3 adc eax, dword [esi + ecx * 4]
3 mov dword [edi + ecx * 4], eax
1 inc ecx
1 dec dl
1 jne 0 (b)
adc byte[edi + ecx * 4],0 ;NZ indicates overflow then,
pop edx ;(assuming cleared before)
pop ecx
ret

I think this should do it...

And an alternate direct "ADD [EDI],[ESI]" will be shorter by tree clocks.

[..."early out" feature]

| Meanwhile the routine needs exactly the amount of additions which are
| neccessary. The code is executed within the initialization phase. Why
| should it be part of the loop? I use EDX as storage for the amount of
| needed additions (including the MSD + carry thing).

Yes you don't need,
only I still need to support all odd-sized numeric types w/o
zero-extended buffering.


| I just missed, that CMP changes the carry, too. There is no other way
| than your RCL - RCR trick, I guess... ;)

You can build the loop from opposite side and 'DEC' until zero,
but then a second 'INC' counter is needed (look to the code now).

[..Endianess..]
| >...Everthing is upside-down there. :)

| As they say - Austr(al)ia is "down under"... ;)
| (Greetings from near the border...)

I keep that as a bonus for one future 'pick on Germans'-spell :)

| As told before, MC68000 (and above?) supports BCD arithmetic with own
| opcodes! This still is a very important issue in the financial world!
| I bet, that a 68k BCD calculator could be coded in a more dense *and*
| faster way! And it would be a *real* BCD calculator...

The Z-80 has a few low/high nibble-addressing opcodes,
but the same Endian as Intel.

| >| Coming up soon: The one and only "Bin Lobster" calculator!
| >And it's a product from "nowhere" :)
| Of course - and you will be mentioned in the contributors list! ;)

That's nice!

You mean for sure the more logical:
higher address = higher value order? :)

| [sigh] ... as Hugo would say... ;)

I never met him.....

Your definition looks good,
even you may miss 'the eye of the fly' with only two exponent-bytes ;)
but 32K-zeros should be enough by far for everything else.

What stands the 'Lobster' for?
Will it have two pairs of scissors?

You didn't mention your run-time environment.
I use FLAT,(un-)PROTECTED32, no paging, all code in the(KESYS) kernel,
no external code will be executed (debugger only).
I have my buffers in the kernels stack-space, which saves a few cycles.

__
wolfgang

bv_schornak

unread,
Mar 14, 2003, 7:09:12 AM3/14/03
to
wolfgang kern wrote:

>| A relict of my "ancient" philosophy ... I *love* the idea of "pseudo"
>| floating points. Every number is an integer - we only need the buffer
>| with a sufficient size to hold it... ;)
>
>We could preceed every divide with a 10^x multiply to produce integers.
>

To be defined, first?

As you are the expert here: What exponent format? AFAIR you suggested
to use the entire hex number as equivalent of the n-th power of ten -
a very "pleasant" definition... ;)

>| "Carry over" is performed with one additional addition / subtraction.
>
>But the "one" may again produce a carry, so it must loop until no more carry.
>I had some troubles with this as I started my first attempts....
>As you now decided to renounce of 'early out' this affects just my program.
>

Is added, see final code. I just do another addition, as long as the
carry is set...


["Improved" multiply]

>| This "multiplication" needs a maximum of 224 + 15 = 239 additions, 15
>| "shift 4 upwards" loops and 1680 comparison loops with two full sized
>| 112 byte (224 digit) operands for a result with 224 byte (448 digit).
>
>Yes, this "classical" shift-add
>will make sense for BCD (1..9) and hex-nibbles (1..15),
>but if I think of bytes(255) or dwords(4G) ?
>
>224 byte ->448 digits ?
>if you work with hex-strings,
> 224 bytes may produce 539 decimal digits [2^(224*8)-1].
> 112 269
> 96 231
>

Surplus digits have to be rounded after the final calculation. Unless
the calculation is finished, we have some extra digits to improve the
overall accuracy.

>| As you may see, this is *not* a real multiplikation, *only* additions
>| and compares are done.
>
>Yes, but too many in my opinion yet,
>so feel free to change my mind,
>I'm looking forward to compare....
>
>Sorry for I cannot interpret object-files,
>I would need to convert it by hand,
>all I can handle direct is pure code (the only thruth).
>
>BTW: the integer MUL (edx:eax=eax*...) needs just 9 clock-cycles and
> covers 8 * 8 nibbles => 16 nibbles including carry-over already.
>

I know that it is possible to use the build in MUL instructions - but
I never checked how it is working, so I can't use it... :(

>[numeric types]
>| Optimization + speed + flexibility are taking their toll...
>Unfortunately.
>| BTW - at the moment, the reserved memory isn't used at all!
>Wait when non-destroying DIV joines in the fun...
>

As long as the Bin Lobster library isn't bound to any of my programs,
the reserved memory is "sleeping"... ;)

>| As I learned at school - "A number can hold all values in the defined
>| "set", while a value is assigned to one number, which represents this
>| value and nothing else..."
>
>Correct,
>A 'hex' figure of 5A can be displayed as 'bin' 0101 1010 (5;A),
>both mean the identical value of 90 decimal.
>
>All I try to say is:
> "hex" is only another, shorter display-form of binary.
>
>You may see 'hex' as 16^n also, but when it comes to larger hex-strings,
>this may lead more easy to confusing than the use of
> bit-number to value relation [value = 2^bit#],
>which is valid for all hex-strings regardless of their nibble-count.
>
>I would be a bit confused if I need to calculate the nibbles power:
> 16^nibble# is also valid and possible,
>but I would need to translate it to 2^n to get a clue of the value.
>
>Perhaps powers of two are more convenient due experience with it,
>even my very first 'CPU' was a four-bit MCU.
>

Seen this way, even BCDs are expressions of 2^n values (this is fact,
if we use BCDs in an x86 environment). If we could do all the odd bit
shifting very fast, then I would use it. But - my impression is, that
it takes much more time than my current way.

OTOH - it would need 7 buffers with the "shifted" number = 7 loops to
shift the number. After this step we need [size * 8] "bit tests" with
the possible addition, and that's all. I will think about it a little
bit longer - sounds good to me...

>| ("set theory" = "Mengenlehre", so I assume, that "set" = "Menge"?)
>I think so, the term "set of.." sounds familiar to me.
>A hex-nibble is a group of four bits...
>so a bit is an element of 'hex' also...
>

Since a bit is nothing else than the expression for 1, it is included
in *every* number, except zero! Go one step further - every number is
build out of a specific amount of 1's...

>[DD vs. dw]
>Now we wrote the first chapter of the 'asm/kesys' syntax translation.
>

To be continued... ? ;)

>[wrong Endian...]
>| The point was, that I started with the wish to add another routine to
>| my BCD calculator, but meanwhile we're going to reinvent the wheel...
>| Not to mourn about, but it went another (different) way as I expected
>| a while ago, as I started this thread... ;)
>
>Even KESYS03 is already out to upgrade previous releases,
>I can't stop fiddle on the performance for future versions.
>
>In opposition to the 'large' companies, where it's enough to
>have sold somehow working code and release upgrades just to
>make money out from previous bugs,
>I will always check every part of my code for doing smarter,
>faster, shorter... ..seems to be a never ending story.
>And BTW: my 256-bit variables are a 'not needed at all' feature
> til now, while 128-bit integers are already in use.
>

I know! Whenever I finish re-coding of my database engine (meanwhile
about 3500 lines) - I could start again at once, because there are so
many things to improve... ;)

Because I don't sell my software - there is plenty of time to develop
it until it reached a "good enough" level to publish it. Better to do
some more improvements than to publish "unready" code, like it is the
common practice of most software companies today... ;)

>another try:
>
>EBX, ESI and EDI = base address buffer (zero-extended fills)
>
>_addOPS:
> push ecx
> push edx
> xor ecx,ecx ;this clears cy also
> mov dl,0x20 ;0x1C for 112 bytes
> 3 0:mov eax, dword [ebx + ecx * 4]
>3 adc eax, dword [esi + ecx * 4]
>3 mov dword [edi + ecx * 4], eax
>1 inc ecx
>1 dec dl
> 1 jne 0 (b)
> adc byte[edi + ecx * 4],0 ;NZ indicates overflow then,
> pop edx ;(assuming cleared before)
> pop ecx
> ret
>
>I think this should do it...
>

See below...

>And an alternate direct "ADD [EDI],[ESI]" will be shorter by tree clocks.
>

Some reasons, why I don't use it:

1. My version of GAS (1994...) does not support this instruction,
at least one general register must be used directly; ...

2. ... I do ADC rather than ADD; ...

3. ... the index registers must be set to the next dword, thus we
need additional code (and cycles) - also the contents of these
registers is changed, so we need to store / restore them!

>[..."early out" feature]
>
>| Meanwhile the routine needs exactly the amount of additions which are
>| neccessary. The code is executed within the initialization phase. Why
>| should it be part of the loop? I use EDX as storage for the amount of
>| needed additions (including the MSD + carry thing).
>
>Yes you don't need,
> only I still need to support all odd-sized numeric types w/o
> zero-extended buffering.
>

Would be faster to extend them before the calculation?

>[..Endianess..]
>| >...Everthing is upside-down there. :)
>
>| As they say - Austr(al)ia is "down under"... ;)
>| (Greetings from near the border...)
>I keep that as a bonus for one future 'pick on Germans'-spell :)
>

Sorry, but I couldn't hold myself...

Bonus is granted!

BTW - there's the right of free speech, anyway... ;)

>| As told before, MC68000 (and above?) supports BCD arithmetic with own
>| opcodes! This still is a very important issue in the financial world!
>| I bet, that a 68k BCD calculator could be coded in a more dense *and*
>| faster way! And it would be a *real* BCD calculator...
>
>The Z-80 has a few low/high nibble-addressing opcodes,
> but the same Endian as Intel.
>

Yep. As I started with posting in usenet, somebody told me, that
DEC invented this (IMO braindead) format ... especially with huge
numbers like we use 'em it's a mess...

>| >| Coming up soon: The one and only "Bin Lobster" calculator!
>| >And it's a product from "nowhere" :)
>| Of course - and you will be mentioned in the contributors list! ;)
>That's nice!
>

The least thing I can do!

>| Operands are now defined as:
>|
>| byte contents
>| 00 flags
>| 01 amount digits (mantissa)
>| 02 exponent (-32768 ... 32767)
>| 04
>| ... mantissa (up to 112 byte)
>| 73
>| 74
>| ... zeroes
>| FF
>|
>| Flags are defined as:
>|
>| bit contents
>| 7 multi-purpose
>| 6 internal error
>| 5 overflow
>| 4 rounded
>| 3 exponent too large
>| 2 mantissa too large
>| 1 sign exponent (redundant, might be redefined)
>| 0 sign mantissa
>|

>| Buffers still are part of the system numerics (BNR). The offsets are:
>|
>| COM = BNR + 0x0C00
>| OP1 = BNR + 0x0D00
>| OP2 = BNR + 0x0E00
>| RES = BNR + 0x0F00
>| T_0 = BNR + 0x1800 (temporary buffer # 0)
>| T_1 = BNR + 0x1900 (temporary buffer # 1)
>| T_2 = BNR + 0x1A00 (temporary buffer # 2)
>|

>| Operands are in reversed order (aka big endian)...
>
>You mean for sure the more logical:
> higher address = higher value order? :)
>

Higher address = higher value = lots of additional work!

But I will get used to it. As I've learned to drive backwards for
miles with a 18.75 m truck - it should be possible with tiny toys
like bits and bytes, too... ;)

>| [sigh] ... as Hugo would say... ;)
>I never met him.....
>

You told him you could erase his entire harddrive? ;)

It is the name I gave our "nice" spamming person, because her/his
"real name" changes every hour...

>Your definition looks good,
>even you may miss 'the eye of the fly' with only two exponent-bytes ;)
>but 32K-zeros should be enough by far for everything else.
>

48 bit would be enough? Or we overdo it a little bit, and define
112 bit? Operands would meet a paragraph boundary, if the latter
one would be taken...

>What stands the 'Lobster' for?
>

Actually for itself. In popular terms, a lobster walks backwards
(it has more of a sidewards movement, but that's a matter of the
"right" interpretation). "Bin Lobster" is the backwards working
"Bin"ary calculator... ;)

>Will it have two pairs of scissors?
>

You probably meant "one pair" or "two units", didn't you? ;)

Of course - all four blades sharp as a katana! You know, its job
is to cut bytes into bits...

>You didn't mention your run-time environment.
>I use FLAT,(un-)PROTECTED32, no paging, all code in the(KESYS) kernel,
>no external code will be executed (debugger only).
>I have my buffers in the kernels stack-space, which saves a few cycles.
>

I'm *still* running OS/2 (on an 1 GHz Athlon)... ;)

It's 32 bit FLAT, and applications are running at ring 3. Paging
(= swapping?), task-switching, et cetera _can't_ be changed from
application level.

The ST-system is launched after the frame and main window(s) are
created, but before starting the message dispatch loop. I cannot
chose another starting point, because the window size + position
is stored in the SystemNumerics.

First a 8192 byte area is allocated for the "LoaderTable", where
the data for up to 256 MemHandles is stored. My "Loader" is just
an excuse of a real memory manager... ;)

Its main purpose is allocation of memory (including auto-loading
of files, if they are "data fields" of my database), resizing of
allocated areas and freeing of 'em (including the auto-saving of
altered "data fields"). A "checker" may scan the LoaderTable in
pre-defined intervals, it removes datafields which were not used
for a pre-defined time (switched off by setting the STATIC flag
in the MemHandle of the corresponding datafield).

Now the "Loader" starts its work, next step is the allocation of
12288 byte for the SystemNumerics (including the 7 math buffers)
and the needed amount of pages for the SystemStrings. Depending
on the linked libraries, there may be some more requests to load
system files to memory (few static memory is allocated - most of
my functions allocate memory in dynamic mode and free it as soon
as possible).

After initialization of the ST-system, the message dispatch loop
starts its work as usual and my system is ready to be used...


Now my current work... ;)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

1. New comparison routine:

IN : EBX base address OP1
ESI OP2

OUT: 0 operands are equal
1 OP1 > OP2
2 OP1 < OP2

.globl _chkOPS
_chkOPS:


push ecx
push edx
xor ecx,ecx

xor edx,edx
mov cl, 0x71
movb dl, 0x71
mov eax, _BNR
/*
get size OP1
*/
0:dec cl
je 1 (forwards)
cmp byte[ebx + ecx * 1 + 0x03),0x00 # EBX = base OP1
jne 0 (backwards)
/*
get size OP2
*/
1:dec dl
je 2 (forwards)
cmp byte[esi + edx * 1 + 0x03],0x00 # ESI = base OP2
jne 1 (backwards)
/*
store operand sizes
*/
2:mov byte[ebx + 0x01],cl # save operand sizes
mov byte[esi + 0x01],dl # to temporary storage
xor eax,eax
cmp dl,cl
ja 4 (forwards)
jb 5 (forwards)
/*
sizes are equal
*/
3:mov al,byte[ebx + ecx * 1 + 0x00)
cmp byte[esi + ecx * 1 + 0x00),al
ja 4 (forwards)
jb 5 (forwards)
dec cl
jne 3 (backwards)
/*
set output to 0...2
*/
xor al,al
jmp 6 (forwards)
4:mov al, 0x01
jmp 6 (forwards)
5:mov al, 0x02
6:pop edx
pop ecx
ret

Might be extended to test the entire buffer. With an overflow in
one of the operands the current version sets the operand size to
a false value (0x70)...

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

2. Final addition routine:

IN : EBX base address OP1
ESI OP2
EDI RES

OUT: nothing

.globl _addOPS
_addOPS:
push edx
push ecx
xor edx,edx
/*
get size of larger operand
*/
mov dl, byte[ebx + 0x01]
cmp dl, byte[esi + 0x01]
jae 0 (forwards)
mov dl, byte[esi + 0x01]
0:add dl,0x03 # grow to next dword
xor ecx,ecx
shr dl, 0x02 # size / 4
/*
addition loop
*/
1:mov eax,dword[ebx + ecx * 4 + 0x04] # get OP1 dword
adc eax,dword[esi + ecx * 4 + 0x04] # add OP2 dword
movl dword[edi + ecx * 4 + 0x04],eax # store RES
inc cl
dec dl # reached end?
jns 1 (backwards)
/*
"left carry" handling
*/
jb 1 (backwards) # carry still set...
/*
overflow detection
*/
cmp cl, 0x1C # overflow?
jbe 2 (forwards)
bts byte[edi + 0x00], 0x05 # set flag in result
2:pop ecx
xor eax,eax
pop edx
ret

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This final solution covers all discussed possibilities, I think.
Now the calculator becomes a somewhat _real_ thing. I will check
the multiplication with the shift buffers, isn't the worst idea!

Might be, that it's coded until the next reply, but there's some
other time to waste with my "friend" Hugo (lots of paperwork)...

Ben Peddell

unread,
Mar 14, 2003, 11:39:50 AM3/14/03
to
laura fairhead <LoveMrs...@madonnaweb.com> wrote in message
news:3e6cd373...@NEWS.CIS.DFN.DE...

OK.
Take a integer fraction (a number that is the ratio between two integers, or
a rational number). It could be exactly representable (i.e. have a finite
number of digits) in a certain base. On the other hand, it could have an
infinite number of digits in that base, and have a pattern to those digits.

If you take an irrational number (a number that is not the ratio between two
integers, e.g. pi), then it will have an infinite number of digits and have
no pattern in those digits. If it had a finite number of digits, or had a
pattern in its digits, then it would by definition be a rational number.

One example I did not include was 1/5, which is

0.00110011... in binary
0.01210121... in trinary
0.0303... in base 4
0.1 in base 5
0.11... in base 6
0.12541254... in base 7
0.14631463... in octal
0.1717... in base 9
0.2 in decimal
0.22... in base 11
0.24972497... in base 12
0.27A527A5... in base 13
0.2B2B... in base 14
0.3 in base 15
0.33... in hexadecimal

If you take an integer fraction with a large prime number as its
denominator, it'll have a large number of digits in its pattern, but that
pattern would repeat ad infinitum.
e.g. 1/97 is
0.
0103092783505154639175257731958762886597938144329896907216494845360824742268
04123711340206185567
0103092783505154639175257731958762886597938144329896907216494845360824742268
04123711340206185567
... in decimal. It has 96 digits in its pattern.
It's 0.02A3A0FD5C5F02A3A0FD5C5F... in hexadecimal. Only 12 digits in its
pattern.

Oh, and if you're wondering how I ended up with the decimal number, I
noticed a sub-pattern repeating every 11 digits, and it would multiply by 5
every iteration.
If the Microsoft Calculator did not have 32 digits of precision, I would not
have seen this sub-pattern.

Here's a C example of how to convert a fraction into an ascii decimal
string:
/*
* buffer: buffer for output
* a: numerator
* b: denominator
* numchars: size of buffer
*/
void frac2string (char *buffer, int a, int b, int numchars){
int n,o,p;
char buf[10];
o = a / b;
for (p=0; o != 0; p++){
buf[p] = (o % 10) + '0';
o /= 10;
}
for (n=0; n<=p; n++){
buffer[n] = buf[p-n];
}
n = p + 1;
buffer[n - 1] = '.';
o = a % b;
for (; n<numchars; n++){
o = (o * 10) % b;
buffer[n] = (o * 10) / b + '0';
}
buffer[n] = 0;

wolfgang kern

unread,
Mar 16, 2003, 1:52:25 PM3/16/03
to

Bernhard answered:

| >We could preceed every divide with a 10^x multiply to produce integers.

| To be defined, first?

Multiply dividend's mantissa by desired precision 10^x,
(you may adjust this factor by the delta-MSB,
or set it to a fixed maximum)
and add this factor to the divisors (10^x)exponent.
[remember my 10^x binary LUT]

The resulting integer mantissa will then have a
desired precision in decimal digits, even it is a 2^n figure.

ie(dec): 12345E0 / 67890E0 = 0.181838268..
= 123450000000E0 / 67890E7 = 1818382E-7 (if 7 digits chosen)

|.. What exponent format? AFAIR you suggested


| to use the entire hex number as equivalent of the n-th power of ten -
| a very "pleasant" definition... ;)

Not the entire number, just the exponent,
the mantissa bytes are 2^n based.

Your format definition says "2 bytes for exponent"
(one to four bytes in my case),
which may be defined as either 10^±32767 or 2^±32767.
I think you've chosen the 10^n variant (due 2^n won't do exact "0.1").


["Improved" multiply]

| > 224 bytes may produce 539 decimal digits [2^(224*8)-1].

| Surplus digits have to be rounded after the final calculation. Unless


| the calculation is finished, we have some extra digits to improve the
| overall accuracy.

? a integer byte (FF) will be displayed as 3 decimal digits(255),
[decimal digits = 1 + int(bits*lg(2) ; int(1+8*0.30103) = 3]
you sure don't mean to round that to two digits (255 -> 26E1).

Do you still think in terms of BCD when "224 bytes are 448 digits"?
Ok, I think you mean hex-nibbles rather than decimal digits here.
I was talking about the display-able decimal values.

[..]


| >BTW: the integer MUL (edx:eax=eax*...) needs just 9 clock-cycles and
| > covers 8 * 8 nibbles => 16 nibbles including carry-over already.

| I know that it is possible to use the build in MUL instructions - but
| I never checked how it is working, so I can't use it... :(

What do you mean with "build"?
The "MUL" does a integer 32bit * 32bit and produce a 64bit result,
there is nothing else to do, and produce the same result as
a looping 8*8 nibbles multiply (incl. nibble carry-over handling).

[hex and bin]

| Seen this way, even BCDs are expressions of 2^n values (this is fact,
| if we use BCDs in an x86 environment). If we could do all the odd bit
| shifting very fast, then I would use it. But - my impression is, that
| it takes much more time than my current way.

| OTOH - it would need 7 buffers with the "shifted" number = 7 loops to
| shift the number. After this step we need [size * 8] "bit tests" with
| the possible addition, and that's all. I will think about it a little
| bit longer - sounds good to me...

I think BCD had it's time when BIN<->ASCII conversion-time was a matter
of concern, but as the main problem these days seems to be storage-size
and calculation speed, only a few people demands BCD-arithmetic.
It's the common human view only which asks for easier programming and
understanding, but after you become familiar with binaries, you'll
find it faster, shorter an as convenient as BCD.
And you won't care about the storage order then :)

| >..so a bit is an element of 'hex' also...

| Since a bit is nothing else than the expression for 1, it is included
| in *every* number, except zero! Go one step further - every number is
| build out of a specific amount of 1's...

Sure, and a bit is the smallest possible numeric element.

| >Now we wrote the first chapter of the 'asm/kesys' syntax translation.

| To be continued... ? ;)

... whenever required :)

[...but it went another (different) way...]

| I know! Whenever I finish re-coding of my database engine (meanwhile
| about 3500 lines) - I could start again at once, because there are so
| many things to improve... ;)

| Because I don't sell my software - there is plenty of time to develop
| it until it reached a "good enough" level to publish it. Better to do
| some more improvements than to publish "unready" code, like it is the
| common practice of most software companies today... ;)

I'm far away from "sell first, search bugs later",
perhaps due I'm much too small to risk the early money way.

[another try: code snip]

| >And an alternate direct "ADD [EDI],[ESI]" will be shorter by three clocks.

| Some reasons, why I don't use it:

[1,2,3...]
Ok, my fault:
This isn't a valid instruction, I meant the "routines function" here,
[I better had used "ADD op1,op2 (by pointer)"] where no buffers are used.

| >[..."early out" feature]

| >... only I still need to support all odd-sized numeric types w/o
| > zero-extended buffering.

| Would be faster to extend them before the calculation?

It would reduce the loop-size, but it would also need buffer-fill
and write-back which also needs some time.
I can't extend a variable located in an external memory-structure.


| >| As they say - Austr(al)ia is "down under"... ;)

[..]


| Bonus is granted!
| BTW - there's the right of free speech, anyway... ;)

Sure, and I keep this bonus for I sure will need it....

[definition block...]

| Higher address = higher value = lots of additional work!

Only once, while changing, but sure lesser work after.

The address-offset/value relation allow easy look-up,
conversions, contiguous graphics pixel-orientation, and more...

| But I will get used to it. As I've learned to drive backwards for
| miles with a 18.75 m truck - it should be possible with tiny toys
| like bits and bytes, too... ;)

I'm sure you can.
Look at your monitor, the left-top pixel got the address 0 ;)

[Hugo..]


| You told him you could erase his entire harddrive? ;)

This warning didn't help too much,
but I won't spend the necessary time yet,
anyway I just don't "see" any post from "AOL"
(except "SciptKid") right now.
As the Bavarians would say: "des is firn Hugo! (aka Oasch)".

| >Your definition looks good,
| >even you may miss 'the eye of the fly' with only two exponent-bytes ;)
| >but 32K-zeros should be enough by far for everything else.

| 48 bit would be enough? Or we overdo it a little bit, and define
| 112 bit? Operands would meet a paragraph boundary, if the latter
| one would be taken...

Actually this was a joke, I can't think of an exponent >10^32k to be
useful anyway, even I got a maximum of 32bits (10^2G),
just due all exponent calculation can be done easy by the CPU's ALU.

Yes, paragraph (qq) aligned variables are best.

| >What stands the 'Lobster' for?

| ...a lobster walks backwards...
I see :)

....two pairs of scissors?


| You probably meant "one pair" or "two units", didn't you? ;)

Yes, I always make this mistake...seems to be burned in.

| Of course - all four blades sharp as a katana! You know, its job
| is to cut bytes into bits...

:)

[your run-time environment.....]

I see, there may be some time losses due the protection rings,
but you can avoid instructions which are affected
(PUSHF/POPF/SEG-override/ etc).

Don't continue calculating already "overflowed" operands?
Or adjust the (buffered) size whenever an overflow occures,
the 'enlarged' value is a valid figure anyway.
Perhaps a top-level "buffer-overflow" is needed in addition.

Yes, looks like working code.
And I'm already familiar with your trailing exponent, flags and size,
I think it's a good way for really large numbers.

[I have the exponent on top, which reflects the original idea
for easier comparison, but with the 10^x things changed]


What are the means of the eax="_BNR" ? ; eax is xor-ed lateron.

The main difference to my unsigned integer stuff is:
* my routine ends with carry/zero setup according to the comparison
so the caller can use conditional branch or cmov already,
instead of the paired ja 4 (f)
jb 5 (f) I have a single "jne 6",
it is "NZ" and the carry-bit says ">" or "<" already.
* If I compare buffers, the leading zero-bytes are detected during
the buffer-fill already (but that's not an important issue),
but it also indicates an operand "is zero" during that,
which you can do by the "je" after dec cl/dl in the get-size loops
to set your "is zero" flag (IIRC the exponent sign isn't needed).
* exponents are included in my compare routine,
if they are different I need some tricky stuff as:
copy to buffer, multiply by the exponent difference [10^X-LUT]...
(123 E7 = 1230000000E0; both storage forms are possible in my world)

Yes, assuming the result-buffer is cleared (zeroed) before,
but I'd replace the "BTS[..],0x05" with "OR[..],0x20",
it's shorter, faster and also sets bit 5.
Any reason to clear eax at the end?

| This final solution covers all discussed possibilities, I think.
| Now the calculator becomes a somewhat _real_ thing. I will check
| the multiplication with the shift buffers, isn't the worst idea!

The idea is interesting, but I'm afraid the loop-count and the shifts
will eat up many clocks....
[LMFAO: just associated Disney's "Strauss" who swallowed an alarm-clock].

| Might be, that it's coded until the next reply, but there's some
| other time to waste with my "friend" Hugo (lots of paperwork)...

Once a war is in progress....
Except my single warning, I saved some time by just ignore ...
... and blank 'em out so I don't see anymore.
__
wolfgang


bv_schornak

unread,
Mar 20, 2003, 6:11:57 PM3/20/03
to
wolfgang kern wrote:

>Multiply dividend's mantissa by desired precision 10^x,
>(you may adjust this factor by the delta-MSB,
> or set it to a fixed maximum)
>and add this factor to the divisors (10^x)exponent.
>[remember my 10^x binary LUT]
>

What the heck is a "delta-MSB" - sounds like the 4th quality level of
an "alpha-MSB"? ;)

>The resulting integer mantissa will then have a
>desired precision in decimal digits, even it is a 2^n figure.
>
>ie(dec): 12345E0 / 67890E0 = 0.181838268..
> = 123450000000E0 / 67890E7 = 1818382E-7 (if 7 digits chosen)
>

Which would be a 0xF9FF for the exponent and a BC A1 D6 0A 00 ... for
the mantissa, already using reversed order...

>|.. What exponent format? AFAIR you suggested
>| to use the entire hex number as equivalent of the n-th power of ten -
>| a very "pleasant" definition... ;)
>
>Not the entire number, just the exponent,
>the mantissa bytes are 2^n based.
>
>Your format definition says "2 bytes for exponent"
>(one to four bytes in my case),
>which may be defined as either 10^±32767 or 2^±32767.
>I think you've chosen the 10^n variant (due 2^n won't do exact "0.1").
>

Sorry for the confusing statement - "the entire number" is the word @
offset 0x02 in the operand buffer, of course. It should be a power of
10 converted to a signed hexadecimal word (see above example):

7FFF -> 10 E +32767
...
0001 -> 10 E +1
0000 -> 10 E 0
FFFF -> 10 E -1
...
8000 -> 10 E -32768

>["Improved" multiply]


>
>| Surplus digits have to be rounded after the final calculation. Unless
>| the calculation is finished, we have some extra digits to improve the
>| overall accuracy.
>
>? a integer byte (FF) will be displayed as 3 decimal digits(255),
> [decimal digits = 1 + int(bits*lg(2) ; int(1+8*0.30103) = 3]
> you sure don't mean to round that to two digits (255 -> 26E1).
>

With a 7 / 8 rounding - why not? I think it would be similar to the 4 / 5
rounding in base 10?

78 9A BC 00 .. E -228 => BD 00 .. E -224

so we still have the +/- 1 error in the last digit, while the rest of the
number is 100 % accurate.

>Do you still think in terms of BCD when "224 bytes are 448 digits"?
>Ok, I think you mean hex-nibbles rather than decimal digits here.
>I was talking about the display-able decimal values.
>

The operands are defined as hexadecimals in reverse order aka big endian.
Thus, one byte is equal to two hex nibbles. Nevertheless, one byte always
is equal to two nibbles (or "digits") - regardless of the used base... ;)

The conversion actually is using my old BCD calculator. Some byte...qword
conversion functions are available in my libraries, but they can't handle
larger numbers.

>[..]
>| >BTW: the integer MUL (edx:eax=eax*...) needs just 9 clock-cycles and
>| > covers 8 * 8 nibbles => 16 nibbles including carry-over already.
>
>| I know that it is possible to use the build in MUL instructions - but
>| I never checked how it is working, so I can't use it... :(
>
>What do you mean with "build"?
>The "MUL" does a integer 32bit * 32bit and produce a 64bit result,
>there is nothing else to do, and produce the same result as
>a looping 8*8 nibbles multiply (incl. nibble carry-over handling).
>

It was "build-in" and means all versions of MUL and IMUL. I've seen code
where continuous multiplies were done, but I never got a clue which spell
they used to invoke the demon who is doing the work... ;)

How does it work? Each multiply is a "partial" result - but how are those
results put together?

>[hex and bin]


>
>And you won't care about the storage order then :)
>

I *ever* will! As I started using it - it does not mean I started to like
it! It's just the only possible way to use the damned iNTEL design in the
most effective way - by ...{ehem}... walking backwards... ;)

>[another try: code snip]


>
>This isn't a valid instruction, I meant the "routines function" here,
>[I better had used "ADD op1,op2 (by pointer)"] where no buffers are used.
>

Buffers are no "must". You could pass the address of every memory area in
EBX, ESI and EDI. This area only has to be of sufficient size, and unused
bytes must be zero...

>| >[..."early out" feature]
>
>| >... only I still need to support all odd-sized numeric types w/o
>| > zero-extended buffering.
>
>| Would be faster to extend them before the calculation?
>
>It would reduce the loop-size, but it would also need buffer-fill
> and write-back which also needs some time.
>I can't extend a variable located in an external memory-structure.
>

If you put it on the stack? Something like

sub ESP, 32
... copy operand to ESP + n
... do calculation
add ESP, 32

>Sure, and I keep this bonus for I sure will need it....
>

Do you know me? ;)

>[definition block...]
>
>| Higher address = higher value = lots of additional work!
>
>Only once, while changing, but sure lesser work after.
>

Only - really _only_ - on iNTEL machines...

>The address-offset/value relation allow easy look-up,
>conversions, contiguous graphics pixel-orientation, and more...
>

Hey - is this an offer for a contribution to TheGame?

>Look at your monitor, the left-top pixel got the address 0 ;)
>

OS/2 uses bitmaps in the graphical subsystem, where pel [0,0] is _bottom_
left (pel = pixel element). IBM learned a lot from the iNTEL guys?

>As the Bavarians would say: "des is firn Hugo! (aka Oasch)".
>

For sure (count the AOL webmasters in - e-mail still isn't delivered
since 16th, 15:00 - seems, that many "Hugos" are out there)... ;)

>| 48 bit would be enough? Or we overdo it a little bit, and define
>| 112 bit? Operands would meet a paragraph boundary, if the latter
>| one would be taken...
>
>Actually this was a joke, I can't think of an exponent >10^32k to be
> useful anyway, even I got a maximum of 32bits (10^2G),
>just due all exponent calculation can be done easy by the CPU's ALU.
>

The 48 / 112 bit are my "counter-joke"... ;)

Nevertheless, it...

>Yes, paragraph (qq) aligned variables are best.
>

...would be a good idea to define the operand's first byte at offset 0x10
in the buffer!

>[your run-time environment.....]
>
>I see, there may be some time losses due the protection rings,
>but you can avoid instructions which are affected
>(PUSHF/POPF/SEG-override/ etc).
>

No reason to use these instructions (ecxept PUSHF/POPF in few cases)...

[New compare routine]

>Don't continue calculating already "overflowed" operands?
>Or adjust the (buffered) size whenever an overflow occures,
>the 'enlarged' value is a valid figure anyway.
>Perhaps a top-level "buffer-overflow" is needed in addition.
>

The routine is "upgraded" now, see attachement - the code gets too large,
so I put it into a separate file from now on (txt saves bandwith compared
to the eml message format (something like HTML code)).

>[I have the exponent on top, which reflects the original idea
>for easier comparison, but with the 10^x things changed]
>

Sorry, the memory beyond the MSB is reserved for possible overflows.

>What are the means of the eax="_BNR" ? ; eax is xor-ed lateron.
>

Short: Remains of the old BCD code - remove it without fear... ;)

Long: _BNR is the base address of my SystemNumerics. My old routine was
loading the D_BASE and S_BASE variables from there, if I remember
right...

>The main difference to my unsigned integer stuff is:
>* my routine ends with carry/zero setup according to the comparison
> so the caller can use conditional branch or cmov already,
> instead of the paired ja 4 (f)
> jb 5 (f) I have a single "jne 6",
> it is "NZ" and the carry-bit says ">" or "<" already.
>

And it can be called from everywhere - e.g. a C-function which needs the
comparison result 0, 1 or 2? The return code of chkOPS is

0 operands equal
1 OP1 > OP2
2 OP1 < OP2,

also the operand size is set to the current size - including the setting
of the overflow flags. Just have a look at the new code, but I think, it
can't be avoided to check the ">" or "<" somewhere, so - why not here?

>* If I compare buffers, the leading zero-bytes are detected during
> the buffer-fill already (but that's not an important issue),
> but it also indicates an operand "is zero" during that,
> which you can do by the "je" after dec cl/dl in the get-size loops
> to set your "is zero" flag (IIRC the exponent sign isn't needed).
>

A zero flag is implemented now. It will be set, if the mantissa is zero.

>* exponents are included in my compare routine,
> if they are different I need some tricky stuff as:
> copy to buffer, multiply by the exponent difference [10^X-LUT]...
> (123 E7 = 1230000000E0; both storage forms are possible in my world)
>

stored as string "00 01 00 07 - 7B 00 00 00",
in calculator 00 04 00 00 - 80 4F 50 49 00 00 00 00 ...

>Yes, assuming the result-buffer is cleared (zeroed) before,
>but I'd replace the "BTS[..],0x05" with "OR[..],0x20",
>it's shorter, faster and also sets bit 5.
>Any reason to clear eax at the end?
>

Just following the old convention - returning zero means "no error". You
may leave it out, if you don't like it. "BT*" instructions were replaced
by "OR" / "AND"...

>| I will check the multiplication with the shift buffers...


>
>The idea is interesting, but I'm afraid the loop-count and the shifts
>will eat up many clocks....
>[LMFAO: just associated Disney's "Strauss" who swallowed an alarm-clock].
>

:-D (My CB skip is Roadrunner - "meep, meep"...)

Exactly 7 shifts (already coded) plus 112 * 8 bit tests and the possible
addition of the buffer.

------------------------------------------------------------------------

Now something we didn't think about until now - the sign. After starting
to recode the subtraction routine, I found out, that the sign _handling_
isn't as trivial as I assumed:

p = positive, n = negative, 0 = zero,
s[OP1 x OP2] s = sign of result, x: + = addition. - = subtraction

__________________________________________________________________
OP1 = OP2 | OP1 > OP2 | OP1 < OP2
_____________________|______________________|_____________________
p + p = +[OP1 + OP2] | p + p = +[OP1 + OP2] | p + p = +[OP1 + OP2]
p + n = 0 | p + n = +[OP1 - OP2] | p + n = -[OP2 - OP1]
n + p = 0 | n + p = -[OP1 - OP2] | n + p = +[OP2 - OP1]
n + n = -[OP1 + OP2] | n + n = -[OP1 + OP2] | n + n = -[OP1 + OP2]
_____________________|______________________|_____________________
p - p = 0 | p - p = +[OP1 - OP2] | p - p = -[OP2 - OP1]
p - n = +[OP1 + OP2] | p - n = +[OP1 + OP2] | p - n = +[OP1 + OP2]
n - p = -[OP1 + OP2] | n - p = -[OP1 + OP2] | n - p = -[OP1 + OP2]
n - n = 0 | n - n = -[OP1 - OP2] | n - n = +[OP2 - OP1]
_____________________|______________________|_____________________

The OP1 = OP2 case can be handled as OP1 > OP2 - but I handle them in an
extra case, because there's a 50 % chance of a resulting zero.

The basic iADD and iSUB routines are reduced to addition or subtraction,
in this latest version. The addOPS and subOPS routines do all neccessary
handling of the signs, then call the appropriate routine (iADD or iSUB),
following the rules defined in the the above table.

I also added a routine chkCND, which evaluates the given operand. chkCND
counts the amount of bytes and sets the overflow and zero flag depending
on the operand's conditions. It can be used to test the result of any of
the math operations or "raw" input.

Don't worry about the grown code. Execution time should be very fast, if
we compare it against the "worker" routines.

I added the first part of the new multiplication routine. The both loops
which still are missing (get byte / test bit and add OP, if bit set) may
be available soon... ;)

New flag definitions (more practical than before):

07 exponent too large
06 mantissa too large
05 overflow
04 zero
03 index exchanged
02 -
01 -
00 sign operand


P.S.: Translation GAS -> iNTEL increases response time...

BinLobster.txt

wolfgang kern

unread,
Mar 22, 2003, 1:43:56 PM3/22/03
to

Bernhard wrote:


| >Multiply dividend's mantissa by desired precision 10^x,
| >(you may adjust this factor by the delta-MSB,
| > or set it to a fixed maximum)
| >and add this factor to the divisors (10^x)exponent.
| >[remember my 10^x binary LUT]

| What the heck is a "delta-MSB" - sounds like the 4th quality level of
| an "alpha-MSB"? ;)

:)
'delta' is the commonly used term for the absolute value of a difference,
[BASIC: delta_MSB = abs(MSB1 - MSB2]


| >The resulting integer mantissa will then have a
| >desired precision in decimal digits, even it is a 2^n figure.
| >
| >ie(dec): 12345E0 / 67890E0 = 0.181838268..
| > = 123450000000E0 / 67890E7 = 1818382E-7 (if 7 digits chosen)

| Which would be a 0xF9FF for the exponent and a BC A1 D6 0A 00 ... for
| the mantissa, already using reversed order...

The hex-notation for -7 still reads 0xFFF9 (a value, regardless of Endian),
and the hex byte string for 1818382 should show up as 0E BF 1B 00..
LSB first (same order as a dump view).
You calculated the 181838268 which would need 9 digits adjustment,
but yes, the order is correct here.

| .. exponent format?

| Sorry for the confusing statement - "the entire number" is the word @
| offset 0x02 in the operand buffer, of course. It should be a power of
| 10 converted to a signed hexadecimal word (see above example):

| 7FFF -> 10 E +32767
| ...
| 0001 -> 10 E +1
| 0000 -> 10 E 0
| FFFF -> 10 E -1
| ...
| 8000 -> 10 E -32768

Yes, but no conversion is needed, just use it as signed integer word.


| >["Improved" multiply]
| >| Surplus digits have to be rounded after the final calculation. Unless
| >| the calculation is finished, we have some extra digits to improve the
| >| overall accuracy.

| >? a integer byte (FF) will be displayed as 3 decimal digits(255),
| > [decimal digits = 1 + int(bits*lg(2) ; int(1+8*0.30103) = 3]
| > you sure don't mean to round that to two digits (255 -> 26E1).

| With a 7 / 8 rounding - why not? I think it would be similar to the 4 / 5
| rounding in base 10?

I prefer to keep integers as they are,
and let the user define the rounding precision.


| 78 9A BC 00 .. E -228 => BD 00 .. E -224

Trapped !? :)
four nibbles reduction won't result in E-4 ( it's 1/65536)


| so we still have the +/- 1 error in the last digit, while the rest of the
| number is 100 % accurate.

I would use self-rounding only if the figure exceeds the destination size.



| The conversion actually is using my old BCD calculator. Some byte...qword
| conversion functions are available in my libraries, but they can't handle
| larger numbers.

I think only display(input/print) will need conversion,
storage and calculation can all be done in the chosen numeric format,
exponent = signed integer; mantissa = unsigned 2^n string.

| >| >BTW: the integer MUL (edx:eax=eax*...) needs just 9 clock-cycles and
| >| > covers 8 * 8 nibbles => 16 nibbles including carry-over already.

| >| I know that it is possible to use the build in MUL instructions - but
| >| I never checked how it is working, so I can't use it... :(

| >What do you mean with "build"?
| >The "MUL" does a integer 32bit * 32bit and produce a 64bit result,
| >there is nothing else to do, and produce the same result as
| >a looping 8*8 nibbles multiply (incl. nibble carry-over handling).

| It was "build-in" and means all versions of MUL and IMUL. I've seen code
| where continuous multiplies were done, but I never got a clue which spell
| they used to invoke the demon who is doing the work... ;)

| How does it work? Each multiply is a "partial" result - but how are those
| results put together?

By adding the 'part' to the result buffer, starting at the correct location
(byte-offset = power of) and loop as long as carry.
You can add the partial results right after the multiply where the index
already points to the correct result-offset.

MUL and IMUL are hard-wired instructions (code 'F6/F7' group),
I don't know which syntax your GAS will use,
I would do:

assuming esi+4,edi+4 holds mantissa factors 1 and 2
and result buffer is cleared.
push ebx
push ebp
push edx ; MUL uses edx
xor epb,ebp
:outerloop ;
1 xor ecx,ecx
:innerloop
3 MOV eax,[esi+ebp+04] ; ebp = factor1 mantissa offset
9 MUL dw [edi+ecx+04] ; ecx = factor2 mantissa offset
; edx:eax = eax*mem32
3 ADD dw [ebx+ecx+04],eax ;32+
3 ADC dw [ebx+ecx+08],edx ;32 = 64 bit ADD
1 jnb continue_loop

1 mov edx,ecx ; save ecx
:x1 ; adc loop
3 ADD dw [ebx+ecx+0Ch],+1 ;(code 83 form)
1 jnb continue
1 add ecx,4
1 cmp cl,0xE0 ; 2 times 0x70
1 jbe x1 ;
:continue
1 mov ecx,edx ;restore ecx

:continue_loop
1 add ecx,4
1 cmp cl,0x70
1 jb innerloop ; next 32 bits factor2
1 add ebp,4
1 add ebx,4 ; adjust result offset
1 cmp ebp,+0x70 ; (code 83 form)
1 jb outerloop ; next 32 bits factor1
pop edx
pop ebp
pop ebx
ret

this is already tested code,
ecx is destroyed (is 0x70),
but it fits exactly one cache-line (64-bytes),
and needs [worst case: all bits set] about 13500
clock-cycles for a 96*96 byte multiply with a 192 byte result.
(70 clks/result-byte)
and this is much lesser than my previous estimation:
"...and a worst case multiply of 128*128 bytes with 256 bytes result
will be done within max. 30000 clock-cycles." ;(117 clks/result-byte)

Will be hard to beat that... :) , but don't hesitate to change my mind.

| >And you won't care about the storage order then :)

| I *ever* will! As I started using it - it does not mean I started to like
| it! It's just the only possible way to use the damned iNTEL design in the
| most effective way - by ...{ehem}... walking backwards... ;)

:) let's hear your opinion after everything is working....



| >[I better had used "ADD op1,op2 (by pointer)"] where no buffers are used.

| Buffers are no "must". You could pass the address of every memory area in
| EBX, ESI and EDI. This area only has to be of sufficient size, and unused
| bytes must be zero...

Yes.

[... zero-extended buffering.


| >| Would be faster to extend them before the calculation?

| >It would reduce the loop-size, but it would also need buffer-fill
| > and write-back which also needs some time.
| >I can't extend a variable located in an external memory-structure.

| If you put it on the stack? Something like

| sub ESP, 32
| ... copy operand to ESP + n
| ... do calculation
| add ESP, 32

I checked this before, zero-extended copies will shrink the loop,
but the gained time is lost by copy and writeback.
For your numbers, which are four times larger, this may look different.


| >[definition block...]
| >| Higher address = higher value = lots of additional work!

| >Only once, while changing, but sure lesser work after.
| Only - really _only_ - on iNTEL machines...

I wrote code for about 40 different CPUs,
only a third use the low-address = high-value Endian.

| >The address-offset/value relation allow easy look-up,
| >conversions, contiguous graphics pixel-orientation, and more...

| Hey - is this an offer for a contribution to TheGame?

Just a few advantage points...

| >Look at your monitor, the left-top pixel got the address 0 ;)
| OS/2 uses bitmaps in the graphical subsystem, where pel [0,0] is _bottom_
| left (pel = pixel element). IBM learned a lot from the iNTEL guys?

Opposite ordered graphic-cards? Reversed '.bmp'-files?
I havent seen any yet, IIRC, even the C-64 had to loop upside down
to show a sine-curve with correct polarity.

| >Yes, paragraph (qq) aligned variables are best.
| ...would be a good idea to define the operand's first byte at offset 0x10
| in the buffer!

Yes, just locate the mantissa-buffers at aligned address.

| >[your run-time environment.....]

| >I see, there may be some time losses due the protection rings,
| >but you can avoid instructions which are affected
| >(PUSHF/POPF/SEG-override/ etc).

| No reason to use these instructions (ecxept PUSHF/POPF in few cases)...

Ok.

| [New compare routine]

| >Don't continue calculating already "overflowed" operands?
| >Or adjust the (buffered) size whenever an overflow occures,
| >the 'enlarged' value is a valid figure anyway.
| >Perhaps a top-level "buffer-overflow" is needed in addition.

| The routine is "upgraded" now, see attachement - the code gets too large,
| so I put it into a separate file from now on (txt saves bandwith compared
| to the eml message format (something like HTML code)).

I have my newsreader set to read plain text only anyway.


| ------------------------------------------------------------------------
|
| Now something we didn't think about until now - the sign. After starting
| to recode the subtraction routine, I found out, that the sign _handling_
| isn't as trivial as I assumed:

Yes, if you look at my 'numerics-standard',
signed variables are treated different in many case and use other routines.
[mainly just different entry-points]

In opposition to your definition I use true signed mantissa:
"FFFFFFF.....FFFFC" is "-4", but this needs strict size definition
and NEGation for signed MUL/DIV.

So I got additional signed 'CMP/ADD/SUB'- and NEG-routines,
the main changes in CMP are just in the conditional branches.

But I also use an [-]operators in front of unsigned figures,
which is almost the same sign-thing you do.


| p = positive, n = negative, 0 = zero,
| s[OP1 x OP2] s = sign of result, x: + = addition. - = subtraction

| __________________________________________________________________
| OP1 = OP2 | OP1 > OP2 | OP1 < OP2
| _____________________|______________________|_____________________
| p + p = +[OP1 + OP2] | p + p = +[OP1 + OP2] | p + p = +[OP1 + OP2]
| p + n = 0 | p + n = +[OP1 - OP2] | p + n = -[OP2 - OP1]
| n + p = 0 | n + p = -[OP1 - OP2] | n + p = +[OP2 - OP1]
| n + n = -[OP1 + OP2] | n + n = -[OP1 + OP2] | n + n = -[OP1 + OP2]
| _____________________|______________________|_____________________
| p - p = 0 | p - p = +[OP1 - OP2] | p - p = -[OP2 - OP1]
| p - n = +[OP1 + OP2] | p - n = +[OP1 + OP2] | p - n = +[OP1 + OP2]
| n - p = -[OP1 + OP2] | n - p = -[OP1 + OP2] | n - p = -[OP1 + OP2]
| n - n = 0 | n - n = -[OP1 - OP2] | n - n = +[OP2 - OP1]
| _____________________|______________________|_____________________
|
| The OP1 = OP2 case can be handled as OP1 > OP2 - but I handle them in an
| extra case, because there's a 50 % chance of a resulting zero.

Looks correct.



| The basic iADD and iSUB routines are reduced to addition or subtraction,
| in this latest version. The addOPS and subOPS routines do all neccessary
| handling of the signs, then call the appropriate routine (iADD or iSUB),
| following the rules defined in the the above table.

The iADD/iSUB look Ok.

| I also added a routine chkCND, which evaluates the given operand. chkCND
| counts the amount of bytes and sets the overflow and zero flag depending
| on the operand's conditions. It can be used to test the result of any of
| the math operations or "raw" input.

Seems to work.



| Don't worry about the grown code. Execution time should be very fast, if
| we compare it against the "worker" routines.

| I added the first part of the new multiplication routine. The both loops
| which still are missing (get byte / test bit and add OP, if bit set) may
| be available soon... ;)

I'm not sure about the "LOOP shift 7" part, what shall it do?


| New flag definitions (more practical than before):
|
| 07 exponent too large
| 06 mantissa too large
| 05 overflow
| 04 zero
| 03 index exchanged
| 02 -
| 01 -
| 00 sign operand

If bit7 is used as the sign bit, some of your code may shrink.

eg:
mov al [flags1] ;(replace 'flags' with addressing)
mov ah [flags2]
xor al,ah ;only al destroyed, ah remain valid
jns equal_signs
: ;signs are different
cmp al,ah ;only one got a set bit 7
jb ;op2 is negative, op1<op2
jmp ;op1 is negative, op1>op2
...

:equal_signs
cmp ah,0x80 ;is same sign as al
jnbe ;both are negative, needs further comparison
jmp ;both are positive, -"-


| P.S.: Translation GAS -> iNTEL increases response time...

I appreciate you for the better "readable" form,
I may already be able to interpret AT&T code,
but don't expect me to write like that.

[code stored aside...]

__
wolfgang


bv_schornak

unread,
Mar 23, 2003, 4:34:36 PM3/23/03
to
wolfgang kern wrote:

>'delta' is the commonly used term for the absolute value of a difference,
>[BASIC: delta_MSB = abs(MSB1 - MSB2]
>

I should have known...

>| Which would be a 0xF9FF for the exponent and a BC A1 D6 0A 00 ... for
>| the mantissa, already using reversed order...
>
>The hex-notation for -7 still reads 0xFFF9 (a value, regardless of Endian),
>and the hex byte string for 1818382 should show up as 0E BF 1B 00..
>LSB first (same order as a dump view).
>You calculated the 181838268 which would need 9 digits adjustment,
>but yes, the order is correct here.
>

Maybe it is too complex to explain the simple things ... it is just a
"snapshot" of the memory starting at 0x02[buffer].

>Yes, but no conversion is needed, just use it as signed integer word.
>

That's what I _wanted_ to say - but I missed to find the right words,
again...

>| With a 7 / 8 rounding - why not? I think it would be similar to the 4 / 5
>| rounding in base 10?
>
>I prefer to keep integers as they are,
> and let the user define the rounding precision.
>

Would be a mess - imagine the mass of dialogs like:

------------------------------------------------------------------
| BUY SURPLUS DIGITS AT EBAY! |
|------------------------------------------------------------------|
| |
| There are 13 surplus digits in this number: |
| |
| -123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF |
| F0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCD |
| E + 0404 |
| |
| What do you want to do today? |
| |
| [truncate] [increment] [round 7/8] [round 5/4] [use random data] |
------------------------------------------------------------------
(Ok, still 8 days until 2003/04/01...) :-D

Nope, I prefer to round all surplus digits, whenever it is neccessary
to get rid of them. It provides the best accuracy...

>| 78 9A BC 00 .. E -228 => BD 00 .. E -224
>
>Trapped !? :)
> four nibbles reduction won't result in E-4 ( it's 1/65536)
>

Wasn't my best day, I guess... ;)

>| so we still have the +/- 1 error in the last digit, while the rest of the
>| number is 100 % accurate.
>
>I would use self-rounding only if the figure exceeds the destination size.
>

Which is the _only_ reason to start rounding, anyway...

>| The conversion actually is using my old BCD calculator. Some byte...qword
>| conversion functions are available in my libraries, but they can't handle
>| larger numbers.
>
>I think only display(input/print) will need conversion,
>storage and calculation can all be done in the chosen numeric format,
>exponent = signed integer; mantissa = unsigned 2^n string.
>

Yep, that's what I was talking about... ;)


[Integer multiply]

> push ebx
> push ebp
> push edx ; MUL uses edx
> xor epb,ebp
> :outerloop ;
>1 xor ecx,ecx
> :innerloop
>3 MOV eax,[esi+ebp+04] ; ebp = factor1 mantissa offset
>9 MUL dw [edi+ecx+04] ; ecx = factor2 mantissa offset
> ; edx:eax = eax*mem32
>3 ADD dw [ebx+ecx+04],eax ;32+
>

EBX is the base of the result buffer?

>3 ADC dw [ebx+ecx+08],edx ;32 = 64 bit ADD
>1 jnb continue_loop
>
>1 mov edx,ecx ; save ecx
> :x1 ; adc loop
>3 ADD dw [ebx+ecx+0Ch],+1 ;(code 83 form)
>

The "+1" is the carry?

>1 jnb continue
>1 add ecx,4
>1 cmp cl,0xE0 ; 2 times 0x70
>1 jbe x1 ;
> :continue
>1 mov ecx,edx ;restore ecx
>
> :continue_loop
>1 add ecx,4
>1 cmp cl,0x70
>1 jb innerloop ; next 32 bits factor2
>1 add ebp,4
>1 add ebx,4 ; adjust result offset
>1 cmp ebp,+0x70 ; (code 83 form)
>1 jb outerloop ; next 32 bits factor1
> pop edx
> pop ebp
> pop ebx
> ret
>

I see...

This are 32*32 (1024) loops, if I count right.

Using EBP in mixed code could be a problem, if parameters were passed
on the stack (in C it's common practice to pass data this way)...

>this is already tested code,
> ecx is destroyed (is 0x70),
>but it fits exactly one cache-line (64-bytes),
>and needs [worst case: all bits set] about 13500
> clock-cycles for a 96*96 byte multiply with a 192 byte result.
> (70 clks/result-byte)
> and this is much lesser than my previous estimation:
> "...and a worst case multiply of 128*128 bytes with 256 bytes result
> will be done within max. 30000 clock-cycles." ;(117 clks/result-byte)
>
>Will be hard to beat that... :) , but don't hesitate to change my mind.
>

Thinking about the different solutions...

>| >And you won't care about the storage order then :)
>
>| I *ever* will! As I started using it - it does not mean I started to like
>| it! It's just the only possible way to use the damned iNTEL design in the
>| most effective way - by ...{ehem}... walking backwards... ;)
>
>:) let's hear your opinion after everything is working....
>

Life is tough enough without this stuff... ;)


[... zero-extended buffering]

>| sub ESP, 32
>| ... copy operand to ESP + n
>| ... do calculation
>| add ESP, 32
>
>I checked this before, zero-extended copies will shrink the loop,
>but the gained time is lost by copy and writeback.
>For your numbers, which are four times larger, this may look different.
>

If the operands are passed from a C-function, they're on the stack...
In any other case, I wouldn't use the stack to store larger operands.


[definition block...]

>I wrote code for about 40 different CPUs,
>only a third use the low-address = high-value Endian.
>

I just believe you (even if it is unbelievable)... ;)

>| >Look at your monitor, the left-top pixel got the address 0 ;)
>| OS/2 uses bitmaps in the graphical subsystem, where pel [0,0] is _bottom_
>| left (pel = pixel element). IBM learned a lot from the iNTEL guys?
>
>Opposite ordered graphic-cards? Reversed '.bmp'-files?
>I havent seen any yet, IIRC, even the C-64 had to loop upside down
>to show a sine-curve with correct polarity.
>

No relation to the graphic card - just a definition for the graphical
subsystem of the OS. If I understand the redbooks right, bitmaps are
stored with the lower left pel first and upper right pel last (as you
get the location displayed in each picture manipulating program)...

>Yes, just locate the mantissa-buffers at aligned address.
>

Is done. Now we have no byte left for overflows _after_ a multiply...

>| The routine is "upgraded" now, see attachement - the code gets too large,
>| so I put it into a separate file from now on (txt saves bandwith compared
>| to the eml message format (something like HTML code)).
>
>I have my newsreader set to read plain text only anyway.
>

No relation to newsreaders, the 10 kByte *.txt file grows at least to
a 15 kByte *.eml file! It saves bandwith and storage, if I attach the
code in a separate text file (AFAIK you may edit it like the original
*.eml message, anyway)...

[Sign handling]

>In opposition to your definition I use true signed mantissa:
> "FFFFFFF.....FFFFC" is "-4", but this needs strict size definition
> and NEGation for signed MUL/DIV.
>

I don't even think about 2's complement numbers. This might be ok for
small numbers up to the size of a DQ, but isn't very handy for larger
operands. The current sign handling is much faster than an extra loop
which calculates the 2's complement (costs at least 384 cycles).

>| I added the first part of the new multiplication routine. The both loops
>| which still are missing (get byte / test bit and add OP, if bit set) may
>| be available soon... ;)
>
>I'm not sure about the "LOOP shift 7" part, what shall it do?
>

In short form: It calculates 7 partial results, where each buffer is
[previous buffer * 2]. Then we read one byte and evaluate each bit -
if the bit is set, the appropriate buffer is added to the result. It
is nothing else than an automated "shift-multiply". Because the bit
pattern repeats after one byte, we only have to increment the index
register -> equal to a shift by 8 bits. But...

Forget it! This solution would need 904 loops à 384 cycles. Much too
much, if I compare it against my first routine, which only needs 256
loops to do the same work. Using nibbles reduces the amount of loops
to 224 (nibbles) + 32 (buffers 1* / 16*).

Compared to your above code with 1024 loops - I'm not sure, which is
the best. It could be my version with "digit based" additions, but I
should code it, before I make promises...

>| New flag definitions (more practical than before):
>|
>| 07 exponent too large
>| 06 mantissa too large
>| 05 overflow
>| 04 zero
>| 03 index exchanged
>| 02 -
>| 01 -
>| 00 sign operand
>
>If bit7 is used as the sign bit, some of your code may shrink.
>
>eg:
>mov al [flags1] ;(replace 'flags' with addressing)
>mov ah [flags2]
>xor al,ah ;only al destroyed, ah remain valid
>jns equal_signs
>: ;signs are different
>cmp al,ah ;only one got a set bit 7
>jb ;op2 is negative, op1<op2
>jmp ;op1 is negative, op1>op2

>....


>
>:equal_signs
>cmp ah,0x80 ;is same sign as al
>jnbe ;both are negative, needs further comparison
>jmp ;both are positive, -"-
>

As you may have noticed, there are 5 more flags stored in this byte.
If you really want to speed things up, then you may RCR the bit into
the carry flag or another register.

Nevertheless, I will have a look at my code, again - it always takes
some time, before I make the one or other decision... ;)

>| P.S.: Translation GAS -> iNTEL increases response time...
>
>I appreciate you for the better "readable" form,
>I may already be able to interpret AT&T code,
>but don't expect me to write like that.
>

Thinking of the 0...1 people who might read this thread, too... ;)

Meanwhile I'm able to read iNTEL "style" code (could easily win each
award for the most confusing and illogical programming language) and
I also can write text which _looks_ as weird as iNTEL syntax... ;)

I'm coming back if I finished my MUL routine - at the moment, my job
occupies huge parts of my spare time, so I am not always in the mood
to sit down and write some lines of code. Too complex stuff to split
the entire routine into small and handy parts...


Greetings from Augsburg

Bernhard Schornak
--
<http://electroniciraq.net> - 1st hand information...

wolfgang kern

unread,
Mar 24, 2003, 2:52:54 PM3/24/03
to

Bernhard wrote:


| Maybe it is too complex to explain the simple things ... it is just a
| "snapshot" of the memory starting at 0x02[buffer].

Ok, 1818382E-7 ="xx 03 F9 FF 0E BF 1B 00.."
|fl|sz| exp |mantissa

| >I prefer to keep integers as they are,
| > and let the user define the rounding precision.
| Would be a mess - imagine the mass of dialogs like:

| ------------------------------------------------------------------
| | BUY SURPLUS DIGITS AT EBAY! |
| |------------------------------------------------------------------|
| | |
| | There are 13 surplus digits in this number: |
| | |
| | -123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF |
| | F0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCD |
| | E + 0404 |
| | |
| | What do you want to do today? |
| | |
| | [truncate] [increment] [round 7/8] [round 5/4] [use random data] |
| ------------------------------------------------------------------
| (Ok, still 8 days until 2003/04/01...) :-D
|

| Nope, I prefer to round all surplus digits, whenever it is necessary


| to get rid of them. It provides the best accuracy...

:)
Not that way, I use four user-definable global values:
[all in decimal digits]
1 calculation precision (this gives the buffer-size),
2 display precision and
3 digits involved for rounding (easier view),
4 writeback precision.


| >| 78 9A BC 00 .. E -228 => BD 00 .. E -224
| >
| >Trapped !? :)
| > four nibbles reduction won't result in E-4 ( it's 1/65536)

| Wasn't my best day, I guess... ;)

:) I had a bad week, searching for a strange silly bug,
finally I found a routine which used a wrong parameter-block.
I just searched for something else....

| >I would use self-rounding only if the figure exceeds the destination size.

| Which is the _only_ reason to start rounding, anyway...

Yes, except for display short-cut.

| >| The conversion actually is using my old BCD calculator. Some byte...qword
| >| conversion functions are available in my libraries, but they can't handle
| >| larger numbers.

| >I think only display(input/print) will need conversion,
| >storage and calculation can all be done in the chosen numeric format,
| >exponent = signed integer; mantissa = unsigned 2^n string.

| Yep, that's what I was talking about... ;)

My bin->ASCII conversion use scan/sub in the 10^x semilog-LUT.
As your largest integer would be 2^767 (96 bytes) = 230 dec.digits,
a table would have 1 + 230*9 = 2071 entries, 96 bytes each (<200Kb).
Without a table you need to loop a divide by 10 for conversion.

[Integer multiply]

| > push ebx
| > push ebp
| > push edx ; MUL uses edx
| > xor epb,ebp
| > :outerloop ;
| >1 xor ecx,ecx
| > :innerloop
| >3 MOV eax,[esi+ebp+04] ; ebp = factor1 mantissa offset
| >9 MUL dw [edi+ecx+04] ; ecx = factor2 mantissa offset
| > ; edx:eax = eax*mem32
| >3 ADD dw [ebx+ecx+04],eax ;32+

| >3 ADC dw [ebx+ecx+08],edx ;32 = 64 bit ADD
| >1 jnb continue_loop
| >1 mov edx,ecx ; save ecx
| > :x1 ; adc loop
| >3 ADD dw [ebx+ecx+0Ch],+1 ;(code 83 form)

| EBX is the base of the result buffer?

Yes.


| The "+1" is the carry?

Yes, "immediate-sign-extended-byte-ADD",
so the carry itself don't need to be saved.

| >1 jnb continue
| >1 add ecx,4

| >1 cmp cl,0xE0 ; 2 times 0x70
| >1 jbe x1 ;

The above two lines can be replaced with 'jmp x1'
as the buffer must be cleared before,
and this 0xE0 limit will be wrong after "add ebx,4" anyway.

| > :continue
| >1 mov ecx,edx ;restore ecx
| >
| > :continue_loop
| >1 add ecx,4
| >1 cmp cl,0x70
| >1 jb innerloop ; next 32 bits factor2
| >1 add ebp,4
| >1 add ebx,4 ; adjust result offset
| >1 cmp ebp,+0x70 ; (code 83 form)
| >1 jb outerloop ; next 32 bits factor1
| > pop edx
| > pop ebp
| > pop ebx
| > ret

| I see...

| This are 32*32 (1024) loops, if I count right.

In fact the count is lesser than,
28*28 plus the value dependent carry-over loop-count.

| Using EBP in mixed code could be a problem, if parameters were passed
| on the stack (in C it's common practice to pass data this way)...

Shouldn't be a problem as the EBP-value is preserved here anyway.

| >this is already tested code,
| > ecx is destroyed (is 0x70),
| >but it fits exactly one cache-line (64-bytes),
| >and needs [worst case: all bits set] about 13500
| > clock-cycles for a 96*96 byte multiply with a 192 byte result.
| > (70 clks/result-byte)
| > and this is much lesser than my previous estimation:
| > "...and a worst case multiply of 128*128 bytes with 256 bytes result
| > will be done within max. 30000 clock-cycles." ;(117 clks/result-byte)

| >Will be hard to beat that... :) , but don't hesitate to change my mind.

| Thinking about the different solutions...

Let's see..

[... zero-extended buffering]

| If the operands are passed from a C-function, they're on the stack...
| In any other case, I wouldn't use the stack to store larger operands.

Ok, if you will support C, I don't support any HLL-stuff at all.

| >Opposite ordered graphic-cards? Reversed '.bmp'-files?

| >I haven't seen any yet, IIRC, even the C-64 had to loop upside down


| >to show a sine-curve with correct polarity.

| No relation to the graphic card - just a definition for the graphical

| subsystem of the OS. If I understand the Redbooks right, bitmaps are


| stored with the lower left pel first and upper right pel last (as you
| get the location displayed in each picture manipulating program)...

Ok, bitmaps may exist in various storage formats,
and the different tools show different zero-points,
eg: with pixel-edit(Labview) you can select your preferred Y-direction.

| >Yes, just locate the mantissa-buffers at aligned address.
| Is done. Now we have no byte left for overflows _after_ a multiply...

A multiplication cannot overflow as long
the sum of the factor bytes =< result-bytes [FF*FF = FE01],
therefore I sized the buffers twice as the largest variables.

[txt app]
Ok.

| [Sign handling]

| >In opposition to your definition I use true signed mantissa:
| > "FFFFFFF.....FFFFC" is "-4", but this needs strict size definition
| > and NEGation for signed MUL/DIV.

| I don't even think about 2's complement numbers. This might be ok for
| small numbers up to the size of a DQ, but isn't very handy for larger
| operands. The current sign handling is much faster than an extra loop
| which calculates the 2's complement (costs at least 384 cycles).

Right, perhaps I remove the 'large' signed variables in my next version,
even I like to support calculation of all types
like signed byte * 32 byte.

| >| I added the first part of the new multiplication routine. The both loops
| >| which still are missing (get byte / test bit and add OP, if bit set) may
| >| be available soon... ;)

| >I'm not sure about the "LOOP shift 7" part, what shall it do?

| In short form: It calculates 7 partial results, where each buffer is
| [previous buffer * 2]. Then we read one byte and evaluate each bit -
| if the bit is set, the appropriate buffer is added to the result. It
| is nothing else than an automated "shift-multiply". Because the bit
| pattern repeats after one byte, we only have to increment the index
| register -> equal to a shift by 8 bits. But...

| Forget it! This solution would need 904 loops à 384 cycles. Much too
| much, if I compare it against my first routine, which only needs 256
| loops to do the same work. Using nibbles reduces the amount of loops
| to 224 (nibbles) + 32 (buffers 1* / 16*).

| Compared to your above code with 1024 loops - I'm not sure, which is
| the best. It could be my version with "digit based" additions, but I
| should code it, before I make promises...

[your flags]


| >If bit7 is used as the sign bit, some of your code may shrink.

| >eg:
| >mov al [flags1] ;(replace 'flags' with addressing)
| >mov ah [flags2]
| >xor al,ah ;only al destroyed, ah remain valid
| >jns equal_signs

| >: ;sign are different


| >cmp al,ah ;only one got a set bit 7
| >jb ;op2 is negative, op1<op2
| >jmp ;op1 is negative, op1>op2
| >....

| >:equal_signs
| >cmp ah,0x80 ;is same sign as al
| >jnbe ;both are negative, needs further comparison
| >jmp ;both are positive, -"-

| As you may have noticed, there are 5 more flags stored in this byte.
| If you really want to speed things up, then you may RCR the bit into
| the carry flag or another register.

Also possible.

| Nevertheless, I will have a look at my code, again - it always takes
| some time, before I make the one or other decision... ;)

| >| P.S.: Translation GAS -> iNTEL increases response time...
| >I appreciate you for the better "readable" form,
| >I may already be able to interpret AT&T code,
| >but don't expect me to write like that.

| Thinking of the 0...1 people who might read this thread, too... ;)

They may learn about tri-lingual posting ? :)

| Meanwhile I'm able to read iNTEL "style" code (could easily win each
| award for the most confusing and illogical programming language) and
| I also can write text which _looks_ as weird as iNTEL syntax... ;)

My disassembler, which is just there to immediate show the meaning of
my hex-input, uses it's very own syntax (common for different CPUs),
eg: it says LD and ST instead of MOV and displays "AL=CY" for code D6,
and always show the segments involved and action performed in addition.
So I often have to look how the original Intel-syntax will name it,
especially for the illogical conditional branch terminology
ie: jnbe (code 77) -> "jr ncnz >"
jnle (code 7F) -> "jr nz S>".

| I'm coming back if I finished my MUL routine - at the moment, my job
| occupies huge parts of my spare time, so I am not always in the mood
| to sit down and write some lines of code. Too complex stuff to split
| the entire routine into small and handy parts...

Ok, I'm looking forward....
__
wolfgang

bv_schornak

unread,
Mar 30, 2003, 1:53:08 PM3/30/03
to
wolfgang kern wrote:

>Ok, 1818382E-7 ="xx 03 F9 FF 0E BF 1B 00.."
> |fl|sz| exp |mantissa
>

Yep, seen from the buffer's 1st byte... ;)

>| ------------------------------------------------------------------
>| | BUY SURPLUS DIGITS AT EBAY! |
>| |------------------------------------------------------------------|
>| | |
>| | There are 13 surplus digits in this number: |
>| | |
>| | -123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF |
>| | F0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCD |
>| | E + 0404 |
>| | |
>| | What do you want to do today? |
>| | |
>| | [truncate] [increment] [round 7/8] [round 5/4] [use random data] |
>| ------------------------------------------------------------------
>

It's just a joke...

>| (Ok, still 8 days until 2003/04/01...) :-D
>

This is a discreet notice, that the above message box is a joke (for
1st of April)...

>| Nope, I prefer to round all surplus digits, whenever it is necessary
>| to get rid of them. It provides the best accuracy...
>:)
>Not that way, I use four user-definable global values:
>[all in decimal digits]
>1 calculation precision (this gives the buffer-size),
>2 display precision and
>3 digits involved for rounding (easier view),
>4 writeback precision.
>

Okay, I keep #1 and #2 for BinLobster... ;)

>| Wasn't my best day, I guess... ;)
>:) I had a bad week, searching for a strange silly bug,
> finally I found a routine which used a wrong parameter-block.
> I just searched for something else....
>

Reminds me of a long time search for an error. Finally I found out,
that it was caused by changed registers. They were not restored in a
OS/2 function, which was called during execution of my own code. Now
I have "wrappers" for every OS/2 and C function I'm using...

>| >I would use self-rounding only if the figure exceeds the destination size.
>
>| Which is the _only_ reason to start rounding, anyway...
>Yes, except for display short-cut.
>

Right. Would be annoying for the user to scroll through 256 digits!

>| >| The conversion actually is using my old BCD calculator. Some byte...qword
>| >| conversion functions are available in my libraries, but they can't handle
>| >| larger numbers.
>
>| >I think only display(input/print) will need conversion,
>| >storage and calculation can all be done in the chosen numeric format,
>| >exponent = signed integer; mantissa = unsigned 2^n string.
>
>| Yep, that's what I was talking about... ;)
>
>My bin->ASCII conversion use scan/sub in the 10^x semilog-LUT.
>As your largest integer would be 2^767 (96 bytes) = 230 dec.digits,
>a table would have 1 + 230*9 = 2071 entries, 96 bytes each (<200Kb).
>Without a table you need to loop a divide by 10 for conversion.
>

It still are (and always will be) 112 bytes. :)

If I remember right - my conversion works with repeated addition and
subtraction sequences (equal to your division)...


[Integer multiply]


>The above two lines can be replaced with 'jmp x1'
>as the buffer must be cleared before,
>and this 0xE0 limit will be wrong after "add ebx,4" anyway.
>

If I use it, I would make the index registers static, that is - only
the counters are changing their values. As you may have noticed, the
amount of registers is too small (iNTEL's outstanding design)... ;)

>| This are 32*32 (1024) loops, if I count right.
>
>In fact the count is lesser than,
>28*28 plus the value dependent carry-over loop-count.
>

Ok, your format has 4 double words less than BinLobster's...

>| Using EBP in mixed code could be a problem, if parameters were passed
>| on the stack (in C it's common practice to pass data this way)...
>
>Shouldn't be a problem as the EBP-value is preserved here anyway.
>

Logisch... (oops - "natural", of course).

>| >this is already tested code,
>| > ecx is destroyed (is 0x70),
>| >but it fits exactly one cache-line (64-bytes),
>| >and needs [worst case: all bits set] about 13500
>| > clock-cycles for a 96*96 byte multiply with a 192 byte result.
>| > (70 clks/result-byte)
>| > and this is much lesser than my previous estimation:
>| > "...and a worst case multiply of 128*128 bytes with 256 bytes result
>| > will be done within max. 30000 clock-cycles." ;(117 clks/result-byte)
>
>| >Will be hard to beat that... :) , but don't hesitate to change my mind.
>
>| Thinking about the different solutions...
>
>Let's see..
>

Yesterday - the first day this week - I had more than three hours of
spare time. No time left for serious work right now. At the moment I
am creating a fan-page for my former band, so there's even less time
left... ;)

>[... zero-extended buffering]
>
>| If the operands are passed from a C-function, they're on the stack...
>| In any other case, I wouldn't use the stack to store larger operands.
>
>Ok, if you will support C, I don't support any HLL-stuff at all.
>

ST-system is using C for anything related to the windowing system or
OS/2 specific functions. Would be a lot of work, if I would have to
write all this stuff in assembler. The "main program core" always is
written in C. The assembler functions are called from the C program,
and they all follow the C definition to pass parameters - pushing in
reverse order, so we have all parameters in the "right" order on the
stack...

>| No relation to the graphic card - just a definition for the graphical
>| subsystem of the OS. If I understand the Redbooks right, bitmaps are
>| stored with the lower left pel first and upper right pel last (as you
>| get the location displayed in each picture manipulating program)...
>
>Ok, bitmaps may exist in various storage formats,
>and the different tools show different zero-points,
>eg: with pixel-edit(Labview) you can select your preferred Y-direction.
>

No, there is only _one_ definition for all bitmap formats. And yes -
each program may handle the [X,Y] parameters in a different way. I'm
using PMview and PicturePublisher, and both are using the lower left
corner as [0,0]...

>| >Yes, just locate the mantissa-buffers at aligned address.
>| Is done. Now we have no byte left for overflows _after_ a multiply...
>
>A multiplication cannot overflow as long
> the sum of the factor bytes =< result-bytes [FF*FF = FE01],
> therefore I sized the buffers twice as the largest variables.
>

That's the reason, why I limited the amount of (OP1 + OP2) digits to
256 in my first BCD routine. But if we do some more additions, then
we might exceed the 224 bytes. After reading this again, I see, that
there still are 16 bytes left -> 224 + 16 = 240...

>They may learn about tri-lingual posting ? :)
>

Including C, German, Bavarian and Austrian - septa-lingual? ;)

>My disassembler, which is just there to immediate show the meaning of
>my hex-input, uses it's very own syntax (common for different CPUs),
>eg: it says LD and ST instead of MOV and displays "AL=CY" for code D6,
>and always show the segments involved and action performed in addition.
>So I often have to look how the original Intel-syntax will name it,
>especially for the illogical conditional branch terminology
>ie: jnbe (code 77) -> "jr ncnz >"
> jnle (code 7F) -> "jr nz S>".
>

I think I even would believe that you're poking holes into pieces of
paper to save the costs for punch cards... ;)

>| I'm coming back if I finished my MUL routine - at the moment, my job
>| occupies huge parts of my spare time, so I am not always in the mood
>| to sit down and write some lines of code. Too complex stuff to split
>| the entire routine into small and handy parts...
>
>Ok, I'm looking forward...
>

I will sit down this week to finish it - promised!

wolfgang kern

unread,
Apr 1, 2003, 1:24:01 AM4/1/03
to

Bernhard wrote:
[...]
[... bin->ASCII conversion]

| It still are (and always will be) 112 bytes. :)

Your mantissa is now 112 bytes large?

| If I remember right - my conversion works with repeated addition and
| subtraction sequences (equal to your division)...

I'm still in search for a better divide solution,
a modified Newton-Raphson method works also with integers,
but needs too many iterations,
and my partial log with the remainig factor calculation
is even more worse for large BCD figures.
Next I'll try a byte based log-table (instead of 1..9 decimal),
a log-(base 256)-table will need just 256 entries,....


| [Integer multiply]

| If I use it, I would make the index registers static, that is - only
| the counters are changing their values. As you may have noticed, the
| amount of registers is too small (iNTEL's outstanding design)... ;)

Yes, there are a lot of registers in the CPU (unused MSR; MMX, ...),
but just not connected to the ALU or the address-generator.

You can loop "ADC [edi]; 4 times INC edi" as well.

| Yesterday - the first day this week - I had more than three hours of
| spare time. No time left for serious work right now. At the moment I
| am creating a fan-page for my former band, so there's even less time
| left... ;)


| >[... zero-extended buffering]
| >| If the operands are passed from a C-function, they're on the stack...
| >| In any other case, I wouldn't use the stack to store larger operands.

| >Ok, if you will support C, I don't support any HLL-stuff at all.

| ST-system is using C for anything related to the windowing system or
| OS/2 specific functions. Would be a lot of work, if I would have to
| write all this stuff in assembler. The "main program core" always is
| written in C. The assembler functions are called from the C program,
| and they all follow the C definition to pass parameters - pushing in
| reverse order, so we have all parameters in the "right" order on the
| stack...

As I'm not bound to any "big-one", I did it the opposite way,
my kernel and all functions are written in fast 'direct code',
while some applications allow the user to define functions
in a literal form.

| >Ok, bitmaps may exist in various storage formats,
| >and the different tools show different zero-points,
| >eg: with pixel-edit(Labview) you can select your preferred Y-direction.

| No, there is only _one_ definition for all bitmap formats.

??? Ever tried to edit a 16-color win2.0-bmp with win98 tools ???
"not a valid bitmap format!?"

| And yes -
| each program may handle the [X,Y] parameters in a different way. I'm
| using PMview and PicturePublisher, and both are using the lower left
| corner as [0,0]...

Ok.

| >A multiplication cannot overflow as long
| > the sum of the factor bytes =< result-bytes [FF*FF = FE01],
| > therefore I sized the buffers twice as the largest variables.

| That's the reason, why I limited the amount of (OP1 + OP2) digits to
| 256 in my first BCD routine. But if we do some more additions, then
| we might exceed the 224 bytes. After reading this again, I see, that
| there still are 16 bytes left -> 224 + 16 = 240...

| > They may learn about tri-lingual posting ? :)
| Including C, German, Bavarian and Austrian - septa-lingual? ;)

:)
"mea wia drei is fei zfü !" -> "multy-lingual"

| I think I even would believe that you're poking holes into pieces of
| paper to save the costs for punch cards... ;)

You're not that wrong,
if you see the fuses in a logic-array as punch-holes.
It's not a matter of costs, why I don't use the compilers "comfort",
I just couldn't find any which work as I would expect.

How would you convince your compiler to fit my memory-model:

"FLAT-DATA,
ANY-code (mixed real16/Pm16/Pm32, selfmodifying),
MONO-stack (code-aligned SS, only exceptions may switch ESP),
unprotected (all privilege0)".

And what will your debugger do
if you single step a changing 'LIDT' or PM/REAL switch-instruction?

| >| I'm coming back if I finished my MUL routine - at the moment, my job
| >| occupies huge parts of my spare time, so I am not always in the mood
| >| to sit down and write some lines of code. Too complex stuff to split
| >| the entire routine into small and handy parts...
| >
| >Ok, I'm looking forward...
| >
|
| I will sit down this week to finish it - promised!

[I just keep in mind:
integer MUL: 64 bytes code, 70 clock-cycles per result-byte]
__
wolfgang

bv_schornak

unread,
Apr 5, 2003, 12:47:35 PM4/5/03
to
wolfgang kern wrote:

>| It still are (and always will be) 112 bytes. :)
>
>Your mantissa is now 112 bytes large?
>

Long time ago, it was 256 bytes, until you "convinced" me to use hex
numbers. The first design with 128 bytes was redesigned to 112 bytes
because of the resulting amount of BCD digits (_should_ be something
around 260 digits now)... ;)

>| If I remember right - my conversion works with repeated addition and
>| subtraction sequences (equal to your division)...
>
>I'm still in search for a better divide solution,
>a modified Newton-Raphson method works also with integers,
>but needs too many iterations,
>and my partial log with the remainig factor calculation
>is even more worse for large BCD figures.
>Next I'll try a byte based log-table (instead of 1..9 decimal),
>a log-(base 256)-table will need just 256 entries,....
>

Maybe I'm trying to code it with subtractions, only - then we have a
"reference design" to compare other solutions with...

>| [Integer multiply]
>
>| If I use it, I would make the index registers static, that is - only
>| the counters are changing their values. As you may have noticed, the
>| amount of registers is too small (iNTEL's outstanding design)... ;)
>
>Yes, there are a lot of registers in the CPU (unused MSR; MMX, ...),
>but just not connected to the ALU or the address-generator.
>

There are some more (DRx and CRx), but they have one thing in common
- you can't use them for "ordinary" work...

>You can loop "ADC [edi]; 4 times INC edi" as well.
>

If avoidable - I don't want to change the index registers!

BTW, it doesn't make sense to replace _one_ instruction with _four_!

>| ST-system is using C for anything related to the windowing system or
>| OS/2 specific functions. Would be a lot of work, if I would have to
>| write all this stuff in assembler. The "main program core" always is
>| written in C. The assembler functions are called from the C program,
>| and they all follow the C definition to pass parameters - pushing in
>| reverse order, so we have all parameters in the "right" order on the
>| stack...
>
>As I'm not bound to any "big-one", I did it the opposite way,
> my kernel and all functions are written in fast 'direct code',
> while some applications allow the user to define functions
> in a literal form.
>

Depends on what you are doing and how much time you have to code it.
I am running OS/2 and my programs are running on this platform, too.
So it's natural, that I use the common tools which are available for
my platform (for free!)...

>| No, there is only _one_ definition for all bitmap formats.
>
>??? Ever tried to edit a 16-color win2.0-bmp with win98 tools ???
>"not a valid bitmap format!?"
>

Here you find _all_ definitions for bitmap (*.bmp) file formats:
<http://www.tanveer.freeservers.com/programming/bitmap_htm.htm>!

If Windoze can't read it, try PMview! ;)

>:)
>"mea wia drei is fei zfü !" -> "multy-lingual"
>

Fui z'fui (but I speak 6 of them and understand 'em all)... ;)

>| I think I even would believe that you're poking holes into pieces of
>| paper to save the costs for punch cards... ;)
>
>You're not that wrong,
> if you see the fuses in a logic-array as punch-holes.
>It's not a matter of costs, why I don't use the compilers "comfort",
>I just couldn't find any which work as I would expect.
>

As I'm not smart enough to learn the hex codes for each instruction,
I have to live with my GCC/2 from 1994. But I prefer it (rather than
other compilers), because I didn't have to visit a university before
I could use it...

>How would you convince your compiler to fit my memory-model:
>
>"FLAT-DATA,
> ANY-code (mixed real16/Pm16/Pm32, selfmodifying),
> MONO-stack (code-aligned SS, only exceptions may switch ESP),
> unprotected (all privilege0)".
>

It wouldn't know what to do with all that weird stuff. Remember that
I'm still running OS/2?

But - if you have something better than OS/2, how much is a license?
Would be nice if it can run LE executables! ;)

>And what will your debugger do
>if you single step a changing 'LIDT' or PM/REAL switch-instruction?
>

What the heck is a debugger? I'm using my "register dump" whenever I
need to find an error (together with a lot of _patience_ and a small
amount of _braincells_)... ;)

>| I will sit down this week to finish it - promised!
>
>[I just keep in mind:
> integer MUL: 64 bytes code, 70 clock-cycles per result-byte]
>

Ok, I finished the MUL today, but had no time to compile and test it
until now. It needs 30 addition loops for the table and up to 32 * 2
loops for the "multiplication" - which is a maximum of 94 loops à 32
additions...

BinLobster.txt

wolfgang kern

unread,
Apr 6, 2003, 7:48:10 AM4/6/03
to

Bernhard wrote:

| Long time ago, it was 256 bytes, until you "convinced" me to use hex
| numbers. The first design with 128 bytes was redesigned to 112 bytes
| because of the resulting amount of BCD digits (_should_ be something
| around 260 digits now)... ;)

Ok. 112 byte are maximal 269 decimal digits (2^896 -1).

| >| If I remember right - my conversion works with repeated addition and
| >| subtraction sequences (equal to your division)...

| >I'm still in search for a better divide solution,
| >a modified Newton-Raphson method works also with integers,
| >but needs too many iterations,

| >and my partial log with the remaining factor calculation


| >is even more worse for large BCD figures.
| >Next I'll try a byte based log-table (instead of 1..9 decimal),
| >a log-(base 256)-table will need just 256 entries,....

| Maybe I'm trying to code it with subtractions, only - then we have a
| "reference design" to compare other solutions with...

Yes, I also use my slow, but working CMP/SUB division for this.

| >| [Integer multiply]
| >
| >| If I use it, I would make the index registers static, that is - only
| >| the counters are changing their values. As you may have noticed, the
| >| amount of registers is too small (iNTEL's outstanding design)... ;)
| >
| >Yes, there are a lot of registers in the CPU (unused MSR; MMX, ...),
| >but just not connected to the ALU or the address-generator.

| There are some more (DRx and CRx), but they have one thing in common
| - you can't use them for "ordinary" work...

My BIOS uses DR0..3 and CR3 as scratch-pads during power-up.

| >You can loop "ADC [edi]; 4 times INC edi" as well.
| If avoidable - I don't want to change the index registers!

Push/pop?


| BTW, it doesn't make sense to replace _one_ instruction with _four_!

Yes, except this additional three clocks would save the carry-status.

| >| ST-system is using C for anything related to the windowing system or
| >| OS/2 specific functions. Would be a lot of work, if I would have to
| >| write all this stuff in assembler. The "main program core" always is
| >| written in C. The assembler functions are called from the C program,
| >| and they all follow the C definition to pass parameters - pushing in
| >| reverse order, so we have all parameters in the "right" order on the
| >| stack...

| >As I'm not bound to any "big-one", I did it the opposite way,
| > my kernel and all functions are written in fast 'direct code',
| > while some applications allow the user to define functions
| > in a literal form.

| Depends on what you are doing and how much time you have to code it.
| I am running OS/2 and my programs are running on this platform, too.
| So it's natural, that I use the common tools which are available for
| my platform (for free!)...

Ok.

| Here you find _all_ definitions for bitmap (*.bmp) file formats:
| <http://www.tanveer.freeservers.com/programming/bitmap_htm.htm>!

A neat site, I added this one to my floating papers stack.

| >"mea wia drei is fei zfü !" -> "multi-lingual"


| Fui z'fui (but I speak 6 of them and understand 'em all)... ;)

Six? Hm.. "hex-oral"? :}

| >| I think I even would believe that you're poking holes into pieces of
| >| paper to save the costs for punch cards... ;)

| >You're not that wrong,
| > if you see the fuses in a logic-array as punch-holes.
| >It's not a matter of costs, why I don't use the compilers "comfort",
| >I just couldn't find any which work as I would expect.

| As I'm not smart enough to learn the hex codes for each instruction,
| I have to live with my GCC/2 from 1994. But I prefer it (rather than
| other compilers), because I didn't have to visit a university before
| I could use it...

No need to be very smart for work with hex-codes,
I wrote all IA-32 codes incl. FPU-codes and timing onto one piece of
paper (two A4 pages), a second page shows all addressing codes
and a third all possible segment-descriptors.
I haven't finished my short-info for MMX and 3Dnow yet,
due the AMD-appendix-pages are short enough anyway.

| >How would you convince your compiler to fit my memory-model:
| >"FLAT-DATA,
| > ANY-code (mixed real16/Pm16/Pm32, selfmodifying),
| > MONO-stack (code-aligned SS, only exceptions may switch ESP),
| > unprotected (all privilege0)".

| It wouldn't know what to do with all that weird stuff.
| Remember that I'm still running OS/2?
| But - if you have something better than OS/2, how much is a license?
| Would be nice if it can run LE executables! ;)

I don't know much about OS/2 programming environment.
But you wont find any assembler/compiler which fits my demands.

KESYS-applications,-tools,-kernel,.. are incompatible to any other Os,
and no foreign code will ever be executed.
License comes together with an application-order only.
Sorry for my tools are "not for sale (yet)".

I once worked on an executable conversion routine,
but as most existing code is written(compiled) "that" ill,
I decided to write all code myself.

| >And what will your debugger do
| >if you single step a changing 'LIDT' or PM/REAL switch-instruction?

| What the heck is a debugger? I'm using my "register dump" whenever I
| need to find an error (together with a lot of _patience_ and a small
| amount of _braincells_)... ;)

I remember that story,
1. write a lot of source text
2. compile it to an executable
3. start a debugger (the one with the register-dump)
4. set break-points
5. run executable from the debugger
6. repeat until nerves exhausted...

Now I just type in:
hex-code, see immediate the disassembled image of it,
and may immediate single step/run(or trace until whatever)that code,
or start at any address and view/edit register-values at any time.

| >| I will sit down this week to finish it - promised!

| >[I just keep in mind:
| > integer MUL: 64 bytes code, 70 clock-cycles per result-byte]

| Ok, I finished the MUL today, but had no time to compile and test it
| until now. It needs 30 addition loops for the table and up to 32 * 2
| loops for the "multiplication" - which is a maximum of 94 loops à 32
| additions...

Give me some days to check the timing....

__
wolfgang


wolfgang kern

unread,
Apr 6, 2003, 11:19:35 AM4/6/03
to

| Give me some days to check the timing....

[BinLobster code]

I assume there are no typos in your code,
but I have some problems to check the functionality without typing in the code.
Must do it anyway to see the clock-cycles consumed.
I think there will be much more iterations than you estimated...

Nevertheless I added some optimise notes (can't hold myself...)
__
wolfgang

PS: appended text-files appear twice,
seems my OE or whoever automated appends *.txt files to the post.
=========================================
/*
---------------------------------------
mulOPS OP1 * OP2 = RES
---------------------------------------
-> EBX address LSB 1st OP
ESI 2nd OP
EDI RESULT
---------------------------------------
<- EAX 0000 0000 ok
08xx 0000 ERROR LDreq()
0801 xxxx ERROR OS/2 LDreq()
0Axx 0000 ERROR LDfre()
0A01 xxxx ERROR OS/2 LDfre()
---------------------------------------
*/

.globl _mulOPS
_mulOPS:

/*
---------------
allocate buffer
---------------
*/


push ecx
xor eax,eax
push edx

xor ecx,ecx
push eax # ld.AmtBy = 0
push ecx # ld.Foffs = 0
push eax # ld.Moffs = 0
push ecx # ld.FileN = not used
push eax # ld.MMoff = OUTPUT
push 0x00000033 # ld.LdCtl = (see description)
push 0xFFFFFFC0 # ld.FldNr = standard mem only
push 0x00001000 # ld.MemSz = 0x1000 byte
push ecx # ld.MemEA = OUTPUT
call _LDreq # request for memory allocation
pop ecx
add esp,0x0000000C
pop edx
add esp,0x00000010
/*
----------------
check for errors
----------------
*/
cmp eax,0x00000000 ; or eax,eax is shorter
je 0 (forwards) ; jne iMULy (code 0F 85 xx xx xx xx)
jmp iMULy ; -1 clk and no flush
/*
------------------
initialize buffers
------------------
*/
0:push edx # ST-MemHandle
push ecx # base address buffers
call _chkOPS ; ?
/*
--------------
check, if zero
--------------
*/
bt byte[ebx],0x04
jae 0 (forwards)
jmp iMULy ; btw: why not long cc-branches?
0:bt byte[esi],0x04 ; jc iMULy (code 0F 82 xx xx xx xx)
jae 0 (forwards)
jmp iMULy
/*
-------------
exchange OPs?
-------------
*/
0:cmp al,0x02
jne 1 (forwards)
xchg ebx,esi # exchange index registers
or byte[edi],0x08 # set exchanged flag
/*
------------------
set sign of result
------------------
*/
1:bt byte[ebx],0x00
jae 2 (forwards)
bt byte[esi],0x00
jb 3 (forwards)
or byte[edi],0x01 # -OP1 * +OP2
jmp 4 (forwards)
2:bt byte[esi],0x00
jae 3 (forwards)
or byte[edi],0x01 # +OP1 * -OP2
jmp 4 (forwards)
3:and byte[edi],0xFE # signs are equal
/*
--------------
create buffers
--------------
*/
4:push edi
xor edx,edx
push ebx # save EBX
mov dl,byte[ebx + 0x01]
push esi # save ESI
add dl,0x04
mov edi,dword[esp +0x0C] ;ok. previous pushed ecx
mov dh,0x0F
mov esi,edi # ESI = buffer "00"
push ebp
shr dl,0x02
add edi,0x0080 # EDI = buffer "01"
push edx # store EDX
xor ecx,ecx
/*
------------------
LOOP multiples * 1
------------------
*/
5:mov eax,dword[ebx + ecx * 4 + 0x10]
adc eax,dword[esi + ecx * 4 + 0x00]
mov dword[edi + ecx * 4 + 0x00],eax
inc cl
dec dl
jne 5 (backwards)
dec dh
je 6 (forwards)
mov dl,byte[esp] ;ok. last pushed edx/dl
mov esi,edi # ESI = last result
xor ecx,ecx
add edi,0x0080 # EDI = next result
jmp 5 (backwards)
/*
-------------------
prep multiples * 10
-------------------
*/
6:mov esi,edi # ESI = buffer "0F"
mov dl,byte[esp]
add edi,0x0100 # EDI = buffer "10"
xor ecx,ecx
7:mov eax,dword[ebx + ecx * 4 + 0x10]
adc eax,dword[esi + ecx * 4 + 0x00]
mov dword[edi + ecx * 4 + 0x00],eax
inc cl
dec dl
jne 7 (backwards)
mov ebx,edi # EBX = new base
mov dx,word[esp]
inc byte[esp] # now 1 digit more
mov esi,edi # ESI = buffer "10"
dec dh
add edi,0x0080 # EDI = buffer "20"
xor ecx,ecx
/*
-------------------
LOOP multiples * 10
-------------------
*/
8:mov eax,dword[ebx + ecx * 4]
adc eax,dword[ebx + ecx * 4]
mov dword[edi + ecx * 4],eax
inc cl
dec dl
jne 8 (backwards)
dec dh
je 9 (forwards)
mov esi,edi # ESI = last result
mov dl,byte[esp]
addl $0x0080,%edi # EDI = next result
xor ecx,ecx
jmp 8 (backwards)
/*
-------------
prepare LOOPS
-------------
*/
9:mov ebx,dword[esp + 0x08] # EBX = base OP2
xor edx,edx
mov edi,dword[esp + 0x10] # EDI = base RES
xor ebp,ebp
mov dl,byte[esp] # EDX = digits table
mov dh,byte[ebx + 0x01] # DH = digits OP2
mov esi,dword[esp + 0x14] # ESI = base table
dec dl
/*
----------------
LOOP through OP2
----------------
-----------
lower digit
-----------
*/
iMULw:push ebp # store byte count
mov cl,dword[ebx + ebp * 1 + 0x10] # get byte
xor eax,eax ; why?
and cl,0x0F # lower digit
je 2 (forwards)
shl ecx,0x05 # * 0x20!
xor ebp,ebp # EBP = temp count!
/*
-----------
01 addition
-----------
*/
1:mov eax,dword[edi + ebp * 4 + 0x10] # get dword RES
adc eax,dword[esi + ecx * 4 + 0x00] # add dword table
mov dword[edi + ebp * 4 + 0x10],eax # store result
inc cl ; inc ecx : 0F*20 = 01E0
inc ebp
dec dl
jne 1 (backwards)

;; as destination and one operand are identical, you may save one line:
;; 1: mov eax,dword[esi + ecx * 4 + 0x00]
;; adc dword[edi + ebp * 4 + 0x10],eax ; add to result
;; inc ecx
;; inc ebp
;; dec dl
;; jne 1(b)
/*
-----------
upper digit
-----------
*/
2:mov ebp,dword[esp] # read byte count
mov cl,byte[ebx + ebp * 1 + 0x10] # get byte
shr cl,0x04 # upper digit
je 4 (forwards)
shl ecx,0x05 # * 0x20!
;? b0..4 will always be 0 ?
mov dl,byte[esp + 0x04] # DL = digits table
add ecx,0x0200 # + 200 dwords!
xor ebp,ebp # EBP = temp count!
/*
-----------
10 addition
-----------
*/
3:mov eax,dword[edi + ebp * 4 + 0x10] # get dword RES
adc eax,dword[esi + ecx * 4 + 0x00] # add dword table
mov dword[edi + ebp * 4 + 0x10],eax # store result
inc cl
inc ebp
dec dl
jne 3 (backwards)
;; same note as to "01 addition"
/*
----------
loop logic
----------
*/
4:pop ebp # restore byte count
dec dh
je 5 (forwards)
inc ebp
jmp iMULw
/*
--------
clean up
--------
*/
5:pop edx
pop ebp
pop esi
pop ebx
pop edi
/*
-----------------
free mem and exit
-----------------
*/
btr 0x00(edi),0x04 # reset exchanged flag
jae 6 (forwards)
xchg ebx,esi # exchange index registers
6:add esp,0x04 # remove base address
call _LDfre # free memory
add esp,0x04 # remove MemHandle
iMULy:pop edx
pop ecx
ret
========== a copy of my translation help ====================================
Conditional branches description
syntax:|ill-|ill-|logic|math|
0F 80 cw/cd JO rel16/32 Jump near if overflow (OF=1)
0F 81 cw/cd JNO rel16/32 Jump near if not overflow (OF=0)
0F 82 cw/cd JB JNAE JC < rel16/32 Jump near if not above or equal (CF=1)
0F 83 cw/cd JNB JAE JNC >= rel16/32 Jump near if not below (CF=0)
0F 84 cw/cd JE JZ = rel16/32 Jump near if equal (ZF=1)
0F 85 cw/cd JNE JNZ <> rel16/32 Jump near if not equal (ZF=0)
0F 86 cw/cd JNA JBE <= rel16/32 Jump near if not above (CF=1 or ZF=1)
0F 87 cw/cd JA JNBE > rel16/32 Jump near if not below or equal (CF=0 and ZF=0)
0F 88 cw/cd JS <0 rel16/32 Jump near if sign (SF=1)
0F 89 cw/cd JNS >=0 rel16/32 Jump near if not sign (SF=0)
0F 8A cw/cd JPE JP rel16/32 Jump near if parity even (PF=1)
0F 8B cw/cd JPO JNP rel16/32 Jump near if parity odd (PF=0)
0F 8C cw/cd JNGE JL S < rel16/32 Jump near if not greater or equal (SF<>OF)
0F 8D cw/cd JGE JNL S>= rel16/32 Jump near if not less (SF=OF)
0F 8E cw/cd JNG JLE S<= rel16/32 Jump near if not greater (ZF=1 or SF<>OF)
0F 8F cw/cd JG JNLE S > rel16/32 Jump near if greater (ZF=0 and SF=OF)

the better known rel8 conditional branches (code 70..7F) and
CMOV (code 0F 40..4F) also use this 16 conditions in equal order.

FCMOV conditions work different:
DA C0+i FCMOVB ST(0), ST(i) Move if below (CF=1)
DB C0+i FCMOVNB ST(0), ST(i) Move if not below (CF=0)
DA C8+i FCMOVE ST(0), ST(i) Move if equal (ZF=1)
DB C8+i FCMOVNE ST(0), ST(i) Move if not equal (ZF=0)
DA D0+i FCMOVBE ST(0), ST(i) Move if below or equal (CF=1 or ZF=1)
DB D0+i FCMOVNBE ST(0), ST(i) Move if not below or equal (CF=0 and ZF=0)
DA D8+i FCMOVU ST(0), ST(i) Move if unordered (PF=1)
DB D8+i FCMOVNU ST(0), ST(i) Move if not unordered (PF=0)
============================================================================

bv_schornak

unread,
Apr 6, 2003, 1:49:30 PM4/6/03
to
wolfgang kern wrote:


[operand size]

>Ok. 112 byte are maximal 269 decimal digits (2^896 -1).
>

Yep! ( Back to earth? ;) )


[DIV]

>| Maybe I'm trying to code it with subtractions, only - then we have a
>| "reference design" to compare other solutions with...
>
>Yes, I also use my slow, but working CMP/SUB division for this.
>

Ok, in progress (if I find the time to code it)!


[unused / unusable registers]

>| There are some more (DRx and CRx), but they have one thing in common
>| - you can't use them for "ordinary" work...
>
>My BIOS uses DR0..3 and CR3 as scratch-pads during power-up.
>

Are there any tricks I have to care about? Or could I use them like
the general registers (like EDI, ESI, etc.)? I never found any info
about this stuff until now...


[MUL - Wolfgang's version]

>| >You can loop "ADC [edi]; 4 times INC edi" as well.
>| If avoidable - I don't want to change the index registers!
>Push/pop?
>

I'm dreaming of a solution where only the counters (multipliers) are
changing values. Other people are dreaming of a white christmas - so
I am not too far off... ;)

>| BTW, it doesn't make sense to replace _one_ instruction with _four_!
>Yes, except this additional three clocks would save the carry-status.
>

And - if paired with the proper instructions - could be used to fill
the pipes for better performance...


[bitmaps]

>| Here you find _all_ definitions for bitmap (*.bmp) file formats:
>| <http://www.tanveer.freeservers.com/programming/bitmap_htm.htm>!
>
>A neat site, I added this one to my floating papers stack.
>

I have it in my bookmarks -> ST/info/programming...


[gas -> LETNi translation]

>| >"mea wia drei is fei zfü !" -> "multi-lingual"
>| Fui z'fui (but I speak 6 of them and understand 'em all)... ;)
>
>Six? Hm.. "hex-oral"? :}
>

:-D :-D :-D (Translation: "Hahaha!")

Hard to top this one - you win! ;)


[compiler vs. brain]

>| As I'm not smart enough to learn the hex codes for each instruction,
>| I have to live with my GCC/2 from 1994. But I prefer it (rather than
>| other compilers), because I didn't have to visit a university before
>| I could use it...
>
>No need to be very smart for work with hex-codes,
> I wrote all IA-32 codes incl. FPU-codes and timing onto one piece of
> paper (two A4 pages), a second page shows all addressing codes
> and a third all possible segment-descriptors.
> I haven't finished my short-info for MMX and 3Dnow yet,
> due the AMD-appendix-pages are short enough anyway.
>

Ok, your method has the advantage, that you've complete control over
every byte in your executable - which is not guaranteed if you use a
compiler (every compiler makes decisions for you - a behaviour which
isn't wanted by some programmers).

One disadvantage is, that you have to learn this method (and all the
opcodes - at least the most used ones). Another "giant" disadvantage
is the time you need for coding.

Thus - deep respect, if you're able to write code this way! But it's
not the thing I have in mind for myself. As my time is very limited,
I can't waste it this way, so I "look away", if the compiler inserts
code I did not write and things like that, because it saves a lot of
my time. And - I'm still alive, if coding finally is done... ;)

>I don't know much about OS/2 programming environment.
>But you wont find any assembler/compiler which fits my demands.
>

Time to write one by your own?

>KESYS-applications,-tools,-kernel,.. are incompatible to any other Os,
>and no foreign code will ever be executed.
>License comes together with an application-order only.
>Sorry for my tools are "not for sale (yet)".
>

I probably couldn't afford it... ;)

>I once worked on an executable conversion routine,
>but as most existing code is written(compiled) "that" ill,
>I decided to write all code myself.
>

That's the reason why I started programming, anyway...


[debugger vs. manual debugging]

>| What the heck is a debugger? I'm using my "register dump" whenever I
>| need to find an error (together with a lot of _patience_ and a small
>| amount of _braincells_)... ;)
>
>I remember that story,
>1. write a lot of source text
>2. compile it to an executable
>3. start a debugger (the one with the register-dump)
>4. set break-points
>5. run executable from the debugger
>6. repeat until nerves exhausted...
>

In my case, it looks like this:

1.0. Write source code.
2.0. Run compiler.
3.0. Ready... or
3.1. Insert some "call _monTST" lines.
4.0. Run Compiler.
5.0. Repeat step 3.0. as often as neccessary!

_monTST stores the current contents of EAX, EBX, ECX, EDX, EDI, ESI,
EBP and ESP in a 32 byte line within an allocated memory block. Each
call increments a counter, so up to 128 "dumps" can be stored. It is
possible to store the contents of the memory block in a file or it's
displayed in a dialog box.

I don't need more than the register's contents to determine, where a
bug might be... ;)

>Now I just type in:
> hex-code, see immediate the disassembled image of it,
> and may immediate single step/run(or trace until whatever)that code,
> or start at any address and view/edit register-values at any time.
>

Sound really comfortable...


[latest MUL]

>| Ok, I finished the MUL today, but had no time to compile and test it
>| until now. It needs 30 addition loops for the table and up to 32 * 2
>| loops for the "multiplication" - which is a maximum of 94 loops à 32
>| additions...
>
>Give me some days to check the timing....
>

Of course! Maybe I find some time to compile and check it myself. My
only problem ... I first have to recode routines and functions which
are involved with the calling and preparation.

bv_schornak

unread,
Apr 6, 2003, 4:52:07 PM4/6/03
to
wolfgang kern wrote:

[BinLobster code]

>I assume there are no typos in your code,
>but I have some problems to check the functionality without typing in the code.
>Must do it anyway to see the clock-cycles consumed.
>I think there will be much more iterations than you estimated...
>

Oops...

>Nevertheless I added some optimise notes (can't hold myself...)
>

Ok! I'm far away from being perfect...

>PS: appended text-files appear twice,
> seems my OE or whoever automated appends *.txt files to the post.
>

Mozilla shows only one copy (news.t-online.de and
news.freenet.de) - might be your Newsreader...

> cmp eax,0x00000000 ; or eax,eax is shorter
>

Altered!

> je 0 (forwards) ; jne iMULy (code 0F 85 xx xx xx xx)
> jmp iMULy ; -1 clk and no flush
>

...

> jmp iMULy ; btw: why not long cc-branches?
> 0:bt byte[esi],0x04 ; jc iMULy (code 0F 82 xx xx xx xx)
> jae 0 (forwards)
> jmp iMULy
>

jne iMULy

would be compiled as

je 0 (forwards)
jmp iMULy
0: ...

If I do this myself, I have better control over the
code which is compiled for real...


> mov dl,byte[esp] ;ok. last pushed edx/dl
>

Hope so! I made a plan of my stack, so I could read
the proper addresses from paper!

> iMULw:push ebp # store byte count
> mov cl,dword[ebx + ebp * 1 + 0x10] # get byte
> xor eax,eax ; why?
> and cl,0x0F # lower digit
>

Just for fun... ;)

As I could read in several articles, it "should" be
avoided to manipulate registers in two instructions
which follow each other... So I always try _not_ to
move data into a register and then manipulate it in
the next instruction.

Rumours? (If so, just delete the line!)

> inc cl ; inc ecx : 0F*20 = 01E0
> inc ebp
> dec dl
> jne 1 (backwards)
>

A blackout? Sudden attack of a wave of stupidity?
Who knows...

>;; as destination and one operand are identical, you may save one line:
>;; 1: mov eax,dword[esi + ecx * 4 + 0x00]
>;; adc dword[edi + ebp * 4 + 0x10],eax ; add to result
>

And here - Ladies and Gentlemen - you can see the
results of a modern "copy and paste" error, where
the author was too lazy to switch on his brain...

> -----------
> upper digit
> -----------
>*/
> 2:mov ebp,dword[esp] # read byte count
> mov cl,byte[ebx + ebp * 1 + 0x10] # get byte
> shr cl,0x04 # upper digit
> je 4 (forwards)
> shl ecx,0x05 # * 0x20!
> ;? b0..4 will always be 0 ?
>

ECX is used as a "pointer" to the current byte in
the apropriate _table_ entry. Because this is the
loop for the upper digits, it should be somewhere
in the range of 0x0800 ... 0x0FFF. (See below!)

>========== a copy of my translation help ====================================
>Conditional branches description
> syntax:|ill-|ill-|logic|math|
>0F 80 cw/cd JO rel16/32 Jump near if overflow (OF=1)
>0F 81 cw/cd JNO rel16/32 Jump near if not overflow (OF=0)
>0F 82 cw/cd JB JNAE JC < rel16/32 Jump near if not above or equal (CF=1)
>0F 83 cw/cd JNB JAE JNC >= rel16/32 Jump near if not below (CF=0)
>0F 84 cw/cd JE JZ = rel16/32 Jump near if equal (ZF=1)
>0F 85 cw/cd JNE JNZ <> rel16/32 Jump near if not equal (ZF=0)
>0F 86 cw/cd JNA JBE <= rel16/32 Jump near if not above (CF=1 or ZF=1)
>0F 87 cw/cd JA JNBE > rel16/32 Jump near if not below or equal (CF=0 and ZF=0)
>0F 88 cw/cd JS <0 rel16/32 Jump near if sign (SF=1)
>0F 89 cw/cd JNS >=0 rel16/32 Jump near if not sign (SF=0)
>0F 8A cw/cd JPE JP rel16/32 Jump near if parity even (PF=1)
>0F 8B cw/cd JPO JNP rel16/32 Jump near if parity odd (PF=0)
>0F 8C cw/cd JNGE JL S < rel16/32 Jump near if not greater or equal (SF<>OF)
>0F 8D cw/cd JGE JNL S>= rel16/32 Jump near if not less (SF=OF)
>0F 8E cw/cd JNG JLE S<= rel16/32 Jump near if not greater (ZF=1 or SF<>OF)
>0F 8F cw/cd JG JNLE S > rel16/32 Jump near if greater (ZF=0 and SF=OF)
>
>the better known rel8 conditional branches (code 70..7F) and
>CMOV (code 0F 40..4F) also use this 16 conditions in equal order.
>

My version of GAS only knows the 70...7F opcodes,
the other ones are not part of the header file! I
could add the opcodes and recompile GAS, but I am
not shure, if they will run on a 80486 - which is
a "must be" condition!

---

About the MUL routine:

First - a 4096 byte block of memory is allocated.

Second - some work is done to detect, if one ope-
rand is zero, the signs are tested and the larger
operand is used to fill the table (maybe I should
do it the other way, because it could be a little
bit faster).

Third - the table is filled with "x * 1" results,
starting with 01, ending with 0F. Then we have to
set the results index to the base of the "x * 10"
buffer and put the result of "(x * F) + (x * 1)"
into that buffer. Now we set the OP1 index to the
base of the "x * 10" buffer (for the 1st multiply
OP2 is equal to OP1).

Fourth - the result index is restored to the base
of the result buffer, the OP1 index is set to the
base of OP2 (the table contains multiples of OP1)
and the OP2 index is set to the table's base.

Fifth - the outer loop reads a byte from OP2. The
upper digit is masked out, so the remaining value
is equal to the lower digit - which is multiplied
by 0x80, the size of one entry. Thus, the counter
points to the base of the entry which is added to
the result buffer. The same procedure is repeated
with the upper digit, but now we add 0x800 to the
counter (actually 0x200, because dwords are used)
for the "x * 10". multiples. This is done, until
all bytes of OP2 are processed.

---

Sorry! Writing the last paragraph, I checked that
I have forgotten to increment the result index...

Thus, the MUL adds every digit to 0x00[RES]! What
a shame...

I will post my new / corrected version as soon as
it is "debugged" - I try to fix it until tomorrow
evening!

bv_schornak

unread,
Apr 7, 2003, 10:08:50 AM4/7/03
to
On Sun, 6 Apr 2003 20:52:07 UTC, bv_schornak <now...@schornak.de>
wrote:

> Sorry! Writing the last paragraph, I checked that
> I have forgotten to increment the result index...
>
> Thus, the MUL adds every digit to 0x00[RES]! What
> a shame...
>
> I will post my new / corrected version as soon as
> it is "debugged" - I try to fix it until tomorrow
> evening!


Done!

Has to be extended, because the routine now loops
three times with an "unaligned" EDI... Try to fix
it in the next days!

Mariusz Supernak

unread,
Apr 6, 2003, 6:00:01 AM4/6/03
to

hgfujhgjhjhjh

--
Z pozdrowieniami

[ : Mariusz Supernak :::: [r2...@klub.chip.pl] :::: : ]
[ : :::: [r2...@poczta.onet.pl] :::: : ]
[ : :::: [ http://mariuszsupernak.prv.pl] :::: : ]
[ : :::: [ http://licznik.cjb.net/ ] :::: : ]

r2b2

unread,
Apr 9, 2003, 4:45:01 AM4/9/03
to
Mariusz Supernak <r2...@klub.chip.pl> wrote:
[cut]
> hgfujhgjhjhjh
I'm sorry for above been mentioned text. I tested programme Hamster on
local server.
I write programme in MASM32 and I read your group on and off .
I do not like HLA, because required MASM32 is.
--
Greetings from Polen

wolfgang kern

unread,
Apr 10, 2003, 9:31:40 AM4/10/03
to

Bernhard wrote:
[answer to both notes]

| [BinLobster code]

| >I assume there are no typos in your code,
| >but I have some problems to check the functionality without typing in the code.
| >Must do it anyway to see the clock-cycles consumed.
| >I think there will be much more iterations than you estimated...
| Oops...

| >Nevertheless I added some optimise notes (can't hold myself...)
| Ok! I'm far away from being perfect...

Nobody is.... I always find something when reviewing my code...

| Mozilla shows only one copy (news.t-online.de and
| news.freenet.de) - might be your Newsreader...

Ok.


| > mov dl,byte[esp] ;ok. last pushed edx/dl
| Hope so! I made a plan of my stack, so I could read
| the proper addresses from paper!

I just confirmed stack-order to better understand.

| > mov cl,dword[ebx + ebp * 1 + 0x10] # get byte
| > xor eax,eax ; why?
| > and cl,0x0F # lower digit
| Just for fun... ;)

| As I could read in several articles, it "should" be
| avoided to manipulate registers in two instructions
| which follow each other... So I always try _not_ to
| move data into a register and then manipulate it in
| the next instruction.
|
| Rumours? (If so, just delete the line!)

Only direct pipe dependencies may result in a clock penalty.
Just calculating operations (mul/div/shifts/add,or...),
where the result is needed in the next instruction will wait
for completion, while 'moves' won't need to wait.
(see AMD K7 pipes organisation for details, there are more than two..)
So strict instruction pairing doesn't always gain speed,
memory-alignment together with cache-aligned code-parts
(as 16-byte boundary and max.64/128-byte loops) gain much more.

| > inc cl ; inc ecx : 0F*20 = 01E0

| A blackout? Sudden attack of a wave of stupidity?
| Who knows...

Probably a very common trap .... :)

| >;; as destination and one operand are identical, you may save one line:
| >;; 1: mov eax,dword[esi + ecx * 4 + 0x00]
| >;; adc dword[edi + ebp * 4 + 0x10],eax ; add to result

| And here - Ladies and Gentlemen - you can see the
| results of a modern "copy and paste" error, where
| the author was too lazy to switch on his brain...

:)

| > shr cl,0x04 # upper digit
| > je 4 (forwards)
| > shl ecx,0x05 # * 0x20!
| > ;? b0..4 will always be 0 ?

| ECX is used as a "pointer" to the current byte in

| the appropriate _table_ entry. Because this is the


| loop for the upper digits, it should be somewhere
| in the range of 0x0800 ... 0x0FFF. (See below!)

Ok, and cl,0xF0
je 4(f)
shl ecx,1 ;is one byte shorter.

[cc-branches]

| My version of GAS only knows the 70...7F opcodes,
| the other ones are not part of the header file! I
| could add the opcodes and recompile GAS, but I am

| not shore, if they will run on a 80486 - which is
| a "must be" condition!

IIRC the '0F 80' branches came with 32-bit architecture (i386),
(I skipped 386), but my old AMD-486 knows them for sure.
You see the disadvantage of being depended on compilers skills,
...and most compilers not even know "all" 286 instructions.


| ---
| About the MUL routine:

| First - a 4096 byte block of memory is allocated.

| Second - some work is done to detect, if one operand is zero,


| the signs are tested and the larger
| operand is used to fill the table (maybe I should
| do it the other way, because it could be a little
| bit faster).

Ok, but shouldn't be both zero-extended anyway ?

| Third - the table is filled with "x * 1" results,
| starting with 01, ending with 0F.

Until here I understand,
fifteen table entries contain OP1*1 .. OP1*15 ; 128 bytes each.

| Then we have to set the results index to the base of the "x * 10"
| buffer and put the result of "(x * F) + (x * 1)"
| into that buffer. Now we set the OP1 index to the
| base of the "x * 10" buffer (for the 1st multiply
| OP2 is equal to OP1).

I don't catch the sense of the 16th entry,
it will be just a "four bit shift left" of OP1.
Seems I missed that this are again 15 entries
OP1*10h .. OP1*F0h

| Fourth - the result index is restored to the base
| of the result buffer, the OP1 index is set to the
| base of OP2 (the table contains multiples of OP1)
| and the OP2 index is set to the table's base.

| Fifth - the outer loop reads a byte from OP2. The
| upper digit is masked out, so the remaining value
| is equal to the lower digit - which is multiplied
| by 0x80, the size of one entry. Thus, the counter
| points to the base of the entry which is added to
| the result buffer.

Ok, this is the "nibble-MUL-LUT ADD".

| The same procedure is repeated
| with the upper digit, but now we add 0x800 to the
| counter (actually 0x200, because dwords are used)
| for the "x * 10". multiples. This is done, until
| all bytes of OP2 are processed.

Now it got a face ...
It avoids all large shift operations by using two sets of tables.
| ---

| Sorry! Writing the last paragraph, I checked that
| I have forgotten to increment the result index...

| Thus, the MUL adds every digit to 0x00[RES]! What
| a shame...

| I will post my new / corrected version as soon as
| it is "debugged" - I try to fix it until tomorrow
| evening!

After I now know the meaning of it, I can start typing.
This time I'll count keystrokes and the time needed for my input,
just to compare against HLL-asm :)

SUB-thread combined

| [unused / unusable registers]


|
| >| There are some more (DRx and CRx), but they have one thing in common
| >| - you can't use them for "ordinary" work...
| >
| >My BIOS uses DR0..3 and CR3 as scratch-pads during power-up.

| Are there any tricks I have to care about? Or could I use them like


| the general registers (like EDI, ESI, etc.)? I never found any info
| about this stuff until now...

No way to use it with ALU or for address-generation.
CR3 can't be (ab)used while paging is enabled.
DR0..3 may be used for temporary storage as long the debugger don't
use them for 'true' (code/data/IO) break-points.

| [MUL - Wolfgang's version]
|

| >| >You can loop "ADC [edi]; 4 times INC edi" as well.
| >| If avoidable - I don't want to change the index registers!
| >Push/pop?

| I'm dreaming of a solution where only the counters (multipliers) are


| changing values. Other people are dreaming of a white christmas - so
| I am not too far off... ;)

A few days ago there was actually snow on our meadow.
Whenever needed or if free registers become rare,
I use two index-pointers "and" a base within one instruction

ie: adc dw [ecx+edx+displ32],eax

the two registers are used as variable index pointers,
the base will be stored within the code
as the 32-bit displacement.

Not sure your compiler can work selfmodifying code.

| >| BTW, it doesn't make sense to replace _one_ instruction with _four_!
| >Yes, except this additional three clocks would save the carry-status.

| And - if paired with the proper instructions - could be used to fill


| the pipes for better performance...

In some cases only, see other post also on this.

| [compiler vs. brain]


|
| >| As I'm not smart enough to learn the hex codes for each instruction,
| >| I have to live with my GCC/2 from 1994. But I prefer it (rather than
| >| other compilers), because I didn't have to visit a university before
| >| I could use it...

| >No need to be very smart for work with hex-codes,
| > I wrote all IA-32 codes incl. FPU-codes and timing onto one piece of
| > paper (two A4 pages), a second page shows all addressing codes
| > and a third all possible segment-descriptors.
| > I haven't finished my short-info for MMX and 3Dnow yet,
| > due the AMD-appendix-pages are short enough anyway.

| Ok, your method has the advantage, that you've complete control over
| every byte in your executable - which is not guaranteed if you use a
| compiler (every compiler makes decisions for you - a behaviour which
| isn't wanted by some programmers).

| One disadvantage is, that you have to learn this method (and all the
| opcodes - at least the most used ones). Another "giant" disadvantage
| is the time you need for coding.

You would be surprised how fast my code-work is done,
(much lesser key-strokes needed),
the really time-consuming stuff is:

search for best concepts,
data organisation decisions,
on screen help text,
optimise already working code,
user-level text (application manuals),
keep track of the history of my own work,
detailed program comments and documentation,
up-grade compatibility checks,
and support my body with enough cigarettes, coffee....

| Thus - deep respect, if you're able to write code this way! But it's
| not the thing I have in mind for myself. As my time is very limited,
| I can't waste it this way, so I "look away", if the compiler inserts
| code I did not write and things like that, because it saves a lot of

| my time. And - I'm still alive, if coding finally is done... ;)

| >I don't know much about OS/2 programming environment.
| >But you wont find any assembler/compiler which fits my demands.

| Time to write one by your own?

I never planned to do, and from my point of view,
I see not too much sense in talking to a machine in human terms,
this will always lead to misunderstanding.
As long I find myself somehow smarter than a CPU,
I better give the handicapped machine a fair chance to doubtfree
understand my wishes by talking to it in its language.

| >Sorry for my tools are "not for sale (yet)".

| I probably couldn't afford it... ;)

Machine-code monitors are around since the very begin of the
computer-age, and they aren't expensive.
But it would need many books to write and deliver with it.
My (quite different) disassembler syntax will confuse everyone,
it contains math-symbols and use its very own font.

| >I once worked on an executable conversion routine,
| >but as most existing code is written(compiled) "that" ill,
| >I decided to write all code myself.

| That's the reason why I started programming, anyway...

| [debugger vs. manual debugging]
|


| >| What the heck is a debugger? I'm using my "register dump" whenever I
| >| need to find an error (together with a lot of _patience_ and a small
| >| amount of _braincells_)... ;)

| >I remember that story,
| >1. write a lot of source text
| >2. compile it to an executable
| >3. start a debugger (the one with the register-dump)
| >4. set break-points
| >5. run executable from the debugger

| >6. repeat until nerves exhausted...


| In my case, it looks like this:
|
| 1.0. Write source code.
| 2.0. Run compiler.
| 3.0. Ready... or
| 3.1. Insert some "call _monTST" lines.
| 4.0. Run Compiler.

| 5.0. Repeat step 3.0. as often as necessary!


|
| _monTST stores the current contents of EAX, EBX, ECX, EDX, EDI, ESI,
| EBP and ESP in a 32 byte line within an allocated memory block. Each
| call increments a counter, so up to 128 "dumps" can be stored. It is
| possible to store the contents of the memory block in a file or it's
| displayed in a dialog box.
|

| I don't need more than the register's contents to determine, where a
| bug might be... ;)

No EFL-info? I couldn't live without it.

| >Now I just type in:
| > hex-code, see immediate the disassembled image of it,
| > and may immediate single step/run(or trace until whatever)that code,
| > or start at any address and view/edit register-values at any time.

| Sound really comfortable...
Yes it is, and I can write code for any (non-IA) CPU as well,
but my planned emulator is still in theoretical phase.
I know that Bochs can 'act' many CPUs,
but unfortunately with an ill M$/C-shit-syntax/environment only.

| [latest MUL]


|
| >| Ok, I finished the MUL today, but had no time to compile and test it
| >| until now. It needs 30 addition loops for the table and up to 32 * 2
| >| loops for the "multiplication" - which is a maximum of 94 loops à 32
| >| additions...

| >Give me some days to check the timing....

| Of course! Maybe I find some time to compile and check it myself. My


| only problem ... I first have to recode routines and functions which
| are involved with the calling and preparation.

__
wolfgang

bv_schornak

unread,
Apr 12, 2003, 9:17:23 PM4/12/03
to
wolfgang kern wrote:

>| >Nevertheless I added some optimise notes (can't hold myself...)
>| Ok! I'm far away from being perfect...
>
>Nobody is.... I always find something when reviewing my code...
>

Never mind - this might be a very common problem...

>| > mov cl,dword[ebx + ebp * 1 + 0x10] # get byte
>| > xor eax,eax ; why?
>| > and cl,0x0F # lower digit
>

>| As I could read in several articles, it "should" be
>| avoided to manipulate registers in two instructions
>| which follow each other... So I always try _not_ to
>| move data into a register and then manipulate it in
>| the next instruction.
>

>Only direct pipe dependencies may result in a clock penalty.
>Just calculating operations (mul/div/shifts/add,or...),
>where the result is needed in the next instruction will wait
> for completion, while 'moves' won't need to wait.
>(see AMD K7 pipes organisation for details, there are more than two..)
>So strict instruction pairing doesn't always gain speed,
>memory-alignment together with cache-aligned code-parts
>(as 16-byte boundary and max.64/128-byte loops) gain much more.
>

Complicated stuff. I should read the docs, but time
is a very valuable and rare item in my life...

>| > inc cl ; inc ecx : 0F*20 = 01E0
>| A blackout? Sudden attack of a wave of stupidity?
>| Who knows...
>
>Probably a very common trap .... :)
>

Or a flaw in my multitasking abilities...

>| > shr cl,0x04 # upper digit
>| > je 4 (forwards)
>| > shl ecx,0x05 # * 0x20!
>| > ;? b0..4 will always be 0 ?
>
>| ECX is used as a "pointer" to the current byte in
>| the appropriate _table_ entry. Because this is the
>| loop for the upper digits, it should be somewhere
>| in the range of 0x0800 ... 0x0FFF. (See below!)
>
>Ok, and cl,0xF0
> je 4(f)
> shl ecx,1 ;is one byte shorter.
>

Right - changed now!

>[cc-branches]
>
>| My version of GAS only knows the 70...7F opcodes,
>| the other ones are not part of the header file! I
>| could add the opcodes and recompile GAS, but I am
>| not shore, if they will run on a 80486 - which is
>| a "must be" condition!
>
>IIRC the '0F 80' branches came with 32-bit architecture (i386),
>(I skipped 386), but my old AMD-486 knows them for sure.
>You see the disadvantage of being depended on compilers skills,
> ...and most compilers not even know "all" 286 instructions.
>

Investigating further - I would have to re-code the
GAS sources heavily to implement it - not worth the
time, I think. Could write my own assembler within
the same time, I guess...

[MUL]

>| First - a 4096 byte block of memory is allocated.
>
>| Second - some work is done to detect, if one operand is zero,
>| the signs are tested and the larger
>| operand is used to fill the table (maybe I should
>| do it the other way, because it could be a little
>| bit faster).
>
>Ok, but shouldn't be both zero-extended anyway ?
>

Per definition - they are! If the programmer passes
operands, which are not, the result is garbage. And
the allocated memory always is filled with zeroes.

>| Third - the table is filled with "x * 1" results,
>| starting with 01, ending with 0F.
>
>Until here I understand,
>fifteen table entries contain OP1*1 .. OP1*15 ; 128 bytes each.
>

True!

>| Then we have to set the results index to the base of the "x * 10"
>| buffer and put the result of "(x * F) + (x * 1)"
>| into that buffer. Now we set the OP1 index to the
>| base of the "x * 10" buffer (for the 1st multiply
>| OP2 is equal to OP1).
>
>I don't catch the sense of the 16th entry,
>it will be just a "four bit shift left" of OP1.
>Seems I missed that this are again 15 entries
> OP1*10h .. OP1*F0h
>

A version with "shift" instead of "add" is slower -
needs more cycles (actually 6 for add and some more
for the counters).

To describe the step between "01"'s and "10"'s:

-----------------------------------------
| 0E | 0F |
-------------------------------------------
| -- | 10 |
-----------------------------------------

Because there's a "hole" between "0F" and "10",
we have to add the difference. "10" is the last
entry, where "1 * OP1" is added. After that, we
have to add "10 * OP1" instead. Thus, the index
for OP1 is set to the "10" entry, 'cause we add
"10 * OP1" now...

>| The same procedure is repeated
>| with the upper digit, but now we add 0x800 to the
>| counter (actually 0x200, because dwords are used)
>| for the "x * 10". multiples. This is done, until
>| all bytes of OP2 are processed.
>
>Now it got a face ...
>It avoids all large shift operations by using two sets of tables.
>

Meanwhile there are 8 tables... ;)

>After I now know the meaning of it, I can start typing.
>This time I'll count keystrokes and the time needed for my input,
>just to compare against HLL-asm :)
>

I hope you recognized my last posting - the new
(latest) version is included here!

Sorry for the missing attachment. I was testing
another - native - newsreader, so my attachment
got lost somewhere...


[special registers]

>No way to use it with ALU or for address-generation.
>CR3 can't be (ab)used while paging is enabled.
>DR0..3 may be used for temporary storage as long the debugger don't
>use them for 'true' (code/data/IO) break-points.
>

As they might be used by the taskmanager ... it
probably is a better idea to leave them alone!


[snow in April]

>| ... Other people are dreaming of a white christmas - so


>| I am not too far off... ;)
>
>A few days ago there was actually snow on our meadow.
>

We had snow here throughout the past two weeks!
It melted during the day, but is this the right
time for snow? In April?


[constant index registers]

>Whenever needed or if free registers become rare,
> I use two index-pointers "and" a base within one instruction
>
>ie: adc dw [ecx+edx+displ32],eax
>
>the two registers are used as variable index pointers,
>the base will be stored within the code
>as the 32-bit displacement.
>

Ok, you can't avoid it in every case...

>Not sure your compiler can work selfmodifying code.
>

Possible, but of no practical use. Whenever the
application tries to access the "text" segment,
OS/2 will kill it. The memory where the code is
stored belongs to the program manager, so it is
an "access violation", if you try to alter data
in this area (SYS 3175, the best known error in
OS/2 - if you run buggy applications)...


[compiler vs. brain]


>| One disadvantage is, that you have to learn this method (and all the
>| opcodes - at least the most used ones). Another "giant" disadvantage
>| is the time you need for coding.
>
>You would be surprised how fast my code-work is done,
>(much lesser key-strokes needed),
>the really time-consuming stuff is:
>
>search for best concepts,
>data organisation decisions,
>on screen help text,
>optimise already working code,
>user-level text (application manuals),
>keep track of the history of my own work,
>detailed program comments and documentation,
>up-grade compatibility checks,
>

Depends on the kind of software you're writing,
and where it is used. If you write applications
for modern operating systems, then it could be-
come a messy job, if you're using your method.

Application manuals are the real challenge (for
a real programmer)...

History, documentation and comments: I write it
while coding, so I don't have to think about...

Compatibility: Whenever I change my definitions
there's a program which converts old stuff into
new one...

>and support my body with enough cigarettes, coffee....
>

Programmers (and truck drivers) fuel... ;)


[compiler]

>| Time to write one by your own?
>
>I never planned to do, and from my point of view,
>I see not too much sense in talking to a machine in human terms,
>this will always lead to misunderstanding.
>As long I find myself somehow smarter than a CPU,
>I better give the handicapped machine a fair chance to doubtfree
>understand my wishes by talking to it in its language.
>

The only problem with machines in general. Most
difficult to reduce a problem into simple tasks
which are "understood" by the machine!

>| >Sorry for my tools are "not for sale (yet)".
>| I probably couldn't afford it... ;)
>
>Machine-code monitors are around since the very begin of the
>computer-age, and they aren't expensive.
>But it would need many books to write and deliver with it.
>My (quite different) disassembler syntax will confuse everyone,
>it contains math-symbols and use its very own font.
>

I don't think, that I really want that... ;)


[debugger vs. manual debugging]


>| I don't need more than the register's contents to determine, where a
>| bug might be... ;)
>
>No EFL-info? I couldn't live without it.
>

EFL??? Never heard of it...

[latest MUL, V 0.0.2.]

Fixed all errors, I hope. These "dword aligned"
routines need many time consuming extra-cycles!
Would be the question, if the "misalignment" is
using up more cycles than the extra-code. Maybe
you know? If not, then I would prefer penalties
over extra-cycles to copy the table 3 times...

BinLobster.txt

wolfgang kern

unread,
Apr 13, 2003, 9:41:29 PM4/13/03
to

Bernhard wrote:

[pairing and pipes]


| Complicated stuff. I should read the docs, but time
| is a very valuable and rare item in my life...

And every new CPU-version adds new different features...

[long cc-branches]


| Investigating further - I would have to re-code the
| GAS sources heavily to implement it - not worth the
| time, I think. Could write my own assembler within
| the same time, I guess...

Perhaps newer versions of GAS know +486 code?


| [MUL]
[...]


| >Ok, but shouldn't be both zero-extended anyway ?

| Per definition - they are! If the programmer passes
| operands, which are not, the result is garbage. And
| the allocated memory always is filled with zeroes.

I see, so I add the 4Kb zero-fill to the clock-count.

[...]


| >Seems I missed that this are again 15 entries
| > OP1*10h .. OP1*F0h

| A version with "shift" instead of "add" is slower -
| needs more cycles (actually 6 for add and some more
| for the counters).

Yes.


| Because there's a "hole" between "0F" and "10",
| we have to add the difference. "10" is the last
| entry, where "1 * OP1" is added. After that, we
| have to add "10 * OP1" instead. Thus, the index
| for OP1 is set to the "10" entry, 'cause we add
| "10 * OP1" now...

Ok.
[...]


| >Now it got a face ...
| >It avoids all large shift operations by using two sets of tables.

| Meanwhile there are 8 tables... ;)

I see that now,
but the chance to have two tables with 4Kb in cache
is somehow larger than using eight with 16Kb.
Depends how many other tasks or interrupts use memory.

I use tables whenever this make sense,
but I'm afraid the time spent on creating this 30(120) entries
will make your MUL-method a bit slow, see below.

| >After I now know the meaning of it, I can start typing.
| >This time I'll count keystrokes and the time needed for my input,
| >just to compare against HLL-asm :)


I still used the 30 entries version:
clocks bytes
clear 4Kb memory = 2000 16
OP-check: zero, used size 13 + 2*112*5 = 1133 32
sign set 5 18
01..0F 15*(7+ 28*12)= 5145 35
01 +0F 1*() 343 38
10..F0 15*() 5145 35
----- ---
creation: 13771 174
nibble-add LUT-MUL:
calc table offset 2*112*(10) 2240 2*18
add table-entry 2*28*(10) 560 2*11
loop ctrl 28*(14) 392 16
----- ----
total: 16963 248

Now if I divide 16963 by 224 result bytes = 75.73 ,
not bad for 4 Kb, but a bit above the "70" with only 64 bytes.

Coding time: ~500 key-strokes took me about 90 minutes.
[I'm not a good typist? :) ]
And I played another half hour to check for clock-count details.

Not included are compiler-specific overhead needs.
I put the "end" somewhere in the middle of the 248 bytes,
so all cc-branches are straight and short form.
The timing was checked with buffers cached already (by clear)
and interrupts disabled.


| I hope you recognized my last posting - the new
| (latest) version is included here!

I combined the sub-threads...


| [special registers]
|
| >No way to use it with ALU or for address-generation.
| >CR3 can't be (ab)used while paging is enabled.
| >DR0..3 may be used for temporary storage as long the debugger don't
| >use them for 'true' (code/data/IO) break-points.

| As they might be used by the taskmanager ... it
| probably is a better idea to leave them alone!

Yes!

| [snow in April]


| It melted during the day, but is this the right
| time for snow? In April?

I remember snow in may 1979, driving on highway with 30 kmh.

| [constant index registers]
|
| >Whenever needed or if free registers become rare,
| > I use two index-pointers "and" a base within one instruction

| >ie: adc dw [ecx+edx+displ32],eax

| >the two registers are used as variable index pointers,
| >the base will be stored within the code
| >as the 32-bit displacement.

| Ok, you can't avoid it in every case...

Right.


| >Not sure your compiler can work selfmodifying code.

| Possible, but of no practical use. Whenever the
| application tries to access the "text" segment,
| OS/2 will kill it. The memory where the code is
| stored belongs to the program manager, so it is
| an "access violation", if you try to alter data
| in this area (SYS 3175, the best known error in
| OS/2 - if you run buggy applications)...

My trick is CS/SS alignment
(the code is at the bottom of the stack),
so I can write to the code-segment any time.
But OS/2 sure got a stack-manager as well...

| [compiler vs. brain]
| >| One disadvantage is, that you have to learn this method (and all the
| >| opcodes - at least the most used ones). Another "giant" disadvantage
| >| is the time you need for coding.

| >You would be surprised how fast my code-work is done,
| >(much lesser key-strokes needed),
| >the really time-consuming stuff is:

| >search for best concepts,
| >data organisation decisions,
| >on screen help text,
| >optimise already working code,
| >user-level text (application manuals),
| >keep track of the history of my own work,
| >detailed program comments and documentation,
| >up-grade compatibility checks,


| Depends on the kind of software you're writing,
| and where it is used. If you write applications
| for modern operating systems, then it could be-
| come a messy job, if you're using your method.

For PC-applications I write for my own crap only,
I won't do anything for M$/boring-land/C±shit.
Other CPU(8/16bit)-programs are mainly ROM-based and use
already collected routines with few modifications.
Yes, it would be "horrible" to write HLL-based winAPI stuff.

BTW:
I think KESYS-2003 is as modern as its version# says.
Ok, it's not made for EDP or net-hosts and
it's far away from being comparable with OS/2 or winXX.

"modern" stands for:

Multi-tasking ; lock everything while waiting
Object-oriented ; is just a HLL-interpreter
Down to museum ; compatible
Erroneous ; more bugs than functions
Restricted ; due don't know how a CPU really works
Negate ; users expectation/demand

| Application manuals are the real challenge (for
| a real programmer)...

In fact I hate this job, but who else can do it...

| History, documentation and comments: I write it
| while coding, so I don't have to think about...

Yes, I got to opportunity to write documentation while
coding (to a tag-file), but very often I find the code
self-explaining and just do it after everything is
finished and tested.

| Compatibility: Whenever I change my definitions
| there's a program which converts old stuff into
| new one...

Unfortunately "I" have to write my conversion programs.

| >and support my body with enough cigarettes, coffee....
| Programmers (and truck drivers) fuel... ;)

Fuel? I prefer milk, or was it beer...? :)

| [compiler]

| >| >... my tools are "not for sale (yet)".
| >..it contains math-symbols and use its very own font.

| I don't think, that I really want that... ;)

Yippee! I may keep it for myself!

[debugger]


| EFL??? Never heard of it...

This is where the "conditions" come from.... :)

The flag-register, carry,zero,sign,step,direct,irq,iopl,...
not all of the 32 bits are used.
I have capital letters for set bits "IPFAVR-NLLODITSZAPC"
and lower case characters for zeroed bits.


| [latest MUL, V 0.0.2.]

| Fixed all errors, I hope. These "dword aligned"
| routines need many time consuming extra-cycles!
| Would be the question, if the "misalignment" is
| using up more cycles than the extra-code. Maybe
| you know? If not, then I would prefer penalties
| over extra-cycles to copy the table 3 times...

A misaligned dw-access will result in a one clock penalty
if already in cache, if not, the cache-line fill will
need extra three clocks (worst case).
I think this extra tables cost more than they may gain.

[code stored aside]
I couldn't view your latest code in detail yet,
I'll be off-line for a few days (back on Friday).
__
wolfgang

bv_schornak

unread,
Apr 16, 2003, 1:24:23 PM4/16/03
to
wolfgang kern wrote:

[pairing and pipes]

>And every new CPU-version adds new different features...
>

Wouldn't sell too much new processors, if brand-new
"features" would be the same as in the last (older)
generation... ;)

>[long cc-branches]
>| Investigating further - I would have to re-code the
>| GAS sources heavily to implement it - not worth the
>| time, I think. Could write my own assembler within
>| the same time, I guess...
>
>Perhaps newer versions of GAS know +486 code?
>

PGCC/2 is at the latest level (including the P4 and
the Athlon). But much more complicated to handle. I
don't like to learn "compiler¹" every few years.

¹ See as a language in itself... ;)


| [MUL]

>| Per definition - they are! If the programmer passes
>| operands, which are not, the result is garbage. And
>| the allocated memory always is filled with zeroes.
>
>I see, so I add the 4Kb zero-fill to the clock-count.
>

Zeroes are applied by OS/2 - I do not know, if this
memory area is "cleared" while the allocation is in
progress or after freeing the memory area (would be
the better way, because it could be done in another
thread).

>| Meanwhile there are 8 tables... ;)
>
>I see that now,
>but the chance to have two tables with 4Kb in cache
>is somehow larger than using eight with 16Kb.
>Depends how many other tasks or interrupts use memory.
>
>I use tables whenever this make sense,
>but I'm afraid the time spent on creating this 30(120) entries
>will make your MUL-method a bit slow, see below.
>

Ok, the table shrinked down to 4 KB, again.

>I still used the 30 entries version:
> clocks bytes
>clear 4Kb memory = 2000 16
>OP-check: zero, used size 13 + 2*112*5 = 1133 32
>sign set 5 18
>01..0F 15*(7+ 28*12)= 5145 35
>01 +0F 1*() 343 38
>10..F0 15*() 5145 35
> ----- ---
> creation: 13771 174
>nibble-add LUT-MUL:
>calc table offset 2*112*(10) 2240 2*18
>add table-entry 2*28*(10) 560 2*11
>loop ctrl 28*(14) 392 16
> ----- ----
> total: 16963 248
>
>Now if I divide 16963 by 224 result bytes = 75.73 ,
>not bad for 4 Kb, but a bit above the "70" with only 64 bytes.
>

Assuming, that I could allocate the memory outside,
there would be 2000 cycles less.

14,963 / 224 = 66.80 ;)

I would have assumed, that the "nibble-add LUT-MUL"
needs much more cycles?

>Coding time: ~500 key-strokes took me about 90 minutes.
>[I'm not a good typist? :) ]
>

Depends - if you were typing with your feet... ;)

(10.8 s / stroke)

>And I played another half hour to check for clock-count details.
>

Thanks a lot!

>Not included are compiler-specific overhead needs.
>I put the "end" somewhere in the middle of the 248 bytes,
>so all cc-branches are straight and short form.
>The timing was checked with buffers cached already (by clear)
>and interrupts disabled.
>

Should be a little bit shorter now. Is the clearing
better or worse? Should I leave it in for the cache
filling, or isn't it important?

>| I hope you recognized my last posting - the new
>| (latest) version is included here!
>
>I combined the sub-threads...
>

Seen. No, there was a short note, that V 0.0.1. was
a little bit buggy. Hope the results of the routine
wasn't used to compute targets of cruise missiles!


[snow in April]

>| It melted during the day, but is this the right
>| time for snow? In April?
>I remember snow in may 1979, driving on highway with 30 kmh.
>

And May 1986 (or was it 1987?) with -40°C...

<brrrrr - my Diesel was frozen...>

Yikes - you drive _that_ fast in Austria? German's
meanwhile are used to creep much slower on the fast
lane (without snow!)...

[self-modifying code]

>| Possible, but of no practical use. Whenever the
>| application tries to access the "text" segment,
>| OS/2 will kill it. The memory where the code is
>| stored belongs to the program manager, so it is
>| an "access violation", if you try to alter data
>| in this area (SYS 3175, the best known error in
>| OS/2 - if you run buggy applications)...
>
>My trick is CS/SS alignment
>(the code is at the bottom of the stack),
>so I can write to the code-segment any time.
>But OS/2 sure got a stack-manager as well...
>

Never heard of. But you can't access CS/SS, either.
If the task ends, the system will crash if it tries
to restore all registers...

BTW - your code probably is running in Ring 0. OS/2
applications are running in Ring 3!

>| Depends on the kind of software you're writing,
>| and where it is used. If you write applications
>| for modern operating systems, then it could be-
>| come a messy job, if you're using your method.
>
>For PC-applications I write for my own crap only,
>I won't do anything for M$/boring-land/C±shit.
>Other CPU(8/16bit)-programs are mainly ROM-based and use
>already collected routines with few modifications.
>Yes, it would be "horrible" to write HLL-based winAPI stuff.
>

:-D

>BTW:
> I think KESYS-2003 is as modern as its version# says.
> Ok, it's not made for EDP or net-hosts and
> it's far away from being comparable with OS/2 or winXX.
>

Never was a question! I got the clue that your code
is something for _special_ purposes, so it wouldn't
make sense to imply some Windows-like "pippifax"...

[ Nope - no translation for this one... ;) ]

>"modern" stands for:
>
> Multi-tasking ; lock everything while waiting
> Object-oriented ; is just a HLL-interpreter
> Down to museum ; compatible
> Erroneous ; more bugs than functions
> Restricted ; due don't know how a CPU really works
> Negate ; users expectation/demand
>

:-D

M... - Very sophisticated!

O... - Use HLA (claims to be an assembler)?

D... - Compatibility always sounds very up-to-date!

E... - You probably need "_tstMON"?

R... - You're kidding! ;)

N... - If you keep it, nobody will buy it...

>| Application manuals are the real challenge (for
>| a real programmer)...
>
>In fact I hate this job, but who else can do it...
>

Busy with a spiritual debatte: Your "ghost"-writer?

>| History, documentation and comments: I write it
>| while coding, so I don't have to think about...
>
>Yes, I got to opportunity to write documentation while
>coding (to a tag-file), but very often I find the code
>self-explaining and just do it after everything is
>finished and tested.
>

Thrown aboard. Slipping upon some very old programs
without documentation, I learned not to wait with a
text until I've forgotten the contents... ;)

>| Compatibility: Whenever I change my definitions
>| there's a program which converts old stuff into
>| new one...
>
>Unfortunately "I" have to write my conversion programs.
>

Who do you think is writing mine? ;)

>| >and support my body with enough cigarettes, coffee....
>| Programmers (and truck drivers) fuel... ;)
>
>Fuel? I prefer milk, or was it beer...? :)
>

Both "Grundnahrungsmittel" (basic food) in Bavaria!


[compiler]

>| I don't think, that I really want that... ;)
>
>Yippee! I may keep it for myself!
>

Until somebody is interested? ;)


[debugger]

>| EFL??? Never heard of it...
>
>This is where the "conditions" come from.... :)
>
>The flag-register, carry,zero,sign,step,direct,irq,iopl,...
>not all of the 32 bits are used.
>I have capital letters for set bits "IPFAVR-NLLODITSZAPC"
>and lower case characters for zeroed bits.
>

I see... If ever needed, I will code it!


[BinLobster - MUL V 0.0.2.]

>| Fixed all errors, I hope. These "dword aligned"
>| routines need many time consuming extra-cycles!
>| Would be the question, if the "misalignment" is
>| using up more cycles than the extra-code. Maybe
>| you know? If not, then I would prefer penalties
>| over extra-cycles to copy the table 3 times...
>
>A misaligned dw-access will result in a one clock penalty
>if already in cache, if not, the cache-line fill will
>need extra three clocks (worst case).
>I think this extra tables cost more than they may gain.
>

Seen and re-coded again -> V 0.0.3.!

>[code stored aside]
>I couldn't view your latest code in detail yet,
>I'll be off-line for a few days (back on Friday).
>

So you had good luck, because the _latest_ revision
is attached here and now! I hope, this one is kept
for a while... ;)

BinLobster.txt

wolfgang kern

unread,
Apr 18, 2003, 8:00:51 AM4/18/03
to

Bernhard wrote:

| [pairing and pipes]

| >And every new CPU-version adds new different features...

| Wouldn't sell too much new processors, if brand-new
| "features" would be the same as in the last (older)
| generation... ;)

The visible side of the coin...
Much too often the new features come on cost of some
useful "old" things
ie: LOADALL (286/386) disappeared with 486,
16-bit BSWAP (undocumented 486) don't work on +586,
strict instruction pairing won't make sense above pentium,
single byte 'INC/DEC,reg' are "abused" on IA-64-CPUs,
prefix-meaning changed with SSE (F3),
opcode-bits are merged into displacement-area with MMX,...

| >[long cc-branches]


| >Perhaps newer versions of GAS know +486 code?

| PGCC/2 is at the latest level (including the P4 and
| the Athlon). But much more complicated to handle. I
| don't like to learn "compiler¹" every few years.

| ¹ See as a language in itself... ;)

You see..., whenever I like to add a new CPU(or version)
to my tools, only a few changes in the disassembler are
needed, and later it also can tell me in detail for which
CPU-type a program is written (foreign code analysis).

[clear buffers ]

| Zeroes are applied by OS/2 - I do not know, if this
| memory area is "cleared" while the allocation is in
| progress or after freeing the memory area (would be
| the better way, because it could be done in another
| thread).

Result may be needed from following process?

No, you can't be sure for allocation clears the already cached RAM.
And buffers may be used again and again in the calculator.
So we should not renounce the clear-steps,
but yes, all allocation can be done when the calculator is loaded.

| I would have assumed, that the "nibble-add LUT-MUL"
| needs much more cycles?

Yes, I estimated more here also,
but as it just adds one memory operand to a buffer,
there are just 56 dword ADCs in the loop... mmh?? bug! ..STOP!

Oops!
I owe you an apologize, so please take it.

I fooled myself here with the "worst-case" all bits set MUL,
so I got a correct looking result after one add already.
But it was just looking similar for I ignored the top-byte
(01 FF FF...FE 00 00 00...00).
But the result must read 00 FF...FF FF FE 000000..01 of course.

In fact we must loop the add-table section (28*10clks) 112 times also.
Obviously my test jumped to the end too early.

Sorry for the "published" bug and even more sorry for
the now corrected clock-cycles count:

creation: 13771 174
nibble-add LUT-MUL:
calc table offset 2*112*(10) 2240 2*18

add table-entry 2*112*28*(10) 62720 2*11


loop ctrl 28*(14) 392 16
----- ----

total: 79123 248
A cycle-count over the whole story gave a few less than this,
probably a better alignment without the RDTSC inserted.

But now ~350 clocks/byte, seven times more (plus 4200 bytes)
than the 64-bit MUL.

So we better archive the "nibble-add LUT-MUL" for CPUs which
don't have a somehow fast MUL-instruction.


| >Coding time: ~500 key-strokes took me about 90 minutes.
| >[I'm not a good typist? :) ]

| Depends - if you were typing with your feet... ;)

Only my main power-switch is a "kick"-type (underneath my desk).

| (10.8 s / stroke)
Far away from 600 strokes/minute...

| >And I played another half hour to check for clock-count details.
| Thanks a lot!

Don't mention, I really was interested to see the speed of it.
And my bug wasn't a typo, it just was wrong thinking about the
loop logic.

| >Not included are compiler-specific overhead needs.
| >I put the "end" somewhere in the middle of the 248 bytes,
| >so all cc-branches are straight and short form.
| >The timing was checked with buffers cached already (by clear)
| >and interrupts disabled.

| Should be a little bit shorter now. Is the clearing
| better or worse? Should I leave it in for the cache
| filling, or isn't it important?

In general,
last recently accessed memory can be assumed to be in cache,
as long enough free and other tasks or interrupts don't use it.
Assuming it's a cacheable region anyway.

| >I combined the sub-threads...
| Seen. No, there was a short note, that V 0.0.1. was
| a little bit buggy. Hope the results of the routine
| wasn't used to compute targets of cruise missiles!

If they added my bug too,
the wrong target will be hit on 1.4.2023
(this date is marked in my top-down diary
as the end of the European Uniting War).

| [snow in April]
|>..., driving on highway with 30 kmh.


| And May 1986 (or was it 1987?) with -40°C...
| <brrrrr - my Diesel was frozen...>
| Yikes - you drive _that_ fast in Austria? German's
| meanwhile are used to creep much slower on the fast
| lane (without snow!)...

I know the slow fellow as NNW-German,
["the non-flying Dutch" usual with a huge camping trailer]
hunting for a toilet near the Mediterranean Sea. :)
(Sorry Hervey, this is just a joke!)

[self-modifying code]

| >My trick is CS/SS alignment
| >(the code is at the bottom of the stack),
| >so I can write to the code-segment any time.
| >But OS/2 sure got a stack-manager as well...

| Never heard of. But you can't access CS/SS, either.
| If the task ends, the system will crash if it tries
| to restore all registers...

?? A task-switch will save/restore SS:ESP anyway...
I don't know if it's OS/2 or your compiler which will
not allow access/create GDT/LDT-entries.

| BTW - your code probably is running in Ring 0. OS/2
| applications are running in Ring 3!

Exact, all my PM-code runs with PL=0.
My protection is different but simple:
"never execute foreign code".

["pippifax"...]


| [ Nope - no translation for this one... ;) ]

The term may be of British origin anyway (Monty Python?),
Beth can answer this better.

[terminated: 'modern'-joke]

[Application manuals...]


| >In fact I hate this job, but who else can do it...

| Busy with a spiritual debate: Your "ghost"-writer?

Must be an identical clone, but then he will hate it also.

[ delayed documantations ]
|>.. just do it after everything is finished and tested.

| Thrown aboard. Slipping upon some very old programs
| without documentation, I learned not to wait with a
| text until I've forgotten the contents... ;)

As I'm very familiar with analysing driver software
without any source-code, I easy find my way through my own
creations. But yes, not always as of immediate.

[Compatibility]


| >Unfortunately "I" have to write my conversion programs.
| Who do you think is writing mine? ;)

Your compiler? :)

| [compiler]
| >| I don't think, that I really want that... ;)
| >Yippee! I may keep it for myself!
| Until somebody is interested? ;)

Depends how much trailing zeros (before DP) on the check...
No serious, I think about to 'free' some of my tools
after I finished my new version (G-styled tool-box),
the previous stuff may become shareware
and the new tools will be part of a developers-package.

| [debugger]
| >| EFL??? Never heard of it...

| >...The flag-register...


| I see... If ever needed, I will code it!

The most needed are carry, zero and sign.
(LAHF instruction)

[BinLobster - MUL V 0.0.2.]

| >...I think this extra tables cost more than they may gain.


| Seen and re-coded again -> V 0.0.3.!

| >[code stored aside]
| >I couldn't view your latest code in detail yet,
| >I'll be off-line for a few days (back on Friday).

| So you had good luck, because the _latest_ revision
| is attached here and now! I hope, this one is kept
| for a while... ;)

It will, [V 0.0.3 text and code-block gone to archive]

Ok, no reason to be disappointed about the timing,
we tried, we learned,...
Perhaps this method can gain speed for divide.

Just for info (your figures are dw-aligned anyway):
I updated my byte-oriented MUL as:
Zero-extend to dword (register only) if smaller operand
or unaligned size.
So my MUL work with dw-words now and became much faster.

BTW:
Some calculation needs factor-summing ("A= A + B*C"),
in this case the MUL-routine can be entered after the
'clear result buffer' (an additional MUL&ADD label).
------------------
I'll continue with the (still topic) divide problem.
Your previous DIV-idea was somehow similar to the latest MUL-version.

Let's assume the worst case is 1/FF FF...FF [1/(2^896-1)]
perhaps more worse will be FF FF...FE/FF FF...FF (we better check both),
and the result shall be at least 896 bits (112 bytes) precise,
with the (10^x)-exponent adjusted accordingly.

The DIV 64 (code F7.. edx:eax/m32) is a vectored [24/40] instruction.
Together with all necessary zero/limit checks,
(needed to avoid DIV-exceptions),
I estimate about 600 clock-cycles per result-byte for it.
A multiplication in front of it may be necessary to produce full
integer precision results.
And the 10^x exponent handling will add some extra clock-cycles.
Rounding may be done elsewhere.

I'll modify my CMP/SUB divide to work on dwords,
and I also will try to use the DIV instruction using quad-words.
If you can reverse your MUL to become a 'nibble=SUB_LUT_DIV'
we can compare the timing of the three variants.

The 1/x Newton-Raphson method is still on my desk,
but it needs too many iterations yet, perhaps I find a short-cut
by playing around with logarithm-formulas.
__
wolfgang


bv_schornak

unread,
Apr 20, 2003, 4:27:24 PM4/20/03
to
wolfgang kern wrote:


[new processor features]

>| Wouldn't sell too much new processors, if brand-new
>| "features" would be the same as in the last (older)
>| generation... ;)
>
>The visible side of the coin...
>

Use a mirror... Then you can see _both_ sides at the
same time. But - you should be able to read mirrored
letters... ;)

>Much too often the new features come on cost of some
>useful "old" things
>ie: LOADALL (286/386) disappeared with 486,
> 16-bit BSWAP (undocumented 486) don't work on +586,
> strict instruction pairing won't make sense above pentium,
> single byte 'INC/DEC,reg' are "abused" on IA-64-CPUs,
> prefix-meaning changed with SSE (F3),
> opcode-bits are merged into displacement-area with MMX,...
>

As I said long ago - I should have kept the 68k line
straight on...


[complicated compiler invocation]

>You see..., whenever I like to add a new CPU(or version)
>to my tools, only a few changes in the disassembler are
>needed, and later it also can tell me in detail for which
>CPU-type a program is written (foreign code analysis).
>

How do you write your sources? As text files, which
are translated into binary files, or directly with a
hexeditor? We're talking several weeks about all the
advantages and disadvantages of _compilers_, but you
never told a word about the way you create your exe-
cutables...


[clear buffers]

>| Zeroes are applied by OS/2 - I do not know, if this
>| memory area is "cleared" while the allocation is in
>| progress or after freeing the memory area (would be
>| the better way, because it could be done in another
>| thread).
>
>Result may be needed from following process?
>

Perhaps I should "outsource" allocation and clearing
to another part of the calculator. The buffer should
be allocated, whenever the calculator is in use. The
best idea would be to put it into the initialisation
of the application. Clearing of the buffer could be
done with a method which saves some clock cycles...


[another try...]

>| I would have assumed, that the "nibble-add LUT-MUL"
>| needs much more cycles?
>
>Yes, I estimated more here also,
>but as it just adds one memory operand to a buffer,
> there are just 56 dword ADCs in the loop... mmh?? bug! ..STOP!
>
>Oops!
>I owe you an apologize, so please take it.
>

Aaah - it's too heavy, can't hold it much longer! ;)

>I fooled myself here with the "worst-case" all bits set MUL,
>so I got a correct looking result after one add already.
>But it was just looking similar for I ignored the top-byte
>(01 FF FF...FE 00 00 00...00).
>But the result must read 00 FF...FF FF FE 000000..01 of course.
>
>In fact we must loop the add-table section (28*10clks) 112 times also.
>Obviously my test jumped to the end too early.
>

Probably you did not read my additional reply, where
I told you, that there's a bug in the routine... ;)

>Sorry for the "published" bug and even more sorry for
>the now corrected clock-cycles count:
>
> creation: 13771 174
>nibble-add LUT-MUL:
>calc table offset 2*112*(10) 2240 2*18
>add table-entry 2*112*28*(10) 62720 2*11
>loop ctrl 28*(14) 392 16
> ----- ----
> total: 79123 248
>A cycle-count over the whole story gave a few less than this,
>probably a better alignment without the RDTSC inserted.
>
>But now ~350 clocks/byte, seven times more (plus 4200 bytes)
> than the 64-bit MUL.
>
>So we better archive the "nibble-add LUT-MUL" for CPUs which
> don't have a somehow fast MUL-instruction.
>

Oh no ... I throw away my PC and get a Mac instead!

(350 / 70) = 7 ??? ;)


[less serious stuff]

>| Depends - if you were typing with your feet... ;)
>
>Only my main power-switch is a "kick"-type (underneath my desk).
>

Then - what took you that long?

>| (10.8 s / stroke)
>Far away from 600 strokes/minute...
>

Hmmmm ... speaking of a typing robot?

>| >And I played another half hour to check for clock-count details.
>| Thanks a lot!
>
>Don't mention, I really was interested to see the speed of it.
>And my bug wasn't a typo, it just was wrong thinking about the
>loop logic.
>

I was wondering about the "good" result... ;)

>| >I combined the sub-threads...
>| Seen. No, there was a short note, that V 0.0.1. was
>| a little bit buggy. Hope the results of the routine
>| wasn't used to compute targets of cruise missiles!
>
>If they added my bug too,
> the wrong target will be hit on 1.4.2023
>(this date is marked in my top-down diary
> as the end of the European Uniting War).
>

And the missile destroys the building, where they're
signing the contract? Terrorist! ;)

>| [snow in April]
>|>..., driving on highway with 30 kmh.
>| And May 1986 (or was it 1987?) with -40°C...
>| <brrrrr - my Diesel was frozen...>
>| Yikes - you drive _that_ fast in Austria? German's
>| meanwhile are used to creep much slower on the fast
>| lane (without snow!)...
>
>I know the slow fellow as NNW-German,
>["the non-flying Dutch" usual with a huge camping trailer]
>hunting for a toilet near the Mediterranean Sea. :)
>(Sorry Hervey, this is just a joke!)
>

Ah, I must have forgotten that they come through the
Alps after they leave Germany... ;)

Who's Harvey? The figure from the famous movie where
James Steward played the guy who always talked to an
invisible "partner"?


[self-modifying code]

>| ... But you can't access CS/SS, either.


>| If the task ends, the system will crash if it tries
>| to restore all registers...
>
>?? A task-switch will save/restore SS:ESP anyway...
>I don't know if it's OS/2 or your compiler which will
> not allow access/create GDT/LDT-entries.
>

AFAIK, SS:ESP is pushed on the stack of the new task
before it is executed. So if you change SS/CS...

The access to _everything_ which belongs to the OS/2
kernel - e.g. memory which was not allocated for the
current thread - isn't possible. If you try it, then
your application is killed by OS/2 at once! This be-
haviour is called "Crash Protection"...

>| BTW - your code probably is running in Ring 0. OS/2
>| applications are running in Ring 3!
>
>Exact, all my PM-code runs with PL=0.
>My protection is different but simple:
> "never execute foreign code".
>

So you have access to everything - in Ring 3 you are
much more limited...

>["pippifax"...]
>| [ Nope - no translation for this one... ;) ]
>The term may be of British origin anyway (Monty Python?),
> Beth can answer this better.
>

Still believe in Santa Claus? ("Pipi" = "wee wee"!)

>[terminated: 'modern'-joke]
>

Killer! ;)

>[Application manuals...]
>| >In fact I hate this job, but who else can do it...
>| Busy with a spiritual debate: Your "ghost"-writer?
>
>Must be an identical clone, but then he will hate it also.
>

With modern genetic manipulation techniques we might
be able to eliminate the "hate component"... ;)

>| Thrown aboard. Slipping upon some very old programs
>| without documentation, I learned not to wait with a
>| text until I've forgotten the contents... ;)
>
>As I'm very familiar with analysing driver software
>without any source-code, I easy find my way through my own
>creations. But yes, not always as of immediate.
>

I have about 30 Kbyte of very old code to check, how
much € would you take? :-D

>[Compatibility]
>| >Unfortunately "I" have to write my conversion programs.
>| Who do you think is writing mine? ;)
>
>Your compiler? :)
>

...is only able to compile already existing sources.
If there are none, then it can't create them. Thus -
I have to code it myself...

>| [compiler]
>| >| I don't think, that I really want that... ;)
>| >Yippee! I may keep it for myself!
>| Until somebody is interested? ;)
>
>Depends how much trailing zeros (before DP) on the check...
>

One? Without another number in front of it?

BTW: Leading -> 0.0000 <- Trailing ...

>No serious, I think about to 'free' some of my tools
>after I finished my new version (G-styled tool-box),
>the previous stuff may become shareware
>and the new tools will be part of a developers-package.
>

So there's a problem with the MUL routine. Myself is
an old chaotic freedom fighter, giving code away for
free. Wouldn't be a fair thing, if I steal your idea
and share it with everybody for nothing...

>| So you had good luck, because the _latest_ revision
>| is attached here and now! I hope, this one is kept
>| for a while... ;)
>
>It will, [V 0.0.3 text and code-block gone to archive]
>

As a warning, how we _never_ should do it...

>Ok, no reason to be disappointed about the timing,
> we tried, we learned,...
>Perhaps this method can gain speed for divide.
>

"Perhaps, perhaps..." having an old song in my mind,
but do not know who was the singer (very nice female
voice, anyway)...

>Just for info (your figures are dw-aligned anyway):
>I updated my byte-oriented MUL as:
>Zero-extend to dword (register only) if smaller operand
>or unaligned size.
>So my MUL work with dw-words now and became much faster.
>
>BTW:
>Some calculation needs factor-summing ("A= A + B*C"),
>in this case the MUL-routine can be entered after the
>'clear result buffer' (an additional MUL&ADD label).
>

See above about "copyright" considerations! (If I am
not very "serious" in most cases - this one's an ex-
ception. It is a serious issue!)


[upcoming: "the DIV" - horror spectacle w/o end]

>I'll continue with the (still topic) divide problem.
>Your previous DIV-idea was somehow similar to the latest MUL-version.
>
>Let's assume the worst case is 1/FF FF...FF [1/(2^896-1)]
>perhaps more worse will be FF FF...FE/FF FF...FF (we better check both),
>and the result shall be at least 896 bits (112 bytes) precise,
>with the (10^x)-exponent adjusted accordingly.
>

Which grows to "unsolveable" problems? While there's
no DIV, we can't do a divide. A big disadvantage, if
we use hexadecimals instead of BCDs.

A wonderful life it was - there were no sins and the
earth was peaceful and nice ... but now we have this
darn hexadecimals to handle...

>The DIV 64 (code F7.. edx:eax/m32) is a vectored [24/40] instruction.
>Together with all necessary zero/limit checks,
>(needed to avoid DIV-exceptions),
>I estimate about 600 clock-cycles per result-byte for it.
>A multiplication in front of it may be necessary to produce full
>integer precision results.
>And the 10^x exponent handling will add some extra clock-cycles.
>Rounding may be done elsewhere.
>
>I'll modify my CMP/SUB divide to work on dwords,
>and I also will try to use the DIV instruction using quad-words.
>If you can reverse your MUL to become a 'nibble=SUB_LUT_DIV'
>we can compare the timing of the three variants.
>

And then throw it away... ;) ;) ;) (= "hahaha")

If I'm in the mood - I try to code it until tomorrow
evening. There's still a lot of work with some other
important things, so probably I've not too much time
to think about the DIV - it is really complicated to
handle a base10 exponent from a base16 routine. Will
need an extra conversion routine. Give me some hours
to get in touch with all those problems...

>The 1/x Newton-Raphson method is still on my desk,
>but it needs too many iterations yet, perhaps I find a short-cut
>by playing around with logarithm-formulas.
>

Will make things even more complicated? ;)


Have a nice Holiday

Bernhard Schornak

bv_schornak

unread,
Apr 21, 2003, 4:09:13 AM4/21/03
to
myself wrote:

> ... about the DIV - it is really complicated to


> handle a base10 exponent from a base16 routine. Will
> need an extra conversion routine. Give me some hours
> to get in touch with all those problems...


As I thought - more than complicated...

Let's take a really simple example:

(1 / 2) = ?

And now? We have to multiply OP1 by 10. Not that big
problem? Then - what about

(1,000,000,000 / 2,000,000,000) = ?

In this case we have to multiply OP1 by this "small"
number: "10,000,000,000"! Thus, we need a table with
65536 entries a 128 byte = 8,388,608 byte. Obviously
not the thing I wanted.

Next problem.

To "normalize" both operands (get the first digit at
the same byte position in the buffer), a hex-dec-hex
conversions is neccessary. How can we do that w/o an
already existing DIV? Is this extra work really the
thing I wanted?

Looking at this - either I should go back to my slow
(but very accurate) BCD solution, _or_ I have to use
a hexadecimal exponent, too. With all side-effects I
don't want to have (e.g. 0.9 = 0.5+0.25+0.125...).

But I want to have an _accurate_ calculator! So this
hexadecimal solution is _not_ a solution at all.

My former BCD MUL didn't need a table at all - while
the new one needs about half of the calculation time
to create a 4 Kbyte table. For DIV it might be a bit
better - I would create a table with _10_ entries to
speed up the BCD DIV, too. The hexadecimal DIV needs
a table with _16_ entries (0...F) and probably a lot
of shift operations, too. With each shift we have to
do a hex-dec-hex conversion, because we are calcula-
ting powers of 10 (that is: 1st convert, then shift,
then convert back)...

BTW: Thinking about this, my MUL could be simplified
by using the same "trick" as in my BCD version. The
buffer isn't needed then. It would need 32 add loops
with 112 search for digit loops per add loop - might
be faster. At least, if not all digits are set in an
operand (probably true in most cases)...

Arguments?

wolfgang kern

unread,
Apr 21, 2003, 8:29:03 PM4/21/03
to

Bernhard wrote:

(answering both notes)

[new processor features]

| >| Wouldn't sell too much new processors, if brand-new
| >| "features" would be the same as in the last (older)
| >| generation... ;)

| >The visible side of the coin...
| Use a mirror... Then you can see _both_ sides at the
| same time. But - you should be able to read mirrored
| letters... ;)

:)


| >Much too often the new features come on cost of some
| >useful "old" things
| >ie: LOADALL (286/386) disappeared with 486,
| > 16-bit BSWAP (undocumented 486) don't work on +586,
| > strict instruction pairing won't make sense above pentium,
| > single byte 'INC/DEC,reg' are "abused" on IA-64-CPUs,
| > prefix-meaning changed with SSE (F3),
| > opcode-bits are merged into displacement-area with MMX,...

| As I said long ago - I should have kept the 68k line
| straight on...

Yeah, and I'm really sad for the Zilogs didn't make it for PCs.

| [complicated compiler invocation]

| >You see..., whenever I like to add a new CPU(or version)
| >to my tools, only a few changes in the disassembler are
| >needed, and later it also can tell me in detail for which
| >CPU-type a program is written (foreign code analysis).

| How do you write your sources? As text files, which
| are translated into binary files, or directly with a
| hexeditor? We're talking several weeks about all the
| advantages and disadvantages of _compilers_, but you
| never told a word about the way you create your exe-
| cutables...

"executables" like .com/.exe/.dll won't run on KESYS,
but sometimes I analyse, edit or create M$-compatible stuff.

KESYS-applications don't contain any code,
there are just parameter/command-strings which come "compiled"
(a better term is compressed) from a tool-box.

I write direct into the hex "opcode"-field of my disassembler:
address (always true physical)and mode(16/32-bit)
can be changed at any time,
Source-text, branch-/call-info, and re-engineering-flags
are created by the disassembler after every key-stroke (full screen update).
The comment-field can be written also and is stored in a tag-file.

ie:(real 16); fields are larger than shown here

address |opcode |source-text |call |branch|comments |flags
FFFF:0100 E9FD02 JMP 0300 initialise: BR
FFFF:0103 6650 PUSH EAX FN31xx: r0q S-4
FFFF:0105 2E668706300F EX EAX CS:[0F30]q swap ptr XMq
FFFF:010B FFD0 CALLw AX var hiw=param Cr0w S-2
FFFF:010D 90......xF2 NOP (242d) module 0
FFFF:01FF CF RET INTw BF S+6
FFFF:0200 E8DD7D CALL 7FE0 enterPM32 Cw+R
0028:00000203 31C9 CLR ECX PM32 now Fr1q
0028 00000205 F6450E20 TEST SS:[EBP+0E]b,20 b5 recurs FMb
.....

I can move the cursor to any byte within the opcode and immediate
step/run/trace/move the code in memory or save/load any part to
or from disk.
An alternate dump-view allow direct ASCII-string inputs.

My debugger (kernel module) can show "all" CPU-info
incl. FPU/MMX/TSC/MSRs/stack/.../PCI-device-list,
and I may even debug itself (dual set).


| [clear buffers]
|
| >| Zeroes are applied by OS/2 - I do not know, if this
| >| memory area is "cleared" while the allocation is in
| >| progress or after freeing the memory area (would be
| >| the better way, because it could be done in another
| >| thread).

| >Result may be needed from following process?

| Perhaps I should "outsource" allocation and clearing
| to another part of the calculator. The buffer should
| be allocated, whenever the calculator is in use. The
| best idea would be to put it into the initialisation
| of the application. Clearing of the buffer could be
| done with a method which saves some clock cycles...

Yes.

| [another try...]

| >| I would have assumed, that the "nibble-add LUT-MUL"
| >| needs much more cycles?

| >Yes, I estimated more here also,
| >but as it just adds one memory operand to a buffer,
| > there are just 56 dword ADCs in the loop... mmh?? bug! ..STOP!

| >Oops!
| >I owe you an apologize, so please take it.

| Aaah - it's too heavy, can't hold it much longer! ;)

:)

| >I fooled myself here with the "worst-case" all bits set MUL,
| >so I got a correct looking result after one add already.
| >But it was just looking similar for I ignored the top-byte
| >(01 FF FF...FE 00 00 00...00).
| >But the result must read 00 FF...FF FF FE 000000..01 of course.
| >In fact we must loop the add-table section (28*10clks) 112 times also.
| >Obviously my test jumped to the end too early.

| Probably you did not read my additional reply, where
| I told you, that there's a bug in the routine... ;)

I read it, but I didn't use your coding, I just took the idea...

| >Sorry for the "published" bug and even more sorry for
| >the now corrected clock-cycles count:

| > creation: 13771 174
| >nibble-add LUT-MUL:
| >calc table offset 2*112*(10) 2240 2*18
| >add table-entry 2*112*28*(10) 62720 2*11
| >loop ctrl 28*(14) 392 16
| > ----- ----
| > total: 79123 248
| >A cycle-count over the whole story gave a few less than this,
| >probably a better alignment without the RDTSC inserted.
| >
| >But now ~350 clocks/byte, seven times more (plus 4200 bytes)
| > than the 64-bit MUL.
| >
| >So we better archive the "nibble-add LUT-MUL" for CPUs which
| > don't have a somehow fast MUL-instruction.

| Oh no ... I throw away my PC and get a Mac instead!

The method wont be faster there...

| (350 / 70) = 7 ??? ;)

Got me, just read it as "5" :)
Seems a short vacation was highly necessary already.

| [less serious stuff]
| >| Depends - if you were typing with your feet... ;)
| >Only my main power-switch is a "kick"-type (underneath my desk).
| Then - what took you that long?

Don't know, perhaps day-dreaming of a "perfect CPU" ?

| >| (10.8 s / stroke)
| >Far away from 600 strokes/minute...

| Hmmmm ... speaking of a typing robot?

No, a former office clerk was near the world-record.

| >| >And I played another half hour to check for clock-count details.
| >| Thanks a lot!

| >Don't mention, I really was interested to see the speed of it.
| >And my bug wasn't a typo, it just was wrong thinking about the
| >loop logic.

| I was wondering about the "good" result... ;)

Me too.

| >| >I combined the sub-threads...
| >| Seen. No, there was a short note, that V 0.0.1. was
| >| a little bit buggy. Hope the results of the routine
| >| wasn't used to compute targets of cruise missiles!

| >If they added my bug too,
| > the wrong target will be hit on 1.4.2023
| >(this date is marked in my top-down diary
| > as the end of the European Uniting War).

| And the missile destroys the building, where they're
| signing the contract? Terrorist! ;)

Yes, we are guilty in advance,
but my reverse history don't mention this fact.

[snow in April,...]


| Ah, I must have forgotten that they come through the
| Alps after they leave Germany... ;)

:)


| Who's Harvey? The figure from the famous movie where
| James Steward played the guy who always talked to an
| invisible "partner"?

No, Hervey is a NG-fellow from "Netherlands",
but yes, invisible to me.

[self-modifying code..]

| AFAIK, SS:ESP is pushed on the stack of the new task
| before it is executed. So if you change SS/CS...

A 'true' task-switch saves all registers incl. seg-regs
in the task-segment(a dedicated memory-block)
rather than on the callers stack.

| The access to _everything_ which belongs to the OS/2
| kernel - e.g. memory which was not allocated for the
| current thread - isn't possible. If you try it, then
| your application is killed by OS/2 at once! This be-
| haviour is called "Crash Protection"...

Yes, "save" but restricted...

| >...Exact, all my PM-code runs with PL=0.


| >My protection is different but simple:
| > "never execute foreign code".
| So you have access to everything - in Ring 3 you are
| much more limited...

Sure.

| >["pippifax"...]
| >| [ Nope - no translation for this one... ;) ]
| >The term may be of British origin anyway (Monty Python?),
| > Beth can answer this better.

| Still believe in Santa Claus? ("Pipi" = "wee wee"!)

I remember another meaning in some nasty kid-rhymes.
As I was 8 years old I spent some time in UK.

| >[terminated: 'modern'-joke]
| Killer! ;)

Oh no, I just had no more arguments....

| >[Application manuals...]
| >| ...."ghost"-writer?


| >Must be an identical clone, but then he will hate it also.
| With modern genetic manipulation techniques we might
| be able to eliminate the "hate component"... ;)

So I wait for completion...

[]


| I have about 30 Kbyte of very old code to check, how
| much € would you take? :-D

[I cannot type an Euro-sign with my OE, ALT-128="Ç"]

To check what CPU and roughly what's about: 300.-
Full analysis (documented asm source) : 2000,-
this is an offer on a private base,
don't tell this to my sales-agent :)

| >[Compatibility]
| >| >Unfortunately "I" have to write my conversion programs.
| >| Who do you think is writing mine? ;)
| >Your compiler? :)
| ...is only able to compile already existing sources.
| If there are none, then it can't create them. Thus -
| I have to code it myself...

Wasn't meant that serious.. :)

[compiler...]


| >Depends how much trailing zeros (before DP) on the check...
| One? Without another number in front of it?

"Des is a weng zweng!"


| BTW: Leading -> 0.0000 <- Trailing ...

before DP -> . <- after

| >No serious, I think about to 'free' some of my tools
| >after I finished my new version (G-styled tool-box),
| >the previous stuff may become shareware
| >and the new tools will be part of a developers-package.

| So there's a problem with the MUL routine. Myself is
| an old chaotic freedom fighter, giving code away for
| free. Wouldn't be a fair thing, if I steal your idea
| and share it with everybody for nothing...

All code-hints I publish in NGs are for free anyway,
including bugs..., unless noted otherwise.

....I hope, this one is kept for a while... ;)


| >It will, [V 0.0.3 text and code-block gone to archive]
| As a warning, how we _never_ should do it...

| >Ok, no reason to be disappointed about the timing,
| > we tried, we learned,...
| >Perhaps this method can gain speed for divide.

| "Perhaps, perhaps..." having an old song in my mind,
| but do not know who was the singer (very nice female
| voice, anyway)...

| >Just for info (your figures are dw-aligned anyway):
| >I updated my byte-oriented MUL as:
| >Zero-extend to dword (register only) if smaller operand
| >or unaligned size.

| >So my MUL work with dwords now and became much faster.

| >BTW:
| >Some calculation needs factor-summing ("A= A + B*C"),
| >in this case the MUL-routine can be entered after the
| >'clear result buffer' (an additional MUL&ADD label).

| See above about "copyright" considerations! (If I am
| not very "serious" in most cases - this one's an ex-
| ception. It is a serious issue!)

I don't care for the copyright for this few bytes,
it may help some programs to get a touch of useful code.
The NG-tracking files will show who published it first,
so nobody else can claim a copyright protection for it.
And every experienced programmer will "owe" a similar routine anyway.
But you're right,
a working fast DIV-solution will ask for 'some' hours.
As we only share the theory and our code-solutions will be quite
different due completely different coding ways and environments,
(my final code wont run on your OS and vice versa)
I see no problem with copyrights.
---------------------------------


[upcoming: "the DIV" - horror spectacle w/o end]

| >I'll continue with the (still topic) divide problem.
| >Your previous DIV-idea was somehow similar to the latest MUL-version.

| >Let's assume the worst case is 1/FF FF...FF [1/(2^896-1)]
| >perhaps more worse will be FF FF...FE/FF FF...FF (we better check both),
| >and the result shall be at least 896 bits (112 bytes) precise,
| >with the (10^x)-exponent adjusted accordingly.

| Which grows to "unsolvable" problems? While there's


| no DIV, we can't do a divide. A big disadvantage, if
| we use hexadecimals instead of BCDs.

| A wonderful life it was - there were no sins and the
| earth was peaceful and nice ... but now we have this
| darn hexadecimals to handle...

Where is a problem with HEX which isn't also with BCD?

| >The DIV 64 (code F7.. edx:eax/m32) is a vectored [24/40] instruction.
| >Together with all necessary zero/limit checks,
| >(needed to avoid DIV-exceptions),
| >I estimate about 600 clock-cycles per result-byte for it.
| >A multiplication in front of it may be necessary to produce full
| >integer precision results.
| >And the 10^x exponent handling will add some extra clock-cycles.
| >Rounding may be done elsewhere.

| >I'll modify my CMP/SUB divide to work on dwords,
| >and I also will try to use the DIV instruction using quad-words.
| >If you can reverse your MUL to become a 'nibble=SUB_LUT_DIV'
| >we can compare the timing of the three variants.

| And then throw it away... ;) ;) ;) (= "hahaha")

Just the slowest two versions will enter the trash-can. :)

| If I'm in the mood - I try to code it until tomorrow
| evening. There's still a lot of work with some other
| important things, so probably I've not too much time
| to think about the DIV - it is really complicated to
| handle a base10 exponent from a base16 routine. Will
| need an extra conversion routine. Give me some hours
| to get in touch with all those problems...

| >The 1/x Newton-Raphson method is still on my desk,
| >but it needs too many iterations yet, perhaps I find a short-cut
| >by playing around with logarithm-formulas.
| Will make things even more complicated? ;)

I'm afraid, yes..

-----------------------------
| myself wrote:
|
| > ... about the DIV - it is really complicated to


| > handle a base10 exponent from a base16 routine. Will
| > need an extra conversion routine. Give me some hours
| > to get in touch with all those problems...

| As I thought - more than complicated...

| Let's take a really simple example:

| (1 / 2) = ?
|
| And now? We have to multiply OP1 by 10. Not that big
| problem? Then - what about
|
| (1,000,000,000 / 2,000,000,000) = ?

| In this case we have to multiply OP1 by this "small"
| number: "10,000,000,000"! Thus, we need a table with
| 65536 entries a 128 byte = 8,388,608 byte. Obviously
| not the thing I wanted.

I see this different:
the largest integer from 112 bytes produce 269 decimal digits,
all you need is a table with 269 power of ten binaries,
112 (or 128) bytes each. This are just 269*128 = 34432 bytes.

10^1 = 0A
10^2 = 064
10^3 = 03E8
...
10^269 = (112 byte hex-string)

This table can be used for BIN->ASCII conversion also.

And if you extend this table to a semi-log LUT
(9*34432 = 309888 bytes in total)

1*10^1 2*10^1 3*10^1 .... 9*10^1
....
1*10^268 ..... 8*10^268 9*10^268

you get an easy decimal-digit-LUT for binary figures.

| Next problem.
|
| To "normalize" both operands (get the first digit at
| the same byte position in the buffer), a hex-dec-hex

| conversions is necessary. How can we do that w/o an


| already existing DIV? Is this extra work really the
| thing I wanted?

Normalisation is needed on SUB/ADD with different exponents only.
No, not a conversion,
just a comparision in the 10^x LUT is needed to find
a factor which produces the maximal integer precision.

| Looking at this - either I should go back to my slow
| (but very accurate) BCD solution, _or_ I have to use
| a hexadecimal exponent, too. With all side-effects I
| don't want to have (e.g. 0.9 = 0.5+0.25+0.125...).

| But I want to have an _accurate_ calculator! So this
| hexadecimal solution is _not_ a solution at all.

As long mantissa is integer (decimal representable)
it does not matter if it's HEX(binary) or BCD.
The binary way allow dword handling and full use of the
CPU's capabilities, while BCD needs nibble- or byte-
oriented handling and the ">9" correcting stuff.

The only important thing in our quest is the exponent
to hold 10^x values rather than 2^x ones.

| My former BCD MUL didn't need a table at all - while
| the new one needs about half of the calculation time
| to create a 4 Kbyte table. For DIV it might be a bit
| better - I would create a table with _10_ entries to
| speed up the BCD DIV, too. The hexadecimal DIV needs
| a table with _16_ entries (0...F) and probably a lot
| of shift operations, too. With each shift we have to
| do a hex-dec-hex conversion, because we are calcula-
| ting powers of 10 (that is: 1st convert, then shift,
| then convert back)...

See below.

| BTW: Thinking about this, my MUL could be simplified
| by using the same "trick" as in my BCD version. The
| buffer isn't needed then. It would need 32 add loops
| with 112 search for digit loops per add loop - might
| be faster. At least, if not all digits are set in an
| operand (probably true in most cases)...

| Arguments?

please don't panic,
it's not _that_ worse... ;)
-----
Example (JUST 32 BITS for easier explanation):

it shall do:
DIV ... 00FFFFFF by .. 00FFFFFE both exponents 0 or equal yet
16777215 / 16777214 = 0.99999994
------------------------------------------------
1. set result exponent to dividends minus divisors exponent
2. determine adjustment factor

either:
CMP-scan DIVISOR in 10^x LUT
00 3B9ACA00 > 00FFFFFE >= 05F5E100
decimal 10^9 10^8
GOT >=10^8
or:
use the MSBit-number to determine decimal power.
[MSBit-NR. * 0.30103] = power of ten value.
or:
determine power10 of both operands and use the difference.
or:
decide per definition how many decimal digits are desired.

(I use 8 decimal digits yet)
3. MUL DIVIDEND, 10^8 00FF FFFF * 05F5E100 = 5F5E0FA0A1F00
SUB result exponent,8

4.NOW DIVide
5F5E0FA0A1F00 / 00FFFFFE = 05F5E0FA
=99999994 E-8
------------------------------------------------
If the operands exponents are different
then add the difference to the adjustment (MUL and DEC result EXP).
------------------------------------
Seems to be easy that way.....

__
wolfgang


bv_schornak

unread,
Apr 26, 2003, 12:57:30 PM4/26/03
to
[remarkable]

Have a look at this one! They don't deliver goods to the
USA and Britain...

<http://www.mayr-tools.com/>

Seen while looking for a translation of "Dampfstrahler"!


wolfgang kern wrote:

[new processor features]

>| As I said long ago - I should have kept the 68k line


>| straight on...
>
>Yeah, and I'm really sad for the Zilogs didn't make it for PCs.
>

Don't forget the TMS 9900! The TI 99 was a nice machine,
not too good for real work, but the game cartridges were
offering very _sophisticated_ games. I remember, we were
kind of dependend to beat the highscore of each other in
this "Munchman" game (a "Pacman" clone, much better than
Atari's original "Pacman")...


[a mistery called KESYS] ;)

>"executables" like .com/.exe/.dll won't run on KESYS,
>but sometimes I analyse, edit or create M$-compatible stuff.
>
>KESYS-applications don't contain any code,
>there are just parameter/command-strings which come "compiled"
>(a better term is compressed) from a tool-box.
>
>I write direct into the hex "opcode"-field of my disassembler:
>address (always true physical)and mode(16/32-bit)
>can be changed at any time,
>Source-text, branch-/call-info, and re-engineering-flags
>are created by the disassembler after every key-stroke (full screen update).
>The comment-field can be written also and is stored in a tag-file.
>
>ie:(real 16); fields are larger than shown here
>
>address |opcode |source-text |call |branch|comments |flags
>FFFF:0100 E9FD02 JMP 0300 initialise: BR
>FFFF:0103 6650 PUSH EAX FN31xx: r0q S-4
>FFFF:0105 2E668706300F EX EAX CS:[0F30]q swap ptr XMq
>FFFF:010B FFD0 CALLw AX var hiw=param Cr0w S-2
>FFFF:010D 90......xF2 NOP (242d) module 0
>FFFF:01FF CF RET INTw BF S+6
>FFFF:0200 E8DD7D CALL 7FE0 enterPM32 Cw+R
>0028:00000203 31C9 CLR ECX PM32 now Fr1q
>0028 00000205 F6450E20 TEST SS:[EBP+0E]b,20 b5 recurs FMb

>......


>
>I can move the cursor to any byte within the opcode and immediate
>step/run/trace/move the code in memory or save/load any part to
>or from disk.
>An alternate dump-view allow direct ASCII-string inputs.
>
>My debugger (kernel module) can show "all" CPU-info
>incl. FPU/MMX/TSC/MSRs/stack/.../PCI-device-list,
>and I may even debug itself (dual set).
>

As long as you check these things yourself... ;)

Reading with care... is your code stored in a ROM, or is
it loaded from media into RAM prior to the execution? It
sounds like a ROM based thing...

[clear buffers]

>| ... Clearing of the buffer could be


>| done with a method which saves some clock cycles...
>
>Yes.
>

I will have a look at the AMD optimisation guide - there
are some examples how to save clock cycles...


[another try...]

(

>| Probably you did not read my additional reply, where
>| I told you, that there's a bug in the routine... ;)
>
>I read it, but I didn't use your coding,
>

)

> I just took the idea...
>

Thief! :-D

Don't take it serious in any way - thinking of all prior
statements about copyrights - just to "share" some funny
thoughts which came to my mind while reading... ;)

>| Oh no ... I throw away my PC and get a Mac instead!
>
>The method wont be faster there...
>

With numbers in the right order, real BCD support and an
"used apple" as logo? Don't believe it...

>| (350 / 70) = 7 ??? ;)
>Got me, just read it as "5" :)
>

I did - could have been, that you meant (490 / 70)... ;)

>Seems a short vacation was highly necessary already.
>

What the heck is "vacation"?

[less serious stuff]

>| >| Depends - if you were typing with your feet... ;)
>| >Only my main power-switch is a "kick"-type (underneath my desk).
>| Then - what took you that long?
>
>Don't know, perhaps day-dreaming of a "perfect CPU" ?
>

>From the new company "Nobody"?

>| >| (10.8 s / stroke)
>| >Far away from 600 strokes/minute...
>
>| Hmmmm ... speaking of a typing robot?
>No, a former office clerk was near the world-record.
>

What should it be good for? No one can speak that fast!

>| I was wondering about the "good" result... ;)
>
>Me too.
>

But I was the first who published it... :-P

>| And the missile destroys the building, where they're
>| signing the contract? Terrorist! ;)
>
>Yes, we are guilty in advance,
> but my reverse history don't mention this fact.
>

Poor me - they will strangle the author...

> [snow in April,...]
>| Ah, I must have forgotten that they come through the
>| Alps after they leave Germany... ;)
>:)
>| Who's Harvey? The figure from the famous movie where
>| James Steward played the guy who always talked to an
>| invisible "partner"?
>No, Hervey is a NG-fellow from "Netherlands",
>but yes, invisible to me.
>

That is - killfiled? ;)


[self-modifying code..]

>| AFAIK, SS:ESP is pushed on the stack of the new task
>| before it is executed. So if you change SS/CS...
>
>A 'true' task-switch saves all registers incl. seg-regs
> in the task-segment(a dedicated memory-block)
> rather than on the callers stack.
>

Do not nail me to a wall. I never wasted a thought about
this stuff, I just assumed it...

>| The access to _everything_ which belongs to the OS/2
>| kernel - e.g. memory which was not allocated for the
>| current thread - isn't possible. If you try it, then
>| your application is killed by OS/2 at once! This be-
>| haviour is called "Crash Protection"...
>
>Yes, "save" but restricted...
>

Which is the _opposite_ of unrestricted and insecure? As
Windows 9x tends to be?

>| >["pippifax"...]
>| >| [ Nope - no translation for this one... ;) ]
>| >The term may be of British origin anyway (Monty Python?),
>| > Beth can answer this better.
>
>| Still believe in Santa Claus? ("Pipi" = "wee wee"!)
>I remember another meaning in some nasty kid-rhymes.
>As I was 8 years old I spent some time in UK.
>

Sorry. Another meaning of what? Pippifax, Monthy Python,
British origin, Santa Claus, pipi, wee wee???? And there
still is Astrid Lindgren's _Pippi_ Langstrumpf... ??? ;)

>| >[terminated: 'modern'-joke]
>| Killer! ;)
>Oh no, I just had no more arguments....
>

So - I win this one? ;)

>| >[Application manuals...]
>| >| ...."ghost"-writer?
>| >Must be an identical clone, but then he will hate it also.
>| With modern genetic manipulation techniques we might
>| be able to eliminate the "hate component"... ;)
>So I wait for completion...
>

Perhaps they need some of your cells, first? ;)

>| I have about 30 Kbyte of very old code to check, how
>| much € would you take? :-D
>
>[I cannot type an Euro-sign with my OE, ALT-128="Ç"]
>

Don't be sad! In OS/2, the € sign also doesn't exist for
more than 3 or 4 years now - before it must have been Ç,
either (which _is_ ASCII 128). I just use the € from my
keyboard, and it looks like € (Mozilla doesn't know this
€ sign, as well)... ;)

(An advantage of an OS which is used in banks and insur-
ances...)

>To check what CPU and roughly what's about: 300.-
>Full analysis (documented asm source) : 2000,-
> this is an offer on a private base,
> don't tell this to my sales-agent :)
>

Wow! I think I should look for some more code, so I save
some shipping costs... ;)

>| >[Compatibility]
>| >| >Unfortunately "I" have to write my conversion programs.
>| >| Who do you think is writing mine? ;)
>| >Your compiler? :)
>| ...is only able to compile already existing sources.
>| If there are none, then it can't create them. Thus -
>| I have to code it myself...
>Wasn't meant that serious.. :)
>

Same to all of my above quoted words... ;)

> [compiler...]
>| >Depends how much trailing zeros (before DP) on the check...
>| One? Without another number in front of it?
>"Des is a weng zweng!"
>

"Meah is ned drin in mei'n Geldbeid'l, nua massig hoasse
Luft!"

>| BTW: Leading -> 0.0000 <- Trailing ...
> before DP -> . <- after
>

Zero remains nothing at all... (Jethro Tull ???)


[copyrights]

>| So there's a problem with the MUL routine. Myself is
>| an old chaotic freedom fighter, giving code away for
>| free. Wouldn't be a fair thing, if I steal your idea
>| and share it with everybody for nothing...
>
>All code-hints I publish in NGs are for free anyway,
>including bugs..., unless noted otherwise.
>

So you don't copyright your bugs???

>| See above about "copyright" considerations! (If I am
>| not very "serious" in most cases - this one's an ex-
>| ception. It is a serious issue!)
>
>I don't care for the copyright for this few bytes,
>it may help some programs to get a touch of useful code.
>The NG-tracking files will show who published it first,
>so nobody else can claim a copyright protection for it.
>And every experienced programmer will "owe" a similar routine anyway.
>But you're right,
> a working fast DIV-solution will ask for 'some' hours.
>As we only share the theory and our code-solutions will be quite
> different due completely different coding ways and environments,
>(my final code wont run on your OS and vice versa)
>I see no problem with copyrights.
>

Not as _trivial_ as it seems. In a few words: I wouldn't
like it too much, if I give my code away for _free_, and
someone else takes it and sells slightly modified copies
of it. Would be contrary to my intention, that I want to
support people with apps who can't afford the commercial
(too expensive) stuff. That's the reason why I copyright
my software - as long as derived stuff is given away for
free, it's okay. I just see my work as property of every
being on this planet...

["The DIV" - nightmare of mine]

>| Which grows to "unsolvable" problems? While there's
>| no DIV, we can't do a divide. A big disadvantage, if
>| we use hexadecimals instead of BCDs.
>
>| A wonderful life it was - there were no sins and the
>| earth was peaceful and nice ... but now we have this
>| darn hexadecimals to handle...
>
>Where is a problem with HEX which isn't also with BCD?
>

For me, it is! I have to think backwards, in hexadecimal
numbers which "represent" decimal numbers, but cannot be
handled the same way... ;)

Seriously: This is really complicated stuff for me. As I
coded my BCD calculator, I had a concept (or imagination
how it should work). At the moment - I have none. I have
to get in touch with this reversed hexadecimals. Not the
big problem, if they fit into a register - overwhelming,
if they are that large. The DIV adds multiple problems -
I probably have to think about it for a _longer_ time to
get the clue behind it. In the end, it's a logical task.
I just have to find a solution, how to split it into se-
veral sub-tasks I _can_ understand. May take some time.

>| >The DIV 64 (code F7.. edx:eax/m32) is a vectored [24/40] instruction.
>| >Together with all necessary zero/limit checks,
>| >(needed to avoid DIV-exceptions),
>| >I estimate about 600 clock-cycles per result-byte for it.
>| >A multiplication in front of it may be necessary to produce full
>| >integer precision results.
>| >And the 10^x exponent handling will add some extra clock-cycles.
>| >Rounding may be done elsewhere.
>
>| >I'll modify my CMP/SUB divide to work on dwords,
>| >and I also will try to use the DIV instruction using quad-words.
>| >If you can reverse your MUL to become a 'nibble=SUB_LUT_DIV'
>| >we can compare the timing of the three variants.
>
>| And then throw it away... ;) ;) ;) (= "hahaha")
>
>Just the slowest two versions will enter the trash-can. :)
>

Technical knock-out in round 10! ;)

>| >The 1/x Newton-Raphson method is still on my desk,
>| >but it needs too many iterations yet, perhaps I find a short-cut
>| >by playing around with logarithm-formulas.
>| Will make things even more complicated? ;)
>I'm afraid, yes..
>

I knew... ;)

>| (1 / 2) = ?
>|
>| And now? We have to multiply OP1 by 10. Not that big
>| problem? Then - what about
>|
>| (1,000,000,000 / 2,000,000,000) = ?
>
>| In this case we have to multiply OP1 by this "small"
>| number: "10,000,000,000"! Thus, we need a table with
>| 65536 entries a 128 byte = 8,388,608 byte. Obviously
>| not the thing I wanted.
>
>I see this different:
>the largest integer from 112 bytes produce 269 decimal digits,
>all you need is a table with 269 power of ten binaries,
>112 (or 128) bytes each. This are just 269*128 = 34432 bytes.
>
>10^1 = 0A
>10^2 = 064
>10^3 = 03E8

>....


>10^269 = (112 byte hex-string)
>

What about

((C.1E3C... E-0xD002) / (1.2345... E-0xD02E)) or
((7.3425... E 0x4F00) / (A.CECF... E 0x4F2F)) ?

Still covered by the 269 entries in the table?

>This table can be used for BIN->ASCII conversion also.
>
>And if you extend this table to a semi-log LUT
> (9*34432 = 309888 bytes in total)
>
>1*10^1 2*10^1 3*10^1 .... 9*10^1

>.....


>1*10^268 ..... 8*10^268 9*10^268
>
>you get an easy decimal-digit-LUT for binary figures.
>

Which would be essential for the calculator...

>| Next problem.
>|
>| To "normalize" both operands (get the first digit at
>| the same byte position in the buffer), a hex-dec-hex
>| conversions is necessary. How can we do that w/o an
>| already existing DIV? Is this extra work really the
>| thing I wanted?
>
>Normalisation is needed on SUB/ADD with different exponents only.
>No, not a conversion,
> just a comparision in the 10^x LUT is needed to find
>a factor which produces the maximal integer precision.
>

Probably I mismatch something, but if I shift an operand
right, I have to _multiply_ it with a power of ten, if I
shift it left, I have to divide it by a power of ten (or
multiply with a negative power). No problem, if a number
is base 10. But I would have assumed, that I need a con-
version first, before I can shift it, then do the shift,
finally convert it back to hex...

Example:

OP1 EXP 0000 NUM 0A 00 00 00 ( 10)
/
OP2 EXP 0000 NUM 00 01 00 00 ( 256)

Now - we need to shift it 2 digits up:

OP1 EXP 0000 NUM 00 0A 00 00 (2 560)
/
OP2 EXP 0000 NUM 00 01 00 00 ( 256)

Thus, we get a result of 0x0A without remainder. This is
not the true result (should be 0.0390625). The shifts we
did must be recognized somewhere. The shifting of OP1 is
one of the _most_ used operations in the DIV routine, so
we have to do it very often...

>| But I want to have an _accurate_ calculator! So this
>| hexadecimal solution is _not_ a solution at all.
>
>As long mantissa is integer (decimal representable)
>it does not matter if it's HEX(binary) or BCD.
>The binary way allow dword handling and full use of the
>CPU's capabilities, while BCD needs nibble- or byte-
>oriented handling and the ">9" correcting stuff.
>
>The only important thing in our quest is the exponent
> to hold 10^x values rather than 2^x ones.
>

See example... ;)

>| Arguments?
>
>please don't panic,
>it's not _that_ worse... ;)
>

If I could believe it... ;)

>Example (JUST 32 BITS for easier explanation):
>
>it shall do:
>DIV ... 00FFFFFF by .. 00FFFFFE both exponents 0 or equal yet
> 16777215 / 16777214 = 0.99999994
>------------------------------------------------
>

Probably the result should be 1.00000006... ;)
(In result buffer: 00 04 FF F8 06 E1 F5 05, not a very
accurate result...)

-------------------------------------------------
Which would be as easy as (since OP1 > OP2):
-------------------------------------------------
OP1 00 03 00 00 FF FF FF 00 00 00 00 00 00 00 00
OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 00 00

RES 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
---------------------------------------------------------
after 1st SUB
-------------------------------------------------
OP1 00 01 00 00 01 00 00 00 00 00 00 00 00 .....
OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....

RES 00 7F 00 00 ..... 00 00 00 00 00 00 00 00 10
-------------------------------------------------
And now we should shift OP1 six digits up, where we meet
our well known problem again...

shift 7 up (too small, so we do a further shift)
[maybe I should use a base 16 shift?]
-------------------------------------------------
OP1 00 04 00 00 00 E1 F5 05 00 00 00 00 00 .....
OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....

RES 00 7F FF F8 ..... 00 00 00 00 00 00 00 00 10
-------------------------------------------------
after SUB:
-------------------------------------------------
OP1 00 03 00 00 0A E1 F5 00 00 00 00 00 00 .....
OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....

RES 00 7F FF F8 ..... 00 00 00 00 05 00 00 00 10
-------------------------------------------------
shift 1 up
-------------------------------------------------
OP1 00 04 00 00 64 CA 9A 09 00 00 00 00 00 .....
OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....

RES 00 7F FF F7 ..... 00 00 00 00 05 00 00 00 10
-------------------------------------------------
after SUB:
-------------------------------------------------
OP1 00 03 00 00 76 CA 9A 00 00 00 00 00 00 .....
OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....

RES 00 7F FF F7 ..... 00 00 00 00 95 00 00 00 10
-------------------------------------------------
shift 1 up
-------------------------------------------------
OP1 00 04 00 00 9C E8 0B 06 00 00 00 00 00 .....
OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....

RES 00 7F FF F6 ..... 00 00 00 00 95 00 00 00 10
-------------------------------------------------
after SUB:
-------------------------------------------------
OP1 00 03 00 00 A8 E8 0B 00 00 00 00 00 00 .....
OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....

RES 00 7F FF F6 ..... 00 00 00 06 95 00 00 00 10
-------------------------------------------------

Where I obviously do something wrong, because I'm adding
my results (how often we can subtract OP2 from OP1) _to_
the result buffer without any further conversion...

Compare it with my BCD solution:

-------------------------------------------------
OP1 00 08 00 00 ..... FF 01 06 07 07 07 02 01 05
OP2 00 08 00 00 ..... FF 01 06 07 07 07 02 01 04

RES 00 FF 00 00 00 00 00 00 00 00 00 00 00 .....
-------------------------------------------------
SUB
-------------------------------------------------
OP1 00 01 00 00 ..... FF 00 00 00 00 00 00 00 01
OP2 00 08 00 00 ..... FF 01 06 07 07 07 02 01 04

RES 00 FF 00 00 01 00 00 00 00 00 00 00 00 .....
-------------------------------------------------
Shift 8 up
-------------------------------------------------
OP1 00 09 00 00 ..... 01 00 00 00 00 00 00 00 00
OP2 00 08 00 00 ..... FF 01 06 07 07 07 02 01 04

RES 00 FF FF F8 01 00 00 00 00 00 00 00 00 .....
-------------------------------------------------
SUB
-------------------------------------------------
OP1 00 08 00 00 ..... FF 01 06 01 01 03 09 03 00
OP2 00 08 00 00 ..... FF 01 06 07 07 07 02 01 04

RES 00 FF FF F8 01 00 00 00 00 00 00 00 05 .....
-------------------------------------------------
Shift 1 up
-------------------------------------------------
OP1 00 09 00 00 ..... 01 06 01 01 03 09 03 00 00
OP2 00 08 00 00 ..... FF 01 06 07 07 07 02 01 04

RES 00 FF FF F7 01 00 00 00 00 00 00 00 05 00 00
-------------------------------------------------
SUB
-------------------------------------------------
OP1 00 08 00 00 ..... FF 01 00 01 04 04 03 07 04
OP2 00 08 00 00 ..... FF 01 06 07 07 07 02 01 04

RES 00 FF FF F7 01 00 00 00 00 00 00 00 05 09 00
-------------------------------------------------
Shift 1 up
-------------------------------------------------
OP1 00 09 00 00 ..... 01 00 01 04 04 03 07 04 00
OP2 00 08 00 00 ..... FF 01 06 07 07 07 02 01 04

RES 00 FF FF F6 01 00 00 00 00 00 00 00 05 09 00
-------------------------------------------------

At this point we already have a result which is accurate
in 10 digits, without _any_ further conversions or other
annoying and time consuming tasks. In the end, the "old"
BCD solution could be much faster than _any_ hex routine
which has to perform several conversions for each single
operation...

BTW - my hex example throws the same result - but should
be hexadecimal rather than decimal (as it is!)...

What I like the most - I check all of the operations I'm
doing (in concurence to this hex solution, where I start
to figure out "What am I doing?" all the time).

I am still insecure about the _best_ solution. Where are
all the arguments pro hex? Vanished into dust, as there
are too many unsolved problems?

I'll try to find out, where the error in my thoughts may
be burried...

wolfgang kern

unread,
Apr 27, 2003, 7:05:46 AM4/27/03
to

Bernhard wrote:
[remarkable]
| Have a look at this one! They don't deliver goods to the
| USA and Britain...
| <http://www.mayr-tools.com/>
| Seen while looking for a translation of "Dampfstrahler"!
I'll view after, 'steam-cleaner'? (Kärcher)

| [new processor features]
| >| As I said long ago - I should have kept the 68k line
| >| straight on...
| >Yeah, and I'm really sad for the Zilogs didn't make it for PCs.
| Don't forget the TMS 9900! The TI 99 was a nice machine,
| not too good for real work, but the game cartridges were
| offering very _sophisticated_ games. I remember, we were
| kind of dependend to beat the highscore of each other in
| this "Munchman" game (a "Pacman" clone, much better than
| Atari's original "Pacman")...

Yes, it had an interesting instruction-set.

[a mistery called KESYS] ;)

[....]


| As long as you check these things yourself... ;)

It's a good feeling to see almost everything...

| Reading with care... is your code stored in a ROM, or is
| it loaded from media into RAM prior to the execution? It
| sounds like a ROM based thing...

Self-modifying code wont do very well in ROMs...
Code and programs are stored on a hard-disk, of course...

But right,
the 'grandmother' of KESYS is a selfmade (1981)
ROM-based Eprom-burner with two Z-80 CPUs,
1Mb static battery-powered RAM, colour-TV-(SVGA)card,
C100 tape-drive, 5.25 FD, RS232, 128-bit parallel port,...
My first disassembler for different CPUs was born there.
(I also wrote a BASIC for it, but rarely used it.)

I still use the equipment for RD/WR ROMs, GALs, etc.
but now it is remote-controlled from my PC,
so I got all data on the harddisk and I don't
have to lift my arse when changing tools.

| I will have a look at the AMD optimisation guide - there
| are some examples how to save clock cycles...

Yes there are many examples in the appendix,
unfortunately nothing for larger operands.

| [another try...]


| >| Probably you did not read my additional reply, where
| >| I told you, that there's a bug in the routine... ;)
| >I read it, but I didn't use your coding,

| > I just took the idea...
| Thief! :-D

Guilty!

| Don't take it serious in any way - thinking of all prior
| statements about copyrights - just to "share" some funny
| thoughts which came to my mind while reading... ;)

| >| Oh no ... I throw away my PC and get a Mac instead!
| >The method wont be faster there...

| With numbers in the right order, real BCD support and an
| "used apple" as logo? Don't believe it...

The 68000 BCD support is not what you really want...

| What the heck is "vacation"?

That's what I call the time when the PC is powered down.

| [less serious stuff]


| >| Then - what took you that long?
| >Don't know, perhaps day-dreaming of a "perfect CPU" ?
|From the new company "Nobody"?

I still have the ALT.OS-CPU in my mind,
needs a lot of time "and" money to see it once alive.

[typing speed]


| >No, a former office clerk was near the world-record.
| What should it be good for?
| No one can speak that fast!

..except my former mother in law!

| >| I was wondering about the "good" result... ;)
| >Me too.
| But I was the first who published it... :-P

Sure, the rigths on this are yours :)

[the buggy missile]


| Poor me - they will strangle the author...

So you should use your time best for the remaining 20 years :)

| >No, Hervey is a NG-fellow from "Netherlands",
| >but yes, invisible to me.
| That is - killfiled? ;)

Hope he don't kill my files after the 'Dutch-joke'.

[task-switches]


| Do not nail me to a wall. I never wasted a thought about
| this stuff, I just assumed it...

I wouldn't do that anyway, sounds like heavy work... :)

[OS/2]


| >Yes, "save" but restricted...
| Which is the _opposite_ of unrestricted and insecure? As
| Windows 9x tends to be?

Seems to be that way...

| >| >["pippifax"...]


| >I remember another meaning in some nasty kid-rhymes.
| >As I was 8 years old I spent some time in UK.

| Sorry. Another meaning of what? Pippifax, Monthy Python,
| British origin, Santa Claus, pipi, wee wee???? And there
| still is Astrid Lindgren's _Pippi_ Langstrumpf... ??? ;)

"My 'pippy' easy 'f***s'
and give me some relax..."
and the kids rhyme ends with "..right after pippy f***s"

| >| >[terminated: 'modern'-joke]
| >| Killer! ;)
| >Oh no, I just had no more arguments....
| So - I win this one? ;)

Sure, take this point in your account... :)

| >| >[Application manuals...]
| >| >| ...."ghost"-writer?
| >| >Must be an identical clone, but then he will hate it also.
| >| With modern genetic manipulation techniques we might
| >| be able to eliminate the "hate component"... ;)
| >So I wait for completion...

| Perhaps they need some of your cells, first? ;)

I already lost many almost everywhere...

| >| I have about 30 Kbyte of very old code to check, how
| >| much € would you take? :-D

| >[I cannot type an Euro-sign with my OE, ALT-128="Ç"]

| Don't be sad! In OS/2, the € sign also doesn't exist for
| more than 3 or 4 years now - before it must have been Ç,
| either (which _is_ ASCII 128). I just use the € from my
| keyboard, and it looks like € (Mozilla doesn't know this
| € sign, as well)... ;)
| (An advantage of an OS which is used in banks and insur-
| ances...)

All of my keyboards are US-style and I never used German
layouts [where are the rectangle brackets there?
and I run crazy with locked caps and three finger keys]
I'm able to convince my OE to show an EURO-sign,
copied from clipboard: € € € € € € € € € € € €

| >To check what CPU and roughly what's about: 300.-
| >Full analysis (documented asm source) : 2000,-
| > this is an offer on a private base,
| > don't tell this to my sales-agent :)

| Wow! I think I should look for some more code, so I save
| some shipping costs... ;)

I hope you can hold the amount within a certain range...

[trailing zeros ]


| >"Des is a weng zweng!"
| "Meah is ned drin in mei'n Geldbeid'l, nua massig hoasse
| Luft!"

"Des is a wäu meah wia in mein, meins is grod gauns floch!"

| Zero remains nothing at all... (Jethro Tull ???)

Except for the space it occupies to show it.

| [copyrights]

| >| So there's a problem with the MUL routine. Myself is
| >| an old chaotic freedom fighter, giving code away for
| >| free. Wouldn't be a fair thing, if I steal your idea
| >| and share it with everybody for nothing...

| >All code-hints I publish in NGs are for free anyway,
| >including bugs..., unless noted otherwise.

| So you don't copyright your bugs???

Should I? Perhaps as a joke collection...

| >| See above about "copyright" considerations! (If I am
| >| not very "serious" in most cases - this one's an ex-
| >| ception. It is a serious issue!)

| >I don't care for the copyright for this few bytes,
| >it may help some programs to get a touch of useful code.
| >The NG-tracking files will show who published it first,
| >so nobody else can claim a copyright protection for it.
| >And every experienced programmer will "owe" a similar routine anyway.
| >But you're right,
| > a working fast DIV-solution will ask for 'some' hours.
| >As we only share the theory and our code-solutions will be quite
| > different due completely different coding ways and environments,
| >(my final code wont run on your OS and vice versa)
| >I see no problem with copyrights.

| Not as _trivial_ as it seems. In a few words: I wouldn't
| like it too much, if I give my code away for _free_, and
| someone else takes it and sells slightly modified copies
| of it. Would be contrary to my intention, that I want to
| support people with apps who can't afford the commercial
| (too expensive) stuff. That's the reason why I copyright
| my software - as long as derived stuff is given away for
| free, it's okay. I just see my work as property of every
| being on this planet...

Of course you should not publish a final solution
in the form of 'usable source code' if you like to
copyright it first.

----------------------------------------------------------


| ["The DIV" - nightmare of mine]

| >| Which grows to "unsolvable" problems? While there's
| >| no DIV, we can't do a divide. A big disadvantage, if
| >| we use hexadecimals instead of BCDs.

| >| A wonderful life it was - there were no sins and the
| >| earth was peaceful and nice ... but now we have this
| >| darn hexadecimals to handle...

| >Where is a problem with HEX which isn't also with BCD?

| For me, it is! I have to think backwards, in hexadecimal
| numbers which "represent" decimal numbers, but cannot be
| handled the same way... ;)

| Seriously: This is really complicated stuff for me. As I
| coded my BCD calculator, I had a concept (or imagination
| how it should work). At the moment - I have none. I have
| to get in touch with this reversed hexadecimals. Not the
| big problem, if they fit into a register - overwhelming,
| if they are that large. The DIV adds multiple problems -
| I probably have to think about it for a _longer_ time to
| get the clue behind it. In the end, it's a logical task.
| I just have to find a solution, how to split it into se-
| veral sub-tasks I _can_ understand. May take some time.

I think it's not the reversed order which is confusing you,
perhaps it's a missing clue for the 'feeling' what binaries are.

Actually it does not matter if an integer is BCD or binary coded.
All SUB/ADD/MUL/DIV works the same, except carry-overs will
take place at other "values".

For binary strings don't think in terms of digits or nibbles.
See it the same way as the CPU, in dword quantities.

FFFFFFFF + 00000001 = 100000000 the carry-over at dword bounds (0..4G),
09 + 01 = 0A with BCDs (0...9).

| >| >I'll modify my CMP/SUB divide to work on dwords,
| >| >and I also will try to use the DIV instruction using quad-words.
| >| >If you can reverse your MUL to become a 'nibble=SUB_LUT_DIV'
| >| >we can compare the timing of the three variants.

| >| And then throw it away... ;) ;) ;) (= "hahaha")
| >Just the slowest two versions will enter the trash-can. :)
| Technical knock-out in round 10! ;)

Lets see...

| >| (1 / 2) = ?
| >|
| >| And now? We have to multiply OP1 by 10. Not that big
| >| problem? Then - what about
| >|
| >| (1,000,000,000 / 2,000,000,000) = ?
| >
| >| In this case we have to multiply OP1 by this "small"
| >| number: "10,000,000,000"! Thus, we need a table with
| >| 65536 entries a 128 byte = 8,388,608 byte. Obviously
| >| not the thing I wanted.

| >I see this different:
| >the largest integer from 112 bytes produce 269 decimal digits,
| >all you need is a table with 269 power of ten binaries,
| >112 (or 128) bytes each. This are just 269*128 = 34432 bytes.

| >10^1 = 0A
| >10^2 = 064
| >10^3 = 03E8
| >....
| >10^269 = (112 byte hex-string)

| What about
|
| ((C.1E3C... E-0xD002) / (1.2345... E-0xD02E)) or
| ((7.3425... E 0x4F00) / (A.CECF... E 0x4F2F)) ?

| Still covered by the 269 entries in the table?

Yes, if 112 bytes is the maximum result precision.
But a decimal point cannot occure,
mantissa will be integer in all cases.

| >This table can be used for BIN->ASCII conversion also.
| >And if you extend this table to a semi-log LUT
| > (9*34432 = 309888 bytes in total)
| >1*10^1 2*10^1 3*10^1 .... 9*10^1
| >.....
| >1*10^268 ..... 8*10^268 9*10^268
| >you get an easy decimal-digit-LUT for binary figures.
| Which would be essential for the calculator...

Yes, especially for display.

| >| Next problem.
| >| To "normalize" both operands (get the first digit at
| >| the same byte position in the buffer), a hex-dec-hex
| >| conversions is necessary. How can we do that w/o an
| >| already existing DIV? Is this extra work really the
| >| thing I wanted?

| >Normalisation is needed on SUB/ADD with different exponents only.
| >No, not a conversion,
| > just a comparision in the 10^x LUT is needed to find
| >a factor which produces the maximal integer precision.

| Probably I mismatch something, but if I shift an operand
| right, I have to _multiply_ it with a power of ten, if I
| shift it left, I have to divide it by a power of ten (or
| multiply with a negative power). No problem, if a number
| is base 10. But I would have assumed, that I need a con-
| version first, before I can shift it, then do the shift,
| finally convert it back to hex...

Negative "shifts" actually effects the exponent value only.
Just by subtract.

We need to multiply the dividend by an 10^x if the difference
of the operands mantissa values is less than desired result precision.


| Example:
| OP1 EXP 0000 NUM 0A 00 00 00 ( 10)
| /
| OP2 EXP 0000 NUM 00 01 00 00 ( 256)
| Now - we need to shift it 2 digits up:

you multiplied with 256 !


| OP1 EXP 0000 NUM 00 0A 00 00 (2 560)
| /
| OP2 EXP 0000 NUM 00 01 00 00 ( 256)

| Thus, we get a result of 0x0A without remainder.
| This is| not the true result (should be 0.0390625).

As you multiplied the dividend with 256 (SHL one byte),
you need to correct the result by shift right one byte:

this would result in a 00.0A ,
which is correct (one byte precise) in terms of 2^n notation
.00001010 = 2^-5 + 2^-7 = 0.03125 + 0.0078125 = 0.0390625

Ok, even correct this is not what we wanted.
The result shall be an integer.

So, you should shift left the dividend,
you acually multiplied with an power of two!
The problem then will be to adjust our 10^x exponent accordingly.

Therefore we replace the "shift" with a 10^x multiply
and set the result precision (in decimal digits) there.

your example:


OP1 EXP 0000 NUM 0A 00 00 00 (10)
/
OP2 EXP 0000 NUM 00 01 00 00 (256)

chose any method to determine result precision (see pt.2 below)
lets assume we got 10^8 for (7 decimal digits desired).

LUT 10^8 = 05F5E100
mul OP1,10^8
OP1 EXP 0000 NUM 00 CA 9A 3B 00 (10^9)


/
OP2 EXP 0000 NUM 00 01 00 00 (256)

[we now could reduce the trailing zeros also,
but that's another story]

result after divide
exponent= (power10 of OP1) - (power10 of OP2) -8
1 - 2 -8 = -9
BUF EXP FFF7 NUM CA 9A 3B 00 (3906250 E-9 = 0.03906250)

this is one digit more than needed, but we asked for.


| The shifts we did must be recognized somewhere.
| The shifting of OP1 is one of the _most_ used operations
| in the DIV routine, so we have to do it very often...

Forget about shifts here.....
Compare and subtract the same way you would do with a BCD-string,
except you loop dwords rather than nibbles.

| >| But I want to have an _accurate_ calculator! So this
| >| hexadecimal solution is _not_ a solution at all.

See above!

| >As long mantissa is integer (decimal representable)
| >it does not matter if it's HEX(binary) or BCD.
| >The binary way allow dword handling and full use of the
| >CPU's capabilities, while BCD needs nibble- or byte-
| >oriented handling and the ">9" correcting stuff.

| >The only important thing in our quest is the exponent
| > to hold 10^x values rather than 2^x ones.

| See example... ;)
Yeah shifts, still stuck to BCD? :)

| >| Arguments?
| >please don't panic, it's not _that_ worse... ;)
| If I could believe it... ;)


[the 1/x example]


| Probably the result should be 1.00000006... ;)

Exact! I'm already bound to the 1/x problem :)

Again:


Example (JUST 32 BITS for easier explanation):
it shall do:

DIV ... 00FFFFFE by .. 00FFFFFF both exponents 0 or equal yet
16777214 / 16777215 = 0.99999994
------------------------------------------------


1. set result exponent to dividends minus divisors exponent

is zero yet

2. determine adjustment factor
either:
CMP-scan DIVISOR in 10^x LUT

00 3B9ACA00 > 00FFFFFF >= 05F5E100


decimal 10^9 10^8
GOT >=10^8
or:
use the MSBit-number to determine decimal power.
[MSBit-NR. * 0.30103] = power of ten value.
or:
determine power10 of both operands and use the difference.
or:
decide per definition how many decimal digits are desired.

(I use 8 decimal digits yet)

3. MUL DIVIDEND, 10^8 00FF FFFE * 05F5E100 = 5F5E0FA0A1E00


SUB result exponent,8
4.NOW DIVide

5F5E0FA0A1E00 / 0FFFFFF = 05F5E0FA
=99999994 E-8


-------------------------------------------
| >If the operands exponents are different
| >then add the difference to the adjustment (MUL and DEC result EXP).
| >------------------------------------
| >Seems to be easy that way.....


| -------------------------------------------------
| Which would be as easy as (since OP1 > OP2):
| -------------------------------------------------
| OP1 00 03 00 00 FF FF FF 00 00 00 00 00 00 00 00
| OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 00 00
|
| RES 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
| ---------------------------------------------------------
| after 1st SUB
| -------------------------------------------------
| OP1 00 01 00 00 01 00 00 00 00 00 00 00 00 .....
| OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....
|
| RES 00 7F 00 00 ..... 00 00 00 00 00 00 00 00 10
| -------------------------------------------------

why does this produce a 2^(127+4) ??
FFFFFF/FFFFFE = 1 ;rem 1


| And now we should shift OP1 six digits up, where we meet
| our well known problem again...
| shift 7 up (too small, so we do a further shift)
| [maybe I should use a base 16 shift?]

Why shift?
multiply by 10^8 (or the next fitting) instead


| -------------------------------------------------
| OP1 00 04 00 00 00 E1 F5 05 00 00 00 00 00 .....
| OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....
|
| RES 00 7F FF F8 ..... 00 00 00 00 00 00 00 00 10
| -------------------------------------------------

the 05F5E100 are 10^8
the partial result of the CMP/SUB is 1 * 10^8
and the remainder is 00FFFFFF*05F5E100 - 1*05F5E100


| after SUB:
| -------------------------------------------------
| OP1 00 03 00 00 0A E1 F5 00 00 00 00 00 00 .....
| OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....
|
| RES 00 7F FF F8 ..... 00 00 00 00 05 00 00 00 10
| -------------------------------------------------
| shift 1 up

you mean * 0A

No, the subtractions are art least eigth times faster.
Again I don't see any reason for conversions (except display).

| BTW - my hex example throws the same result - but should
| be hexadecimal rather than decimal (as it is!)...

Perhaps you treat the binaries like decimal digits?
You actually count the subtractions and store it as BCD,
good for immediate output...

| What I like the most - I check all of the operations I'm
| doing (in concurence to this hex solution, where I start
| to figure out "What am I doing?" all the time).

| I am still insecure about the _best_ solution. Where are
| all the arguments pro hex? Vanished into dust, as there
| are too many unsolved problems?

The pro-hex are dword-operations
and no need for 99/66h carry-over corrections ...


| I'll try to find out, where the error in my thoughts may

| be buried...

I think you still see hexadecimals
as "digits" instead of "consecutive-value"-binary strings.
You can combine all your "shifts" to one preceding single
multiplication.
---------------------------------

__
wolfgang

bv_schornak

unread,
Apr 27, 2003, 4:12:51 PM4/27/03
to
wolfgang kern wrote:


[remarkable]

>| Have a look at this one! They don't deliver goods to the
>| USA and Britain...
>| <http://www.mayr-tools.com/>
>| Seen while looking for a translation of "Dampfstrahler"!
>I'll view after, 'steam-cleaner'? (Kärcher)
>

I found "hot water high pressure cleaner"... But I meant
something else with remarkable... ;)

>| [new processor features]
>| >| As I said long ago - I should have kept the 68k line
>| >| straight on...
>| >Yeah, and I'm really sad for the Zilogs didn't make it for PCs.
>| Don't forget the TMS 9900! The TI 99 was a nice machine,
>| not too good for real work, but the game cartridges were
>| offering very _sophisticated_ games. I remember, we were
>| kind of dependend to beat the highscore of each other in
>| this "Munchman" game (a "Pacman" clone, much better than
>| Atari's original "Pacman")...
>Yes, it had an interesting instruction-set.
>

Yep - but the least "sophisticated" processor did win in
the end... :(

> [a mistery called KESYS] ;)
>[....]
>| As long as you check these things yourself... ;)
>
>It's a good feeling to see almost everything...
>

You win this one... ;)

>| Reading with care... is your code stored in a ROM, or is
>| it loaded from media into RAM prior to the execution? It
>| sounds like a ROM based thing...
>
>Self-modifying code wont do very well in ROMs...
>Code and programs are stored on a hard-disk, of course...
>

I see. I was thinking of embedded systems...

>But right,
>the 'grandmother' of KESYS is a selfmade (1981)
>ROM-based Eprom-burner with two Z-80 CPUs,
>1Mb static battery-powered RAM, colour-TV-(SVGA)card,
>C100 tape-drive, 5.25 FD, RS232, 128-bit parallel port,...
>My first disassembler for different CPUs was born there.
>(I also wrote a BASIC for it, but rarely used it.)
>
>I still use the equipment for RD/WR ROMs, GALs, etc.
>but now it is remote-controlled from my PC,
>so I got all data on the harddisk and I don't
> have to lift my arse when changing tools.
>

2:0... Found nothing to top that! ;)

>| I will have a look at the AMD optimisation guide - there
>| are some examples how to save clock cycles...
>
>Yes there are many examples in the appendix,
>unfortunately nothing for larger operands.
>

Waiting for the Opteron (super, as c't wrote)...

>| [another try...]
>| >| Probably you did not read my additional reply, where
>| >| I told you, that there's a bug in the routine... ;)
>| >I read it, but I didn't use your coding,
>| > I just took the idea...
>| Thief! :-D
>
>Guilty!
>

Oh no ... it really is gone?

>| With numbers in the right order, real BCD support and an
>| "used apple" as logo? Don't believe it...
>
>The 68000 BCD support is not what you really want...
>

I did like it, as I was young and dumb ... Nowadays I am
no more young... ;)

>| What the heck is "vacation"?
>
>That's what I call the time when the PC is powered down.
>

Ah! Working time! But - it still is Sunday!

>| [less serious stuff]
>| >| Then - what took you that long?
>| >Don't know, perhaps day-dreaming of a "perfect CPU" ?
>|From the new company "Nobody"?
>
>I still have the ALT.OS-CPU in my mind,
>needs a lot of time "and" money to see it once alive.
>

Never heard of it...

>[typing speed]
>| >No, a former office clerk was near the world-record.
>| What should it be good for?
>| No one can speak that fast!
>

>...except my former mother in law!
>

Yikes... ;)

>| >| I was wondering about the "good" result... ;)
>| >Me too.
>| But I was the first who published it... :-P
>
>Sure, the rigths on this are yours :)
>

I don't want to have them (I would have preferred faster
execution times instead)...

>[the buggy missile]
>| Poor me - they will strangle the author...
>
>So you should use your time best for the remaining 20 years :)
>

To look for a tree with strong arms? ;)

>| >No, Hervey is a NG-fellow from "Netherlands",
>| >but yes, invisible to me.
>| That is - killfiled? ;)
>Hope he don't kill my files after the 'Dutch-joke'.
>

Would prove no humour at all...

>[task-switches]
>| Do not nail me to a wall. I never wasted a thought about
>| this stuff, I just assumed it...
>
>I wouldn't do that anyway, sounds like heavy work... :)
>

You probably have a robot from "KUKA" to hold the hammer
(controlled from your PC, no reason to leave your luxury
seat)... ;)

>[OS/2]
>| >Yes, "save" but restricted...
>| Which is the _opposite_ of unrestricted and insecure? As
>| Windows 9x tends to be?
>
>Seems to be that way...
>

But they made million$$ with this "software"...

>| >| >["pippifax"...]
>| >I remember another meaning in some nasty kid-rhymes.
>| >As I was 8 years old I spent some time in UK.
>
>| Sorry. Another meaning of what? Pippifax, Monthy Python,
>| British origin, Santa Claus, pipi, wee wee???? And there
>| still is Astrid Lindgren's _Pippi_ Langstrumpf... ??? ;)
>
>"My 'pippy' easy 'f***s'
> and give me some relax..."
>and the kids rhyme ends with "..right after pippy f***s"
>

At the tender age of 8??? ;)

>| >| >[terminated: 'modern'-joke]
>| >| Killer! ;)
>| >Oh no, I just had no more arguments....
>| So - I win this one? ;)
>Sure, take this point in your account... :)
>

Too late... ;)


[cloning, cells, etc.]

>| Perhaps they need some of your cells, first? ;)
>I already lost many almost everywhere...
>

The "Boris Becker" syndrom? ;)


[the € symbol]

>All of my keyboards are US-style and I never used German
> layouts [where are the rectangle brackets there?
>

[ -> AltGr + 8
] -> AltGr + 9

> and I run crazy with locked caps and three finger keys]
>I'm able to convince my OE to show an EURO-sign,
>copied from clipboard: € € € € € € € € € € € €
>

So - Windows still needs the help of a real OS... ;)

>| >To check what CPU and roughly what's about: 300.-
>| >Full analysis (documented asm source) : 2000,-
>| > this is an offer on a private base,
>| > don't tell this to my sales-agent :)
>
>| Wow! I think I should look for some more code, so I save
>| some shipping costs... ;)
>
>I hope you can hold the amount within a certain range...
>

Maybe I should collect the code of some others, too? ;)


[Mundart - our impolite slang corner]

>[trailing zeros ]
>| >"Des is a weng zweng!"
>| "Meah is ned drin in mei'n Geldbeid'l, nua massig hoasse
>| Luft!"
>"Des is a wäu meah wia in mein, meins is grod gauns floch!"
>

"Do legst di nieda - daa brauchatst dös Teil eigntlich a
goar need..."

>| Zero remains nothing at all... (Jethro Tull ???)
>Except for the space it occupies to show it.
>

So the mathematical definition isn't _that_ true...


[copyrights]

>| >All code-hints I publish in NGs are for free anyway,
>| >including bugs..., unless noted otherwise.
>
>| So you don't copyright your bugs???
>
>
>Should I? Perhaps as a joke collection...
>

Why? Take Microsoft as a shining example - and sell 'em!

>| Not as _trivial_ as it seems. In a few words: I wouldn't
>| like it too much, if I give my code away for _free_, and
>| someone else takes it and sells slightly modified copies
>| of it. Would be contrary to my intention, that I want to
>| support people with apps who can't afford the commercial
>| (too expensive) stuff. That's the reason why I copyright
>| my software - as long as derived stuff is given away for
>| free, it's okay. I just see my work as property of every
>| being on this planet...
>
>Of course you should not publish a final solution
>in the form of 'usable source code' if you like to
>copyright it first.
>

Stored! I will have it in mind...


[DIV - and no end... ;) ]

>I think it's not the reversed order which is confusing you,
>perhaps it's a missing clue for the 'feeling' what binaries are.
>
>Actually it does not matter if an integer is BCD or binary coded.
>All SUB/ADD/MUL/DIV works the same, except carry-overs will
>take place at other "values".
>
>For binary strings don't think in terms of digits or nibbles.
>See it the same way as the CPU, in dword quantities.
>
>FFFFFFFF + 00000001 = 100000000 the carry-over at dword bounds (0..4G),
> 09 + 01 = 0A with BCDs (0...9).
>

Ok, I think I'm smarter than any CPU, because I can read
hex - a CPU only "sees" bits. Nevertheless, the problems
remain... ;)

>| >| And then throw it away... ;) ;) ;) (= "hahaha")
>| >Just the slowest two versions will enter the trash-can. :)
>| Technical knock-out in round 10! ;)
>
>Lets see...
>

Why should I start fighting against somebody like Arnold
Schwarzenegger? I _know_ that he can't win! (Muscles vs.
brain...)

>| >I see this different:
>| >the largest integer from 112 bytes produce 269 decimal digits,
>| >all you need is a table with 269 power of ten binaries,
>| >112 (or 128) bytes each. This are just 269*128 = 34432 bytes.
>
>| >10^1 = 0A
>| >10^2 = 064
>| >10^3 = 03E8
>| >....
>| >10^269 = (112 byte hex-string)
>
>| What about
>|
>| ((C.1E3C... E-0xD002) / (1.2345... E-0xD02E)) or
>| ((7.3425... E 0x4F00) / (A.CECF... E 0x4F2F)) ?
>
>| Still covered by the 269 entries in the table?
>
>Yes, if 112 bytes is the maximum result precision.
>But a decimal point cannot occure,
>mantissa will be integer in all cases.
>

Sure, but what about the difference between 10 E 1437
and 10 E -1437? All of them are handled by the powers
10 E 0 ... 10 E 269 ???

This are quite different hexadecimal patterns, as far
as I can remember...

>| >This table can be used for BIN->ASCII conversion also.
>| >And if you extend this table to a semi-log LUT
>| > (9*34432 = 309888 bytes in total)
>| >1*10^1 2*10^1 3*10^1 .... 9*10^1
>| >.....
>| >1*10^268 ..... 8*10^268 9*10^268
>| >you get an easy decimal-digit-LUT for binary figures.
>| Which would be essential for the calculator...
>
>Yes, especially for display.
>

Ok! Still my question: "Where is 10 E -3245 ?"...

>| Probably I mismatch something, but if I shift an operand
>| right, I have to _multiply_ it with a power of ten, if I
>| shift it left, I have to divide it by a power of ten (or
>| multiply with a negative power). No problem, if a number
>| is base 10. But I would have assumed, that I need a con-
>| version first, before I can shift it, then do the shift,
>| finally convert it back to hex...
>
>Negative "shifts" actually effects the exponent value only.
>Just by subtract.
>
>We need to multiply the dividend by an 10^x if the difference
>of the operands mantissa values is less than desired result precision.
>

Ok. How far I can follow your words, see examples. Sorry
about that, but I just don't get the clue at the moment.

>| Example:
>| OP1 EXP 0000 NUM 0A 00 00 00 ( 10)
>| /
>| OP2 EXP 0000 NUM 00 01 00 00 ( 256)
>| Now - we need to shift it 2 digits up:
>
>you multiplied with 256 !
>

Indeed. I shifted the 0x0A two digits up...

>| OP1 EXP 0000 NUM 00 0A 00 00 (2 560)
>| /
>| OP2 EXP 0000 NUM 00 01 00 00 ( 256)
>
>| Thus, we get a result of 0x0A without remainder.
>| This is| not the true result (should be 0.0390625).
>
>As you multiplied the dividend with 256 (SHL one byte),
> you need to correct the result by shift right one byte:
>
>this would result in a 00.0A ,
> which is correct (one byte precise) in terms of 2^n notation
> .00001010 = 2^-5 + 2^-7 = 0.03125 + 0.0078125 = 0.0390625
>

Got this one!

>Ok, even correct this is not what we wanted.
>The result shall be an integer.
>
>So, you should shift left the dividend,
>you acually multiplied with an power of two!
>The problem then will be to adjust our 10^x exponent accordingly.
>
>Therefore we replace the "shift" with a 10^x multiply
>and set the result precision (in decimal digits) there.
>

Where the precision always is 112 bytes or the amount of
digits we need for an accurate result - which depends on
the result - a non-predictable value of (112 - x)!

>your example:
>OP1 EXP 0000 NUM 0A 00 00 00 (10)
>/
>OP2 EXP 0000 NUM 00 01 00 00 (256)
>
>chose any method to determine result precision (see pt.2 below)
>lets assume we got 10^8 for (7 decimal digits desired).
>

Sorry - but how does the routine "know" this factor? See
my above note - it _only_ depends on the result. If this
would be an endless row (e.g. 0.33333...), then it needs
at least 112 + 1 (for the rounding!) bytes in the result
buffer!

>LUT 10^8 = 05F5E100
>mul OP1,10^8
>OP1 EXP 0000 NUM 00 CA 9A 3B 00 (10^9)
>/
>OP2 EXP 0000 NUM 00 01 00 00 (256)
>[we now could reduce the trailing zeros also,
> but that's another story]
>
>result after divide
>exponent= (power10 of OP1) - (power10 of OP2) -8
> 1 - 2 -8 = -9
>BUF EXP FFF7 NUM CA 9A 3B 00 (3906250 E-9 = 0.03906250)
>
>this is one digit more than needed, but we asked for.
>

Might work, as long as we use simple examples - but what
about random input with random sizes? The routine should
handle that, too!

>| The shifts we did must be recognized somewhere.
>| The shifting of OP1 is one of the _most_ used operations
>| in the DIV routine, so we have to do it very often...
>
>Forget about shifts here.....
>Compare and subtract the same way you would do with a BCD-string,
>except you loop dwords rather than nibbles.
>

The dword vs. nibbles thing isn't the problem!

>| >As long mantissa is integer (decimal representable)
>| >it does not matter if it's HEX(binary) or BCD.
>| >The binary way allow dword handling and full use of the
>| >CPU's capabilities, while BCD needs nibble- or byte-
>| >oriented handling and the ">9" correcting stuff.
>
>| >The only important thing in our quest is the exponent
>| > to hold 10^x values rather than 2^x ones.
>
>| See example... ;)
>Yeah shifts, still stuck to BCD? :)
>

Of course - I did not work with all this stuff until the
2nd of February this year! I've learned a lot of things
by now, but I don't have a clue about what I am doing at
the moment... :(

Ooops - did I forget to tell you, that the MSB is stored
at the rightmost position, so we will have enough "room"
for "surplus" digits? If there is a method to _know_ the
size of the result within a few clock cycles, I will use
it at once!

>| And now we should shift OP1 six digits up, where we meet
>| our well known problem again...
>| shift 7 up (too small, so we do a further shift)
>| [maybe I should use a base 16 shift?]
>
>Why shift?
> multiply by 10^8 (or the next fitting) instead
>

Do you really want to know, why I prefer 3.something clocks
over 350 clocks per processed byte?

>| -------------------------------------------------
>| OP1 00 04 00 00 00 E1 F5 05 00 00 00 00 00 .....
>| OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....
>|
>| RES 00 7F FF F8 ..... 00 00 00 00 00 00 00 00 10
>| -------------------------------------------------
>the 05F5E100 are 10^8
>the partial result of the CMP/SUB is 1 * 10^8
>and the remainder is 00FFFFFF*05F5E100 - 1*05F5E100
>

The "1" is the MSB (see above)!

>
>| after SUB:
>| -------------------------------------------------
>| OP1 00 03 00 00 0A E1 F5 00 00 00 00 00 00 .....
>| OP2 00 03 00 00 FE FF FF 00 00 00 00 00 00 .....
>|
>| RES 00 7F FF F8 ..... 00 00 00 00 05 00 00 00 10
>| -------------------------------------------------
>| shift 1 up
>you mean * 0A
>

I definitely meant "shift" (at least 100 times faster)!

If the shift is solved with only one change of the index
register? Takes 1 or 2 clock cycle(s). Where is this SUB
routine which handles 112 bytes within 0.25 clocks? ;)

>| BTW - my hex example throws the same result - but should
>| be hexadecimal rather than decimal (as it is!)...
>
>Perhaps you treat the binaries like decimal digits?
>You actually count the subtractions and store it as BCD,
>good for immediate output...
>

Not BCD - I would store any 0x0A...0x0F as well!

So the 1st solution (in the 10/256 example) would be the
proper (and fastest) way...

>| What I like the most - I check all of the operations I'm
>| doing (in concurence to this hex solution, where I start
>| to figure out "What am I doing?" all the time).
>
>| I am still insecure about the _best_ solution. Where are
>| all the arguments pro hex? Vanished into dust, as there
>| are too many unsolved problems?
>
>The pro-hex are dword-operations
>and no need for 99/66h carry-over corrections ...
>

Would not bother me, as long as I know how it works. You
may compare my current "thinking" a little bit with this
"apples and peaches" story - I count apples, but use the
peaches as counter of the dozens. The problem is, not to
mismatch all those apples and peaches...

Calculate the _amount_ of clock cycles we would need for
the entire routine:

1. Determine the result size.
2. Clear at least 2 buffers for the multiply,
create 30 entries for the table, multiply.
3. Create 30 entries for the DIV table.
4. Do a subtraction.
5. Multiply (see step 2!) the resulting digit
with the proper power of 10 (which must be
loaded from another table).
6. Add this result to the DIV result.
7. Repeat step 4...6 - until OP1 is less than
1 * OP2. If OP1 still is not zero, repeat
from step 1, but take a larger factor (the
current result is garbage)...

I guess, it will take some thousands of clock cycles per
result digit (one multiply is 350 cycles per byte).

>| I'll try to find out, where the error in my thoughts may
>| be buried...
>
>I think you still see hexadecimals
>as "digits" instead of "consecutive-value"-binary strings.
>You can combine all your "shifts" to one preceding single
>multiplication.
>

Might be - but there's no such thing like the separation
of both. They belong to each other - valid for any other
base you may use, too. And as long as I am a human, I am
bound to the abilities of my brain. I may handle a deci-
mal number with 8 digits easier than a 2 digit hexadeci-
mal number, because I'm used to handle decimals for some
46 years now. Please do not forget, that I only have an
ordinary math knowledge, and I didn't think about _math_
with _hexadecimals_ before I started this thread (BTW, I
started it with BCD, not with hex; I wouldn't have taken
notice of a thread about hexadecimal math).

There's something missing in my thoughts - this stuff is
not _that_ complicated (as I see it at the moment). It's
like running against a wall, where the door is two steps
away. Someone has nailed a huge plank to my forehead. It
might be, that I did that myself... ;)


Thinking in progress ;)

Bernhard Schornak

wolfgang kern

unread,
Apr 29, 2003, 9:00:03 AM4/29/03
to

Berhard wrote:

[sorry for the snip, short in time right now...]

[DIV - and no end... ;)]

| >| >the largest integer from 112 bytes produce 269 decimal digits,


| >| >all you need is a table with 269 power of ten binaries,
| >| >112 (or 128) bytes each. This are just 269*128 = 34432 bytes.

| >| >10^1 = 0A
| >| >10^2 = 064
| >| >10^3 = 03E8
| >| >....
| >| >10^269 = (112 byte hex-string)

| >| What about
| >| ((C.1E3C... E-0xD002) / (1.2345... E-0xD02E)) or
| >| ((7.3425... E 0x4F00) / (A.CECF... E 0x4F2F)) ?

| >| Still covered by the 269 entries in the table?

| >Yes, if 112 bytes is the maximum result precision.

| >But a decimal point cannot occur,


| >mantissa will be integer in all cases.

| Sure, but what about the difference between 10 E 1437
| and 10 E -1437? All of them are handled by the powers
| 10 E 0 ... 10 E 269 ???

YES.
The difference of the exponents will be reflected in the results exponent.
Only the mantissa are subject of a division,
the exponents are just subtracted:

1 E-1437 / 1 E+1437 = 1 E-2474 and

9876543210 E-12345 / 1234567890 E+12345 = 9876543210/123456789 E-24690

but this last example would produce a non-integer mantissa =80.000000729..

therefore we multiply the 9876543210 (mantissa only!) with
the (desired result-precision) minus
the difference 10^x power (mantissa only!),
ie:
you wish 269 decimal digits precision (112 bytes integer)
dividends mantissa power is 10 *)
divisors mantissa power is 9
So the factor is 10^(269-(10-9)) =10^268.

[*)determined by either scan-compare in the 10^x-LUT
or calculated by (most significant bit number) * (0.30103) ,
which is just a 32-bit integer story (see earlier posts on that)]

Now multiply the (binary) dividend with the (binary) 10^268 table entry.
[the product may be up to 224 bytes, but lesser here]

Now perform a (binary) division,
the (binary) result correspond to an decimal integer with 269 decimal digits.

The 10^x exponent is calculated as:
(dividends exponent) minus (divisors exponent) minus (the "factor")
-12345
- 12345
--------
= -24690
- 268
--------
= -24958
=========
so the 112 bytes result is 80000000729...........
and the exponent is E-24958

This single multiplication avoids all 10^x-shifts in the
binary divide routine.


| Ok! Still my question: "Where is 10 E -3245 ?"...

Mantissa is "1" the exponent is 0xF353 ...
or "1 000 000 000" 0xF34A (-3254)...
or ......

[snipped the confusing ... :) ]


| Calculate the _amount_ of clock cycles we would need for
| the entire routine:

Ok, I estimate:



| 1. Determine the result size.

neglected, a few ...

| 2. Clear at least 2 buffers for the multiply,

You could use just one result- and
one remainder(MODulo)-buffer:
[OP1mem]*[10^x entry]-> [rem-BUF]
SUB[rem-BUF],...
ADD [RESULT],...

counts for 4Kb included in next

| create 30 entries for the table, multiply.

13771 , or use the faster 64-bit MUL ..?

| 3. Create 30 entries for the DIV table.

13771 ,

| 4. Do a subtraction.

28*15

| 5. Multiply (see step 2!) the resulting digit
| with the proper power of 10 (which must be
| loaded from another table).

Only one MUL needed at start,
as you're doing a BINARY divide here,
don't care about 10^x yet.

| 6. Add this result to the DIV result.

28*15

| 7. Repeat step 4...6 - until OP1 is less than
| 1 * OP2.

up to 224 times (28+28)*15 = 188160
+~30000 ahead
add some loop/logic/carry
-------
220000
220000/112 = ~2000 per result-byte


| If OP1 still is not zero, repeat
| from step 1, but take a larger factor (the
| current result is garbage)...

No, already fits due the MUL on start...


| I guess, it will take some thousands of clock cycles per
| result digit (one multiply is 350 cycles per byte).

Yes "~2000", so the nibble-wise-LUT DIVide will be the slowest.
Even if you use the 64-bit-ALU MUL-version.

The 64-bit-ALU DIV needs about 600 cycles/byte
And the dword CMP/SUB division is fastest with about 500/byte.


| >| I'll try to find out, where the error in my thoughts may
| >| be buried...

| >I think you still see hexadecimals
| >as "digits" instead of "consecutive-value"-binary strings.
| >You can combine all your "shifts" to one preceding single
| >multiplication.

| Might be - but there's no such thing like the separation
| of both. They belong to each other - valid for any other
| base you may use, too. And as long as I am a human, I am
| bound to the abilities of my brain. I may handle a deci-
| mal number with 8 digits easier than a 2 digit hexadeci-
| mal number, because I'm used to handle decimals for some
| 46 years now. Please do not forget, that I only have an
| ordinary math knowledge, and I didn't think about _math_
| with _hexadecimals_ before I started this thread (BTW, I
| started it with BCD, not with hex; I wouldn't have taken
| notice of a thread about hexadecimal math).

| There's something missing in my thoughts - this stuff is
| not _that_ complicated (as I see it at the moment). It's
| like running against a wall, where the door is two steps
| away. Someone has nailed a huge plank to my forehead. It
| might be, that I did that myself... ;)

| Thinking in progress ;)

Don't think too complex, ...
I use my old TI-36x calculator (HEX/DEC up to 39 bits)
whenever I'm not sure about figures.

Even your idea with the nibble-tables is not the fastest,
it will be faster and shorter than any BCD-solution.

I already have to think twice when counting the coins
returned after I bought cigarettes... , damned shit decimals... :)

__
wolfgang


bv_schornak

unread,
May 1, 2003, 12:18:49 PM5/1/03
to
wolfgang kern wrote:

>[sorry for the snip, short in time right now...]
>

Poof! It became really hard to produce this nonsense,
anyway... ;)

[another try with DIV...]

>| Sure, but what about the difference between 10 E 1437
>| and 10 E -1437? All of them are handled by the powers
>| 10 E 0 ... 10 E 269 ???
>
>YES.
>The difference of the exponents will be reflected in the results exponent.
>Only the mantissa are subject of a division,
>the exponents are just subtracted:
>
>1 E-1437 / 1 E+1437 = 1 E-2474 and
>
>9876543210 E-12345 / 1234567890 E+12345 = 9876543210/123456789 E-24690
>

I finally got it!

>but this last example would produce a non-integer mantissa =80.000000729..
>
>therefore we multiply the 9876543210 (mantissa only!) with
>the (desired result-precision) minus
> the difference 10^x power (mantissa only!),
> ie:
> you wish 269 decimal digits precision (112 bytes integer)
> dividends mantissa power is 10 *)
> divisors mantissa power is 9
>So the factor is 10^(269-(10-9)) =10^268.
>

I should study this one with a few examples!

>[*)determined by either scan-compare in the 10^x-LUT
> or calculated by (most significant bit number) * (0.30103) ,
> which is just a 32-bit integer story (see earlier posts on that)]
>

Where the latter one looks like it would perform much
faster...

>Now multiply the (binary) dividend with the (binary) 10^268 table entry.
>[the product may be up to 224 bytes, but lesser here]
>
>Now perform a (binary) division,
>the (binary) result correspond to an decimal integer with 269 decimal digits.
>
>The 10^x exponent is calculated as:
> (dividends exponent) minus (divisors exponent) minus (the "factor")
> -12345
> - 12345
> --------
>= -24690
> - 268
> --------
>= -24958
>=========
>so the 112 bytes result is 80000000729...........
>and the exponent is E-24958
>

Looks really good!

>This single multiplication avoids all 10^x-shifts in the
>binary divide routine.
>

Ok - I think I got it now. Thanks for your patience!

>| Calculate the _amount_ of clock cycles we would need for
>| the entire routine:
>
>Ok, I estimate:
>
>| 1. Determine the result size.
>
>neglected, a few ...
>

Ho - checked!

>| 2. Clear at least 2 buffers for the multiply,
>
>You could use just one result- and
> one remainder(MODulo)-buffer:
>[OP1mem]*[10^x entry]-> [rem-BUF]
> SUB[rem-BUF],...
> ADD [RESULT],...
>
>counts for 4Kb included in next
>
>| create 30 entries for the table, multiply.
>
>13771 , or use the faster 64-bit MUL ..?
>

My MUL without a table still is pending. Could be bit
faster than the table version, but will not beat your
"hardware" multiply...

>| 3. Create 30 entries for the DIV table.
>
>13771 ,
>

Might be, that the 10...F0 entries could be skipped?

>| 4. Do a subtraction.
>
>28*15
>

"Pippifax"! (Compared to the lengthy jobs...)

>| 5. Multiply (see step 2!) the resulting digit
>| with the proper power of 10 (which must be
>| loaded from another table).
>
>Only one MUL needed at start,
>as you're doing a BINARY divide here,
> don't care about 10^x yet.
>

Yippieeh!

>| 6. Add this result to the DIV result.
>
>28*15
>
>| 7. Repeat step 4...6 - until OP1 is less than
>| 1 * OP2.
>
>up to 224 times (28+28)*15 = 188160
> +~30000 ahead
>add some loop/logic/carry
> -------
> 220000
>220000/112 = ~2000 per result-byte
>

Hmmm - I don't like this result...

>| If OP1 still is not zero, repeat
>| from step 1, but take a larger factor (the
>| current result is garbage)...
>
>No, already fits due the MUL on start...
>
>| I guess, it will take some thousands of clock cycles per
>| result digit (one multiply is 350 cycles per byte).
>
>Yes "~2000", so the nibble-wise-LUT DIVide will be the slowest.
>Even if you use the 64-bit-ALU MUL-version.
>

Where there is some room for development!

>The 64-bit-ALU DIV needs about 600 cycles/byte
>And the dword CMP/SUB division is fastest with about 500/byte.
>

Sorry - I just don't get the point - what the heck is
CMP/SUB compared to the thing I had in mind?


[complex stuff needs simple thinking?]

>| There's something missing in my thoughts - this stuff is
>| not _that_ complicated (as I see it at the moment). It's
>| like running against a wall, where the door is two steps
>| away. Someone has nailed a huge plank to my forehead. It
>| might be, that I did that myself... ;)
>
>| Thinking in progress ;)
>
>Don't think too complex, ...
>I use my old TI-36x calculator (HEX/DEC up to 39 bits)
>whenever I'm not sure about figures.
>
>Even your idea with the nibble-tables is not the fastest,
>it will be faster and shorter than any BCD-solution.
>

Starting some serious work Friday evening!

>I already have to think twice when counting the coins
>returned after I bought cigarettes... , damned shit decimals... :)
>

Another case for your TI-36x? Take care - AFAIK, each
pack of cigarrettes only contains a limited amount of
0x13 cigarrettes (I prefer tobacco and papers)... ;)

wolfgang kern

unread,
May 3, 2003, 4:59:39 AM5/3/03
to

Berhard wrote:

| >[sorry for the snip, short in time right now...]

| Poof! It became really hard to produce this nonsense, anyway... ;)

We may continue the jokes later-on...
Actually I need to steal the time for posting yet,
I fed my guests with time consuming holiday-pictures
just to get some minutes with the net.

| [another try with DIV...]

| >The difference of the exponents will be reflected in the results exponent.
| >Only the mantissa are subject of a division,
| >the exponents are just subtracted:

| I finally got it!

| >but this last example would produce a non-integer mantissa ...

| I should study this one with a few examples!

Let me know if you find figures which need extra adjustment,
I found problems only near the maximum value,
due 112 bytes (all bits set) are less than 0.9999..999 E+269.
Avoidable if you limit the 'valid' range to 1 E+268 (0.999..E+268).



| >[*)determined by either scan-compare in the 10^x-LUT
| > or calculated by (most significant bit number) * (0.30103) ,
| > which is just a 32-bit integer story (see earlier posts on that)]

| Where the latter one looks like it would perform much faster...

Yes, even the MSBit determination will be a (dw) loop.


| Looks really good!
| >This single multiplication avoids all 10^x-shifts in the
| >binary divide routine.
| Ok - I think I got it now. Thanks for your patience!

So finally I was able to spell what I mean ...

[estimated clock-cycles]

| >... or use the faster 64-bit MUL ..?

| My MUL without a table still is pending. Could be bit
| faster than the table version, but will not beat your
| "hardware" multiply...

As you probably paid for your hardware, you may use it :)
Don't hesitate to use my MUL-64 example,
"my" MUL already mutated to 46 clock-cycles per result-byte,
using L1-cache lines and other tricks
which OS/2 and other common OS won't never allow.


| >| 3. Create 30 entries for the DIV table.
|

| Might be, that the 10...F0 entries could be skipped?

But then you would need 4-bit shifting?



| >220000/112 = ~2000 per result-byte

| Hmmm - I don't like this result...

It's really not that bad, compared to existing Divide-routines.

| >| If OP1 still is not zero,....

Then treat it as the remainder!



| >Yes "~2000", so the nibble-wise-LUT DIVide will be the slowest.
| >Even if you use the 64-bit-ALU MUL-version.

| Where there is some room for development!

| >The 64-bit-ALU DIV needs about 600 cycles/byte
| >And the dword CMP/SUB division is fastest with about 500/byte.
| Sorry - I just don't get the point - what the heck is
| CMP/SUB compared to the thing I had in mind?

The main differences are:
[dword CMP/SUB]
No table creation (the 10^x table is on disk).
Works with dword-CMP rather than on nibbles.
Subtract the divisor "at" and "with" dword-steps.

| [complex stuff needs simple thinking?]

Absolutely a Yes!
Cut the story down into "easy thinking, doubtfree" steps.



| Starting some serious work Friday evening!

Good luck!
or better Good Logic! :)

[..damned shit decimals... :)


| Another case for your TI-36x? Take care - AFAIK, each

| pack of cigarettes only contains a limited amount of
| 0x13 cigarettes (I prefer tobacco and papers)... ;)

This never happened in Austria (packet-size = 0x14 ),
price increase here will always fit the slot-machines by
proper rounding (up, never saw it down-rounded).
And while working on the PC I also 'roll' my cigarettes,
(with filters) otherwise I would smoke too much by far.

__
wolfgang

bv_schornak

unread,
May 6, 2003, 4:51:15 PM5/6/03
to
Hi Wolfgang!

Hope you survived!


[DIV, DIV and DIV... (stolen from M.P.)]

>| I should study this one with a few examples!
>
>Let me know if you find figures which need extra adjustment,
>I found problems only near the maximum value,
> due 112 bytes (all bits set) are less than 0.9999..999 E+269.
> Avoidable if you limit the 'valid' range to 1 E+268 (0.999..E+268).
>

In the end, E+256 would be enough, the surplus digits
were only thought to produce a higher accuracy and my
buffer is filled with entire paragraphs, if I use 112
bytes...

>| >[*)determined by either scan-compare in the 10^x-LUT
>| > or calculated by (most significant bit number) * (0.30103) ,
>| > which is just a 32-bit integer story (see earlier posts on that)]
>
>| Where the latter one looks like it would perform much faster...
>
>Yes, even the MSBit determination will be a (dw) loop.
>

For practical work:

1. Get MSB number (however we do it).
2. (MSB * 1000) / 3322 -> power of 10?

[3.322 is the rounded result of (1 / 0.30103).]

How accurate is the factor 0.30103?

>| Ok - I think I got it now. Thanks for your patience!
>
>So finally I was able to spell what I mean ...
>

It seems! I still have some "minor" difficulties with
all this stuff, but I figure it out sooner or later!


[considerations...]

>| My MUL without a table still is pending. Could be bit
>| faster than the table version, but will not beat your
>| "hardware" multiply...
>
>As you probably paid for your hardware, you may use it :)
>Don't hesitate to use my MUL-64 example,
>"my" MUL already mutated to 46 clock-cycles per result-byte,
>using L1-cache lines and other tricks
> which OS/2 and other common OS won't never allow.
>

Beware - the mutants are coming... ;)

I still don't have decided to use your MUL - the easy
way isn't half as challenging as the harder ones.

>| >| 3. Create 30 entries for the DIV table.
>| Might be, that the 10...F0 entries could be skipped?
>
>But then you would need 4-bit shifting?
>

True. Unfortunately not the way to save some clocks!

>| >220000/112 = ~2000 per result-byte
>
>| Hmmm - I don't like this result...
>
>It's really not that bad, compared to existing Divide-routines.
>

Let's code it first... ;)

>| >| If OP1 still is not zero,....
>
>Then treat it as the remainder!
>

Might be skipped, if it already exceeded the LSB.

>| >The 64-bit-ALU DIV needs about 600 cycles/byte
>| >And the dword CMP/SUB division is fastest with about 500/byte.
>| Sorry - I just don't get the point - what the heck is
>| CMP/SUB compared to the thing I had in mind?
>
>The main differences are:
> [dword CMP/SUB]
> No table creation (the 10^x table is on disk).
> Works with dword-CMP rather than on nibbles.
> Subtract the divisor "at" and "with" dword-steps.
>

Sounds nice, but I still don't get, what I should SUB
then. The OP2 (shifted to match the MSB of OP1) or an
entry of the "10 E x"-table (which doesn't make sense
in my current imagination of the DIV routine)...

>| [complex stuff needs simple thinking?]
>
>Absolutely a Yes!
>Cut the story down into "easy thinking, doubtfree" steps.
>
>| Starting some serious work Friday evening!
>
>Good luck!
>or better Good Logic! :)
>

Thanks a lot!

I started to write a little testing range with my hex
dump (already available), so I may get a clue how the
single steps of the calculator work.

At the moment the entire calc only exists in my mind,
so it's quite hard to understand the one or other re-
lation. If I see something working step by step, then
I check it much better...

It should provide functions like clear operand, input
data, single step and complete execution - may take a
while, but it is important. The more complicated this
gets, the more I need to "see" what is going on. Just
have to search for the functions and cut & paste them
together...

>[..damned shit decimals... :)
>| Another case for your TI-36x? Take care - AFAIK, each
>| pack of cigarettes only contains a limited amount of
>| 0x13 cigarettes (I prefer tobacco and papers)... ;)
>
>This never happened in Austria (packet-size = 0x14 ),
>price increase here will always fit the slot-machines by
>proper rounding (up, never saw it down-rounded).
>And while working on the PC I also 'roll' my cigarettes,
>(with filters) otherwise I would smoke too much by far.
>

Wow - I never could roll a cigarette with a filter at
one end. Needs master skills, my respect! Do not know
too much about prices and package sizes, I buy my to-
bacco in 200 gramm cans ("Rothhändle", what else), so
I don't even know, how much the small (40 gramm) pack
of tobacco costs...

wolfgang kern

unread,
May 8, 2003, 9:23:21 AM5/8/03
to

Bernhard wrote:

| Hope you survived!
Yes, and another 'heavy' weekend is near...
several old Ladies are to expect...

| [DIV, DIV and DIV... (stolen from M.P.)]

Who is M.P. ?


| >| I should study this one with a few examples!

| >Let me know if you find figures which need extra adjustment,
| >I found problems only near the maximum value,
| > due 112 bytes (all bits set) are less than 0.9999..999 E+269.
| > Avoidable if you limit the 'valid' range to 1 E+268 (0.999..E+268).

| In the end, E+256 would be enough, the surplus digits
| were only thought to produce a higher accuracy and my
| buffer is filled with entire paragraphs, if I use 112
| bytes...

Ok.

[determine MSB value]
| For practical work:

| 1. Get MSB number (however we do it).
| 2. (MSB * 1000) / 3322 -> power of 10?

I would use a factor which allow right-shift instead of divide.
ie:
top bit value = 2^895 [ (112*8)-1 ]
895 = 037F [ max. 10 bits ]
so if we like to operate with just 32 bits,
the factor may have up to 22 bits.
<"probier-grammier:">
2^21 * log10(2) = 2097152 * 0.30103 = 631305.5675 = 0009A209 (20 bits)
2^22 = 00134412 (21 )
2^23 = 2525222.63 = 00268826 (22 bits)
</"that's it"> FACTOR= 0x00268826
==================
So the maximum is 037F * 00268826 and this fits into 32 bits.
;eax = bit-nr.
IMUL eax,0x00268826 ;immediate 32 form
SHR eax,0x17 ; DIV by 2^23

now let's see if all the truncation and rounding effects the result:
I got eax = 0000010D = 269 for bit# 895.

let's also check bit# 31 2^31 = 2147483648 = 2.14.. E+9
001F * 00268826 = 04AA7C9A
SHR 0x17
= 00000009 OK.

| How accurate is the factor 0.30103?

The 0.30103 is log10(2), my TI shows 0.3010299957..
even our factor is a bit off: 2525222/8388608 = 0.3010299206..
if rounded up: 2525223/8388608 = 0.3010300398..
I recommend to use the smaller value (checked working solution).
I think this method work correct for a range much larger than ever needed.


| [considerations...]
|
| >| My MUL without a table still is pending. Could be bit
| >| faster than the table version, but will not beat your
| >| "hardware" multiply...

| >As you probably paid for your hardware, you may use it :)
| >Don't hesitate to use my MUL-64 example,
| >"my" MUL already mutated to 46 clock-cycles per result-byte,
| >using L1-cache lines and other tricks
| > which OS/2 and other common OS won't never allow.

| Beware - the mutants are coming... ;)

My hard-disk is packed and sealed gas-tight, so no-one can escape..

| I still don't have decided to use your MUL - the easy
| way isn't half as challenging as the harder ones.

:)


| >| Might be, that the 10...F0 entries could be skipped?
| >But then you would need 4-bit shifting?

| True. Unfortunately not the way to save some clocks!

| >| >220000/112 = ~2000 per result-byte
| >| Hmmm - I don't like this result...
| >It's really not that bad, compared to existing Divide-routines.
| Let's code it first... ;)

| >| >| If OP1 still is not zero,....
| >Then treat it as the remainder!
| Might be skipped, if it already exceeded the LSB.

Yes, but very useful for "MOD".

| >| >The 64-bit-ALU DIV needs about 600 cycles/byte
| >| >And the dword CMP/SUB division is fastest with about 500/byte.
| >| Sorry - I just don't get the point - what the heck is
| >| CMP/SUB compared to the thing I had in mind?

| >The main differences are:
| > [dword CMP/SUB]
| > No table creation (the 10^x table is on disk).
| > Works with dword-CMP rather than on nibbles.
| > Subtract the divisor "at" and "with" dword-steps.

| Sounds nice, but I still don't get, what I should SUB
| then. The OP2 (shifted to match the MSB of OP1) or an
| entry of the "10 E x"-table (which doesn't make sense
| in my current imagination of the DIV routine)...

Sub OP2, shifted to be just less than OP1,...
many iterations, but the loop can be kept short.

The 10^x table join in the fun for bin2ASCII and bin2BCD-conversion,
which are divide-functions also.

| I started to write a little testing range with my hex
| dump (already available), so I may get a clue how the
| single steps of the calculator work.
|
| At the moment the entire calc only exists in my mind,
| so it's quite hard to understand the one or other re-
| lation. If I see something working step by step, then
| I check it much better...
|
| It should provide functions like clear operand, input
| data, single step and complete execution - may take a
| while, but it is important. The more complicated this
| gets, the more I need to "see" what is going on. Just
| have to search for the functions and cut & paste them
| together...

Yes, "YSWYG", a hex-dump will show the truth.

[smokers corner..]


| Wow - I never could roll a cigarette with a filter at
| one end. Needs master skills, my respect! Do not know
| too much about prices and package sizes, I buy my to-

| bacco in 200 gram cans ("Rothhändle", what else), so
| I don't even know, how much the small (40 gram) pack
| of tobacco costs...

There is really no magic with the filters, just try it.
I pay 88.- for 1000 gram (Maverick, isn't "that" heavy).

__
wolfgang

bv_schornak

unread,
May 9, 2003, 4:39:37 PM5/9/03
to
wolfgang kern wrote:

>| [DIV, DIV and DIV... (stolen from M.P.)]
>
>Who is M.P. ?
>

Spam, Spam and Spam - Monty Python...

Valid HTML? The Validator says no - but it's funny! ;)

Got the clue - stored to be used...

>| How accurate is the factor 0.30103?
>
>The 0.30103 is log10(2), my TI shows 0.3010299957..
>even our factor is a bit off: 2525222/8388608 = 0.3010299206..
> if rounded up: 2525223/8388608 = 0.3010300398..
>I recommend to use the smaller value (checked working solution).
>I think this method work correct for a range much larger than ever needed.
>

It's part of BinLobster, now!

>| [considerations...]


>| >As you probably paid for your hardware, you may use it :)
>| >Don't hesitate to use my MUL-64 example,
>| >"my" MUL already mutated to 46 clock-cycles per result-byte,
>| >using L1-cache lines and other tricks
>| > which OS/2 and other common OS won't never allow.
>
>| Beware - the mutants are coming... ;)
>
>My hard-disk is packed and sealed gas-tight, so no-one can escape..
>

Gosh - didn't think of _that_... ;)

>| >| >| If OP1 still is not zero,....
>| >Then treat it as the remainder!
>| Might be skipped, if it already exceeded the LSB.
>Yes, but very useful for "MOD".
>

What the heck is MOD -> modulo?

>| >| >The 64-bit-ALU DIV needs about 600 cycles/byte
>| >| >And the dword CMP/SUB division is fastest with about 500/byte.
>| >| Sorry - I just don't get the point - what the heck is
>| >| CMP/SUB compared to the thing I had in mind?
>
>| >The main differences are:
>| > [dword CMP/SUB]
>| > No table creation (the 10^x table is on disk).
>| > Works with dword-CMP rather than on nibbles.
>| > Subtract the divisor "at" and "with" dword-steps.
>
>| Sounds nice, but I still don't get, what I should SUB
>| then. The OP2 (shifted to match the MSB of OP1) or an
>| entry of the "10 E x"-table (which doesn't make sense
>| in my current imagination of the DIV routine)...
>
>Sub OP2, shifted to be just less than OP1,...
>many iterations, but the loop can be kept short.
>

And every right shift of OP2 is one left shift in the
result? Should be...

>The 10^x table join in the fun for bin2ASCII and bin2BCD-conversion,
>which are divide-functions also.
>

I may see the light, if my "playground" is done.

>| I started to write a little testing range with my hex
>| dump (already available), so I may get a clue how the
>| single steps of the calculator work.
>|
>| At the moment the entire calc only exists in my mind,
>| so it's quite hard to understand the one or other re-
>| lation. If I see something working step by step, then
>| I check it much better...
>|
>| It should provide functions like clear operand, input
>| data, single step and complete execution - may take a
>| while, but it is important. The more complicated this
>| gets, the more I need to "see" what is going on. Just
>| have to search for the functions and cut & paste them
>| together...
>
>Yes, "YSWYG", a hex-dump will show the truth.
>

Badly needed! My imagination in this case is not what
I would call sufficient...

>[smokers corner..]
>| Wow - I never could roll a cigarette with a filter at
>| one end. Needs master skills, my respect! Do not know
>| too much about prices and package sizes, I buy my to-
>| bacco in 200 gram cans ("Rothhändle", what else), so
>| I don't even know, how much the small (40 gram) pack
>| of tobacco costs...
>
>There is really no magic with the filters, just try it.
>I pay 88.- for 1000 gram (Maverick, isn't "that" heavy).
>

I don't need filters. ;)

"Rothhändle" isn't really heavy stuff - compared with
Gauloises or Gitanes - 200 gram cans come at € 14.95,
no "jumbo cans" with 1 kg available here (OCB come at
€ 1.50 for 100 pieces)...

wolfgang kern

unread,
May 12, 2003, 10:27:46 PM5/12/03
to

Berhard wrote:

[determine MSB value]
... </"that's it"> nominator= 0x00268826 ; denominator = 2^23


| Valid HTML? The Validator says no - but it's funny! ;)

Everything is definable in the crazy advanced HTML-extensions,
(see self...@teamone.de , be aware of many, many Mbytes)
but here I just used it as block-marks.


| Got the clue - stored to be used...

| It's part of BinLobster, now!

Ok, I mark it as "done".

| What the heck is MOD -> modulo?

Right, sorry I'm lazy too often...

[CMP/SUB loop DIV]

| >Sub OP2, shifted to be just less than OP1,...
| >many iterations, but the loop can be kept short.

| And every right shift of OP2 is one left shift in the
| result? Should be...

Mmh.., actually I add the 'count' _at_ divisors LSB-position.

ie: 1/9 ;9 digits precise
(1/9 => 1000000000/9) = (111111111 E-9)

0x3B9ACA00 / 0x00000009 = 0x069F6BC7

pos: 4 3 2 1 0
3B9ACA00
no: 9 ;too large, next
3B9ACA00
sub: 9 06 times
add loop count at result pos 3 (+06000000)
059ACA00 remain after loop
sub: 9 9F times
add loop count at result pos 2 (+009F0000)
0003CA00 remain after loop
sub: 9 6B times
add loop count at result pos 1 (+00006B00)
00000700 remain after loop
sub 9 C7 times
add loop count at result pos 0 (+000000C7)
00000001 remainder = 1 E-9

This byte-wise method may look slow,
but it's working with every operand-size.
The inner SUBtract-loop can work with registers only,
so it needs just a few clock-cycles.

[Not with this 64-bit example of course
(it will produce a DIV-overflow anyway),
but it's faster than the iterative use of
DIV-instructions for larger figures.]

I still think log/exp will be the fastest divide solution,
and I'll implement it for my 256-bit calculator.
Your 896-bit story unfortunately can't use full log-tables,
and as long we haven't a faster 1/x -solution you cannot
calculate the log immediate.

__
wolfgang

bv_schornak

unread,
May 13, 2003, 12:03:35 PM5/13/03
to
wolfgang kern wrote:

>[determine MSB value]
>.... </"that's it"> nominator= 0x00268826 ; denominator = 2^23


>| Valid HTML? The Validator says no - but it's funny! ;)
>
>Everything is definable in the crazy advanced HTML-extensions,
>(see self...@teamone.de , be aware of many, many Mbytes)
>but here I just used it as block-marks.
>

Guess, how I learned the HTML syntax to write all the
(don't even know, how much) pages for my homepage? It
still is under development, so it will grow much more
until it's ready one fine day (with program and music
downloads it's about 47 MB now - good to know, that I
still have 153 MB left)... ;)

>| Got the clue - stored to be used...
>| It's part of BinLobster, now!
>
>Ok, I mark it as "done".
>

<done> ;)

>| What the heck is MOD -> modulo?
>Right, sorry I'm lazy too often...
>

As long as I don't know something, I will ask you!

Now I see some light at the end of the way - it looks
a little bit like the conversion routine I coded with
my 1st steps in x86 assembler (A86 syntax).

And I get the clue, that my table might speed up this
routine a little bit.

My playground got stuck, because the display isn't of
use. Either I code a new one or leave it be. Maybe it
would be better to put some more effort into the DIV.
I think that I can it do now - let's have a look, how
I'm in the mood to do the one or other (where the 1st
one look like much more work, but would be a base for
the calculator itself)... ;)

wolfgang kern

unread,
May 16, 2003, 10:34:19 AM5/16/03
to

Bernhard wrote:

[HTML]

| Guess, how I learned the HTML syntax to write all the
| (don't even know, how much) pages for my homepage? It
| still is under development, so it will grow much more
| until it's ready one fine day (with program and music
| downloads it's about 47 MB now - good to know, that I
| still have 153 MB left)... ;)

Even my home-page is smaller by far, it seems never to become finished...

[CMP/SUB loop DIV]
[...]


| Now I see some light at the end of the way - it looks
| a little bit like the conversion routine I coded with
| my 1st steps in x86 assembler (A86 syntax).

| And I get the clue, that my table might speed up this
| routine a little bit.

Yes, it could ....



| My playground got stuck, because the display isn't of
| use. Either I code a new one or leave it be. Maybe it
| would be better to put some more effort into the DIV.
| I think that I can it do now - let's have a look, how
| I'm in the mood to do the one or other (where the 1st
| one look like much more work, but would be a base for
| the calculator itself)... ;)

Good luck!
I let you know about a faster 1/x or log solution whenever I may find ...

__
wolfgang

bv_schornak

unread,
Jun 5, 2003, 7:57:58 PM6/5/03
to
wolfgang kern wrote:

- - -
<private note>

Hallo Wolfgang!

Hob mi jetz'ad gnua veroasch'n loss'n vo dia Leid - tuad mer
leit, dos dös a so long gdau'rd hod, bis i gseg'n hob wo d'r
Bartl sein Moscht hoi't. Man'gene Leid verschdenga'dn olawai
Boa'hof - egoal woas'd eahna soagst. Schod um oll'd verdoane
Zeid...

Z'ruckert vom Kaschperlesteoad'r! ;)

Bernhard

</private note>
- - -

Now - there's a BinLobster waiting, isn't it? Lets give it a
new try!

Finally I decided to use your MUL routine - my own solutions
can't compete with it. To make a new start (long time passed
by, so I have to go through all the postings again to get in
touch with the matter) I post my new code (for correction).

The entire BinLobster (AT&T syntax) can be downloaded at

<http://schornak.de/download/BinLobster.txt>.

(Maybe you should have a look at that, too. I hope you agree
with my copyright notes. If not: Please give me a text which
I can put in there!)

CAUTION: All prior given links are not valid anymore!

Here's the code (translated to iNTEL style):


/*
---------------------------
mulOPS OP1 * OP2 = RES
---------------------------
-> EBX address LSB 1st OP
ESI 2nd OP
EDI RESULT
---------------------------
<- EAX 0000 0000 ok
---------------------------
*/

.globl _mulOPS
_mulOPS:
/*
save registers
*/
push edi
push ebp
push ecx
push edx
/*
check, if zero
*/
bt byte [ebx + 0],4
jae 0 forwards
jmp iMULy
0:bt byte [esi + 0],4
jae 1 forwards
jmp iMULy
/*
set sign of result
*/
1:bt byte [ebx + 0],0
jae 2 forwards
bt byte [esi + 0],0
jb 3 forwards
or byte [edi + 0],1 # -OP1 * +OP2
jmp 4 forwards
2:bt byte [esi + 0],0
jae 3 forwards
or byte [edi + 0],1 # +OP1 * -OP2
jmp 4 forwards
3:and byte [edi + 0],0xFE # signs are equal
/*
MUL loop
*/
4:xor epb,ebp
5:xor ecx,ecx
6:mov eax,dword [(ebx + 4) + (ebp * 4)] # EBP = factor1
mantissa offset
mul dword [(esi + 4) + (ecx * 4)] # ECX = factor2
mantissa offset
add dword [(edi + 4) + (ecx * 4)],eax # \ EDX:EAX =
EAX*MEM32
adc dword [(edi + 8) + (ecx * 4)],edx # 32 + 32 = 64 bit
ADD
jnb 9 forwards
mov edx,ecx # save ECX
7:add dword [(edi + 0x0C) + (ecx * 4)],1 # ADC loop
jnb 8 forwards
add ecx,4
jmp 7 backwards
8:mov ecx,edx # restore ECX
9:add ecx,4
cmp cl,0x70
jb 6 backwards # next 32 bits
factor2
add ebp,4
add edi,4 # adjust result
offset
cmp ebp,0x70 # (code 83 form)
jb 5 backwards # next 32 bits
factor1
/*
restore registers
*/
iMULy:pop edx
pop ecx
pop ebp
pop edi
ret


Ok - that's all for now. I think, that the first version for
the DIV will be ready until Monday evening (a holiday in Ba-
varia and, AFAIK, in Austria, too) - no more distractions in
sight... ;)

It is loading more messages.
0 new messages