128bit Square Root

Kurt White

unread,

May 26, 2015, 8:30:03 PM5/26/15

to

Background:
I am running a program that does massive calculations in the 128bit integer number space. One of the important functions in the program is SQRT. LCC does a nice job of using the Intel x86 64bit SQRT assembly instruction. Initially I was unable to find the equivalent for the 128bit number space. So I wrote a custom 128bit integer Square Root function. It finds the exact or closest value in as little as 3 iterations and at most 8. I used the Euclid method from Wikipedia with my custom mod for picking a sweet seed value. As I am exhausting the 64bit number space and further exploring into 128bit territory, I have found my algorithm is significantly slow compared to the hardware sqrt function. Duh!
Bignum and Qfloat are not viable as they implement variable precision. Clearly even rewriting my code in assembler will not produce optimal execution. Comparing the 10's of thousands of clock ticks for any "interpreted" solution against the 200 to 400 clock ticks using a direct hardware SQRT instruction points to the proper solution. What I need is access to the 128bit sqrtpd Intel x86 Instruction provided in all (Haswell) processors since 2013.
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
Page 873
Intel C/C++ compiler definition and routine.
_mm_sqrt_pd (m128d a);
http://www.felixcloutier.com/x86/SQRTPD.html
Microsoft MSDN VS definition and routine.
__m128d _mm_sqrt_pd(__m128d a);
https://msdn.microsoft.com/en-us/library/57yx633e(v=vs.90).aspx

Questions:
Can we have the 128bit hardware SQRTxx function in LCC-64 Int128 library?
If not, suggestions on which assembler generates compatible code for linking with LCC-64? (and maybe a few tips on how-to?)
I do not know if the MSDB VS SDK libs contain this function or if they are free to download, or if the library containing this function is even compatible with LCC-64?

Thanks,
KW

jacobnavia

unread,

May 27, 2015, 8:57:04 AM5/27/15

to

Le 27/05/2015 02:30, Kurt White a écrit :
> Background:
> I am running a program that does massive calculations in the 128bit integer number space.

INTEGER!

One of the important functions in the program is SQRT. LCC does a nice
job of using the Intel x86 64bit SQRT assembly instruction.

Thanks

Initially I was unable to find the equivalent for the 128bit number space.

There is NONE.

So I wrote a custom 128bit integer Square Root function.

OK.

It finds the exact or closest value in as little as 3 iterations and at
most 8.

Probably using Newton's method isn't it?

I used the Euclid method from Wikipedia with my custom mod for picking a
sweet seed value.

As I am exhausting the 64bit number space and further exploring into
128bit territory,

I have found my algorithm is significantly slow compared to the hardware
sqrt function. Duh!

Well, lcc-win generates code for:

1: Read higher 64 bit, multiply it by 2^64 and add lower 64 bits
2) Take the sqrt of a long double format
3) Write the result as a 64 bit integer. Since the square root of an
int128 MUST fit in 64 bits zero the high part.

> Bignum and Qfloat are not viable as they implement variable precision.

They are probably too slow compared to lcc win's solution.

Clearly even rewriting my code in assembler will not produce optimal
execution.

lcc-win writes in assembler! :-)

Comparing the 10's of thousands of clock ticks for any "interpreted"
solution against the

200 to 400 clock ticks using a direct hardware SQRT instruction points
to the proper solution.

What I need is access to the 128bit sqrtpd Intel x86 Instruction
provided in all (Haswell) processors since 2013.

You are confusing SQRTPD with sqrt(int128). Sorry to tell you this but
SQRTPD takes two roots of two 64 bit floating point numbers! That is NOT
what you want.

Following your message I have optimized the generated code a bit for
sqrt(int128). But I do not see any other algorithm as the one I wrote above

If you have a better one tell me and I will implement it

jacobnavia

unread,

May 27, 2015, 4:21:13 PM5/27/15

to

I improved somewhat the sqrt(int128) operation. It takes now 5ns
to take the square root of a 128 bit integer (in my achine, 1.5
years old)

Kurt White

unread,

May 27, 2015, 6:21:17 PM5/27/15

to

From:
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/4th-gen-core-family-desktop-vol-1-datasheet.pdf
"Intel Advanced Vector Extensions 2.0 (Intel AVX2) is the latest expansion of the Intel instruction set. Intel AVX2 extends the Intel Advanced Vector Extensions (Intel AVX)with 256-bit integer instructions, floating-point fused multiply add (FMA) instructions,and gather operations. The 256-bit integer vectors benefit math, codec, image, and digital signal processing software.
Bummer apparently no 128bit SQRT function.
I've opened up an engineering request to implement this in the next revision.
he he he he he. . . At least get back to me with what they really gots.
Other note: Long Double is 80 bits, right? I'm already up to 23 digits, the manual mentions 19, but I calculate 24 digits for 80 bit math... I'm just probably way off. Looking like I'm really going to have to implement my method in assembler. I will go back and find out if it is Newton or not.
Regards,
KW

jacobnavia

unread,

May 28, 2015, 3:02:10 AM5/28/15

to

Le 28/05/2015 00:21, Kurt White a écrit :
> Other note: Long Double is 80 bits, right? I'm already up to 23 digits, the manual mentions 19, but I calculate 24 digits for 80 bit math...

I'm just probably way off.

YES.

If you want more precision (always using floating point) you can use the
float128_t floating point numbers, but then the time goes
from 5 ns to 120 ns.

Will try to find a 128 sqrt integer approx.

Looking like I'm really going to have to implement my method in assembler.

lcc-win IS assembler!

Kurt White

unread,

May 28, 2015, 3:31:56 AM5/28/15

to

http://en.wikipedia.org/wiki/Integer_square_root
Newton's method :)
I'm using the following inline code.
Int k, the rest are I128
/* target is c2 */
for (k = 0, x = c2; x != 0; k++) x = x / 10;
for (x = c2, k = (k + 1) / 2; k > 1; k--) x = x / 10;
do {
t1 = c2 / x;
t2 = x + t1;
x1 = x;
x = t2 / 2;
}
while ((x1 - x) > 1);
/* x is the resulting sqrt */
Assembler generated by optimized lcc-win for the loop is already assembler! :)
I bet a $ the seed generation could be optimized. I selected base 10 because base 16 increases the center median, while using base 2 introduces larger order of magnitude deltas. I wonder if I should chop down by 10 and 100 simultaneously!?
Any suggestions ?

Kurt White

unread,

May 28, 2015, 3:43:48 AM5/28/15

to

WHAT? - from include64/math.h
// sqrt
long double sqrtl(long double);
double _sqrti(int);
double _sqrtu(unsigned);
double _sqrtll(long long);
double _sqrtull(unsigned long long);
float sqrtf(float);
double sqrt(double);
//
the biggest value I can supply is a long double which is 96 bits?
5ns !?!? Oh please enlighten this think and sometimes dense head.
It has been right in front of me all this time!? It just works!?
<head down, chuckling>

Kurt White

unread,

May 28, 2015, 3:45:54 AM5/28/15

to

Downloading and reinstalling now...
:)

Kurt White

unread,

May 28, 2015, 4:44:01 AM5/28/15

to

On Thursday, May 28, 2015 at 12:45:54 AM UTC-7, Kurt White wrote:
> Downloading and reinstalling now...
> :)

sqrt(int128) Problem:

int128 c, c2;
int main(int argc,char *argv[]){
c = 3125000320;
c2 = c * c;
printf(" c = %I128d c2 = %I128d \n", c ,c2);
c = sqrt(c2);
printf(" c = %I128d \n", c);
}
Results:
c = 3125000320 c2 = 9765627000000102400
c = 9223372036854775808

Kurt White

unread,

May 28, 2015, 5:05:23 AM5/28/15

to

>
> sqrt(int128) Problem:
c = 312500032012345 c2 = 97656270007716649790232399025
c = 312500031982830
I'm seeing loss of accuracy when the 64bit boundary is crossed.
if that helps.

Keith Thompson

unread,

May 28, 2015, 4:40:13 PM5/28/15

to

Kurt White <willy...@gmail.com> writes:
> Background:
> I am running a program that does massive calculations in the 128bit
> integer number space. One of the important functions in the program is
> SQRT. LCC does a nice job of using the Intel x86 64bit SQRT assembly
> instruction. Initially I was unable to find the equivalent for the
> 128bit number space. So I wrote a custom 128bit integer Square Root
> function. It finds the exact or closest value in as little as 3
> iterations and at most 8. I used the Euclid method from Wikipedia with
> my custom mod for picking a sweet seed value. As I am exhausting the
> 64bit number space and further exploring into 128bit territory, I have
> found my algorithm is significantly slow compared to the hardware sqrt
> function. Duh!

[...]

If you convert your 128-bit integer to long double, pass the result to
sqrtl(), and convert back to a 128-bit integer, you'll get a close
approximation of the integer square root. (It's bound to be a little
off for any value that won't fit in the mantissa of a long double, and
could be off by 1 even for smaller values.)

If you used that approximation, you could probably get an exact result
by applying a much smaller number of iterations of whatever method
you're using for pure integer square roots.

For example, for an input that takes 8 iterations to compute its square
root, converting and calling sqrtf() *might* be a faster way to do the
first, say, 6 or 7 iterations.

This is almost pure speculation on my part, but it might be worth
exploring.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Kurt White

unread,

May 29, 2015, 12:06:42 AM5/29/15

to

>
> If you convert your 128-bit integer to long double, pass the result to
> sqrtl(), and convert back to a 128-bit integer, you'll get a close
> approximation of the integer square root. (It's bound to be a little
> off for any value that won't fit in the mantissa of a long double, and
> could be off by 1 even for smaller values.)
>
> If you used that approximation, you could probably get an exact result
> by applying a much smaller number of iterations of whatever method
> you're using for pure integer square roots.
>
> For example, for an input that takes 8 iterations to compute its square
> root, converting and calling sqrtf() *might* be a faster way to do the
> first, say, 6 or 7 iterations.
>
> This is almost pure speculation on my part, but it might be worth
> exploring.
>
> --
> Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
> Working, but not speaking, for JetHead Development, Inc.
> "We must do something. This is something. Therefore, we must do this."
> -- Antony Jay and Jonathan Lynn, "Yes Minister"

Thanks Keith,
I just tried that. the sample program I provided in my post "sqrt(int128) does just that. But a long double is only 96 bits, which translates to approx 24 digits, where as 64 bits get you 21 digits, and 128bits gets you 42 digits!
currently when calling that function, the assembler code generated is:
; 172 c = sqrt(c2);
.line 172
push $64
fildl (%rsp)
leaq c2,%rcx
fildq 8(%rcx)
fscale
fildq (%rcx)
faddp
fstp %st(1)
fsqrt
fisttpq (%rsp)
popq %rax
xorl %edx,%edx
movq %rax,c
leaq c,%rax
movq %rdx,8(%rax)
Notice that fsqrt which does 96bit floating point. ( usually only 250 clock ticks!) WOWeee. I thought I found a 128bit SQRT instruction, but Jacob pointed out I got the wrong one. Intel engineering has not responded to me yet either.
In my application I need exact integer square roots, if they exist.
I'm using the Newton method with a tweak to get me guaranteed <=9 of the square root if it exists. (see above) the Newton method gets you 50% closer on each iteration, so if you start close you get there quick! The normal way is to start with the number you are targeting! That assumption leads to unacceptable performance when in my case I have to (in reality) do a billion trillion square root calls when I get half way into the 128bit number space. Eeeek! I need every clock tick to count! But anyways for now, I'm stuck with doing it manually, as fast as I can figure. I think Jacob has something brewing which might save the day.
:)
Regards,
KW

Keith Thompson

unread,

May 29, 2015, 6:18:21 PM5/29/15

to

Kurt White <willy...@gmail.com> writes:
>> If you convert your 128-bit integer to long double, pass the result to
>> sqrtl(), and convert back to a 128-bit integer, you'll get a close
>> approximation of the integer square root. (It's bound to be a little
>> off for any value that won't fit in the mantissa of a long double, and
>> could be off by 1 even for smaller values.)
>>
>> If you used that approximation, you could probably get an exact result
>> by applying a much smaller number of iterations of whatever method
>> you're using for pure integer square roots.
>>
>> For example, for an input that takes 8 iterations to compute its square
>> root, converting and calling sqrtf() *might* be a faster way to do the
>> first, say, 6 or 7 iterations.
>>
>> This is almost pure speculation on my part, but it might be worth
>> exploring.
>

I'm not very familiary with assembly language, so I can't quite tell
whether you understood my suggestion.

I'm not suggesting that a floating-point square root will give you the
exact result you want. I'm suggesting that it will give you a close
approximation, *and then* you can iterate on that to get the exact
result. The only sample program of yours that I found just does a
floating-point sqrt; it doesn't iterate beyond that.

And a floating-point square root might get you closer to the exact
result than you might expect, since sqrt(x) only needs about half as
many bits as x.

Sorry about the noise if you already understood that.

> The normal way is to start with the number you are targeting!

And I'm suggesting that you instead start with the close approximation
that a floating-point sqrt call gives you (since it's apparently much
faster than your existing method).

> That assumption leads to unacceptable performance when in my case I
> have to (in reality) do a billion trillion square root calls when I
> get half way into the 128bit number space. Eeeek! I need every clock
> tick to count! But anyways for now, I'm stuck with doing it manually,
> as fast as I can figure. I think Jacob has something brewing which
> might save the day.

Are you computing square roots for *every* integer value within a large
range?

A billion trillion is about 2**69. 128 bits worth of integers is about
340 trillion trillion trillion, which is vastly bigger (and beyond what
you can reasonably calculate).

But if you're computing integer square roots for a large contiguous
range of integer values, it's much faster to do it backwards, by
iterating over the square roots rather than over the squares:

1 is the square root of each number in the range 1..3
2 is the square root of each number in the range 4..8
3 is the square root of each number in the range 9..15
4 is the square root of each number in the range 16..24

And so on. You don't even have to do any multiplication; each square
can be computed from the previous one with just addition.

Even if this isn't what you're doing, a bit more description might lead
to a considerably more efficient solution.

Kurt White

unread,

May 31, 2015, 3:53:08 PM5/31/15

to

> But if you're computing integer square roots for a large contiguous
> range of integer values, it's much faster to do it backwards, by
> iterating over the square roots rather than over the squares:
>

> Even if this isn't what you're doing, a bit more description might lead
> to a considerably more efficient solution.
>
> --

> Keith Thompson (The_Other_Keith) kst...@mib.org <http://www.ghoti.net/~kst>

> Working, but not speaking, for JetHead Development, Inc.
> "We must do something. This is something. Therefore, we must do this."
> -- Antony Jay and Jonathan Lynn, "Yes Minister"

Keith Thompson: You are brilliant!
This approach reminded me when I first started this adventure seeking an answer to the as yet unsolved problem back in 1986. I wrote my program in Fortran and ran it for several months (until I got busted) on an IBM 360 32bit mainframe. I wrote the program a second time to support 100 digits after exhausting the 32bit number space.
So, yesterday I wrote the code to implement the "Thompson" method of scanning to
the next expected result using squares instead of Square Roots. (just as I did so long ago). I was able to reduce the scanning code to the following Single C Statement:
int128 i, c, c2;
i = bla(starting);
c2 = bla(target);
for ( c = i; (c*c) < c2; c++ );
/* test for exact */
Turns out this works fantastic until the numbers get big, (64bits) then it bogs.
Scanning ultimately takes longer than any SQRT function.
SQRT is KING.

The following data are hardly accurate results, BUT, yields a "feel" for my misery. :)
All tests done on lcc-win-64 with various configurations for accomplishing Square Root.
(includes at least 5% housekeeping and a few writes to disk)
67,10,18,982 64bit Integer SQRT Assembler on all 10 digit numbers - 18 seconds.
Fantastic speed but limited to the 64bit number space!
In all tests the FSQRT function takes longer except when an exact result exists.
and also there is the overhead from long float to I128 translation.
671,018,982 128 Integer Sqrt on 10 digit numbers using modified Newton method.
(the C code for this is listed in a previous post.)
174,464,936,096 128bit Sqrts Mod Newton method on 11-12 digit nums 87 hours 51 minutes
249,999,999,998 128bit Sqrts Mod Newton method on 12 digit numbers 85 hours 43 minutes
299,999,999,997 128bit Sqrts Mod Newton method on 12 digit numbers 124 Hours 19 minutes

Apples and Oranges:
lcc-win 64bit/128bit tests were run on:
Intel Core i7 975 Extreme Edition Bloomfield 4.2GHz DDR3 2600
in a VMware Workstation VM (Total system overhead already at 42%)
lcc-win 128bit modified Newton method were run on:
Intel Xeon W3680 3.33GHz DDR3 1333
in a VMware ESXi 6.1 VM (Total system overhead already at 56%)

My Haswell total screamer is arriving next week! Oops, ordered before it was
pointed out to me that the AVS enhanced assembler instruction SQRTpd does not do 128bits! DOH!
Now, I'm looking at storing previous results in a GIANT array for successive
passes on new numbers. This would use the "Thompson" method as an index into results!
Thinking out loud here, there are 1,099,511,627,776 bytes in a gigabyte.
Therefore there are 68,719,476,736 128bit integers in a Gigabyte.
So, there are 2,199,023,255,552 128bit integers storable in 32Gigs of RAM.
and 8,796,093,022,208 128 integers storable in 128Gigs of RAM.
(Side note: a Database is out of the question even if it is on an SSD,
the overhead still doesn't even come close to a one reference RAM array)
I could use the RAM array beginning at the 64bit boundary just like to do now
by switching to the Mod Newton method when the assembler 64bit SQRT runs out of gas.
An GIANT (several Gazillion) array of 128bit results does provide some leeway.
The question becomes how to implement in lcc-win64 ?!?
int128 Big128[ 8796093022208 ];
or
malloc ( 549755813888 * sizeof(int128) ];
ROFL man... are we having fun yet? shouldn't do multiple allocs as that turns the objective into a kind of BTree solution. More and more indexes...

What is the largest static array of 128bit ints that can be defined in lcc-win64?

What is the largest contiguous dynamic memory allocation that can be done in lcc-win64?

Regards,
KW

Keith Thompson

unread,

May 31, 2015, 9:51:15 PM5/31/15

to

Kurt White <willy...@gmail.com> writes:
>> But if you're computing integer square roots for a large contiguous
>> range of integer values, it's much faster to do it backwards, by
>> iterating over the square roots rather than over the squares:
>>
>> Even if this isn't what you're doing, a bit more description might lead
>> to a considerably more efficient solution.
>

> Keith Thompson: You are brilliant!

Thank you!

> This approach reminded me when I first started this adventure seeking
> an answer to the as yet unsolved problem back in 1986. I wrote my
> program in Fortran and ran it for several months (until I got busted)
> on an IBM 360 32bit mainframe. I wrote the program a second time to
> support 100 digits after exhausting the 32bit number space.
> So, yesterday I wrote the code to implement the "Thompson" method of scanning to
> the next expected result using squares instead of Square Roots. (just
> as I did so long ago). I was able to reduce the scanning code to the
> following Single C Statement:
> int128 i, c, c2;
> i = bla(starting);
> c2 = bla(target);
> for ( c = i; (c*c) < c2; c++ );
> /* test for exact */
> Turns out this works fantastic until the numbers get big, (64bits) then it bogs.
> Scanning ultimately takes longer than any SQRT function.
> SQRT is KING.

Avoiding the multiplication (c*c) might make it faster. For example:

#include <stdio.h>
int main(void) {
long long x = 1, y = 1;
while (1) {
printf("%lld * %lld = %lld\n", x, x, y);
y += x + x + 1;
x ++;
}
}

(Warning: This is an infinite loop.)

[...]

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>

Kurt White

unread,

Jun 1, 2015, 12:18:55 PM6/1/15

to

> > Scanning ultimately takes longer than any SQRT function.
> > SQRT is KING.
>
> Avoiding the multiplication (c*c) might make it faster. For example:
>
> #include <stdio.h>
> int main(void) {
> long long x = 1, y = 1;
> while (1) {
> printf("%lld * %lld = %lld\n", x, x, y);
> y += x + x + 1;
> x ++;
> }
> }
>
> (Warning: This is an infinite loop.)
>
> [...]
>
> --

> Keith Thompson (The_Other_Keith) mib.org <http://www.ghoti.net/~kst>

> Working, but not speaking, for JetHead Development, Inc.
> "We must do something. This is something. Therefore, we must do this."
> -- Antony Jay and Jonathan Lynn, "Yes Minister"

I am doing the (c * c) because I'm looking for exact Square Roots.
I can scan down or up using (c * c) or scan up or down using SQRT(c2)
If we do not do (c * c) then I HAVE to do SQRT.
Regards,
KW

jacobnavia

unread,

Jun 1, 2015, 1:57:07 PM6/1/15

to

I do not understand

lcc-win implements exact roots now!

please download the latest version!

Kurt White

unread,

Jun 1, 2015, 5:42:05 PM6/1/15

to

On Monday, June 1, 2015 at 10:57:07 AM UTC-7, jacobnavia wrote:

> I do not understand
>
> lcc-win implements exact roots now!
>
> please download the latest version!

I downloaded the new version per your instructions on 5/28.
I mentioned the problem I found in a previous post.
I just now download and uninstalled and re-installed fresh.
I might still not understand, but is this a valid problem?
99.999% of the time the results are correct!
Do you get these results?

#include <int128.h>
#include <stdio.h>
#include <math.h>

int128 i, i2;

int main(int argc,char *argv[]){

i = 49;
i2 = i * i;
printf(" i = %I128d i2 = %I128d \n", i ,i2);
i = sqrt(i2);
printf(" sqrt(i2) = %I128d \n\n", i);

i = 3125000320;
i2 = i * i;
printf(" i = %I128d i2 = %I128d \n", i ,i2);
i = sqrt(i2);
printf(" sqrt(i2) = %I128d \n\n", i);

i = 3906250256;
i2 = i * i;
printf(" i = %I128d i2 = %I128d \n", i ,i2);
i = sqrt(i2);
printf(" sqrt(i2) = %I128d \n\n", i);

i = 4000000250;
i2 = i * i;
printf(" i = %I128d i2 = %I128d \n", i ,i2);
i = sqrt(i2);
printf(" sqrt(i2) = %I128d \n\n", i);
}

Results:
i = 49 i2 = 2401
sqrt(i2) = 49

i = 3125000320 i2 = 9765627000000102400
sqrt(i2) = 4611686018427387904

i = 3906250256 i2 = 15258791062500065536
sqrt(i2) = 4611686018427387904

i = 4000000250 i2 = 16000002000000062500
sqrt(i2) = 4611686018427387904

Thanks,
KW

Kurt White

unread,

Jun 1, 2015, 6:04:23 PM6/1/15

to

int128 i, j, i2;
long long k;

int main(int argc,char *argv[]){

/* 2^63 - 1 = first failure? then in a few seconds BOOM */
for (k = 4294967295; k < 42949672950; k++) {
i = k;

i2 = i * i;

j = sqrt(i2);
if (j != i) {

printf(" i = %I128d i2 = %I128d \n", i ,i2);

printf(" sqrt(i2) = %I128d \n\n", j);
}
}

WOW! Killer Kudos!, it is running at least Two Orders of Magnitude
Faster than my modified Newton method!

jacobnavia

unread,

Jun 2, 2015, 5:00:49 AM6/2/15

to

I forgot that the lower part of the 128 bit number is UNSIGNED. But
there is NO INSTRUCTION to load an unsigned 64 bit number into the FPU
nor to load an unsigned one.

That means that when you have a 128 bit numer that has the sign bit ON
in position 63 like 9765627000000102400 the loading fails. This can be
corrected if I load the upper 32 bits, the scale by 32, then load the
lower 32 bits and add both.

Now, this problem is solved but I have yet to correct the storing of the
64 bit lower part lof the result correctly.

Will keep you posted.

jacob

CRNG

unread,

Jun 2, 2015, 5:06:14 AM6/2/15

to

On Tue, 02 Jun 2015 11:00:47 +0200, jacobnavia
<ja...@jacob.remcomp.fr> wrote in <mkjr9o$d2$1...@dont-email.me>

>Now, this problem is solved but I have yet to correct the storing of the
>64 bit lower part lof the result correctly.
>
>Will keep you posted.

Thanks
--
Web based forums are like subscribing to 10 different newspapers
and having to visit 10 different news stands to pickup each one.
Email list-server groups and USENET are like having all of those
newspapers delivered to your door every morning.

jacobnavia

unread,

Jun 2, 2015, 9:20:24 AM6/2/15

to

Le 02/06/2015 11:05, CRNG a écrit :
> On Tue, 02 Jun 2015 11:00:47 +0200, jacobnavia
> <ja...@jacob.remcomp.fr> wrote in <mkjr9o$d2$1...@dont-email.me>
>
>> Now, this problem is solved but I have yet to correct the storing of the
>> 64 bit lower part lof the result correctly.
>>
>> Will keep you posted.
>
> Thanks
>

please download again

It should work now

Kurt White

unread,

Jun 2, 2015, 2:41:11 PM6/2/15

to

> please download again

I think there is still some issue.
I suspected the hi bit last time, but holding my trap this time.
:)

#include <int128.h>
#include <stdio.h>
#include <math.h>

int128 i, i2;
int main(int argc,char *argv[]){

i = 2000002000001;

i2 = i * i;
printf(" i = %I128d i2 = %I128d \n", i ,i2);
i = sqrt(i2);
printf(" sqrt(i2) = %I128d \n\n", i);

}
Results:

i = 2000002000001 i2 = 4000008000008000004000001
sqrt(i2) = 2000002000006

Thanks so much for your efforts!
KW

Kurt White

unread,

Jun 2, 2015, 11:10:13 PM6/2/15

to

NICE !
i = 1844674407370955161 i2 = 3402823669209384632420136785472535921
sqrt(i2) = 1844674407370955161

KW

jacobnavia

unread,

Jun 3, 2015, 2:02:10 PM6/3/15

to

This is corrected, but I have another problem.

When writing the result, the code now can't write a square root bigger
than 2^63 since it is a signed result that will be written. There is no
instruction to write an unsigned 64 bit number from the FPU.

To be bigger than 2^63 for a square root, the number must be bigger than
2^126, i.e; 2^126 < N <= 2^127. Now, I am looking for an efficient ay to
implement this without penalizing all other square roots where this
doesn't matter.

Kurt White

unread,

Jun 3, 2015, 7:51:02 PM6/3/15

to

>
> This is corrected, but I have another problem.
>

Fascinating stuff, thanks for the insider information.
Regards,
KW