G77 + ASM (80x86) interface 101 ?

NimbUs

unread,

Jun 21, 2017, 11:06:29 AM6/21/17

to

Hullo, Group ! I'd appreciate any quick-start pointers as
guidance in the form of specifications, templates, primers...
showing how best to interface a subroutine in 80x86-X32
assembler, being intended to be called from a G77-Fortran main
program. The goal here is to have the ASM subs do intensive
multi-precision arithmetic on very large integers - which
cannot be coded in Fortran easily and efficiently, - with a
Main shell written in Fortran, serving as a user interface,
especially in/out and formatting, setting up parameters and
calling the ASM subs which would do the "real" work - since
flexible I/O and user-interface is annoying and time consuming
to program in ASM.

Many /thanks/ in advance !

--
N.

herrman...@gmail.com

unread,

Jun 21, 2017, 2:18:08 PM6/21/17

to

Well, first there are already some existing such libraries,
I believe callable from g77. If not from g77, then from gfortran
using C interoperability.

Except for multiple precision divisor, they are all fairly easy
to do in Fortran. One that makes it easier and more efficient
is the ability to multiply integers with a double length product,
and divide with a double length dividend. That is the way the hardware
usually works, but not so many high-level languages.

Assembly routines to do those two would allow the rest to be done
in standard Fortran reasonably easily and efficiently.

But otherwise, the usual calling sequence is addresses of each
argument on the stack. The two common methods are called STDCALL
and CDECL, where the former uses the optional ability of RET to
pop the arguments off the stack, and the latter assumes that the
caller will pop them off. (C allows for variable number of
arguments, such that one can't depend on the count being constant.)

For 32 bit code, they will be 32 bit addresses.

herrman...@gmail.com

unread,

Jun 21, 2017, 2:19:31 PM6/21/17

to

On Wednesday, June 21, 2017 at 8:06:29 AM UTC-7, NimbUs wrote:

> Hullo, Group ! I'd appreciate any quick-start pointers as
> guidance in the form of specifications, templates, primers...
> showing how best to interface a subroutine in 80x86-X32
> assembler, being intended to be called from a G77-Fortran main
> program.

One way to find out the calling sequence is to write a small
Fortran subroutine, then compile it with the option to save the
generated assembly code. Look at that, and modify accordingly.

NimbUs

unread,

Jun 21, 2017, 5:20:05 PM6/21/17

to

Herrmanns Feldt wrote as news:d1831807-fb43-4e71-b1c9-
01175b...@googlegroups.com:

Indeed, shall do. I had been hoping to profit from earlier
expertise of others in the form of an example or template
though !

Re. your other reply in thread, operations on (very) long
integers are not efficiently (for speed) programmed in
Fortran, primarily because the language does not have unsigned
integer types AFAIK, and does not grant us access to CPU
arithmetic registers and flags (carry, overflow). Maybe the
existing libraries you have in mind were coded, themselves, in
a more low level language than Fortran, like C ?

There is a second reason I want to have my own "long
arithmetic", is because the application I have in mind needs
only a handful of well defined specialized operations - not
the general range of arithmetic - and I can hand craft optimal
algorithms for those operations that will be faster than any
general purpose library however clever it is.

> But otherwise, the usual calling sequence is addresses of
> each argument on the stack. The two common methods are
> called STDCALL and CDECL, where the former uses the optional
> ability of RET to pop the arguments off the stack, and the >
> latter assumes that the caller will pop them off. (C allows
> for variable number of arguments, such that one can't depend
> on the count being constant.)
> For 32 bit code, they will be 32 bit addresses.

Yes, I am aware of the general notions of call conventions
used by "high level" languages, I'm more looking for the
details. I'm afraid I'll have to peek at compiler output and
do some trial/error then..;

Thanks

--
NimbUs

herrman...@gmail.com

unread,

Jun 21, 2017, 7:34:39 PM6/21/17

to

On Wednesday, June 21, 2017 at 2:20:05 PM UTC-7, NimbUs wrote:

(snip)

> Re. your other reply in thread, operations on (very) long
> integers are not efficiently (for speed) programmed in
> Fortran, primarily because the language does not have unsigned
> integer types AFAIK, and does not grant us access to CPU
> arithmetic registers and flags (carry, overflow). Maybe the
> existing libraries you have in mind were coded, themselves, in
> a more low level language than Fortran, like C ?

To do it in Fortran or C, you make an array of integers
half the size of the largest you can operate on normally.

There is a slight complication that comes from using signed
(twos complement) integers, but that is fixable.

There is also the assumption that operations follow normal
twos complement arithmetic, including wrap on overflow.

> There is a second reason I want to have my own "long
> arithmetic", is because the application I have in mind needs
> only a handful of well defined specialized operations - not
> the general range of arithmetic - and I can hand craft optimal
> algorithms for those operations that will be faster than any
> general purpose library however clever it is.

For add and subtract, you go down the array from right to left,
do the operation on each element, then propagate the carry.

Dividing by a small enough integer is easy, as you go down the
array left to right, and propagate the remainder to the next
element.

That is enough to compute pi or e to many digits, those being
the favorites.

herrman...@gmail.com

unread,

Jun 21, 2017, 7:43:28 PM6/21/17

to

On Wednesday, June 21, 2017 at 2:20:05 PM UTC-7, NimbUs wrote:

(snip)

> There is a second reason I want to have my own "long
> arithmetic", is because the application I have in mind needs
> only a handful of well defined specialized operations - not
> the general range of arithmetic - and I can hand craft optimal
> algorithms for those operations that will be faster than any
> general purpose library however clever it is.

A few years ago, I compute pi to 5000 digits on an IBM 360/20
(in about a day), and e to 7000 digits.

Looking for a source to check my e calculation, I found that
someone had computed one million digits on a VAX(sic) Alpha.

(Note that VAX and Alpha are two different processors sold
by DEC.)

After some e-mail, it turned out that he thought that VAX
was the company that made Alpha, but then sent me the Fortran
program to do the computation.

I then used it to compute one million digits of e on an actual
VAX 11/785 in a few days.

There are some additional inefficiencies.

First, the calculation is done in a base that is a power
of 10, avoiding a final binary to decimal conversion.
With VAX Fortran having 32 bit integer as its largest, there
is a limit on how large the value for each array element can be.

I could have done it in Macro-32 (VAX assembler) but it was
an interesting enough problem in Fortran, and I had plenty
of time to do it. There wasn't much else being done on the
machine.

herrman...@gmail.com

unread,

Jun 21, 2017, 7:49:01 PM6/21/17

to

On Wednesday, June 21, 2017 at 2:20:05 PM UTC-7, NimbUs wrote:

(snip)

> Indeed, shall do. I had been hoping to profit from earlier
> expertise of others in the form of an example or template
> though !

By the way, you might also find the book:

"Introduction to Precise Numerical Methods"

interesting. Not only does it do high precision arithmetic,
but also keeps track of the precision of the result.
(That is, any accumulated rounding errors.)

More traditional is to compute plenty of extra digits, and use
a simple formula for average round-off.

When adding and subtracting values rounded to integers,
the expected error is half sqrt(N), where N is the number
of numbers added/subtracted.

NimbUs

unread,

Jun 22, 2017, 5:08:05 AM6/22/17

to

Herrmanns Feldt wrote in news:bf591a86-2642-4e24-b16a-
50f0be...@googlegroups.com:

> By the way, you might also find the book:

> "Introduction to Precise Numerical Methods"

> interesting. Not only does it do high precision arithmetic,
> but also keeps track of the precision of the result.
> (That is, any accumulated rounding errors.)

> More traditional is to compute plenty of extra digits, and
use
> a simple formula for average round-off.

(snipping)
>
Thank you but I have some mathematical and computional
experience and formal training going back, well, half a
century - including numerical analysis and algorithms. I did
my first own PI calculation even earlier on an IBM 1620 coded
in assembly(SPS), computing 1000 digits took 25 hours IIRC
(!). I'm /not/ attempting PI records any more :=) The point of
the thread is the specifics of interfacing G77 Fortran to X86
assembler.

--
N.

herrman...@gmail.com

unread,

Jun 22, 2017, 5:19:45 PM6/22/17

to

On Thursday, June 22, 2017 at 2:08:05 AM UTC-7, NimbUs wrote:
> Herrmanns Feldt wrote in news:bf591a86-2642-4e24-b16a-
> 50f0be...@googlegroups.com:

> > By the way, you might also find the book:

> > "Introduction to Precise Numerical Methods"

> > interesting. Not only does it do high precision arithmetic,
> > but also keeps track of the precision of the result.
> > (That is, any accumulated rounding errors.)

> > More traditional is to compute plenty of extra digits,
> > and use a simple formula for average round-off.

> (snipping)

> Thank you but I have some mathematical and computional
> experience and formal training going back, well, half a
> century - including numerical analysis and algorithms.

Oops, I wasn't intending to imply otherwise.

For one, but book is interesting in its treatment of
precision. But also, it includes a disk of programs
that you can try yourself.

You might find it interesting, even if it isn't at
all useful for your current project.

> I did
> my first own PI calculation even earlier on an IBM 1620 coded
> in assembly(SPS), computing 1000 digits took 25 hours IIRC
> (!). I'm /not/ attempting PI records any more :=) The point of
> the thread is the specifics of interfacing G77 Fortran to X86
> assembler.

Some will wonder about the interest in g77 instead of the more
usual gfortran. I believe that the calling sequence for Fortran 77
code compiled with gfortran is the same, but am not quite sure
enough without looking it up.

As mentioned, there are two common choices, STDCALL and CDECL.

CDECL is convenient when calling C routines, and I believe
is (mostly) used by g77 and gfortran.

I have mostly only written one x86 assembler routine in about
the last 10 years, (but written it about 10 times), which is
a routine to return the value of the time stamp counter, RDTSC.

It turns out that 32 bit code uses EDX:EAX to return 64 bit
function values, such that the instructions

rdtsc
ret

are enough for a function declared to return a 64 bit integer.
That, and whatever else the assembler needs is all.

(It is more complicated in 64 bit code, where the function
result is returned in rax, but rdtsc still stores it in edx:eax.
I did figure that out once.)

But mostly, I was never very good at writing x86 assembly.

For S/360 assembly, I can mostly write down the instructions,
then after looking up some opcodes, enter it directly into
memory, with a good chance of it working. Instruction formats
are nice and regular, and otherwise obvious.

Not so easy with x86, so I try to minimize what I do
in assembly, and call it from Fortran or C. Instruction formats
still have some left over from the 8080, and then with each new
generation they squeeze into the available opcode space.
Only certain registers are allowed for some operations, and you
have to know all the different ways that those work.

For 32 bit code index up from ESP (or copy to EBP), to find
the address for each argument. Load it into an appropriate
register, and load or store into scalars. For arrays, it will
be the start of the array, so compute an offset and index array
elements.

Gary Scott

unread,

Jun 22, 2017, 8:41:50 PM6/22/17

to

i've often noted that assembly for older processors seemed more
logically designed and straightforward than x86. I suppose they were
severely constrained or something. I liked to program in 360/370, DG
Nova, and Datacraft/Harris/Concurrent H.

herrman...@gmail.com

unread,

Jun 23, 2017, 2:55:55 AM6/23/17

to

On Thursday, June 22, 2017 at 5:41:50 PM UTC-7, Gary Scott wrote:

(snip, I wrote)

> > For S/360 assembly, I can mostly write down the instructions,
> > then after looking up some opcodes, enter it directly into
> > memory, with a good chance of it working. Instruction formats
> > are nice and regular, and otherwise obvious.

> > Not so easy with x86, so I try to minimize what I do
> > in assembly, and call it from Fortran or C. Instruction formats
> > still have some left over from the 8080, and then with each new
> > generation they squeeze into the available opcode space.
> > Only certain registers are allowed for some operations, and you
> > have to know all the different ways that those work.

(snip)

> i've often noted that assembly for older processors seemed more
> logically designed and straightforward than x86. I suppose they were
> severely constrained or something. I liked to program in 360/370, DG
> Nova, and Datacraft/Harris/Concurrent H.

For the computers in the 1950's, they had to simplify the hardware,
which mostly resulted in fairly simple processors. One accumulator,
one MQ register to help with multiply and divide, maybe some index
registers.

In the 1960's and 1970's, mainframe computers, such as the PDP-10
and IBM S/360 got more registers, more addressing modes, and were
otherwise designed to make assembly programming easier. Compilers
and operating systems were mostly in assembler, so ease of
programming was important. About this time, minicomputers
started to appear similar to earlier mainframes, and then
microcomputers similar to earlier minicomputers.

Also, IBM S/360, using microcode, was able to have a fairly
high level user instruction set, with lower models using
simpler hardware to run slower, but still supply the full
instruction set.

The VAX, leading the superminicomputer days, with an instruction
set designed even more for assembly programmers, unfortunately
just at the time that less and less was written in assembly.

The change to RISC happened when enough more code was written in
high-level languages, by compilers that can easily find the
optimal complex set of simpler instructions. Ease of human
assembly programming is no longer important.

But then there is the 8080. Similar to earlier minicomputers,
though maybe 8 bit instead of 16, the 8080 isn't so hard to
program. When an improved version was needed, it was convenient
to allow for assembler source compatibility. A fairly simple
program can convert 8080 assembler source to 8086 assembler
source. Then the 80286, 80386, and 80486, extending the
instruction set using unused opcodes, and new modes changing
the way older instructions worked. If the 80486 had been
designed without the need to be compatible with earlier
processors, it might have been very different.

Around this time, IBM S/370 was extended to XA/370 and
then ESA/390, mostly without the inconveniences that went
into the 486. Many instructions added to ESA/390 were still
to make assembly programming easier. (Though the complications
of 24 bit and 31 bit addressing modes isn't always so easy.)

Oh well.

campbel...@gmail.com

unread,

Jun 23, 2017, 5:25:12 AM6/23/17

to

On Friday, June 23, 2017 at 7:19:45 AM UTC+10, herrman...@gmail.com wrote:
> I have mostly only written one x86 assembler routine in about
> the last 10 years, (but written it about 10 times), which is
> a routine to return the value of the time stamp counter, RDTSC.
>

I'd be interested in 64bit gFortran (Windows) code for RDTSC, if you have it available.
It is my impression that QueryPerformanceCounter = RDTSC/1024, ie a 10-bit shift

Another related question for recent Intel i7 processors; when they appear to increase their clock rate (as reported in task manager), does the RDTSC rate change ?

James Van Buskirk

unread,

Jun 23, 2017, 8:13:19 AM6/23/17

to

wrote in message
news:2e8d296d-09c1-4f98...@googlegroups.com...

> On Friday, June 23, 2017 at 7:19:45 AM UTC+10, herrman...@gmail.com wrote:
> > I have mostly only written one x86 assembler routine in about
> > the last 10 years, (but written it about 10 times), which is
> > a routine to return the value of the time stamp counter, RDTSC.

> I'd be interested in 64bit gFortran (Windows) code for RDTSC, if you have
> it available.
> It is my impression that QueryPerformanceCounter = RDTSC/1024, ie a 10-bit
> shift

The code is just
RDTSC
SHL RDX, 32
OR RAX, RDX
RET

But gfortran doesn't have an inline assembler so I wrote code that
pokes the appropriate machine code into memory and creates
procedure pointers to it. Posted at

https://groups.google.com/d/msg/comp.lang.fortran/G5B-O3tvNGE/D-xgSru6KrUJ

NimbUs

unread,

Jun 23, 2017, 11:43:56 AM6/23/17

to

herrmannsfeldt@g... wrote:

> Oops, I wasn't intending to imply otherwise.

Absolutely zero offence taken, was only trying to pull the
thread back to its original intent.
(...)

> Some will wonder about the interest in g77 instead of the
more
> usual gfortran. I believe that the calling sequence for
Fortran 77
> code compiled with gfortran is the same, but am not quite
sure

Simply, I wanted a reasonably modern compiler that still can
compile the old style FORTRAN (capitals intended!) dialect
which I knew quite well in the old days and can remember off
my aging head.

> As mentioned, there are two common choices, STDCALL and
CDECL.
> CDECL is convenient when calling C routines, and I believe
> is (mostly) used by g77 and gfortran.

OK, I think I have a grasp of the calling convention used by
GNU (and possibly other) Fortran compilers for X32 targets.
Just a technicality, before I can really "trial" my
understanding of it, what -link- (object) format do GNU
compilers product, and linkers understand ? My favorite X86
ASM is TASM, which outputs old Intel/Microsoft "OMF" object
files.
Will the GNU linker accept OMF modules, or do I need "COFF"
instead, or some other format altogether ? If using TASM is
not possible or practical, which assembler would you suggest ?
I've seen NASM mentionned, which I'd prefer to avoid. Also no
ATT-like syntax for me, IF at all avoidable :=)

Does FASM sound like the right tool ?

Thanks again, regards

--
NimbUs

herrman...@gmail.com

unread,

Jun 23, 2017, 4:09:43 PM6/23/17

to

On Friday, June 23, 2017 at 8:43:56 AM UTC-7, NimbUs wrote:

(snip, I wrote)

> > Some will wonder about the interest in g77 instead
> > of the more usual gfortran.
> > I believe that the calling sequence for Fortran 77
> > code compiled with gfortran is the same, but am not quite
> sure

> Simply, I wanted a reasonably modern compiler that still can
> compile the old style FORTRAN (capitals intended!) dialect
> which I knew quite well in the old days and can remember off
> my aging head.

For some years, my favorite C and Fortran compilers for
DOS/Windows (and OS/2) have been the Watcom compilers.

For some time, I would even use the Watcom linker with
MS compilers. Among others, it has a nicer overlay feature
than the MS linker.

Otherwise, I believe that the usual Fortran 77 code is
still accepted by gfortran. There are a small number of
features removed from Fortran 77, that may or may not
have been removed from gfortran. One that I actually use
once in a while is REAL variables in DO loops. Yes you
have to be careful when you use them, but in those cases
where it comes up, you usually know that anyway.

(snip on STDCALL and CDECL)

> OK, I think I have a grasp of the calling convention used by
> GNU (and possibly other) Fortran compilers for X32 targets.
> Just a technicality, before I can really "trial" my
> understanding of it, what -link- (object) format do GNU
> compilers product, and linkers understand ? My favorite X86
> ASM is TASM, which outputs old Intel/Microsoft "OMF" object
> files.

As above, I recommend the Watcom linker.

> Will the GNU linker accept OMF modules, or do I need "COFF"
> instead, or some other format altogether ? If using TASM is
> not possible or practical, which assembler would you suggest ?
> I've seen NASM mentionned, which I'd prefer to avoid. Also no
> ATT-like syntax for me, IF at all avoidable :=)

(snip)

herrman...@gmail.com

unread,

Jun 23, 2017, 4:16:25 PM6/23/17

to

On Friday, June 23, 2017 at 5:13:19 AM UTC-7, James Van Buskirk wrote:
> wrote in message
> news:2e8d296d-09c1-4f98...@googlegroups.com...

(snip on 64 bit RDTSC)

> The code is just
> RDTSC
> SHL RDX, 32
> OR RAX, RDX
> RET

It was some years ago, but I suspect that I might have
cleared RAX first. Does RDTSC clear the high half?

> But gfortran doesn't have an inline assembler so I wrote code that
> pokes the appropriate machine code into memory and creates
> procedure pointers to it.

I did it as an external function, conveniently named rdtsc.
(I believe lower case for calling from C.)

I even used it called from Java, using JNI, and it worked
well. I wasn't sure if there might be other effects from
JRE, but it seemed to work fine when I needed it.

RDTSC is supposed to count clock cycles, so the count rate
should change with the clock rate. On the other hand, when
determining the execution time (cycles) for some code, it
should be independent of clock rate. If I want to find where
the bottleneck is, it works well.

campbel...@gmail.com

unread,

Jun 24, 2017, 3:19:13 AM6/24/17

to

Hi James,

Thanks for your suggestion, although I find the reference a bit complex.
I was hoping for code for: INTEGER*8 FUNCTION RDTSC_tick ()
I would like to run it with 64-bit gFortran from http://www.equation.com, which
I have found to be a fairly robust windows version of the gFortran compiler.

I suspect that it is contained in the code you referenced, but I would prefer something more concise.

There are two other questions you may be able to reply to:

1) Does RDTSC need some initialisation ? A version of RDTSC interface I have for ifort (2013) does use an initialisation code. I also use QueryPerformanceCounter so suspect RDTSC is being initialised via these calls.

2) I am puzzled by the speed up of the processor in my i7-6700HQ or i7-4790K. ( as reported by task manager) SYSTEM_CLOCK appears to provide accurate clock times, but I am wondering about RDTSC and also QueryPerformanceCounter, which appear to be RDTSC / 1024.

The cycle rate of RDTSC is also a problem, which I approximate by using QueryPerformanceFrequency * 1024

For Silverfrost FTN95 /64, RDTSC_VAL@() is a 64 bit replacement for REAL*10 CPU_CLOCK@ and appears to work for 32 bit also. As INTEGER*8, I find this is a better option.

RDTSC rate is not documented and appears to be the processor rate. My approach for RDTSC_tick_rate works on all pc's I have available to test.
(I do not know what happens on processors that have a turbo boost or are over-clocked. If someone can test this I would like to know)
http://forums.silverfrost.com/viewtopic.php?t=3271&postdays=0&postorder=asc&start=0

If you are able to comment, it would be appreciated,

regards,

John

herrman...@gmail.com

unread,

Jun 24, 2017, 3:49:43 AM6/24/17

to

On Saturday, June 24, 2017 at 12:19:13 AM UTC-7, campbel...@gmail.com wrote:
> On Friday, June 23, 2017 at 10:13:19 PM UTC+10, James Van Buskirk wrote:

(snip)

> I suspect that it is contained in the code you referenced,
> but I would prefer something more concise.

Two instructions in 32 bit mode is about as small as you
can make a function do something useful.

In 32 bit mode, functions return 64 bit values in EDX:EAX,
that is, the high bits in EDX and low bits in EAX, as
returned by RDTSC.

In 64 bit mode, the function result is in rax, but RDTSC
still returns it in EDX:EAX.

It seems, as I asked before, that in 64 bit mode it does
clear the upper half of rax and rdx. I think when I did
this (some years ago), I wasn't sure and cleared them
to be sure.

> There are two other questions you may be able to reply to:

> 1) Does RDTSC need some initialisation ? A version of
> RDTSC interface I have for ifort (2013) does use an
> initialisation code. I also use QueryPerformanceCounter
> so suspect RDTSC is being initialised via these calls.

The usual RDTSC starts counting from zero when the processor
starts. With multicore systems, there is a question about
synchronizing different cores. But the OS, not you, should
do that.

> 2) I am puzzled by the speed up of the processor in my
> i7-6700HQ or i7-4790K. ( as reported by task manager)
> SYSTEM_CLOCK appears to provide accurate clock times,
> but I am wondering about RDTSC and also
> QueryPerformanceCounter, which appear to be RDTSC / 1024.

Earlier processors counted clock cycles. After they
started building varying clock rate chips, it still
counted clock cycles, but somewhat later, they use a
counter independent of clock rate.

I suspect the ones I remember using were the earlier
ones.

> The cycle rate of RDTSC is also a problem, which I
> approximate by using QueryPerformanceFrequency * 1024

> For Silverfrost FTN95 /64, RDTSC_VAL@() is a 64 bit
> replacement for REAL*10 CPU_CLOCK@ and appears to work
> for 32 bit also. As INTEGER*8, I find this is a
> better option.

I only use it for relative timing, to find out which
parts of a program are using a big fraction of the time.

> RDTSC rate is not documented and appears to be the
> processor rate. My approach for RDTSC_tick_rate works
> on all pc's I have available to test.

> (I do not know what happens on processors that have
> a turbo boost or are over-clocked. If someone can test
> this I would like to know)

As above, older processors counted the variable clock,
newer ones count a fixed rate clock.

Thomas Koenig

unread,

Jun 24, 2017, 4:25:29 AM6/24/17

to

James Van Buskirk <not_...@comcast.net> schrieb:

> But gfortran doesn't have an inline assembler so I wrote code that
> pokes the appropriate machine code into memory and creates
> procedure pointers to it.

Ouch :-)

You would be much better to create your own assembly file and
leave the assembly and linking to the gfortran command.
You can just do something like

$ gfortran foo.s main.f90

and it will do the right thing. Translate an example foo.f90 with
the -S option to see what the assembly looks like.

An alternative would be to use inline assembly, which is an
extension that gcc supports, and call the function via
C interop

Thomas Koenig

unread,

Jun 24, 2017, 5:39:42 AM6/24/17

to

NimbUs <nim...@xxx.invalid> schrieb:

> Simply, I wanted a reasonably modern compiler that still can
> compile the old style FORTRAN (capitals intended!) dialect
> which I knew quite well in the old days and can remember off
> my aging head.

One nice thing about Fortran is its backwards compatibility.
Even things that were deleted from the standard tend to be
supported for backwards compatibility.

gfortran also has quite a few extensions, including
Cray pointers and DEC extensions (since 7.1).

I'd say gfortran is a much more sensible choice than g77.

Thomas Koenig

unread,

Jun 24, 2017, 5:53:08 AM6/24/17

to

James Van Buskirk <not_...@comcast.net> schrieb:

> The code is just
> RDTSC
> SHL RDX, 32
> OR RAX, RDX
> RET

> But gfortran doesn't have an inline assembler so I wrote code that
> pokes the appropriate machine code into memory and creates
> procedure pointers to it.

As an addendum, here is what I would prefer:

$ cat rdtsc.s
.file "rdtsc.s"
.text
.globl rdtsc_
.type rdtsc_, @function
rdtsc_:
.LFB0:
.cfi_startproc
rdtsc
shl $32, %edx
or %edx, %eax
ret
.cfi_endproc
.LFE0:
.size rdtsc_, .-rdtsc_
.section .note.GNU-stack,"",@progbits
$ cat main.f90
program main
interface
function rdtsc()
integer(kind=8) :: rdtsc
end function rdtsc
end interface

integer(kind=8) :: t1, t2
t1 = rdtsc()
print *,"Hello, world!"
t2 = rdtsc()
print *,t2-t1, "Cycles used."
end program
$ gfortran rdtsc.s main.f90 && ./a.out
Hello, world!
83979 Cycles used.

Jos Bergervoet

unread,

Jun 24, 2017, 7:10:49 AM6/24/17

to

Can you explain why I get sometimes a negative number?
Printing t1, t2, t2-t1 shows this:
Hello, world!
569114599 569110223 -4376 Cycles used.

Repeating, numbers like 253934, 12024, 3856, 288776, -2180
are found for t2-t1. (Consistently always even!)

And 'cat /proc/cpuinfo' starts like this: (there are 40
in total)

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 62
model name : Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
stepping : 4
microcode : 1064
cpu MHz : 2999.825
cache size : 25600 KB
physical id : 0
siblings : 20
core id : 0
cpu cores : 10
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est
tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb
xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep
erms
bogomips : 5999.65
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
...
...
processor : 39
...

--
Jos

Thomas Koenig

unread,

Jun 24, 2017, 7:52:01 AM6/24/17

to

Jos Bergervoet <jos.ber...@xs4all.nl> schrieb:

The simple reason is that I got the assembly wrong for 64-bit
registers.

Here is a version that _should_ work for 64 bits:

$ cat rdtsc.s
.file "rdtsc.s"
.text
.globl rdtsc_
.type rdtsc_, @function
rdtsc_:
.LFB0:
.cfi_startproc
rdtsc

shl $32, %rdx
or %rdx, %rax

ret
.cfi_endproc
.LFE0:
.size rdtsc_, .-rdtsc_
.section .note.GNU-stack,"",@progbits
$ cat main.f90
program main
interface
function rdtsc()
integer(kind=8) :: rdtsc
end function rdtsc
end interface

integer(kind=8) :: t1, t2
t1 = rdtsc()
print *,"Hello, world!"
t2 = rdtsc()

print *,t1, t2, t2-t1, "Cycles used."
end program
$ !gf
gfortran -g -O3 rdtsc.s main.f90
$ ./a.out
Hello, world!
193714148621315 193714148698679 77364 Cycles used.

Jos Bergervoet

unread,

Jun 24, 2017, 8:03:23 AM6/24/17

to

Indeed, now reporting 69824, 71044, 63092, 62356, 77432, etc.

--
Jos

James Van Buskirk

unread,

Jun 24, 2017, 7:57:40 PM6/24/17

to

wrote in message
news:3949f35f-fb8c-4f97...@googlegroups.com...

> On Friday, June 23, 2017 at 10:13:19 PM UTC+10, James Van Buskirk wrote:
> > wrote in message
> > news:2e8d296d-09c1-4f98...@googlegroups.com...

> > > I'd be interested in 64bit gFortran (Windows) code for RDTSC, if you
> > > have
> > > it available.
> > > It is my impression that QueryPerformanceCounter = RDTSC/1024, ie a
> > > 10-bit
> > > shift

> > The code is just
> > RDTSC
> > SHL RDX, 32
> > OR RAX, RDX
> > RET

> > But gfortran doesn't have an inline assembler so I wrote code that
> > pokes the appropriate machine code into memory and creates
> > procedure pointers to it. Posted at

> > https://groups.google.com/d/msg/comp.lang.fortran/G5B-O3tvNGE/D-xgSru6KrUJ

> Thanks for your suggestion, although I find the reference a bit complex.
> I was hoping for code for: INTEGER*8 FUNCTION RDTSC_tick ()
> I would like to run it with 64-bit gFortran from http://www.equation.com,
> which
> I have found to be a fairly robust windows version of the gFortran
> compiler.

Some of the complexity is there because there are multiple functions
included
in the example some some of it due is due use of ambivalent machine code
that includes a branch to select a 32- or 64-bit version as well as both
versions.

I pared it down a bit to include only RDTSC, and did the selection between
32- or 64-bit code in the Fortran part, so the machine code is now much more
readable. Also I used an initializer for my function pointer so that on the
first invocation the code is poked into memory and the procedure pointer
is pointed at the poked code which is then invoked so that the user doesn't
have to initiate that step himself. Unfortunately ifort 16.0 can't handle
that
initializer so I was only able to test with gfortran (both 32- and 64-bits).
Can
someone confirm that it works with more recent ifort?

D:\gfortran\clf\rdtsc>type hello.f90
module rdtsc_mod
use ISO_C_BINDING
implicit none
! We will not export anything but the pointer to the rdtsc function
private
! Interface for rdtsc function
abstract interface
function rdtsc_iface() bind(C)
import
implicit none
integer(C_INT64_T) rdtsc_iface
end function rdtsc_iface
end interface
! Define pointer to rdtsc function and initialize to point
! at initialization function
procedure(rdtsc_iface), pointer, public :: rdtsc => rdtsc_init
! Constants required for VirtualAlloc and VirtualProtect
integer(C_INT32_T), parameter :: &
MEM_COMMIT = int(Z'00001000',C_INT32_T), &
MEM_RESERVE = int(Z'00002000',C_INT32_T), &
PAGE_READWRITE = int(Z'04',C_INT32_T), &
PAGE_EXECUTE = int(Z'10',C_INT32_T)
! Interfaces for VirtualAlloc, GetLastError, and VirtualProtect
interface
function VirtualAlloc(lpAddress,dwSize,flAllocationType, &
flProtect) bind(C,name='VirtualAlloc')
import
implicit none
!DEC$ ATTRIBUTES STDCALL :: VirtualAlloc
!GCC$ ATTRIBUTES STDCALL :: VirtualAlloc
type(C_PTR) VirtualAlloc
type(C_PTR), value :: lpAddress
integer(C_SIZE_T), value :: dwSize
integer(C_INT32_T), value :: flAllocationType
integer(C_INT32_T), value :: flProtect
end function VirtualAlloc

function GetLastError() bind(C,name='GetLastError')
import
implicit none
!DEC$ ATTRIBUTES STDCALL :: GetLastError
!GCC$ ATTRIBUTES STDCALL :: GetLastError
integer(C_INT32_T) GetLastError
end function GetLastError

function VirtualProtect(lpAddress,dwSize,flNewProtect, &
lpflOldProtect) bind(C,name='VirtualProtect')
import
implicit none
!DEC$ ATTRIBUTES STDCALL :: VirtualProtect
!GCC$ ATTRIBUTES STDCALL :: VirtualProtect
integer(C_INT32_T) VirtualProtect
type(C_PTR), value :: lpAddress
integer(C_SIZE_T), value :: dwSize
integer(C_INT32_T), value :: flNewProtect
integer(C_INT32_T) :: lpflOldProtect
end function VirtualProtect
end interface
contains
! Initialization procedure for rdtsc. It will be called on the
! first invocation of rdtsc and sets up our real rdtsc function
function rdtsc_init() bind(C)
integer(C_INT64_T) rdtsc_init
! Machine code for 32-bit function
integer(C_INT8_T), target :: BAD_STUFF_32(3)
data BAD_STUFF_32 / &
Z'0F', Z'31', & ! rdtsc
Z'C3' / ! ret
! Machine code for 64-bit function
integer(C_INT8_T), target :: BAD_STUFF_64(10)
data BAD_STUFF_64 / &
Z'0F', Z'31', & ! rdtsc
Z'48', Z'C1', Z'E2', Z'20', & ! shl rdx, 32
Z'48', Z'09', Z'D0', & ! or rax, rdx
Z'C3' / ! ret
! Address the OS allocates for our function via VirtualAlloc
type(C_PTR) rdtsc_address
! Last error code
integer(C_INT32_T) last
! Fortran pointer to write our function to
integer(C_INT8_T), pointer :: rdtsc_code(:)
! Error status from VirtualProtect
integer(C_INT32_T) status
! Variable to store old memory protection code
integer(C_INT32_T) OldProtect

! Get writable address from OS to put our function in
! Need different sizes for 32- and 64-bit modes
if(bit_size(0_C_INTPTR_T) == 32) then
rdtsc_address = VirtualAlloc( &
lpAddress = C_NULL_PTR, &
dwSize = size(BAD_STUFF_32,kind=C_SIZE_T), &
flAllocationType = ior(MEM_COMMIT,MEM_RESERVE), &
flProtect = PAGE_READWRITE)
else
rdtsc_address = VirtualAlloc( &
lpAddress = C_NULL_PTR, &
dwSize = size(BAD_STUFF_64,kind=C_SIZE_T), &
flAllocationType = ior(MEM_COMMIT,MEM_RESERVE), &
flProtect = PAGE_READWRITE)
end if
! If something goes wrong, print out error code and abort
if(.NOT.C_ASSOCIATED(rdtsc_address)) then
last = GetLastError()
write(*,'(*(g0))') &
'rdtsc_init failed in VirtualAlloc with code ', last
stop
end if
! Get Fortran pointer to allocated memory and poke our
! function into it. Then mark it as executable. Need
! to poke in code appropriate to address size.
if(bit_size(0_C_INTPTR_T) == 32) then
call C_F_POINTER(rdtsc_address,rdtsc_code, &
[size(BAD_STUFF_32)])
rdtsc_code = BAD_STUFF_32
status = VirtualProtect( &
lpAddress = rdtsc_address, &
dwSize = size(BAD_STUFF_32,kind=C_SIZE_T), &
flNewProtect = PAGE_EXECUTE, &
lpflOldProtect = OldProtect)
else
call C_F_POINTER(rdtsc_address,rdtsc_code, &
[size(BAD_STUFF_64)])
rdtsc_code = BAD_STUFF_64
status = VirtualProtect( &
lpAddress = rdtsc_address, &
dwSize = size(BAD_STUFF_64,kind=C_SIZE_T), &
flNewProtect = PAGE_EXECUTE, &
lpflOldProtect = OldProtect)
end if
! If something goes wrong, print out error code and abort
if(status == 0) then
last = GetLastError()
write(*,'(*(g0))') &
'rdtsc_init failed in VirtualProtect with code ', last
stop
end if
! Point the function pointer at the function we just poked into memory
call C_F_PROCPOINTER(transfer(rdtsc_address,C_NULL_FUNPTR), &
rdtsc)
! We still have to return the TSC value for transparency
rdtsc_init = rdtsc()
end function rdtsc_init
end module rdtsc_mod

program hello
use rdtsc_mod
use ISO_C_BINDING, only: C_INT64_T
implicit none
integer(C_INT64_T) t0, tf

t0 = rdtsc()
write(*,'(*(g0))') 'Hello, world'
tf = rdtsc()
write(*,'(*(g0))') 'Time = ',tf-t0
t0 = rdtsc()
tf = rdtsc()
write(*,'(*(g0))') 'Time = ',tf-t0
endprogram hello

D:\gfortran\clf\rdtsc>gfortran -fno-range-check hello.f90 -ohello

D:\gfortran\clf\rdtsc>hello
Hello, world
Time = 438947
Time = 36

As can be seen in the above example (note copious comments) all
the user need do is USE the rdtsc_mod module and then the
rdtsc function works transparently.

> I suspect that it is contained in the code you referenced, but
> I would prefer something more concise.

Check

> There are two other questions you may be able to reply to:

> 1) Does RDTSC need some initialisation ? A version of RDTSC
> interface I have for ifort (2013) does use an initialisation code.
> I also use QueryPerformanceCounter so suspect RDTSC is being
> initialised via these calls.

No, RDTSC needs no initialization. It's possible to write to the TSC,
but that's a privileged instruction so mostly one just uses the
difference between two values read and subtracts. Since my
example doesn’t link to a *.LIB or *.OBJ file it needs some
preliminaries where, with the help of the OS, Fortran can set
up a function which can subsequently be invoked.

> 2) I am puzzled by the speed up of the processor in my i7-6700HQ
> or i7-4790K. ( as reported by task manager) SYSTEM_CLOCK
> appears to provide accurate clock times, but I am wondering
> about RDTSC and also QueryPerformanceCounter, which appear
> to be RDTSC / 1024.

Yeah, I haven't played around with the consequences of the
processor varying its speed. You are sort of on your own here.

> The cycle rate of RDTSC is also a problem, which I approximate
> by using QueryPerformanceFrequency * 1024

I normally use RDTSC only for timing short chunks of code where
other timers don't have enough precision. Relative times are what
are important to me, so I'm not so interested in wall-clock time.

> For Silverfrost FTN95 /64, RDTSC_VAL@() is a 64 bit replacement
> for REAL*10 CPU_CLOCK@ and appears to work for 32 bit also.
> As INTEGER*8, I find this is a better option.

> RDTSC rate is not documented and appears to be the processor
> rate. My approach for RDTSC_tick_rate works on all pc's I have
> available to test.

RDTSC is documented in 325462.pdf. The rate is the processor
rate. It should be noted that RDTSC advances in chunks of the
bus pace in units of the processor pace, so you may find that all
timings have a common divisor like 5 or 13 or something.

> (I do not know what happens on processors that have a turbo
> boost or are over-clocked. If someone can test this I would like to know)
> http://forums.silverfrost.com/viewtopic.php?t=3271&postdays=0&postorder=asc&start=0

> If you are able to comment, it would be appreciated,

Check

campbel...@gmail.com

unread,

Jun 24, 2017, 11:12:35 PM6/24/17

to

Hi James,

Thanks very much for this "pared down" example. I have tested it on Win7 using 64-bit gFortran 6.3.0 and confirm that it does appear to work well.

I modified your program hello, to provide some more timing reports and look for any variation in run times.

program hello
use rdtsc_mod
use ISO_C_BINDING, only: C_INT64_T
implicit none
integer(C_INT64_T) t0, tf

integer i

tf = rdtsc() ! first call should clean up delays
!
t0 = rdtsc()
do i = 1,5
tf = rdtsc() - t0
write(*,'(*(g0))') 'Time for do loop update = ',tf

t0 = rdtsc()
write(*,'(*(g0))') 'Hello, world'

tf = rdtsc() - t0
write(*,'(*(g0))') 'Time for write statement = ',tf

t0 = rdtsc()
tf = rdtsc()

write(*,'(*(g0))') 'Time for rdtsc call only = ',tf - t0

t0 = rdtsc()
tf = rdtsc() - t0
write(*,'(*(g0))') 'Time for rdtsc call - t0 = ',tf

t0 = rdtsc()
end do
tf = rdtsc() - t0
write(*,'(*(g0))') 'Time for do loop exit = ',tf

endprogram hello

{hello.tce} the results of the run show:
Time for do loop update = 77
Hello, world
Time for write statement = 21749
Time for rdtsc call only = 77
Time for rdtsc call - t0 = 77
Time for do loop update = 105
Hello, world
Time for write statement = 9450
Time for rdtsc call only = 49
Time for rdtsc call - t0 = 49
Time for do loop update = 49
Hello, world
Time for write statement = 9422
Time for rdtsc call only = 49
Time for rdtsc call - t0 = 49
Time for do loop update = 49
Hello, world
Time for write statement = 8967
Time for rdtsc call only = 49
Time for rdtsc call - t0 = 49
Time for do loop update = 49
Hello, world
Time for write statement = 9324
Time for rdtsc call only = 49
Time for rdtsc call - t0 = 49
Time for do loop exit = 133

This shows consistently 49 cycles for a rdtsc() call, although your test showed only 36 cycles.
Either way this is a small overhead for the rdtsc call and provides much better accuracy than QueryPerformanceCounter.
The next test will be to link this into a larger code.

Based on the test above, I would recommend all to consider this as a useful timer for small packets of code.

Thanks again James for this example of using rdtsc with gFortran

James Van Buskirk

unread,

Jun 25, 2017, 12:46:29 PM6/25/17

to

wrote in message
news:ca711c91-f09c-4416...@googlegroups.com...

> Based on the test above, I would recommend all to consider
> this as a useful timer for small packets of code.

Thanks for the endorsement. It occurred to me last night
that I should have created a Fortran pointer to BAD_STUFF_32
or BAD_STUFF_64 according as the code was begin compiled
in a 32- or 64-bit context. Having done so, the duplication of
complex code under separate branches of two IF blocks was
avoided which is a good thing for readability and testability
of the resulting code. Also it make the blob of Fortran code
shorter.

D:\gfortran\clf\rdtsc>type hello2.f90

! Pointer to machine code appropriate to address size
integer(C_INT8_T), pointer :: code_ptr(:)
! Size of machine code
integer(C_SIZE_T) code_size

! Address the OS allocates for our function via VirtualAlloc
type(C_PTR) rdtsc_address
! Last error code
integer(C_INT32_T) last
! Fortran pointer to write our function to
integer(C_INT8_T), pointer :: rdtsc_code(:)
! Error status from VirtualProtect
integer(C_INT32_T) status
! Variable to store old memory protection code
integer(C_INT32_T) OldProtect

! Point machine code pointer at code appropriate to
! address size and get code size
if(bit_size(0_C_INTPTR_T) == 32) then
code_ptr => BAD_STUFF_32
else
code_ptr => BAD_STUFF_64
end if
code_size = size(code_ptr,KIND=C_SIZE_T)

! Get writable address from OS to put our function in

rdtsc_address = VirtualAlloc( &
lpAddress = C_NULL_PTR, &

dwSize = code_size, &

flAllocationType = ior(MEM_COMMIT,MEM_RESERVE), &
flProtect = PAGE_READWRITE)

! If something goes wrong, print out error code and abort
if(.NOT.C_ASSOCIATED(rdtsc_address)) then
last = GetLastError()
write(*,'(*(g0))') &
'rdtsc_init failed in VirtualAlloc with code ', last
stop
end if
! Get Fortran pointer to allocated memory and poke our
! function into it. Then mark it as executable

call C_F_POINTER(rdtsc_address,rdtsc_code,[code_size])
rdtsc_code = code_ptr

status = VirtualProtect( &
lpAddress = rdtsc_address, &

dwSize = code_size, &

flNewProtect = PAGE_EXECUTE, &
lpflOldProtect = OldProtect)

! If something goes wrong, print out error code and abort
if(status == 0) then
last = GetLastError()
write(*,'(*(g0))') &
'rdtsc_init failed in VirtualProtect with code ', last
stop
end if
! Point the function pointer at the function we just poked into memory
call C_F_PROCPOINTER(transfer(rdtsc_address,C_NULL_FUNPTR), &
rdtsc)
! We still have to return the TSC value for transparency
rdtsc_init = rdtsc()
end function rdtsc_init
end module rdtsc_mod

program hello2

use rdtsc_mod
use ISO_C_BINDING, only: C_INT64_T
implicit none
integer(C_INT64_T) t0, tf

t0 = rdtsc()
write(*,'(*(g0))') 'Hello, world'
tf = rdtsc()
write(*,'(*(g0))') 'Time = ',tf-t0
t0 = rdtsc()
tf = rdtsc()
write(*,'(*(g0))') 'Time = ',tf-t0

end program hello2

D:\gfortran\clf\rdtsc>gfortran -fno-range-check hello2.f90 -ohello2

D:\gfortran\clf\rdtsc>hello2
Hello, world
Time = 365351
Time = 39

jfh

unread,

Jun 25, 2017, 6:10:21 PM6/25/17

to

On Saturday, June 24, 2017 at 8:09:43 AM UTC+12, herrman...@gmail.com wrote:
>
> Otherwise, I believe that the usual Fortran 77 code is
> still accepted by gfortran. There are a small number of
> features removed from Fortran 77, that may or may not
> have been removed from gfortran.

A problem of the opposite kind when using old FORTRAN code with a more modern compiler: you may have a subroutine or function that was not known to the older compiler but is intrinsic in the newer one. You may not have had to specify EXTERNAL with the older compiler but you must with the newer one. For example if an f66 program has a function called LEN, you must have a statement saying EXTERNAL LEN when using an f77 or later compiler, to avoid invoking the f77 intrinsic function LEN. There are many more cases like that with f90, f95, ...

herrman...@gmail.com

unread,

Jun 25, 2017, 8:55:45 PM6/25/17

to

Note that this problem isn't new, though.

Fortran 66 and 77 compilers can have extensions, including
intrinsic functions that are not in the standard.

And in addition to the newer intrinsic functions, newer
compilers can also have extensions, as intrinsic functions.

Also, many functions included in the older standards
were included as external functions. That is, you were
allowed to pass them as actual arguments to other routines.
I believe that they standard still allows for this, but
maybe not in all the same cases.

In the unlucky case where it doesn't, you would have to
write a wrapper function to call the intrinsic function.

Gary Scott

unread,

Jun 25, 2017, 9:36:25 PM6/25/17

to

My personal coding practice makes it unlikely that this situation would
arise, as I typically prefix each procedure with something like an
application id or a module id.

herrman...@gmail.com

unread,

Jun 25, 2017, 10:02:37 PM6/25/17

to

On Sunday, June 25, 2017 at 6:36:25 PM UTC-7, Gary Scott wrote:

(snip on collisions between intrinsic routine names and user
routine names)

> My personal coding practice makes it unlikely that this situation would
> arise, as I typically prefix each procedure with something like an
> application id or a module id.

I suppose, but following the trend of this thread, my rdtsc
functions are always called rdtsc.

In the Java case, I had class rdtsc and static method rdtsc,
such that the call is:

rdtsc.rdtsc();

There is an IBM convention of assigning each project, such
as each compiler or (separately) run-time library a three
letter code. All the external routine names start with those
three letters, as do all the error messages.

I am not quite that organized, but also don't normally use
names that are likely existing intrinsic names.

Gary Scott

unread,

Jun 26, 2017, 8:59:28 AM6/26/17

to

That IBM practice is likely where I picked that up.