Code blocks available for generic use

Rick C. Hodgin

unread,

Jan 26, 2017, 4:30:42 PM1/26/17

to

Does C have any kind of syntax which allows code to be constructed that
can be used for generic purposes, but without following the standard
prologue and epilogue code through a function?

I'm thinking I want to design a block of code like this:

block xyz
{
int a, c;

input b; // By block {..} protocol, uses an input register

a = myfunc1();
c = a + b;

output c; // By block {..} protocol, uses an output register
}

Then in my code I can use this:

int main(int argc, char* argv[])
{
int j, k;

j = argc * sizeof(argv[0]);
k = xyz(j);
}

In this example, the block of code in xyz is explicitly inlined.
It does not dispatch a function call.

In the following example, I am able to take a defined block and
copy it into another function so as to prevent the dispatch of
the callback function call:

block my_qsort_compare
{
input left as SStruct1*, right as SStruct1*;

if (left->value > right->value) result = 1;
else if (left->value < right->value) result = -1;
else result = 0;

output result;
}

Then in a modified qsort_block() version, I could provide parameters:

qsort_block(&data[0], count, data_size,
my_qsort_compare, sizeof(my_qsort_compare));

The qsort_block() function copies in the block of code so that it
doesn't have to issue function call dispatches, but inlines the
logic through some kind of established input and output protocols
where parameters enter and exit knowingly without going through
the stack, but reference wherever they already are in the specific
instance use.

Is there a way to do something like this in C?

Thank you,
Rick C. Hodgin

Robert Wessel

unread,

Jan 26, 2017, 4:52:23 PM1/26/17

to

An inline function. Of course it's up to the compiler to decide on
exactly what code to generate. A compiler might inline a function not
explicitly marked inline, and it may refuse to inline a function that
is (for example, most compilers will not inline recursive functions).
Many compilers have more emphatic "inline" directive as well (MSVCs
"__forceinline" and gcc's "__attribute__((always_inline))", for
example).

Ian Collins

unread,

Jan 27, 2017, 2:05:42 AM1/27/17

to

On 01/27/17 10:30 AM, Rick C. Hodgin wrote:
> Does C have any kind of syntax which allows code to be constructed that
> can be used for generic purposes, but without following the standard
> prologue and epilogue code through a function?
>
> I'm thinking I want to design a block of code like this:
>
> block xyz
> {
> int a, c;
>
> input b; // By block {..} protocol, uses an input register
>
> a = myfunc1();
> c = a + b;
>
> output c; // By block {..} protocol, uses an output register
> }
>
> Then in my code I can use this:
>
> int main(int argc, char* argv[])
> {
> int j, k;
>
> j = argc * sizeof(argv[0]);
> k = xyz(j);
> }
>
> In this example, the block of code in xyz is explicitly inlined.
> It does not dispatch a function call.

If you want C++, you know where to find it!

--
Ian

luser droog

unread,

Jan 27, 2017, 2:18:49 AM1/27/17

to

Or Smalltalk. This is exactly the behavior of Smalltalk's
blocks.

Rick C. Hodgin

unread,

Jan 27, 2017, 9:05:45 AM1/27/17

to

On Friday, January 27, 2017 at 2:05:42 AM UTC-5, Ian Collins wrote:
> On 01/27/17 10:30 AM, Rick C. Hodgin wrote:
> > Does C have any kind of syntax which allows code to be constructed that
> > can be used for generic purposes, but without following the standard
> > prologue and epilogue code through a function?
> >

> > ...

> >
> > In the following example, I am able to take a defined block and
> > copy it into another function so as to prevent the dispatch of
> > the callback function call:
> >

v block my_qsort_compare

> > {
> > input left as SStruct1*, right as SStruct1*;
> >
> > if (left->value > right->value) result = 1;
> > else if (left->value < right->value) result = -1;
> > else result = 0;
> >
> > output result;
> > }
> >
> > Then in a modified qsort_block() version, I could provide parameters:
> >
> > qsort_block(&data[0], count, data_size,
> > my_qsort_compare, sizeof(my_qsort_compare));
> >
> > The qsort_block() function copies in the block of code so that it
> > doesn't have to issue function call dispatches, but inlines the
> > logic through some kind of established input and output protocols
> > where parameters enter and exit knowingly without going through
> > the stack, but reference wherever they already are in the specific
> > instance use.
> >
> > Is there a way to do something like this in C?
>

> If you want C++, you know where to find it!

I can see how the example you cited works in C++, but the idea I
have extends to this second example.

How does this work in C++? How specifically does the qsort_block()
example work with the use of my_qsort_compare() function being able
to be copied into another function like that?

David Brown

unread,

Jan 27, 2017, 10:10:38 AM1/27/17

to

The quick answer is "templates". Details should probably be taken to
C++, if you really are interested in knowing about them. Templates
don't exist in C, and I believe they are a part of C++ that you are not
intending to copy to CAlive.

Rick C. Hodgin

unread,

Jan 27, 2017, 10:37:38 AM1/27/17

to

On Friday, January 27, 2017 at 10:10:38 AM UTC-5, David Brown wrote:
> On 27/01/17 15:05, Rick C. Hodgin wrote:
> > How does this work in C++? How specifically does the qsort_block()
> > example work with the use of my_qsort_compare() function being able
> > to be copied into another function like that?
>

> The quick answer is "templates" ... Templates don't exist in C, and I
> believe they are a part of C++ that you are not intending [in] CAlive.

Correct.

The purpose of block {..} would be to allow the compiler to compile
logic into a block that can be used, but doesn't have to follow a
particular pattern of use (such as compile, link, and be bound to
a particular binary executable wrapper.

The compiler is a tool, and it should be able to compile logic down
to a binary form for someone, and without the required protocol of
a target OS. It should simply be able to work in targeting code for
a target ISA.

supe...@casperkitty.com

unread,

Jan 27, 2017, 10:54:21 AM1/27/17

to

On Friday, January 27, 2017 at 9:37:38 AM UTC-6, Rick C. Hodgin wrote:
> The purpose of block {..} would be to allow the compiler to compile
> logic into a block that can be used, but doesn't have to follow a
> particular pattern of use (such as compile, link, and be bound to
> a particular binary executable wrapper.

I've often thought that a C-like language would benefit from having
something a bit like the C preprocessor, but operating at a somewhat
higher level. There's presently a significant gap between what inline
functions can do and what function-like macros can do, and it would
be helpful to have a construct that combined the benefits of both.
Is that what you're after with "block"?

Rick C. Hodgin

unread,

Jan 27, 2017, 10:59:32 AM1/27/17

to

I address the pre-processor limitations of C by introducing a fully
compliant CAlive source code pre-processor step. I introduce a host
of new functions and abilities (the higher level you mention) to be
able to access source code, perform searches, replace lines, suppress
lines for a single instance of a compilation, embed generated content
in a hard form in a source file, etc.

This block {..} engine would be able to work from the same code base
as the one used in the preprocessor, but it wasn't intended to be
used explicitly for that point.

David Brown

unread,

Jan 27, 2017, 11:37:56 AM1/27/17

to

On 27/01/17 16:37, Rick C. Hodgin wrote:
> On Friday, January 27, 2017 at 10:10:38 AM UTC-5, David Brown wrote:
>> On 27/01/17 15:05, Rick C. Hodgin wrote:
>>> How does this work in C++? How specifically does the qsort_block()
>>> example work with the use of my_qsort_compare() function being able
>>> to be copied into another function like that?
>>
>> The quick answer is "templates" ... Templates don't exist in C, and I
>> believe they are a part of C++ that you are not intending [in] CAlive.
>
> Correct.

(I see you have made a post in c.l.c++. I haven't answered it yet - I
might do if no one else does so.)

>
> The purpose of block {..} would be to allow the compiler to compile
> logic into a block that can be used, but doesn't have to follow a
> particular pattern of use (such as compile, link, and be bound to
> a particular binary executable wrapper.

A C compiler does not have to follow this pattern either, when dealing
with code it knows "everything" about. Static functions, in particular,
can be inlined, outlined, partially inlined, cloned, and otherwise
jumbled up in any way the compiler wants. And with link-time
optimisation, /everything/ that doesn't come as a pre-compiled object
code lump, and that does not need to be exported with a specific calling
convention, can be handled in this way.

Rick C. Hodgin

unread,

Jan 27, 2017, 11:58:13 AM1/27/17

to

On Friday, January 27, 2017 at 11:37:56 AM UTC-5, David Brown wrote:
> On 27/01/17 16:37, Rick C. Hodgin wrote:
> > On Friday, January 27, 2017 at 10:10:38 AM UTC-5, David Brown wrote:
> >> On 27/01/17 15:05, Rick C. Hodgin wrote:
> >>> How does this work in C++? How specifically does the qsort_block()
> >>> example work with the use of my_qsort_compare() function being able
> >>> to be copied into another function like that?
> >>
> >> The quick answer is "templates" ... Templates don't exist in C, and I
> >> believe they are a part of C++ that you are not intending [in] CAlive.
> >
> > Correct.
>
> (I see you have made a post in c.l.c++. I haven't answered it yet - I
> might do if no one else does so.)
>
> >
> > The purpose of block {..} would be to allow the compiler to compile
> > logic into a block that can be used, but doesn't have to follow a
> > particular pattern of use (such as compile, link, and be bound to
> > a particular binary executable wrapper.
>
> A C compiler does not have to follow this pattern either, when dealing
> with code it knows "everything" about. Static functions, in particular,
> can be inlined, outlined, partially inlined, cloned, and otherwise
> jumbled up in any way the compiler wants. And with link-time
> optimisation, /everything/ that doesn't come as a pre-compiled object
> code lump, and that does not need to be exported with a specific calling
> convention, can be handled in this way.

I have never experienced this apart from inlining via explicit
declaration or compiler optimization.

I know of no way in any C compiler (except for one which came with
the book "Developing Your Own 32-bit Operating System" by Richard
Burgess from back in the 1990s, which was simply a translating tool
which took C code and generated assembly source code) to have it
generate code that has a generic input protocol, and a generic
output protocol, and is relocatable.

I am aware of many implementations that generate functions which
require a calling convention using registers or the stack, and
explicitly return from the call.

Can you give me an example in GCC or MSVC that produces code which
can be inserted generically into the middle of large block of NOPs,
for example, which then conduct work provided the input is correct,
and the output is specified?

Best regards,
Rick C. Hodgin

anti...@math.uni.wroc.pl

unread,

Jan 27, 2017, 12:23:21 PM1/27/17

to

Rick C. Hodgin <rick.c...@gmail.com> wrote:

What your "block" can do beside having different syntax and
ensuring that code expanded inline? In particular, what
is the difference between your your "block" and hypotetical
"really_inline" keyword for function definitions?

--
Waldek Hebisch

Rick C. Hodgin

unread,

Jan 27, 2017, 12:33:42 PM1/27/17

to

The block {..} could allow its own local variables. By decreasing
the actual stack pointer, and then referencing them as [sp+Nn] for
their various offsets. It would receive parameters in registers,
or at known locations relative to something like the base pointer
(where [bp] is not necessarily the [bp] of the actual program, but
is just a pointer to the base of the struct holding the local vars
which are used as input (and optionally output) to the block {..}
code).

It would be a way to take C source code and generate either asm
source code, or raw binary output, based on the logic contained
within, but without the need to follow a calling convention or an
OS protocol for wrapping (such as being in an executable format,
a DLL, LIB, etc.).

For me, it would be a way to create truly relocatable code blocks
which could be injected anywhere (specific to my LiveCode (edit-
and-continue) ability) in CAlive, as then each block of code is
viewed as a chunk or "block" ... and it's just a matter of then
re-arranging the blocks for LiveCode compilation. The only truly
intensive operation is reconfiguring the stack to address changed
data types which may inflate or deflate everything contained
within, and to be able to inject "inquirys" whenever something
has changed which, upon return, will not have been initialized
properly.

It's all working out ... it's just a lot of work in design to get
every i dotted, and every t crossed.

BartC

unread,

Jan 27, 2017, 12:42:20 PM1/27/17

to

On 26/01/2017 21:30, Rick C. Hodgin wrote:
> Does C have any kind of syntax which allows code to be constructed that
> can be used for generic purposes, but without following the standard
> prologue and epilogue code through a function?
>
> I'm thinking I want to design a block of code like this:
>
> block xyz
> {
> int a, c;
>
> input b; // By block {..} protocol, uses an input register
>
> a = myfunc1();
> c = a + b;
>
> output c; // By block {..} protocol, uses an output register
> }
>
> Then in my code I can use this:
>
> int main(int argc, char* argv[])
> {
> int j, k;
>
> j = argc * sizeof(argv[0]);
> k = xyz(j);
> }
>
> In this example, the block of code in xyz is explicitly inlined.
> It does not dispatch a function call.

Macros? (I think even #include can be employed, depending on what
interface is needed, but then the each lot of block code will be in a
separate file.)

> In the following example, I am able to take a defined block and
> copy it into another function so as to prevent the dispatch of
> the callback function call:
>
> block my_qsort_compare
> {
> input left as SStruct1*, right as SStruct1*;
>
> if (left->value > right->value) result = 1;
> else if (left->value < right->value) result = -1;
> else result = 0;
>
> output result;
> }
>
> Then in a modified qsort_block() version, I could provide parameters:
>
> qsort_block(&data[0], count, data_size,
> my_qsort_compare, sizeof(my_qsort_compare));
>
> The qsort_block() function copies in the block of code so that it
> doesn't have to issue function call dispatches, but inlines the
> logic through some kind of established input and output protocols
> where parameters enter and exit knowingly without going through
> the stack, but reference wherever they already are in the specific
> instance use.

I can't make sense of your second example. my_qsort_compare involves an
if-statement, but an if-statement isn't usually allowed in a
function-call argument.

And what does sizeof(my_qsort_compare) mean? What is the input to
my_qsort_compare?

Maybe you can show the expanded version of your qsort_block() call so we
can see what how it's supposed to work.

--
Bartc

Rick C. Hodgin

unread,

Jan 27, 2017, 1:07:56 PM1/27/17

to

On Friday, January 27, 2017 at 12:42:20 PM UTC-5, Bart wrote:
> On 26/01/2017 21:30, Rick C. Hodgin wrote:
> > Does C have any kind of syntax which allows code to be constructed that
> > can be used for generic purposes, but without following the standard
> > prologue and epilogue code through a function?
> >
> > I'm thinking I want to design a block of code like this:
> >
> > block xyz
> > {
> > int a, c;
> >
> > input b; // By block {..} protocol, uses an input register
> >
> > a = myfunc1();
> > c = a + b;
> >
> > output c; // By block {..} protocol, uses an output register
> > }
> >
> > Then in my code I can use this:
> >
> > int main(int argc, char* argv[])
> > {
> > int j, k;
> >
> > j = argc * sizeof(argv[0]);
> > k = xyz(j);
> > }
> >
> > In this example, the block of code in xyz is explicitly inlined.
> > It does not dispatch a function call.
>
> Macros? (I think even #include can be employed, depending on what
> interface is needed, but then the each lot of block code will be in a
> separate file.)

Macros would work for this first example, but it would handle it a
little differently than a true block {..} would (as per my proposed
design, mostly described in detail only in my head thus far).

Sure. See below.

Source code:

01: block my_qsort_compare
02: {
03: input left as SStruct1*, right as SStruct1*;
04:
05: if (left->value > right->value) result = 1;
06: else if (left->value < right->value) result = -1;
07: else result = 0;
08:
09: output result;
10: }

Compiled down to:

01: block my_qsort_compare
02: {
03: input left as SStruct1*, right as SStruct1*;
// By block {..} protocol, use left=eax, right=ebx
04:
05: if (left->value > right->value) result = 1;
mov edx,[eax+offset SStruct1::value] // left
cmp edx,[ebx+offset SStruct1::value] // right
jng @F
mov eax,1
jmp end
06: else if (left->value < right->value) result = -1;
@@:
jnl @F
mov eax,-1
jmp end
07: else result = 0;
@@:
xor eax,eax
08:
09: output result;
end:
// By block protocol, return result in eax
10: }

Now, this block of code was compiled by the compiler. It exists.
It contains logic, those inputs and outputs. It can be copied to
anywhere it's needed. It now exists basically as the equivalent
of:

unsigned char my_qsort_compare[] = { /* opcode bytes */ };

So the reference to "my_qsort_compare" returns its address, and the
reference to sizeof(my_qsort_compare) is the size of the code
contained within.

In order for the qsort_block() version to work, it would need to
have a large enough block of code within that it can be copied.
And then it can create a jmp offset and inject the equivalent of
this assembly code immediately after the block above:

jmp known_hard_offset - sizeof(my_qsort_compare)

In that way it would not have to pass parameters on the stack,
or make the function call, saving overhead for spill / fill
operations.

And lots of other types of callback code could be done this way
if it's appropriate. The caller could even provide another input
which is a reusable block to generate the output to so that on
subsequent passes it doesn't need to redesign the qsort_block()
function body, but could simply reuse that which was first
assembled.

The OS would have to provide support for the, of course, but
if it's an oft-used function, the removal of call overhead would
be desirable. And this would be a way to do it in the cases of
static library functions linked against at runtime, rather than
compile-time.

BartC

unread,

Jan 27, 2017, 1:14:20 PM1/27/17

to

I'm not sure it's much clearer!

What would be the equivalent of your example in C? Ie. what does someone
have to write /now/ to do the logical equivalent of the example?

--
Bartc

Rick C. Hodgin

unread,

Jan 27, 2017, 1:27:00 PM1/27/17

to

Oh dear. Well ... Hmm...

> What would be the equivalent of your example in C? Ie. what does someone
> have to write /now/ to do the logical equivalent of the example?

I don't think the ability exists today in C or C++. I think this is a
new idea.

supe...@casperkitty.com

unread,

Jan 27, 2017, 1:52:26 PM1/27/17

to

On Friday, January 27, 2017 at 11:33:42 AM UTC-6, Rick C. Hodgin wrote:
> The block {..} could allow its own local variables. By decreasing
> the actual stack pointer, and then referencing them as [sp+Nn] for
> their various offsets. It would receive parameters in registers,
> or at known locations relative to something like the base pointer
> (where [bp] is not necessarily the [bp] of the actual program, but
> is just a pointer to the base of the struct holding the local vars
> which are used as input (and optionally output) to the block {..}
> code).

Even in the 1980s, C compilers seldom made many promises about how they
will treat the stack pointer except at certain key points in the program.
Being able to have the compiler put or leave things on the stack at its
leisure has long been recognized as an easy and useful optimization, and
exposing push and pop operations would undermine it. With the exception
of automatic variables whose address has been taken and which are still
alive, the contents of the stack are best regarded as Unspecified even
when using conventional architectures.

Rick C. Hodgin

unread,

Jan 27, 2017, 2:01:46 PM1/27/17

to

Correct. I used the stack reference only as an example as the stack is
fairly ubiquitous among mainstream architectures. I also introduce a
new stack feature in my Arxoda CPU, called a parameter stack (with the
spp and bpp registers).

This parameter stack is a separate physical stack frame. It is a
mechanism designed to allow pre-packaged parameters to be pointed to
without having to populate them manually at runtime (if they're not
typically changed). And it provides a way to aggregate stack local
and heap variables into a discrete packet which allows them to be
assembled into a logical single block. To access the actual local
variables requires indirection, but because it is being accessed through
the parameter stack, Arxoda is aware of this condition and handles the
job of indirection for you. The underlying code simply accesses the
address of their offset in the parameter stack. The CPU automatically
handles access to their true offsets.

It simplifies code at the expense of a little additional complexity in
hardware, and on behalf of the OS designers. A strong trade off when
you consider how many application developers there are, compared to
how many hardware designers and OS developers. The weight needs to
be in favor of the larger base, as those involved in the true inner
nitty gritty details will have little extra difficulty sorting through
the heavier knowledge base requirements.

Rick C. Hodgin

unread,

Jan 27, 2017, 2:14:41 PM1/27/17

to

On Friday, January 27, 2017 at 1:52:26 PM UTC-5, supe...@casperkitty.com wrote:

One other idea I'd like to throw out there:

I've discussed before on comp.std.c, and on comp.lang.c, the idea of
the standard C compiler operating on a standard C emulator. I had
the idea of defining the typical machine (little endian, two's
complement, 64-bit integer, and so on). I met with unending resistance
and abandoned the idea in favor of devoting my attention to a break-
away effort (CAlive).

Still, the underlying philosophy I was working with at that time
remains:

We must not look at the language to define or limit our ability
to use hardware. As someone once wrote, "Nobody writes C code
on a whiteboard." The goal is actually produce real software
running on real hardware.

As such, our goals should be to look to the machine, and mold
the language around the machine. And to take that a step
further, to mold the language around the machine with a strong
and consistent deference to the needs of regular human beings
who must approach these tasks and conduct the work within them.

-----
People are absolutely amazing. We can achieve the most complex things
imaginable ... because God put us together to be unlike every other of
His created beings. Not even angels are able to know all of the things
we know, or to be able to live the lives we will live.

But there is a difference between operating at our optimum, and being
able to have a lesser knowledge base and still do 99% of the work we'd
need to do.

In order to learn C++, for example, it's an exhaustive effort. You
have to learn so many things to truly know C++. But, in knowing 10%
of C++ you can do probably 80% of the things that need done.

With CAlive, my goal is to get that number up to about 98%, and to do
so by making as many things as possible more conducive to the way we
already think. I want our natural conclusions about things to be the
way CAlive operates so that there isn't as much of a learning curve,
but instead ... "it just works naturally."

In that way, that same 10% investment in knowledge learning time in
C++ would translate to a nearly 98% knowledge base in CAlive. It
would yield developers able to do much more much more quickly, and
to be able to go in and debug code later with greater ease because
you don't really have to get in there and hammer out by rigorous
means what's going on ... everything will be much clearer.

I'm prepared to sacrifice a little bit of machine optimization for
this, though I plan to re-engineer the hardware with my Arxoda CPU
to compensate for much of that (by working in hardware the way the
needs of the CAlive data expression mandates in software), and I'm
willing to invest extra time in developing the language, IDE, and
other features people will interact with to make their efforts that
much easier, even if my efforts in creating the tools are that much
harder.

I believe this philosophy is a correct one. And I believe it is the
one which must endure on any system that intends to survive for an
extended period of time.

And to be clear about this, I believe that had something like CAlive
been about during most of C's lifetime, then we would not have C in
its widely used form today, but rather it would've been CAlive (or a
derivative thereof) which would've been used because it is still very
low-level like C, but it works more the way human beings think, and
it still contains a subset of C for those who like to program in C
for various aspects of their code, while still providing the more
human-usable higher level portions where performance is not as much
of a factor).

anti...@math.uni.wroc.pl

unread,

Jan 27, 2017, 3:28:43 PM1/27/17

to

Above you describe (one of sevareal possible) implementation of
inline functions. But I did not ask about implementation.
My question was about semantics. Maybe I will ask it differently:
suppose that I have written huge program in CAlive using block.
And I consider changing all blocks to functions. What I will
have to change beside syntactic change?

>
> It would be a way to take C source code and generate either asm
> source code, or raw binary output, based on the logic contained
> within, but without the need to follow a calling convention or an
> OS protocol for wrapping (such as being in an executable format,
> a DLL, LIB, etc.).
>
> For me, it would be a way to create truly relocatable code blocks
> which could be injected anywhere (specific to my LiveCode (edit-
> and-continue) ability) in CAlive, as then each block of code is
> viewed as a chunk or "block" ... and it's just a matter of then
> re-arranging the blocks for LiveCode compilation. The only truly
> intensive operation is reconfiguring the stack to address changed
> data types which may inflate or deflate everything contained
> within, and to be able to inject "inquirys" whenever something
> has changed which, upon return, will not have been initialized
> properly.

It looks that you want to implement inlining via copying of
machine code. I have not heard of any system doing this
as benefits seem quite small: with proper calling convention
such inlining saves you two machine instructions (call and
return). In both cases you need to arrange that arguments
are in correct place (register or stack) and retrive produced
value from right place. AFAIK main gain from inlining
is because it enables extra optimization. For example
the following code:

typedef void(*work_fun)(void *, void *, void *);
typedef void (*step_fun)(void *);
typedef int(*end_fun)(void *, void *);

void
map(void * a, step_fun sf, end_fun ef, work_fun wf,
void * iter, void * limit, void * extra)
{
for(; ef(iter, limit); sf(iter)) {
wf(a, iter, extra);
}
}

void my_stepper(void * sp)
{
int * ip = (int *)sp;
(*ip)++;
}

int my_end(void * sp, void * limit)
{
return *(int *)sp < *(int *)limit;
}

void my_work(void * a, void * iter, void * extra)
{
int val = ((int *)a)[*(int *)iter];
int * ip = (int *)extra;
(*ip) += val;
}

int my_sum(int * a, int n)
{
int i = 0;
int res = 0;
map((void *)a, my_stepper, my_end, my_work, (void *)&i,
(void *)&n, (void *)&res);
return res;
}

Using gcc 5.3 for ARM with options -S -Wall -mthumb -mcpu=cortex-m0 -O3
produces the following assembly for 'my_sum':

.global my_sum
.code 16
.thumb_func
.type my_sum, %function
my_sum:
cmp r1, #0
ble .L12
lsls r1, r1, #2
movs r3, r0
adds r1, r0, r1
movs r0, #0
.L11:
ldmia r3!, {r2}
adds r0, r0, r2
cmp r1, r3
bne .L11
.L10:
@ sp needed
bx lr
.L12:
movs r0, #0
b .L10
.size my_sum, .-my_sum
.ident "GCC: (GNU) 5.3.0"

This is essentially code that you would get from hand-written
summing loop. Note that 'map' can iterate over anything (given
appropriate arguments). I could use it to iterate over lists
or say inorder traversal of binary tree. So we get generic
code that can do a lot of things, but there is cost:
naive compiler will produce inefficiet code.
But using gcc all that abstraction bloat vanishes
after inlinig. Merely copying and linking machine code
snippets would retain most of the bloat.

BTW: gcc-4.9.3 for amd64 produces a lot of cruft from the
same source. I am not sure if this is due to different
target or due to different version.

--
Waldek Hebisch

Rick C. Hodgin

unread,

Jan 27, 2017, 3:42:32 PM1/27/17

to

On Friday, January 27, 2017 at 3:28:43 PM UTC-5, anti...@math.uni.wroc.pl wrote:
> Above you describe (one of sevareal possible) implementation of
> inline functions. But I did not ask about implementation.

> My question was about semantics...

Inlining is not something that a user has control over in the
binary expression. The compiler will take a function and inline
it and potentially completely refactor the way it works internally
to meet the optimization needs of whatever it's inlined into. An
inlined function in func1() may look completely different in its
assembly language expression, when compared to an inline function
in func2() because it was able in func2() to make use of optimizations
not possible in func1(), or vice-versa.

With the block {..}, you're creating a stand-alone piece of code
which has a rigid and expected input protocol, and a rigid and
expected output protocol. It generates optimized code for its
own block, and its own use of registers, but cannot make more
general or global assumptions about whatever it will wind up
running in at some later point, but the idea is that the gains
made by saving the spill + fill and function call are more than
worth the trade off in doing it this way.

The purpose of creating block {..} code blocks isn't to use them
as functions. It's to generate a thing which can be wielded to
the place where you need it, injected there directly in the middle
of other code, without having to use a function call to access it.
It's doing a runtime linking after the program has already launched.

In addition, without including the code for qsort(), you cannot
inline its compare callback function. The qsort() has to issue
the callback because it's typically located in an external library
that knows nothing of the app it will ultimately be used in. And
where there is some allowance for statically linked files, it's
unlikely to be inlined because you'd have multiple qsort() copies
being made, each with their own peculiarities.

In the case of a qsort_block(), however, it would be designed in
such a way that it could make use of the external block() to save
the overhead of the callback. And then you get into multiple
places in programming where, having this generic ability to inject
binary code into already existing algorithms to optimize them even
more, allows for possibilities that don't exist today.

I would like to see this ability added to C and C++. It would
allow blocks of binary code to be constructed, and then able to
be referenced like data at runtime, copied where they can be
injected into other binary blocks which have allocated space
for their insertion (and if they're larger than the allocated
space, have it go ahead and perform the call with epilogue and
prolog for the block {..} code done in the original.

For multi-threaded apps, you'd need to allocate blocks in your
thread where the code would be assembled to when needed, and run
from, and then re-used in its "lazy run-time assembled form" until
no longer needed. For single-threaded apps, it could use a single
static algorithm that's updated at each call.

anti...@math.uni.wroc.pl

unread,

Jan 27, 2017, 5:41:50 PM1/27/17

to

Rick C. Hodgin <rick.c...@gmail.com> wrote:

> On Friday, January 27, 2017 at 3:28:43 PM UTC-5, anti...@math.uni.wroc.pl wrote:
> > Above you describe (one of sevareal possible) implementation of
> > inline functions. But I did not ask about implementation.
> > My question was about semantics...
>
> Inlining is not something that a user has control over in the
> binary expression.

You write the compiler so you have complete control over
inlining in your compiler. If somebody implements
a CAlive compiler with blocks implemented differently
than you imagine (for example using normal function call)
then user of this clone compiler would loose warranty that
you want to make.

> With the block {..}, you're creating a stand-alone piece of code
> which has a rigid and expected input protocol, and a rigid and
> expected output protocol. It generates optimized code for its
> own block, and its own use of registers, but cannot make more
> general or global assumptions about whatever it will wind up
> running in at some later point,

Exacly like non-inlined function.

> but the idea is that the gains
> made by saving the spill + fill and function call are more than
> worth the trade off in doing it this way.

Well assume that it work nicely. Now bad guy comes and
implement function calls so that he adds an extra argument
for return address to the block, and puts branch to the
code implementing block in the call space. For return
after block code he adds jump back to return addres.
Assuming that the protocol involves stack a single
call instruction will put return address on the stack
and transfer control. A single ret will clean up the
stack and transfer control back. So compared to normal
call your block have saving of two instructions per
call.

Extra remark: IIUC you want to copy code to a buffer.
Copied code will have variable length, so AFAICS
you will need jump back (you can not fall-trough the
end of buffer because there may be junk after code
of block). You may arrange code of function using
the block so that it falls trough the start of buffer,
but that is not possible if you want to use it more than
once (unless you use several buffers). Also, in multithreaded
case you will want buffer separate from function area.
So quite possible that you will have jump to transfer
control to the block and jump to transfer control
back: exactly the two jumps that you are trying to
save.

> The purpose of creating block {..} code blocks isn't to use them
> as functions. It's to generate a thing which can be wielded to
> the place where you need it, injected there directly in the middle
> of other code, without having to use a function call to access it.
> It's doing a runtime linking after the program has already launched.

People were doing runtime linking with functions for ages.
On the system I am using now (not C) I can redefine any
function on the fly and system will use it. When system
stops on error I can redefine function from the debugger
and continue execution. However, the last ability
for me is much less useful than you think: after
error normaly data structures are wrong. So it
saves me time to start from beginnig then to think
if by chance I data structures are intact and
continuing is safe. I could also fix data structures
from debugger, but restarting is faster...

If you looked at functional programming you will notice
that people use functions to "inject" code where it is
needed, exactly like you want to use your blocks.

You may wish to look at Oberon system: it has technical
goes very much as you describe CAlive and is quite
small and well documented system. Or at least think
why people are using C and Oberon is almost
forgotten.

OK, I see that you really want to implement this
micro-optimization. So do it and tell us if it
was worth the effort.

--
Waldek Hebisch

Rick C. Hodgin

unread,

Jan 27, 2017, 7:18:48 PM1/27/17

to

On Friday, January 27, 2017 at 5:41:50 PM UTC-5, anti...@math.uni.wroc.pl wrote:
> OK, I see that you really want to implement this
> micro-optimization. So do it and tell us if it
> was worth the effort.

It's more to take a natural ability made available by the compiler,
constrained only by the protocol of building functions, removing
that constraint so it can be made injectable without a function
call, and proceeding forward with the rest of my life.

I'll let you know how it turns out.

Rick C. Hodgin

unread,

Jan 27, 2017, 7:37:38 PM1/27/17

to

On Friday, January 27, 2017 at 5:41:50 PM UTC-5, anti...@math.uni.wroc.pl wrote:

> Rick C. Hodgin <rick.c...@gmail.com> wrote:
> > With the block {..}, you're creating a stand-alone piece of code
> > which has a rigid and expected input protocol, and a rigid and
> > expected output protocol. It generates optimized code for its
> > own block, and its own use of registers, but cannot make more
> > general or global assumptions about whatever it will wind up

> > running in at some later point, but the idea is that the gains

> > made by saving the spill + fill and function call are more than
> > worth the trade off in doing it this way.
>
> Well assume that it work nicely. Now bad guy comes and
> implement function calls so that he adds an extra argument
> for return address to the block, and puts branch to the
> code implementing block in the call space. For return
> after block code he adds jump back to return addres.
> Assuming that the protocol involves stack a single
> call instruction will put return address on the stack
> and transfer control. A single ret will clean up the
> stack and transfer control back. So compared to normal
> call your block have saving of two instructions per
> call.

There's more to the idea than just this software protocol. I call
it "SMC blocks," and it relates to a new hardware function which
allows thread-specific blocks to be overlain atop the original
instruction code, without actually altering the code, but by adding
the blocks by protocol at runtime. The blocks are like stickers
which are temporarily put on top of the code with new instructions,
and are then removed when the SMC block is deleted or goes out of
scope (as they are stack sensitive).

Hardware Architecture and Self-Modifying Code
https://groups.google.com/d/msg/comp.arch/-T_Gam9_TvY/xZ4hyl1FAAAJ

> Extra remark: IIUC you want to copy code to a buffer.
> Copied code will have variable length, so AFAICS
> you will need jump back (you can not fall-trough the
> end of buffer because there may be junk after code
> of block). You may arrange code of function using
> the block so that it falls trough the start of buffer,
> but that is not possible if you want to use it more than
> once (unless you use several buffers). Also, in multithreaded
> case you will want buffer separate from function area.
> So quite possible that you will have jump to transfer
> control to the block and jump to transfer control
> back: exactly the two jumps that you are trying to
> save.

The hardware overhead between a jump and call is somewhat different.
Unconditional jumps are able to be traversed in the pipeline (that
assumes no memory stalls).

The idea is you would always inject code into a fixed size buffer
of relatively small/average size for the task at hand. If the code
required is larger than the buffer, inject a standard call to your
function, or inject a custom block with prologue code, a call to
the block which is also modified to contain some epilogue to return,
making it essentially a function call as it is today.

Robert Wessel

unread,

Jan 28, 2017, 12:13:44 AM1/28/17

to

On Fri, 27 Jan 2017 16:37:31 -0800 (PST), "Rick C. Hodgin"
<rick.c...@gmail.com> wrote:

>
>The hardware overhead between a jump and call is somewhat different.
>Unconditional jumps are able to be traversed in the pipeline (that
>assumes no memory stalls).

There is little different in performance for most processors between
static calls (and the associated returns) and jumps. OTOH, indirect
jumps (IOW via an address in a register or some-such, as commonly
happens with function pointers), do often have pipeline hits.

Robert Wessel

unread,

Jan 28, 2017, 1:10:14 AM1/28/17

to

On Fri, 27 Jan 2017 12:42:15 -0800 (PST), "Rick C. Hodgin"
<rick.c...@gmail.com> wrote:

>On Friday, January 27, 2017 at 3:28:43 PM UTC-5, anti...@math.uni.wroc.pl wrote:
>> Above you describe (one of sevareal possible) implementation of
>> inline functions. But I did not ask about implementation.
>> My question was about semantics...
>
>Inlining is not something that a user has control over in the
>binary expression. The compiler will take a function and inline
>it and potentially completely refactor the way it works internally
>to meet the optimization needs of whatever it's inlined into. An
>inlined function in func1() may look completely different in its
>assembly language expression, when compared to an inline function
>in func2() because it was able in func2() to make use of optimizations
>not possible in func1(), or vice-versa.

That specialization, is, of course, the major advantage of inlining
code, not the trivial overhead of the call itself (except in the case
of very short functions).

>With the block {..}, you're creating a stand-alone piece of code
>which has a rigid and expected input protocol, and a rigid and
>expected output protocol. It generates optimized code for its
>own block, and its own use of registers, but cannot make more
>general or global assumptions about whatever it will wind up
>running in at some later point, but the idea is that the gains
>made by saving the spill + fill and function call are more than
>worth the trade off in doing it this way.
>
>The purpose of creating block {..} code blocks isn't to use them
>as functions. It's to generate a thing which can be wielded to
>the place where you need it, injected there directly in the middle
>of other code, without having to use a function call to access it.
>It's doing a runtime linking after the program has already launched.
>
>In addition, without including the code for qsort(), you cannot
>inline its compare callback function. The qsort() has to issue
>the callback because it's typically located in an external library
>that knows nothing of the app it will ultimately be used in. And
>where there is some allowance for statically linked files, it's
>unlikely to be inlined because you'd have multiple qsort() copies
>being made, each with their own peculiarities.

Which is an implementation detail. First, the library needs to be
distributed in source form, or the library needs to be distributed in
whatever intermediate format gets used when LTCG is specified. Then
the compiler just has to be willing to inline aggressively across
translation unit boundaries at LTCG time.

In any event, there's nothing wrong with multiple specializations of
(say) qsort.

For the following, I implemented "inssort()", and insertion sort, for
simplicity, using the same parameters as the library qsort.

MSVC, for example, can be prodded into do that fairly easily if the
source for both functions (say inssort and the comparison function)
are in the same source module, and then to get multiple specialization
of inssort for different calls to the outer routine. He can also
specialize an inssort with a comparison function from a different
translation unit, but he falls back to not duplicating inssort if
there are calls with two different comparison functions. And with
LTCG there are fewer ways to prod him to more aggressive inlining.

So he can clearly do all that, he just isn't. A bit more
aggressiveness in deciding to do so, and he'd have it (or an stronger
version of the __forceinline attribute that he paid attention to at
LTCG time).

>In the case of a qsort_block(), however, it would be designed in
>such a way that it could make use of the external block() to save
>the overhead of the callback. And then you get into multiple
>places in programming where, having this generic ability to inject
>binary code into already existing algorithms to optimize them even
>more, allows for possibilities that don't exist today.
>
>I would like to see this ability added to C and C++. It would
>allow blocks of binary code to be constructed, and then able to
>be referenced like data at runtime, copied where they can be
>injected into other binary blocks which have allocated space
>for their insertion (and if they're larger than the allocated
>space, have it go ahead and perform the call with epilogue and
>prolog for the block {..} code done in the original.

As above, avoiding just the call throws away much of the value of
inlining, and compilers can already perform essentially that sort of
cross-translation unit inlining with LTCG, even if you can't convince
them to do it in all cases.

BartC

unread,

Jan 28, 2017, 6:39:58 AM1/28/17

to

> qsort_block(&data[0], count, data_size,
> my_qsort_compare, sizeof(my_qsort_compare));
>

The nearest to C in the way it's used would be if my_qsort_compare() was
a normal function. Then my_qsort_compare would be a pointer to that
function. In that case:

* The declaration of qsort_block would contain a signature for
my_qsort_compare giving the parameter and return types

* There would be no need for sizeof

* The function in question could equally be a pre-compiled block of
code perhaps in another module, the source for which is not visible
(so unlike macros or templates)

* The invocation of my_qsort_compare inside qsort_block would be
just like a regular function call

You haven't shown a full example of the equivalent using these code blocks.

Presumably you intend that the [relocatable] machine code comprising a
particular my_qsort_compare argument (as it could be different on each
call) is physically copied into some memory owned by qsort_block, and
somehow linked in to its code so that it can 'step into' the execution
of it rather than have to jump to or call the start of it.

There is a parameter interface defined in terms of registers.

But all this so that you can avoid call/return (or goto/goto, perhaps
via label pointers, which is another way of doing it), a /bit/ like
inline functions.

However with inline functions, the compiler will usually have sight of
the source code for the function, so that it could be more closely
merged into the caller's code - parameter passing overheads can be
eliminated; local variables can be combined; constant arguments can be
folded into the caller's code... a lot of stuff can be done.

I think you need to synthesise some code so that it works the way you
intend these blocks will work, and compare performance with a normal
function-pointer call.

That's assuming the whole point of this is to save that call/return hit.
As I can't see the advantage from a coding point of view.

--
Bartc

Rick C. Hodgin

unread,

Jan 28, 2017, 8:43:41 AM1/28/17

to

On Saturday, January 28, 2017 at 6:39:58 AM UTC-5, Bart wrote:
> I think you need to synthesise some code so that it works the way you
> intend these blocks will work, and compare performance with a normal
> function-pointer call.

Good idea! I shall do so, Mr. Bart, as I'm genuinely curious myself.

Rick C. Hodgin

unread,

Jan 28, 2017, 11:57:09 AM1/28/17

to

I took qsort.c from this address (no idea if this is an efficient
implementation or not, it's just the first pure-C version I could
find that compiled without error first try):

https://raw.githubusercontent.com/01org/linux-sgx/master/sdk/tlibc/stdlib/qsort.c

I wasn't able to modify it quite like I had hoped because it uses
the callback compare function a half dozen times or more per test,
and it's used in multiple places. So, I instead inlined my compare
function for this test.

I'll create another test where I create a macro to simulate the
overlay ability in multiple places.

Results for 100K, 10K, 1K element int array sorted 50, 500, 5K times:

-----
32-bit: 100K Native best = 469, average = 667
100K Inline best = 313, average = 354
100K Block best = 327, average = 367

10K Native best = 482, average = 543
10K Inline best = 247, average = 303
10K Block best = 233, average = 333

1K Native best = 374, average = 402
1K Inline best = 173, average = 236
1K Block best = 141, average = 245

-----
64-bit: 100K Native best = 499, average = 581
100K Inline best = 330, average = 369
100K Block best = 328, average = 358

10K Native best = 422, average = 488
10K Inline best = 250, average = 320
10K Block best = 265, average = 301

1K Native best = 267, average = 344
1K Inline best = 218, average = 254
1K Block best = 157, average = 220

Source code compiled in Win7 using Visual Studio 2015 Version
14.0.25431.01 Update 3, Microsoft .NET Framework 4.6.01055, on
a Dell Latitude E6530 (Intel Core i7-3520M 2.9 GHz).

Source Code: http://pastebin.com/pJB7n1Eu

Note: I originally had a much different reporting format and
had edited the comments at the top with the results.
I forgot to change them before posting. The correct
results are above.

Note: Also, this may not be the most efficient way to do this.
I want to write an assembly version, but it requires
intimate knowledge of how each compilation takes place
as the registers in use change. CAlive would be able
to know this, or use an established protocol.

-----
The results show that by using blocks you do not need to have compile-
time optimizations on the function. You can, for example, create a
static compile that is then overlain with key components being updated
with the block {..} code, and achieve nearly the same gains as you can
with a full inline compile.

BartC

unread,

Jan 28, 2017, 12:52:22 PM1/28/17

to

I got this with gcc -m64 -O3:

100K Native best = 343, average = 382)
100K Inline best = 361, average = 392)
100K Block best = 373, average = 391)

10K Native best = 281, average = 339)
10K Inline best = 278, average = 328)
10K Block best = 266, average = 324)

1K Native best = 202, average = 251)
1K Inline best = 221, average = 267)
1K Block best = 203, average = 247)

And with gcc -m32 -O3 (and #define GetTickCount64 GetTickCount):

100K Native best = 406, average = 433)
100K Inline best = 358, average = 472)
100K Block best = 281, average = 385)

10K Native best = 326, average = 360)
10K Inline best = 312, average = 359)
10K Block best = 328, average = 360)

1K Native best = 221, average = 270)
1K Inline best = 219, average = 268)
1K Block best = 231, average = 279)

If 'Block' mode is supposed to emulate what your proposal does, then
this is not very convincing. Sometimes it will a load faster, but
usually about the same or even slower.

But I don't know if it is equivalent: the compiler can see the source
for the 'block' code, and can make use of that. The way you're planning
to do it (and I don't think we've seen the invocation code yet where the
block is passed to a function), it will be an anonymous block of machine
code which has to be patched in at runtime.

I think, if you're going to be using such patching, then why not patch
directly to a normal function? Then it will execute 'call F' instead of
'call R0'. Some ABIs will already use registers for parameters. You
don't need to block-copy machine-code. And it will be less work overall.
You may not even need an explicit new feature, so no code changes either.

--
Bartc

Rick C. Hodgin

unread,

Jan 28, 2017, 1:07:57 PM1/28/17

to

Bartc wrote:
> I think, if you're going to be using such patching, then why
> not patch directly to a normal function?

The goal is to not call, but to integrate usable code in a published
library where the innards of even the compiled code are known,
allowing easy insertion with near optimal results, but without
needing source code to achieve it, just knowledge of source
code.

I've had some extended ideas about this concept. I will
post more about it later.

Rick C. Hodgin

unread,

Jan 29, 2017, 2:34:37 AM1/29/17

to

I found a flaw in my posted code. The bottom of the myqsort_block() function
needs to call itself. Instead, it's calling myqsort_inline() due to a copy/
paste bug.

Rick C. Hodgin

unread,

Jan 29, 2017, 2:56:18 AM1/29/17

to

On Sunday, January 29, 2017 at 2:34:37 AM UTC-5, Rick C. Hodgin wrote:
> I found a flaw in my posted code. The bottom of the myqsort_block() function
> needs to call itself. Instead, it's calling myqsort_inline() due to a copy/
> paste bug.

Changing the compare function to:

int compare(const void* l, const void* r)
{
if (*(int*)l < *(int*)r) return(-1); // Left is less
else if (*(int*)l > *(int*)r) return(1); // Left is greater
else return(0); // Equal
}

Changing around line 364, in myqsort_block(), near the end:
- myqsort_inline(a, r / es, es);
+ myqsort_block(a, r / es, es);

-----
New results and running on batteries instead of AC power:

64-bit:
100K Native best = 467, average = 519
100K Inline best = 281, average = 312
100K Block best = 295, average = 307

10K Native best = 390, average = 436
10K Inline best = 232, average = 273
10K Block best = 265, average = 294

1K Native best = 267, average = 333
1K Inline best = 171, average = 218
1K Block best = 124, average = 215

32-bit:
100K Native best = 531, average = 587
100K Inline best = 277, average = 308
100K Block best = 312, average = 380

10K Native best = 467, average = 503
10K Inline best = 234, average = 288
10K Block best = 279, average = 310

1K Native best = 310, average = 359
1K Inline best = 189, average = 229
1K Block best = 202, average = 237

David Brown

unread,

Jan 29, 2017, 8:06:04 AM1/29/17

to

It is typically not something you control explicitly, although you might
be able to influence it with pragmas, attributes, etc. Usually it is
something the compiler will do based on optimisation settings and
possibly profile information.

A compiler typically deals with a standard ABI or calling convention.
In the DOS/Windows x86 world, there is no standard (or a multitude of
standards, depending on your viewpoint) with different tools having
different ideas, and often having multiple variations within the same
tool. On almost every other platform (cpu/OS combination), there is a
well-defined standard ABI that all tools follow. But with functions
that the compiler has full control over, it does not need to follow
those conventions.

As a simple case, an ABI may say that registers r0..r5 are "volatile",
or "caller preserved". That means that a function can freely change
these registers - if the caller function wants to keep the values that
are in those registers over the function call, it must preserve them
itself (typically on the stack). But if the called function does not
actually /use/ those registers, this stack/unstack is wasted. If the
compiler is generating the code for the caller and the callee functions
at the same time, it knows that the caller can safely skip the
stack/unstack.

In another case, the compiler may know that one of the parameters for
the called function is always the same constant value - because it knows
every piece of code that calls that function. This information can be
used to eliminate the parameter entirely, and generate a more efficient
called function. Perhaps the compiler can see the parameter is always
one of two different constant values - and decides to make two "clones",
each optimised for a specific constant value.

With these sorts of optimisations, compilers can end up re-arranging
code to give almost completely different object code structure from the
original source code.

>
> I know of no way in any C compiler (except for one which came with
> the book "Developing Your Own 32-bit Operating System" by Richard
> Burgess from back in the 1990s, which was simply a translating tool
> which took C code and generated assembly source code) to have it
> generate code that has a generic input protocol, and a generic
> output protocol, and is relocatable.
>
> I am aware of many implementations that generate functions which
> require a calling convention using registers or the stack, and
> explicitly return from the call.
>
> Can you give me an example in GCC or MSVC that produces code which
> can be inserted generically into the middle of large block of NOPs,
> for example, which then conduct work provided the input is correct,
> and the output is specified?
>

That would be an almost totally useless idea, so I can't think of any
reason to do it. A large block of NOPs is expensive - it is a waste of
cache space, pre-fetch buffers and decoding time even with the
operations do nothing. A simple call/return pair is pretty cheap on
most processors, especially in cases where the compiler can see both
sides at once and can avoid unnecessary stacking or movement of
registers. And replacing the NOPs at run-time would play havoc with
instruction caches and buffers - self-modifying code is extremely
inefficient.

A replaceable /short/ sequence of NOPs is sometimes useful for
debugging, as a way to patch or replace an existing function.

David Brown

unread,

Jan 29, 2017, 8:15:36 AM1/29/17

to

This looks like you compiling your "block" as a normal function, just
skipping the "return" instruction at the end. And inside your "caller"
function, you are replacing the normal "call my_qsort_compare" with a
copy of that generated machine code, excluding the return instruction.

Is that about right?

Given that normal call/return (with static addresses) is basically free
in most cpus because it is swallowed early in the decode pipeline, what
is the gain? And you completely miss out on the useful part of inlining
functions, which is that the optimiser can use the caller function
context and parameters to generate a more efficient version of the
inlined code.

Rick C. Hodgin

unread,

Jan 29, 2017, 8:17:50 AM1/29/17

to

David Brown wrote:
> [snip]

You're responding to what is now a somewhat stale idea. As I indicated
to Bart, I've had some extended ideas about this concept. I will post more
about it later. My mind is contemplating possibilities.

Kenny McCormack

unread,

Jan 29, 2017, 8:30:52 AM1/29/17

to

In article <76e3dce5-0650-4ccb...@googlegroups.com>,

Rick C. Hodgin <rick.c...@gmail.com> wrote:

>David Brown wrote:
>> [snip]
>
>You're responding to what is now a somewhat stale idea. As I indicated
>to Bart, I've had some extended ideas about this concept. I will post more
>about it later. My mind is contemplating possibilities.

Actually, the definining characteristic of religious nutters is that their
minds do *NOT* contemplate any new possibilities. Their whole mindset is
set in the Stone (or is it the Iron?) Age, and can never change.

--
1/20/17: A great day for all those people who are sick of being told they don't know
how to spell "you're" (or "there").

Rick C. Hodgin

unread,

Jan 29, 2017, 8:42:56 AM1/29/17

to

Kenny McCormack wrote:
> [snip]

I am sad for you, Kenny. Were you to die today, you would weep fountains
over what you've thumbed your nose at. Your soul would cry out fully
for what you've mocked, but it would be too late. Your sin and
rebellion would consume you.

I pray that changes so you can know real love, real peace, real forgiveness.

Rick C. Hodgin

unread,

Jan 31, 2017, 9:40:29 AM1/31/17

to

On Thursday, January 26, 2017 at 4:30:42 PM UTC-5, Rick C. Hodgin wrote:
> Does C have any kind of syntax which allows code to be constructed that
> can be used for generic purposes, but without following the standard
> prologue and epilogue code through a function?
>
> I'm thinking I want to design a block of code like this:
>
> block xyz
> {
> int a, c;
>
> input b; // By block {..} protocol, uses an input register
>
> a = myfunc1();
> c = a + b;
>
> output c; // By block {..} protocol, uses an output register
> }
>

> [snip]

>
> In the following example, I am able to take a defined block and
> copy it into another function so as to prevent the dispatch of
> the callback function call:
>
> block my_qsort_compare
> {
> input left as SStruct1*, right as SStruct1*;
>
> if (left->value > right->value) result = 1;
> else if (left->value < right->value) result = -1;
> else result = 0;
>
> output result;
> }
>
> Then in a modified qsort_block() version, I could provide parameters:
>
> qsort_block(&data[0], count, data_size,
> my_qsort_compare, sizeof(my_qsort_compare));
>
> The qsort_block() function copies in the block of code so that it
> doesn't have to issue function call dispatches, but inlines the
> logic through some kind of established input and output protocols
> where parameters enter and exit knowingly without going through
> the stack, but reference wherever they already are in the specific
> instance use.
>

> Is there a way to do something like this in C?

I've had an extended idea about this. I've had the idea as above
to be implemented in hardware so that each thread can make in-flight,
at runtime corrections to local source code, allowing overlays to be
used which optimize published algorithms based on the dynamics of the
runtime environment.

That will all work fine and it is planned for my Arxoda CPU design.

The extended idea I've had is to take the concept of creating each
algorithm back a step, and leave it at a point of purposefully in-
complete compilation (even in standard library form) so that the
final stages of compilation can be completed at runtime, with a nod
given to instance loads.

The idea is to have an algorithm broken out into pre-compiled template
components. The algorithm is fully coded, but it is missing key or
crucial sections which require the unique features of the runtime need
to fill in those spots with the algorithms which will make them run
optimally for their need.

In this way, at compile time the code is compiled to a series of parts.
Each part is then stored in the binary or library. The compiler adds
startup code which then assembles those parts into a final form based
on the dynamic needs of the runtime environment, even doing so in a
lazy form, such as when an unassembled algorithm is called the first
time, it then assembles itself into an instance for that call. That
instance then persists so that future references to it are dispatched
properly without the later need for assembly.

By creating the compiled code portions of an algorithm in steps, with
only the key missing components put into place dynamically at runtime,
standard dynamic link libraries can be constructed which are able to
be optimized with the local code at runtime, with only a minimal hit
on load, or on first use.

I'll create a simple example of this at some point and publish it.
But in your mind, think of qsort(). It requires logic to navigate
the list. That logic does not change. But, the testing criteria
within does change. So, rather than creating a single qsort()
algorithm with a callback, or creating a qsort_block() algorithm
with padding to allow the insertion of code, create a qsort_un()
algorithm which is unassembled, which contains pointers to the
various parts which can be assembled. Then, in the source code
which references qsort_un() it has instance information based on
that which needs to be injected (as provided for by the IDE + CAlive
compiler, which informs the user of what's needed to make it work,
as by published specs of qsort_un().

In code it might be used like this:

qsort_un(data, count, size) {
connect left as int*;
connect right as int*;

return (*left < *right ? -1 : (*left > *right ? : 1 : 0));
}

Then at runtime, this code was also pre-compiled, and can then be
linked to the qsort_un() algorithm, inserting that block to all of
the connects and returns to the runtime instance. The compiler
could even have left the register selection in flux so that it
could be determined based on the runtime instance needs of the
algorithm, so it could be better optimized right then.

I see real potential for this idea in terms of breaking down algorithms
to more than mere functions or exportable stable/solid components.
We can now have algorithms broken out into fundamental blocks, able
to be assembled dynamically at runtime, assembled into a more optimized
form for the runtime needs of that particular instance of the app,
based on the particular data it's processing (perhaps the particular
qsort_un() block above was never traversed, it would never have been
assembled in a lazy model, so only the other qsort_un() blocks would've
been created. And so on.

In any event, I think I'm going to incorporate this ability into
CAlive. I'll probably call it "incoding" (meaning it's in the
"process of being coded, but is not yet complete").

An incoded algorithm can be published (like qsort_un()), which allows
other algorithms to incode themselves to that reference during editing,
so they are compiled with that known reference, their connects, and
returns (plural), so that they are all connected at runtime to create
the final form.

This will move a small portion of the compiler and linker into the
executable. However, with a proper binary loader for this new ABI,
that portion will not be in the executable, just a small table of
information the loader will read which then directs the connecting
process at launch, and also at runtime for lazy models.

David Brown

unread,

Jan 31, 2017, 10:13:35 AM1/31/17

to

On 31/01/17 15:40, Rick C. Hodgin wrote:
> On Thursday, January 26, 2017 at 4:30:42 PM UTC-5, Rick C. Hodgin wrote:

<snip>

Doing all this at run-time with copy-and-paste of pre-generated code
blocks hobbles your compiler's chances of optimisation, and it makes
your system /much/ more complicated.

There are four ways that I can think of to get basically the same effect:

1. Use function calls. Your qsort "algorithm" function can call the
appropriate "compare" function via a function pointer. Function call
overhead is not going to be much more than a cut-and-paste binary copy
would be - the overhead is mainly in moving data into and out of the ABI
registers, as the call and return are handled early in the processor's
pipeline.

2. Use proper compile-time code generation and optimisation, basically
like C++ templates. That lets your compiler generate better code -
sometimes /much/ better code, if the actual compare "function" is
trivial in assembly.

3. If you need run-time patching, put a /small/ group of NOP's at the
start of each generated function which can be replaced by a
jump-to-patch instruction. I believe MSVC does this. This gives you a
simple and consistent method for replacing code at a function level.

4. Rather than making half-baked ideas that can be used in a few
specific circumstances with complicated load-time and run-time code,
make a JIT compiler.

Rick C. Hodgin

unread,

Jan 31, 2017, 10:38:53 AM1/31/17

to

On Tuesday, January 31, 2017 at 10:13:35 AM UTC-5, David Brown wrote:
> On 31/01/17 15:40, Rick C. Hodgin wrote:
> > On Thursday, January 26, 2017 at 4:30:42 PM UTC-5, Rick C. Hodgin wrote:
> > <snip>
>
> Doing all this at run-time with copy-and-paste of pre-generated code
> blocks hobbles your compiler's chances of optimisation, and it makes
> your system /much/ more complicated.

The "system" only has to be designed once, and then we can expose these
abilities through GUI interfaces which make it easy for human beings.

I envision a GUI where windows pop-up to the side with inputs to funcs
broken out line-by-line. We can change them in source code, or through
the mini-forms interfaces. We can also take complex statements and view
them through tools which visualize them in various ways, breaking out
logic visibly, and so on.

Our GUIs are only getting better, and they should become the keymost
component of doing software development.

> There are four ways that I can think of to get basically the same effect:
>
> 1. Use function calls. Your qsort "algorithm" function can call the
> appropriate "compare" function via a function pointer. Function call
> overhead is not going to be much more than a cut-and-paste binary copy
> would be - the overhead is mainly in moving data into and out of the ABI
> registers, as the call and return are handled early in the processor's
> pipeline.

I proved earlier in this thread there is a notable performance increase
by inlining, and a near commensurate performance increase is also shown
by directly injecting code into the stream without inlining (though the
actual test would require more of what I propose here with connects and
returns and an incoding model).

> 2. Use proper compile-time code generation and optimisation, basically
> like C++ templates. That lets your compiler generate better code -
> sometimes /much/ better code, if the actual compare "function" is
> trivial in assembly.

Incoding is a new way to do a similar thing. And with a proper design,
there will be no great loss of optimization ability, only minor edges
which, if required, can be overcome through direct inlining. But, any
developer who cares that much about performance will know that and the
tools and code will be all "components" to them anyway.

> 3. If you need run-time patching, put a /small/ group of NOP's at the
> start of each generated function which can be replaced by a
> jump-to-patch instruction. I believe MSVC does this. This gives you a
> simple and consistent method for replacing code at a function level.

I essentially have that above with block {..} abilities. However, it's
not really helpful in a multi-threaded environment unless you're hard-
replacing an entire function with a new one.

What I'm talking about are creating instances of each one (like C++'s
templates) which are tailored to their need, but only when needed,
and only at run-time. This allows something like every possible
template to be created at compile-time, and then used at runtime,
but without creating every possible template, but only creating the
incode function which is missing a few key parts which are then
inserted at runtime.

> 4. Rather than making half-baked

I don't think it is a half-baked idea, David, and I resent the
implication. It is a rather comprehensive idea that spans across
hardware and software to the system I am envisioning.

> ideas that can be used in a few
> specific circumstances with complicated load-time and run-time code,
> make a JIT compiler.

I think JITs are an incorrect approach as an overall solution. I
think JITs are great for many things, but I'm thinking of something
a little deeper here, a migration of the toolset from single points
in the software scheme, into broader coverage in both software and
hardware.

Rick C. Hodgin

unread,

Jan 31, 2017, 10:50:41 AM1/31/17

to

On Tuesday, January 31, 2017 at 10:38:53 AM UTC-5, Rick C. Hodgin wrote:
> What I'm talking about are creating instances of each one (like C++'s
> templates) which are tailored to their need, but only when needed,
> and only at run-time. This allows something like every possible
> template to be created at compile-time, and then used at runtime,
> but without creating every possible template, but only creating the
> incode function which is missing a few key parts which are then
> inserted at runtime.

To be more clear about this point, it's like creating every possible
template at compile-time, and sticking it in a static library that
can be linked to at runtime. However, it does so without creating
every possible template, just the general template, which then has
the key components added which make its instance unique.

It's a minor savings, but it's a much more flexible and powerful
feature in that less code does more with minimal loss compare to
a truly optimal presentation.

David Brown

unread,

Jan 31, 2017, 4:05:09 PM1/31/17

to

Inlining provides /far/ more possibilities for optimising than directly
injecting pre-compiled code. The best you can do for comparison is
perhaps to use gcc with optimisation disabled (even though you can't
disable /all/ optimisation) and compare the speed when the "compare"
function is marked __attribute__((always_inline)) against no marking.

>
>> 2. Use proper compile-time code generation and optimisation, basically
>> like C++ templates. That lets your compiler generate better code -
>> sometimes /much/ better code, if the actual compare "function" is
>> trivial in assembly.
>
> Incoding is a new way to do a similar thing.

I agree it is a new idea - and it is vaguely similar. I just don't see
any advantages, but many disadvantages.

I think it is great that you are thinking of new ideas and suggesting
them - I don't want to sound too negative here. Just don't make the
mistake of thinking that "new" is inherently "better". What you are
suggesting would be perfectly possible as an optimisation for an
existing C compiler - no change to the language would be needed to get
pretty much the same effect. But compilers don't implement it - they
implement inlining, because it is /better/.

Note that this is in regard to the implementation of your idea. As for
the syntax - it is different from C inline functions, and different from
C++ templates. Personally, I am not a fan - but I am not a fan of most
of your new syntax ideas. That's okay, of course, since you are not
making the language for me - if you think the syntax works for you and
your CAlive users, that's fine. Your new syntax could be implemented as
traditional inlining, or as "code injection" - I think inlining would be
a great deal better.

> And with a proper design,
> there will be no great loss of optimization ability, only minor edges
> which, if required, can be overcome through direct inlining. But, any
> developer who cares that much about performance will know that and the
> tools and code will be all "components" to them anyway.
>
>> 3. If you need run-time patching, put a /small/ group of NOP's at the
>> start of each generated function which can be replaced by a
>> jump-to-patch instruction. I believe MSVC does this. This gives you a
>> simple and consistent method for replacing code at a function level.
>
> I essentially have that above with block {..} abilities. However, it's
> not really helpful in a multi-threaded environment unless you're hard-
> replacing an entire function with a new one.
>
> What I'm talking about are creating instances of each one (like C++'s
> templates) which are tailored to their need, but only when needed,
> and only at run-time. This allows something like every possible
> template to be created at compile-time, and then used at runtime,
> but without creating every possible template, but only creating the
> incode function which is missing a few key parts which are then
> inserted at runtime.
>
>> 4. Rather than making half-baked
>
> I don't think it is a half-baked idea, David, and I resent the
> implication. It is a rather comprehensive idea that spans across
> hardware and software to the system I am envisioning.

You have got as far as a discussion in a Usenet thread, thinking of new
additions or changes all the way. "Half-baked" is an exaggeration - it
is not nearly as complete as that for now. But it is not an insult - a
new concept has to pass through "half-baked" on its way to specification
and implementation. And the "half-baked" stage is exactly the point at
which you need to collect ideas and opinions and try to figure out if it
really is a good idea and worth carrying through.

You are very quick to leap from initial "brainwave" to deciding that the
idea is the best thing since sliced bread, and will revolutionise
programming. This is not a good thing. You should understand that
designing a language is a long-term project, and for every good idea
that makes it into the final system you will have two dozen ideas that
sound good at the start, but don't make the final cut.

Rick C. Hodgin

unread,

Jan 31, 2017, 4:18:20 PM1/31/17

to

On Tuesday, January 31, 2017 at 4:05:09 PM UTC-5, David Brown wrote:
> [snip]

I appreciate your input over the years, David.

BartC

unread,

Jan 31, 2017, 4:46:21 PM1/31/17

to

On 31/01/2017 21:18, Rick C. Hodgin wrote:
> On Tuesday, January 31, 2017 at 4:05:09 PM UTC-5, David Brown wrote:
>> [snip]
>
> I appreciate your input over the years, David.

That sounds more like a curt dismissal of all his input!

David Brown

unread,

Jan 31, 2017, 5:00:25 PM1/31/17

to

It's not the first time. I keep trying to give Rick different ways to
think about things, but if my suggestions are too much in contradiction
to what he has already decided to be the truth, they just bounce off.

It stops things going round in circles, however, so there is merit in it.

Rick C. Hodgin

unread,

Jan 31, 2017, 5:33:12 PM1/31/17

to

Bart wrote:

> Rick C. Hodgin wrote:
> > I appreciate your input over the years, David.
> That sounds more like a curt dismissal of all his input!

That's funny because it was about my 4th draft. :-) And, I wrote it for
this very reason: "It stops things going round in circles."

And, I do appreciate David's input. I just disagree with almost
all of it. And, I don't want to argue. I just want to complete the
product and then people can see it all firsthand, thereby removing
any potential misunderstandings.

I have had most of these thoughts about hardware and software
design since the 90s. They are coalescing now in spurts, but not
all of them are new. Just newly published.