Hi Dzemal. This newsgroup is pretty much dead, which is why you aren't
getting any responses. Try posting in news:comp.lang.asm.x86 - that's a
moderated group, and you *might* have to cc clax-...@crayne.org to
get through. Also news:alt.lang.asm is a lot more active than this one,
tho not limited to x86.
I don't think I'll be able to help you very much.
> Ok, here's the c++ function:
In the first place, I don't know C++ :) (but this doesn't look too
tough)
> inline void _stdcall SetPixel(int y,int x,char color)
> {
> __int16 q=y>>1;
> (pBitmapRow[x])[q] &= 0xF << (y & 1 ? 4 : 0);
> (pBitmapRow[x])[q] |= color << (y & 1 ? 0 : 4);
> }
In the second place, I'm not sure I understand what we're doing here.
What video mode is this? (I only know the really simple-minded ones) It
*looks* to me like maybe a 16-color mode - that is, 4 bits per pixel. Is
that right? I think I'm confused, because it looks to me that while we
set the color of an "odd" pixel, we zero out the "even" pixel, and vice
versa. I think I'm missing something - no great surprise. Wait, we're
basing this on whether the *row* is odd or even. I'm just confused.
It's really important to understand exactly what we *need* to do here,
because the first step is to optimize the algorithm, before we even
*think* about asm. A crappy algorithm implemented in exquisite assembly
is still crappy code!
> VC++ disassembly window (debug build of course therefore unoptimized) says
> this translate to:
Can't you get VC++ to spit out asm for an optimized compile?
> mov eax,dword ptr [ebp+0Ch]
> sar eax,1
> mov word ptr [ebp-4],ax
Okay, this is just calculating "q" and storing it in a temporary
variable. I'm not sure we need to do this at all (store it, that is...).
I think we'd be better off if we hadn't made it int16, too - the
processor is generally happier operating on it's "native" size.
> movsx edx,word ptr [ebp-4]
Move "q" into edx - we do this repeatedly, and I don't think we need to.
I think we could just use edx in the first place, and leave it there.
> mov eax,dword ptr [ebp+10h]
"x"
> mov eax,dword ptr [eax*4+4369E8h]
"*x", if I understand it...
> mov ecx,dword ptr [ebp+0Ch]
"y"
> and ecx,1
> neg ecx
> sbb ecx,ecx
> and ecx,4
Make cl either 4 or 0 ...
> mov ebx,0Fh
> shl ebx,cl
Make ebx either 0F0h or 0Fh...
> mov cl,byte ptr [eax+edx]
Get our "destination" byte.
> and cl,bl
Mask out just the nibble we want (whyever we want it! :)
> movsx edx,word ptr [ebp-4]
Get "q" into edx - wasn't it already there?
> mov eax,dword ptr [ebp+10h]
> mov eax,dword ptr [eax*4+4369E8h]
"x" and "*x"
> mov byte ptr [eax+edx],cl
Move our "anded" byte back to "destination".
> movsx edx,word ptr [ebp-4]
Get "q"... again.
> mov eax,dword ptr [ebp+10h]
> mov eax,dword ptr [eax*4+4369E8h]
"*x"
> movsx ebx,byte ptr [ebp+14h]
Color into ebx. (bl is all we really need)
> mov ecx,dword ptr [ebp+0Ch]
> and ecx,1
> neg ecx
> sbb ecx,ecx
> and ecx,0FCh
> add ecx,4
Make ecx our shift count depending on "odd" or "even"...
> shl ebx,cl
Shift the color - or not.
> mov cl,byte ptr [eax+edx]
Our destination byte.
> or cl,bl
Or it with our shifted color.
> movsx edx,word ptr [ebp-4]
Get "q", in case we mislaid it.
> mov eax,dword ptr [ebp+10h]
> mov eax,dword ptr [eax*4+4369E8h]
"*x" again.
> mov byte ptr [eax+edx],cl
Move our byte back into it's destination.
> Now I don't know a first thing about asm but still, 36 INSTRUCTIONS ???!!
Hehe! The "first thing about asm" is that it's most likely *going* to
take a lot of instructions :) Don't put too much stock in the
instruction count - often speed-optimized code is a *lot* longer than
the size-optimized version. But I think an optimized version of this
would be shorter just by eliminating duplicate code. I'd think in terms
of calculating the destination address *once* and hanging on to it.
Likewise, the "shift-count" could probably be figured just once - I
still don't get why we're basing this on "y"! (so my analysis above may
be totally off-base)
> I figure there must be a more optimal way to encode this in assembler (I
> will use C++ inline assembler)
I don't know the syntax for C++ inline assembler (I'm a devout Nasmist),
so I'm not going to be able to help you there. It's a problem, because I
can't readily test any "bright ideas" I might come up with :)
> So, can anyone PLEASE help me with better asm code- I desperately need this
> routine to run as fast as possible.
You may need to re-think what you're doing "from the top". If you're
filling any horizontal "runs" of pixels, for example, there are faster
ways of doing this than by calling setpixel repeatedly - and other
speedups. You might want to store your data in a different manner for
faster access - "offset thinking" is faster than "row-column thinking".
At the very least, we can get rid of some of the unneccessary code!
It would be interesting to see what your compiler comes up with for
optimized code, if you can get it to do it.
But there are surely optimized setpixel's available, if you know where
to look. And folks more knowledgeable than I in a livelier newsgroup.
I'm going to cross-post this to alt.lang.asm - I hope you can look for
replies there(?). We *desperately* need some asm to talk about over
there! :)
Best,
Frank
On Thu, 21 Feb 2002 00:08:13 GMT, Frank Kotler spake thus:
>Dzemal Kulenovic wrote:
>
>Hi Dzemal. This newsgroup is pretty much dead, which is why you aren't
>getting any responses. Try posting in news:comp.lang.asm.x86 - that's a
>moderated group, and you *might* have to cc clax-...@crayne.org to
>get through. Also news:alt.lang.asm is a lot more active than this one,
>tho not limited to x86.
Hi :) As Frank says, c.l.a is not the liveliest of assembly language
groups, I read this post in a.l.a
>
>> Ok, here's the c++ function:
>
>In the first place, I don't know C++ :) (but this doesn't look too
>tough)
>
>> inline void _stdcall SetPixel(int y,int x,char color)
>> {
>> __int16 q=y>>1;
>> (pBitmapRow[x])[q] &= 0xF << (y & 1 ? 4 : 0);
>> (pBitmapRow[x])[q] |= color << (y & 1 ? 0 : 4);
>> }
If I understand this correctly, we are saying (in PseudoCode):
{
define z = (pBitmapRow[x])[q]
int q = y shr 1 ; ebp+0Ch
if(y == 1)
{
z = z AND 0xF0
}
else
{
z = z AND 0xF
}
if(y == 1)
{
z = z OR color
}
else
{
z = z OR (color shl 4)
}
}
I'm not sure how you would code (pBitmapBow[x])[q] in assembler, so
I'll leave that part of it the same as in the original code (you'll
ahve to convert to make use of the library code).
inline void _stdcall SetPixel(int y,int x,char color)
{
__int32 q=y>>1;
(pBitmapRow[x])[q] &= 0xF << (y & 1 ? 4 : 0);
(pBitmapRow[x])[q] |= color << (y & 1 ? 0 : 4);
}
I changed q to a 32-bit int, for the same reason Frank gave (smaller,
faster code).
>
>> mov eax,dword ptr [ebp+0Ch]
>> sar eax,1
#define z = (pBitmapRow[x])[q] ; this is just to make the code easier
; to show you
mov eax,dword ptr[ebp+0Ch] ; y
mov ecx,eax ; less memory reads
sar eax,1 ; ecx = y, edx = q
>> mov word ptr [ebp-4],ax
>> movsx edx,word ptr [ebp-4]
mov esi,eax ; 2 less memory accesses
; esi = q
>> mov eax,dword ptr [ebp+10h]
mov eax,dword ptr[ebp+10h] ; x
>> mov eax,dword ptr [eax*4+4369E8h]
mov eax,dword ptr[eax*4+4369E8h]
; eax = (pBitmapRow[x])
; the above constant is set up by the system and is dependant on the
; position in memory of the resident code. I don't know how you would
; code that in assembler.
>> mov ecx,dword ptr [ebp+0Ch]
>> and ecx,1
and ecx,1 ; ecx = y & 1
>> neg ecx
>> sbb ecx,ecx
>> and ecx,4
shl ecx,2 ; ecx = (y & 1 ? 4 : 0)
>> mov ebx,0Fh
>> shl ebx,cl
mov ebx,0Fh
shl ebx,cl ; ebx = 0xF << (y&1 ? 4:0)
>> mov cl,byte ptr [eax+edx]
>> and cl,bl
xor edx,edx
mov edx,byte ptr[eax+esi]
and edx,ebx ; edx = z & (0xF << (y&1 ? 4:0))
The following code makes little sense to me (basically because it
makes no attempt to optimise as a routine), so I'll ignore it and just
show you how I would write it without trying to compare to the code
MSVC gave you.
xor cl,4 ; ecx = (y & 1 ? 0 : 4)
mov ebx,[ebp+14h] ; ebx low byte = color
; the processor would have pushed as
; a dword, so this is OK.
shl ebx,cl ; ebx = color<<(y&1 ? 0:4)
or edx,ebx ; z = z OR color<<(y&1 ? 0:4)
; edx = dl = z
mov byte ptr[eax+esi],dl
>> movsx edx,word ptr [ebp-4]
>> mov eax,dword ptr [ebp+10h]
>> mov eax,dword ptr [eax*4+4369E8h]
>> mov byte ptr [eax+edx],cl
>> movsx edx,word ptr [ebp-4]
>> mov eax,dword ptr [ebp+10h]
>> mov eax,dword ptr [eax*4+4369E8h]
>> movsx ebx,byte ptr [ebp+14h]
>> mov ecx,dword ptr [ebp+0Ch]
>> and ecx,1
>> neg ecx
>> sbb ecx,ecx
>> and ecx,0FCh
>> add ecx,4
>> shl ebx,cl
>> mov cl,byte ptr [eax+edx]
>> or cl,bl
>> movsx edx,word ptr [ebp-4]
>> mov eax,dword ptr [ebp+10h]
>> mov eax,dword ptr [eax*4+4369E8h]
>> mov byte ptr [eax+edx],cl
>
>Move our byte back into it's destination.
>
>> Now I don't know a first thing about asm but still, 36 INSTRUCTIONS ???!!
I'm sure I havent provided the most optimal way of doing it, but 18
instructions with 6 memory accesses, instead of 36 instructions and 21
memory accesses!
The part that you will have to work out (if nobody else is able to
help) is how to get the address for (pBitmapRow[x]), as it won't be
4369E8h every time you run it.
>
>> I figure there must be a more optimal way to encode this in assembler (I
>> will use C++ inline assembler)
>
>I don't know the syntax for C++ inline assembler (I'm a devout Nasmist),
>so I'm not going to be able to help you there. It's a problem, because I
>can't readily test any "bright ideas" I might come up with :)
Me neither, but I figure that using the syntax in the posted code will
provide something that should work :)
>
>> So, can anyone PLEASE help me with better asm code- I desperately need this
>> routine to run as fast as possible.
Like Frank I'm a commited nasm user, and I haven't got a lot of idea
how to do things with other assemblers. I pick it up as I go, by
looking at how others write code for those other assemblers (and, of
course, by looking at code samples online) :) The above should do the
same as the code VC produced, but it will be a lot faster (more than
twice the speed, because of the number of memory accesses saved as
well as being half the number of instructions).
I agree with what Frank said about rethinking the algorithm if you are
writing a lot of pixels. A line can be drawn far faster than writing
it pixel by pixel. whether it's horizontal, vertical or diagonal, as
coding the entire line in assembler allows you to write multiple
pixels in one write (if they are adjacent horizontally) and to reuse
values in registers without saving in memory between calls when they
are not horizontally adjacent.
--
Debs
de...@dwiles.nospam.demon.co.uk
----
If you're not part of the solution, start another problem!
It seemed to be dead.
Then we get this post from PacBell and discover that
it's really been alive, but stuff has been getting lost.
Shades of comp.lang.asm.x86!
Randy Hyde