some comments

65 views
Skip to first unread message

Tom Ehlert

unread,
May 4, 2019, 8:50:58 AM5/4/19
to Night Kernel
Hi,

I just located

        ; add one to the task number, then mask to make sure we stay under 255
        inc ebx
        and ebx, 0x000000FF


        ; We use a couple tricks here below for speed. First, we're calling an internal List Manager function directly instead
        ; of going through the slower public interface. This bypasses the parameter checking that's normally performed, but
        ; that's okay since we are managing the input internally, so we know it will be correct. Second, we use the EBX
        ; register to track the current task number we're testing. Why EBX? Because LM_Internal_ElementAddressGet() doesn't
        ; modify it, so we don't need to save it here, saving us a handful of CPU cycles.

        ; get that task number's starting address
        push ebx
        push dword [tSystem.listTasks]
        call LM_Internal_ElementAddressGet
        pop esi      

let me share some observations:

1st, efficiency.
this loop is executed on average 256/numberOfActiveTasks times.
assuming 5 active tasks, this is 50 iterations to find the task to switch to.
this is inefficient, even if you count the cycles of each single instruction.

2nd, bug
ebx will eventually be 255.
after
     inc ebx, and ebx, 0x000000FF
ebx will be 0. unfortunately LM_Internal_ElementAddressGet will not return a valid
task, but some (more or less) random location somewhere. as this is treated as a task
this is a bug.

3rd, calling conventions

        call LM_Internal_ElementAddressGet
        pop esi 

returning results on the stack is inefficient. return results through a register. EAX has been the  register
to be used for resultgs for ages.

4th, more calling conventions
        ; Why EBX? Because LM_Internal_ElementAddressGet() doesn't
        ; modify it, so we don't need to save it here, saving us a handful of CPU cycles.

standard C calling conventions require that EBP, ESI, EDI are preserved across calls.
this serves the same purpose (efficiency), but has not to rely on undocumented features of
LM_Internal_ElementAddressGet.

5th, use efficient assembler (LM_Internal_ElementAddressGet)

replace
    ; check that the element requested is within range
    ; so first we get the number of elements from the list itself
    mov esi, [ebp + 8]
    mov eax, [tListInfo.elementCount]

    ; adjust eax by one since if a list has, say, 10 elements, they would actually be numbered 0 - 9
    dec eax

    ; get the size of each element in this list
    mov eax, [tListInfo.elementSize]

    ; calculate the new destination address
    mov edx, [ebp + 12]
    mul edx
    add eax, esi
    add eax, 16

with

    ; check that the element requested is within range
    ; so first we get the number of elements from the list itself
    mov esi, [ebp + 8]
          ; get the size of each element in this list
    mov eax, [tListInfo.elementSize]

    ; calculate the new destination address
    mul [ebp + 12]
    lea eax, [eax + esi + 16]

6th, use efficient assembler

replace
    ; set address of the thing to print info on what we found,
    mov eax, [tSystem.configBits]
    and eax, 000000000000000000000000000000010b
    cmp eax, 000000000000000000000000000000010b
    jne .NoPrint
by
    ; set address of the thing to print info on what we found,
    test [tSystem.configBits], 000000000000000000000000000000010b
    jne .NoPrint


7th, kFalse and kTrue are constants unlikely to ever change
replace
    kTrue                                            dd 0x00000001
    kFalse                                            dd 0x00000000
   
    mov eax, [kTrue]
    ...
    cmp eax, [kTrue]

by
    %define    kTrue  1
    %define    kFalse 0
   
    mov eax, kTrue
    ...
    cmp eax, kTrue

accessing memory is inefficient, compared to constants.

8th, bug
LMElementValidate will *always* return [kTrue].

Tom



Mercury Thirteen

unread,
May 6, 2019, 3:40:48 AM5/6/19
to Night Kernel
On Saturday, May 4, 2019 at 8:50:58 AM UTC-4, Tom Ehlert wrote:
Hi,

Greetings!


I just located

        ; add one to the task number, then mask to make sure we stay under 255
        inc ebx
        and ebx, 0x000000FF


        ; We use a couple tricks here below for speed. First, we're calling an internal List Manager function directly instead
        ; of going through the slower public interface. This bypasses the parameter checking that's normally performed, but
        ; that's okay since we are managing the input internally, so we know it will be correct. Second, we use the EBX
        ; register to track the current task number we're testing. Why EBX? Because LM_Internal_ElementAddressGet() doesn't
        ; modify it, so we don't need to save it here, saving us a handful of CPU cycles.

        ; get that task number's starting address
        push ebx
        push dword [tSystem.listTasks]
        call LM_Internal_ElementAddressGet
        pop esi      

let me share some observations:

1st, efficiency.
this loop is executed on average 256/numberOfActiveTasks times.
assuming 5 active tasks, this is 50 iterations to find the task to switch to.
this is inefficient, even if you count the cycles of each single instruction.

I ended up completely changing this code to something much more streamlined which completely forgoes the call to LM_Internal_ElementAddressGet() based on one of your later suggestions.

 
2nd, bug
ebx will eventually be 255.
after
     inc ebx, and ebx, 0x000000FF
ebx will be 0. unfortunately LM_Internal_ElementAddressGet will not return a valid
task, but some (more or less) random location somewhere. as this is treated as a task
this is a bug.

Ehh... not quite a bug, in that this condition is handled, albeit indirectly. Granted, task #0 is considered invalid due to how the Memory Manager handles block allocation, but EBX becoming null is actually handled due to a combination of other conditions. Slot zero of the Task List remains empty so it will never be used for a valid task and therefore never gets switched to, making the EBX = 0 condition a non-issue for stability. Although it seems inefficient to let the code proceed on if EBX is null only to fail a later test, it's still better than checking every pass through the loop to see if a condition exists which will only happen once out of every 256 passes through the loop.


3rd, calling conventions

        call LM_Internal_ElementAddressGet
        pop esi 

returning results on the stack is inefficient. return results through a register. EAX has been the  register
to be used for resultgs for ages.

I have considered having values passed in registers, especially since I hear this is practically a requirement of the modern SYSCALL interface. However, that may prove problematic since many routines in Night need all the registers they can get to maintain decent speed and clarity of code. As a compromise, I've made an effort to use the traditional (or ancient, depending on who you ask, lol) method of using the stack for parameters and simply eliminating the need for function calls as often as possible, especially in (most) loops. Not the fastest possible solution, I know - and a future version of Night may yet change it - but for now, it'll do.

 
4th, more calling conventions
        ; Why EBX? Because LM_Internal_ElementAddressGet() doesn't
        ; modify it, so we don't need to save it here, saving us a handful of CPU cycles.

standard C calling conventions require that EBP, ESI, EDI are preserved across calls.
this serves the same purpose (efficiency), but has not to rely on undocumented features of
LM_Internal_ElementAddressGet.

As EBP is used only for pointing to parameters and local variables, every function call will preserve it upon entry and restore it upon exit. ESI and EDI, however, are not dedicated to any specific purpose (in Night, that is) but are commonly used as indexes into strings and, in the case of ESI, as a pointer to the beginning of a structure. Since the x86 architecture is a bit register-starved to begin with, I don't see an easy way to free up ESI and EDI to allow the source to be more C-compatible. But since the kernel proper is pure Assembly anyway, this shouldn't be an issue.

 
5th, use efficient assembler (LM_Internal_ElementAddressGet)

replace
    ; check that the element requested is within range
    ; so first we get the number of elements from the list itself
    mov esi, [ebp + 8]
    mov eax, [tListInfo.elementCount]

    ; adjust eax by one since if a list has, say, 10 elements, they would actually be numbered 0 - 9
    dec eax

    ; get the size of each element in this list
    mov eax, [tListInfo.elementSize]

    ; calculate the new destination address
    mov edx, [ebp + 12]
    mul edx
    add eax, esi
    add eax, 16

with

    ; check that the element requested is within range
    ; so first we get the number of elements from the list itself
    mov esi, [ebp + 8]
          ; get the size of each element in this list
    mov eax, [tListInfo.elementSize]

    ; calculate the new destination address
    mul [ebp + 12]
    lea eax, [eax + esi + 16]

This resulted in an unbelievable 63.27% speedup (at least according to the built-in performance monitor thingy as run under VirtualBox) versus the original. Great suggestion!!

This also made me realize a couple other lines in that function were unnecessary leftovers from before the parameter checking was separated out into a separate routine. After cleaning that up as well, the code runs even faster.

 
6th, use efficient assembler

replace
    ; set address of the thing to print info on what we found,
    mov eax, [tSystem.configBits]
    and eax, 000000000000000000000000000000010b
    cmp eax, 000000000000000000000000000000010b
    jne .NoPrint
by
    ; set address of the thing to print info on what we found,
    test [tSystem.configBits], 000000000000000000000000000000010b
    jne .NoPrint

Curiously, that breaks the function of the original code.
Edit: I modified the JNE to JZ and it works.

 
7th, kFalse and kTrue are constants unlikely to ever change
replace
    kTrue                                            dd 0x00000001
    kFalse                                            dd 0x00000000
   
    mov eax, [kTrue]
    ...
    cmp eax, [kTrue]

by
    %define    kTrue  1
    %define    kFalse 0
   
    mov eax, kTrue
    ...
    cmp eax, kTrue

accessing memory is inefficient, compared to constants.
 
Makes perfect sense. Done!


8th, bug
LMElementValidate will *always* return [kTrue].

Tom

 Haha! A stupid oversight. I also found a similar error in LMListValidate(), which will also now be fixed in the next version.


Thanks, Tom, for some overall solid suggestions! This kind of input is always appreciated. :)

Tom Ehlert

unread,
May 6, 2019, 1:02:44 PM5/6/19
to night-do...@googlegroups.com


On Monday, May 6, 2019 at 9:40:48 AM UTC+2, Mercury Thirteen wrote:
On Saturday, May 4, 2019 at 8:50:58 AM UTC-4, Tom Ehlert wrote:
Hi,

.
     inc ebx, and ebx, 0x000000FF
ebx will be 0. unfortunately LM_Internal_ElementAddressGet will not return a valid
task, but some (more or less) random location somewhere. as this is treated as a task
this is a bug.

Ehh... not quite a bug, in that this condition is handled, albeit indirectly. Granted, task #0 is considered invalid due to how the Memory Manager handles block allocation, but EBX becoming null is actually handled due to a combination of other conditions.
actually, it's not.
you are accessing task #-1, somewhere at 0xffffffffffffffa0, and treat it as a possible task.


3rd, calling conventions

        call LM_Internal_ElementAddressGet
        pop esi 

returning results on the stack is inefficient. return results through a register. EAX has been the  register
to be used for resultgs for ages.

I have considered having values passed in registers,

I was talking about *returned* values.

Most of your functions end up in

    mov dword [ebp + 8], edx

    mov esp, ebp
    pop ebp
    ret
with next instruction
    pop eax
    cmp eax, kTrue

replacing this with

    mov eax, edx

    mov esp, ebp
    pop ebp
    ret
with next instruction
    cmp eax, kTrue

(or even generating the returned value to eax directly) is simply more efficient.



Mercury Thirteen

unread,
May 7, 2019, 1:50:12 AM5/7/19
to Night Kernel


On Monday, May 6, 2019 at 1:02:44 PM UTC-4, Tom Ehlert wrote:


On Monday, May 6, 2019 at 9:40:48 AM UTC+2, Mercury Thirteen wrote:
On Saturday, May 4, 2019 at 8:50:58 AM UTC-4, Tom Ehlert wrote:
Hi,

.
     inc ebx, and ebx, 0x000000FF
ebx will be 0. unfortunately LM_Internal_ElementAddressGet will not return a valid
task, but some (more or less) random location somewhere. as this is treated as a task
this is a bug.

Ehh... not quite a bug, in that this condition is handled, albeit indirectly. Granted, task #0 is considered invalid due to how the Memory Manager handles block allocation, but EBX becoming null is actually handled due to a combination of other conditions. 
actually, it's not.
you are accessing task #-1, somewhere at 0xffffffffffffffa0, and treat it as a possible task.

I still disagree on this lol

First, EBX is a 32-bit register, making an access to 0xFFFFFFFFFFFFFFA0 impossible. Second, task numbers are unsigned byte values, making a negative impossible as well. Third, this task switching code is always run from within an interrupt handler (e.g. interrupts will be temporarily off) making it impossible for the value to be used as-is before it is corrected. Fourth, even if a switch was somehow performed to a bogus task, the kernel's GPF handler would step in and shut it down before it got too far.

Slots 0 and 1 in the Task List are marked "used" by writing 0xFFFFFFFF to the beginning of the slot for both, ensuring they will never be used for tasks because they are not empty. However, if either is evaluated as a possible task to which to switch, the switch will harmlessly fail due to both having a null kernel stack pointer.

The Task List itself has 256 elements numbered 0 through 255, so obtaining the address of task zero will correctly point to the 0th entry of the Task List, however there is simply nothing there for the task switching code to use, causing it to be skipped. If the address of element 256 was somehow requested, that definitely would cause problems since it's beyond the end of the list, but as I said earlier, the value is incremented and then immediately ANDed to drop it from 256 to 0 with no possibility of it being used before this adjustment occurs.

Like I said before, if my assumptions are incorrect, feel free to explain more on how you reached your conclusion... maybe there's something I'm missing here and I just haven't gotten the AHA moment yet!

Edit: In the interest of optimization, I changed

; add one to the task number, then mask to make sure we stay under 255
inc ebx
and ebx, 0x000000FF


to 

; increment the task number using the low byte only to make sure we stay under 255
inc bl


which averts the extra math altogether.


3rd, calling conventions

        call LM_Internal_ElementAddressGet
        pop esi 

returning results on the stack is inefficient. return results through a register. EAX has been the  register
to be used for resultgs for ages.

I have considered having values passed in registers,

I was talking about *returned* values.

Most of your functions end up in

    mov dword [ebp + 8], edx

    mov esp, ebp
    pop ebp
    ret
with next instruction
    pop eax
    cmp eax, kTrue

replacing this with

    mov eax, edx

    mov esp, ebp
    pop ebp
    ret
with next instruction
    cmp eax, kTrue

(or even generating the returned value to eax directly) is simply more efficient.

Very true. When considering this before, I had it in my mind that the method of passing values should match the method for returning them. Why was that in my head? No idea. I guess I just thought the uniformity would be a good thing and perhaps avert possible unforeseen complications of mixing methods. Since you mentioned this, I did some research and I must say... I never realized how common the mixed method was, especially when taking into account other platforms besides x86. I suppose migrating to register returns soon would be a smart move. The whole focus of this kernel is performance, after all.

Also, I suppose different types of return values could be returned in different registers, based on for what they'll be used. For example, the List Manager calls which return the address of a list element may do so in ESI since that's likely the register you'll need the result in anyway, and other general routines can return miscellaneous data in EAX.

Thanks for the idea!

Reply all
Reply to author
Forward
0 new messages