Pushing and popping segment registers on context switch

33 views
Skip to first unread message

James Harris

unread,
Mar 11, 2022, 8:07:22 AMMar 11
to
No need to reply but you may find this lot interesting.

My (Pmode) interrupt-handling currently includes code of the form

pusha
push ds
push es
push fs
push gs

... handle interrupt

pop gs
pop fs
pop es
pop ds
popa

That simple approach has led to some queries.


===> 1. When pushed in that way does each segment register take up two
bytes or four bytes on the stack? Intel seem remarkably unclear about
this. Their manuals say such as

"PUSH decrements the stack pointer by 2 if the operand-size attribute
of the instruction is 16 bits; otherwise, it decrements the stack
pointer by 4."

http://www.scs.stanford.edu/05au-cs240c/lab/i386/PUSH.htm

IMO that's confusing since the size of the operands in this case is
unavoidably 16 bits but the D bit of the CS descriptor will be set
indicating a default operand size of 32 bits.

I have, I think, satisfied myself that in PM32 the segment registers
will always be pushed as four bytes but that leads to the second point.


===> 2. Since two-byte values are being pushed as four bytes which bytes
hold the value and what gets put in the other two bytes?

If all we are doing is popping them later, as above, then it doesn't
matter but it could be important when priming the stack for a new task
and it could matter to an interrupt handler which accessed the
interrupted task's registers, e.g. a debug-exception handler.


===> 3. Since loading segment registers is slow would it be faster to
wrap the loads in tests such that each segreg is only reloaded if it has
changed as in

if current GS does not equal the GS saved on the stack
pop GS from the stack
else
pop the stack into an unused general register such as EAX


===> 4. Most often it will be a user-mode task which is interrupted so
the DS segreg will have to change to the kernel's data segment. But
there will be times when DS does not need to change such as interrupting
a kernel-mode task or, perhaps more importantly, when there's an
'interrupt storm' and higher-priority IRQs are interrupting
lower-priority ones. Given that, should DS also only be reloaded if it
has changed?


===> 5. Finally, since DS is most likely to be used first by the code
once the interrupt returns would it be better to carry out the reload of
DS (whether wrapped or not) /before/ the other segment registers as in

pop ds ;Do DS first
pop es
pop fs
pop gs

so that DS is more likely to be ready soonest?


As I say, no need to reply unless interested. This has been bugging me
for days but I think I am making progress.


--
James Harris

Scott Lurndal

unread,
Mar 15, 2022, 12:19:45 PMMar 15
to
James Harris <james.h...@gmail.com> writes:
>No need to reply but you may find this lot interesting.
>
>My (Pmode) interrupt-handling currently includes code of the form
>
> pusha
> push ds
> push es
> push fs
> push gs
>
> ... handle interrupt
>
> pop gs
> pop fs
> pop es
> pop ds
> popa
>
>That simple approach has led to some queries.
>
>
>===> 1. When pushed in that way does each segment register take up two
>bytes or four bytes on the stack? Intel seem remarkably unclear about
>this. Their manuals say such as
>

(StackAddrSize as defined for current Stack Segment)
(OperandSize as defined for current Code Segment and/or override).

IF StackAddrSize = 64
THEN
IF OperandSize = 64
THEN
RSP <= (RSP - 8);
IF (SRC is FS or GS)
THEN
TEMP = ZeroExtend64(SRC);
ELSE IF (SRC is IMMEDIATE)
TEMP = SignExtend64(SRC); FI;
ELSE
TEMP = SRC;
FI
RSP <= TEMP; (* Push quadword *)
ELSE (* OperandSize = 16; 66H used *)
RSP <= (RSP - 2);
RSP <= SRC; (* Push word *)
FI;
ELSE IF StackAddrSize = 32
THEN
IF OperandSize = 32
THEN
ESP <= (ESP - 4);
IF (SRC is FS or GS)
THEN
TEMP = ZeroExtend32(SRC);
ELSE IF (SRC is IMMEDIATE)
TEMP = SignExtend32(SRC); FI;
ELSE
TEMP = SRC;
FI;
SS:ESP <= TEMP; (* Push doubleword *)
ELSE (* OperandSize = 16*)
ESP <= (ESP - 2);
SS:ESP <= SRC; (* Push word *)
FI;
ELSE StackAddrSize = 16
IF OperandSize = 16
THEN
SP <= (SP - 2);
SS:SP <= SRC; (* Push word *)
ELSE (* OperandSize = 32 *)
SP <= (SP - 4);
SS:SP <= SRC; (* Push doubleword
FI;
FI;
FI;

>
>===> 2. Since two-byte values are being pushed as four bytes which bytes
>hold the value and what gets put in the other two bytes?

two byte values are zero extended for 32 or 64-bit StackAddrSize,
except when pushing an immediate value, which will be sign extended.


>===> 3. Since loading segment registers is slow would it be faster to
>wrap the loads in tests such that each segreg is only reloaded if it has
>changed as in
>
> if current GS does not equal the GS saved on the stack
> pop GS from the stack
> else
> pop the stack into an unused general register such as EAX

You probably won't save any signficant number of cycles, even
if you get a cache hit on the stack access for the compare.

James Harris

unread,
Mar 22, 2022, 4:27:26 AMMar 22
to
On 15/03/2022 16:19, Scott Lurndal wrote:
> James Harris <james.h...@gmail.com> writes:

...

>> ===> 3. Since loading segment registers is slow would it be faster to
>> wrap the loads in tests such that each segreg is only reloaded if it has
>> changed as in
>>
>> if current GS does not equal the GS saved on the stack
>> pop GS from the stack
>> else
>> pop the stack into an unused general register such as EAX
>
> You probably won't save any signficant number of cycles, even
> if you get a cache hit on the stack access for the compare.

I don't see a cache influence. The value has to be read in both cases so
any difference should cancel out.

What matters is that in Protected Mode loading a segreg has a lot of
work to do. In particular, loading a selector causes the CPU to also
load and check the full descriptor. The checks can happen in parallel
but how long does loading a segreg take? Based on

https://www.agner.org/optimize/instruction_tables.pdf

one could estimate the work as taking from 8 to over 20 cycles,
depending on CPU. By contrast, it looks as though wrapping the segload
in a test could take 2 cycles.

Here's how the code might look for FS.

pop eax ;Saved FS
mov ebx, fs ;Current FS
cmp ax, bx ;Compare them
je fs_ready ;If FS has changed
mov fs, eax ; then, and only then, reload it
fs_ready:

The estimate of 2 cycles has POP and MOV happening in one cycle, CMP and
JE happening in the other. If that's right then the code would execute
in 2 cycles if FS has remained unchanged and add 2 cycles if FS has changed.

Doesn't that make it worth doing for segment registers which are seldom
changed in interrupt handlers, such as ES, FS and GS?

Perhaps a more interesting question is whether it should it be done for DS.


--
James Harris

James Harris

unread,
Mar 22, 2022, 4:49:47 AMMar 22
to
On 15/03/2022 16:19, Scott Lurndal wrote:
> James Harris <james.h...@gmail.com> writes:
>> No need to reply but you may find this lot interesting.
>>
>> My (Pmode) interrupt-handling currently includes code of the form
>>
>> pusha
>> push ds
>> push es
>> push fs
>> push gs
>>
>> ... handle interrupt
>>
>> pop gs
>> pop fs
>> pop es
>> pop ds
>> popa
>>
>> That simple approach has led to some queries.
>>
>>
>> ===> 1. When pushed in that way does each segment register take up two
>> bytes or four bytes on the stack? Intel seem remarkably unclear about
>> this. Their manuals say such as
>>
>
> (StackAddrSize as defined for current Stack Segment)
> (OperandSize as defined for current Code Segment and/or override).

Agreed. Pushing and popping a segreg uses the OperandSize. That means
(in 32-bit Pmode) they would adjust the stack pointer by 4 bytes even
though segment registers are just 2 bytes.

...

>>
>> ===> 2. Since two-byte values are being pushed as four bytes which bytes
>> hold the value and what gets put in the other two bytes?
>
> two byte values are zero extended for 32 or 64-bit StackAddrSize,
> except when pushing an immediate value, which will be sign extended.

You might find this interesting. For an OS developer it turns out to be
not as simple as you suggest. Instead, what's left on the stack after a
segreg push will depend on the processor! Check out this from Intel:

"When pushing a segment selector onto the stack, the Pentium 4, Intel
Xeon, P6 family, and Intel486 processors decrement the ESP register by
the operand size and then write 2 bytes. If the operand size is 32-bits,
the upper two bytes of the write are not modified. The Pentium processor
decrements the ESP register by the operand size and determines the size
of the write by the operand size. If the operand size is 32-bits, the
upper two bytes are written as 0s."

That's from 22.31.1 Selector Pushes and Pops in Intel Vol 3B from June
2013, Order Number: 325462-047US.

IOW, some Intel CPUs will zero extend, others will write just two bytes
of the four.


--
James Harris

Scott Lurndal

unread,
Mar 22, 2022, 11:41:19 AMMar 22
to
I don't find processors prior to the Pentium particularly interesting,
myself, and I don't expect that there are many operating system
developers (of which number I have been counted for four decades now)
that will be targeting 286, 386 or 486 processors (outside of simulators
or using compatability mode in modern processors).

But, your point stands that the selector is not always zero-extended
stands. It would be interesting to look at qemu to see if it always
extends the selector or if it treats it differently for pre-p5
processors (assuming it claims to simulate pre p5 processors
with some degree of fidelity).

James Harris

unread,
Mar 23, 2022, 1:20:41 PMMar 23
to
Some further info, this time about more-recent processors, from the
current Intel manuals (dated December 2021):

"if the operand size is 32-bits, either a zero-extended value is pushed
on the stack or the segment selector is written on the stack using a
16-bit move. For the last case, all recent Intel Core and Intel Atom
processors perform a 16-bit move, leaving the upper portion of the stack
location unmodified."

Ref: https://cdrdv2.intel.com/v1/dl/getContent/671110

It's curious that the Pentium switched to pushing four bytes but later
processors switched back to writing just two. I'd have thought that
writing four would be more efficient overall as the words of stacks are
commonly accessed in sequence and it could prevent the processor having
to read a fully written line into cache.

AMD maybe behaves differently.

Either way, the point remains: if code is to be portable then the upper
two bytes of a segreg push cannot be relied upon.


--
James Harris

Scott Lurndal

unread,
Mar 23, 2022, 2:43:50 PMMar 23
to
James Harris <james.h...@gmail.com> writes:
>On 22/03/2022 15:41, Scott Lurndal wrote:
>> James Harris <james.h...@gmail.com> writes:
>>> On 15/03/2022 16:19, Scott Lurndal wrote:
>>>> James Harris <james.h...@gmail.com> writes:
>
>...

>
>It's curious that the Pentium switched to pushing four bytes but later
>processors switched back to writing just two. I'd have thought that
>writing four would be more efficient overall as the words of stacks are
>commonly accessed in sequence and it could prevent the processor having
>to read a fully written line into cache.

So long as caching is enabled and the accessed physical
address is marked WB/WT in the MTRRs, the processor must
bring the entire line into the L1 cache before modifying it.

Only the non-temporal instructions bypass the cache.

wolfgang kern

unread,
Mar 25, 2022, 2:45:35 AMMar 25
to
On 23/03/2022 18:20, James Harris wrote:
[about segreg image on stack]
>
> It's curious that the Pentium switched to pushing four bytes but later
> processors switched back to writing just two. I'd have thought that
> writing four would be more efficient overall as the words of stacks are
> commonly accessed in sequence and it could prevent the processor having
> to read a fully written line into cache.
>
> AMD maybe behaves differently.

what I saw in the past on AMD were previous stack contents in the upper
word, but haven't checked on recent versions (it should read zeros now).

> Either way, the point remains: if code is to be portable then the upper
> two bytes of a segreg push cannot be relied upon.

Yes, that's it. And by any luck the upper word is rare needed anyway :)
__
wolfgang

James Harris

unread,
Mar 25, 2022, 5:57:09 AMMar 25
to
If a program stored (e.g. with a series of pushes) the words of an
entire cache line are you saying that with WB caching the processor will
read the line from memory even though it is about to overwrite the whole
thing?

If so, why would it do that? Something to do with MESI/MOESI
notifications of other CPUs, perhaps?


--
James Harris

Scott Lurndal

unread,
Mar 25, 2022, 10:02:17 AMMar 25
to
James Harris <james.h...@gmail.com> writes:
>On 23/03/2022 18:43, Scott Lurndal wrote:
>> James Harris <james.h...@gmail.com> writes:
>
>
>>> It's curious that the Pentium switched to pushing four bytes but later
>>> processors switched back to writing just two. I'd have thought that
>>> writing four would be more efficient overall as the words of stacks are
>>> commonly accessed in sequence and it could prevent the processor having
>>> to read a fully written line into cache.
>>
>> So long as caching is enabled and the accessed physical
>> address is marked WB/WT in the MTRRs, the processor must
>> bring the entire line into the L1 cache before modifying it.
>>
>> Only the non-temporal instructions bypass the cache.
>
>If a program stored (e.g. with a series of pushes) the words of an
>entire cache line are you saying that with WB caching the processor will
>read the line from memory even though it is about to overwrite the whole
>thing?

Generally yes. It depends on the store buffer implementation.

For some intel processors:

"For stores to WB space, the store data stays in the
store buffer until after the retirement of the stores.
Once retired, data can written to the L1 Data Cache
(if the line is present and has write permission), otherwise
an LFB is allocated for the store miss. The LFB will eventually
receive the "current" copy of the cache line so that it can be
installed in the L1 Data Cache and the store data can be written to the cache."

Scott Lurndal

unread,
Mar 25, 2022, 10:04:30 AMMar 25
to
Reply all
Reply to author
Forward
0 new messages