I am just wondering if the Intel instruction set includes (Dual Core)
an instruction that allows me to flush the INSTRUCTION cache in one
go? I know there is a CFLUSH instruction out there, which allows fine-
grained deleting of single cache lines. However, looping over each
cache line takes quite a while. I think I have found once an
instruction that allows me to flush the whole data cache at once, but
I am more looking into the instruction cache at the moment!
Many thanks!
I think you mean CLFLUSH instruction, and it can be used
on instructions or data but may GP(0) fault. I do not
know how it reacts to codepages which are typically RO.
Otherwise, there are the older INVD / WBINVD instructions
but these need to be run in Ring0 ore real-mode code.
For dual core, you will need to issue the instructions on
both cores to clear L1 caches.
-- Robert
Yes, CLFLUSH is the instruction I was referring to.
> Otherwise, there are the older INVD / WBINVD instructions
> but these need to be run in Ring0 ore real-mode code.
I also came across these instructions but if I understand correctly
they
are only intended to flush the data cache but leaves the instruction
cache untouched?
> For dual core, you will need to issue the instructions on
> both cores to clear L1 caches.
Good point ;)
I don't read the Intel manual that way. They refer to flushing
data from cache_s_, but that data could be instructions. The data
and instruction caches are separate things at least at L1.
-- Robert
Alright, that could make sense as L2 and L3 caches store instructions
as well as data. If it flushes those L2 & L3 caches then you could be
right
that also the data hold in the L1 instruction as well as data cache
should be
flushed.
But I have to admit I would feel more confident if I could read
somewhere
in writing that the WBINVD instruction ACTUALLY "flushes the
instruction cache" ;)
WBINVD is documented to "write back and flush internal caches". That is
ALL internal caches.
-hpa
> WBINVD is documented to "write back and flush internal caches". =A0That i=
s
> ALL internal caches.
>
> =A0 =A0 =A0 =A0 -hpa
Thanks for your response! Now when you say ALL internal caches, does
this mean the instruction caches
of both cores on a DUAL core machine or do I still have to issue the
WBINVD instruction on each core
seperately? On another note, do I have to set any specific flags in
some control register before I can use
this instruction?
Thanks!
Each core separately; and the only requirement is that you're in CPL 0
as it is a privileged instruction.
-hpa
Thanks for clarification H. Peter. At the moment I disabled one core,
so I just
have one core activated on my Dual core machine. Am I right in
assuming that issuing
the command 'asm volatile(" wbinvd ")' does the business and
invalidates all the caches,
including the L1 cache on the enabled core? I am wondering how I could
issue this
instruction in such a way it is executed on both cores in case both
CPUs are enabled?
Many thanks
You implement something for CPU communication and ask the other
processor to execute this instruction as well. You can send an inter-
processor interrupt (IPI) using the APIC. There must be a shared
memory area that would carry the parameters associated with such a
request. Those would describe the workload to do.
Alex
> You implement something for CPU communication and ask the other
> processor to execute this instruction as well. You can send an inter-
> processor interrupt (IPI) using the APIC. There must be a shared
> memory area that would carry the parameters associated with such a
> request. Those would describe the workload to do.
Thanks for that, I will have a look at this how it works in detail.
One other thing bothers me, unfortunately.
In the documentation is says: Flushes internal cache, then signals the
external cache to write back current data followed by a signal to flush
the external cache.
So in other words, after the execution of the WBINV the L1 cache will be
invalidated before the next instruction will be executed. Moreover, in
the documentation it also says that this instruction takes 5 cycles.
However, when I did some performance measurements with the TSC than
it calling the WBINV instruction makes a difference of 2 Mio (!!!) clock
cycles. It cant take that long to invalidate internal 32KB caches, can it?
Thanks
Martin wrote:
> One other thing bothers me, unfortunately.
> In the documentation is says: Flushes internal cache, then signals the
> external cache to write back current data followed by a signal to flush
> the external cache.
> So in other words, after the execution of the WBINV the L1 cache will be
> invalidated before the next instruction will be executed. Moreover, in
> the documentation it also says that this instruction takes 5 cycles.
> However, when I did some performance measurements with the TSC than
> it calling the WBINV instruction makes a difference of 2 Mio (!!!) clock
> cycles. It cant take that long to invalidate internal 32KB caches, can it?
Yes, this were the expected timing penalty ...
WBINV takes at least (min latency) ~2000 cycles plus the time needed
to write-back all dirty pending cache lines.
you can use INVD (invalidate without write back), but then you should
be aware of what you may loose ...
write-through may be a solution if you can stand some penalties on WR.
I'd also look at the new various flash-instructions beside the few
cache-bypassing WR-opportunities (NTMOV..)
and there are the so-called serialising things like CPUID, which
may be of help for more exact time measurements by RDTSC.
__
wolfgang
> Yes, this were the expected timing penalty ...
Excellent Wolfgang, just wanted to make sure that this timing difference
is coming from invalidating the L1 cache.
> WBINV takes at least (min latency) ~2000 cycles plus the time needed
> to write-back all dirty pending cache lines.
Where did you get this figure from? As I mentioned, in the documentation
it said something about 5 cycles...
> you can use INVD (invalidate without write back), but then you should
> be aware of what you may loose ...
> write-through may be a solution if you can stand some penalties on WR.
Yes, I am aware of this option, but the thing is I need to write back
some useful intermediate results ;)
> and there are the so-called serialising things like CPUID, which
> may be of help for more exact time measurements by RDTSC.
Thanks, heard of this, that the out of order execution can make those
timings a bit imprecise.
Martin asked:
>> Yes, this were the expected timing penalty ...
> Excellent Wolfgang, just wanted to make sure that this timing difference
> is coming from invalidating the L1 cache.
>> WBINV takes at least (min latency) ~2000 cycles plus the time needed
>> to write-back all dirty pending cache lines.
> Where did you get this figure from? As I mentioned, in the documentation
> it said something about 5 cycles...
Newer AMD optimisatiion guides wont show this '~2000' figure for WBINVD
anymore, but I remember an earlier version of
"Instruction Dispatch and Execution Resources/Timing" (part of 22007.pdf)
had it in there, and I also can confirm it from own experience.
__
wolfgang
5 cycles might be the dispatch time, i.e. if the next instruction
doesn't touch memory at all, it might start 5 cycles later.
However, as soon as you need memory access, you'll have to wait until
all the cache flushing has finished.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Terje Mathisen corrected:
> Wolfgang Kern wrote:
>> Martin asked:
...
>>>> WBINV takes at least (min latency) ~2000 cycles plus the time needed
>>>> to write-back all dirty pending cache lines.
>>> Where did you get this figure from? As I mentioned, in the documentation
>>> it said something about 5 cycles...
>> Newer AMD optimisatiion guides wont show this '~2000' figure for WBINVD
>> anymore, but I remember an earlier version of
>> "Instruction Dispatch and Execution Resources/Timing" (part of 22007.pdf)
>> had it in there, and I also can confirm it from own experience.
> 5 cycles might be the dispatch time, i.e. if the next instruction
> doesn't touch memory at all, it might start 5 cycles later.
> However, as soon as you need memory access, you'll have to wait until
> all the cache flushing has finished.
Yes, 'might...', but hard to see in a debugger w/o memory access :)
OTOH, I measure >2000 cycles on AMD K8 with:
CPUID ;serialising
RDTSC
MOV ebx,eax
MOV ecx,edx
WBIND ;a serialising instruction as well
RDTSC
SUB eax,ebx
SBB edx,ecx
INT3 ;...read eax:edx in debug view here
__
wolfgang