PCIe Driver read problem

shingo

unread,

Nov 21, 2009, 3:18:02 AM11/21/09

to

Under 32bit Windows XP SP2, I develop a driver for a special designed PCIe
device using FPGA as PCIe moudel.
With WDK 7200, I choose "KMDF" as a driver model.
First "MmMapIoSapce" map the device DDR in to my virtual space, then use
"RtlCopyMemory" read for mapped Virtual Address. Also tried
"READ_REGISTER_BUFFER_XXX" for Physical Address.
This two functions behave the same.Is There any problem here?

When I read for 2 DWORDs, the FPGA get tow seperated requests, each require
for 1 DWORD. This makes the read operation too slow, only 4MBps(I find out
that a PCIe 4x deivce can reach 5GBps).
I can only suppose the pci bus driver (pci.sys) split my request. Why can't
pci bus driver send a require of 2 DWORDs at a time?or Is thers any thing
wrong in my driver or FPGA code?
Thanks for your attention. Waiting forward your reply.

--
shingo for windows driving & winCE driving

David Craig

unread,

Nov 21, 2009, 4:57:04 AM11/21/09

to

Why is anyone using the WDK 7200? It is not a released version and not
useful any more. If your hardware and the supporting bridge does not
support larger reads, then it won't happen. It will be broken up. What is
a 4x device? Maybe x4 for four lanes? If so that would be 20Gbps and not
20GBps.

"shingo" <shi...@zju.edu.cn> wrote in message
news:2D7D3499-3F41-415C...@microsoft.com...

Yongqiang Shi

unread,

Nov 21, 2009, 7:49:02 AM11/21/09

to

> > shingo for windows driving & winCE driving- 隐藏被引用文字 -
>
> - 显示引用的文字 -

Sorry,I have made a mistake. I get the latest WDK from microsoft.
After install the WDK, there is a folder of "7600.16385.0".
The version of WDK is 7600.
The PCI device is inserted on a slot support PCI Express 4x. And
Hardware designer claimed that the device support read large than 1
DWORD.
Is there any more information about pci bus driver I can get?The WDK
document does not give plenty information about how the pci bus deal
with the read/write IRP.
Thank for your help anyway.

Charles Gardiner

unread,

Nov 21, 2009, 7:57:54 AM11/21/09

to

Hi Shingo,

if you want to go faster than about 4MB/s with a PCIe device you will
have to put a DMA controller in your FPGA. The DMA controller must move
the data and not the processor.

Take a look at the PciDrv or PLX9... examples in the WDK.

Processors typically use a load-and-store model which means
loop
read Location n;
write Location M;
end loop;

Even READ_REGISTER_BUFFER_XXX is just a wrapper around such a sequence.
You will find it very very hard to get any processor to burst across a
PCIe connection. I'm sure custom solutions are possible if you have a
processor supporting block-move/block-copy instructions, but this will
not be portable.

Alexander Grigoriev

unread,

Nov 21, 2009, 11:43:45 AM11/21/09

to

The device register read operations don't go through pci.sys.
READ_REGISTER_ULONG or READ_REGISTER_LONGLONG are simple memory accesses,
combined with read barrier.

4 MB/s means 1 us per read. This seems too much. I suspect it's your FPGA
that cannot handle it fast enough.

"shingo" <shi...@zju.edu.cn> wrote in message
news:2D7D3499-3F41-415C...@microsoft.com...

Charles Gardiner

unread,

Nov 21, 2009, 12:32:14 PM11/21/09

to

Alexander Grigoriev schrieb:

>
> 4 MB/s means 1 us per read. This seems too much. I suspect it's your FPGA
> that cannot handle it fast enough.
>

Hi Alexander,

1us per read would in fact be quite fast. Typically a single read will
cost you in the region of 1.4 us to 1.8 us, at least that has been the
case on all PCs I have measured (with Logic Analyzer). There are two issues:

1) Read is a split response. i.e. the PC sends a request packet to the
FPGA. The FPGA responds with a completion packet.

2) The operating system seems to have quite a lot of overhead

I suspect current chip-sets /OSes handle a processor read from a
peripheral by just setting up the registers in the I/O controller host
(ICH) and then suspending the thread while waiting for an interrupt. For
an interrupt alone, I have often measured about 300ns response time.
That's just the time needed to call the ISR. The DPC hasn't even run
yet. I don't think there is a whole lot of difference here between Port
(I/O space) and Memory space reads, from a timing point of view.

I have done a few simulations with FPGAs and the hardware alone would
typically require below 800 ns for the round trip (request packet,
process request, completion packet). Sometimes with optimal buffer
settings, credit settings etc. maybe even as good as 450 ns.

Single requests are a terrible waste of bandwidth (if you are in a
hurry). For a single DWORD data (read), you transfer a total of 7 or 8
DWORDS.

Regards,
Charles

Don Burn

unread,

Nov 21, 2009, 12:44:53 PM11/21/09

to

For a normal read case as is being described here with READ_REGISTER_XXX the
operation is a direct processor read, your description is far from reality
this is just a read of a memory location existing on the PCIe bus (so it may
take a few cycles), there is in essence no OS intervention and no
interrupts.

--
Don Burn (MVP, Windows DKD)
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr
Remove StopSpam to reply

"Charles Gardiner" <inv...@invalid.invalid> wrote in message
news:he986v$rcp$01$1...@news.t-online.com...

> __________ Information from ESET NOD32 Antivirus, version of virus
> signature database 4626 (20091120) __________
>
> The message was checked by ESET NOD32 Antivirus.
>
> http://www.eset.com
>
>
>

__________ Information from ESET NOD32 Antivirus, version of virus signature database 4626 (20091120) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com

Charles Gardiner

unread,

Nov 21, 2009, 2:13:59 PM11/21/09

to

Don Burn schrieb:

> For a normal read case as is being described here with READ_REGISTER_XXX the
> operation is a direct processor read, your description is far from reality
> this is just a read of a memory location existing on the PCIe bus (so it may
> take a few cycles), there is in essence no OS intervention and no
> interrupts.

Hi Don,

regarding reality, I guess we'll have to disagree on what reality is. I
had a customer peskering me for accurate figures, so I set up the entire
measurement scenario myself. The figures/scenario are very real and very
reproducible. In fact a few months ago I met a chap at a training who
had very similar figures with his own scenario.

My Scenario:
1 x PCIe FPGA Demo Board
1 x KMDF Device driver
1 x Windows XP
1 x GUI application (with CodeGear RAD Studio)

The customer could type in the burst size he wanted in the GUI.
Otherwise two buttons, one read one write and edit boxes to
enter/display the write/read data. The memory available was a 32Kx 32
embedded RAM array inside the FPGA. Internal time to read a single
DWord, about 24 ns (i.e. 3 cycles @ 8 ns).

On the driver side (leaned strongly on the PciDrv, PCL90x0 examples in
the WDK), I converted the IRPs to READ_REGISTER_BUFFER_ULONG or
WRITE_REGISTER_BUFFER_ULONG with the 'count' field being the value that
the user entered in the GUI mask. In reality also verified with TraceView.

I measured the time between consecutive reads /writes by bringing
internal FPGA signals out to test pins. The observation was that the PC
never bursts and memory writes are up to four times as fast as
non-posted (mem read, I/O rd/wr) PCIe requests, since here there is no
completion. What is important, the time measured here is independent of
the GUI and user part of the operating system. I just measure the time
used by the READ_REGISTER_BUFFER_ULONG part of a single IRP e.g. how
long do I need to read 5 consecutive DWords.

What admittedly is only conjecture (i.e. not tangible reality, at least
not for me), is where all this time overhead arises. From my real
simulations, I know it is not in the FPGA (here max ~900ns round trip,
typical more like 600ns). Whether the overhead is in the operating
system as written by Microsoft or in the chip-set driver, probably as
written by Intel, I indeed can't say for sure. From a HW or user point
of view, I would however consider both together as 'the operating
system'. My assumption on the interrupt etc. is based on descriptions I
have seen regarding how Intel chip-sets generally implement port
input/output requests. I'm assuming they do much the same for memory
reads/writes. I am also sure that real memory (as in internal RAM)
reads/writes are handled much more effectively as these go through the
MCH chip in 'standard' Intel chip-sets. The newer Server single-chip
companion chip E3xxx (or some number like that) with natively attached
PCIe lanes for the peripherals are probably also faster but I don't have
such a system (yet).

By the way, any real information you have regarding where the overhead
arises would be much appreciated.

Regards,
Charles

Don Burn

unread,

Nov 21, 2009, 2:48:48 PM11/21/09

to

"Charles Gardiner" <inv...@invalid.invalid> wrote in message

news:he9e5n$707$01$1...@news.t-online.com...

> What admittedly is only conjecture (i.e. not tangible reality, at least
> not for me), is where all this time overhead arises. From my real
> simulations, I know it is not in the FPGA (here max ~900ns round trip,
> typical more like 600ns). Whether the overhead is in the operating
> system as written by Microsoft or in the chip-set driver, probably as
> written by Intel, I indeed can't say for sure. From a HW or user point
> of view, I would however consider both together as 'the operating
> system'. My assumption on the interrupt etc. is based on descriptions I
> have seen regarding how Intel chip-sets generally implement port
> input/output requests. I'm assuming they do much the same for memory
> reads/writes. I am also sure that real memory (as in internal RAM)
> reads/writes are handled much more effectively as these go through the
> MCH chip in 'standard' Intel chip-sets. The newer Server single-chip
> companion chip E3xxx (or some number like that) with natively attached
> PCIe lanes for the peripherals are probably also faster but I don't have
> such a system (yet).

If you are referring to READ_REGISTER_ULONG operation, then your conjecture
is way off. If you look at the include file that defines this the operation
it is a memory barrier operation and then a memory read, or just a volatile
memory read so talking about the OS or the chip set driver getting in the
middle is nonsense. These are non-cached and they do go through Intel
chipset and you are are issuing one operation at a time, but it is not some
mysterious driver that is getting in the way.

--
Don Burn (MVP, Windows DKD)
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr

__________ Information from ESET NOD32 Antivirus, version of virus signature database 4627 (20091121) __________

Alexander Grigoriev

unread,

Nov 21, 2009, 3:38:04 PM11/21/09

to

"Charles Gardiner" <inv...@invalid.invalid> wrote in message

news:he9e5n$707$01$1...@news.t-online.com...

>
> What admittedly is only conjecture (i.e. not tangible reality, at least
> not for me), is where all this time overhead arises. From my real
> simulations, I know it is not in the FPGA (here max ~900ns round trip,
> typical more like 600ns). Whether the overhead is in the operating

Do you mean 600-900 ns is your FPGA's roundtrip time? Do you have PCIe
analyser? It would come very handy to debus such issues.

Charles Gardiner

unread,

Nov 21, 2009, 4:10:28 PM11/21/09

to

> If you are referring to READ_REGISTER_ULONG operation, then your conjecture
> is way off. If you look at the include file that defines this the operation
> it is a memory barrier operation and then a memory read, or just a volatile
> memory read so talking about the OS or the chip set driver getting in the
> middle is nonsense. These are non-cached and they do go through Intel
> chipset and you are are issuing one operation at a time, but it is not some
> mysterious driver that is getting in the way.
>

To be precise, I'm referring to
READ_REGISTER_BUFFER_ULONG(Register, Buffer, Count).

which is defined in wdm.h as

__forceinline
VOID
READ_REGISTER_BUFFER_ULONG (
PULONG Register,
PULONG Buffer,
ULONG Count
)
{
_ReadWriteBarrier();
__movsd(Buffer, Register, Count);
return;
}

#define READ_REGISTER_BUFFER_ULONG(x, y, z) { \
PULONG registerBuffer = x; \
PULONG readBuffer = y; \
ULONG readCount; \
__mf(); \
for (readCount = z; readCount--; readBuffer++, registerBuffer++) { \
*readBuffer = *(volatile ULONG * const)(registerBuffer); \
} \
}

Are you saying that in reality the processor just sits there for about
1.5 us per iteration waiting for PCIe to come back with a single data
DWORD (because that is what I measure at the hardware). Surely not.

But 1.5 us is one heck of a long time for a processor which is supposed
to be running at say 2.x GHz. Assuming then that there are is no
'mysterious driver' involvement. What is happening in reality?
- Thread suspension + polling of some status register in the ICH to say
when PCIe has finished?
- Indefinite thread suspension until ICH signals PCIe completion over an
FSB interrupt?
- Something else?

Charles Gardiner

unread,

Nov 21, 2009, 4:58:48 PM11/21/09

to

Alexander Grigoriev schrieb:

>
> Do you mean 600-900 ns is your FPGA's roundtrip time? Do you have PCIe
> analyser? It would come very handy to debus such issues.
>
>

The figures are:
- roughly 450 ns for the first few packets, i.e. buffers empty plenty of
credits
- typical 600 ns, DMA traffic but no credit stalls
- worst 900 ns, heavy DMA and some credit stalls

With the round-trip time, I mean first header DWORD into PCIe core in
requestor to last DWORD of completion packet arriving at requestor. This
was measured in simulation with two identical PCIe cores connected
back-to-back (Aldec VHDL/Verilog simulator, Lattice ECP2M FPGA). i.e.
this is the time that user logic in the PCIe end-point would see if the
completer was a pure hardware implementation and could deliver data as
soon as the request had been received.

Packet transmission is normally pretty fast. It's the reception that's
slow since the packet has to be checked by the data-link layer before
passing it on to the user/application logic. Switches in the path often
use transparent mode i.e. the data layer checks on the fly and issues a
'nullify' if it unexpectedly detects a link CRC error.

The assumption here is of course that all chips have much the same
overhead in the data-link/physical layers. From the figures I have from
different chips or heard from people on different projects, this is the
case. In PCIe Gen 1.x, your byte time is 4 ns (UI 400 ps).

Don Burn

unread,

Nov 21, 2009, 5:01:48 PM11/21/09

to

The processor is blocking waiting for the PCIe transaction to do the write
to complete. There is no magic polling or interrupt here. This is a
function of the processor to PCIe to device and return path. You may want
to believe surely not, but there is no software in this just the hardware.

--
Don Burn (MVP, Windows DKD)
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr

"Charles Gardiner" <inv...@invalid.invalid> wrote in message

news:he9l04$eci$03$1...@news.t-online.com...

Alexander Grigoriev

unread,

Nov 21, 2009, 6:54:24 PM11/21/09

to

So from the processor issuing the read request to PCIe, and the device
sending the completion to the processor, it might take 400-600-900 ns? No
surprise you're getting only 4MB/s. This is even slower than 33 MHz 32 bit
PCI.

"Charles Gardiner" <inv...@invalid.invalid> wrote in message

news:he9nr2$rf$00$1...@news.t-online.com...

shingo

unread,

Nov 22, 2009, 12:07:01 AM11/22/09

to

Thank for all you guys.
I have learned a lot from your disscussion, since I'm a rookie in windows
driver development.
My project is mostly like Charles says, except that the FPGA only attatched
DDR RAM without a DMA adapter.
Do you guys means that it is no possible to accelerate the read speed
withourt DMA involved?
Do I have to persuade the hardware designer to add a DMA adapter on the
board?or the FPGA can implement the function of a DMA adapter?

--
shingo for windows driving & winCE driving

"Alexander Grigoriev" wrote:

> .
>

Charles Gardiner

unread,

Nov 22, 2009, 10:42:28 AM11/22/09

to

shingo schrieb:

> Do you guys means that it is no possible to accelerate the read speed
> withourt DMA involved?

Yes, unfortunately you will not get any decent transfer rates without DMA.

> Do I have to persuade the hardware designer to add a DMA adapter on the
> board?or the FPGA can implement the function of a DMA adapter?
>

I'd put the DMA controller in the FPGA.
I'm sure nearly every FPGA manufacturer has some DMA controller in his
IP-library, some are probably even free. It's not the most complicated
piece of logic to do yourself either. Essentially, your driver must feed
the DMA controller with the scatter/gather lists describing your IRP
buffer. The PciDrv and PLX9xxx examples in the WDK are great for getting
started.

shingo

unread,

Nov 22, 2009, 8:19:01 PM11/22/09

to

Thanks
I will add DMA function in my FPGA as soon as possible.
It is glad to learn so much things from you guys.
I appreciate all your help sincerely.