When I read for 2 DWORDs, the FPGA get tow seperated requests, each require
for 1 DWORD. This makes the read operation too slow, only 4MBps(I find out
that a PCIe 4x deivce can reach 5GBps).
I can only suppose the pci bus driver (pci.sys) split my request. Why can't
pci bus driver send a require of 2 DWORDs at a time?or Is thers any thing
wrong in my driver or FPGA code?
Thanks for your attention. Waiting forward your reply.
--
shingo for windows driving & winCE driving
"shingo" <shi...@zju.edu.cn> wrote in message
news:2D7D3499-3F41-415C...@microsoft.com...
Sorry,I have made a mistake. I get the latest WDK from microsoft.
After install the WDK, there is a folder of "7600.16385.0".
The version of WDK is 7600.
The PCI device is inserted on a slot support PCI Express 4x. And
Hardware designer claimed that the device support read large than 1
DWORD.
Is there any more information about pci bus driver I can get?The WDK
document does not give plenty information about how the pci bus deal
with the read/write IRP.
Thank for your help anyway.
if you want to go faster than about 4MB/s with a PCIe device you will
have to put a DMA controller in your FPGA. The DMA controller must move
the data and not the processor.
Take a look at the PciDrv or PLX9... examples in the WDK.
Processors typically use a load-and-store model which means
loop
read Location n;
write Location M;
end loop;
Even READ_REGISTER_BUFFER_XXX is just a wrapper around such a sequence.
You will find it very very hard to get any processor to burst across a
PCIe connection. I'm sure custom solutions are possible if you have a
processor supporting block-move/block-copy instructions, but this will
not be portable.
4 MB/s means 1 us per read. This seems too much. I suspect it's your FPGA
that cannot handle it fast enough.
"shingo" <shi...@zju.edu.cn> wrote in message
news:2D7D3499-3F41-415C...@microsoft.com...
>
> 4 MB/s means 1 us per read. This seems too much. I suspect it's your FPGA
> that cannot handle it fast enough.
>
Hi Alexander,
1us per read would in fact be quite fast. Typically a single read will
cost you in the region of 1.4 us to 1.8 us, at least that has been the
case on all PCs I have measured (with Logic Analyzer). There are two issues:
1) Read is a split response. i.e. the PC sends a request packet to the
FPGA. The FPGA responds with a completion packet.
2) The operating system seems to have quite a lot of overhead
I suspect current chip-sets /OSes handle a processor read from a
peripheral by just setting up the registers in the I/O controller host
(ICH) and then suspending the thread while waiting for an interrupt. For
an interrupt alone, I have often measured about 300ns response time.
That's just the time needed to call the ISR. The DPC hasn't even run
yet. I don't think there is a whole lot of difference here between Port
(I/O space) and Memory space reads, from a timing point of view.
I have done a few simulations with FPGAs and the hardware alone would
typically require below 800 ns for the round trip (request packet,
process request, completion packet). Sometimes with optimal buffer
settings, credit settings etc. maybe even as good as 450 ns.
Single requests are a terrible waste of bandwidth (if you are in a
hurry). For a single DWORD data (read), you transfer a total of 7 or 8
DWORDS.
Regards,
Charles
--
Don Burn (MVP, Windows DKD)
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr
Remove StopSpam to reply
"Charles Gardiner" <inv...@invalid.invalid> wrote in message
news:he986v$rcp$01$1...@news.t-online.com...
> __________ Information from ESET NOD32 Antivirus, version of virus
> signature database 4626 (20091120) __________
>
> The message was checked by ESET NOD32 Antivirus.
>
> http://www.eset.com
>
>
>
__________ Information from ESET NOD32 Antivirus, version of virus signature database 4626 (20091120) __________
The message was checked by ESET NOD32 Antivirus.
Hi Don,
regarding reality, I guess we'll have to disagree on what reality is. I
had a customer peskering me for accurate figures, so I set up the entire
measurement scenario myself. The figures/scenario are very real and very
reproducible. In fact a few months ago I met a chap at a training who
had very similar figures with his own scenario.
My Scenario:
1 x PCIe FPGA Demo Board
1 x KMDF Device driver
1 x Windows XP
1 x GUI application (with CodeGear RAD Studio)
The customer could type in the burst size he wanted in the GUI.
Otherwise two buttons, one read one write and edit boxes to
enter/display the write/read data. The memory available was a 32Kx 32
embedded RAM array inside the FPGA. Internal time to read a single
DWord, about 24 ns (i.e. 3 cycles @ 8 ns).
On the driver side (leaned strongly on the PciDrv, PCL90x0 examples in
the WDK), I converted the IRPs to READ_REGISTER_BUFFER_ULONG or
WRITE_REGISTER_BUFFER_ULONG with the 'count' field being the value that
the user entered in the GUI mask. In reality also verified with TraceView.
I measured the time between consecutive reads /writes by bringing
internal FPGA signals out to test pins. The observation was that the PC
never bursts and memory writes are up to four times as fast as
non-posted (mem read, I/O rd/wr) PCIe requests, since here there is no
completion. What is important, the time measured here is independent of
the GUI and user part of the operating system. I just measure the time
used by the READ_REGISTER_BUFFER_ULONG part of a single IRP e.g. how
long do I need to read 5 consecutive DWords.
What admittedly is only conjecture (i.e. not tangible reality, at least
not for me), is where all this time overhead arises. From my real
simulations, I know it is not in the FPGA (here max ~900ns round trip,
typical more like 600ns). Whether the overhead is in the operating
system as written by Microsoft or in the chip-set driver, probably as
written by Intel, I indeed can't say for sure. From a HW or user point
of view, I would however consider both together as 'the operating
system'. My assumption on the interrupt etc. is based on descriptions I
have seen regarding how Intel chip-sets generally implement port
input/output requests. I'm assuming they do much the same for memory
reads/writes. I am also sure that real memory (as in internal RAM)
reads/writes are handled much more effectively as these go through the
MCH chip in 'standard' Intel chip-sets. The newer Server single-chip
companion chip E3xxx (or some number like that) with natively attached
PCIe lanes for the peripherals are probably also faster but I don't have
such a system (yet).
By the way, any real information you have regarding where the overhead
arises would be much appreciated.
Regards,
Charles
If you are referring to READ_REGISTER_ULONG operation, then your conjecture
is way off. If you look at the include file that defines this the operation
it is a memory barrier operation and then a memory read, or just a volatile
memory read so talking about the OS or the chip set driver getting in the
middle is nonsense. These are non-cached and they do go through Intel
chipset and you are are issuing one operation at a time, but it is not some
mysterious driver that is getting in the way.
--
Don Burn (MVP, Windows DKD)
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr
__________ Information from ESET NOD32 Antivirus, version of virus signature database 4627 (20091121) __________
Do you mean 600-900 ns is your FPGA's roundtrip time? Do you have PCIe
analyser? It would come very handy to debus such issues.
To be precise, I'm referring to
READ_REGISTER_BUFFER_ULONG(Register, Buffer, Count).
which is defined in wdm.h as
__forceinline
VOID
READ_REGISTER_BUFFER_ULONG (
PULONG Register,
PULONG Buffer,
ULONG Count
)
{
_ReadWriteBarrier();
__movsd(Buffer, Register, Count);
return;
}
#define READ_REGISTER_BUFFER_ULONG(x, y, z) { \
PULONG registerBuffer = x; \
PULONG readBuffer = y; \
ULONG readCount; \
__mf(); \
for (readCount = z; readCount--; readBuffer++, registerBuffer++) { \
*readBuffer = *(volatile ULONG * const)(registerBuffer); \
} \
}
Are you saying that in reality the processor just sits there for about
1.5 us per iteration waiting for PCIe to come back with a single data
DWORD (because that is what I measure at the hardware). Surely not.
But 1.5 us is one heck of a long time for a processor which is supposed
to be running at say 2.x GHz. Assuming then that there are is no
'mysterious driver' involvement. What is happening in reality?
- Thread suspension + polling of some status register in the ICH to say
when PCIe has finished?
- Indefinite thread suspension until ICH signals PCIe completion over an
FSB interrupt?
- Something else?
>
> Do you mean 600-900 ns is your FPGA's roundtrip time? Do you have PCIe
> analyser? It would come very handy to debus such issues.
>
>
The figures are:
- roughly 450 ns for the first few packets, i.e. buffers empty plenty of
credits
- typical 600 ns, DMA traffic but no credit stalls
- worst 900 ns, heavy DMA and some credit stalls
With the round-trip time, I mean first header DWORD into PCIe core in
requestor to last DWORD of completion packet arriving at requestor. This
was measured in simulation with two identical PCIe cores connected
back-to-back (Aldec VHDL/Verilog simulator, Lattice ECP2M FPGA). i.e.
this is the time that user logic in the PCIe end-point would see if the
completer was a pure hardware implementation and could deliver data as
soon as the request had been received.
Packet transmission is normally pretty fast. It's the reception that's
slow since the packet has to be checked by the data-link layer before
passing it on to the user/application logic. Switches in the path often
use transparent mode i.e. the data layer checks on the fly and issues a
'nullify' if it unexpectedly detects a link CRC error.
The assumption here is of course that all chips have much the same
overhead in the data-link/physical layers. From the figures I have from
different chips or heard from people on different projects, this is the
case. In PCIe Gen 1.x, your byte time is 4 ns (UI 400 ps).
--
Don Burn (MVP, Windows DKD)
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr
"Charles Gardiner" <inv...@invalid.invalid> wrote in message
news:he9l04$eci$03$1...@news.t-online.com...
"Charles Gardiner" <inv...@invalid.invalid> wrote in message
news:he9nr2$rf$00$1...@news.t-online.com...
--
shingo for windows driving & winCE driving
"Alexander Grigoriev" wrote:
> .
>
> Do you guys means that it is no possible to accelerate the read speed
> withourt DMA involved?
Yes, unfortunately you will not get any decent transfer rates without DMA.
> Do I have to persuade the hardware designer to add a DMA adapter on the
> board?or the FPGA can implement the function of a DMA adapter?
>
I'd put the DMA controller in the FPGA.
I'm sure nearly every FPGA manufacturer has some DMA controller in his
IP-library, some are probably even free. It's not the most complicated
piece of logic to do yourself either. Essentially, your driver must feed
the DMA controller with the scatter/gather lists describing your IRP
buffer. The PciDrv and PLX9xxx examples in the WDK are great for getting
started.