On Monday, December 10, 2012 7:42:17 PM UTC-5, MG wrote:
>> So I learned how the IP stacks do their stuff. They
>> use a kernel mode API known as VCI. That was lovingly
>> documented in April 1993 and there is a random web
>> site that has it all as an old school vms HTML help.
>> It's very nice.
>
> Fascinating. Thank you, I will definitely look into that!
The VCI API documentation can be found here.
http://starlet.deltatel.ru/vci1.html
The Digital employee that wrote that documentation clearly cared
about the API. That manual was a labor of love. It's quite detailed
and incredibly accurate. Not bad for something that was written
nearly 20 years ago! I owe that an employee some beer. I could
never have written what I did without that documentation.
>> At this level of the system, I would say VCI
>> is pretty good. The receive handler would take
>> in the Ethernet frame in interrupt mode and
>> shuffle it off to another process. I found that
>> could be done in about 8 microseconds on
>> an rx8640. Decent. I suspect the C compiler
>> didn't produce the best ia64 code.
> What kind of optimizations did you attempt?
My UDP stack was aimed at a specific internal application on
Itanium so the optimization choices were aimed at those needs.
So some of what I will describe will seem like its not useful
for a generalized usage pattern.
In this case, the application was going to exchange only one
UDP packet in a request/response fashion. While there would
be more than one instance of the application at a time, each one
would have a monogamous relationship with its Linux based buddy.
And the linux based process was going to be in the driver's seat
with the VMS host process acting as responsive slave. So one
process to one process. One request. One reply.
VCI takes an ethernet frame and presents it in the form of a
VCRP to the receive function that you provide. It is the responsibility
of that function to then deallocate that VCRP.
The deallocation requires taking out IOLOCK8. In looking at the
performance, I had wondered if there was value to bunching up
those frees. Let the receive function accumulate a few (8?)
and then do the free en masse. You can't leak those VCRPs as
they are quite precious.
But in my case, I was going to receive a steady stream of them.
The cost of freeing them might be cheaper when amortized out in
that fashion. Especially since acquiring IOLOCK8 seemed expensive
compared to the free.
In my case, there's also the issue of harvesting the bytes from the
VCRP. At times, it looked like the memcpy out of the VCRP was expensive.
So I had played with transferring that responsibility to the receiving
process. I would keep the VCRP, poke the process with $wake, and then
let the receiving process harvest the bytes and free the VCRP through
the API I provided to the IP stack.
This reduced the amount of time spent in the receive function which
frees it up to process another packet. But if the receiving VMS process
went away or was unresponsive, you would leak the VCRP unless one had
another mechanism in place to harvest those.
I think part of my issue was caused by working with a 10 GigE card in a
PCI-X slot. So copying large amounts of data on and off that card was
relatively expensive. Placing the card in a PCI Express v2 slot would
have been a better choice, but that wasn't a possibility for me at
the time.
My whole line of thinking is that the receive function is a game of hot
potato. Get in, get out, and STFU. Any farting around in the receive
function can't be tolerated. It backs up everything. Every line of code
has to be defended or it goes. There may be a million microseconds in
a second, but everyone of them is precious and you have to care about
each one. Don't waste them.
My sending side seemed to be pretty pokey at times, but the corporate
support for the project evaporated before I could really dig in.
I felt that the C code to compute the IP header checksum was slower than
I expected. Macro-32 might do better in this area.
I spent some time making sure I had VCRPs pre-created prior to sending.
Since I would only send at most one UDP packet at a time, I could get away
with that given the target application.
I skipped checking the IP header checksum as I recall. Again, the target
usage was a data center application. The idea that the packet is damaged
in transit in that environment seemed absurd to me.
And because I was using IP v4, I could opt out of the data checksum is
optional in that case. Computing that checksum for 64k worth of data
seemed awfully expensive and not necessary given modern environments.
Again, a Macro32 implementation might be better for itanium than C.
IP fragmentation is painful when doing the send side. In that sense,
jumbo frames can be a huge win. You can go from 40 sends to 9. Fewer
VCRPs to manage, and less overhead in total
You can find some example VCI code in an old VMS freeware CD. I think
it was v5 or v6 that had a working pcapvcm.c/vci_jacket.mar for the tcpdump
utility. That one came in two flavors. One that used SYS$QIO and the
other did VCI. I believe they have a few bugs in their VCI usage which
caused crashes and that code was ultimately pulled. If you are good with
the great google, you can find it.
Hope that helps a bit.
EJ