Obviously, it raise the questions1. Did anyone try to work with lwip on linux/Win32 and what throughput you've been able to achieve ?
I totally agree with Simon, at least with RAW API – I can’t speak for NETCONN performance. My “embedded” system is like yours (667Mhz CPU and 256MB 667Mhz DDR2) and is zero-copy with optimized checksum (aligned accesses and in assembly). I can reach 500-600+MbS with TCP. In my testing of UDP the PC in some instances was not able to keep pace receiving. I do not run lwIP on the PC – we use Win32 sockets so far without issues.
It would be nice to see a standard Win32/Linux program to communicate with a simple lwIP test program (RAW, NETCONN, sockets) to create a list of lwIP platforms and the results of the tests.
I totally agree with Simon, at least with RAW API ן¿½ I canן¿½t speak for NETCONN performance. My ן¿½embeddedן¿½ system is like yours (667Mhz CPU and 256MB 667Mhz DDR2) and is zero-copy with optimized checksum (aligned accesses and in assembly). I can reach 500-600+MbS with TCP. In my testing of UDP the PC in some instances was not able to keep pace receiving. I do not run lwIP on the PC ן¿½ we use Win32 sockets so far without issues.
It would be nice to see a standard Win32/Linux program to communicate with a simple lwIP test program (RAW, NETCONN, sockets) to create a list of lwIP platforms and the results of the tests.
From:
lwip-users-bounces+bauerbach=arrayon...@nongnu.org
[mailto:lwip-users-bounces+bauerbach=arrayon...@nongnu.org] On Behalf
Of Simon Goldschmidt
Sent: Tuesday, February 07, 2012 12:57 AM
To: Mailing list for lwIP users
Subject: Re: [lwip-users] LWIP/WIN32 UDP performance (throughtput)
Zayaz Volk <papu...@hotmail.com> wrote:
Obviously, it raise the questions
1. Did anyone try to work with lwip on linux/Win32 and what throughput you've been able to achieve ?
In my opinion, the win32 port is mainly limited by the netif driver: winpcap doesn't seem to give the best performance, since we have to copy packets twice on RX (and I don't remember if it's once or twice on TX). Also, lwIP doesn't really benefit from multiple CPU cores. Instead, you're right that it is targeted on low resources rather that throughput.
Still, i think lwIP should be able to achieve the performance you want if:
A) your netif driver correctly supports zero copy and
B) you handle the stack "correctly" so that zero copy is possible (I.e. data passed to a TX function must be left unchanged in memory until the netif driver has actually sent it out).
Simon
> Still, I think lwIP should be able to achieve the performance you want if:
> A) your netif driver correctly supports zero copy
A few months ago(*) we discussed zero-copy for DMA drivers,
and you wrote:
> For the RX side, using a *custom* PBUF_REF would be the best solution.
> That's a pbuf that has a 'freed' callback and references external
> memory. However, that doesn't work, yet (though I planned to add support
> for it as I can see it's one possible solution to implement DMA MAC
> drivers). The problem here is that pbuf_header() can't grow such pbufs
> (as it doesn't know the original length). This would have to be fixed by
> changing the struct pbuf (if only for PBUF_REF pbufs).
Is this mode supported in the trunk?
If yes, do you plan to port it to 1.4.1?
If no, that means I have to memcpy every incoming frame, right?
> As to the TX side: normally, TX pbufs are allocated as PBUF_RAM, the
> memory for that is taken from the heap by calling mem_malloc(). Now the
> simple solution would be to just replace mem.c by your own code
> allocating and freeing from your TX pool: with the correct settings,
> mem_malloc() isn't used for anything else buf PBUF_RAM pbufs. The only
> problem might be that you don't know in advance how big the memory
> blocks are (and thus how big your TX buffer entries should be), but by
> using TCP_OVERSIZE, you can just use the maximum ethernet frame size (if
> you don't mind wasting some RAM for smaller packets).
But when I looked,
I don't think it's possible to use mem_malloc as a packet
buffer allocator, as there are other types of uses, e.g.
dhcp = (struct dhcp *)mem_malloc(sizeof(struct dhcp));
autoip = (struct autoip *)mem_malloc(sizeof(struct autoip));
What I'd like is to be able to "redirect" pbuf_alloc to use
my buffer allocator, but the pbuf struct and the payload buffer
would not be contiguous, which violates the implicit assumptions
of PBUF_RAM pbufs, IIUC.
Have things changed on that front?
If no, that means I have to memcpy every outgoing frame, right?
(*) Date: Thu, 06 Oct 2011 16:10:07 +0200
http://lists.gnu.org/archive/html/lwip-users/2011-10/msg00008.html
--
Regards.
> Did somebody use a UDP on any hardware using NETCONN API (zero copy,
> multithreaded environment) ? What was your maximum throughput ? What
> is the characteristic of your system ?
I can max out a Fast Ethernet link on a 450-MHz SH-4 system with
lots of CPU cycles to spare.
Stas,
Res, RAW API and NO_SYS=1. You don’t need threads with RAW API if you have a “big loop” type of program that spins processing lwIP timers and other system events.
Single threaded? Yes and no I guess. We use a cooperative RTOS so there are many tasks, but one task handles Ethernet, lwIP and its callbacks. The other threads (tasks) do send TCP data but being cooperative in nature there is never a worry as to when they do need to send something. Being mostly event driven, lwIP’s task gets 99%-interrupt processing time of CPU time so there is always all the bandwidth we need. This is a real-time system and can push 50-60MB/S (TCP) and at this speed 30% of the processor is then processing interrupts (not Ethernet interrupts – they are polled).
Bill
From:
lwip-users-bounces+bauerbach=arrayon...@nongnu.org
[mailto:lwip-users-bounces+bauerbach=arrayon...@nongnu.org] On Behalf
Of Zayaz Volk
Sent: Tuesday, February 07, 2012 11:32 AM
To: lwip-...@nongnu.org
Subject: Re: [lwip-users] LWIP/WIN32 UDP performance (throughtput)
I did find this information on the lwip wiki
http://lwip.wikia.com/wiki/Writing_a_device_driver#Notes_on_Zero-Copy_Network_interface_drivers
But I think it's a little bit dated?
Mason <mpeg...@free.fr> wrote:
> > I don't think it's possible to use mem_malloc as a packet
> > buffer allocator, as there are other types of uses, e.g.
> >
> > dhcp = (struct dhcp *)mem_malloc(sizeof(struct dhcp));
> > autoip = (struct autoip *)mem_malloc(sizeof(struct autoip));
These are not used if you init autoip and dhcp correctly (dhcp_set_struct(), autoip_set_struct()). I know that's not a good solution and we should add pools for it.
> > What I'd like is to be able to "redirect" pbuf_alloc to use
> > my buffer allocator, but the pbuf struct and the payload buffer
> > would not be contiguous, which violates the implicit assumptions
> > of PBUF_RAM pbufs, IIUC.
> >
> > Have things changed on that front?
No, not yet. What's missing there are two things:
a) redirecting pbuf_alloc() to a custom function and
b) having a PBUF_REF-like type where pbuf_header works in both directions.
It shouldn't be too hard to implement, but I haven't even found the time to work much on the 1.4.1 release-related things, lately, so it might take some time until we get there :-(
> > If no, that means I have to memcpy every outgoing frame, right?
Not if would redirect mem_malloc(). However, your platform would need to allow the pbuf struct and payload be in contiguous memory, then.
> I did find this information on the lwip wiki
> http://lwip.wikia.com/wiki/Writing_a_device_driver#Notes_on_Zero-Copy_Network_interface_drivers
>
> But I think it's a little bit dated?
Yeah, that doesn't really sum it up right. I don't think I have ever seen that article before...
Simon
--
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
> Sorry for not answering that post, I meant to, but somehow it slipped through...
I was afraid to bother you by asking too many questions ;-)
> Mason <mpeg...@free.fr> wrote:
>
>> I don't think it's possible to use mem_malloc as a packet
>> buffer allocator, as there are other types of uses, e.g.
>>
>> dhcp = (struct dhcp *)mem_malloc(sizeof(struct dhcp));
>> autoip = (struct autoip *)mem_malloc(sizeof(struct autoip));
>
> These are not used if you init autoip and dhcp correctly
> (dhcp_set_struct(), autoip_set_struct()). I know that's not a good
> solution and we should add pools for it.
This is all voodoo and black magic to me. It's very cool that you
are so nice to explain stuff when questions are asked, but it
really feels like there is a lot of high-level documentation
missing, wouldn't you agree?
>> What I'd like is to be able to "redirect" pbuf_alloc to use
>> my buffer allocator, but the pbuf struct and the payload buffer
>> would not be contiguous, which violates the implicit assumptions
>> of PBUF_RAM pbufs, IIUC.
>>
>> Have things changed on that front?
>
> No, not yet. What's missing there are two things:
> a) redirecting pbuf_alloc() to a custom function and
> b) having a PBUF_REF-like type where pbuf_header works in both directions.
>
> It shouldn't be too hard to implement, but I haven't even found the
> time to work much on the 1.4.1 release-related things, lately, so it
> might take some time until we get there :-(
I think I'll err on the safe side, and make my port do the extra
memcpy in the initial version. Then I can profile the different
use-cases, and if there is a performance problem, then I'll see
how to get rid of the copies. (Probably in 2-3 months)
>> If no, that means I have to memcpy every outgoing frame, right?
>
> Not if would redirect mem_malloc(). However, your platform would need
> to allow the pbuf struct and payload be in contiguous memory, then.
I've been meaning to talk to you about this issue.
Since I use DMA, there are hard constraints on the location of the buffer.
It must start at an address multiple of 32 (cache line aligned) and the
length must be a multiple of 32, and we must purge the appropriate lines
from cache before the DMA operation.
I suppose I could allocate 32 more bytes to hold the pbuf struct and
fudge the pointers to make the payload start on the right boundary,
but you've said that pbuf_header could move the start of payload,
or something like that... More voodoo stuff.
Like someone said on the list, I think lwip wants control over
the packet buffers, which is a problem when a driver also needs
control over the buffers.
--
Regards.
This is what I've done on 2 platforms - both use DMA and zero-copy (one uses
chained DMA and needs cache maintenance). I've done this "fudging". I
aligned the payload and padded it out to the required size.
>Like someone said on the list, I think lwip wants control over
>the packet buffers, which is a problem when a driver also needs
>control over the buffers.
It's not a problem that can't be overcome with lwIP as is. If this pbuf
driver management was done from the onset, all the better. But it wasn't
and they would risk breaking compatibility with a lot of lwIP programs if
pbuf use and the API for it is changed.
Bill
Of course I agree! It's just as always, writing the code is one thing, but writing documentation... :-)
> I think I'll err on the safe side, and make my port do the extra
> memcpy in the initial version. Then I can profile the different
> use-cases, and if there is a performance problem, then I'll see
> how to get rid of the copies. (Probably in 2-3 months)
That's probably the best idea.
> >> If no, that means I have to memcpy every outgoing frame, right?
> >
> > Not if would redirect mem_malloc(). However, your platform would need
> > to allow the pbuf struct and payload be in contiguous memory, then.
>
> I've been meaning to talk to you about this issue.
> Since I use DMA, there are hard constraints on the location of the buffer.
> It must start at an address multiple of 32 (cache line aligned) and the
> length must be a multiple of 32, and we must purge the appropriate lines
> from cache before the DMA operation.
>
> I suppose I could allocate 32 more bytes to hold the pbuf struct and
> fudge the pointers to make the payload start on the right boundary,
> but you've said that pbuf_header could move the start of payload,
> or something like that... More voodoo stuff.
And that is something I need, too: using a different alignment constraint for pbufs (16 bytes) than for the rest (4 bytes). I might have to add that to 1.4.1 so we can use a clean version in our products :-)
Simon
--
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
_______________________________________________
Hi,
I need 32 bytes alignement on Blackfin for L1 cache
and even 128 bytes on C674x for L2 cache.
I just add this in my cc.h:
// PBUF_POOL needs to be aligned to cache line size
#ifndef __cplusplus
extern u8_t memp_memory_PBUF_POOL_base[] __attribute__ ((aligned (32)));
#endif
And in my lwipopts.h:
#define PBUF_POOL_BUFSIZE 1520
// + sizeof(pbuf) = 1536 = 48 cache lines
But I agree it could be easier with something like:
#define PBUF_ALIGNMENT 32
--
Stephane
But then struct pbuf is still only 16 bytes long, so payload might start in the middle of a 32-byte border, which leads to faults when flushing cache (since the struct pbuf members are still used cached) or am I wrong there?
> But I agree it could be easier with something like:
> #define PBUF_ALIGNMENT 32
That's what I imagined (only with another name maybe).
Simon
--
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
_______________________________________________
I don't have the problem, because:
1. I don't need to align my payload. I just need to ensure that a cache line cannot contain payload of pbuf N and struct of pbuf N+1.
I do this by setting (PBUF_POOL_BUFSIZE + sizeof(struct pbuf)) = multiple of cache line size
2. on RX, the pbuf is owned by the driver so I can safely flush the cache to get the payload, because I know the stack will not modify a value in the struct.
> > But I agree it could be easier with something like:
> > #define PBUF_ALIGNMENT 32
>
> That's what I imagined (only with another name maybe).
Ok, maybe we need 2 settings:
1. alignement of pbuf struct
2. alignement of pbuf payload (may lead to wasted memory)
--
Stéphane Lesage
>> But then struct pbuf is still only 16 bytes long, so payload might
>> start in the middle of a 32-byte border, which leads to faults when
>> flushing cache (since the struct pbuf members are still used cached) or
>> am I wrong there?
>
> I don't have the problem, because:
> 1. I don't need to align my payload. I just need to ensure that a cache line cannot contain payload of pbuf N and struct of pbuf N+1.
> I do this by setting (PBUF_POOL_BUFSIZE + sizeof(struct pbuf)) = multiple of cache line size
>
> 2. on RX, the pbuf is owned by the driver so I can safely flush the cache to get the payload, because I know the stack will not modify a value in the struct.
>
Correct me if I am wrong, but I think for this to work you have to invalidate a pbuf before passing it to the RX engine. When payload starts on a 32 bit boundary, you can leave a TX pbuf inalidated and pass it to RX.
Simon
I did this like you did Stephane (modified the #defines) and I also agree
that both of these would be useful if built-in.
Bill
I'm not sure to understand what you mean.
Yes, the driver needs to invalidate after allocating and before DMA RX.
This is a good time to publish my drivers:
http://lwip.wikia.com/wiki/Available_device_drivers#lwIP_1.4.1
Here's how they work:
1. init:
- prepare TX descriptors list
- init RX descriptors list: allocate pbuf and WriteBack+Invalidate the
payload buffer
- init interrupts, start DMA, start MAC
2. low_level_output:
- we need 1 contiguous buffer: pbuf_ref() or pbuf_copy()
- get a TX descriptor from the "free" queue
- setup descriptor and WriteBack the payload buffer
- append to "queued" descriptors
- if DMA not active, move "queued" to "active" and start DMA
3. RX interrupt:
- move received descriptors to "complete" queue
- update "active" queue: remove "complete", add "queued" descriptors and
eventually restart DMA
- tcpip_trycallback(rxmsg);
4. RX callback in the tcpip thread:
- loop to extract descriptors from "complete" queue
- update statistics
- give the packet to the stack with ethernet_input()
- try to allocate a pbuf for the newly free descriptor:
If the pbuf has been reused in the meantime -> move to "free".
If pbuf re-allocated:
- WriteBack+Invalidate the payload buffer
- move to "queued" descriptors
- if DMA not active, move "queued" to "active" and start DMA
5. TX interrupt:
- move sent descriptors to "complete" queue
- update "active" queue: remove "complete", add "queued" descriptors and
eventually restart DMA
- tcpip_trycallback(txmsg);
6. TX callback in the tcpip thread:
- loop to extract descriptors from "complete" queue
- update statistics
- free the pbuf
- move back the descriptor to "free" queue
- on exit, try to allocate pbuf for RX "free" descriptors to move to
"queued" descriptors
--
Stephane