[lwip-users] LWIP/WIN32 UDP performance (throughtput)

Zayaz Volk

unread,

Feb 6, 2012, 6:00:35 PM2/6/12

to lwip-...@nongnu.org

Hi all,

I consider to use lwip in an Real Time Vision application running on specially designed SoC platform.
Considering specifications and requirements this SoC is more of "embedded computer" rather an "embedded controller" or a "device"
It has 256mb of RAM, 4 MIPSes, flash storage, 1Gbit Ethernet adapter etc. On the other hand, it is still an "embedded" system, with

certain compromises and not as fast as modern computers.

The desired throughput of UDP send is at least 80Mbyte/s, i.e. at least 80% of theoretical 1Gbit/s limit.
Considering it is still an embedded systems and having read a good reviews of lwip i decided to benchmark the stack.

I started from lwip port for Windows(i could also go for Ubuntu, but preferred Windows for the start), since
a) I still don't have the dedicated SoC in hand, it's yet in a design and manufacturing stage.
b) I wanted to start a benchmarking within a "trusted" environment, where an achieving of 80Mbyte/s over UDP shall be possible.
Without having a SoC, i didn't want to spend the time on target hardware and software optimization, just to benchmark the logic of the stack itself and see if

it can come close to desired throughput
c) Our SoC is not "that" memory/CPU limited

I was using netconn API, rather then raw API or a sockets API. According to the documentation it should provide some convenience over raw API,

without a redundant copying of sockets API. I might been using raw API as well.

Now for my really a basic test.

I tried to send about 100MBytes of data using lwip netconn API on my PC. Just a simple loop calling to netconn_send.

The data of each UDP packet was 1Kbytes. (i.e. no IP fragmentation). The time it took to send 100Mbytes was... 25 seconds, i.e. about 4Mbytes per second.

It is REALLY slow...

I was blaming my Windows machine (not the fastest, other processes running, probably even antivirus or firewall).

Anyway, i compiled and ran similar application using WinSock. The time it took to send 100Mbytes in 1Kbytes of data was... 2.5 seconds, i.e. about 40Mbytes per second.

Still not the best, but much more understandable and acceptable as a first result.

I repeated the tests several times. While the exact results were varying, the magnitude remains.

Obviously, it raise the questions

1. Did anyone try to work with lwip on linux/Win32 and what throughput you've been able to achieve ?

2. What are the options to optimize the stack in terms of lwip options(lwipopt.h/compiler) and further inside the code optimizations ? I am connecting point-to-point.

3. It seems like the lwip is mostly concerning with the memory consumption rather then the achieving a maximal throughput (probably aiming at pure embedded systems short of memory and weak CPUs that are not going to communicate at speed around 0.8-1Gbit anyway). What is your maximal bandwidth achieved on what hardware ?

4. What other, open source hardware-independent oriented stacks would you suggest if not lwip ?

Some technical information.

I've been connecting point to point between 2 1Gbit interfaces.

The tested machine (sending) was running Win7/32. It has Intel Core2Duo CPU with 2gb of RAM and 200gb of HD

The received machine was running Linux(Ubuntu). It has Intel Core2Duo CPU with 4gb of RAM and 250gb of HD.

It was running Ethereal.

The code was compiled using Visual Studio 2008, in release mode.

The lwiopt.h was unmodified by me.

The lwip itself as well as its win32 port (contrib) both are 1.4.0

This win32 port is using wincap library

Thank you,

Stas

Simon Goldschmidt

unread,

Feb 7, 2012, 12:56:51 AM2/7/12

to Mailing list for lwIP users

Zayaz Volk <papu...@hotmail.com> wrote:

Obviously, it raise the questions
1. Did anyone try to work with lwip on linux/Win32 and what throughput you've been able to achieve ?

In my opinion, the win32 port is mainly limited by the netif driver: winpcap doesn't seem to give the best performance, since we have to copy packets twice on RX (and I don't remember if it's once or twice on TX). Also, lwIP doesn't really benefit from multiple CPU cores. Instead, you're right that it is targeted on low resources rather that throughput.

Still, i think lwIP should be able to achieve the performance you want if:

A) your netif driver correctly supports zero copy and

B) you handle the stack "correctly" so that zero copy is possible (I.e. data passed to a TX function must be left unchanged in memory until the netif driver has actually sent it out).

Simon

Bill Auerbach

unread,

Feb 7, 2012, 8:46:24 AM2/7/12

to Mailing list for lwIP users

I totally agree with Simon, at least with RAW API – I can’t speak for NETCONN performance. My “embedded” system is like yours (667Mhz CPU and 256MB 667Mhz DDR2) and is zero-copy with optimized checksum (aligned accesses and in assembly). I can reach 500-600+MbS with TCP. In my testing of UDP the PC in some instances was not able to keep pace receiving. I do not run lwIP on the PC – we use Win32 sockets so far without issues.

It would be nice to see a standard Win32/Linux program to communicate with a simple lwIP test program (RAW, NETCONN, sockets) to create a list of lwIP platforms and the results of the tests.

Zayaz Volk

unread,

Feb 7, 2012, 11:18:17 AM2/7/12

to lwip-...@nongnu.org

Hi,
Thanks for your answers.

Achieving 500-600Mbs using TCP on dedicated hardware is quite promising, so it might be some inefficiencies related to windows port or wincap (although Ethereal, i.e. Wireshark is using the wincap library for its packet sniffing,
so i would guess it shouldn't degrade the performance by this much (magnitude of 10 vs regular WinSock API, 4-5Mbytes/susing lwip/Win32 vs 40-50Mbytes/s using WinSock)).

Meanwhile, I am trying to profile the stack itself, till the pcapif_low_level_output function is being called (it seems to be the transfer point into wincap library by calling the pcap_sendpacket).
Again, if somebody has been using the Win32 or Linux ports and has measured the throughput please let me know.
I will also look into lwiopt.h to to try and optimize some values there, mostly regarding the memory and buffers constraints.
I will post my results and timings later.

Any suggestions for "optimized" lwiopt.h values for the systems with more than just few Kbytes of RAM ?

Thank you,
Stas

From: baue...@arrayonline.com
To: lwip-...@nongnu.org
Date: Tue, 7 Feb 2012 08:46:24 -0500

Subject: Re: [lwip-users] LWIP/WIN32 UDP performance (throughtput)

I totally agree with Simon, at least with RAW API ן¿½ I canן¿½t speak for NETCONN performance. My ן¿½embeddedן¿½ system is like yours (667Mhz CPU and 256MB 667Mhz DDR2) and is zero-copy with optimized checksum (aligned accesses and in assembly). I can reach 500-600+MbS with TCP. In my testing of UDP the PC in some instances was not able to keep pace receiving. I do not run lwIP on the PC ן¿½ we use Win32 sockets so far without issues.

It would be nice to see a standard Win32/Linux program to communicate with a simple lwIP test program (RAW, NETCONN, sockets) to create a list of lwIP platforms and the results of the tests.

From: lwip-users-bounces+bauerbach=arrayon...@nongnu.org [mailto:lwip-users-bounces+bauerbach=arrayon...@nongnu.org] On Behalf Of Simon Goldschmidt
Sent: Tuesday, February 07, 2012 12:57 AM
To: Mailing list for lwIP users
Subject: Re: [lwip-users] LWIP/WIN32 UDP performance (throughtput)

Zayaz Volk <papu...@hotmail.com> wrote:

Obviously, it raise the questions

1. Did anyone try to work with lwip on linux/Win32 and what throughput you've been able to achieve ?

In my opinion, the win32 port is mainly limited by the netif driver: winpcap doesn't seem to give the best performance, since we have to copy packets twice on RX (and I don't remember if it's once or twice on TX). Also, lwIP doesn't really benefit from multiple CPU cores. Instead, you're right that it is targeted on low resources rather that throughput.

Still, i think lwIP should be able to achieve the performance you want if:

A) your netif driver correctly supports zero copy and

B) you handle the stack "correctly" so that zero copy is possible (I.e. data passed to a TX function must be left unchanged in memory until the netif driver has actually sent it out).

Simon

_______________________________________________ lwip-users mailing list lwip-...@nongnu.org https://lists.nongnu.org/mailman/listinfo/lwip-users

Zayaz Volk

unread,

Feb 7, 2012, 11:32:03 AM2/7/12

to lwip-...@nongnu.org

Bill,
Do you use the RAW API with NO_SYS=1, i.e. singe threaded ?
Don't you need another threads to be able to send a data over ethernet ?
Is your application single threaded ?

I guess i would need an ability to send a data over ethernet from other threads as well, so NETCONN might be my "lowest" possible API, or did i miss something ?
If NETCONN API is the "lowest" way to run a lwip in multi threaded environment, did somebody tried to profile this API on PC or its own Hardware (i.e. did somebody use an lwip in multi threaded environment) ?
What was the outcome ?

Thanks,
Stas

Mason

unread,

Feb 8, 2012, 5:02:17 AM2/8/12

to Mailing list for lwIP users

Simon Goldschmidt wrote:

> Still, I think lwIP should be able to achieve the performance you want if:

> A) your netif driver correctly supports zero copy

A few months ago(*) we discussed zero-copy for DMA drivers,
and you wrote:

> For the RX side, using a *custom* PBUF_REF would be the best solution.
> That's a pbuf that has a 'freed' callback and references external
> memory. However, that doesn't work, yet (though I planned to add support
> for it as I can see it's one possible solution to implement DMA MAC
> drivers). The problem here is that pbuf_header() can't grow such pbufs
> (as it doesn't know the original length). This would have to be fixed by
> changing the struct pbuf (if only for PBUF_REF pbufs).

Is this mode supported in the trunk?
If yes, do you plan to port it to 1.4.1?
If no, that means I have to memcpy every incoming frame, right?

> As to the TX side: normally, TX pbufs are allocated as PBUF_RAM, the
> memory for that is taken from the heap by calling mem_malloc(). Now the
> simple solution would be to just replace mem.c by your own code
> allocating and freeing from your TX pool: with the correct settings,
> mem_malloc() isn't used for anything else buf PBUF_RAM pbufs. The only
> problem might be that you don't know in advance how big the memory
> blocks are (and thus how big your TX buffer entries should be), but by
> using TCP_OVERSIZE, you can just use the maximum ethernet frame size (if
> you don't mind wasting some RAM for smaller packets).

But when I looked,

I don't think it's possible to use mem_malloc as a packet
buffer allocator, as there are other types of uses, e.g.

dhcp = (struct dhcp *)mem_malloc(sizeof(struct dhcp));
autoip = (struct autoip *)mem_malloc(sizeof(struct autoip));

What I'd like is to be able to "redirect" pbuf_alloc to use
my buffer allocator, but the pbuf struct and the payload buffer
would not be contiguous, which violates the implicit assumptions
of PBUF_RAM pbufs, IIUC.

Have things changed on that front?
If no, that means I have to memcpy every outgoing frame, right?

(*) Date: Thu, 06 Oct 2011 16:10:07 +0200
http://lists.gnu.org/archive/html/lwip-users/2011-10/msg00008.html

--
Regards.

Zayaz Volk

unread,

Feb 8, 2012, 6:24:08 PM2/8/12

to lwip-...@nongnu.org

Simon, your remark about netif/winpcap seems to be right.

For lwIP stack itself (using NETCONN on WIN32)it doesn't take too much to prepare the ethernet packet before the pcap_sendpacket is being called from pcapif.c

This time is about 1sec on my computer for 100Mbyte of data. So, it seems like most of the time is being spend inside pcapif.c low_level_output with the call to pcap_sendpacket.

Windows WinSock is taking 2.5 sec to prepare and send the same amount of data, so lwIP's time of 1sec for packet preparation only seems to be adequate as opposite to total 25sec for total preparation and send with pcap. Seems a bit ugly, since pcap is used in many applications and should be really optimized for the performance, but may be it is more optimized for the receiving(capturing) rather sending. Anyway, tomorrow i will try to send the same load through the pcap directly (using this same pcap_sendpacket function) and measure the time.

Did somebody use a UDP on any hardware using NETCONN API (zero copy, multithreaded environment) ?

What was your maximum throughput ? What is the characteristic of your system ?

Thanks,

Stas

Mason

unread,

Feb 9, 2012, 4:35:08 AM2/9/12

to Mailing list for lwIP users

Zayaz Volk wrote:

> Did somebody use a UDP on any hardware using NETCONN API (zero copy,
> multithreaded environment) ? What was your maximum throughput ? What
> is the characteristic of your system ?

I can max out a Fast Ethernet link on a 450-MHz SH-4 system with
lots of CPU cycles to spare.

Bill Auerbach

unread,

Feb 9, 2012, 9:04:10 AM2/9/12

to Mailing list for lwIP users

Stas,

Res, RAW API and NO_SYS=1. You don’t need threads with RAW API if you have a “big loop” type of program that spins processing lwIP timers and other system events.

Single threaded? Yes and no I guess. We use a cooperative RTOS so there are many tasks, but one task handles Ethernet, lwIP and its callbacks. The other threads (tasks) do send TCP data but being cooperative in nature there is never a worry as to when they do need to send something. Being mostly event driven, lwIP’s task gets 99%-interrupt processing time of CPU time so there is always all the bandwidth we need. This is a real-time system and can push 50-60MB/S (TCP) and at this speed 30% of the processor is then processing interrupts (not Ethernet interrupts – they are polled).

Bill

From: lwip-users-bounces+bauerbach=arrayon...@nongnu.org [mailto:lwip-users-bounces+bauerbach=arrayon...@nongnu.org] On Behalf Of Zayaz Volk
Sent: Tuesday, February 07, 2012 11:32 AM
To: lwip-...@nongnu.org
Subject: Re: [lwip-users] LWIP/WIN32 UDP performance (throughtput)

Mason

unread,

Feb 14, 2012, 4:37:59 AM2/14/12

to Mailing list for lwIP users

Mason wrote:

I did find this information on the lwip wiki
http://lwip.wikia.com/wiki/Writing_a_device_driver#Notes_on_Zero-Copy_Network_interface_drivers

But I think it's a little bit dated?

Simon Goldschmidt

unread,

Feb 14, 2012, 4:48:12 AM2/14/12

to Mailing list for lwIP users

Sorry for not answering that post, I meant to, but somehow it slipped through...

Mason <mpeg...@free.fr> wrote:
> > I don't think it's possible to use mem_malloc as a packet
> > buffer allocator, as there are other types of uses, e.g.
> >
> > dhcp = (struct dhcp *)mem_malloc(sizeof(struct dhcp));
> > autoip = (struct autoip *)mem_malloc(sizeof(struct autoip));

These are not used if you init autoip and dhcp correctly (dhcp_set_struct(), autoip_set_struct()). I know that's not a good solution and we should add pools for it.

> > What I'd like is to be able to "redirect" pbuf_alloc to use
> > my buffer allocator, but the pbuf struct and the payload buffer
> > would not be contiguous, which violates the implicit assumptions
> > of PBUF_RAM pbufs, IIUC.
> >
> > Have things changed on that front?

No, not yet. What's missing there are two things:
a) redirecting pbuf_alloc() to a custom function and
b) having a PBUF_REF-like type where pbuf_header works in both directions.

It shouldn't be too hard to implement, but I haven't even found the time to work much on the 1.4.1 release-related things, lately, so it might take some time until we get there :-(

> > If no, that means I have to memcpy every outgoing frame, right?

Not if would redirect mem_malloc(). However, your platform would need to allow the pbuf struct and payload be in contiguous memory, then.

> I did find this information on the lwip wiki
> http://lwip.wikia.com/wiki/Writing_a_device_driver#Notes_on_Zero-Copy_Network_interface_drivers
>
> But I think it's a little bit dated?

Yeah, that doesn't really sum it up right. I don't think I have ever seen that article before...

Simon
--
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

Mason

unread,

Feb 16, 2012, 7:49:14 AM2/16/12

to Mailing list for lwIP users

Simon Goldschmidt wrote:

> Sorry for not answering that post, I meant to, but somehow it slipped through...

I was afraid to bother you by asking too many questions ;-)

> Mason <mpeg...@free.fr> wrote:
>
>> I don't think it's possible to use mem_malloc as a packet
>> buffer allocator, as there are other types of uses, e.g.
>>
>> dhcp = (struct dhcp *)mem_malloc(sizeof(struct dhcp));
>> autoip = (struct autoip *)mem_malloc(sizeof(struct autoip));
>
> These are not used if you init autoip and dhcp correctly
> (dhcp_set_struct(), autoip_set_struct()). I know that's not a good
> solution and we should add pools for it.

This is all voodoo and black magic to me. It's very cool that you
are so nice to explain stuff when questions are asked, but it
really feels like there is a lot of high-level documentation
missing, wouldn't you agree?

>> What I'd like is to be able to "redirect" pbuf_alloc to use
>> my buffer allocator, but the pbuf struct and the payload buffer
>> would not be contiguous, which violates the implicit assumptions
>> of PBUF_RAM pbufs, IIUC.
>>
>> Have things changed on that front?
>
> No, not yet. What's missing there are two things:
> a) redirecting pbuf_alloc() to a custom function and
> b) having a PBUF_REF-like type where pbuf_header works in both directions.
>
> It shouldn't be too hard to implement, but I haven't even found the
> time to work much on the 1.4.1 release-related things, lately, so it
> might take some time until we get there :-(

I think I'll err on the safe side, and make my port do the extra
memcpy in the initial version. Then I can profile the different
use-cases, and if there is a performance problem, then I'll see
how to get rid of the copies. (Probably in 2-3 months)

>> If no, that means I have to memcpy every outgoing frame, right?
>
> Not if would redirect mem_malloc(). However, your platform would need
> to allow the pbuf struct and payload be in contiguous memory, then.

I've been meaning to talk to you about this issue.
Since I use DMA, there are hard constraints on the location of the buffer.
It must start at an address multiple of 32 (cache line aligned) and the
length must be a multiple of 32, and we must purge the appropriate lines
from cache before the DMA operation.

I suppose I could allocate 32 more bytes to hold the pbuf struct and
fudge the pointers to make the payload start on the right boundary,
but you've said that pbuf_header could move the start of payload,
or something like that... More voodoo stuff.

Like someone said on the list, I think lwip wants control over
the packet buffers, which is a problem when a driver also needs
control over the buffers.

--
Regards.

Bill Auerbach

unread,

Feb 16, 2012, 9:39:58 AM2/16/12

to Mailing list for lwIP users

>I've been meaning to talk to you about this issue.
>Since I use DMA, there are hard constraints on the location of the
>buffer.
>It must start at an address multiple of 32 (cache line aligned) and the
>length must be a multiple of 32, and we must purge the appropriate lines
>from cache before the DMA operation.
>
>I suppose I could allocate 32 more bytes to hold the pbuf struct and
>fudge the pointers to make the payload start on the right boundary,
>but you've said that pbuf_header could move the start of payload,
>or something like that... More voodoo stuff.

This is what I've done on 2 platforms - both use DMA and zero-copy (one uses
chained DMA and needs cache maintenance). I've done this "fudging". I
aligned the payload and padded it out to the required size.

>Like someone said on the list, I think lwip wants control over
>the packet buffers, which is a problem when a driver also needs
>control over the buffers.

It's not a problem that can't be overcome with lwIP as is. If this pbuf
driver management was done from the onset, all the better. But it wasn't
and they would risk breaking compatibility with a lot of lwIP programs if
pbuf use and the API for it is changed.

Bill

Simon Goldschmidt

unread,

Feb 16, 2012, 9:46:09 AM2/16/12

to Mailing list for lwIP users

Mason <mpeg...@free.fr> wrote:
> > These are not used if you init autoip and dhcp correctly
> > (dhcp_set_struct(), autoip_set_struct()). I know that's not a good
> > solution and we should add pools for it.
>
> This is all voodoo and black magic to me. It's very cool that you
> are so nice to explain stuff when questions are asked, but it
> really feels like there is a lot of high-level documentation
> missing, wouldn't you agree?

Of course I agree! It's just as always, writing the code is one thing, but writing documentation... :-)

> I think I'll err on the safe side, and make my port do the extra
> memcpy in the initial version. Then I can profile the different
> use-cases, and if there is a performance problem, then I'll see
> how to get rid of the copies. (Probably in 2-3 months)

That's probably the best idea.

> >> If no, that means I have to memcpy every outgoing frame, right?
> >
> > Not if would redirect mem_malloc(). However, your platform would need
> > to allow the pbuf struct and payload be in contiguous memory, then.
>

> I've been meaning to talk to you about this issue.
> Since I use DMA, there are hard constraints on the location of the buffer.
> It must start at an address multiple of 32 (cache line aligned) and the
> length must be a multiple of 32, and we must purge the appropriate lines
> from cache before the DMA operation.
>
> I suppose I could allocate 32 more bytes to hold the pbuf struct and
> fudge the pointers to make the payload start on the right boundary,
> but you've said that pbuf_header could move the start of payload,
> or something like that... More voodoo stuff.

And that is something I need, too: using a different alignment constraint for pbufs (16 bytes) than for the rest (4 bytes). I might have to add that to 1.4.1 so we can use a clean version in our products :-)

Simon
--
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

_______________________________________________

Stephane Lesage

unread,

Feb 16, 2012, 10:09:02 AM2/16/12

to Mailing list for lwIP users

> > I suppose I could allocate 32 more bytes to hold the pbuf struct and
> > fudge the pointers to make the payload start on the right boundary,
> > but you've said that pbuf_header could move the start of payload, or
> > something like that... More voodoo stuff.
>
> And that is something I need, too: using a different alignment
> constraint for pbufs (16 bytes) than for the rest (4 bytes). I might
> have to add that to 1.4.1 so we can use a clean version in our
products

Hi,
I need 32 bytes alignement on Blackfin for L1 cache
and even 128 bytes on C674x for L2 cache.

I just add this in my cc.h:

// PBUF_POOL needs to be aligned to cache line size
#ifndef __cplusplus
extern u8_t memp_memory_PBUF_POOL_base[] __attribute__ ((aligned (32)));
#endif

And in my lwipopts.h:

#define PBUF_POOL_BUFSIZE 1520
// + sizeof(pbuf) = 1536 = 48 cache lines

But I agree it could be easier with something like:
#define PBUF_ALIGNMENT 32

--
Stephane

Simon Goldschmidt

unread,

Feb 16, 2012, 10:51:24 AM2/16/12

to Mailing list for lwIP users

"Stephane Lesage" <Stephan...@ateis.com> wrote:
> I need 32 bytes alignement on Blackfin for L1 cache
> and even 128 bytes on C674x for L2 cache.
>
> I just add this in my cc.h:
>
> // PBUF_POOL needs to be aligned to cache line size
> #ifndef __cplusplus
> extern u8_t memp_memory_PBUF_POOL_base[] __attribute__ ((aligned (32)));
> #endif
>
> And in my lwipopts.h:
>
> #define PBUF_POOL_BUFSIZE 1520
> // + sizeof(pbuf) = 1536 = 48 cache lines

But then struct pbuf is still only 16 bytes long, so payload might start in the middle of a 32-byte border, which leads to faults when flushing cache (since the struct pbuf members are still used cached) or am I wrong there?

> But I agree it could be easier with something like:
> #define PBUF_ALIGNMENT 32

That's what I imagined (only with another name maybe).

Simon
--
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

_______________________________________________

Stephane Lesage

unread,

Feb 16, 2012, 11:35:22 AM2/16/12

to Mailing list for lwIP users

> But then struct pbuf is still only 16 bytes long, so payload might
> start in the middle of a 32-byte border, which leads to faults when
> flushing cache (since the struct pbuf members are still used cached) or
> am I wrong there?

I don't have the problem, because:
1. I don't need to align my payload. I just need to ensure that a cache line cannot contain payload of pbuf N and struct of pbuf N+1.
I do this by setting (PBUF_POOL_BUFSIZE + sizeof(struct pbuf)) = multiple of cache line size

2. on RX, the pbuf is owned by the driver so I can safely flush the cache to get the payload, because I know the stack will not modify a value in the struct.

> > But I agree it could be easier with something like:
> > #define PBUF_ALIGNMENT 32
>
> That's what I imagined (only with another name maybe).

Ok, maybe we need 2 settings:
1. alignement of pbuf struct
2. alignement of pbuf payload (may lead to wasted memory)

--
Stéphane Lesage

Simon Goldschmidt

unread,

Feb 16, 2012, 1:59:57 PM2/16/12

to Mailing list for lwIP users

"Stephane Lesage" <Stephan...@ateis.com> wrote:

>> But then struct pbuf is still only 16 bytes long, so payload might
>> start in the middle of a 32-byte border, which leads to faults when
>> flushing cache (since the struct pbuf members are still used cached) or
>> am I wrong there?
>
> I don't have the problem, because:
> 1. I don't need to align my payload. I just need to ensure that a cache line cannot contain payload of pbuf N and struct of pbuf N+1.
> I do this by setting (PBUF_POOL_BUFSIZE + sizeof(struct pbuf)) = multiple of cache line size
>
> 2. on RX, the pbuf is owned by the driver so I can safely flush the cache to get the payload, because I know the stack will not modify a value in the struct.
>

Correct me if I am wrong, but I think for this to work you have to invalidate a pbuf before passing it to the RX engine. When payload starts on a 32 bit boundary, you can leave a TX pbuf inalidated and pass it to RX.

Simon

Bill Auerbach

unread,

Feb 16, 2012, 2:21:41 PM2/16/12

to Mailing list for lwIP users

>Ok, maybe we need 2 settings:
>1. alignement of pbuf struct
>2. alignement of pbuf payload (may lead to wasted memory)

I did this like you did Stephane (modified the #defines) and I also agree
that both of these would be useful if built-in.

Bill

Stephane Lesage

unread,

Feb 16, 2012, 5:11:47 PM2/16/12

to Mailing list for lwIP users

> Correct me if I am wrong, but I think for this to work you have to
> invalidate a pbuf before passing it to the RX engine. When payload
> starts on a 32 bit boundary, you can leave a TX pbuf inalidated and
> pass it to RX.

I'm not sure to understand what you mean.
Yes, the driver needs to invalidate after allocating and before DMA RX.

This is a good time to publish my drivers:
http://lwip.wikia.com/wiki/Available_device_drivers#lwIP_1.4.1

Here's how they work:
1. init:
- prepare TX descriptors list
- init RX descriptors list: allocate pbuf and WriteBack+Invalidate the
payload buffer
- init interrupts, start DMA, start MAC

2. low_level_output:
- we need 1 contiguous buffer: pbuf_ref() or pbuf_copy()
- get a TX descriptor from the "free" queue
- setup descriptor and WriteBack the payload buffer
- append to "queued" descriptors
- if DMA not active, move "queued" to "active" and start DMA

3. RX interrupt:
- move received descriptors to "complete" queue
- update "active" queue: remove "complete", add "queued" descriptors and
eventually restart DMA
- tcpip_trycallback(rxmsg);

4. RX callback in the tcpip thread:
- loop to extract descriptors from "complete" queue
- update statistics
- give the packet to the stack with ethernet_input()
- try to allocate a pbuf for the newly free descriptor:
If the pbuf has been reused in the meantime -> move to "free".
If pbuf re-allocated:
- WriteBack+Invalidate the payload buffer
- move to "queued" descriptors
- if DMA not active, move "queued" to "active" and start DMA

5. TX interrupt:
- move sent descriptors to "complete" queue
- update "active" queue: remove "complete", add "queued" descriptors and
eventually restart DMA
- tcpip_trycallback(txmsg);

6. TX callback in the tcpip thread:
- loop to extract descriptors from "complete" queue
- update statistics
- free the pbuf
- move back the descriptor to "free" queue
- on exit, try to allocate pbuf for RX "free" descriptors to move to
"queued" descriptors

--
Stephane

Reply all

Reply to author

Forward