[lwip-users] byte alignment

Tyrel Newton

unread,

May 4, 2010, 2:18:38 AM5/4/10

to lwip-...@nongnu.org

For the system I'm using, when an Ethernet frame is transmitted, it has
to be copied into a 32-bit aligned contiguous buffer within the mac.
This means that ideally, the beginning of the Ethernet frame (i.e. the
destination address) will be aligned on a 32-bit boundary so that it can
be copied verbatim with aligned reads/writes. Now, I realize this makes
the contained IP/TCP frames unaligned and I've read that setting the
ETH_PAD_SIZE=2 improves performance on 32-bit machines by (presumably)
making the contained IP/TCP frames 32-bit aligned (assuming
MEM_ALIGNMENT=4).

For my case, it would seem that un-aligning the contained IP/TCP frames
would be better so that the final copy to the mac is aligned at the
start of the Ethernet frame. I'm wondering about the validity of this
statement and if it has any merit performances wise. If the Ethernet
frame being sent is not 32-bit aligned, I basically have to do byte
reads to a temporary buffer which is then written to the mac.

My other question is about actually getting the pbuf->payload to be
32-bit aligned at the start of the generated Ethernet frame. I would
think that setting ETH_PAD_SIZE=0 and MEM_ALIGNMENT=4 would produce
32-bit aligned Ethernet frames in the resulting PBUF_RAM-type pbufs.
However, this doesn't seem to be the case in my testing as the
p->payload in PBUF_RAM-type pbufs appear to always be 16-bit aligned
with the aforementioned settings. What settings do I need to create
32-bit aligned Ethernet frames for output? Btw, I'm using malloc/free as
provided by my libc build.

Thanks,
Tyrel

_______________________________________________
lwip-users mailing list
lwip-...@nongnu.org
http://lists.nongnu.org/mailman/listinfo/lwip-users

--
You received this message because you are subscribed to the Google Groups "osdeve.mirror.tcpip.lwip" group.
To post to this group, send email to osdeve_mirro...@googlegroups.com.
To unsubscribe from this group, send email to osdeve_mirror_tcpi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/osdeve_mirror_tcpip_lwip?hl=en.

Simon Goldschmidt

unread,

May 4, 2010, 5:38:57 AM5/4/10

to new...@tethers.com, Mailing list for lwIP users

What you're seeing is basically a misdesign of the ethernet MAC or its DMA engine: you can either spend too many processor cycles on copying the data to/from unaligned memory or spend too many processor cycles on accessing unaligned TCP/IP headers. Which solution is better depends on your processor, you might have to check that.

I have the same problem but I am a bit better off since I can do 16-bit DMA copy on unaligned data - still, it makes the DMA copy slower.

As to the aligned pbuf payload: I think the code currently relies on mem_malloc returning aligned data (and that should be OK with your current settings), so you might want to check the return values of your libc malloc.

Simon

GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

Bill Auerbach

unread,

May 4, 2010, 9:19:13 AM5/4/10

to Mailing list for lwIP users

>For the system I'm using, when an Ethernet frame is transmitted, it has
>to be copied into a 32-bit aligned contiguous buffer within the mac.

It might help some of us to help you if you can tell us about the system
you're using. Unless this is a custom processor, someone here has probably
used what you're using and can tell you what is best to do. If your MAC has
some facility to read in a packet with a +2 offset (it's still 32-bit
aligned - they just prefix 0's to the frame) then you can receive a frame on
a 32-bit boundary and the actual packet is at offset +2, which makes all of
the headers 32-bit aligned when ETH_PADSIZE is 2.

Bill

Tyrel Newton

unread,

May 4, 2010, 12:44:37 PM5/4/10

to Mailing list for lwIP users

> It might help some of us to help you if you can tell us about the system
> you're using. Unless this is a custom processor, someone here has probably
> used what you're using and can tell you what is best to do. If your MAC has
> some facility to read in a packet with a +2 offset (it's still 32-bit
> aligned - they just prefix 0's to the frame) then you can receive a frame on
> a 32-bit boundary and the actual packet is at offset +2, which makes all of
> the headers 32-bit aligned when ETH_PADSIZE is 2.
>

I doubt it will help that much because I'm using the xps_ethernetlite IP
core with a MicroBlaze embedded processor on a Spartan-3A DSP. Most
people that are using this type of setup don't care about making it
actually work well, they seem to be happy using the sub-standard
software provided by Xilinx. The xps_ethernetlite core is a 100Mbps mac
without DMA. Xilinx also offers a gigabit mac with DMA, which quite
naturally consumes many more FPGA resources (and costs more). The
problem I'm seeing is that the xps_ethernetlite core is unnecessarily
limited by this alignment issue.

Tyrel

Tyrel Newton

unread,

May 4, 2010, 6:26:32 PM5/4/10

to Mailing list for lwIP users

> As to the aligned pbuf payload: I think the code currently relies on mem_malloc returning aligned data (and that should be OK with your current settings), so you might want to check the return values of your libc malloc.
>

As the pbuf code is written (I think I'm running the latest stable,
1.3.2), there is no way to guarantee a 32-bit aligned payload pointer at
the start of the Ethernet frame with MEM_ALIGNMENT=4. This is because in
pbuf_alloc, the payload pointer for PBUF_RAM types is initialized at an
offset that is itself memory aligned (this offset is equal to the size
of pbuf structure plus the various header lengths). When the 14 byte
Ethernet header is eventually uncovered, it will always be 16-bit
aligned since the the original payload pointer was 32-bit aligned. This
of course assumes a PBUF_LINK_HLEN=14.

The moral for me is that I actually see higher throughput by setting
MEM_ALIGNMENT=2, which guarantees that when the Ethernet header is
uncovered, it will be 32-bit aligned. Even though the TCP/IP headers are
unaligned, the final copy to the mac's transmit buffer is much faster if
the source pointer is 32-bit aligned, i.e. at the start of the actual
Ethernet frame.

Btw, this is also assuming the outgoing data is copied into the stack
such that all the outgoing pbufs are PBUF_RAM-type.

Interesting results, but pretty esoteric since this is not an oft-used
platform (MicroBlaze w/ xps_ethernetlite IP core).

Tyrel

_______________________________________________
lwip-users mailing list
lwip-...@nongnu.org
http://lists.nongnu.org/mailman/listinfo/lwip-users

--

gold...@gmx.de

unread,

May 5, 2010, 11:53:40 AM5/5/10

to Mailing list for lwIP users

Tyrel Newton wrote:

As to the aligned pbuf payload: I think the code currently relies on mem_malloc returning aligned data (and that should be OK with your current settings), so you might want to check the return values of your libc malloc.

As the pbuf code is written (I think I'm running the latest stable, 
1.3.2), there is no way to guarantee a 32-bit aligned payload pointer at 
the start of the Ethernet frame with MEM_ALIGNMENT=4. This is because in 
pbuf_alloc, the payload pointer for PBUF_RAM types is initialized at an 
offset that is itself memory aligned (this offset is equal to the size 
of pbuf structure plus the various header lengths). When the 14 byte 
Ethernet header is eventually uncovered, it will always be 16-bit 
aligned since the the original payload pointer was 32-bit aligned. This 
of course assumes a PBUF_LINK_HLEN=14.

I see... I must say I didn't check that yet. And as my code itself requires the payload aligned (or I would have to use packed structs to access the contents), I just ended up with 16-bit DMA transfers (using an Altera NIOS-II system with a standard Altera RAM-to-RAM DMA-engine). I always planned to write my own DMA engine in VHDL that can do 32-bit transfers from 16-bit aligned data, but I didn't make it, yet.

Anyway, if there is a requirement to let pbuf_alloc produce an unaligned payload so that the outer header is aligned, please file a bug report or patch at savannah!

The moral for me is that I actually see higher throughput by setting 
MEM_ALIGNMENT=2, which guarantees that when the Ethernet header is 
uncovered, it will be 32-bit aligned. Even though the TCP/IP headers are 
unaligned, the final copy to the mac's transmit buffer is much faster if 
the source pointer is 32-bit aligned, i.e. at the start of the actual 
Ethernet frame.

The question is whether the final copy is what matters or the rest of the processing: when the final copy is done in background by a DMA engine, this might not even be harmful. While it is true that the transfer takes longer, it only has to be faster than the previous frame takes for sending. The only difference then is how long the DMA transfer generates a background load on the RAM bus, and if it uses too much RAM bandwitdth for the processor to work normally.

However, if the processor does the final copy (without a DMA enginge), than it's a bad thing if the data is not aligned. But you should be able to include a DMA engine in your FPGA, so...

Btw, this is also assuming the outgoing data is copied into the stack 
such that all the outgoing pbufs are PBUF_RAM-type.

Single PBUF_RAM pbufs or chained pbufs?

Interesting results, but pretty esoteric since this is not an oft-used 
platform (MicroBlaze w/ xps_ethernetlite IP core).

Not that different to my own platform ;-) And after all, we need examples for task #7896 (Support zero-copy drivers) and this is one example to start with.

Simon

Tyrel Newton

unread,

May 5, 2010, 6:01:04 PM5/5/10

to Mailing list for lwIP users

On 5/5/2010 8:53 AM, gold...@gmx.de wrote:

Tyrel Newton wrote:
As to the aligned pbuf payload: I think the code currently relies on mem_malloc returning aligned data (and that should be OK with your current settings), so you might want to check the return values of your libc malloc.
   
    
As the pbuf code is written (I think I'm running the latest stable, 
1.3.2), there is no way to guarantee a 32-bit aligned payload pointer at 
the start of the Ethernet frame with MEM_ALIGNMENT=4. This is because in 
pbuf_alloc, the payload pointer for PBUF_RAM types is initialized at an 
offset that is itself memory aligned (this offset is equal to the size 
of pbuf structure plus the various header lengths). When the 14 byte 
Ethernet header is eventually uncovered, it will always be 16-bit 
aligned since the the original payload pointer was 32-bit aligned. This 
of course assumes a PBUF_LINK_HLEN=14.
  
I see... I must say I didn't check that yet. And as my code itself requires the payload aligned (or I would have to use packed structs to access the contents), I just ended up with 16-bit DMA transfers (using an Altera NIOS-II system with a standard Altera RAM-to-RAM DMA-engine). I always planned to write my own DMA engine in VHDL that can do 32-bit transfers from 16-bit aligned data, but I didn't make it, yet.

Anyway, if there is a requirement to let pbuf_alloc produce an unaligned payload so that the outer header is aligned, please file a bug report or patch at savannah!

I thought of that, but it depends on "what" you want aligned when the pbuf is created--the Ethernet frame itself or the actual payload within the TCP frame. Supposed I filled the TCP frame with lots of 32-bit aligned data from within mainline software but then used a 16-bit aligned DMA to move the frame to the mac (or a zero-copy mac that can access individual bytes from memory).

The moral for me is that I actually see higher throughput by setting 
MEM_ALIGNMENT=2, which guarantees that when the Ethernet header is 
uncovered, it will be 32-bit aligned. Even though the TCP/IP headers are 
unaligned, the final copy to the mac's transmit buffer is much faster if 
the source pointer is 32-bit aligned, i.e. at the start of the actual 
Ethernet frame.
  
The question is whether the final copy is what matters or the rest of the processing: when the final copy is done in background by a DMA engine, this might not even be harmful. While it is true that the transfer takes longer, it only has to be faster than the previous frame takes for sending. The only difference then is how long the DMA transfer generates a background load on the RAM bus, and if it uses too much RAM bandwitdth for the processor to work normally.

However, if the processor does the final copy (without a DMA enginge), than it's a bad thing if the data is not aligned. But you should be able to include a DMA engine in your FPGA, so...

Xilinx provides a gigabit mac with a built-in DMA (at an additional cost of course), so I definitely have options. I could also definitely write my own DMA, or for that matter, my own non-DMA Ethernet mac that simply accepts and discards a two-byte pad. But all of that is outside the scope (and priority) of my current effort. At the moment, I'm not terribly concerned about Ethernet performance as long as it works and isn't horrendously slow. My investigations into this issue came from re-writing the horrible lwIP driver provided by Xilinx. By re-writing the code in a reasonably intelligent manner, I managed to increase the throughput 4x along with making the system more stable. C-code is easier to change than VHDL . . .

Btw, this is also assuming the outgoing data is copied into the stack 
such that all the outgoing pbufs are PBUF_RAM-type.

Single PBUF_RAM pbufs or chained pbufs?

Single PBUF_RAM pbufs. Looking through the TCP code, if the data is being copied into the stack (i.e. via NETCONN_COPY), I'm not even sure how chained pbufs would be created (assuming malloc returns a block big enough for an Ethernet frame).

Interesting results, but pretty esoteric since this is not an oft-used 
platform (MicroBlaze w/ xps_ethernetlite IP core).
  
Not that different to my own platform ;-) And after all, we need examples for task #7896 (Support zero-copy drivers) and this is one example to start with.

I wouldn't say the system I'm using (at the moment at least) is zero-copy because once I receive the frame from lwIP, I pbuf_ref it, queue it up for transmit, and then eventually copy its payload to the mac's transmit buffer, after which I do a pbuf_free. Although I guess this is still zero-copy from the stack's frame of reference . . . its probably worth distinguishing somewhere between zero-copy macs and zero-copy drivers.

Tyrel

Simon

gold...@gmx.de

unread,

May 6, 2010, 12:41:20 AM5/6/10

to Mailing list for lwIP users

Tyrel Newton wrote:
>
>> However, if the processor does the final copy (without a DMA
>> enginge), than it's a bad thing if the data is not aligned. But you
>> should be able to include a DMA engine in your FPGA, so...
> Xilinx provides a gigabit mac with a built-in DMA (at an additional
> cost of course), so I definitely have options. I could also definitely
> write my own DMA, or for that matter, my own non-DMA Ethernet mac that
> simply accepts and discards a two-byte pad. But all of that is outside
> the scope (and priority) of my current effort. At the moment, I'm not
> terribly concerned about Ethernet performance as long as it works and
> isn't horrendously slow. My investigations into this issue came from
> re-writing the horrible lwIP driver provided by Xilinx. By re-writing
> the code in a reasonably intelligent manner, I managed to increase the
> throughput 4x along with making the system more stable. C-code is
> easier to change than VHDL . . .

I meant just include a standard RAM-to-RAM DMA controller (at least
Altera provides something like that for free) and let it copy from your
real RAM to the MAC's transmit-buffer RAM. For me, that was only a
matter of 1 hour adding the DMA controller and recompiling the FPGA, the
code to use it is quite simple and a lot faster than a processor-memcpy.

>> Single PBUF_RAM pbufs or chained pbufs?
> Single PBUF_RAM pbufs. Looking through the TCP code, if the data is
> being copied into the stack (i.e. via NETCONN_COPY), I'm not even sure

> how chained pbufs would be created.
Not for the netconn-layer, no. .

> I wouldn't say the system I'm using (at the moment at least) is
> zero-copy because once I receive the frame from lwIP, I pbuf_ref it,
> queue it up for transmit, and then eventually copy its payload to the
> mac's transmit buffer, after which I do a pbuf_free. Although I guess
> this is still zero-copy from the stack's frame of reference . . . its
> probably worth distinguishing somewhere between zero-copy macs and
> zero-copy drivers.

It is zero-copy but without delayed-transmit, and therefore, it's a bit
out of the scope of that task. However, it would be non-zero-copy and in
the scope of that task if you were to first copy the data to an aligned
buffer and then copy that buffer to the MAC.

Simon

_______________________________________________
lwip-users mailing list
lwip-...@nongnu.org
http://lists.nongnu.org/mailman/listinfo/lwip-users

Reply all

Reply to author

Forward