Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

icmp echo to a host with smaller mtu on same link

212 views
Skip to first unread message

rahul....@gmail.com

unread,
Jun 12, 2013, 6:19:23 AM6/12/13
to
Guys,

I have two linux machines (A and B) connected to same link (ethernet). The MTU supported is 1500.

If I change the mtu to 1280 of one the host (say B), then can I send an ICMP echo packet from node A (of max size as per its mtu which is 1500) to node B ?

Will host B be able to receive a packet whose size is greater than its MTU ?

Is there a way host A can discover the decreased MTU of B and then can send an ICMP packet of lower size ?


thanks,
rahul

Jorgen Grahn

unread,
Jun 12, 2013, 8:53:31 AM6/12/13
to
On Wed, 2013-06-12, rahul....@gmail.com wrote:
> Guys,
>
> I have two linux machines (A and B) connected to same link
> (ethernet). The MTU supported is 1500.

> If I change the mtu to 1280 of one the host (say B), then can I send

My take on this is: you cannot really change the MTU on an interface
-- you can only tell the interface the actual MTU on the link.
If you lie to it, anything can happen. Or, rather, I assume anything
can happen.

So I wouldn't set the MTU to X unless I could be sure all other hosts
were set to X too, and that X is in fact supported by the medium.

Are you doing this out of curiosity, or are you trying to solve a
problem?

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Barry Margolin

unread,
Jun 12, 2013, 10:08:26 AM6/12/13
to
In article <slrnkrgrq9.3...@frailea.sa.invalid>,
Jorgen Grahn <grahn...@snipabacken.se> wrote:

> On Wed, 2013-06-12, rahul....@gmail.com wrote:
> > Guys,
> >
> > I have two linux machines (A and B) connected to same link
> > (ethernet). The MTU supported is 1500.
>
> > If I change the mtu to 1280 of one the host (say B), then can I send
>
> My take on this is: you cannot really change the MTU on an interface
> -- you can only tell the interface the actual MTU on the link.
> If you lie to it, anything can happen. Or, rather, I assume anything
> can happen.

I'd expect B to receive the packet successfully, but it might not be
able to reply because the reply doesn't fit in the MTU.

--
Barry Margolin
Arlington, MA

rahul....@gmail.com

unread,
Jun 12, 2013, 10:23:39 AM6/12/13
to

>
> > I have two linux machines (A and B) connected to same link
>
> > (ethernet). The MTU supported is 1500.
>
>
>
> > If I change the mtu to 1280 of one the host (say B), then can I send
>
>
>
> My take on this is: you cannot really change the MTU on an interface
>
> -- you can only tell the interface the actual MTU on the link.
>
> If you lie to it, anything can happen. Or, rather, I assume anything
>
> can happen.
>
>
>
> So I wouldn't set the MTU to X unless I could be sure all other hosts
>
> were set to X too, and that X is in fact supported by the medium.
>
>
>
> Are you doing this out of curiosity, or are you trying to solve a
>
> problem?
>
>
Thanks for your reply. I am just doing it out of curiosity. I am trying to understand how path mtu discovery works.

Can we have two hosts with different mtu connected to the same link ? Ethernet supports both 1500 and jumbo frames (9000). So, can we have one host with mtu = 1500 and other with mtu = 9000 on the same link.
If yes, what would be the pmtu for the path between these nodes ?
Can node A find the mtu of node B on the same link ?

thanks for any help in advance.

Rick Jones

unread,
Jun 12, 2013, 12:57:40 PM6/12/13
to
rahul....@gmail.com wrote:
> Thanks for your reply. I am just doing it out of curiosity. I am
> trying to understand how path mtu discovery works.

> Can we have two hosts with different mtu connected to the same link
> ? Ethernet supports both 1500 and jumbo frames (9000). So, can we
> have one host with mtu = 1500 and other with mtu = 9000 on the same
> link.

> If yes, what would be the pmtu for the path between these nodes ?
> Can node A find the mtu of node B on the same link ?

The starting point is this, in pseudo IEEEspeak (which I've probably
botched) with a smattering of IETF style:

All stations in the same broadcast domain MUST have the same MTU.

Ethernet has no way to communicate frame size between peers. If a
frame larger than the station is prepared to receive arrives, that
frame will be dropped.

Path MTU is up at the IP layer and uses ICMP messages to communicate
MTU between hosts. The PathMTU logic is run when a router looks to
forward the IP datagram - when it goes to send it. That means it must
have received it in the first place. For IP to "receive" the datagram
it must first be received at the layer below it - in this case
Ethernet.

However, if the interface on which it was going to receive the
datagram has an MTU/framesize smaller than the size of the datagram
sent, the datagram won't be "received" by the IP layer so it cannot be
resent, so the Path MTU logic cannot trigger.

Thus the reason why all stations (hosts, systems, what you will) in
the broadcast domain (everything joined at layer 2 eg ethernet) MUST
have the same MTU.

Now, if you have a router (a device making forwarding decisions at
layer three - eg IP) it will have a foot in two different broadcast
domains. So long as its feet are the correct size for each broadcast
domain, you can have different MTUs on each side of the router. Then,
PathMTU discovery will be able to do its thing. But when all there
are between the two hosts are switches (a device making forwarding
decisions at layer 2 - eg Ethernet) there is no way for PathMTU
discovery to become invovled in the first place.

rick jones
--
oxymoron n, commuter in a gas-guzzling luxury SUV with an American flag
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

rahul....@gmail.com

unread,
Jun 12, 2013, 1:38:10 PM6/12/13
to
Thanks a lot Rick for such an elaborate reply. This is really useful.

Jorgen Grahn

unread,
Jun 12, 2013, 3:14:01 PM6/12/13
to
On Wed, 2013-06-12, rahul....@gmail.com wrote:
...
>> Are you doing this out of curiosity, or are you trying to solve a
>> problem?
>>
> Thanks for your reply. I am just doing it out of curiosity. I am
> trying to understand how path mtu discovery works.

Ah, but that is more than curiosity! Learning about PMTU discovery is
a good thing.

I wish I could suggest a better way to do this ...
(Perhaps the others who responded did; I haven't read those responses
carefully.)

When Stevens wrote his famous "TCP/IP Illustrated" books he had access
to several networks and a SLIP link, and he used this in his examples
(SLIP has a much smaller MTU than Ethernet).

Maybe some kind of tunneling protocol could help you. Tunneling
usually (always?) creates a new link with a lower MTU.

Or maybe you can use iptables to drop packets larger than size N and
generate an ICMP "too big" response? If you can get that to work, you
can also change N over time, and see if PMTU discovery discovers the
change.

But I think you need more than one Ethernet to do this well. PMTU
discovery is about detecting things which happen more than one hop
away from the source. Maybe you can use virtual machines and simulated
networks.

Moe Trin

unread,
Jun 12, 2013, 3:23:54 PM6/12/13
to
On Wed, 12 Jun 2013, in the Usenet newsgroup comp.protocols.tcp-ip, in article
<barmar-046B4B....@news.eternal-september.org>, Barry Margolin wrote:

>Jorgen Grahn <grahn...@snipabacken.se> wrote:

>> My take on this is: you cannot really change the MTU on an interface
>> -- you can only tell the interface the actual MTU on the link.

[fermi ~]$ uname -o
GNU/Linux
[fermi ~]$ whatis ip ifconfig
ip (7) - Linux IPv4 protocol implementation
ip (8) - show / manipulate routing, devices, policy routing
and tunnels
ifconfig (8) - configure a network interface
[fermi ~]$

It's going to depend on the O/S - or more correctly, the network
stack, but go for it. Most IPv4 stacks allow MTU to be set as easily
as network mask or IP address. If you only want to transmit packets up
to size $FOO - that's your business as long as it's within the range of
the minimum/maximum RECEIVING size of the other end of the link. Note
that I'm only referring to IPv4 here. On 'fermi', as root I used
'ifconfig mtu 1280 eth0' to set the MTU on the Ethernet interface to
1280 - look at the tcpdump below.

>I'd expect B to receive the packet successfully, but it might not be
>able to reply because the reply doesn't fit in the MTU.

Try it. BTW, know what a "-s 8000" does in ping? Pay attention to
the 'length' variables reported in the tcpdump below:

[jade ~]$ ping -c 2 -s 8000 fermi
PING fermi.phx.az.us (192.168.1.11) 8000(8028) bytes of data.
8008 bytes from fermi.phx.az.us (192.168.1.11): icmp_seq=1 ttl=64
time=1.76 ms
8008 bytes from fermi.phx.az.us (192.168.1.11): icmp_seq=2 ttl=64
time=1.72 ms

--- fermi.phx.az.us ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1003ms
rtt min/avg/max/mdev = 1.719/1.735/1.762/0.038 ms
[jade ~]$ /sbin/ifconfig eth0 | grep MTU
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
[jade ~]$

[fermi ~]# ifconfig eth0 | grep MTU
UP BROADCAST RUNNING MULTICAST MTU:1280 Metric:1
[fermi ~]# tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol
decode listening on eth0, link-type EN10MB (Ethernet), capture size
96 bytes
12:08:12.516627 IP jade.phx.az.us > fermi.phx.az.us: ICMP echo request,
id 2018, seq 1, length 1480
12:08:12.516720 IP jade.phx.az.us > fermi.phx.az.us: icmp
12:08:12.516851 IP jade.phx.az.us > fermi.phx.az.us: icmp
12:08:12.516966 IP jade.phx.az.us > fermi.phx.az.us: icmp
12:08:12.517089 IP jade.phx.az.us > fermi.phx.az.us: icmp
12:08:12.517142 IP jade.phx.az.us > fermi.phx.az.us: icmp
12:08:12.517194 IP fermi.phx.az.us > jade.phx.az.us: ICMP echo reply,
id 2018, seq 1, length 1256
12:08:12.517197 IP fermi.phx.az.us > jade.phx.az.us: icmp
12:08:12.517199 IP fermi.phx.az.us > jade.phx.az.us: icmp
12:08:12.517201 IP fermi.phx.az.us > jade.phx.az.us: icmp
12:08:12.517203 IP fermi.phx.az.us > jade.phx.az.us: icmp
12:08:12.517204 IP fermi.phx.az.us > jade.phx.az.us: icmp
12:08:12.517206 IP fermi.phx.az.us > jade.phx.az.us: icmp
12:08:13.517767 IP jade.phx.az.us > fermi.phx.az.us: ICMP echo request,
id 2018, seq 2, length 1480
12:08:13.517890 IP jade.phx.az.us > fermi.phx.az.us: icmp
12:08:13.518015 IP jade.phx.az.us > fermi.phx.az.us: icmp
12:08:13.518138 IP jade.phx.az.us > fermi.phx.az.us: icmp
12:08:13.518261 IP jade.phx.az.us > fermi.phx.az.us: icmp
12:08:13.518315 IP jade.phx.az.us > fermi.phx.az.us: icmp
12:08:13.518343 IP fermi.phx.az.us > jade.phx.az.us: ICMP echo reply,
id 2018, seq 2, length 1256
12:08:13.518345 IP fermi.phx.az.us > jade.phx.az.us: icmp
12:08:13.518347 IP fermi.phx.az.us > jade.phx.az.us: icmp
12:08:13.518349 IP fermi.phx.az.us > jade.phx.az.us: icmp
12:08:13.518350 IP fermi.phx.az.us > jade.phx.az.us: icmp
12:08:13.518352 IP fermi.phx.az.us > jade.phx.az.us: icmp
12:08:13.518353 IP fermi.phx.az.us > jade.phx.az.us: icmp
^C
28 packets captured
28 packets received by filter
0 packets dropped by kernel
[fermi ~]#

And this works equally well in the opposite direction with the same
set-up. In the early 1990s, the standard test we ran after connecting
a new box was to ping the snot out of a victim down the hall

ping -c25 -s8000 victim

and we expected at least 24 replies out of 25. This was being done
on Sun SparcStations and various Macs using classic thick-net, but
some hosts were using 10BaseT. The only problem we ran into was
some crappy network stack running on PCs which would crash on what it
deemed "oversized" pings (different from the classic "Ping of Death").
For them, the test was 'ping -c150 -s1475 victim' and 144 out of 150.

Old guy

Moe Trin

unread,
Jun 12, 2013, 3:25:26 PM6/12/13
to
On Wed, 12 Jun 2013, in the Usenet newsgroup comp.protocols.tcp-ip, in article
<d44c0899-c908-4c21...@googlegroups.com>, rahul....@gmail.com
wrote:

NOTE: Posting from groups.google.com (or some web-forums) dramatically
reduces the chance of your post being seen. Find a real news server.

>I have two linux machines (A and B) connected to same link
>(ethernet). The MTU supported is 1500.

>If I change the mtu to 1280 of one the host (say B), then can I
>send an ICMP echo packet from node A (of max size as per its mtu
>which is 1500) to node B ?

Why limit it to 1500? Look up "fragmentation" in RFC1122, and then
look at the man page for 'ping' to see what this command is doing:

[fermi ~]$ ping -c 2 -s 65000 jade
PING jade.phx.az.us (192.168.101.11) 65000(65028) bytes of data.
65008 bytes from jade.phx.az.us (192.168.101.11): icmp_seq=1 ttl=64
time=11.3 ms
65008 bytes from jade.phx.az.us (192.168.101.11): icmp_seq=2 ttl=64
time=11.3 ms

--- jade.phx.az.us ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1013ms
rtt min/avg/max/mdev = 11.310/11.328/11.347/0.108 ms
[fermi ~]$

>Will host B be able to receive a packet whose size is greater
>than its MTU ?

MTU is the maximum size packet your interface will TRANSMIT. It is
not the maximum size it will RECEIVE. See my response to Barry in
this thread and look at the tcpdump data.

Is there any particular reason you didn't bother to TRY IT YOURSELF?

>Is there a way host A can discover the decreased MTU of B and then
>can send an ICMP packet of lower size ?

You mean something like RFC1191, RFC1435 and RFC1981? Try reading
RFC2923 for concepts/problems/hints. You may also want to read
RFC1122 for some fundamentals.

Old guy

Moe Trin

unread,
Jun 12, 2013, 3:47:30 PM6/12/13
to
On Wed, 12 Jun 2013, in the Usenet newsgroup comp.protocols.tcp-ip, in article
<kpa9a4$ddl$1...@usenet01.boi.hp.com>, Rick Jones wrote:

>The starting point is this, in pseudo IEEEspeak (which I've probably
>botched) with a smattering of IETF style:

> All stations in the same broadcast domain MUST have the same MTU.

Wrong

>Ethernet has no way to communicate frame size between peers. If a
>frame larger than the station is prepared to receive arrives, that
>frame will be dropped.

Nope - if you want to transmit packets no larger than 500 octets in a
link where everyone else is doing 1500, that's your business. See
RFC1122 and read about fragmentation and reassembly. Then look at
my reply to Barry - specifically the output of 'ifconfig' and the
tcpdump. One system with MTU 1280, one with 1500 and no problems.

>However, if the interface on which it was going to receive the
>datagram has an MTU/framesize smaller than the size of the datagram
>sent, the datagram won't be "received" by the IP layer so it cannot be
>resent, so the Path MTU logic cannot trigger.

>Thus the reason why all stations (hosts, systems, what you will) in
>the broadcast domain (everything joined at layer 2 eg ethernet) MUST
>have the same MTU.

Rick, what are you smoking? Can I get some? ;-)

Old guy

Jorgen Grahn

unread,
Jun 12, 2013, 4:05:46 PM6/12/13
to
On Wed, 2013-06-12, Moe Trin wrote:
> On Wed, 12 Jun 2013, in the Usenet newsgroup comp.protocols.tcp-ip, in article
> <barmar-046B4B....@news.eternal-september.org>, Barry Margolin wrote:
>
>>Jorgen Grahn <grahn...@snipabacken.se> wrote:
>
>>> My take on this is: you cannot really change the MTU on an interface
>>> -- you can only tell the interface the actual MTU on the link.
>
> [fermi ~]$ uname -o
> GNU/Linux
> [fermi ~]$ whatis ip ifconfig
> ip (7) - Linux IPv4 protocol implementation
> ip (8) - show / manipulate routing, devices, policy routing
> and tunnels
> ifconfig (8) - configure a network interface
> [fermi ~]$
>
> It's going to depend on the O/S - or more correctly, the network
> stack, but go for it. Most IPv4 stacks allow MTU to be set as easily
> as network mask or IP address.

Perhaps I was unclear -- I never claimed there were no tools for
"changing" the MTU, just that I don't expect meaningful results when
the endpoints on an Ethernet disagree about it.

When I've done this in Linux (to create and capture a TCP stream with
small segments, so I later could replay it over a link with actual
lower MTU) things stopped working in unexpected ways. Lots of trouble
and wasted time.

(Later I discovered the route(8) 'mss' option -- a much better fit for
my problem.)

Rick Jones

unread,
Jun 12, 2013, 7:09:13 PM6/12/13
to
Moe Trin <ibup...@painkiller.example.tld.invalid> wrote:
> On Wed, 12 Jun 2013, in the Usenet newsgroup comp.protocols.tcp-ip, in article
> <kpa9a4$ddl$1...@usenet01.boi.hp.com>, Rick Jones wrote:

> >The starting point is this, in pseudo IEEEspeak (which I've probably
> >botched) with a smattering of IETF style:

> > All stations in the same broadcast domain MUST have the same MTU.

> Wrong

Are we arguing about my conflating MTU and frame size together?

> >Ethernet has no way to communicate frame size between peers. If a
> >frame larger than the station is prepared to receive arrives, that
> >frame will be dropped.

> Nope - if you want to transmit packets no larger than 500 octets in a
> link where everyone else is doing 1500, that's your business. See
> RFC1122 and read about fragmentation and reassembly. Then look at
> my reply to Barry - specifically the output of 'ifconfig' and the
> tcpdump. One system with MTU 1280, one with 1500 and no problems.

> >However, if the interface on which it was going to receive the
> >datagram has an MTU/framesize smaller than the size of the datagram
> >sent, the datagram won't be "received" by the IP layer so it cannot be
> >resent, so the Path MTU logic cannot trigger.

> >Thus the reason why all stations (hosts, systems, what you will) in
> >the broadcast domain (everything joined at layer 2 eg ethernet) MUST
> >have the same MTU.

> Rick, what are you smoking? Can I get some? ;-)

JumboFrames I suspect.

Starting point - two systems, back-to-back cable 1500 byte MTU at both
ends:

raj@raj-8510w:~$ ifconfig eth0
eth0 Link encap:Ethernet HWaddr 00:1f:29:7a:d1:32
inet addr:192.168.1.3 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::21f:29ff:fe7a:d132/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:16256551 errors:0 dropped:5683 overruns:0 frame:0
TX packets:16251404 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1155190603 (1.1 GB) TX bytes:1154104068 (1.1 GB)
Interrupt:22 Memory:e8000000-e8020000

raj@tardy:~$ ifconfig eth1
eth1 Link encap:Ethernet HWaddr 00:1c:c4:47:d3:f9
inet addr:192.168.1.4 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::21c:c4ff:fe47:d3f9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:35423059 errors:0 dropped:0 overruns:0 frame:0
TX packets:35449304 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2518538585 (2.5 GB) TX bytes:2522670698 (2.5 GB)
Interrupt:38 Memory:f4000000-f4020000

Perform your 8000 byte ping test:

raj@tardy:~$ ping -c 2 -s 8000 192.168.1.3
PING 192.168.1.3 (192.168.1.3) 8000(8028) bytes of data.
8008 bytes from 192.168.1.3: icmp_req=1 ttl=64 time=0.279 ms
8008 bytes from 192.168.1.3: icmp_req=2 ttl=64 time=0.236 ms

--- 192.168.1.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.236/0.257/0.279/0.026 ms

And the tcpdump taken on the sender:

raj@tardy:~$ sudo tcpdump -i eth1 ip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
15:47:27.120359 IP tardy.local > the-laptop: ICMP echo request, id 10123, seq 1, length 1480
15:47:27.120370 IP tardy.local > the-laptop: icmp
15:47:27.120373 IP tardy.local > the-laptop: icmp
15:47:27.120374 IP tardy.local > the-laptop: icmp
15:47:27.120376 IP tardy.local > the-laptop: icmp
15:47:27.120377 IP tardy.local > the-laptop: icmp
15:47:27.120595 IP the-laptop > tardy.local: ICMP echo reply, id 10123, seq 1, length 1480
15:47:27.120605 IP the-laptop > tardy.local: icmp
15:47:27.120607 IP the-laptop > tardy.local: icmp
15:47:27.120610 IP the-laptop > tardy.local: icmp
15:47:27.120612 IP the-laptop > tardy.local: icmp
15:47:27.120614 IP the-laptop > tardy.local: icmp
15:47:28.120505 IP tardy.local > the-laptop: ICMP echo request, id 10123, seq 2, length 1480
15:47:28.120519 IP tardy.local > the-laptop: icmp
15:47:28.120521 IP tardy.local > the-laptop: icmp
15:47:28.120523 IP tardy.local > the-laptop: icmp
15:47:28.120525 IP tardy.local > the-laptop: icmp
15:47:28.120526 IP tardy.local > the-laptop: icmp
15:47:28.120687 IP the-laptop > tardy.local: ICMP echo reply, id 10123, seq 2, length 1480
15:47:28.120699 IP the-laptop > tardy.local: icmp
15:47:28.120701 IP the-laptop > tardy.local: icmp
15:47:28.120712 IP the-laptop > tardy.local: icmp
15:47:28.120714 IP the-laptop > tardy.local: icmp
15:47:28.120716 IP the-laptop > tardy.local: icmp
^C
24 packets captured
24 packets received by filter

Now up the MTU on the sender to 9000 bytes:
[sudo] password for raj:
raj@tardy:~$ ifconfig eth1
eth1 Link encap:Ethernet HWaddr 00:1c:c4:47:d3:f9
inet6 addr: fe80::21c:c4ff:fe47:d3f9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:35423102 errors:0 dropped:0 overruns:0 frame:0
TX packets:35449362 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2518588525 (2.5 GB) TX bytes:2522723299 (2.5 GB)
Interrupt:38 Memory:f4000000-f4020000

Run the ping again (yes, I reapplied the IP address to eth1 :-)

raj@tardy:~$ ping -c 2 -s 8000 192.168.1.3
PING 192.168.1.3 (192.168.1.3) 8000(8028) bytes of data.

--- 192.168.1.3 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1006ms

And the tcpdump trace, again from the sender:

raj@tardy:~$ sudo tcpdump -i eth1 ip
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
15:50:25.675807 IP tardy.local > the-laptop: ICMP echo request, id 10163, seq 1, length 8008
15:50:26.682668 IP tardy.local > the-laptop: ICMP echo request, id 10163, seq 2, length 8008

As it happens, ethtool -S eth0 on the destination shows
"rx_long_length_errors: 2" and that incremented to 4 when I ran the
ping test again.

Now, perhaps in your situation when you shrank the MTU to < 1500
bytes, the ethernet controller and NIC were still willing to take-in a
frame larger and IP accept it. In my setup, I have reset my sender to
a 1500 byte MTU, and then set the MTU to 1280 on my receiver. In that
case the 8000 byte ping still worked, but I am not sure it would be
good to rely on that. I could see, and have some vague recollections
about, systems whereon if one dropped the MTU/frame size below 1500
bytes, it would not accept packets above the new MTU and still below
1500.

rick

Barry Margolin

unread,
Jun 12, 2013, 8:01:30 PM6/12/13
to
In article <slrnkrhj25.s...@fermi.phx.az.us>,
Moe Trin <ibup...@painkiller.example.tld.invalid> wrote:

> On Wed, 12 Jun 2013, in the Usenet newsgroup comp.protocols.tcp-ip, in
> article
> <d44c0899-c908-4c21...@googlegroups.com>,
> rahul....@gmail.com
> wrote:
>
> NOTE: Posting from groups.google.com (or some web-forums) dramatically
> reduces the chance of your post being seen. Find a real news server.
>
> >I have two linux machines (A and B) connected to same link
> >(ethernet). The MTU supported is 1500.
>
> >If I change the mtu to 1280 of one the host (say B), then can I
> >send an ICMP echo packet from node A (of max size as per its mtu
> >which is 1500) to node B ?
>
> Why limit it to 1500? Look up "fragmentation" in RFC1122, and then
> look at the man page for 'ping' to see what this command is doing:

Silly me, when I answered earlier I had somehow gotten it into my head
that ICMP couldn't be fragmented.

Moe Trin

unread,
Jun 12, 2013, 10:21:20 PM6/12/13
to
On Wed, 12 Jun 2013, in the Usenet newsgroup comp.protocols.tcp-ip, in article
<kpav2o$kcs$1...@usenet01.boi.hp.com>, Rick Jones wrote:

>Moe Trin <ibup...@painkiller.example.tld.invalid> wrote:

>>> All stations in the same broadcast domain MUST have the same MTU.

>> Wrong

>Are we arguing about my conflating MTU and frame size together?

Perhaps - normally I don't see a (software) knob to diddle with level
2 packet lengths, much less a level 2 variable called MTU. But as
mentioned, mucking with the (level 3) MTU is quite common - because of
tunneling for example.

>> Rick, what are you smoking? Can I get some? ;-)

>JumboFrames I suspect.

My doctor says they're fattening and tells me to avoid them. Yes, I
can see what you're referring to in that regard.

>Starting point - two systems, back-to-back cable 1500 byte MTU at both
>ends:

Looks fine

>Now up the MTU on the sender to 9000 bytes:

Don't have GigE here (I'm assuming this be the case), and at work
where we did, we left the MTU at 1500 to avoid some confusion as some
systems were 100BaseT on slower ports off the network switches, It
says here that's costing efficiency, but it wasn't that big of a deal
considering that those were slower systems anyway.

>As it happens, ethtool -S eth0 on the destination shows
>"rx_long_length_errors: 2" and that incremented to 4 when I ran the
>ping test again.

You said they're back-to-back, so the only thing I can think of is this
is a artifact of the receiving system not being aware that jumbos were
in use. I don't know.

>Now, perhaps in your situation when you shrank the MTU to < 1500
>bytes, the ethernet controller and NIC were still willing to take-in a
>frame larger and IP accept it.

That's part of what I meant in my reply to Jorgen (via Barry) by O/S
dependent. A quick scan through RFC1122 doesn't find it, but I recall
a bit where the a minimum "maximum" packet is 576 bytes, but I think
most Ethernet stacks are built to expect 1500 or so. For that matter,
the MTU on the loopback is 16436 on my Linux boxes.

>In my setup, I have reset my sender to a 1500 byte MTU, and then set
>the MTU to 1280 on my receiver. In that case the 8000 byte ping
>still worked, but I am not sure it would be good to rely on that.

The reason the 8k ping works is that the stack is fragmenting it, as
shown by all those extra packets in the tcpdump. The "transmitting"
system is sending at it's MTU, and the "receiving" system is accepting
that - possibly up to the nominal 1500 or so bytes. What might be
interesting to try is beeping up the MTU to something like 1550 or so,
rather than the giant leap to 9k.

>I could see, and have some vague recollections about, systems whereon
>if one dropped the MTU/frame size below 1500 bytes, it would not
>accept packets above the new MTU and still below 1500.

Might this be related to PMTU black-holes? There's certainly plenty
of evidence for that with PPPo? where the user is also running a
firewall blocking those ICMP Type 3 Code 4 mal-ware packets rather
than relying on RFC3514.

Old guy

rahul....@gmail.com

unread,
Jun 13, 2013, 1:05:00 AM6/13/13
to
>
> And this works equally well in the opposite direction with the same
>
> set-up. In the early 1990s, the standard test we ran after connecting
>
> a new box was to ping the snot out of a victim down the hall
>
>
>
> ping -c25 -s8000 victim
>

Thanks for your reply Moe Trin.

I found this link "https://www.kernel.org/doc/Documentation/networking/netdevices.txt".

As per this link,
"MTU is symmetrical and applies both to receive and transmit. A device
must be able to receive at least the maximum size packet allowed by
the MTU. A network device may use the MTU as mechanism to size receive
buffers".

So, if we decrease the mtu (say 1280), the driver may or may not receive the packet. It depends on the implementation, I guess.

For linux, I believe the receive buffer size was not decreased with the decrease in mtu size and so it was able to receive the 1500 size packet.

But, we should not rely on this behavior.




rahul....@gmail.com

unread,
Jun 13, 2013, 1:12:16 AM6/13/13
to
On Thursday, June 13, 2013 4:39:13 AM UTC+5:30, Rick Jones wrote:


>
> Now, perhaps in your situation when you shrank the MTU to < 1500
>
> bytes, the ethernet controller and NIC were still willing to take-in a
>
> frame larger and IP accept it. In my setup, I have reset my sender to
>
> a 1500 byte MTU, and then set the MTU to 1280 on my receiver. In that
>
> case the 8000 byte ping still worked, but I am not sure it would be
>
> good to rely on that. I could see, and have some vague recollections
>
> about, systems whereon if one dropped the MTU/frame size below 1500
>
> bytes, it would not accept packets above the new MTU and still below
>
> 1500.
>
>
Agree with you Rick.

Please see "https://www.kernel.org/doc/Documentation/networking/netdevices.txt".

One more question:

For directly connected hosts, what should be the PMTU ? Should the sender always assign pmtu = interface mtu of sender?

Or, should it send an ICMP echo starting with its interface mtu. If it gets the reply, assign pmtu = interface mtu. Else, decrease the icmp echo size and try again till it gets the reply and use that value as pmtu ?

thanks,
rahul

Rick Jones

unread,
Jun 14, 2013, 1:33:27 PM6/14/13
to
Moe Trin <ibup...@painkiller.example.tld.invalid> wrote:
> On Wed, 12 Jun 2013, in the Usenet newsgroup comp.protocols.tcp-ip, in article
> <kpav2o$kcs$1...@usenet01.boi.hp.com>, Rick Jones wrote:

> >As it happens, ethtool -S eth0 on the destination shows
> >"rx_long_length_errors: 2" and that incremented to 4 when I ran the
> >ping test again.

> You said they're back-to-back, so the only thing I can think of is this
> is a artifact of the receiving system not being aware that jumbos were
> in use. I don't know.

Indeed - it is an example of two stations in the broadcast domain who
were using different frame/MTU sizes :)

> >Now, perhaps in your situation when you shrank the MTU to < 1500
> >bytes, the ethernet controller and NIC were still willing to take-in a
> >frame larger and IP accept it.

> That's part of what I meant in my reply to Jorgen (via Barry) by O/S
> dependent. A quick scan through RFC1122 doesn't find it, but I
> recall a bit where the a minimum "maximum" packet is 576 bytes, but
> I think most Ethernet stacks are built to expect 1500 or so. For
> that matter, the MTU on the loopback is 16436 on my Linux boxes.

The "minimum 'maximum'" IP datagram size relates to IP fragment
reassembly. A conforming IP(v4) implementation must be able to
reassemble IP datagrams of at least 576 bytes. That is distinct from
the minimum MTU for an IPv4 network, which if I recall correctly is 68
bytes.

> >I could see, and have some vague recollections about, systems
> >whereon if one dropped the MTU/frame size below 1500 bytes, it
> >would not accept packets above the new MTU and still below 1500.

> Might this be related to PMTU black-holes? There's certainly plenty
> of evidence for that with PPPo? where the user is also running a
> firewall blocking those ICMP Type 3 Code 4 mal-ware packets rather
> than relying on RFC3514.

It could be I suppose. PMTU Discovery depends on receipt of the "too
large" IP datagram at a router in the first place, so if there were a
hop along the way where the sending side of that hop had an
MTU/framesize/whatnot larger than the receiving side of that hop, the
traffic could/would get bitbucketed and not generate any ICMP
Destination Unreachable, Datagram Too Big messages.

It has always been my understanding though that PTMU black holes were,
99 times out of 10, triggered by overly aggressive ICMP filtering.
rick jones
--
firebug n, the idiot who tosses a lit cigarette out his car window

Rick Jones

unread,
Jun 14, 2013, 1:44:55 PM6/14/13
to
rahul....@gmail.com wrote:
> One more question:

> For directly connected hosts, what should be the PMTU ? Should the
> sender always assign pmtu = interface mtu of sender?

> Or, should it send an ICMP echo starting with its interface mtu. If
> it gets the reply, assign pmtu = interface mtu. Else, decrease the
> icmp echo size and try again till it gets the reply and use that
> value as pmtu ?

Given I start from the premise of "Every station in a given broadcast
domain MUST have the same MTU (framesize)" sending the ICMP Echo
Request is (should be) unnecessary. And I wouldn't suggest it.

While the conditions are not identical, there was an amplification
attack possible against some TCP stacks (one of which used to be near
and dear to my paycheck) which did something similar with non-local
destinations. When speaking with the remote destination, they would
send an ICMP Echo Request with the DF bit set in the IP header, while
the rest of the traffic being sent to the destination was sent with
the DF bit cleared. The idea was to try to see if there was effective
PathMTU discovery possible along the path to the remote. Only once
there was an ICMP Echo Reply recieved would the DF bit start being set
on the "real" traffic. A rather clever thing to do, save for one
thing...

Trouble was, someone sending say TCP SYNchronize segments with a
spoofed source IP address would get the receiving stack to send a full
local MTU-sized ICMP Echo request to the spoofed source IP along with
the TCP SYN|ACK.

rick jones
--
The computing industry isn't as much a game of "Follow The Leader" as
it is one of "Ring Around the Rosy" or perhaps "Duck Duck Goose."
- Rick Jones

Moe Trin

unread,
Jun 14, 2013, 4:00:15 PM6/14/13
to
On Fri, 14 Jun 2013, in the Usenet newsgroup comp.protocols.tcp-ip, in
article <kpfk57$68q$2...@usenet01.boi.hp.com>, Rick Jones wrote:

>Moe Trin <ibup...@painkiller.example.tld.invalid> wrote:

>> You said they're back-to-back, so the only thing I can think of is
>> this is a artifact of the receiving system not being aware that
>> jumbos were in use. I don't know.

>Indeed - it is an example of two stations in the broadcast domain who
>were using different frame/MTU sizes :)


Frame, yes - MTU ? The receiving stack didn't even see it, so we
haven't gotten to the MTU level.

>It has always been my understanding though that PTMU black holes were,
>99 times out of 10, triggered by overly aggressive ICMP filtering.

That's pretty much what RFC1435 and RFC2923 indicate.

Old guy

Martijn Lievaart

unread,
Jun 15, 2013, 10:14:03 AM6/15/13
to
On Wed, 12 Jun 2013 19:14:01 +0000, Jorgen Grahn wrote:

> But I think you need more than one Ethernet to do this well. PMTU
> discovery is about detecting things which happen more than one hop away
> from the source. Maybe you can use virtual machines and simulated
> networks.

Using VMs takes some setup, but the reward is an almost infinite
possibility to play with this kind of stuff. Greatly recommended.

M4

Martijn Lievaart

unread,
Jun 20, 2013, 10:53:49 AM6/20/13
to
On Fri, 14 Jun 2013 17:44:55 +0000, Rick Jones wrote:

> While the conditions are not identical, there was an amplification
> attack possible against some TCP stacks (one of which used to be near
> and dear to my paycheck) which did something similar with non-local
> destinations. When speaking with the remote destination, they would
> send an ICMP Echo Request with the DF bit set in the IP header, while
> the rest of the traffic being sent to the destination was sent with the
> DF bit cleared. The idea was to try to see if there was effective
> PathMTU discovery possible along the path to the remote. Only once there
> was an ICMP Echo Reply recieved would the DF bit start being set on the
> "real" traffic. A rather clever thing to do, save for one thing...

Ugh. I assume you are talking about HP/UX? This behavior caused lots of
troubles for us. The least being that ping was blocked everywhere in our
network, so it did not work at all. Sadly I cannot remember what other
troubles it caused, but it made a lot of architects against PMTUD where
there was no real problem with it, but a huge problem without it.

Although I still think the HP/UX implementation was moronic, the response
by architects and admins to the PMTU troubles in said network was even
more moronic. All token ring interfaces had their MTU set to 1500. I did
predict that the first tunnel would cause no end of trouble, which it did.

In the end, as we were responsible for said tunnel, we just forcibly
stripped of the DF bit as it was impossible to convince the network
owners of the correct solutions.

In our part of the network PMTUD did work as designed, except for one
stupid firewall that neither allowed fragmentation-needed errors by
default in an intelligent way, in general, or even could be made to allow
them at all without allowing all ICMP. This was high-end, expensive
stuff...

M4

-- The "fun" I had with PMTU, yuck! --

Rick Jones

unread,
Jun 20, 2013, 12:40:19 PM6/20/13
to
Martijn Lievaart <m...@rtij.nl.invlalid> wrote:
> Ugh. I assume you are talking about HP/UX?

No. HP-UX :) And Solaris 2 and at least one version of MacOS (9?) -
anything based on Mentat's TCP/IP stack.

> This behavior caused lots of troubles for us. The least being that
> ping was blocked everywhere in our network, so it did not work at
> all. Sadly I cannot remember what other troubles it caused, but it
> made a lot of architects against PMTUD where there was no real
> problem with it, but a huge problem without it.

> Although I still think the HP/UX implementation was moronic, the

Difference of opinion on "moronic" - while I think what Mentat tried
was ultimately il-advised, it was a clever way to see if PathMTU
discovery could "work" to the destination. If there were no ICMP's
coming in response to those ICMP Echo Requests with the DF bit set in
the IP header, the "real" traffic would continue to flow with DF
cleared, without having to rely on some sort of timeout-based PMTU
black hole detection.

> response by architects and admins to the PMTU troubles in said
> network was even more moronic. All token ring interfaces had their
> MTU set to 1500. I did predict that the first tunnel would cause no
> end of trouble, which it did.

You had Ethernet and TokenRing networks joined just at Layer2?

> In the end, as we were responsible for said tunnel, we just forcibly
> stripped of the DF bit as it was impossible to convince the network
> owners of the correct solutions.

adde parvum parvo magnus acervus erit :)

Arbitrarily clearing the DF bit - might as well have set
ip_pmtu_strategy to 0 on the HP-UX systems and disabled PathMTU
discovery at the root. Or, if things were otherwise OK apart from
what you described below, there was always the value of 1 which set DF
always without any of the ICMP Echo Requests being sent (which was
value 2)

> In our part of the network PMTUD did work as designed, except for
> one stupid firewall that neither allowed fragmentation-needed errors
> by default in an intelligent way, in general, or even could be made
> to allow them at all without allowing all ICMP. This was high-end,
> expensive stuff...

rick

An excerpt from my old "annotated_ndd.txt"

ip_pmtu_strategy:

Set the Path MTU Discovery strategy: 0 disables Path MTU
Discovery; 1 enables Strategy 1; 2 enables Strategy 2.

Because of problems encountered with some firewalls, hosts, and
low-end routers, IP provides for selection of either of two
discovery strategies, or for completely disabling the
algorithm. The tunable parameter ip_pmtu_strategy controls the
selection.

Strategy 1: All outbound datagrams have the "Don't Fragment" bit
set. This should result in notification from any intervening
gateway that needs to forward a datagram down a path that would
require additional fragmentation. When the ICMP "Fragmentation
Needed" message is received, IP updates its MTU for the remote
host. If the responding gateway implements the recommendations for
gateways in RFC 1191, then the next hop MTU will be included in
the "Fragmentation Needed" message, and IP will use it. If the
gateway does not provide next hop information, then IP will reduce
the MTU to the next lower value taken from a table of "popular"
media MTUs.

Strategy 2: When a new routing table entry is created for a
destination on a locally connected subnet, the "Don't Fragment"
bit is never turned on. When a new routing table entry for a
non-local destination is created, the "Don't Fragment" bit is not
immediately turned on. Instead,

o An ICMP "Echo Request" of full MTU size is generated and
sent out with the "Don't Fragment" bit on.

o The datagram that initiated creation of the routing table
entry is sent out immediately, without the "Don't Fragment"
bit. Traffic is not held up waiting for a response to the "Echo
Request".

o If no response to the "Echo Request" is received, the
"Don't Fragment" bit is never turned on for that route; IP
won't time-out or retry the ping. If an ICMP "Fragmentation
Needed" message is received in response to the "Echo Request",
the Path MTU is reduced accordingly, and a new "Echo Request"
is sent out using the updated Path MTU. This step repeats as
needed.

o If a response to the "Echo Request" is received, the
"Don't Fragment" bit is turned on for all further packets for
the destination, and Path MTU discovery proceeds as for
Strategy 1.

Assuming that all routers properly implement Path MTU Discovery,
Strategy 1 is generally better - there is no extra overhead for
the ICMP "Echo Request" and response. Strategy 2 is available only
because some routers, or firewalls, or end hosts have been
observed simply to drop packets that have the DF bit on without
issuing the "Fragmentation Needed" message. Strategy 2 is more
conservative in that IP will never fail to communicate when using
it. [0,2] Default: Strategy 2

Since ip_pmtu_strategy was created, it has been determined that having
ip_pmtu_strategy set to a value of two can allow malicious people to
use the system as part of a Denial of Service attack on other
systems. For this reason, an HP Security Advisory was issued that
suggests the value be either one or zero. In fact, using a value of
two is considered "bad" enough that the option is removed from the 11i
version of ndd and the default changed to 1, though the ndd -h text
may not have been updated in time for the release of 11i (11.11)...

Setting the value to one will mean that IP datagrams will always have
the DF bit set. This is generally fine, but there are still some
broken setups out there that will filter-out ICMP "Fragmentation
Needed" messages. Trying to send IP datagrams with the DF bit set
through such setups will create a "black hole" beyond which systems
are unreachable.

Setting the value to zero will mean that TCP will have to fall-back on
other strategies to ensure that its segments are not fragmented along
the path to the destination. This could result in TCP using a Maximum
Segment Size (MSS) smaller than the maximum possible along that
path. This can lead to decreased performance.

In the HP-UX 11 patch stream, it is possible to set a value of "3" for
the ip_pmtu_strategy. This value will result in the DF bit in the IP
header being cleared, but will still have TCP select an MSS based on
the link-local MTU. In effect, it is a way for the network
administrator to tell the transport that all (sub)nets are local or
that the network administrator is not at all concerned if traffic from
this host happens to become fragmented along the way.

--
Wisdom Teeth are impacted, people are affected by the effects of events.

Jorgen Grahn

unread,
Jun 20, 2013, 5:00:40 PM6/20/13
to
On Thu, 2013-06-20, Rick Jones wrote:
> Martijn Lievaart <m...@rtij.nl.invlalid> wrote:
>> Ugh. I assume you are talking about HP/UX?
>
> No. HP-UX :) And Solaris 2 and at least one version of MacOS (9?) -
> anything based on Mentat's TCP/IP stack.

You all bought the stack from somewhere? I assumed at least Sun had
enough IP hackers to write one from scratch, or one based off of
whatever they already had running. "The network is the computer", and
all.

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Martijn Lievaart

unread,
Jun 20, 2013, 5:05:40 PM6/20/13
to
On Thu, 20 Jun 2013 16:40:19 +0000, Rick Jones wrote:

>> response by architects and admins to the PMTU troubles in said network
>> was even more moronic. All token ring interfaces had their MTU set to
>> 1500. I did predict that the first tunnel would cause no end of
>> trouble, which it did.
>
> You had Ethernet and TokenRing networks joined just at Layer2?

No, layer 3. Hence the PMTU blackholes.

>
>> In the end, as we were responsible for said tunnel, we just forcibly
>> stripped of the DF bit as it was impossible to convince the network
>> owners of the correct solutions.
>
> adde parvum parvo magnus acervus erit :)
>
> Arbitrarily clearing the DF bit - might as well have set
> ip_pmtu_strategy to 0 on the HP-UX systems and disabled PathMTU
> discovery at the root. Or, if things were otherwise OK apart from what
> you described below, there was always the value of 1 which set DF always
> without any of the ICMP Echo Requests being sent (which was value 2)

Besides that we did not control said stuff, we could not convince the
"architects" that what they mandated was plain dead wrong. That means you
have to resort to solutions like this. (There also was some politics
involved....)

M4

Rick Jones

unread,
Jun 20, 2013, 8:24:16 PM6/20/13
to
Jorgen Grahn <grahn...@snipabacken.se> wrote:
> You all bought the stack from somewhere? I assumed at least Sun had
> enough IP hackers to write one from scratch, or one based off of
> whatever they already had running. "The network is the computer",
> and all.

The Solaris 2 TCP/IP stack was certainly based on Mentat. HP-UX got
its Mentat-based stack a bit later (first shipped with HP-UX 11.0)
That is why both HP-UX and Solaris have ndd and a very similar set of
ndd settings. It has always been my understanding that Sun did a "one
time" of the Mentat stack and started hacking from there but that may
simply be because I noticed after they may (or may not) have split.
HP maintained a continuing relationship (tracking Mentat "upstream")
through Mentat's purchase by Packeteer, then somewhere after that
(perhaps after Packeteer was itself purchased by Blue Coat?) brought
it fully in-house. I have often referred to the Solaris and HP-UX
networking stacks as "cousins."

Web searching for "Mentat TCP/IP" or "Mentat Hewlett-Packard" etc will
find some of the interesting, related links:

http://www.thefreelibrary.com/Mentat+Announces+Networking+Technology+Agreement+with+HP%3B+HP+Licenses...-a020788508

http://www.zoominfo.com/p/Mentat-SkyX/689744294

rick jones
--
web2.0 n, the dot.com reunion tour...

Pascal Hambourg

unread,
Jun 22, 2013, 9:53:14 AM6/22/13
to
Hello,

rahul....@gmail.com a écrit :
>
> I found this link "https://www.kernel.org/doc/Documentation/networking/netdevices.txt".
>
> As per this link,
> "MTU is symmetrical and applies both to receive and transmit. A device
> must be able to receive at least the maximum size packet allowed by
> the MTU. A network device may use the MTU as mechanism to size receive
> buffers".
>
> So, if we decrease the mtu (say 1280), the driver may or may not receive the packet. It depends on the implementation, I guess.
>
> For linux, I believe the receive buffer size was not decreased with the decrease in mtu size and so it was able to receive the 1500 size packet.
>
> But, we should not rely on this behavior.

Indeed, this seems to depend on the driver or the hardware type.
On one ethernet interface type, setting the MTU did not limit the
receive size. On another one, setting the MTU limited the receive size.

Pascal Hambourg

unread,
Jun 22, 2013, 9:54:47 AM6/22/13
to
Barry Margolin a ᅵcrit :
>> My take on this is: you cannot really change the MTU on an interface
>> -- you can only tell the interface the actual MTU on the link.
>> If you lie to it, anything can happen. Or, rather, I assume anything
>> can happen.
>
> I'd expect B to receive the packet successfully, but it might not be
> able to reply because the reply doesn't fit in the MTU.

The reply can be sent with fragmentation.

Pascal Hambourg

unread,
Jun 22, 2013, 10:00:04 AM6/22/13
to
Moe Trin a ᅵcrit :
>
> Then look at
> my reply to Barry - specifically the output of 'ifconfig' and the
> tcpdump. One system with MTU 1280, one with 1500 and no problems.

You've been lucky. See my first reply to Rahul in this thread.

Pascal Hambourg

unread,
Jun 22, 2013, 10:12:13 AM6/22/13
to
rahul....@gmail.com a ᅵcrit :
>
> I have two linux machines (A and B) connected to same link (ethernet). The MTU supported is 1500.
>
> If I change the mtu to 1280 of one the host (say B), then can I send an ICMP echo packet from node A (of max size as per its mtu which is 1500) to node B ?
>
> Will host B be able to receive a packet whose size is greater than its MTU ?

It depends on B's implementation. Some will, some won't.

> Is there a way host A can discover the decreased MTU of B and then can send an ICMP packet of lower size ?

You may try to use a technique similar to path MTU discovery (PMTUD).
But it is not the same, as PMTUD relies on intermediate routers sending
ICMP "fragmentation needed but don't fragment flag set" error messages.
There is no intermediate router between tow hosts on the same link.

Send ICMP echo packets with the DF (don't fragment) flag set and
increasing sizes. If the receiver discards packets bigger than its MTU,
of course it won't reply. If it accepts the packet but tries to reply
with DF set too, it won't be able to reply. But again, this is
implementation dependent.

Barry Margolin

unread,
Jun 22, 2013, 10:14:47 AM6/22/13
to
In article <kq4ab6$kvs$2...@saria.nerim.net>,
Pascal Hambourg <boite-...@plouf.fr.eu.org> wrote:

> Barry Margolin a écrit :
Someone else pointed that out, and I acknowledged a week ago that I
forgot that ICMP could be fragmented, it was a stupid mistake. Was it
really necessary to dredge it up again?

Pascal Hambourg

unread,
Jun 22, 2013, 10:52:14 AM6/22/13
to
Barry Margolin a écrit :
>
> Someone else pointed that out, and I acknowledged a week ago that I
> forgot that ICMP could be fragmented, it was a stupid mistake. Was it
> really necessary to dredge it up again?

Sorry, I just discovered the thread today, didn't notice the first posts
were over one week old and started to reply before reading all replies.

glen herrmannsfeldt

unread,
Jun 22, 2013, 12:08:48 PM6/22/13
to
Rick Jones <rick....@hp.com> wrote:
> Moe Trin <ibup...@painkiller.example.tld.invalid> wrote:

(snip)
>> You said they're back-to-back, so the only thing I can think of is this
>> is a artifact of the receiving system not being aware that jumbos were
>> in use. I don't know.

> Indeed - it is an example of two stations in the broadcast domain who
> were using different frame/MTU sizes :)

>> >Now, perhaps in your situation when you shrank the MTU to < 1500
>> >bytes, the ethernet controller and NIC were still willing to take-in a
>> >frame larger and IP accept it.

(snip)

Some years ago, I had a SLIP line running office to home at 9600 baud
using the usual SLIP MTU, which I believe is 1008. (That was just
at the beginning of the web, so mostly telnet, ftp, mail.)

> The "minimum 'maximum'" IP datagram size relates to IP fragment
> reassembly. A conforming IP(v4) implementation must be able to
> reassemble IP datagrams of at least 576 bytes. That is distinct from
> the minimum MTU for an IPv4 network, which if I recall correctly is 68
> bytes.

As I understand it, there are hosts, usually routers, that can
route larger datagrams, but can't process ones addressed to them
larger than 576 bytes.

I did find some hosts that didn't understand PMTUD while running
that SLIP line. It might be that at one time I changed all my
home hosts to 1008.

-- glen

glen herrmannsfeldt

unread,
Jun 22, 2013, 12:19:47 PM6/22/13
to
Rick Jones <rick....@hp.com> wrote:
> rahul....@gmail.com wrote:

(snip)
>> If yes, what would be the pmtu for the path between these nodes ?
>> Can node A find the mtu of node B on the same link ?

> The starting point is this, in pseudo IEEEspeak (which I've probably
> botched) with a smattering of IETF style:

> All stations in the same broadcast domain MUST have the same MTU.

Is this still true if you have more than one IP net in the same
broadcast domain?

I haven't thought of this for some time, but it seems to me that
you should be able to run separate (sub)nets on the same ethernet,
with different MTU for the different subnets. You then need a router
that can route between them.

> Ethernet has no way to communicate frame size between peers. If a
> frame larger than the station is prepared to receive arrives, that
> frame will be dropped.

> Path MTU is up at the IP layer and uses ICMP messages to communicate
> MTU between hosts. The PathMTU logic is run when a router looks to
> forward the IP datagram - when it goes to send it. That means it must
> have received it in the first place. For IP to "receive" the datagram
> it must first be received at the layer below it - in this case
> Ethernet.

Which gets interesting with different interfaces on different
types of nets, or different datagram limits on the same type
of network. (The latter being the case for ethernet with and
without jumbo frames, the former maybe for eithernet-fddi or
ethernet-token ring.)

> However, if the interface on which it was going to receive the
> datagram has an MTU/framesize smaller than the size of the datagram
> sent, the datagram won't be "received" by the IP layer so it cannot be
> resent, so the Path MTU logic cannot trigger.

Yes, but there are routers that can route larger packets than
they can receive if addressed to them.

> Thus the reason why all stations (hosts, systems, what you will) in
> the broadcast domain (everything joined at layer 2 eg ethernet) MUST
> have the same MTU.

(snip)

-- glen

Pascal Hambourg

unread,
Jun 22, 2013, 12:53:30 PM6/22/13
to
glen herrmannsfeldt a ᅵcrit :
>
> I did find some hosts that didn't understand PMTUD

What do you mean by "understand PMTUD" ?
Path MTU discovery is a technique, not a protocol. It relies on some
features of IP routers, but these routers do not need to specifically
understand PMTUD.

Pascal Hambourg

unread,
Jun 22, 2013, 1:01:47 PM6/22/13
to
glen herrmannsfeldt a ᅵcrit :
> Rick Jones <rick....@hp.com> wrote:
>
>> All stations in the same broadcast domain MUST have the same MTU.
>
> Is this still true if you have more than one IP net in the same
> broadcast domain?

Yes.

> I haven't thought of this for some time, but it seems to me that
> you should be able to run separate (sub)nets on the same ethernet,

If they share the same ethernet, they are not separate. They are just
different subnets.

> with different MTU for the different subnets. You then need a router
> that can route between them.

You also need to disable ICMP redirect, otherwise hosts on different
subnets may by-pass the router and start to communicate directly with
each other when they learn the destination is directly reachable.

glen herrmannsfeldt

unread,
Jun 22, 2013, 1:40:32 PM6/22/13
to
Pascal Hambourg <boite-...@plouf.fr.eu.org> wrote:
> glen herrmannsfeldt a écrit :
Well, that was about 1992 and some hosts/routers didn't do it right.

I don't remember now the details, but some things didn't work.

-- glen

glen herrmannsfeldt

unread,
Jun 22, 2013, 1:52:21 PM6/22/13
to
Pascal Hambourg <boite-...@plouf.fr.eu.org> wrote:

>> Rick Jones <rick....@hp.com> wrote:

>>> All stations in the same broadcast domain MUST have the same MTU.

(snip, I wrote)
>> Is this still true if you have more than one IP net in the same
>> broadcast domain?

> Yes.

>> I haven't thought of this for some time, but it seems to me that
>> you should be able to run separate (sub)nets on the same ethernet,

> If they share the same ethernet, they are not separate. They are just
> different subnets.

>> with different MTU for the different subnets. You then need a router
>> that can route between them.

> You also need to disable ICMP redirect, otherwise hosts on different
> subnets may by-pass the router and start to communicate directly with
> each other when they learn the destination is directly reachable.

Some time ago, I had systems on a two subnet ethernet with the
same MTU as part of a network transition. In that case, it was
possible to communicate directly. I had an HP-UX machine running
gated (instead of routed that most everyone else ran), and gated
allows one to configure a metric 0 route.

In the case of different MTU, it should be possible for smaller
MTU hosts to send directly, as long as the large MTU hosts don't
try to do it. Also, hosts could have addresses on both nets.

-- glen

or...@pwr.wroc.pl

unread,
Jun 23, 2013, 1:36:48 PM6/23/13
to
On 12.06.2013, Rick Jones <rick....@hp.com> wrote:
> rahul....@gmail.com wrote:
> All stations in the same broadcast domain MUST have the same MTU.

It's not true. Both side of any communication in same broadcast domain
must send and be able to receive with same MTU.

For example - we have network segment with one IP subnet. MTU was
increased to almost 9000 due to lot of NFS traffic. One host serving
important service is running on older machine without ability to
increase MTU size(we cannot replace the server too). All other
hosts in subnet have static route with MTU set(in linux MTU
comes from routing tables;) and it works well for almost 2 years.

--
Regards
orcus

Rick Jones

unread,
Jun 24, 2013, 2:12:49 PM6/24/13
to
Indeed, there are corner cases or exceptions or workarounds, but I
still prefer to start with that assertion above. Much of the time, if
the communication is TCP, the TCP MSS exchange will paper-over a
mismatched frame/MTU size in the same broadcast domain. But it
isn't really something one should count-on.

rick jones
--
denial, anger, bargaining, depression, acceptance, rebirth...
where do you want to be today?
0 new messages