10k Clients + Routing Table --> Bad Performance

241 views
Skip to first unread message

Christian Kreuzberger

unread,
Jan 19, 2016, 4:31:24 PM1/19/16
to ns-3-users
Hi. I have a scenario with 10k TCP clients, each of them having their own pre-defined link capacity (installed using P2PLinkhelper). This resulted in having ONE router with 10k links and 10k different IP addresses for the clients.
While these 10k clients are not always active at the same time, they exist from the beginning to the end of the simulation. I have populated the routing tables with 

Ipv4GlobalRoutingHelper::PopulateRoutingTables ();




The scenario looks as follows:
Server <--BOTTLENECK--> Router <---> Each Client
The clients only need to communicate with the server, and nothing else. 

I am experiencing really really bad performance even with only 10 clients active. When only having 500 clients, I can have all 500 clients active without any issues. I guess this is something coming from the routing tables, and I'm wondering whether I can improve this somehow. I'm sure my code is far from being optimal, so any hints and suggestions are welcome. If necessary, I'll upload the code.

Best regards and thanks in advance
Christian

Tommaso Pecorella

unread,
Jan 19, 2016, 6:03:01 PM1/19/16
to ns-3-users
Hi,

this is a known limitation of GlobalRouting. This is not a surprise tbh, as 10k nodes would kill OSPF as well, and GlobalRouting is based on OSPF.
Use NixVectorRouting, it's much faster for the use-case you have.

T.

Christian Kreuzberger

unread,
Jan 20, 2016, 4:43:18 AM1/20/16
to ns-3-users
Hi. Thanks for the reply. Nix Vector Routing is definately the solution for scaling in my scenario.

However, while Nix Vector Routing did significantly speed up the initialization process and is also much faster than OSPF/Global Routing with 10k static clients, it is still much slower (a factor of 10-20) then what I experienced before when there are only 500 clients connected at all.
I will try to put together a minimal example scenario and upload it, so people can compare the performance. I'll also dig deeper into the problem and see if I can improve it on my own. Stay tuned.

Christian

Nat P

unread,
Jan 20, 2016, 5:01:57 AM1/20/16
to ns-3-users
I don't know in what matter it can help, but with many nodes I always created the initial tables by hand. Automatic handling of routing tables often insert redundant path, duplicated entry, and it is pratically impossible to manage.

Nat

Christian Kreuzberger

unread,
Jan 21, 2016, 6:39:13 AM1/21/16
to ns-3-users
Agreed. I've started to manually add the static routing entries. It is much faster than GlobalRouting - both, when initializing the routing table as well as when using them in production.
It also seems to be faster than NixVector Routing, which makes me wonder whether it would make sense to put some thought/manpower into another strategy that combines the benefits of static routing and dynamic routing.

Also, I've looked into Static Routing (namely the LookupStatic method). It uses a sequential search implemented as a for loop:
  for (NetworkRoutesI i = m_networkRoutes.begin ();
       i != m_networkRoutes.end ();
       i++)
    {
..........
    }

Considering that this method is called EVERY time a packet needs to be forwarded, this is, and I'm trying to stay polite here, not very performant.

Tommaso Pecorella

unread,
Jan 21, 2016, 9:03:16 AM1/21/16
to ns-3-users
Hi Christian,

scanning through the routing table to match the longest prefix is *the* IP routing problem, and it can be solved with specialized hardware memory.
If you have an idea on how to speedup the lookup without breaking the longest-matching case I'll be happy to implement it.
Keep in mind that the normal use-case doesn't involve extremely huge routing tables, so there wasn't the need to optimize the lookup algorithms.

Cheers,

T.

Christian Kreuzberger

unread,
Jan 21, 2016, 9:32:49 AM1/21/16
to ns-3-users
I'm not an expert in routing nor IP, and without having everything 100% planned, I would propose the following:

Routing table should be stored in a tree structure, where you follow the path that is valid for the destination IP address. Each entry is (obviously) associated with a metric, netdevice, and all the other good stuff that is currently associated.

Root node: 0.0.0.0/0
Child nodes: i) 10.0.0.0/8, ii) 15.0.0.0/8, iii) 23.0.0.0/8, iv) 127.0.0.1/32, v) 192.168.0.0/16, ...
Grandchildren: i.1) 10.1.0.0/16, i.2) 10.2.0.0/16, ..., v.1) 192.168.12.0/24, v.2) 192.168.13.0/24

Every time you look up a routing information, you start with the root node, and check if any of the children is applicable to you. If NONE is applicable, then you stay at the current node. If there is one applicable, you follow it and start with looking at that nodes children again.


Let me try to apply this to my scenario, which looks as follows:
Link 0:
Server: 20.0.0.1
Router: 20.0.0.2

Link 1:
Router: 10.0.0.x_1
Client:  10.0.0.y_1
Where x_1 and y_1 are integers from 0 to 254, x_1 != y_1

Link 2:
Router: 10.0.0.x_2
Client: 10.0.0.y_2
Where x_2 and y_2 are integers from 0 to 254, x_2 != y_2, x_2 != x_1, x_2 != y_1, y_2 != x_1, y_2 != y_1

...
Link n:
Router: 10.0.0.x_n
Client: 10.0.0.y_n
Where x_n and y_n are integers from 0 to 254, x_n != y_n, and so on....

Link n+1:
Router: 10.0.1.x_{n+1}
Client: 10.0.1.y_{n+1}

etc...


Now the "flat" routing table on "Router" looks like this:
Host/Network, Device
20.0.0.1, 0
10.0.0.y_1, 1
10.0.0.y_2, 2
10.0.0.y_3, 3
...
10.0.0.y_n, n
10.0.1.y_{n+1}, n+1
...

so potentially very long.

Now just assume that the first entry with 20.0.0.1, 0 is at the very end of that list. This means everytime this host is contacted, the list needs to be sequentially searched until the very end.

Now just assume you would have an Index structure, looking like this:

Root: 0.0.0.0/0
Child 1: 20.0.0.0/8 - Sub Child 1.1: 20.0.0.1/32 Link 0
Child 2: 10.0.0.0/8 - Sub Child 2.1: 10.0.0.0/24, Sub Child 2.2: 10.0.1.0/24, Sub Child 2.3: 10.0.2.0/24, Sub Child 2.4: 10.0.3.0/24, etc...

And then Sub Child 2.1 would contain the information for 256 clients, Sub Child 2.2 for another 256, etc...
So worst case you have to sequentially follow the path down the tree, which is probably short (within 2-3 steps), and then perform a sequential search of 256 clients (which can also be further improved).


I am probably completely missing something that is very important for IP/routing. My approach is based on a very simplistic idea and  most likely not going to work in all scenarios. But I hope it is a good starting point for a discussion.


Hope to help
Christian

Christian Kreuzberger

unread,
Jan 23, 2016, 3:46:18 AM1/23/16
to ns-3-users
I did some more digging, and I found something that is even worse than routing. I can control the routing stuff with some schedule magic and static routing easily now, which increased performance by a factor of 5.
But what is even worse, is the method GetInterfaceForDevice:


int32_t
Ipv4L3Protocol::GetInterfaceForDevice (
  Ptr<const NetDevice> device) const
{
  NS_LOG_FUNCTION (this << device);
  int32_t interface = 0;

  for (Ipv4InterfaceList::const_iterator i = m_interfaces.begin ();
       i != m_interfaces.end ();
       i++, interface++)
    {
      if ((*i)->GetDevice () == device)
        {
          return interface;
        }
    }

  return -1;
}


Turns out, when you have 10k links, this method is really far from optimal... By looking at the profiler, this is the single biggest bottleneck in the implementation at the moment :(



Am Dienstag, 19. Januar 2016 22:31:24 UTC+1 schrieb Christian Kreuzberger:

Tommaso Pecorella

unread,
Jan 23, 2016, 5:23:10 AM1/23/16
to ns-3-users
Hi,

yes, I was thinking to re-implement that with a reverse hash table, but I felt that it wasn't that necessary after all. The normal use-case isn't affected by this.
Unfortunately, changing this lookup is quite a change in the code. However, if you're up to it we can prepare a patch and you can test it. I'd be curious about the speedup.

Cheers,


T.

Christian Kreuzberger

unread,
Jan 23, 2016, 7:15:17 AM1/23/16
to ns-3-users
I'm up for helping out here. I am wondering whether NetDevice::GetAddress could be utilized for that?

Tommaso Pecorella

unread,
Jan 23, 2016, 7:39:44 AM1/23/16
to ns-3-users
Hi,

not really. It is not guaranteed that NetDevices addresses are unique in a node. It's really a super-conservative assumption, but addresses are unique in a local network (because otherwise the network will not work), but they could be duplicated in the node.

T.

Christian Kreuzberger

unread,
Jan 23, 2016, 8:26:39 AM1/23/16
to ns-3-users
But that sounds like a perfect input for a hash-function. Even if you find duplicates, at least you reduced the number of items to iterate over. 

Tommaso Pecorella

unread,
Jan 23, 2016, 9:31:50 AM1/23/16
to ns-3-users
True, but the pointer value is already unique... :)

T.

Christian Kreuzberger

unread,
Jan 23, 2016, 11:17:53 AM1/23/16
to ns-3-users
A different approach, which might be quicker to do (at least for me), is the following:
I already have the interface index in 

Ipv4StaticRouting::LookupStatic (Ipv4Address dest, Ptr<NetDevice> oif)



when the following is called

rtentry = Create<Ipv4Route> ();
rtentry
->SetDestination (route->GetDest ());
rtentry
->SetSource (SourceAddressSelection (interfaceIdx, route->GetDest ()));
rtentry
->SetGateway (route->GetGateway ());
rtentry->SetOutputDevice (m_ipv4->GetNetDevice (interfaceIdx));


So here we could just set the output device interface index in rtentry, and then we would not have to query it again in the other methods. 

Tommaso Pecorella

unread,
Jan 23, 2016, 5:54:07 PM1/23/16
to ns-3-users
Hi,

let me understand better. Your proposal is to add another field in Ipv4RoutingTableEntry: the output interface index. Right ?
Technically it can be done, but its effects would be limited to StaticRouting (or to any routing protocol "rewired" to use this.

Moreover, there are a number of other places where the Ipv4RoutingTableEntry is not available, and yet GetInterfaceForDevice is used.

As a consequence, I'd rather optimize GetInterfaceForDevice, its effects will be immediate and widespread.
I'll prepare a patch, I'd be happy if you can test it.

Cheers,

T.

Tommaso Pecorella

unread,
Jan 23, 2016, 8:07:32 PM1/23/16
to ns-3-users
Hi Christian,

can you test the attached patch ?
It should have some beneficial effects... if you know what I mean.

Mind, it's untested and I don't know if everything works AND if it doesn't generate a memory leak. It shouldn't, but you never know.

Cheers,

T.
patch-78.diff

Christian Kreuzberger

unread,
Jan 24, 2016, 5:46:17 AM1/24/16
to ns-3-users
Thank you a lot for your efforts! Initial Tests look good, definitely can see a performance boost (results are in the middle of the post). I did not see a negative impact on the memory, so this should not have caused a memory leak.

I believe that there is another method in this class than can be improved with this new type of lookup: 

void 
Ipv4L3Protocol::Receive ( Ptr<NetDevice> device, Ptr<const Packet> p, uint16_t protocol, const Address &from,
                          const Address &to, NetDevice::PacketType packetType)

where the list is also iterated sequentially:

  Ptr<Ipv4Interface> ipv4Interface;
  for (Ipv4InterfaceList::const_iterator i = m_interfaces.begin (); 
       i != m_interfaces.end (); 
       i++, interface++)
    {
      ipv4Interface = *i;
      if (ipv4Interface->GetDevice () == device)
        {
          if (ipv4Interface->IsUp ())
            {
              m_rxTrace (packet, m_node->GetObject<Ipv4> (), interface);
              break;
            }
          else
            {
              NS_LOG_LOGIC ("Dropping received packet -- interface is down");
              Ipv4Header ipHeader;
              packet->RemoveHeader (ipHeader);
              m_dropTrace (ipHeader, packet, DROP_INTERFACE_DOWN, m_node->GetObject<Ipv4> (), interface);
              return;
            }
        }
    }

I think it would make sense to also use the lookup method here like this (although feel free to change this to something that looks better or works better or ...):

  int32_t interface = GetInterfaceForDevice(device);

  Ptr<Ipv4Interface> ipv4Interface;

  if (interface != -1)
    {
      ipv4Interface = m_interfaces[interface];

      if (ipv4Interface->IsUp ())
        {
          m_rxTrace (packet, m_node->GetObject<Ipv4> (), interface);
        }
      else
       {
          NS_LOG_LOGIC ("Dropping received packet -- interface is down");
          Ipv4Header ipHeader;
          packet->RemoveHeader (ipHeader);
          m_dropTrace (ipHeader, packet, DROP_INTERFACE_DOWN, m_node->GetObject<Ipv4> (), interface);
          return;
       }
    }

I did some "short" testing to see the performance improvement (10k nodes connected to one router, 5 active nodes), and here are the details about it:

Without your patch: 470 seconds
With your patch: 350 seconds
With my additional change: 300 seconds

-------------------------------------------------------------

Also, there are some other bits and pieces that I am going to tweak for my specific scenario. For instance, in
Ipv4L3Protocol::Send

the comments say the following:
  // Handle a few cases:
  // 1) packet is destined to limited broadcast address
  // 2) packet is destined to a subnet-directed broadcast address
  // 3) packet is not broadcast, and is passed in with a route entry
  // 4) packet is not broadcast, and is passed in with a route entry but route->GetGateway is not set (e.g., on-demand)
    // 5) packet is not broadcast, and route is NULL (e.g., a raw socket call, or ICMP)eben...

For me, it is pretty much always case 3 (I'm not using broadcast at all). However, case 2 wastes a lot of CPU time, as it does a loop over all 10k interfaces:
  // 2) check: packet is destined to a subnet-directed broadcast address
  uint32_t ifaceIndex = 0;
  for (Ipv4InterfaceList::iterator ifaceIter = m_interfaces.begin ();
       ifaceIter != m_interfaces.end (); ifaceIter++, ifaceIndex++)
    {
      Ptr<Ipv4Interface> outInterface = *ifaceIter;
      for (uint32_t j = 0; j < GetNAddresses (ifaceIndex); j++)
        {
          Ipv4InterfaceAddress ifAddr = GetAddress (ifaceIndex, j);          
          if (destination.IsSubnetDirectedBroadcast (ifAddr.GetMask ()) &&
              destination.CombineMask (ifAddr.GetMask ()) == ifAddr.GetLocal ().CombineMask (ifAddr.GetMask ())   )
            {
              ...
              return;
            }
        }
    }

Mind you, in my scenario the inner if condition is always false. I'm leaving this here as a comment in the hopes that it will help to further improve ns-3 performance. However, I am not sure whether this code is there on purpose or whether there could be a condition used to prevent this loop from being executed. I for myself am going to comment this part of the code out for now.

Initial testing showed that this did not reduce runtime, though I guess it will have a positive impact over the long run of my simulation when there are more interfaces up and running.


Another thing that puzzled me was in Static Routing: 
Ipv4StaticRouting::LookupStatic (Ipv4Address dest, Ptr<NetDevice> oif)

Here the for loop goes over all routing entries. But I'm wondering whether this is still necessary IF the routing entry found has a mask of 32 (in the case of IPv4)?
The only condition in the loop that is relevant here is this one:
          if (masklen < longest_mask) // Not interested if got shorter mask
            {
              NS_LOG_LOGIC ("Previous match longer, skipping");
              continue;
            }

So if the current masklen is smaller than the longest_mask found already, then we ignore the current routing entry and continue with the next item. However, can there be a mask that is greater than 32 (read: a direct route to the host is in the routing table, there can not be a more direct route)? If not, then you have already found the "optimal" entry in the routing table and you can stop! So something like this:

          Ipv4RoutingTableEntry* route = (j);
          uint32_t interfaceIdx = route->GetInterface ();
          rtentry = Create<Ipv4Route> ();
          rtentry->SetDestination (route->GetDest ());
          rtentry->SetSource (SourceAddressSelection (interfaceIdx, route->GetDest ()));
          rtentry->SetGateway (route->GetGateway ());
          rtentry->SetOutputDevice (m_ipv4->GetNetDevice (interfaceIdx));

          if (masklen == 32) 
            {
              break;
            }

This small change did also not significantly impact the runtime, though I'm expecting this to have a greater impact when my routing tables grow during the scenario (I'm having up to 500 concurrent clients, meaning 500 entries in the routing table at some point).

Apologies for this very long post, I hope I was not too bitchy about things and I was actually able to help improve the performance of ns-3 a bit. I will, at some point, write a blogpost about some of the other optimizations I did (like the static routing entries) to improve my scenarios performance.

Thanks,
Christian

Christian Kreuzberger

unread,
Jan 24, 2016, 7:06:10 AM1/24/16
to ns-3-users
Yet another sequential search loop that I found is in Ipv4StaticRouting::RouteInput. 

  for (uint32_t j = 0; j < m_ipv4->GetNInterfaces (); j++)
    {
      for (uint32_t i = 0; i < m_ipv4->GetNAddresses (j); i++)
        {
          Ipv4InterfaceAddress iaddr = m_ipv4->GetAddress (j, i);
          Ipv4Address addr = iaddr.GetLocal ();
          if (addr.IsEqual (ipHeader.GetDestination ()))
            {
              if (j == iif)
                {
                  NS_LOG_LOGIC ("For me (destination " << addr << " match)");
                }
              else
                {
                  NS_LOG_LOGIC ("For me (destination " << addr << " match) on another interface " << ipHeader.GetDestination ());
                }
              lcb (p, ipHeader, iif);
              return true;
            }
          if (ipHeader.GetDestination ().IsEqual (iaddr.GetBroadcast ()))
            {
              NS_LOG_LOGIC ("For me (interface broadcast address)");
              lcb (p, ipHeader, iif);
              return true;
            }
          NS_LOG_LOGIC ("Address "<< addr << " not a match");
        }
    }
  }

Unfortunately this loop seems to be needed by something, though it is not needed for my router with 10k links on it. So I was able to tweak that encapsulating it with an if condition:
  if (m_ipv4->GetNInterfaces () < 100)
  {
// .. the above code
  }




Am Dienstag, 19. Januar 2016 22:31:24 UTC+1 schrieb Christian Kreuzberger:

Tommaso Pecorella

unread,
Jan 24, 2016, 10:16:57 AM1/24/16
to ns-3-users
Hi,

long but productive posts, indeed.
First and foremost, thanks for helping in profiling. As I told before, many of these functions are a tradeoff between simplicity and speed, with a preference on simplicity. It's unusual to have a node with so many links.

Let's go in reverse order.
1) Ipv4StaticRouting::RouteInput "optimization". That is more or less a bug, but one that you can have in your own code. that code is responsible for forwarding packets to the local host. Without it, the node will be "simply" a switch, and it will not be even able to reply to ICMPs. Most probably it won't be able to use algorithms like RIP, which requires UDP.
Still, if you're using Static or Global routing, you can afford this "bug".

2) Ipv4StaticRouting::LookupStatic. The idea to check the mask length is good, I'll add this as well. 

3) Unfortunately, the checks for broadcasts are needed, so they should stay.
There was a discussion about some "glitches" in broadcast handling, and probably we should triple check the logic, but it's not that easy. In the process, we could as well keep an eye on optimizing it. Nevertheless, subnet-directed broadcasts are a b**ch, as you can have multiple addresses on each NetDevice, so... bummer.

4) Ipv4L3Protocol::Receive. You're damn right, I added the code.

Let me know if you don't spot more optimization points (so I can push the code), and please write me a mail in private (I need it for the code attribution).

Cheers,

T.

Christian Kreuzberger

unread,
Jan 26, 2016, 7:38:29 AM1/26/16
to ns-3-users
Thanks a lot, simulations are running so much faster now.

I did not find any other parts that could be improved. I arleady sent you a private e-mail aswell.

Best regards
Christian

Tommaso Pecorella

unread,
Jan 26, 2016, 9:30:17 AM1/26/16
to ns-3-users
Hi Christian,

I've seen the mail, ty. All the improvements will be added soon to the codebase.

As a side note, and thinking to what you said about the RouteInput local delivery, one could add a flag to block any connection (L4 packets) except the ones from a particular port.
The problem is that this is a proto-firewall, and I'd rather have a real netfilter than this.

Cheers,

T.

Yolanda Xu

unread,
May 9, 2017, 11:36:02 PM5/9/17
to ns-3-users
hi, I am working on routing table design on numerous nodes. now I also have problem on routing table design. My current design is use the  static routing in AS, then use the dynamic routing to connect the different ASs. Do you think it works? Could you share your design to me if it is convenient?
在 2016年1月26日星期二 UTC+8下午8:38:29,Christian Kreuzberger写道:
Reply all
Reply to author
Forward
0 new messages