8.8.4.4 Down

0 views

Skip to first unread message

Annette Fazzari

unread,

Jul 26, 2024, 3:31:50 AM7/26/24

to Menpo Users

I'm running two ISPs. I haven't updated its firmware yet. WAN1 connection is via Fiber media converter then a Router and it further in the FortiGate. here is the error. Do let me know how to avoid this hitch.

As per the logs the alert mail is regarding the performance sla which you must have configured in SDWAN.
It is not stating the information regarding the interface is being down but the link from wan1 is down due to which it is removing the default route from wan1 from the routing table
From the logs I could see that you have configured source IP

The Fortinet Security Fabric brings together the concepts of convergence and consolidation to provide comprehensive cybersecurity protection for all users, devices, and applications and across all network edges.

It looks as if the outage was limited to a specific region and was resolved very quickly. We noticed a 4.7% drop in traffic (pageviews being tracked) at 11.29. Over two years ago, Google DNS was serving over 70 billion requests per day, so just one minute of downtime would mean 49 million failed requests.

When I see in the status page that an interface is down or disabled, the policy will stay on the failed connection for several minutes before finally switching to the backup interface. I do want it to stay on backup until the primary/higher priority connection is online, but I also want to fail to a backup connection quickly.

Several minutes ... I have a functionally similar setup, just testing it for a while, having 2 interfaces (wan,wwan). In normal state, both interfaces are up. wan (eth1) is default, wwan(qmi) backup. Switchover happens in about a minute or so. Although I do not really know, when switchover happens exactly, because of the various states and transitions, logged by mwan3. Or, what switchover exactly means for mwan3. In my understanding, it should be the instant of time, another default gateway becomes effective. Tried to enable 'debug' logging for mwan3, but have seen, it actually is not working. May be, to patch the code. A full log of a switchover might be helpful here, to compare the activies and the timing.

Starlink modem rule just forces all traffic to the status page of starlink uplink regardless of where the default route policy is activated. This is so i can see if the modem is having a problem when its failed over to cellular.

I think the issue is that mwan3 doesnt ping all test ips in one ping interval, it ping each X seconds apart. So if there are 4 ips and the failover is set to fail after 5 tests 10 second interval apart, with a reliability of 2, it wont fail until 400 seconds have passed and at least 3 ips dont respond in each 80sec test of 4

I improved this by setting the fail to 1, with 1 sec inbetween tests. Which means it will take at most 8 seconds to fail. And then i set the recover score to 10, so it wont failback until 36 seconds of sucessful tests.

Mwan3 is pretty damn good for free software, but it definitely isnt built for fast failover detection. Im still happy with the results after understanding how it really works.
I would prefer if each test were run in parallel, it makes more sense to test 4 ips at the same time than 4 ips x seconds apart to know if the link is failing at a particular moment.

A month ago the service provider told me their ONT was rebooting at the times I provided. They replaced their equipment and the problem disappeared for a couple weeks, but came back. They now claim my firewall is the problem. So I am trying to figure a way to prove they are still having issues. My xtm330, shows online and my logs show the external interface receives no connections during the time I see the outage. HERE is the question. How can I prove the problem is or isn't related to the ONT or the XTM330? I currently have five offices which all need Fireboxes, but this customer is only willing if I can prove this is not related to the XTM330.

The only way that I can think of is to have a switch between your ISP device and your firewall external interface, have a switch port mirroring what is going to/from the ISP interface and have a laptop recording the traffic.
That way you are recording the traffic outside the firewall.
So if incoming packets stop from the ISP, then it is their cause.

Thanks Bruce yeah I started to put that in place over the weekend, but the computer wasn't getting an IP Address and I wasn't thinking the computer could still sniff packets without an IP address assigned.

-Open WSM and log into your firewall using the status user by going file -> connect to device.
-Once you're logged into your firewall, right click on it and select Firebox System Manager.
-Once FSM opens, go to Tools -> Diagnostic Tasks.
-In the Network tab, in the Task drop down menu, choose TCP DUMP.
-Tick the advanced options checkbox at the bottom of the window.
-In arguments type in "-nei eth0" without the quotes.
-Tick the "stream data to file" checkbox, and choose a place to save the file.
-Click RUN TASK when the issue starts.
-Click STOP TASK when you have enough data (I'd suggest the whole two minutes if you have time/space to do this.) This may take a moment to stop, only click it once.

You can then use Wireshark (wireshark.org) to look at what was being sent/received on the external interface. Using a filter like "ip.addr==4.2.2.1" in wireshark would filter for the IP 4.2.2.1. Use the IP you were pinging to.

If you can see that traffic leaving the firewall, it's very likely not the firewall causing the issue. You could also potentially see something like the firewall ARPing for the gateway -- if the gateway is not responding, that also points at the ISP device.

As a side note, I would suggest pushing back on the ISP to provide details (as in how specifically they're determining that this is a Firewall/Router issue.) Any information they might have can help you troubleshoot this, and if they don't have any I'd suggest you get it escalated on their side until you can find out that information.

If you notice during this interruption above, there is no external packets "denied" at the external interface which looks to me like they aren't coming through the ONT. I don't think we see 1 minute go by without some denied packet.

Pardon me if I missed you already doing this, but I suggest running more scripts at the same time you run a PowerShell loop recording results and timestamp while pinging 8.8.8.8 and 8.8.4.4. Run one to ping the LAN IP of the 330, one to ping the WAN IP, one to ping the ISP's gateway, and one to ping the ISP's DNS servers (which should be closer to your 330 than Google's DNS).

Ideally, you could run the pings to their DNS and gateway, plus your WAN, from an external computer, also. Set the 330 to trust your home computer's IP and run the scripts from there as well as from behind the 330. Make sure that the computers running the scripts have the exact same time.

@funkywinkerbean
If it's random and you can't be on site, you'll want to use a port mirror and wireshark to catch it. Trying to run a packet capture via the WSM tool is going to chew up a lot of resources on the firewall for an extended period, and also possibly fill whatever storage device you're using.

update: On a computer installed on the secure (trusted) network of the xtm330 I'm running a powershell loop logging (ping of an external site results, timestamp, and nmap -sp scan of the local network) results. lately the ping results fail at 10:51am and are back up by 10:53am. I notice when the nmap scan and pings run, nmap usually logs 15 devices. when the ping fails, the nmap scan displays 2 devices (itself, and one other computer) ...until the network comes back. I now believe this outage, is on my network, not the ISP. I'm running wire shark, logging, where I'm seeing packets during the outage time originating only from the machine running the network ping loops. I'm not sure if this is an xtm330 problem or what is happening at this point.... continuing to dig.

Local devices on the same subnet should not use the firewall to respond to packets.
So the nmap scan going from 15 to 2 devices suggests to me that the switch to which the trusted network is connected is involved.
Is this a managed switch, with port status & logs etc.?

@funkywinkerbean
If you have any free ports on the firewall, configuring another interface as trusted and plugging directly into it (to bypass the switch(es) can help determine if that blip is the firewall itself.

Bruce, I have wireshark running on the trusted side (thinking this is an internal issue) are you referring to, as you mentioned previously, the anonymous capture connected to a switch between the ONT and the firebox?
Appreciate Ya'll,
David

Question Gentlemen, If, as James recommended above, I configure another port as a trusted, and only connect one computer to it and run a few loop tests - if it doesn't have an issue at the same time, that would confirm the problem exists on internal network. If it has same errors, the problem would be xtm or external ISP. Agreed?
Let me know please if I missed your point James?
Thank You,
David

For my recent suggestion on the use of Wireshark - this is on a Trusted PC to see if something is saturating your trusted switch - or if during or just after the timeout you see a bunch of ARP packets from the switch or the firewall

OK, I am here to update ya'll. the final answer was found to be a switch was plugged into an APC Smart-UPS 1400 which was running a daily self test. This was causing the switch to restart. I can't explain, why some days we didn't see the outage... Anyway, I really appreciate your time and beg your pardon, for reading more into this, than was necessary. Finally answer I pulled the UPS and it worked fine. Don't know how the UPS went into this daily test mode.... and can't say I really like the concept.

Even with a daily test, it should not drop the switch if the battery is good. HOWEVER, testing a battery that frequently can shorten its life. A normal schedule may be once a month. I would replace the battery if the UPS is less than three years old, or replace the whole UPS if more than three years old. A LOT of techs I know have changed from APC to Eaton for their UPS purchases.

Reply all

Reply to author

Forward

0 new messages