NXFilter server sizing

697 views
Skip to first unread message

Giorgio Catena

unread,
Apr 11, 2014, 2:14:51 AM4/11/14
to nxfil...@googlegroups.com
Hi Jinhee,
I know you already wrote down some indicative minimum requirements to have nxfilter up and runnning but due to the impact that NXFilter have in terms of normal usage for a huge number oc clients it could be useful to have a better idea of a suggested sizing / user numbers.
What I mean is: starting from a simple config with a single node you could goes up in terms of node - single node hardware size while the user number grows up.
E.G.:
Single office under 50 user: single node 1 cpu-vpu at least xxx, xx GBytes disk to retain logs for at least xx days, xxx MBytes RAM suggested or 2 nodes etc etc etc

With the direct experience I had during these week of testing, I'm thinking to dedicate 2 Centos 6 x32 VM to work with NXFilter (dedicated machines).
For my company's size (under 1000 users) I do think to use 2 nodes in terms of reliabilitiy making the master as primary and the slave as secondary.
The machines will be sized as follows:
1 vcpu
1 Gbytes RAM
40 GBytes disk
1 Eth card GBit class

Would it be a correct sizing?

Regards

Jinhee

unread,
Apr 11, 2014, 3:58:33 AM4/11/14
to nxfil...@googlegroups.com
Sorry. actually you know better than me. :)
I think we'd better keep this thread on the top.
So that the other people can share their experience with the number of users and hardware and scalability.

But I guess your RAM would be enough as NxFilter consumes not much memory.
CPU is also enough it doesn't require too much CPU power.

About the disk size.
I assume that our demo site has the amount of traffic from about 30 users and it's been 20 days.
The size of the traffic DB is 88MB.
So it's like 10 user * 10 days = 14.6MB.
1000 user * 10 days = 1.46G.
1000 user * 60 days = 8.76G

So that'd be enough in my calculation.
Though this is just based on some assumption.
If anybody can share their actual experience with NxFilter's scalability that would be great for everybody.

Jinhee

James Gardner

unread,
Apr 11, 2014, 12:58:39 PM4/11/14
to nxfil...@googlegroups.com
I am currently running 5 individual instances of NXF 2.0.3 and 1 cluster of (3) 2.0.3.  

I am running into a problem where the name service stops responding and java process spikes CPU utilization to 100%.  Fixable only by stopping service and/or killing java process.  This occurs once to twice daily.  This seems to happen when load spikes dramatically.  At peak the cluster is serving 25k to 40k requests every ten minutes.  This is consistent across installations of latest CentOS 5 and CentOS 6 with JVM 1.7.51.  The VMs are running on multi-core Xeon servers with 1+GB RAM each.

Not sure if this is a scalability issue.  Memory and Java use of memory don't appear to be maxxed out.  Or is this just a mismatch of JVM versions?  This was occurring in the 1.7.x versions, but has become worse in 2.0.x.

Open to suggestions.  Love the product!  Saved us when OpenDNS showed us the door.

-James
Message has been deleted

Jinhee

unread,
Apr 11, 2014, 8:26:57 PM4/11/14
to nxfil...@googlegroups.com
Hi James,

Your report is very interesting to me.

You might be under attack. Could be DDOS or from botnet/malware.
Look at the difference between request-sum and request-cnt. If
there's a big difference it means someone sending queries against
the same domain many times.

NXF keeps logging data into memory and writes it once in a minute
to its traffic DB. This is for saving disk space and reducing
the number of access of disk. If it sees a same query it only
increase the counter for the domain so if there are 1,000
queries for the same domain for a minute it only insert one
data with 1,000 count.

Means you can have some difference between request-sum and
request-cnt. And if the difference is big that means someone is
making multiple queries against a domain. And that's why you
get 35K queries on your peak time. Or if it's worse it may
send queries against fake domains.

One of the feature of NXF is local cache. It has its own local
cache so when you use Google DNS as your resolving DNS you don't
send too many queries to Google DNS thanks to the local cache.
It would seem like just one or several users from Google DNS side.
From my experience I guess Google blocks you if you send too many
DNS queries to them.

But if you're under attack and these attackers sending queries
for fake domains or forged domains being generated by their
program then you will send lots of queries to Google DNS and
they will block you. They would think you're an attacker.

Now you need to find out who is the attacker. Is your NXF
exposed to Internet? You'd better put it behind your firewall
in that case. Or use 'Config > allowed-ip' for blocking traffic
from attackers. If it is just serving your local queries then
try to find out which user sending that amount of queries
using report and logging.

However there're 2 things to think about. One is when you use
NxClient for remote filtering. Then you can't use IP based ACL
as it is using whitelist only method. That's why I am thinking
of adding blacklist based ACL for DNS queries. And another
thing with V2 I stopped logging anonymous username with their
client-ip so you can't use 'Top 5 user' on reporting to find
out who's the attacker without authentication.

So I will add these 2 things.

  - Blacklist based ACL for your NXF DNS port.
  - Client IP based top 5 chart.

Can you tell me how many users you have as well?

Again it's a very interesting report and you helped me a lot
for figuring out what's going on in real world with NXF.

Lastly, if you can help other people finding NXF for saving
them from their frustration against OpenDNS door that'd be
even great for all of us.

Jinhee

James Gardner

unread,
Apr 11, 2014, 11:48:41 PM4/11/14
to nxfil...@googlegroups.com
Here are example daily stats from the internal-only cluster of 3:
request-sum = 1150919 , request-cnt = 785645 , block-sum = 229806 , block-cnt = 89576 , domain = 57185 , user = 1 , client-ip = 966

We are an all Mac shop ~1200 users.  K-12 school environment.  No authentication yet, but we're glad openldap support is back.  
Tons of requests from nearly all clients to:

About 100k between the two on that day.  CDNs for Apple apparently.  Wondering if putting in a local redirect would speed this up?  Nearly every upstream DNS throttles us noticeably.

I do see forged/fake queries on the public external facing NXFs.  Dropping most of that with iptables using a hit count and pattern match blocks.

Any other thoughts on the Java/CPU spike issue?  Or is it just load?

Thanks in advance!

-James

Jinhee

unread,
Apr 12, 2014, 12:46:23 AM4/12/14
to nxfil...@googlegroups.com
Try to whitelist those domains with bypass_auth, bypass_filter, bypass_logging options.
Your CPU spike seems related to 40K requests in 10 mins.
Even if you have 4000 users 40K is not a normal number.
We'd better try to reduce the number first.
Or at least try to exclude these requests from NXF processing.

Jinhee

Jinhee

unread,
Apr 12, 2014, 12:53:42 AM4/12/14
to nxfil...@googlegroups.com
And you can try to increase TTL from NXF or chage it to '0' so that NXF doesn't need to touch TTL.
I made NXF manipulating TTL to set 60sec.
This is because I didn't want to your client PC keep DNS cache too long.
Coz when you want to change your policy you want to see the result faster.
But if it's 60sec means even if your DNS cache supposed to be alive for 24 hours it expires in 60sec due to NXF manipulation.
And it increases the number of requests from your client side.
So the first thing you can do is actually set it to 0 or increase it.

Read this.

  http://nxfilter.org/faq.php#many_user

Jinhee

James Gardner

unread,
Apr 12, 2014, 9:32:16 AM4/12/14
to nxfil...@googlegroups.com
I will whitelist those domains.  Client TTL had already been adjusted to 1800.  

If the whitelist doesn't work by itself, I'll adjust the TTL to 14400 or turn off.  We also channel all traffic through a transparent proxy on-site for immediate filtering needs.

The public NXFs, I'll see changes this weekend, but the big test is Monday for the cluster.

-James

Jinhee

unread,
Apr 12, 2014, 10:13:52 PM4/12/14
to nxfil...@googlegroups.com
OK that'd be the correct procedure.
But the TTL thing might not be useful as those domains having short TTL.

Was it not happening on regular time basis?
Happening on the same time or randomly happening?

Jinhee

Jinhee

unread,
Apr 12, 2014, 10:33:40 PM4/12/14
to nxfil...@googlegroups.com
Try whitelist way first and then your last resort would be local_domain bypassing.

  http://nxfilter.org/faq.php#bypass_local

Don't worry.
If nothing works I will come up with some other thing.

Jinhee

James Gardner

unread,
Apr 13, 2014, 1:24:15 AM4/13/14
to nxfil...@googlegroups.com
The timing of the crashes is not regular.  Prior to tonight, it seemed correlated to load.

Tonight, one of the public facing NXFs crashed with virtually no load (<400 queries/10mins).  NXF Logs showed no unusual or complicated queries.  System logs were normal too.  Do you recommend JVM 1.6.x or 1.7.x?  All mine run on 1.7.51.

Whitelisting is cleaning things up, but doesn't seem to be the solution.

-James

Jinhee

unread,
Apr 13, 2014, 2:28:12 AM4/13/14
to nxfil...@googlegroups.com
You have to get some error log if it's a runtime error.
When it crashed do you mean it stopped before you restart it?
Was it happening with 1.7.x version of NXF as well?
I know there's someone running NXF v1.7.x for almost 3 months without restarting.
He has several thousands users.

Anyway if it's a runtime error I need to see some error message.
If it happens everywhere maybe you can turn on debug message for one of your NXF which has lesser load.
In /conf/log4j.properties INFO to DEBUG.

If I have to choose between JVM 1.6 and 1.7.
I would say 1.6 would be stable.
But I don't think that's the reason.

And what this mean exactly?


  Whitelisting is cleaning things up, but doesn't seem to be the solution.

You didn't get the log but you still have the problem?

Jinhee

Jinhee

unread,
Apr 13, 2014, 2:30:29 AM4/13/14
to nxfil...@googlegroups.com
Do you see any common things before they crash?
Like same kind of log prior to crashes?

Jinhee
Message has been deleted

Jinhee

unread,
Apr 13, 2014, 2:39:53 AM4/13/14
to nxfil...@googlegroups.com
How much memory do they use?
In my memory there's difference in memory usage depending on JVM version.
And currently it's limited to 512MB for NXF.

In /bin/startup.sh

      nohup java -Djava.net.preferIPv4Stack=true -Xmx512m -Djava.security.egd=file:/dev/./urandom -cp $NX_HOME/nxd.jar:$NX_HOME//lib/*: nxd.Main > /dev/null 2>&1 &

You see 512m?
You might need to increase it if you see it spending like more that 400MB.
In my testing environment it's always lesser than 200MB but not sure in real world.

Jinhee

Jinhee

unread,
Apr 13, 2014, 3:16:58 AM4/13/14
to nxfil...@googlegroups.com
And one of the way of finding out what's happening exactly would be running it on foreground mode.
Starts it without '-d' option on your console.
If it crashes with some runtime error you'd know what happened.

Or you can edit your startup.sh


  nohup java -Djava.net.preferIPv4Stack=true -Xmx512m -Djava.security.egd=file:/dev/./urandom -cp $NX_HOME/nxd.jar:$NX_HOME//lib/*: nxd.Main > /dev/null 2>&1 &

Instead of '/dev/null 2>&1 &' you can use some filename.
Like,

  nohup java -Djava.net.preferIPv4Stack=true -Xmx512m -Djava.security.egd=file:/dev/./urandom -cp $NX_HOME/nxd.jar:$NX_HOME//lib/*: nxd.Main > /home/jinhee/monitor.txt 2>&1 &

Jinhee

James Gardner

unread,
Apr 14, 2014, 11:01:16 AM4/14/14
to nxfil...@googlegroups.com
The java/cpu crashes were happening under NXF 1.7.x, but much less frequently.  Under 1.7.x, once a week, under 2.0.x several times daily.

Whitelisting (no filter/no log) those akamai CDNs that Apple uses, along with several other high query sites has cleaned up the dashboard and ability to see which sites are/need to be filtered.  This has no doubt also taken load off the server as it is filtering only 50% (or less) of the queries it was seeing before.

At your suggestion, I have also rolled the JVM back from 1.7.51 to 1.6.30 on all sites, and activated console logging (instead of pipe to /dev/null) on two of the lower volume instances.  I also increased the JVM Max Mem to 768 from 512.  Since making these changes, the web:9443 interface seems to respond much quicker, and more importantly there have been zero crashes.  We are well into the school/business day on Monday.

I will patch to 2.0.4 Monday night.

Thank you for your continued quick assistance.

-James

Carl Miller

unread,
Apr 14, 2014, 12:52:46 PM4/14/14
to nxfil...@googlegroups.com
I am using a BeagleBone Black as my DNS Server.  I use a 32GB sdcard and boot off of that so I have sufficient drive space. We currently have around 50 devices and 25 users.  I expect those numbers to double in the next year so I hope it is still sufficient. I have been using it for about 4 months with no problems.  I have a very basic NXF setup - I filter by IP not login, and we do not use AD. 

What is BeagleBone Black?

BeagleBone Black is a $45 MSRP community-supported development platform for developers and hobbyists. Boot Linux in under 10 seconds and get started on development in less than 5 minutes with just a single USB cable.

Processor: AM335x 1GHz ARM® Cortex-A8

  • 512MB DDR3 RAM
  • 2GB 8-bit eMMC on-board flash storage
  • 3D graphics accelerator
  • NEON floating-point accelerator
  • 2x PRU 32-bit microcontrollers

Connectivity

  • USB client for power & communications
  • USB host
  • Ethernet
  • HDMI
  • 2x 46 pin headers

Software Compatibility

  • Ångström Linux
  • Android
  • Ubuntu
  • Cloud9 IDE on Node.js w/ BoneScript library
  • plus much more

Jinhee

unread,
Apr 14, 2014, 7:25:42 PM4/14/14
to nxfil...@googlegroups.com
@James:

OK we can keep watching it for a while.
But I think it's most likely from memory consumption.
V2 has more features and it may require bigger memory.

Especially the logging/reporting requires a lot of memory and added new things about logging/reporting with V2.
This is one of the reasons why I don't like to have too many features on NXF.
But these days memory is one of the cheapest resource so if it's memory problem that's still OK.

Let me know if it works after several days.

Jinhee

Jinhee

unread,
Apr 14, 2014, 7:26:25 PM4/14/14
to nxfil...@googlegroups.com
@Carl:

Thanks for the reporting.

Jinhee

Dm Ms

unread,
Apr 15, 2014, 3:05:59 PM4/15/14
to nxfil...@googlegroups.com
FYI

I run NxFilter 1.7.6 (still didn't make a move to 2.0.3) on 1vCPU with 1.5 GB assigned to it serving approx 120 users. That server also runs small sql server instance, central AV software and Unifi AP central controller. ~%70 memory utilization and %30 CPU. I think if i were to run it by itself on a linux guest 512MB would be more than enough for my environment.

Jinhee

unread,
Apr 15, 2014, 9:55:07 PM4/15/14
to nxfil...@googlegroups.com
@Dm Ms

Yeah 512MB would be enough if there are just 120 users.
Thanks for the info.

Jinhee

Viktor Sosic

unread,
Apr 23, 2014, 11:39:53 AM4/23/14
to nxfil...@googlegroups.com
Hi James,

I have noticed that Apple devices on my network seem to flood my router and my router thinking they are trying to SYN flood an external source.

Have you found a reason for this or how to stop apple devices doing this?

James Gardner

unread,
Apr 23, 2014, 1:52:14 PM4/23/14
to nxfil...@googlegroups.com
See this article:
regarding the bonjour service advertisements.  This change could be pushed with a management tool.

Also, Apple devices love to phone home to a variety of Akamai sites.  Some of this is SoftwareUpdate, some proactive a/v and malware filtering, and some who knows.  I had to bypass_filter and bypass_logging on these sites as they would crowd my top 5 every day in NxFilter.

Update on server scaling:
It does appear to be a memory issue.  JVM 1.7 uses more than JVM 1.6, so that compounded the problem.  Upgrading my busiest cluster to 1024M for Java appears to be enough.  These are on CentOS VMs with 1.5G RAM.  This cluster of 3 services between 500k and 1200k requests daily.  Memory is cheap.  Ten days of zero crashes between the cluster and 5 separate instances, all now properly scaled.
< /div>

Jinhee

unread,
Apr 23, 2014, 7:40:17 PM4/23/14
to nxfil...@googlegroups.com
Hi James,

Thanks for the report.
It's good to hear that your NxFilter works without crash.

However it's bit weird.
There are bigger sites having more users.
I am not sure if they modified their startup script to increase the memory size.
Because actually you are the first user brought the memory issue.

I remember when I was writing a Java program which works on several platforms.
That was almost 7 years ago and the Java version was 1.4.
In my memory Java performed better on Windows and Solaris compared to Linux.
I don't know if it it's true still.

Like you said memory is the cheapest resource these days.
And I don't think it's too much weird to use 1G of memory if you have more than 1,000 users.
And NxFilter is having single process.
But I think there might be a way of reducing the memory consumption.

I will test it myself on CentOS when I have some free time.

Jinhee

mark page

unread,
Apr 23, 2014, 8:14:42 PM4/23/14
to nxfil...@googlegroups.com
James,

I'm running NxFilter on 2 Ubuntu 12.04 VMs under ESXi, with 1 socket / 2 cores and 2GB RAM. We're getting just over 1.5M requests per day, per VM without issue. One of the NxFilter boxes authenticates Windoze users on a wired 10.x/16 network (lower grades K-3 are filtered through an "allow only" whitelist), while the other sits behind a NAT bridge from our wireless network made up of 8 172.x/16 VLANs. Typically, I'll see about 600 wireless clients and 2000+ wired clients at any time of day. We've got over 6000 users total =)

Just a thought, are you having hard crashes with Java, or does the machine become unresponsive? If it's the latter, tail the kern.log and if there are any "neighbour table overflow" errors. I've had to tweak sysctl and bump-up the DNS threshold for ARP on both my Ubuntu servers.


Mark

Jinhee

unread,
Apr 23, 2014, 8:42:07 PM4/23/14
to nxfil...@googlegroups.com
Thanks Mark for the reporting.

So you didn't modify startup.sh?

Jinhee

mark page

unread,
Apr 23, 2014, 9:29:26 PM4/23/14
to nxfil...@googlegroups.com
No, I'm running the stock script from under init.d.

Mark

Jinhee

unread,
Apr 23, 2014, 10:09:07 PM4/23/14
to nxfil...@googlegroups.com
So there might be some difference between Ubuntu and CentOS.
Or JVM implementation.

James Gardner

unread,
Apr 24, 2014, 12:07:10 AM4/24/14
to nxfil...@googlegroups.com
Mark,

Thanks for the tuning advice.  Java doesn't crash per se, cpu utilization goes to 100% and the app/NXF becomes unresponsive.

My VMs are all running under OpenVZ, with four of the previously problematic ones on one host.  Sysctl under OpenVZ only allows those DNS/ARP variables to be tuned at the host level.  Sure enough the low baseline parameters were used.

These adjustments were recommended:
-------------
# Force gc to clean-up quickly
net.ipv4.neigh.default.gc_interval = 3600

# Set ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600

# Setup DNS threshold for arp
net.ipv4.neigh.default.gc_thresh3 = 4096
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh1 = 1024
---------------

I will keep the group updated with any changes.
Message has been deleted

mark page

unread,
Apr 24, 2014, 5:36:23 AM4/24/14
to nxfil...@googlegroups.com

James Gardner

unread,
Apr 24, 2014, 9:32:47 AM4/24/14
to nxfil...@googlegroups.com
I had not.  That very error I was seeing with all the CentOS 5 NXF instances under OpenVZ.  It doesn't appear with CentOS 6 or Debian 7.

Thanks!

Jinhee

unread,
Apr 24, 2014, 10:28:42 AM4/24/14
to nxfil...@googlegroups.com
So 512MB is enough for NXF to handle several thousands users?
If so Java is not so bad actually.

Jinhee
Reply all
Reply to author
Forward
0 new messages