Nomulus - is there a guide for GCP resource sizing and cost projection

155 views
Skip to first unread message

Ian Sison

unread,
Mar 1, 2017, 6:10:18 AM3/1/17
to nomulus-discuss
Hi - Sorry I need to ask this, as there isn't any document i've googled that gives a definite answer -  Is there any particular guide to sizing (and in turn approximating the monthly costs involved) in subscribing to the resources needed for Nomulus ?   This is an important first step for us to consider adopting Nomulus as our registry.

Jovenet Consulting

unread,
Mar 1, 2017, 8:30:44 AM3/1/17
to nomulus-discuss
Hello Ian, a few answers that I received already: http://www.gtld.club/2017/02/brand-new-gtlds-answers-from-nomulus.html (hope this helps a little). If you receive answers to your question, please share with me.

Ben McIlwain

unread,
Mar 2, 2017, 5:19:33 PM3/2/17
to nomulus-discuss
On Wednesday, March 1, 2017 at 6:10:18 AM UTC-5, Ian Sison wrote:
Hi - Sorry I need to ask this, as there isn't any document i've googled that gives a definite answer -  Is there any particular guide to sizing (and in turn approximating the monthly costs involved) in subscribing to the resources needed for Nomulus ?   This is an important first step for us to consider adopting Nomulus as our registry.

If you look at the appengine-web.xml files that we ship with Nomulus, you'll see that we use B4_1G instances, which cost $0.30 per instance hour.  You'll also see that the default files we ship with use 50 (!!) per service, and there are three services (frontend aka default, backend, and tools), for 150 total, though honestly we should tune that way down for Nomulus.  We only have them turned up that high because they're free for us, though of course they wouldn't be free for anyone else.

You can get away with a substantially smaller number of instances for all services than what we ship for default.  As little as 1 or 2 would likely be sufficient for tools, for instance.  Everything on backend is asynchronous, so you could also tune that way down until it's just barely able to keep up with its tasks.  Frontend is the most performance-sensitive, but still completely depends on the total volume of traffic you get across all your TLDs.

I will say that our costs are dominated by instance hours.  Yes, BigQuery, Datastore, Cloud Storage, network traffic, et al also have some costs, but they're insignificant compared to instance hour costs, as fundamentally the kind of data we deal with in the domain name registry business just isn't that big.  You can do a good first order approximation of cost by only considering instance hours.

Running ten B4_1G instances, at full load, works out to about USD 200 per month.  That might be enough to run all three services for a smaller registry.  For someone like Donuts, with ~200 TLDs and millions of domains (??, just guessing) under management, it'd obviously be larger.  These are all guesstimates however, and I'd encourage you to spin up Nomulus for yourself, play around with some representative data, and see how many instances you actually need for your anticipated volume.

Ian Sison

unread,
Mar 2, 2017, 9:02:36 PM3/2/17
to nomulus-discuss
Thanks Ben.  This is very helpful.
Message has been deleted

Hans Ridder

unread,
Mar 8, 2017, 11:55:39 AM3/8/17
to Ben McIlwain, nomulus-discuss
On Thu, Mar 2, 2017 at 2:19 PM, 'Ben McIlwain' via nomulus-discuss <nomulus...@googlegroups.com> wrote:
If you look at the appengine-web.xml files that we ship with Nomulus, you'll see that we use B4_1G instances, which cost $0.30 per instance hour.

Running ten B4_1G instances, at full load, works out to about USD 200 per month.  

Either I'm doing something wrong, or I think you may have slipped a decimal point:

24 * 30 * $0.30 = $216 per instance per month. So 10 instances would be $2160.

-hans

Ben McIlwain

unread,
Mar 8, 2017, 1:26:23 PM3/8/17
to nomulus-discuss, mcil...@google.com
Ouch, good catch.  Thank you.

I've been thinking more about this and I think switching to auto-scaling likely makes a lot of sense, at least for backend and tools, which don't have strict latency requirements because they aren't in any serving paths that affect end users.  So for the vast, vast majority of the time, you don't need any running tools instances, and you only need backend instances occasionally, when background batch mapreduces kick off.

The $216 per month per instance is for an instance that is running flat-out for the entire month.  In practice, that is very unlikely to happen unless you are using basic scaling.  I'm going to start experimenting with auto-scaling on our non-production instances (and possibly roll it out to production too if all goes well) so I can offer a more informed opinion on costs and how to control them.

On Wednesday, March 8, 2017 at 11:55:39 AM UTC-5, Hans Ridder wrote:

Hans Ridder

unread,
Mar 8, 2017, 6:40:27 PM3/8/17
to Ben McIlwain, nomulus-discuss
On Wed, Mar 8, 2017 at 10:26 AM, 'Ben McIlwain' via nomulus-discuss <nomulus...@googlegroups.com> wrote:
Ouch, good catch.  Thank you.

;-)
 
I've been thinking more about this and I think switching to auto-scaling likely makes a lot of sense, at least for backend and tools, which don't have strict latency requirements because they aren't in any serving paths that affect end users.  So for the vast, vast majority of the time, you don't need any running tools instances, and you only need backend instances occasionally, when background batch mapreduces kick off.

The $216 per month per instance is for an instance that is running flat-out for the entire month.

Agreed.

On a related note, according to the GC Console, our biggest cost is "Backend Instance Hours" (more below). Now I thought AE "backends" were dead, at least I can't find it in the pricing, but regardless the console shows the price as $0.05/hour. Our appengine-web.xml has the instance-class as B4_1G for frontend and tools, and B4 for backend (I think this is basically random). On the pricing page, B4_1G is $0.30 and B4 is $0.20.

I don't know where the $0.05 comes from, but you/they seem to be calculating our billing using it. With that in mind, 24 * 30 * .05 = $36 per backend instance per month. But I agree, most of the time there's no need for a large backend fleet.
 
In practice, that is very unlikely to happen unless you are using basic scaling.  

I'll take your word for it, except the documentation suggests that with basic-scaling idle instances should be shut down. Now we don't actually see that happening. We're seeing "max-instances" of the backend running all the time, even with a small request rate. But I'd think that some of the instances should become idle and shutdown, even with basic-scaling. This is something I've looked into on and off, and even briefly covered with Brian (in the context of RDE upload/report "just in case" cron requests that will never complete). What I haven't looked closely enough at is how many of the minute-ish cron tasks are an MR, that's willing to keep the whole fleet warmed up.
 
I'm going to start experimenting with auto-scaling on our non-production instances (and possibly roll it out to production too if all goes well) so I can offer a more informed opinion on costs and how to control them.

Sounds good. Looking forward to what you find out and trying it here. I'm willing to try some stuff out here, though I can't guarantee getting back to you quickly. :-(

-h
 


On Wednesday, March 8, 2017 at 11:55:39 AM UTC-5, Hans Ridder wrote:
On Thu, Mar 2, 2017 at 2:19 PM, 'Ben McIlwain' via nomulus-discuss <nomulus-discuss@googlegroups.com> wrote:
If you look at the appengine-web.xml files that we ship with Nomulus, you'll see that we use B4_1G instances, which cost $0.30 per instance hour.

Running ten B4_1G instances, at full load, works out to about USD 200 per month.  

Either I'm doing something wrong, or I think you may have slipped a decimal point:

24 * 30 * $0.30 = $216 per instance per month. So 10 instances would be $2160.

-hans

--
NOTE: This is a public discussion list for the Nomulus domain registry project.
---
You received this message because you are subscribed to the Google Groups "nomulus-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomulus-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomulus-discuss/f548b656-9e06-4189-a828-44d783c4b5c6%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Brian Mountford

unread,
Mar 9, 2017, 2:56:00 PM3/9/17
to Hans Ridder, Ben McIlwain, nomulus-discuss
Hans,

When I was looking at this last month, I came across documentation that indicated that all types of instance would appear to be getting charged at 5 cents an hour, but the hours would be bumped up. I realized that this must be the case, because the hourly totals I was seeing were way higher than could possibly be. That was because our 30 cent an hour instance was actually being charged as 6 hours of 5 cent an hour instance. The hours were six times what they really are, but at one sixth the rate.

When you look at your statistics, does that seem possible?

Brian

On Wed, Mar 8, 2017 at 6:40 PM, Hans Ridder <ha...@donuts.email> wrote:

Ben McIlwain

unread,
Mar 9, 2017, 5:43:17 PM3/9/17
to nomulus-discuss, ha...@donuts.email, mcil...@google.com
I've been thinking more about how to save money on instance hours.  One idea would be to carefully examine the cron configuration and remove or reduce the running frequency of some of the tasks that are frequently-running tasks (i.e. every hour or more often).  Some good candidates for this right off the top of my head that could likely be reduced/removed would be: syncGroupMembers, syncRegistrarsSheet, and sessioncleanup.  Tasks that you wouldn't want to remove, but could reduce the running frequency of, would be: deleteContactsAndHosts, refreshDnsOnHostRename, and possibly commitLogCheckpoint.

Here's a deeper look at all of said tasks and my thoughts on them, in descending order of frequency:
  • readDnsQueue (every 1 minute) -- This is what syncs DNS changes to the DNS provider.  You likely would not want to de-tune this too much on a production environment, probably no more than 5 minutes maximum, but on development environments, especially ones not syncing to DNS, you could even turn it off entirely.
  • commitLogCheckpoint (every 1 minute) -- You can probably detune this to a decent degree even on production, and by a lot on dev environments.  Basically, it's a question of how much data loss you're willing to endure if Datastore gets entirely nuked, which for non-production environments should be a lot, and even for less busy production environments, there likely aren't mutations every minute unless you are big or run a lot of TLDs (as Donuts does).
  • deleteContactsAndHosts (every 5 minutes) -- This can be detuned as well.  It will increase the latency of asynchronous contact and host deletions, but that probably doesn't affect much.  Also, this is a batched operation, where handling N operations is about as costly as handling just 1, so if there is a decent volume of work to handle, to the point where it's usually doing something every time it runs, if you run it with half the frequency it will still do all the work (just with higher latency) while only costing half as much.
  • refreshDnsOnHostRename (every 5 minutes) -- This is the same kind of situation as deleteContactsAndHosts, and can likely be detuned down to the same value.
  • sessioncleanup (every 15 minutes) -- This just cleans up some expired _ah_SESSION entities, which are small and don't take up much space, and can be removed more efficiently by simply nuking them all using Datastore Admin (fine for dev environments, has the unfortunate side effect of logging out all registars in production, so don't do it there).
  • syncGroupMembers (every 60 minutes) -- You can likely turn this off in dev environments, and even prod environments if not used.
  • syncRegistrarsSheet (every 60 minutes) -- Same as syncGroupMembers; odds of not being used are good.

In addition to all of that, as previously discussed, reducing the number of instances will help a lot too -- for a dev environment you likely only need 1 or 2 for frontend, 1 for tools, and a handful for backend.

And then, one last bit that hasn't come up yet but that would control costs too, is detuning the # of commit log and EPP resource index buckets.  The default values for both of those are, respectively, 100 and 997, as seen in the default config file.  You'd override them by following the procedure laid out in the configuration doc (i.e. do not just overwrite the values in the default file!).  Note that these values cannot be changed if there is already data existing in the system -- well, they can be, but things will be screwy until you delete all of the existing data.  This is perfectly fine for dev environments, not so nice for prod environments, so you'd want to make sure to choose suitable options during initial install.

The reason these options are important is that they govern how many worker shards are kicked off by mapreduces, with the # of shards being equal to the number of buckets.  The # of commit logs isn't worth focusing on at the moment, because there's only one mapreduce that iterates over them all, DeleteOldCommitLogs, and that has been disabled for awhile.

There are, however, a number of mapreduces that regularly run across all EPP resource index buckets, including the aforementioned deleteContactsAndHosts and refreshDnsOnHostRename (which only check whether they should run every 5 minutes; they only actually run a mapreduce when there is work to do).  Other mapreduces that iterate across all EPP resource index buckets include RDE, DeleteProberData, and VerifyEntityIntegrity.  We chose 997 as a suitable figure to be able to handle any conceivable traffic load (e.g. something like a .com), so it could definitely be tuned down for smaller registries and development environments.  The max QPS of your system works out to something like 1-10 QPS per bucket, though let's go on the low end to be safe (and avoid costly entity group contentions) and say that it's 1 QPS for bucket.  The amount of work that the mentioned mapreduces do scales roughly linearly with the number of buckets, so there's a lot of cost gains to be had there in tuning that down.  I'd caution against going tuning too far down though, because the lower the number of buckets, the higher the chance of Datastore contention, and conflicts induce load too.

I'm sure some of my teammates will have some thoughts to chime in with on this, everything possibly ranging from "That's a great idea" to "You're crazy!"  Eagerly awaiting the responses.

Ben McIlwain

unread,
Mar 9, 2017, 5:55:25 PM3/9/17
to nomulus-discuss, ha...@donuts.email
On Thu, Mar 9, 2017 at 5:43 PM 'Ben McIlwain' via nomulus-discuss <nomulus...@googlegroups.com> wrote:
We chose 997 as a suitable figure to be able to handle any conceivable traffic load (e.g. something like a .com), so it could definitely be tuned down for smaller registries and development environments.  The max QPS of your system works out to something like 1-10 QPS per bucket, though let's go on the low end to be safe (and avoid costly entity group contentions) and say that it's 1 QPS for bucket.  The amount of work that the mentioned mapreduces do scales roughly linearly with the number of buckets, so there's a lot of cost gains to be had there in tuning that down.  I'd caution against going tuning too far down though, because the lower the number of buckets, the higher the chance of Datastore contention, and conflicts induce load too.

I'm sure some of my teammates will have some thoughts to chime in with on this, everything possibly ranging from "That's a great idea" to "You're crazy!"  Eagerly awaiting the responses.

I forgot to include my crazy suggestion: You could likely set these values all the way down to 1 for dev environments and be perfectly fine.  For production environments for small TLDs, you'd definitely want more than 1, but still something logarithmically closer to 1 than it is to 997.

Ben McIlwain

unread,
Mar 9, 2017, 6:01:16 PM3/9/17
to nomulus-discuss, ha...@donuts.email
On Thursday, March 9, 2017 at 5:55:25 PM UTC-5, Ben McIlwain wrote:
And in combination with doing this, you'd want to ensure that all of the aforementioned mapreduces are set to run on a number of shards that maxes out at the number of buckets.  I believe we default to 100 right now, which is not what you'd want to be doing with only one bucket's worth of entities to process (Donuts, I believe you had a problem related to this back when you were running an environment with 2 EPP resource index buckets that had tens of thousands of entities in it). 

Ben McIlwain

unread,
Mar 10, 2017, 10:29:09 AM3/10/17
to nomulus-discuss, ha...@donuts.email
 ... although 10 is probably a more reasonable number, and will more accurately model what will happen in production environments.  You also probably don't want to go below the number of backend instances, at least not until I have a chance to look at it further.  It looks like all of the EPP resource index traversing mapreduces default to 100 mapper shards, which is likely not what you'd want if you have fewer than 100 EPP resource index buckets.  I'm investigating whether it'd make sense to change that default to MAX(100, N * # EPP resource index buckets), for some small value of N in the range of 1-2.

Brian Mountford

unread,
Mar 10, 2017, 11:02:23 AM3/10/17
to Ben McIlwain, nomulus-discuss, Hans Ridder
Regarding frontend versus backend instance classes, maybe the information I remembered seeing came from the O'Reilly book "Programming Google App Engine with Java", which I read a couple months ago to try and improve my understanding of the monster. Looking now, I see the following on page 114 of the second edition:

"Modules with manual or basic scaling use a different set of instance classes than modules with automatic scaling: B1, B2, B4, B4_1G, and B8. These are similar to the corresponding F* classes, with the addition of B8, which has 1GB of memory and a 4.8 GHz CPU."

Modules are what are now called services; the book is out of date on some things. So it sounds like, if you are using manual or basic scaling, you are always going to be using the backend instance classes no matter what, and don't need to worry about what "frontend" means at all.

Brian

--
NOTE: This is a public discussion list for the Nomulus domain registry project.
---
You received this message because you are subscribed to the Google Groups "nomulus-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomulus-discuss+unsubscribe@googlegroups.com.

Brian Mountford

unread,
Mar 10, 2017, 11:08:02 AM3/10/17
to Ben McIlwain, nomulus-discuss, Hans Ridder
And here is the help page that describes the 5 cents versus 30 cents thing:


A few inches below the list of the instance classes is the following:

Important: When you are billed for instance hours, you will not see any instance classes in your billing line items. Instead, you will see the appropriate multiple of instance hours. For example, if you use an F4 instance for one hour, you do not see "F4" listed, but you see billing for four instance hours at the F1 rate.

So it always uses the base rate of 5 cents, and just multiplies the instance hours by the factor appropriate to the instance class. in our case, using the B4_1G instances, we see a rate of 5 cents an hour, but with 6 hours billed for every hour actually used.

On Fri, Mar 10, 2017 at 11:02 AM, Brian Mountford <moun...@google.com> wrote:
Regarding frontend versus backend instance classes, maybe the information I remembered seeing came from the O'Reilly book "Programming Google App Engine with Java", which I read a couple months ago to try and improve my understanding of the monster. Looking now, I see the following on page 114 of the second edition:

"Modules with manual or basic scaling use a different set of instance classes than modules with automatic scaling: B1, B2, B4, B4_1G, and B8. These are similar to the corresponding F* classes, with the addition of B8, which has 1GB of memory and a 4.8 GHz CPU."

Modules are what are now called services; the book is out of date on some things. So it sounds like, if you are using manual or basic scaling, you are always going to be using the backend instance classes no matter what, and don't need to worry about what "frontend" means at all.

Brian

Ben McIlwain

unread,
Apr 27, 2017, 11:00:23 AM4/27/17
to nomulus-discuss, mcil...@google.com, ha...@donuts.email
I've started looking more into costs and the following resources that I've used will likely prove helpful:

The TL;DR is that you first have to set up billing exports to BigQuery if you haven't already done so (which unfortunately won't retroactively export past billing data), and then you can graph it using some slick tools.  You can then easily see the impact of billing-affecting changes over time.  I'm experimenting with de-tuning one of our non-production environments to see what a reasonable configuration looks like, as 50 instances for all services is definitely not reasonable.

Ben McIlwain

unread,
Nov 11, 2019, 11:04:47 AM11/11/19
to nomulus-discuss
Reviving this thread since the question is still of interest.  And keep in mind that everything I'm saying here is my best guess, informed by 5 years' experience of running our registry platform of course, but I'm not committing to anything in stone here.

There are four different App Engine services in Nomulus: frontend ("default"), pubapi, backend, and tools.  These account for the majority of GCP costs in running Nomulus, so just by optimizing them you can really bring the cost of running Nomulus way down.  Taking these in turn:

frontend: For a registry of roughly the size of 150k domains, you could likely get away with 1-2 frontend instances.  This service handles EPP commands, of which there are not many on an ongoing basis unless you're running .com.  It also runs the registrar web console, but that's quite low load.

pubapi: If you plan on running WHOIS or RDAP in Nomulus (which is not a given for your specific use case as I understand it), then you'll need another few instances running pubapi, sized to your WHOIS load (as RDAP is not widely used yet).  If you don't plan on using WHOIS/RDAP, then you won't need any instances.

backend: These run dynamically, so you're configuring a maximum number of instances.  A lot of stuff runs on backend: DNS updates, RDE exports, any number of asynchronous export tasks.  You can turn off the ones you won't need by removing them from the cron.xml file (e.g. as a ccTLD you likely don't need RDE).  It would probably be safe to bring the max number of instances here down to 5-10, knowing that, at any given time, asynchronous batch jobs aren't usually running, and thus the average number of running instances if probably only a couple.

tools: These also run dynamically, and are only needed to service requests from the command-line admin tool.  It's fine to let this max out at 1, knowing that the vast majority of the time 0 would be running.


You can see reported costs on a daily basis, so if anything is wrong it should be easy to stop things before they get too out of hand.  If you want, we can help review proposed configuration settings before you push them to ensure that things are tuned down a cost-efficient level for testing (e.g. minimum number of instances, background services you wouldn't initially use and might never use as a ccTLD turned off).
Reply all
Reply to author
Forward
0 new messages