Using NATS for RPC

Alexander Staubo

unread,

Feb 10, 2016, 8:18:15 PM2/10/16

to nats

As others have discovered, NATS looks like a good fit for RPC. Is anyone here using it in production for this? I'd love to hear success stories (or horror stories).

One of my concerns with using a decoupled, asynchronous request/response model for RPC is handling latency and network load when the request represents work that can take some time (>1 second), combined with retry logic, which might lead to dogpiling scenarios.

For example, consider this scenario:

1. Client sends request A

2. Wait a while

3. Gets no response (due to network blip, heavy load, I/O)

4. Retries by sending request A again as request B

5. Meanwhile, the server has in fact received request A despite the timeout, processes the request, and emits a response.

6. Client gets the response to A, which it must ignore

7. Server receives B

etc.

You end up using twice the amount of work/CPU, and it has potentially delayed other requests from executing. In theory, such blips shouldn't occur, but they do, quite often, which could have cascading effects, especially if the request, in this case, involved a literal cascade of services notifying each other of an event. Probably nothing huge, but it could result in one of those "why is this thing spiking mysteriously 4 times a day" Heisenbug scenarios.

With point-to-point TCP communication, such a scenario should be likely, given the tight coupling between the TCP connection and the handling logic. You can still get into situations where the client disconnects/fails while the server is still chewing on a request, of course, but in my mind, it should be less likely.

Thoughts?

Larry McQueary

unread,

Feb 10, 2016, 9:54:39 PM2/10/16

to nat...@googlegroups.com, Alexander Staubo

Hi Alexander,

I understand the premise of your concern/comment, but I think this problem is mitigated by the design of NATS and can be further mitigated by judicious use of the provided API(s).

In the scenario you describe below, you would ideally use the Request(subject, data, timeout) convenience API, specifying a timeout value. The Request() API ensures that the client only receives 1 (one) response from gnatsd through the use of AutoUnsubscribe(). It also unsubscribes from the reply inbox subject immediately when the timeout expires (if no message was received).

So, in step 5, while the “replier” may indeed answer this stale request, interest at the NATS server (gnatsd) has already been pruned and therefore this response message will never be sent from gnatsd to the requestor, so step 6 never happens.

-Larry

--
You received this message because you are subscribed to the Google Groups "nats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to natsio+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexander Staubo

unread,

Feb 10, 2016, 10:02:49 PM2/10/16

to nat...@googlegroups.com, Larry McQueary

On 10 Feb 2016, at 21:54, Larry McQueary <la...@apcera.com> wrote:
> I understand the premise of your concern/comment, but I think this problem is mitigated by the design of NATS and can be further mitigated by judicious use of the provided API(s).
>
> In the scenario you describe below, you would ideally use the Request(subject, data, timeout) convenience API, specifying a timeout value. The Request() API ensures that the client only receives 1 (one) response from gnatsd through the use of AutoUnsubscribe(). It also unsubscribes from the reply inbox subject immediately when the timeout expires (if no message was received).
>
> So, in step 5, while the “replier” may indeed answer this stale request, interest at the NATS server (gnatsd) has already been pruned and therefore this response message will never be sent from gnatsd to the requestor, so step 6 never happens.

Hi, thanks for the explanation. I’m actually less concerned about step 6 than about step 4, which is where the unnecessary request might happen — is this also mitigated by the convenience API? The cause is that the client doesn’t know the message has reached the server.

Tyler Treat

unread,

Feb 10, 2016, 11:24:36 PM2/10/16

to nats, la...@apcera.com

Cascading failures and retries is a common problem and certainly not specific to NATS as an RPC transport. The problem persists whether you're using NATS or a direct socket connection (network delays, partitions, etc.—they all happen). I don't think there's a way to magically prevent this other than designing your system to be robust, e.g. with exponential backoffs, retry budgets, circuit breakers, etc. I've said it before and I'll say it again, usually the biggest cause of a DoS attack is yourself. :)

We are using NATS as an RPC transport. There are certainly advantages and disadvantages. One of the disadvantages in my experience so far is it becomes slightly more difficult to debug issues in production. Debugging distributed systems is already a challenge, though we can usually get by with distributed tracing. NATS adds yet another "hop" so to speak, and it's hard to know what went wrong without turning on full debugging. For example, I sent this request but it appears the server never received it. Did NATS receive it? Did it forward it? Was there an issue with the route between two NATS servers? I think this is the biggest challenge we've been facing, but there are plenty of benefits too.

Alexander Staubo

unread,

Feb 11, 2016, 12:26:01 PM2/11/16

to nat...@googlegroups.com, Larry McQueary, ttre...@gmail.com

On 10 Feb 2016, at 23:24, Tyler Treat <ttre...@gmail.com> wrote:
> Cascading failures and retries is a common problem and certainly not specific to NATS as an RPC transport. The problem persists whether you're using NATS or a direct socket connection (network delays, partitions, etc.—they all happen). I don't think there's a way to magically prevent this other than designing your system to be robust, e.g. with exponential backoffs, retry budgets, circuit breakers, etc. I've said it before and I'll say it again, usually the biggest cause of a DoS attack is yourself. :)

Good advice. My example was merely the first scenario I could think of that would be specific to a decoupled pub/sub model and less of a concern in a point-to-point scenario. There are of course tons of other concerns.

Which makes me sad that nobody with a lot of development resources hasn't sat down and spent an effort to create a standard, resilient, reusable RPC layer. People are probably vary of standard protocols after the disasters that were CORBA and SOAP, but that doesn’t mean RPC is inherently bad — most of us is stuck doing poor man’s RPC with REST.

You guys might be interested in this discussion I (tried to) kick off on HN yesterday:

https://news.ycombinator.com/item?id=11076890

NATS is one of the options I mention.

> We are using NATS as an RPC transport. There are certainly advantages and disadvantages. One of the disadvantages in my experience so far is it becomes slightly more difficult to debug issues in production. Debugging distributed systems is already a challenge, though we can usually get by with distributed tracing. NATS adds yet another "hop" so to speak, and it's hard to know what went wrong without turning on full debugging. For example, I sent this request but it appears the server never received it. Did NATS receive it? Did it forward it? Was there an issue with the route between two NATS servers? I think this is the biggest challenge we've been facing, but there are plenty of benefits too.

That’s true. I think tracing is actually a bigger and somewhat orthogonal concern that is required even if doing point-to-point. We’ve had problems with re-entrancy in the past, where queue limits cause deadlocks, for example.

Not to mention you really want to collect hop metrics; we have some experimental code that reports [from, to] pairs to Statsd, but we never put it into production.

Tyler Treat

unread,

Feb 11, 2016, 2:05:51 PM2/11/16

to nats, la...@apcera.com, ttre...@gmail.com

We investigated several options for RPC. gRPC looks nice, but it's early, doesn't have much of a track record, and is pretty tightly coupled to HTTP/2. Finagle seems nice and has lots of resiliency stuff built in, but it's largely Scala/Java-ish. Uber's tchannel looks interesting too, though I'm not sure how proven it is. We looked into Thrift and what we like about it is the separation between the RPC layer, the transports, and the protocols. We started by implementing a NATS Thrift transport. We quickly realized there are some fundamental problems with Thrift:

- Head-of-line blocking: a single, slow request will block any following requests for a client.

- Out-of-order responses: an out-of-order response puts a Thrift transport in a bad state, requiring it to be torn down and reestablished. E.g. if a slow request times out at the client, the client issues a subsequent request, and a response comes back for the first request, the client blows up.

- Concurrency: a Thrift client cannot be shared between multiple threads of execution, requiring each thread to have its own client issuing requests sequentially. This, combined with head-of-line blocking, is a major performance killer.

- RPC timeouts: Thrift does not provide good facilities for per-request timeouts, instead opting for a global transport read timeout.

- Request headers: Thrift does not provide support for request metadata, making it difficult to implement things like authentication and authorization. Instead, you are required to bake these things into your IDL. The problem with this is it puts the onus on service providers rather than allowing an API gateway or middleware to perform these functions in a centralized way.

- RPC-only: Thrift has limited support for asynchronous messaging patterns, and even asynchronous RPC is largely language-dependent and susceptible to the head-of-line blocking and out-of-order response problems.

As a result, we ended up writing our own (groan, NIH) based on Thrift. It's actually an extension/superset of Thrift which addresses the issues above and implements support for the languages we need (Go, Java, Python, Dart). What's nice about it is, since it just extends Thrift, existing Thrift transports and protocols "just work." We're using NATS as the underlying transport which also solves some discovery problems, and we added IDL code generation for pub/sub APIs too. Working on async APIs soon.

I'm doing a talk at the NATS meetup next month on how we're using NATS, which will talk about some of this in greater detail.

Alexander Staubo

unread,

Feb 11, 2016, 2:23:46 PM2/11/16

to nat...@googlegroups.com, la...@apcera.com, ttre...@gmail.com

On 11 Feb 2016, at 14:05, Tyler Treat <ttre...@gmail.com> wrote:
[…]

> As a result, we ended up writing our own (groan, NIH) based on Thrift. It's actually an extension/superset of Thrift which addresses the issues above and implements support for the languages we need (Go, Java, Python, Dart). What's nice about it is, since it just extends Thrift, existing Thrift transports and protocols "just work." We're using NATS as the underlying transport which also solves some discovery problems, and we added IDL code generation for pub/sub APIs too. Working on async APIs soon.

Your Thrift solution actually sounds pretty good and something others could benefit from. Is this something you have open-sourced or plans to open source?

I’m surprised you tried to shoehorn the whole Thrift stack into NATS. My first idea would be simply use the Thrift serialization/wire formats and use NATS as a transport directly, not via the Thrift client/server layer. But I don’t know nearly enough about Thrift to make a judgement.

> I'm doing a talk at the NATS meetup next month on how we're using NATS, which will talk about some of this in greater detail.

Would be awesome if you could capture it (video, transcript, blog post, anything). We’re located on the wrong coast.

Tyler Treat

unread,

Feb 11, 2016, 3:01:06 PM2/11/16

to nats, la...@apcera.com, ttre...@gmail.com

We definitely intend to open source once we feel we've reached a good point, which shouldn't be too far off.

> Would be awesome if you could capture it (video, transcript, blog post, anything). We’re located on the wrong coast.

I think the Apcera folks usually try to record the talks, but will have to defer that one to Brian. :)

Larry McQueary

unread,

Feb 11, 2016, 3:38:09 PM2/11/16

to nat...@googlegroups.com, Alexander Staubo, ttre...@gmail.com

Alexander,

Which makes me sad that nobody with a lot of development resources hasn't sat down and spent an effort to create a standard, resilient, reusable RPC layer. People are probably vary of standard protocols after the disasters that were CORBA and SOAP, but that doesn’t mean RPC is inherently bad — most of us is stuck doing poor man’s RPC with REST.

TL;DR RPC isn’t bad. Attempting a “standard protocol” by today’s definition is mostly folly

(Here’s where I stroke my long, flowing beard and wax philosophical): I think you have hit the nail on the head. In 25+ years in this business, I have seen a ton of effort put into creating this "standard, resilient, reusable RPC layer”. CORBA succeeded! Well, not in being standard, resilient and reusable (or even usable, depending on when and what you were implementing). It succeeded in making a lot of money for a lot of people, though… authors, publishers, software companies, integrators/consultants… everybody but the end-users/business stakeholders it was meant to benefit. It was evolutionary, not revolutionary, and it was both (a) too early for the problem(s) it was trying to solve, and (b) horribly over-engineered. Which gets me to my principal point:

The issue isn’t RPC, it is the eagerness of software engineers to 'gild the lily'. One tenet of modern design (not just software design) is something that Steve Jobs is often noted for. Although he didn’t invent the concept, he eventually made Apple famous for it: zen-like simplicity; to measure the success of a design not by what *is* there, but by the absence of cruft — the things that are intentionally absent because they don’t need to be there. As software engineers, that void — the absence of cruft — is something we have an overwhelming drive to fill, and without adequate oversight, we *will* fill it with more stuff… stuff that is as likely to complex/unnecessary/slow as it is simple/useful/performant. If you invent the fish, I will invent a bicycle for it. Or vice versa. This applies to standards as much as it does the technologies implementing the standard(s).

“But Larry, haven’t you just pointed out the solution? Proper oversight, right?”

No, actually I think I’ve pointed out the other part of the problem, and probably the larger part. When you hear the word “oversight” in this context, what’s the other word that automatically pops into your head? Committee! Standards committees are great at keeping individuals from adding unnecessary bits and pieces to protocols… so that they can wedge their own unnecessary bits and pieces in! I’ll leave that discussion for another day, but I will recommend reading the following old but still-relevant article by Pieter Hintjens (http://www.imatix.com/articles:whats-wrong-with-amqp/). It’s a bit of a lengthy rant (not unlike this one, just 10x longer and more thorough), and is specific to AMQP, but it makes a number of excellent observations/points about the pitfalls of the standards process as we know it.

Anyway, no, RPC isn’t inherently bad at all, just the “standard protocol” bit :)

(Meanwhile, I sit comfortably in my Utopian ivory tower, employed by the benevolent corporate sponsor of an open-source messaging project with an open protocol, managed by a small oligarchy, open to criticism, suggestion and contribution, but free to invent or prevent 'cruft' as we see fit :))

-Larry

(who, for the record, does not have a long flowing beard… and whose opinions are solely his own and *not* those of Apcera Inc.)

Alexander Staubo

unread,

Feb 15, 2016, 1:57:37 PM2/15/16

to Larry McQueary, nat...@googlegroups.com

On 11 Feb 2016, at 15:38, Larry McQueary <la...@apcera.com> wrote:
>> Which makes me sad that nobody with a lot of development resources hasn't sat down and spent an effort to create a standard, resilient, reusable RPC layer. People are probably vary of standard protocols after the disasters that were CORBA and SOAP, but that doesn’t mean RPC is inherently bad — most of us is stuck doing poor man’s RPC with REST.
> TL;DR RPC isn’t bad. Attempting a “standard protocol” by today’s definition is mostly folly

[…]

Well, I don’t disagree!

Perhaps “de facto standard” would be better way to phrase it. For example, the REST bandwagon took off pretty quickly, and now it’s the de facto standard of doing RPC over REST, which turns out to be an evolutionary dead end.

I think it’s hilarious and tragic that in this day and age, apps still can’t talk to each other in a simple manner. However you turn it around, JSON over HTTP is not simple, nor flexible enough for all use cases; pure TCP alone doesn’t solve the service discoverability problem; TCP and HTTP don’t have sufficient mechanisms for fault tolerance; and so on. So every solution ends up being proprietary and ad hoc.

I just want an opinionated stack that works across a fairly broad set of languages.

As an aside, I’m also sad that Tom Preston-Werner’s BERT-RPC (http://bert-rpc.org) effort never got anywhere. Github was behind it, a spec appeared, a bunch of bindings were launched, and then: Dead.

Brian Flannery

unread,

Feb 16, 2016, 9:24:35 AM2/16/16

to nats, la...@apcera.com, ttre...@gmail.com

On Thursday, February 11, 2016 at 3:01:06 PM UTC-5, Tyler Treat wrote:

We definitely intend to open source once we feel we've reached a good point, which shouldn't be too far off.

> Would be awesome if you could capture it (video, transcript, blog post, anything). We’re located on the wrong coast.

I think the Apcera folks usually try to record the talks, but will have to defer that one to Brian. :)

>>>We will live stream Tyler's talk, and then after make recordings available ;)

Reply all

Reply to author

Forward