REST API Caching

3.158 visualizações
Pular para a primeira mensagem não lida

Sahil

não lida,
24 de mai. de 2012, 03:04:4724/05/2012
para API Craft
Hi all,
I am new to REST world. I have been studying about REST for around 1
month.
I have lots of doubt.
First I have a small doubt
how to achieve as the term says "API Response Caching" technically?
So that it can maximize Scalability.

Daniel Roop

não lida,
24 de mai. de 2012, 22:38:1524/05/2012
para api-...@googlegroups.com
Sahil,

I am not sure how far you have gone down the rabbit hole so I will start up high.

The first important principal is to understand how HTTP Caching Works, there are basically two parts, TTL (Cache-Control) and Stale Check (ETag). When a resource is generated by an origin server you need to think of it is gone.  You no longer have control over it, you only get to make suggestions to the client what to do with it.  The two mechanism you have are TTL (which is how long the client should keep the object in cache before checking back) and Stale Check (which is a version of the resource that was returned) that can be sent with a new GET request to the origin server, to say "Hey I have this version is it still good".  Giving the Origin server the opportunity to say yep, keeping using that one and provide a new TTL, if it is still valid.

You need to use these two controls in different ways to get the effects you want.  For instance when serving files that will never change (like the css for a build)  you can set a really long TTL, and no etag.  For something that doesn't change very often, but when it does change needs to be quick (like the party members on a reservation) you would set a low TTL (like 1 minute) and an ETag. In this second example you set a low TTL of 1 minute to help with bursts from clients to not overwhelem the origin server (scale) and the ETag allows the Origin server to skip the construction of the reservation object, if it has a way to verify what the current valid ETag is faster than constructing the entire reservation.  Another example would be something that doesn't change often and when it does, it can propagate slowly (like a user's ad recommendation profile)  You can set a higher TTL (like 6 hours) and not worry so much about an ETag (although it would still be useful).

This helps you scale by allowing components along the network path to cache the documents reducing the load on the origin server (thus allowing you to scale).  You can basically cache it in every client application of the origin server, and also in the network (using products like Apache Traffic Server, Squid, Varnish etc..).  This allows you to scale because it consumes less resources to serve a static file than to build a dynamic resource, and because you can push the majority of the work to the edge of your network where you can setup edges around the world (or at least closer to your users) reducing network latency and/or balancing load around the world.

So to answer your question, the first place you should look to ad the API Caching would be using a product like
https://www.varnish-cache.org/

If you are really interested in this, I would recommend reading REST in Practice as it has an good overview of this topic, and how to apply rest in general.

Daniel Roop

Sam Ramji

não lida,
25 de mai. de 2012, 21:47:4725/05/2012
para api-...@googlegroups.com
I wish I could "favorite" posts in Google Groups.  This is the canonical answer for REST API caching IMHO.

Cheers,

Sam

Maxim Mass

não lida,
12 de jun. de 2012, 07:18:2812/06/2012
para api-...@googlegroups.com
Nice summary indeed. The little star next to the post reply button on the left should do the trick.

Mike Kelly

não lida,
12 de jun. de 2012, 07:33:0512/06/2012
para api-...@googlegroups.com
There is another area of consideration when it comes to web caching,
and that is cache invalidation.

I actually did some work on this a couple of years ago, you can read a
write up of my slide deck here:

http://restafari.blogspot.co.uk/2010/04/link-header-based-invalidation-of.html

Since then Mark Nottingham and myself have published the invalidation
mechanism as an Internet-Draft:

http://tools.ietf.org/html/draft-nottingham-linked-cache-inv-02

If anyone has any q's about the content let me know

Cheers,
M
--
Mike

http://twitter.com/mikekelly85
http://github.com/mikekelly
http://linkedin.com/in/mikekelly123

Daniel Roop

não lida,
13 de jun. de 2012, 22:45:0413/06/2012
para api-...@googlegroups.com
Mike,

I hadn't seen that before, that seems useful.  Is there a reason you felt to decouple normal cache maxage from the inv-maxage.  Seems like you could have leveraged the existing property, and just specified a new cache-control that indicated the invalidated aspect, or even just rely on the existance of Links with the rel you are definine.

-Daniel

P.S. I am glad my overview was useful.

Mike Kelly

não lida,
14 de jun. de 2012, 04:30:3314/06/2012
para api-...@googlegroups.com
On Thu, Jun 14, 2012 at 3:45 AM, Daniel Roop <dan...@danielroop.com> wrote:
> Mike,
>
> I hadn't seen that before, that seems useful.  Is there a reason you felt to
> decouple normal cache maxage from the inv-maxage.  Seems like you could have
> leveraged the existing property, and just specified a new cache-control that
> indicated the invalidated aspect, or even just rely on the existance of
> Links with the rel you are definine.
>

The TTL of invalidateable responses should be longer than those that
aren't but we need to take account of caches in the request stream
that aren't aware of the mechanism and still need to see a 'normal'
TTL. Basically, this means that maxage, s-maxage, and inv-maxage are
all necessary for one HTTP response to be able to control different
types of cache that have different TTL values.

Cheers,
M

John Paul Hayes

não lida,
27 de jun. de 2012, 07:50:3227/06/2012
para api-...@googlegroups.com
Hi Sam,

You can Star a post in google groups... just an FYI.

Jp

Mark Nottingham

não lida,
3 de jul. de 2012, 18:52:3603/07/2012
para api-...@googlegroups.com
Just to add a word or two, I gave a talk about how we did API caching when I was at Yahoo!:
(a bit hard to follow from slides; I should really do a blog entry to cover this)

...and there's also the Web caching tutorial (getting a bit old) which also applies to API caching:

Finally, the HTTPbis caching spec breaks out RFC2616's caching and specifies it a lot more cleanly:

Hope this helps,

Sahil

não lida,
12 de jul. de 2012, 17:03:0812/07/2012
para api-...@googlegroups.com
Daniel, Thanks a lot for a nice and lovely explanation and others too for providing valuable info. 
I will try to follow as mentioned while developing my Mobile App and RESTful services. 

Brock Allen

não lida,
19 de jul. de 2012, 12:20:2819/07/2012
para api-...@googlegroups.com

Daniel --

I have a question that I've been struggling with this for some time. Caching is great and helps things scale and be more responsive, but confirm for me (if you will) my concerns: Caching is only useful or viable for unauthenticated requests. When the request/response is either 1) authenticated or 2) contains sensitive information we always require SSL and as a result caching is essentially gone form the equation (modulo Cache-Control:private).

The reason I ask/mention this is so much hype goes into the benefits of caching but most of the web api (or even browser-based) apps I see or work on require authentication and this promise of scale due to caching seems to be a lie.

Thoughts? TIA

-Brock

Mike Kelly

não lida,
19 de jul. de 2012, 13:34:4519/07/2012
para api-...@googlegroups.com
On Thu, Jul 19, 2012 at 5:20 PM, Brock Allen <brock...@gmail.com> wrote:
>
> Daniel --
>
> I have a question that I've been struggling with this for some time. Caching
> is great and helps things scale and be more responsive, but confirm for me
> (if you will) my concerns: Caching is only useful or viable for
> unauthenticated requests. When the request/response is either 1)
> authenticated or 2) contains sensitive information we always require SSL and
> as a result caching is essentially gone form the equation (modulo
> Cache-Control:private).

Not necessarily; you can use reverse proxy/gateway caching on the
server side, behind your SSL endpoint (2) and potentially behind an
Auth layer (1).

"Shared caching" is not really appropriate for (1) or (2).

> The reason I ask/mention this is so much hype goes into the benefits of
> caching but most of the web api (or even browser-based) apps I see or work
> on require authentication and this promise of scale due to caching seems to
> be a lie.

See above.

Either way.. thinking about cacheability makes you approach things
from an "intermediary perspective", and this puts you in the right
frame of mind when deciding what resources your system should expose
and how.

Cheers,
M

Daniel Roop

não lida,
19 de jul. de 2012, 18:57:4619/07/2012
para api-...@googlegroups.com
Brock,

As Mike suggested, even if you can't use the cache all the way to your client, you can distribute it through your network using reverse proxies.  So if you are using a API Gateway like Apigee, Layer7 or Mashery and you deploy them in EC2 or Rackspace, you can have them all over the world providing the authentication really close to the user, and have the origin server in one datacenter.

Another thing to point out is that SSL does not prevent caching.  This is a odd urban myth that has spread due to old browsers disabling caching when the site was over SSL.  Nothing in the HTTP Specification (or SSL) specification says anything about treating HTTPS communication any different than HTTP in this regard.  All modern browsers (i think from IE7 up) allow you to cache HTTPS request.  And as far as I know all reverse proxy caches (Apache Traffic Server, Zeus, Squid etc..) have no problem with this either.

Where you do have a potential problem is according to the HTTP Caching Semantics a Request with an Authorization header should not be cached by "shared" caches unless certain conditions are met (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.8).  This doesn't need to worry you if you use the Reverse Proxy thing for inside your network and put the OAuth outside of that.  And your client is not considered a "shared" cache usually but a private cache.  This is where in theory the Cache-Control: private plays a part.  This tells the private cache that it can keep this data cached.

I must be honest I have never tried to test the private cache idea, although at my office we are getting close to needing this.  If I find anything interesting I will look to posting it here.  But in theory this should account for your concerns.

Daniel

Brock Allen

não lida,
19 de jul. de 2012, 19:40:0219/07/2012
para api-...@googlegroups.com

As Mike suggested, even if you can't use the cache all the way to your client, you can distribute it through your network using reverse proxies.  So if you are using a API Gateway like Apigee, Layer7 or Mashery and you deploy them in EC2 or Rackspace, you can have them all over the world providing the authentication really close to the user, and have the origin server in one datacenter.

Hmm, ok... you guys are obviously coming at this with a different mindset than me. I'm thinking about security and network sniffing and you guys are assuming caching is a must (I don't meant to put words into your mouth tho) :)

My assumption is that anywhere along the call from the client to the server we could have malicious agents eavesdropping on the packets and thus we'd never allow this, thus SSL all the way end-to-end.

Another thing to point out is that SSL does not prevent caching.  This is a odd urban myth that has spread due to old browsers disabling caching when the site was over SSL.  Nothing in the HTTP Specification (or SSL) specification says anything about treating HTTPS communication any different than HTTP in this regard.  All modern browsers (i think from IE7 up) allow you to cache HTTPS request.  And as far as I know all reverse proxy caches (Apache Traffic Server, Zeus, Squid etc..) have no problem with this either.

I understand that Cache-Control:private would work, of course, but I don't understand how a proxy or intermediary can decrypt the traffic to do caching -- the key exchange was between the client and server.
 
Where you do have a potential problem is according to the HTTP Caching Semantics a Request with an Authorization header should not be cached by "shared" caches unless certain conditions are met (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.8).  This doesn't need to worry you if you use the Reverse Proxy thing for inside your network and put the OAuth outside of that.  And your client is not considered a "shared" cache usually but a private cache.  This is where in theory the Cache-Control: private plays a part.  This tells the private cache that it can keep this data cached.

Again, no one in their right mind would send Authorization headers without SSL -- heck, I run with ForceTLS addon for my browser because I don't trust websites to do things the right way even -- think Facebook or even Google used to allow authenticated requests without SSL. Seems insane to me.

Maybe I'm missing something, so please educate me if I am.
 
I must be honest I have never tried to test the private cache idea, although at my office we are getting close to needing this.  If I find anything interesting I will look to posting it here.  But in theory this should account for your concerns.
 
So yea, private caching (unless I'm missing something, and I probably am) is simply user agent caching, right? But it's not the "internet scale" of caching we all learned about with HTTP. Again, unless there are other moving parts I'm just not aware of.

Thanks for your replies though -- I enjoy the conversation :)

Daniel Roop

não lida,
19 de jul. de 2012, 19:48:1519/07/2012
para api-...@googlegroups.com
Brock,

You can do SSL from client to Origin Server, but you can't do a single SSL connection all the way down.  You instead have to have each component negotiate at each layer.


User Agent ======> API Gateway ====> Origin Server

So given the above setup, you would need to have an SSL negotiation between User Agent and API Gateway, and a second one between the Gateway and the Origin Server.  Most compliance bodies require you to do SSL to the Datacenter, then within the datacenter you just need appropriate access rules.  Many security folks however (probably rightly so) believe you should go all the way down the chain.  There are tradeoffs for that, because between the Gateway and Origin server, there is typically multiple instances of the origin server, and they all need a unique SSL cert, that needs to be setup.  This makes it more difficult in some environments to spin up new origin servers to handle load dynamically (but it is totally possible).

So I think that solves your concern about how the intermediary decrypts and caches.  The answer is it controls the SSL on both ends so it has the non-encrypted value to cache in between.  However you are right if you wanted a single SSL connection from User Agent to Origin Server, you could not have intermediary caches.

Daniel

Mark Nottingham

não lida,
19 de jul. de 2012, 19:48:1819/07/2012
para api-...@googlegroups.com
Shared, third-party caching isn't possible with SSL/TLS. Caching by the client, the server, or an intermediary acting on behalf of one of them can (i.e., the intermediary terminates the TLS connection).

Even a third-party intermediary *can* be injected (and this is becoming depressingly more common); e.g. <http://wiki.squid-cache.org/Features/SslBump> (just one example of many).

Cheers,
--
Mark Nottingham http://www.mnot.net/



Brock Allen

não lida,
19 de jul. de 2012, 19:58:1719/07/2012
para api-...@googlegroups.com
 
So I think that solves your concern about how the intermediary decrypts and caches.  The answer is it controls the SSL on both ends so it has the non-encrypted value to cache in between.  However you are right if you wanted a single SSL connection from User Agent to Origin Server, you could not have intermediary caches.

Ok, yea, so I think we are saying/thinking the same thing. And from the client's perspective (at least from DNS and TLS' perspective) the gateway is the host that I'm establishing the TLS connection with. How you manage the network once I'm at your host, then that's your matter. And so I suppose when you were answering earlier about caching, you were suggesting doing it in the gateway.

We don't always have those gateways, tho. I'm from the IIS world and most commonly for the size sites I run and work on (small to medium) we do SSL on the webserver itself.

So this raises another question -- if you were caching in the gateway, you'd need to key the cache entry on whatever authentication token is being presented, yes? Then this means the cache entry is specific to the individual client, yes? If these are both true, then is the cache really that useful?

TIA

Mark Nottingham

não lida,
19 de jul. de 2012, 20:09:2219/07/2012
para api-...@googlegroups.com
It depends very much about how you do it.

If you put the users' identity (or a facsimile of it) in the URL, that gets it into the cache key, and then you only need to enforce authentication at the gateway (which is supported supported in most products).

I built a system in the late 90's where we cached authenticated content on the "edge", but arranged cache control so that it was validated for every request, thereby checking authentication on the origin server. Since the responses were big PDFs, this helped quite a bit.

Cheers,

Brock Allen

não lida,
19 de jul. de 2012, 20:22:3219/07/2012
para api-...@googlegroups.com

If you put the users' identity (or a facsimile of it) in the URL, that gets it into the cache key, and then you only need to enforce authentication at the gateway (which is supported supported in most products).

Hmm, that's not the first place I'd think to put it :) I mean... isn't this what the Authorization header is for?

Also, I'm appreciating how different our styles or mindsets are to this (and I don't mean this in a bad way at all -- it's just interesting to me). So often in the coding I do identity information is mandatory in the server itself so that authorization decisions can be made -- typically due to inputs affecting those decisions or a database needs to get consulted which isn't accessible (or designed to be) from the gateway. But then again, this is typically with RPC style apps. Maybe designing as a RESTful system will somehow allow a better decoupling and separation. I can see with an OAuth style authorization where this would be much easier to separate out... but that's because the resource that requires authorization is the thing at the end of some URL (IOW the application semantics shifting to more of a resource oriented system allow these authorization semantics to also revolve around the identifier for the resource).

I built a system in the late 90's where we cached authenticated content on the "edge", but arranged cache control so that it was validated for every request, thereby checking authentication on the origin server. Since the responses were big PDFs, this helped quite a bit.

So you always forced a request to get to the origin, but it just returned a 304 if authorization was still granted? So you're saving on the bandwidth?

Again, many thanks. For me this conversation is very compelling.

Mark Nottingham

não lida,
19 de jul. de 2012, 20:31:0319/07/2012
para api-...@googlegroups.com

On 20/07/2012, at 10:22 AM, Brock Allen wrote:

>
>> If you put the users' identity (or a facsimile of it) in the URL, that gets it into the cache key, and then you only need to enforce authentication at the gateway (which is supported supported in most products).
>>
> Hmm, that's not the first place I'd think to put it :) I mean... isn't this what the Authorization header is for?

Yep. A gateway can be set up to handle the authentication; e.g., in Squid, it's a combination of:
http://wiki.squid-cache.org/ConfigExamples/Reverse/BasicAccelerator
http://wiki.squid-cache.org/Features/Authentication


> Also, I'm appreciating how different our styles or mindsets are to this (and I don't mean this in a bad way at all -- it's just interesting to me). So often in the coding I do identity information is mandatory in the server itself so that authorization decisions can be made -- typically due to inputs affecting those decisions or a database needs to get consulted which isn't accessible (or designed to be) from the gateway. But then again, this is typically with RPC style apps. Maybe designing as a RESTful system will somehow allow a better decoupling and separation. I can see with an OAuth style authorization where this would be much easier to separate out... but that's because the resource that requires authorization is the thing at the end of some URL (IOW the application semantics shifting to more of a resource oriented system allow these authorization semantics to also revolve around the identifier for the resource).

Right. Where you need to use the identity on the back end, it's a matter of establishing trust between the gateway and the back end, and then conveying the user identity to it (whether that's in the URL, a header, etc.).


>> I built a system in the late 90's where we cached authenticated content on the "edge", but arranged cache control so that it was validated for every request, thereby checking authentication on the origin server. Since the responses were big PDFs, this helped quite a bit.
>>
> So you always forced a request to get to the origin, but it just returned a 304 if authorization was still granted? So you're saving on the bandwidth?

Exactly.


> Again, many thanks. For me this conversation is very compelling.

No worries!
Responder a todos
Responder ao autor
Encaminhar
0 nova mensagem