PROPOSAL : Rewriting links in generated and proxied content

11 views
Skip to first unread message

Louis Ryan

unread,
May 6, 2008, 7:48:00 PM5/6/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org
Hi,

Many containers offer the ability to rewrite links to use their proxy loading mechanism using gadgets.io.getProxyUrl() Typically the generated proxy URL can be expected to give better download performance for users of that particular container and also potentially reducing the load on gadget developer backends. Currently however there is no standard way for a gadget to have links in its generated markup be rewritten to use the proxy URL format. I would like to propose creating a new feature to allow for rewriting of links in generated gadget content

Create a new standard gadget feature called proxy-rewriter which allows gadgets to control whether they want content re-writing enabled. Containers can choose to turn the feature on by default for all gadgets and gadgets can use this mechanism to opt-out.

An 'include' param is used to control which URLs to rewrite:
  • ALL - All URLs in the content
  • DEFAULT - is recognized static file extension types such as .js, .png, .gif ...
  • NONE - disables the feature even if it is enabled by default by the container
An 'include-pattern' and 'exclude-pattern' can be specified to implement more exact filtering rules. Patterns are applied to the URL to rewrite, excludes are processed after includes

An 'apply-to' is used to specify the comma separated list of mime-types which the rewriter should recognize and rewrite. By default the list is text/html,text/xml,application/xml (suggestions welcome here)

     <Optional feature="proxy-rewriter">
        <Param name="include">DEFAULT</Param>
        <Param name="include-pattern">.*\/mystaticcontent\/.*</Param>
        <Param name="exclude-pattern">.*\/mynonstaticcontent\/.*</Param>
        <Param name="apply-to">text/html</Param>
     </Optional>

This feature will not only impact the content generated when the gadget is rendered but is also used to control whether any content fetched through makeRequest, Preload and proxied URLs is also rewritten.

It is probably also worthwhile mentioning how re-writing a URL to be proxied impacts the caching behavior of the content. Containers will cache content fetched through the proxy, in general containers are likely to favor a simpler expires/max-age style cache control policy rather than the more complicated to implement and more latency sensitive Last-Modified/If-Modified-Since which is the default mechanism Apache and other webservers use when serving static files. Containers are free to make policy decisions about how to alter the cache-control headers of content fetched through their proxy.

A sample policy might look like:
- Pragma : no-cache & Cache-Control : no-cache,no-store  are always respected and this content will never be cached by the proxy
- Expires and Cache-Control : max-age are always respected. If expiration is shorter than one day and the URL is a recognized static file type then expiration is forced to 1 day minimum
- ETag & Last-Modified are stripped if Expires or Cache-Control is set or the URL is a recognized static file type.

Such a policy is attempting to balance the needs of sophisticated users of real If-Modified requests and the default configuration of common static file serving configurations.

Thoughts?

-Louis



Brian Eaton

unread,
May 6, 2008, 8:27:37 PM5/6/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org
On Tue, May 6, 2008 at 4:48 PM, Louis Ryan <lr...@google.com> wrote:
> Many containers offer the ability to rewrite links to use their proxy
> loading mechanism using gadgets.io.getProxyUrl() Typically the generated
> proxy URL can be expected to give better download performance for users of
> that particular container and also potentially reducing the load on gadget
> developer backends. Currently however there is no standard way for a gadget
> to have links in its generated markup be rewritten to use the proxy URL
> format. I would like to propose creating a new feature to allow for
> rewriting of links in generated gadget content

How would this be implemented?

Louis Ryan

unread,
May 6, 2008, 9:31:51 PM5/6/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org
In Shindig this would be a combination of Servlet filters and added code in the ProxyHandler. I've already done a proof of concept implementation here at Google. I'm hoping to get time to put together a patch that supports this and add it to the Shindig JIRA for feedback.

Brian Eaton

unread,
May 6, 2008, 11:40:13 PM5/6/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org
On Tue, May 6, 2008 at 6:31 PM, Louis Ryan <lr...@google.com> wrote:
> In Shindig this would be a combination of Servlet filters and added code in
> the ProxyHandler. I've already done a proof of concept implementation here
> at Google. I'm hoping to get time to put together a patch that supports this
> and add it to the Shindig JIRA for feedback.

I guessed the answer would involve code. =) I was curious about what
the code does. Are you parsing HTML looking for links, or is there
more to it?

Louis Ryan

unread,
May 7, 2008, 12:35:56 PM5/7/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org
The parsing would need to be pretty simple for performance reasons, so probably not full HTML parsing with funk tolerance. More likely simple searches for
src="<url>" and href="<url> in easily recognizable tag patterns and similar for CSS and JSON structures. If people have suggestions on the implementation details Id love to hear them.

Im my experience trying to use Java's regex facility for this will probably not meet the performance goals.

Brian Eaton

unread,
May 7, 2008, 12:50:20 PM5/7/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org
On Wed, May 7, 2008 at 9:35 AM, Louis Ryan <lr...@google.com> wrote:
> The parsing would need to be pretty simple for performance reasons, so
> probably not full HTML parsing with funk tolerance. More likely simple
> searches for
> src="<url>" and href="<url> in easily recognizable tag patterns and similar
> for CSS and JSON structures. If people have suggestions on the
> implementation details Id love to hear them.

Doing this server-side is error prone, so plan on failing gracefully.
What about client-side solutions? Traversing document.links might
work, or walking the DOM.

Graham Spencer

unread,
May 7, 2008, 7:40:14 PM5/7/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org
Thanks Louis. A few minor questions:

[1] I assume DEFAULT should also include CSS files. It might make sense to base DEFAULT not on pattern matching targets but rather on the context (e.g. <IMG>, CSS, etc.).

[2] Will CSS files be parsed to extract background images and such?

[3] Should we give developers control over timeouts?

--g

Louis Ryan

unread,
May 7, 2008, 8:22:21 PM5/7/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org
1. Yes
2. Yes, My sample implementation is now doing this
3. To some extent. Developers have a lot of control using the cache-control headers. Containers are likely to require a minimum value

Louis Ryan

unread,
May 16, 2008, 7:47:11 PM5/16/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org
An initial implementation of this is now available as a patch for Shindig

https://issues.apache.org/jira/browse/SHINDIG-276

Paul Walker

unread,
May 16, 2008, 10:10:12 PM5/16/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org, Bryan Green

Our thoughts at MySpace were to parse and persist external resources on our end and the developer would simply opt in by supplying a standardized query parameter on the URI of the resource.  I think we can understand why developers would want us to persist/cache some images/swfs for them and not others.  We need a way for users to opt in individual files and while the pattern matching would work, the parameter just seems easier.  I suppose for users that create apps on our developer site, we could publish the use of this parameter as a pattern match and still be compliant with the pattern matching detail of this proposal.

 

We’d like to obey caching headers, but would need to have a significant floor.  We will be storing the parsed resources on Akamai and would have to run a job that updated these files and I don’t like the messiness of that.   I don’t see developers having a huge issue versioning URIs as is typically done w/ JS libraries and update their gadget xml, so I’m not so concerned about this.  I’m more concerned about the 90%+ developers that don’t properly use cache control headers.  Developers that are sophisticated enough to and have the need to invalidate these resources periodically are sophisticated enough to use URI versioning and update their gadget xml.  Again, they also have the option of hosting the resource themselves and controlling any complicated/varying user-agent caching they have on the resource.

 

We pre-process all markup when it is saved/published and the live version for installed users is retrieved already compressed to gzip (for users that don’t already have it cached on their user-agent), so we don’t have the same performance implications as shindig it sounds like.  I assume that Shindig could store a lookup of cached urls and check/replace where it finds matches and where it does not, throw the work of fetching/caching to a background task and not rewrite those until it has, but I’ll leave that up to the Shindig experts.

 

Cc’ed Bryan Green, the main developer who has already finished most of our implementation, if someone would like to discuss.  We’re keen on this support so a general BIG +1 here.  Could we separate the cache control portion of this proposal and move forward with resource parsing/persisting w/out it for now? 

 

Thx…

~Paul

Louis Ryan

unread,
May 22, 2008, 5:11:25 PM5/22/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org, Bryan Green
Paul,

You certainly can use the include rule to specify the parameter matching implementation and just document that for your container. The implementation in shindig no longer cares about recognized file extensions but rather uses the referencing HTML tag (script, link, img & embed currently) to determine whether rewriting should occur so the revised proposal looks like....

<Optional feature="content-rewrite">
        <Param name="include-tags">script,link,img,embed</Param>
        <Param name="include-pattern">.*</Param>
        <Param name="exclude-pattern"></Param>
</Optional>

Params are shown with their default values, defaults are used if param is omitted.

Exclude pattern overrides include pattern if exclude pattern is defined. The feature can be turned off by explicitly setting include-tags to an empty list of exclude-pattern to '.*'

I didnt mean to imply that the cache-control rules are defined by this feature, thats an implementation detail that containers are free to change at their discretion. I provided the discussion to give folks some idea what a typical container might do. Completely agree that good developers will version content and that containers are free to enforce this policy by setting cache-control to eons on rewritten content.

-Louis

Paul Walker

unread,
May 28, 2008, 3:39:19 AM5/28/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org, Bryan Green

Great news.  Without caching rules, we are definitely prepared to support this. 

 

However, speaking of Shindig and the process of proposals…since you have contributed the code to Shindig, is it just expected to be approved as a proposal without voting? 

 

And on the specifics of what has been contributed to Shindig, what does it actually do?  Is this a specific to your provider?  Where does the code you contributed actually persist the content and rewrite the URLs to?

 

Thanks,

Paul Walker

unread,
May 28, 2008, 3:45:21 AM5/28/08
to opensocial-an...@googlegroups.com

 

Simple,

 

Instead of Container, we use ProviderContainer is demeaning and self serving.  Provider gives a sense of obligation to client.

 

~Paul

Kevin Brown

unread,
May 28, 2008, 4:12:36 AM5/28/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org, Bryan Green
On Wed, May 28, 2008 at 12:39 AM, Paul Walker <pwa...@myspace.com> wrote:

Great news.  Without caching rules, we are definitely prepared to support this. 

 

However, speaking of Shindig and the process of proposals…since you have contributed the code to Shindig, is it just expected to be approved as a proposal without voting? 


No -- Shindig will always implement whatever the standard specifies, so if the standard deviates from Louis' current proposal Shindig will be modified to support it. Shindig is a useful test bed for ironing out issues in a proposal before they are finalized, with enough users to easily test a wide range of container deployment needs.

 And on the specifics of what has been contributed to Shindig, what does it actually do? 

What Louis has contributed so far rewrites links as proposed on this thread. The content is examined for any links and those links are rewritten to go through a caching proxy. Several optimizations are done to try to reduce http overhead, such as merging contiguous http requests into one. We've deployed this to the Orkut sandbox as well to get some real-world feedback from developers. You can take a look at the code here: http://svn.apache.org/viewvc/incubator/shindig/trunk/java/gadgets/src/main/java/org/apache/shindig/gadgets/rewrite/
 

Is this a specific to your provider?  Where does the code you contributed actually persist the content and rewrite the URLs to?


Shindig's proxy provides a cache interface that users can implement to store the data however they like. A default in memory LRU cache is provided.

Louis Ryan

unread,
May 28, 2008, 1:21:29 PM5/28/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org, Bryan Green
Glad to hear it. Currently the implementation in Shindig is a work in progress though it already provides useful functionality. In particular because the configurable rewriting rules need to carry across a chain of requests emanating from a gadget we need to carry the ability to lookup the rule through that chain, this is something that is not yet implemented. Nor for that matter is the ability to include or exclude URLs be regex though that will be added soon.

Luke

unread,
Jun 2, 2008, 8:57:00 PM6/2/08
to OpenSocial and Gadgets Specification Discussion
This relates more to implementation of the proxying server itself
rather than the gadget spec... I would like to see that the proxying
be useful for people who do flash development. The proxying, if not
implemented with flash in mind, will be a security hole for flash
developers. We use crossdomain.xml as a security mechanism to prevent
unwanted access and including *.gmodules.com, *.msappspace.com, or
*.hi5modules.com as an allowed domain in our crossdomain isn't secure
if those domains proxy arbitrary urls. We may as well allow access
from all domains.

I see 2 ways we can get around the problem:

1) Create urls for proxied content using the original domain as part
of the subdomain. Thus, something a url like http://static.myminilife.com/someswf.swf
would be cached at static.myminilife.gmodules.com/... . Thus, we can
allow the static.myminilife.gmodules.com specifically in the
crossdomain file. The proxying server would have to ensure that the
subdomain is the same as the url being proxied.

2) Do filtering of the request's http host against the proxied url
being asked for. With that in place, then I can create a CNAME
pointing one of our subdomains to the server doing the proxying. Ie,
gmodules.myminilife.com points to gmodules.com. When I have a swf at
static.myminilife.com/someswf.swf, I can replace that url with
gmodules.myminilife.com, which will retrieve the content from the
proxying server. Like above, The proxying server has to ensure that
the request's http host is the same as the content being proxied.

On May 28, 10:21 am, "Louis Ryan" <lr...@google.com> wrote:
> Glad to hear it. Currently the implementation in Shindig is a work in
> progress though it already provides useful functionality. In particular
> because the configurable rewriting rules need to carry across a chain of
> requests emanating from a gadget we need to carry the ability to lookup the
> rule through that chain, this is something that is not yet implemented. Nor
> for that matter is the ability to include or exclude URLs be regex though
> that will be added soon.
>
> On Wed, May 28, 2008 at 1:12 AM, Kevin Brown <e...@google.com> wrote:
> > On Wed, May 28, 2008 at 12:39 AM, Paul Walker <pwal...@myspace.com> wrote:
>
> >> Great news. Without caching rules, we are definitely prepared to
> >> support this.
>
> >> However, speaking of Shindig and the process of proposals…since you have
> >> contributed the code to Shindig, is it just expected to be approved as a
> >> proposal without voting?
>
> > No -- Shindig will always implement whatever the standard specifies, so if
> > the standard deviates from Louis' current proposal Shindig will be modified
> > to support it. Shindig is a useful test bed for ironing out issues in a
> > proposal before they are finalized, with enough users to easily test a wide
> > range of container deployment needs.
>
> >> And on the specifics of what has been contributed to Shindig, what does
> >> it actually do?
>
> > What Louis has contributed so far rewrites links as proposed on this
> > thread. The content is examined for any links and those links are rewritten
> > to go through a caching proxy. Several optimizations are done to try to
> > reduce http overhead, such as merging contiguous http requests into one.
> > We've deployed this to the Orkut sandbox as well to get some real-world
> > feedback from developers. You can take a look at the code here:
> >http://svn.apache.org/viewvc/incubator/shindig/trunk/java/gadgets/src...
>
> >> Is this a specific to your provider? Where does the code you contributed
> >> actually persist the content and rewrite the URLs to?
>
> > Shindig's proxy provides a cache interface that users can implement to
> > store the data however they like. A default in memory LRU cache is provided.
>
> >> Thanks,
>
> >> Paul
>
> >> *From:* opensocial-an...@googlegroups.com [mailto:
> >> opensocial-an...@googlegroups.com] *On Behalf Of *Louis Ryan
> >> *Sent:* Thursday, May 22, 2008 2:11 PM
> >> *To:* opensocial-an...@googlegroups.com
> >> *Cc:* shindig-...@incubator.apache.org; Bryan Green
>
> >> *Subject:* Re: PROPOSAL : Rewriting links in generated and proxied
> >> content
>
> >> Paul,
>
> >> You certainly can use the include rule to specify the parameter matching
> >> implementation and just document that for your container. The implementation
> >> in shindig no longer cares about recognized file extensions but rather uses
> >> the referencing HTML tag (script, link, img & embed currently) to determine
> >> whether rewriting should occur so the revised proposal looks like....
>
> >> <Optional feature="content-rewrite">
> >> <Param name="include-tags">script,link,img,embed</Param>
> >> <Param name="include-pattern">.*</Param>
>
> >> <Param name="exclude-pattern"></Param>
> >> </Optional>
>
> >> Params are shown with their default values, defaults are used if param is
> >> omitted.
>
> >> Exclude pattern overrides include pattern if exclude pattern is defined.
> >> The feature can be turned off by explicitly setting include-tags to an empty
> >> list of exclude-pattern to '.*'
>
> >> I didnt mean to imply that the cache-control rules are defined by this
> >> feature, thats an implementation detail that containers are free to change
> >> at their discretion. I provided the discussion to give folks some idea what
> >> a typical container might do. Completely agree that good developers will
> >> version content and that containers are free to enforce this policy by
> >> setting cache-control to eons on rewritten content.
>
> >> -Louis
>
> >> *From:* opensocial-an...@googlegroups.com [mailto:
> >> opensocial-an...@googlegroups.com] *On Behalf Of *Louis Ryan
> >> *Sent:* Friday, May 16, 2008 4:47 PM
> >> *To:* opensocial-an...@googlegroups.com
> >> *Cc:* shindig-...@incubator.apache.org
> >> *Subject:* Re: PROPOSAL : Rewriting links in generated and proxied
> >> content
>
> >> An initial implementation of this is now available as a patch for Shindig
>
> >>https://issues.apache.org/jira/browse/SHINDIG-276
>
> >> On Wed, May 7, 2008 at 5:22 PM, Louis Ryan <lr...@google.com> wrote:
>
> >> 1. Yes
> >> 2. Yes, My sample implementation is now doing this
> >> 3. To some extent. Developers have a lot of control using the
> >> cache-control headers. Containers are likely to require a minimum value
>
> >> On Wed, May 7, 2008 at 4:40 PM, Graham Spencer <g...@google.com> wrote:
>
> >> Thanks Louis. A few minor questions:
>
> >> [1] I assume DEFAULT should also include CSS files. It might make sense to
> >> base DEFAULT not on pattern matching targets but rather on the context (e.g.
> >> <IMG>, CSS, etc.).
>
> >> [2] Will CSS files be parsed to extract background images and such?
>
> >> [3] Should we give developers control over timeouts?
>
> >> --g
>
> >> On Tue, May 6, 2008 at 4:48 PM, Louis Ryan <lr...@google.com> wrote:
>
> >> Hi,
>
> >> Many containers offer the ability to rewrite links to use their proxy
> >> loading mechanism using *gadgets.io.getProxyUrl() *Typically the
> >> generated proxy URL can be expected to give better download performance for
> >> users of that particular container and also potentially reducing the load on
> >> gadget developer backends. Currently however there is no standard way for a
> >> gadget to have links in its generated markup be rewritten to use the proxy
> >> URL format. I would like to propose creating a new feature to allow for
> >> rewriting of links in generated gadget content
>
> >> Create a new standard gadget feature called *proxy-rewriter *which allows
> >> gadgets to control whether they want content re-writing enabled. Containers
> >> can choose to turn the feature on by default for all gadgets and gadgets can
> >> use this mechanism to opt-out.
>
> >> An 'include' param is used to control which URLs to rewrite:
>
> >> - ALL - All URLs in the content
> >> - DEFAULT - is recognized static file extension types such as .js,
> >> .png, .gif ...
> >> - NONE - disables the feature even if it is enabled by default by the
> >> container
>
> >> An 'include-pattern' and 'exclude-pattern' can be specified to implement
> >> more exact filtering rules. Patterns are applied to the URL to rewrite,
> >> excludes are processed after includes
>
> >> An 'apply-to' is used to specify the comma separated list of mime-types
> >> which the rewriter should recognize and rewrite. By default the list is
> >> text/html,text/xml,application/xml (suggestions welcome here)
>
> >> <Optional feature="proxy-rewriter">
> >> <Param name="include">DEFAULT</Param>
> >> <Param name="include-pattern">.*\/mystaticcontent\/.*</Param>
> >> <Param name="exclude-pattern">.*\/mynonstaticcontent\/.*</Param>
> >> <Param name="apply-to">text/html</Param>
> >> </Optional>
>
> >> This feature will not only impact the content generated when the gadget is
> >> rendered but is also used to control whether any content fetched through
> >> makeRequest, Preload and proxied URLs is also rewritten.
>
> >> It is probably also
>
> ...
>
> read more »

Kevin Brown

unread,
Jun 2, 2008, 11:08:57 PM6/2/08
to opensocial-an...@googlegroups.com
I think the best option is probably to just opt out for the rewriting of the swf files. Automatic rewriting flash isn't very practical anyway due to IE's limitations, so use of gadgets.flash.embedFlash would be the recommended way to handle this.

Louis Ryan

unread,
Jun 4, 2008, 7:18:50 AM6/4/08
to opensocial-an...@googlegroups.com
Yes, for this kind of security you should just use a regex to exclude the swf from the proxy. Many apps use swf's for simple animations and other non security sensitive displays for which proxying would be perfectly fine. If you want to use a secure & trusted communication channel you may want to use the flash to js bridge to call signed makeRequest or do the equivalent yourself in flash.

Louis Ryan

unread,
Jul 16, 2008, 4:43:01 PM7/16/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org
Hi,


Now that we've had some experience with the rewriter I'd like to propose an additional feature to allow developers to control the caching behavior of the open-proxy through which rewritten content is directed. The new param 'expires' can have a value of either HTTP or a positive integer that represents the TTL of the content in the browser cache and in any cache the open-proxy may use internally. If the value is HTTP then the cache-headers on the original content are respected by the cache used by the proxy and carried through to the browser.

This gives gadget developers finer-grained control over their contents TTL while still getting the latency benefit of the cache. It is also useful for developers that host content at sites where they dont have explicit control of the cache-headers. Containers are still free to enforce a minimum TTL if they so chose and should document it for developers.

<Optional feature="content-rewrite">
        <Param name="expires">86400</Param>

        <Param name="include-tags">script,link,img,embed</Param>
        <Param name="include-pattern">.*</Param>
        <Param name="exclude-pattern"></Param>
</Optional>

Thoughts?

-Louis

Kevin Brown

unread,
Jul 16, 2008, 4:59:00 PM7/16/08
to opensocial-an...@googlegroups.com, shind...@incubator.apache.org
This is a great addition, and it ensures that developers can control the caching behavior at all levels of the stack (makeRequest, getProxyUrl, and now rewritten content).
Reply all
Reply to author
Forward
0 new messages