Bringing the proxy discussion to dev list from the forums

35 views
Skip to first unread message

Andrew Anderson

unread,
Aug 17, 2020, 10:56:53 PM8/17/20
to zotero-dev

For the benefit of others, I'm bringing a discussion that started on the forums to a better venue to hash out.

Apologies in advance for a very long post, but this is a complex topic that all started with work that I am doing to add support for Muse Knowledge Proxy, which is a rewriting proxy the behaves similarly to EZProxy, but uses very different rewriting schemes.

Following Dan's suggestion, I instrumented the browser connector a bit over the weekend to trace the execution of URL handling, and confirmed that the Proxy.scheme pattern that I have is correct and it is in fact doing the right thing, up until the point it enters Zotero.Proxy.prototype.toProxy(), and that is where things go sideways.

The pattern in use is https://%a-y-https-%h.proxy.example.edu/%p, and when this pattern is used in .toProxy(), the output is https://-y-https-www-vendor-com.proxy.example.edu/path/to/content, because that function does not have sufficient context to build a valid URL (in this case, it does not have any information available to stuff into the %a placeholder, thus skips the %a token completely and starts with a "-").  Based on what I see, any use of %a is not supported in .toProxy() at this time, nor do I think it would be reasonable to try to add it, as I will explain below.

First a side-bar here is a quick primer on how rewriting platforms other than Ezproxy manage their URLs if you have not read the forum post and/or have not worked with those proxy platforms before, so that we can have a common reference for further discussion:

I am differentiating between the supported proxy entry points (the Starting URL) and what Zotero will see in window.location.href once the content is loaded (the Location URL):

EZProxy (proxy by hostname):


Muse Knowledge Proxy (rewrite by host):


Muse Knowledge Proxy (rewrite by path):

Pulse Secure / Juniper SSL VPN: [from memory, but it should be close]

Location URL: https://proxy.exampl.edu,DanaInfo=/www.vendor.com,/path?arg1=value1

OpenAthens: [based on available docs]


Innovative Interfaces WAM: [based on available docs]


For EZProxy and WAM, it would be fairly simple to build a Starting URL from the information available to the .toProxy() function, however, the function would likely need to be updated to understand the WAM port prefix.  But even with that, are some BIG assumptions about authentication handling being made here that I will go into a little later.

For Muse Knowledge Proxy a XHR technique almost identical to the existing one for EZproxy can be used  to obtain the login URL can be used to get the login URL, but that information is lost after the .learn() function to build the Proxy.scheme has completed, and is not available in .toProxy().

For OpenAthens, loading the proxied link is going to take the user to a federation WAYF page to sign in using a SAMLRequest that cannot be re-used, nor do I see any way that one can divine which institution that the user came from to construct a starting URL using their redirector entry point:


So, with that background information on the different access platforms available today, I believe that there is an advantage to extending the proxies schema in the Zotero application to include what is known as a "proxy prefix".  Proxy prefix support is available for all of these platforms, and it simply appends encoded URIs (e.g. encodeURIComponent() output) to the prefix value:


If these proxy prefixes were stored alongside the scheme, then toProxy() would become trivial:

Zotero.Proxy.prototype.toProxy = function(uri) {
    if (typeof uri == "string") {
        uri = url.parse(uri);
        // If there's no path it is set to null, but we need
        // at least an empty string to avoid doing many checks
        uri.path = uri.path || '';
    }
    var loginURI = this.prefix+encodeURIComponent(uri.href);
    return loginURI;
}    

I have tested this with hard-coded values for the prefix, and this is one of the last issues that I am working through for adding full Muse Knowledge support.

In proxy.js. this is the style of change that would be needed at the Detector level:

EZProxy:

    if (loginHostIsProxiedHost && proxiedAndLoginPortsDiffer) {
        // Proxy by port
        proxy = new Zotero.Proxy({
            autoAssociate: false,
            scheme: proxiedURI.host+"/%p",
            prefix: loginURI.protocol+'://'+loginURI.host+'/login?qurl=',
            hosts: [properURI.host],
            dotsToHyphens: false
        });
    } else if (!loginHostIsProxiedHost && proxiedHostContainsProperHost) {
        // Proxy by host
        proxy = new Zotero.Proxy({
            autoAssociate: true,
            scheme: proxiedURI.host.replace(properURI.hostname, "%h")+"/%p",
            prefix: loginURI.protocol+'://'+loginURI.host+'/login?qurl=',
            hosts: [properURI.host],
            dotsToHyphens: true
        });
    }

Juniper:

    return new Zotero.Proxy({
        autoAssociate: true,
        scheme: m[1]+"/%d"+",DanaInfo=%h%a+%f",
        prefix: 'https://'+loginURI.host+'/dana/home/launch.cgi?url=',
        hosts: [m[3]]
    });

Muse Knowledge Proxy:

    return new Zotero.Proxy({
        autoAssociate: true,
        scheme: zoteroRegex,
        prefix: 'https://'+proxyHost+'/'+loginPath+'?'+loginArgs+'&qurl=',
        hosts: [targetURL.host],
        dotsToHyphens: dotsToHyphens
    });

I could make an educated guess at the OpenAthens and III WAM behaviors, but I have no access to those systems to ensure that it would work. (The Juniper prefix is going off memory and what I could find on the web, and should work.)

As a side-benefit of this, knowing these proxy prefixes means that users could also be given a choice of which proxy to use to retrieve their content in a future application release.

The use cases for this are:

Scenario #1:

A user is a doctoral student at University A, and an adjunct faculty at College B; there is no connection between these two institutions.  The user has one set of subscriptions available at University A, and a different set of subscriptions available at College B, with no overlap, however, all of the content comes from ProQuest, so the only host token available is search.proquest.com.  In this scenario, Zotero does not have adequate context that it needs to know which proxy to select for that host.

Scenario #2:

A user attended a community college for their associates, a state university for their baccalaureate, a private college for their master's, and private university for their doctorate degree.  This user could have 4 completely different access methods available to them to access content as an alumni of 4 different institutions.  Again, if all 4 were ProQuest subscribers and each had access to different subscription content, there is not enough context from the hostname alone to make an intelligent decision about which proxy to use.

Looking at the schema in the 5.0.89, I do not see any foreign key relationships between proxies and/or proxyHosts to itemData, so I do not see any current relationship made within the application between an item and a proxy.  Correct me if I am wrong, but there is an assumption currently being made that there is a 1:1 mapping between hostnames and proxies, that may not be valid for all cases, such as the two I gave above.

Also, some vendor platforms (mainly ebooks) have special authentication provisions in proxy servers that need to be activated.  These generally work with deep linking URLs, but there have been times over the years when these did not work correctly if the starting URL was not used.

And finally, to add one more wrinkle to the mix, proxy servers (even EZProxy) may be configured to have special entry points to accommodate different methods of authentication.  For example, a university uses SSO, so has SAML configured on their EZProxy server, but needs to have non-SSO user access (guest patrons, visiting scholars, etc) available.   One way to do this is to pass "auth=shibboleth" into the EZProxy login so that the authentication processing can branch into the correct authentication path.  (See Mixing Traditional EZproxy Authentication with Shibboleth).   This is something that I do not believe is exposed using the current XHR request method of proxy discovery for EZProxy, but rather the generic '/login?[q]url=' location header is returned with no arguments to preserve the authentication entry query argument.

Muse Knowledge Proxy also supports this kind of authentication branching logic, via a "groupID" parameter, however it does expose the groupID in the XHR request so that is available to the .learn() function for that platform and would be available as part of the stored prefix today.

From my investigation so far, here is what I think would need to happen to at least support proxy prefixes in the connector:

1) Update the schema in the application to:

CREATE TABLE proxies (
    proxyID INTEGER PRIMARY KEY,
    multiHost INT,
    autoAssociate INT,
    scheme TEXT,
    prefix TEXT
);

Discussion: should there be a flag for active/inactive status so that users can de-select proxies without deleting them?  Use case: during hiatus between terms when classes are offered, adjunct faculty may lose access to the proxy server when not actively teaching a course.

2) Update src/common/proxy.js with prefix support.  I have a good start on this, but found that I need application support to store the information permanently before I can continue.

This includes .toProxy(), the Detectors, and probably a few other places I have not found yet.

3) Update the connector preferences to expose both Scheme and Prefix values in the UI. I have not located this code yet, but it should be more or less a duplicate of the Scheme handling.  This would allow users to tweak the proxy prefix as required (e.g. the auth=shibboleth case).

Thoughts?

Andrew

Dan Stillman

unread,
Aug 18, 2020, 1:37:46 AM8/18/20
to zoter...@googlegroups.com
On 8/17/20 10:53 PM, Andrew Anderson wrote:
> As a side-benefit of this, knowing these proxy prefixes means that
> users could also be given a choice of which proxy to use to retrieve
> their content in a future application release.

As I mentioned in the forums, this is really a separate issue — you're
seeing it as related only because your proxy server isn't currently
supported. Zotero already stores URLs unproxied for supported proxies,
and that will be the case for any additional proxy servers we add
support for. A user could, currently, have the same host entered for
more than one EZproxy. Right now the behavior is probably undefined in
that case, and there aren't great alternatives. Prefix support wouldn't
change that.

The Zotero app is irrelevant here. Proxy redirection is a Connector
feature, not a Zotero feature, and it applies whenever a user loads a
URL — there's no guarantee or assumption that the URL is being opened
from Zotero, which may or may not even be open or even installed. (The
'proxies' table in the app is mostly a holdover from Zotero for Firefox.
It might still be used to share proxy data between Connector versions —
I can't remember if that's currently hooked up — but otherwise is not
exposed or used in the app and doesn't really make sense there.)

In some browsers we might be able to prompt for the proxy to redirect
through when there's more than one possibility for a host, but 1) that's
sort of annoying, particularly since the user may not know which proxy
would work ahead of time and it's entirely possible more than one would
without needing to bother the user with it and 2) it won't work in at
least Safari 14, which won't support blocking web request interception.

One option would be to make Reload via Proxy (which already shows all
available proxies) a bit smarter so that you could use it immediately
after a proxy redirection to reload the just-redirected URL via a
different proxy if the first one was unsuccessful, and then either move
the host from one to the other so that it's tried first next time or, if
we add temporary proxy disabling, assign a priority so that it tries the
most recently used enabled proxy for that host.

Andrew Anderson

unread,
Aug 18, 2020, 4:01:48 PM8/18/20
to zoter...@googlegroups.com
On Tue, Aug 18, 2020 at 1:37 AM Dan Stillman <dsti...@zotero.org> wrote:
On 8/17/20 10:53 PM, Andrew Anderson wrote:
> As a side-benefit of this, knowing these proxy prefixes means that
> users could also be given a choice of which proxy to use to retrieve
> their content in a future application release.

As I mentioned in the forums, this is really a separate issue — you're
seeing it as related only because your proxy server isn't currently
supported.

Once the .toProxy() piece is sorted out, I think several other parts of the URL handling will start working as intended.  I just have a different idea on URL retrieval that I am not yet communicating well, I think.
 
Zotero already stores URLs unproxied for supported proxies,
and that will be the case for any additional proxy servers we add
support for.

Absolutely agree with this.
 
A user could, currently, have the same host entered for
more than one EZproxy. Right now the behavior is probably undefined in
that case, and there aren't great alternatives. Prefix support wouldn't
change that.

Yes, it is separate, but related, and prefix support in the application could change the behavior to be defined and robust.  I suspect that the disconnect that we are having right now is that I believe you are looking to the connector and it's proxy/host mapping to do all the heavy lifting, while I am proposing to make the application proxy prefix aware, while still maintaining the unproxied URLs in the item records, and putting control of how the item URL is loaded in the hands of the user.  The majority of users would only have 1 proxy, and would not know/care about this, but for power users this could be very useful.

In the application right now, there is the "View Online" context menu option for the items.  This works if the user does not require a proxy (public library GeoIP use case), uses IP access or a transparent/browser configured proxy (corporate/academic campus use case), or only has only 1 rewriting proxy configured (current connector use case that assumes a 1:1 host/proxy mapping).  If a user has more than 1 proxy, or has a mix of use cases (uses some items from the public library, some items from corporate/campus, some items from a college with a rewriting proxy), then this model breaks down.

What I'm suggesting is to take a look at the "View Online" menu option handling in the application.  What if it were "View Online" with a sibling menu item of "Load via ...", with a list of proxy servers available as a sub-menu?  Then the user would have the option of loading the URL directly via "View Online", or via a known proxy though "Load via...", and THAT is where knowledge of proxy prefixes in the application comes into play.  To keep the impact even more minimal, the "Load via..." item and sub-menu could only be added and/or activated if there are proxies actually configured.  Think of this as an application-side, proactive "Reload via Proxy" option that prefixes and encodes the URL in one step, instead of requiring the user to load the URL, then "Reload via Proxy" as a second step.

Scenario #1: No proxies; same behavior as today

View Online
-------------
Show in Library
...

Scenario #2: Only one proxy; add "Load via" with proxy label

View Online
Load via <proxy label>
--------------------------
Show in Library
...

Scenario #3: Multiple proxies; add "Load via..." with proxy labels in sub-menu

View Online
Load Via ->
    <proxy1 label>
    <proxy2 label>
--------------------
Show in Library
...
 
The Zotero app is irrelevant here. Proxy redirection is a Connector
feature, not a Zotero feature, and it applies whenever a user loads a
URL — there's no guarantee or assumption that the URL is being opened
from Zotero, which may or may not even be open or even installed.

OK, I missed that then, because I was looking at the callMethod('proxies') and thought that the connector was talking to the application to pull the proxy data from the proxies table in sqlite. Now that you pointed that out, I see the proxies array in the browser preferences, which is nice for connector changes, because this means that the data there is not constrained by a SQL schema elsewhere, and gives more freedom in making changes to that structure.
 
(The 'proxies' table in the app is mostly a holdover from Zotero for Firefox.
It might still be used to share proxy data between Connector versions —
I can't remember if that's currently hooked up — but otherwise is not
exposed or used in the app and doesn't really make sense there.)

So for proxy prefix support to work in the application, this would need to be revived and updated so that the connector could educate the application on what proxies it encounters, and the application could present the proxy list to the user.  If this is revived, I would suggest just storing the JSON arrays in SQLite so that the application will have the same flexibility as the connector does.  If there is absolutely no desire for proxy prefix support in the application as I propose above, then I will focus my energies on the connector instead.
 
In some browsers we might be able to prompt for the proxy to redirect
through when there's more than one possibility for a host, but 1) that's
sort of annoying, particularly since the user may not know which proxy
would work ahead of time and it's entirely possible more than one would
without needing to bother the user with it and 2) it won't work in at
least Safari 14, which won't support blocking web request interception.

Right, my intent for user choice was more focused on the application side when going to pull up a saved item, and less so on the browser connector.  Like you, I think trying to intercept the URL load event and present a "how would you like to load this?" dialog for every known host is a bad idea, and not what I was proposing at all.  The "Reload via Proxy" concept is better, but could benefit from some tuning.
 
One option would be to make Reload via Proxy (which already shows all
available proxies) a bit smarter so that you could use it immediately
after a proxy redirection to reload the just-redirected URL via a
different proxy if the first one was unsuccessful, and then either move
the host from one to the other so that it's tried first next time or, if
we add temporary proxy disabling, assign a priority so that it tries the
most recently used enabled proxy for that host.

Reload via Proxy does need to show more context data for the multi-tenant cases (Muse Knowledge and OpenAthens) so that the user can differentiate between them in the machine-generated labels.  I think that the label also needs to be attached to the proxy object, just like scheme is, and not built dynamically the same way that it is today, because a lot of the information that is needed for that is only available within the Listener functions.  The label should also be made editable in the same way as scheme is, so that the machine generated labels can be tweaked by the user to have more meaning for them (e.g. "Work", "School", etc), as well as for prefixes that would need to be manually configured, as in the OpenAthens redirector case.

I still think the assumption that there is a 1:1 mapping between proxies and hosts is a bad idea, though, and I will cite the aggregator platforms as one of the primary reasons why I feel this way.  Use case: A user has access to EBSCO, ProQuest, or Gale via state library contracts for the public library, and also has access to those same platforms via university library subscriptions.  Which databases are proxied and which ones are not?  Depending on the terms of the state contract, loading public library resources via proxy may or may not work, and loading university resources without the proxy will not work, but you cannot tell this based on the hostname alone.  If the code tries to be too smart in this case, it is going to be adding/removing hosts every time the user switches between public library and university library resources, and the connector behavior will seem inconsistent or random to the user as a result.

The priority concept is interesting, but the potential for complexity concerns me there, as I assume that it would involve trying to load the content and sniffing the resulting status code to determine if the load was successful (2xx) or not (non-2xx), and would run afoul of authentication redirection for initial page loads when the user does not already have a proxy session established.  In the redirection case, there would be no way to know at that moment if the content would have been successfully loaded or not, and could present as a false failure.

I think the proxy/host mapping needs some more thought, but it is beyond the scope of what I hope to achieve at the moment; it can be revisited after Muse Knowledge support and the other changes land.  Putting aside the application-side proxy prefix support discussion for the moment, are there any other concerns or questions about the changes I propose to add prefix support to the connector and improve the data and labeling for Reload via Proxy?  If not, I will continue to work on those two areas and submit a pull request once I think the code is close to integration ready.

Dan Stillman

unread,
Aug 19, 2020, 12:22:30 AM8/19/20
to zoter...@googlegroups.com
On 8/18/20 4:01 PM, Andrew Anderson wrote:
A user could, currently, have the same host entered for
more than one EZproxy. Right now the behavior is probably undefined in
that case, and there aren't great alternatives. Prefix support wouldn't
change that.

Yes, it is separate, but related, and prefix support in the application could change the behavior to be defined and robust.  I suspect that the disconnect that we are having right now is that I believe you are looking to the connector and it's proxy/host mapping to do all the heavy lifting, while I am proposing to make the application proxy prefix aware, while still maintaining the unproxied URLs in the item records, and putting control of how the item URL is loaded in the hands of the user.

It's not that there's no prefix support in the application currently. There's no web proxy support in the application at all. Zotero's web proxy support is a Zotero Connector feature. You don't need Zotero open or installed to use it. The point of the feature is not just to open URLs from Zotero items. It's to redirect unproxied URLs through configured proxies when you load them in your browser, whether you open them from Zotero or follow a link from Twitter or click a link on a webpage or resolve a DOI that redirects to a publisher URL. It wouldn't make sense to create some confusing situation where there are different ways to load things through proxies in different places. The browser is where URLs are loaded, and that's where proxy redirection belongs.


I still think the assumption that there is a 1:1 mapping between proxies and hosts is a bad idea, though

This just doesn't come up for most of our users, which is why it works the way it does, but I've already suggested several different ways that we could support having hosts associated with multiple proxies, so I'm not sure what you're responding to here. There's no magic option here — either we need to ask on each load or we need to automatically use the last-used proxy for a given host, and make it easy to choose another (possibly with "always" and "this time" options).


The priority concept is interesting, but the potential for complexity concerns me there, as I assume that it would involve trying to load the content and sniffing the resulting status code to determine if the load was successful (2xx) or not (non-2xx), and would run afoul of authentication redirection for initial page loads when the user does not already have a proxy session established.  In the redirection case, there would be no way to know at that moment if the content would have been successfully loaded or not, and could present as a false failure.

No, it wouldn't try to look at the status code. The priority would only be changed when the user chose to reload via another proxy ("always", if that's a separate option).

Adomas Venčkauskas

unread,
Aug 19, 2020, 8:14:25 AM8/19/20
to zotero-dev
Sorry it took me a while to jump into this discussion!

So I've always known that the proxy situation in Zotero could be improved to support a greater variety of schemas (and thus proxies), but I've only ever had access to hostname-based EZProxy for testing, and there has barely been any requests or questions about proxy support from non-EZ-proxy users (most with rewrite schemas too crazy for us to bother with handling). To be frank this is likely because the proxy functionality in the Connector is essentially not advertised at all and is supposed to "just work" for a casual user, and in most cases it either does and those users are happy, or it does not, because proxies are not properly detected and those users are not even aware of its existence.

So the first priority of improvement in the proxy department is certainly getting more proxy providers to "just work" with the Connector. This requires 2 separate steps:
1. Implementing the detection logic. We could do it from various different proxy docs, but without having a way to test the logic it's a bit of a pain. It's also never been clear to me which proxy providers should be prioritized given that we do not really get much consistent requests from our users for any particular one.
2. We'd need to be able to handle more complex schemas better. I have never given it much thought other than that we'd need to support an additional schema field for entry-points vs already proxied urls, although I had never experimented code-wise whether that would work well.

Thanks for doing all the digging and thinking on this, and confirming that this would indeed work. We'd certainly take a PR for this as well as any detection code for different proxy vendors.

As Dan said above the proxy code in the client code has mislead you. It is leftover from Zotero for Firefox days and upon switching all the Zotero for Firefox users to Standalone + Connector we added the proxy functionality to the Connectors as well as some code to pull existing Zotero for Firefox proxies from the client database to the Connector for easier user transition. That's the callMethod('proxies') call you saw in the connector codebase. The code proxy code in the client also has one other use - item saves from Connector may pass a proxy option which is instantiated in the client to deproxify the final item url as well as any attachment urls. The reason we're doing final deproxying in the client rather than passing unproxied urls from the connector is because the client needs to load snapshot and pdf pages separately and the proxied url along with cookies from the Connector are used to retrieve that content.

So the way to add this would be to modify the relevant proxy.js code and the proxy UI code to include the new option in the Connector. I also think adding a Label field would be useful, which would then need to be used in the UI update code. The changes to toProper() would also need to be made in the proxy.js file for the client (and _loadFromRow(), see the note for db query throw for the dotsToHyphens property) such that deproxying would work fine in the client from the passed-in "temporary" proxies from the Connector for item saves.

Do note that at least from our past experience getting proxy support right is hard, especially going by the spec as corner-cases or undocumented vendor behaviours crop up so changes to this code will need a beta-testing period and likely some fine-tuning once they're out for release so if you're willing to contribute a PR it would be nice to know that you'll be able to support it through to a polished version with minor fixes upon users bug reports (if any) if they are for proxies that we do not have access to (Muse, etc.)

------

As for other, more advanced proxy uses:
1. Opening links from Zotero via a proxy could be a nice feature, but given how fairly niche it would be (i.e. only really ever useful for users with multiple proxies) and the required additional code complexity (either attaching items to a proxy schema, or sharing connector proxies with the client) as well as context menu complexity creep, I'm not really in favour of it. An extra page-load and/or a few clicks in the connector seems like an absolutely reasonable compromise.
2. Support for multiple proxies providing partial access on the same host. It would be nice to handle this more gracefully, but we'd rather avoid an overly complex solution for this due to it once again being rather niche, but eventually becoming part of code that may require maintenance without main Zotero developer ability to test properly. So something like what Dan suggests, where hostnames are prioritized based on which one the user chose to load via last seems like a good solution. Anything that would require request content or request sniffing seems like an overkill.
Reply all
Reply to author
Forward
0 new messages