For the benefit of others, I'm bringing a discussion that started on the
forums to a better venue to hash out.
Apologies in advance for a very long post, but this is a complex topic that all started with work that I am doing to add support for Muse Knowledge Proxy, which is a rewriting proxy the behaves similarly to EZProxy, but uses very different rewriting schemes.
Following Dan's suggestion, I instrumented the browser connector a bit over the weekend to trace the execution of URL handling, and confirmed that the Proxy.scheme pattern that I have is correct and it is in fact doing the right thing, up until the point it enters Zotero.Proxy.prototype.toProxy(), and that is where things go sideways.
The pattern in use is https://%a-y-https-%
h.proxy.example.edu/%p, and when this pattern is used in .toProxy(), the output is https://-
y-https-www-vendor-com.proxy.example.edu/path/to/content, because that function does not have sufficient context to build a valid URL (in this case, it does not have any information available to stuff into the %a placeholder, thus skips the %a token completely and starts with a "-"). Based on what I see, any use of %a is not supported in .toProxy() at this time, nor do I think it would be reasonable to try to add it, as I will explain below.
First a side-bar here is a quick primer on how rewriting platforms other than Ezproxy manage their URLs if you have not read the forum post and/or have not worked with those proxy platforms before, so that we can have a common reference for further discussion:
I am differentiating between the supported proxy entry points (the Starting URL) and what Zotero will see in window.location.href once the content is loaded (the Location URL):
EZProxy (proxy by hostname):
Muse Knowledge Proxy (rewrite by host):
Muse Knowledge Proxy (rewrite by path):
Pulse Secure / Juniper SSL VPN: [from memory, but it should be close]
OpenAthens: [based on available docs]
Innovative Interfaces WAM: [based on available docs]
For EZProxy and WAM, it would be fairly simple to build a Starting URL from the information available to the .toProxy() function, however, the function would likely need to be updated to understand the WAM port prefix. But even with that, are some BIG assumptions about authentication handling being made here that I will go into a little later.
For Muse Knowledge Proxy a XHR technique almost identical to the existing one for EZproxy can be used to obtain the login URL can be used to get the login URL, but that information is lost after the .learn() function to build the Proxy.scheme has completed, and is not available in .toProxy().
For OpenAthens, loading the proxied link is going to take the user to a federation WAYF page to sign in using a SAMLRequest that cannot be re-used, nor do I see any way that one can divine which institution that the user came from to construct a starting URL using their redirector entry point:
So, with that background information on the different access platforms available today, I believe that there is an advantage to extending the proxies schema in the Zotero application to include what is known as a "proxy prefix". Proxy prefix support is available for all of these platforms, and it simply appends encoded URIs (e.g. encodeURIComponent() output) to the prefix value:
If these proxy prefixes were stored alongside the scheme, then toProxy() would become trivial:
Zotero.Proxy.prototype.toProxy = function(uri) {
if (typeof uri == "string") {
uri = url.parse(uri);
// If there's no path it is set to null, but we need
// at least an empty string to avoid doing many checks
uri.path = uri.path || '';
}
var loginURI = this.prefix+encodeURIComponent(uri.href);
return loginURI;
}
I have tested this with hard-coded values for the prefix, and this is one of the last issues that I am working through for adding full Muse Knowledge support.
In proxy.js. this is the style of change that would be needed at the Detector level:
EZProxy:
if (loginHostIsProxiedHost && proxiedAndLoginPortsDiffer) {
// Proxy by port
proxy = new Zotero.Proxy({
autoAssociate: false,
scheme: proxiedURI.host+"/%p",
prefix: loginURI.protocol+'://'+loginURI.host+'/login?qurl=',
hosts: [properURI.host],
dotsToHyphens: false
});
} else if (!loginHostIsProxiedHost && proxiedHostContainsProperHost) {
// Proxy by host
proxy = new Zotero.Proxy({
autoAssociate: true,
scheme: proxiedURI.host.replace(properURI.hostname, "%h")+"/%p",
prefix: loginURI.protocol+'://'+loginURI.host+'/login?qurl=',
hosts: [properURI.host],
dotsToHyphens: true
});
}
Juniper:
return new Zotero.Proxy({
autoAssociate: true,
scheme: m[1]+"/%d"+",DanaInfo=%h%a+%f",
prefix: 'https://'+loginURI.host+'/dana/home/launch.cgi?url=',
hosts: [m[3]]
});
Muse Knowledge Proxy:
return new Zotero.Proxy({
autoAssociate: true,
scheme: zoteroRegex,
prefix: 'https://'+proxyHost+'/'+loginPath+'?'+loginArgs+'&qurl=',
hosts: [targetURL.host],
dotsToHyphens: dotsToHyphens
});
I could make an educated guess at the OpenAthens and III WAM behaviors, but I have no access to those systems to ensure that it would work. (The Juniper prefix is going off memory and what I could find on the web, and should work.)
As a side-benefit of this, knowing these proxy prefixes means that users could also be given a choice of which proxy to use to retrieve their content in a future application release.
The use cases for this are:
Scenario #1:
A user is a doctoral student at University A, and an adjunct faculty at College B; there is no connection between these two institutions. The user has one set of subscriptions available at University A, and a different set of subscriptions available at College B, with no overlap, however, all of the content comes from ProQuest, so the only host token available is
search.proquest.com. In this scenario, Zotero does not have adequate context that it needs to know which proxy to select for that host.
Scenario #2:
A user attended a community college for their associates, a state university for their baccalaureate, a private college for their master's, and private university for their doctorate degree. This user could have 4 completely different access methods available to them to access content as an alumni of 4 different institutions. Again, if all 4 were ProQuest subscribers and each had access to different subscription content, there is not enough context from the hostname alone to make an intelligent decision about which proxy to use.
Looking at the schema in the 5.0.89, I do not see any foreign key relationships between proxies and/or proxyHosts to itemData, so I do not see any current relationship made within the application between an item and a proxy. Correct me if I am wrong, but there is an assumption currently being made that there is a 1:1 mapping between hostnames and proxies, that may not be valid for all cases, such as the two I gave above.
Also, some vendor platforms (mainly ebooks) have special authentication provisions in proxy servers that need to be activated. These generally work with deep linking URLs, but there have been times over the years when these did not work correctly if the starting URL was not used.
And finally, to add one more wrinkle to the mix, proxy servers (even EZProxy) may be configured to have special entry points to accommodate different methods of authentication. For example, a university uses SSO, so has SAML configured on their EZProxy server, but needs to have non-SSO user access (guest patrons, visiting scholars, etc) available. One way to do this is to pass "auth=shibboleth" into the EZProxy login so that the authentication processing can branch into the correct authentication path. (See
Mixing Traditional EZproxy Authentication with Shibboleth). This is something that I do not believe is exposed using the current XHR request method of proxy discovery for EZProxy, but rather the generic '/login?[q]url=' location header is returned with no arguments to preserve the authentication entry query argument.
Muse Knowledge Proxy also supports this kind of authentication branching logic, via a "groupID" parameter, however it does expose the groupID in the XHR request so that is available to the .learn() function for that platform and would be available as part of the stored prefix today.
From my investigation so far, here is what I think would need to happen to at least support proxy prefixes in the connector:
1) Update the schema in the application to:
CREATE TABLE proxies (
proxyID INTEGER PRIMARY KEY,
multiHost INT,
autoAssociate INT,
scheme TEXT,
prefix TEXT
);
Discussion: should there be a flag for active/inactive status so that users can de-select proxies without deleting them? Use case: during hiatus between terms when classes are offered, adjunct faculty may lose access to the proxy server when not actively teaching a course.
2) Update src/common/proxy.js with prefix support. I have a good start on this, but found that I need application support to store the information permanently before I can continue.
This includes .toProxy(), the Detectors, and probably a few other places I have not found yet.
3) Update the connector preferences to expose both Scheme and Prefix values in the UI. I have not located this code yet, but it should be more or less a duplicate of the Scheme handling. This would allow users to tweak the proxy prefix as required (e.g. the auth=shibboleth case).
Thoughts?
Andrew