Hello, I am trying to use a Globus public link with the genome browser. I was made aware of this conversation:
https://groups.google.com/a/soe.ucsc.edu/g/genome/c/_a6HS_nOgco?pli=1
It seems to show that the latest version of Globus, which I am running, will allow a link to be directly available with no redirects that confuse the genome browser tool. In fact, if I run this command in a shell I get:
$ curl https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt
hub Helena_Ovol1
shortLabel Ovol1
longLabel 5umchir_cas9_gfpsgrna
…
…
If I use ‘wget’ to get the file it is downloaded with all the proper formatting as well.
However, hubCheck gives an error:
$ ./hubCheck https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt
Found 1 problem:
Missing required setting 'hub' from https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt
The Web browser gives the same error. Has anyone found a way to work around this? The Globus collection link seems to be working fine but hubCheck and the web tool seem to be getting tripped up with something I can’t figure out. Thanks
Hello Chris,
Thank you for writing to Genome Browser support and we appreciate your patience with this delayed answer.
It appears that there are a few unusual server configurations with the Globus site that are causing our site to give an error when retrieving the data. That error is somewhat misleading and is a sign that data is not being sent from Globus successfully. This is probably something you will need to discuss with the Globus Network Administrators. Here are the issues our engineers found and their comments:If the user can remove the redirection, or replace it with redirection directly in Apache HTTP server's header response, it will work better.
It is also modifying the URL by adding "?download=1" to the URL, which is not expected and could confuse our processing code.
http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html#Hosting
I hope this was helpful. If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.
All the best,
Daniel Schmelter
UCSC Genome Browser
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/BY5PR07MB68856223C282D31A489EAE9ECAA29%40BY5PR07MB6885.namprd07.prod.outlook.com.
Hello- I’ve been continuing discussions with the Globus group at U of Chicago regarding using URLs to load hub files. I got this from their technical support. It’s interesting, I’d like to see if anyone has run into this:
QUOTE -----
Chris-
Looks like hubCheck uses the user agent genome.ucsc.edu/net.c which we added support for a while back. That user agent (along with curl/ and Wget/) allow access to the data without the redirect. So the correct test to validate access via curl (but still mimicking hubCheck) is:
$ curl -A "genome.ucsc.edu/net.c" https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt
descriptionUrl https://g-60ce76.a78b8.36fe.data.globus.org/test.html
hub Helena_Ovol1
shortLabel Ovol1
longLabel 5umchir_cas9_gfpsgrna
genomesFile https://g-60ce76.a78b8.36fe.data.globus.org/globus/genomes.txt
email hbug...@usc.edu
That seems to work fine. What I'm not sure of is what else hubCheck is doing and why it is reporting this error:
$ ./hubCheck -verbose=2 https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt
### kent source version 420 ###
Found 1 problem:
Missing required setting 'hub' from https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt
END QUOTE ------
Hello, Chris.
Thank you for your interest in the Genome Browser and sending your inquiry.
Globus seems to be having issues with loading data. We tried to load the bam data in the hub just as a custom track on mm10 (genome.ucsc.edu/cgi-bin/hgCustom?db=mm10) and gives the following error:
Error : failed to read index file (.bai) corresponding to https://g-60ce76.a78b8.36fe.data.globus.org/globus/9_831_Six2TGC_MecomCCf_ucsc.bam
We downloaded the files then hosted the data on our development server. The data is formatted correctly since the bam and bam.bai data on mm10 can be seen in the following session:
http://genome.ucsc.edu/s/brianlee/DataHostedOnDev
An engineer of ours shares that it looks like a certificates issue:
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Gerardo Perez
UCSC Genomics Institute
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/BY5PR07MB68857654D6EF2C0E51B44BCFCAAA9%40BY5PR07MB6885.namprd07.prod.outlook.com.
That’s strange- I’m not getting that when I look at the certificate:
$ curl --insecure -vvI https://g-60ce76.a78b8.36fe.data.globus.org/globus/9_831_Six2TGC_MecomCCf_ucsc.bam.bai 2>&1 | awk 'BEGIN { cert=0 } /^\* SSL connection/ { cert=1 } /^\*/ { if (cert) print }'
* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
* Server certificate:
* subject: CN=a78b8.36fe.data.globus.org
* start date: Aug 09 06:10:08 2021 GMT
* expire date: Nov 07 06:10:06 2021 GMT
* common name: a78b8.36fe.data.globus.org
* issuer: CN=R3,O=Let's Encrypt,C=US
* Connection #0 to host g-60ce76.a78b8.36fe.data.globus.org left intact
I’ll see what the people at U of Chicago say. Thanks,
Chris
I guess it's some problem with wget 1.14. It works with 1.20 – would you be able to confirm?
$ wget https://g-60ce76.a78b8.36fe.data.globus.org/globus/9_831_Six2TGC_MecomCCf_ucsc.bam.bai
--2021-10-06 12:45:59-- https://g-60ce76.a78b8.36fe.data.globus.org/globus/9_831_Six2TGC_MecomCCf_ucsc.bam.bai
Resolving g-60ce76.a78b8.36fe.data.globus.org (g-60ce76.a78b8.36fe.data.globus.org)... 68.181.11.4, 68.181.11.5
Connecting to g-60ce76.a78b8.36fe.data.globus.org (g-60ce76.a78b8.36fe.data.globus.org)|68.181.11.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘9_831_Six2TGC_MecomCCf_ucsc.bam.bai.2’
9_831_Six2TGC_MecomCCf_ucsc.bam.bai.2 [ <=> ] 15.12K --.-KB/s in 0.007s
2021-10-06 12:46:00 (2.23 MB/s) - ‘9_831_Six2TGC_MecomCCf_ucsc.bam.bai.2’ saved [15480]
$ wget -V
GNU Wget 1.20.3 built on linux-gnu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/BY5PR07MB6885AA3C318EC91BBA018E47CAB09%40BY5PR07MB6885.namprd07.prod.outlook.com.
I’ve been doing some more work with Globus at U. of Chicago. They used the hubCheck binary to go to the URL provided and it seems to be getting the right data. Here is what Globus tech support said:
…..quote…..
Jasonalt, Oct 11, 2021, 11:24 CDT:
I can confirm with the hubCheck binary (using strace) that hubCheck is not receiving a redirect, it is getting the contents of the file. IE it sees:
\r\n--4c7b2264-5d9d-423e-a80e-e74467b14efe\r\nContent-Type: text/plain\r\nContent-Range: 0-234/*\r\n\r\ndescriptionUrl https://g-60ce76.a78b8.36fe.data.globus.org/test.html\nhub Helena_Ovol1\nshortLabel Ovol1\nlongLabel 5umchir_cas9_gfpsgrna\ngenome
-Jason
……quote…..
Content-Disposition: inline Connection: close Content-Type: multipart/byteranges; boundary=c3b7e1e9-6586-426d-9459-96a41dd950e7 --c3b7e1e9-6586-426d-9459-96a41dd950e7 Content-Type: text/plain Content-Range: 0-5/* descri --c3b7e1e9-6586-426d-9459-96a41dd950e7--
They should not be using a multipart response for a single request, which is all we ever ask for.
Yes, multipart is the only way to satisfy a user who asks for multiple ranges in a single request,
however we never ask for more than one range at a time, and multipart is not appropriate
and cannot be made to work. Our software asks for data at some offset,
and then does not read it all but instead closes the connection after it has read all the data it wants.
When combining multipart with open ended range requests it is especially disastrous.
An open-ended range is one which does not supply any end, it specifies the start but not the end.
100-500 has an end specified and is a closed range specification.
But 100- has no end specified, and so the server with multipart response will supply the rest of the enormous file up to the entire file size.
Imagine how inefficient this is for both their server
and ours when we read a piece near the beginning of a file with 3GB of data,
and it tries sending back that huge unwanted 3GB response,
it is both slow and horribly inefficient and not usable.
Globus needs to configure or program their http server to give us single non-multipart response
optimized for open-ended byterange requests.
Please let us know if you have more questions.
Thanks!
Galt Barber
Senior Software Engineer
UCSC Genome Browser
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/BY5PR07MB68856B924AB9DE57DF72A258CAB59%40BY5PR07MB6885.namprd07.prod.outlook.com.