RE: Genome Browser and Globus collection links

Chris Taylor

unread,

Sep 22, 2021, 3:18:27 PM9/22/21

to gen...@soe.ucsc.edu

Hello, I am trying to use a Globus public link with the genome browser. I was made aware of this conversation:

https://groups.google.com/a/soe.ucsc.edu/g/genome/c/_a6HS_nOgco?pli=1

It seems to show that the latest version of Globus, which I am running, will allow a link to be directly available with no redirects that confuse the genome browser tool. In fact, if I run this command in a shell I get:

$ curl https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt

hub Helena_Ovol1

shortLabel Ovol1

longLabel 5umchir_cas9_gfpsgrna

…

If I use ‘wget’ to get the file it is downloaded with all the proper formatting as well.

However, hubCheck gives an error:

$ ./hubCheck https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt

Found 1 problem:

Missing required setting 'hub' from https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt

The Web browser gives the same error. Has anyone found a way to work around this? The Globus collection link seems to be working fine but hubCheck and the web tool seem to be getting tripped up with something I can’t figure out. Thanks

Dan Schmelter

unread,

Sep 23, 2021, 8:39:09 PM9/23/21

to Chris Taylor, gen...@soe.ucsc.edu

Hello Chris,

Thank you for writing to Genome Browser support and we appreciate your patience with this delayed answer.

It appears that there are a few unusual server configurations with the Globus site that are causing our site to give an error when retrieving the data. That error is somewhat misleading and is a sign that data is not being sent from Globus successfully. This is probably something you will need to discuss with the Globus Network Administrators. Here are the issues our engineers found and their comments:

The Globus site uses a javascript redirect that our networking client web library will not support.

If the user can remove the redirection, or replace it with redirection directly in Apache HTTP server's header response, it will work better.

The server changes the data based on the user-agent, changing the URL.

It is also modifying the URL by adding "?download=1" to the URL, which is not expected and could confuse our processing code.

$ curl -A "Mozilla/5.0" '
https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt'
<html><head><title>Redirecting to data</title><head><body>Downloading <a
href="https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt?download=1
">/globus/hub.txt</a><script>window.location.replace(" 
https://g-60ce76.a78b8.3
6fe.data.globus.org/globus/hub.txt?download=1");</script></body>

They're using a distributed multi-system server, not a normal web server which may be too slow to be compatible with the Browser. This is similar to something like Google Drive, which doesn't work as a hosting site. If you would like to use a different hosting service, please refer to our guide to data hosting for the Genome Browser:

http://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html#Hosting

I hope this was helpful. If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.

All the best,

Daniel Schmelter
UCSC Genome Browser

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/BY5PR07MB68856223C282D31A489EAE9ECAA29%40BY5PR07MB6885.namprd07.prod.outlook.com.

Chris Taylor

unread,

Sep 30, 2021, 4:09:33 PM9/30/21

to Dan Schmelter, gen...@soe.ucsc.edu

Hello- I’ve been continuing discussions with the Globus group at U of Chicago regarding using URLs to load hub files. I got this from their technical support. It’s interesting, I’d like to see if anyone has run into this:

QUOTE -----

Chris-

Looks like hubCheck uses the user agent genome.ucsc.edu/net.c which we added support for a while back. That user agent (along with curl/ and Wget/) allow access to the data without the redirect. So the correct test to validate access via curl (but still mimicking hubCheck) is:

$ curl -A "genome.ucsc.edu/net.c" https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt

descriptionUrl https://g-60ce76.a78b8.36fe.data.globus.org/test.html

hub Helena_Ovol1

shortLabel Ovol1

longLabel 5umchir_cas9_gfpsgrna

genomesFile https://g-60ce76.a78b8.36fe.data.globus.org/globus/genomes.txt

email hbug...@usc.edu

That seems to work fine. What I'm not sure of is what else hubCheck is doing and why it is reporting this error:

$ ./hubCheck -verbose=2 https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt

### kent source version 420 ###

Found 1 problem:

Missing required setting 'hub' from https://g-60ce76.a78b8.36fe.data.globus.org/globus/hub.txt

END QUOTE ------

Gerardo Perez

unread,

Oct 6, 2021, 2:05:42 PM10/6/21

to Chris Taylor, gen...@soe.ucsc.edu

Hello, Chris.

Thank you for your interest in the Genome Browser and sending your inquiry.

Globus seems to be having issues with loading data. We tried to load the bam data in the hub just as a custom track on mm10 (genome.ucsc.edu/cgi-bin/hgCustom?db=mm10) and gives the following error:

Error : failed to read index file (.bai) corresponding to https://g-60ce76.a78b8.36fe.data.globus.org/globus/9_831_Six2TGC_MecomCCf_ucsc.bam

We downloaded the files then hosted the data on our development server. The data is formatted correctly since the bam and bam.bai data on mm10 can be seen in the following session:
http://genome.ucsc.edu/s/brianlee/DataHostedOnDev

An engineer of ours shares that it looks like a certificates issue:

wget https://g-60ce76.a78b8.36fe.data.globus.org/globus/9_831_Six2TGC_MecomCCf_ucsc.bam.bai

--2021-10-05 10:28:49--  https://g-60ce76.a78b8.36fe.data.globus.org/globus/9_831_Six2TGC_MecomCCf_ucsc.bam.bai
Resolving g-60ce76.a78b8.36fe.data.globus.org... 68.181.11.4, 68.181.11.5
Connecting to g-60ce76.a78b8.36fe.data.globus.org|68.181.11.4|:443... connected.
ERROR: cannot verify g-60ce76.a78b8.36fe.data.globus.org's certificate, issued by â:
  Issued certificate has expired.
To connect to g-60ce76.a78b8.36fe.data.globus.org insecurely, use `--no-check-certificate'.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Gerardo Perez
UCSC Genomics Institute

To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/BY5PR07MB68857654D6EF2C0E51B44BCFCAAA9%40BY5PR07MB6885.namprd07.prod.outlook.com.

Chris Taylor

unread,

Oct 6, 2021, 3:02:30 PM10/6/21

to Gerardo Perez, gen...@soe.ucsc.edu

That’s strange- I’m not getting that when I look at the certificate:

$ curl --insecure -vvI https://g-60ce76.a78b8.36fe.data.globus.org/globus/9_831_Six2TGC_MecomCCf_ucsc.bam.bai 2>&1 | awk 'BEGIN { cert=0 } /^\* SSL connection/ { cert=1 } /^\*/ { if (cert) print }'

* SSL connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384

* Server certificate:

* subject: CN=a78b8.36fe.data.globus.org

* start date: Aug 09 06:10:08 2021 GMT

* expire date: Nov 07 06:10:06 2021 GMT

* common name: a78b8.36fe.data.globus.org

* issuer: CN=R3,O=Let's Encrypt,C=US

* Connection #0 to host g-60ce76.a78b8.36fe.data.globus.org left intact

I’ll see what the people at U of Chicago say. Thanks,

Chris

Chris Taylor

unread,

Oct 6, 2021, 3:59:44 PM10/6/21

to Gerardo Perez, gen...@soe.ucsc.edu

I guess it's some problem with wget 1.14. It works with 1.20 – would you be able to confirm?

$ wget https://g-60ce76.a78b8.36fe.data.globus.org/globus/9_831_Six2TGC_MecomCCf_ucsc.bam.bai

--2021-10-06 12:45:59-- https://g-60ce76.a78b8.36fe.data.globus.org/globus/9_831_Six2TGC_MecomCCf_ucsc.bam.bai

Resolving g-60ce76.a78b8.36fe.data.globus.org (g-60ce76.a78b8.36fe.data.globus.org)... 68.181.11.4, 68.181.11.5

Connecting to g-60ce76.a78b8.36fe.data.globus.org (g-60ce76.a78b8.36fe.data.globus.org)|68.181.11.4|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: unspecified

Saving to: ‘9_831_Six2TGC_MecomCCf_ucsc.bam.bai.2’

9_831_Six2TGC_MecomCCf_ucsc.bam.bai.2 [ <=> ] 15.12K --.-KB/s in 0.007s

2021-10-06 12:46:00 (2.23 MB/s) - ‘9_831_Six2TGC_MecomCCf_ucsc.bam.bai.2’ saved [15480]

$ wget -V

GNU Wget 1.20.3 built on linux-gnu.

To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/BY5PR07MB6885AA3C318EC91BBA018E47CAB09%40BY5PR07MB6885.namprd07.prod.outlook.com.

Chris Taylor

unread,

Oct 11, 2021, 1:54:41 PM10/11/21

to Gerardo Perez, gen...@soe.ucsc.edu

I’ve been doing some more work with Globus at U. of Chicago. They used the hubCheck binary to go to the URL provided and it seems to be getting the right data. Here is what Globus tech support said:

…..quote…..

Jasonalt, Oct 11, 2021, 11:24 CDT:

I can confirm with the hubCheck binary (using strace) that hubCheck is not receiving a redirect, it is getting the contents of the file. IE it sees:

\r\n--4c7b2264-5d9d-423e-a80e-e74467b14efe\r\nContent-Type: text/plain\r\nContent-Range: 0-234/*\r\n\r\ndescriptionUrl https://g-60ce76.a78b8.36fe.data.globus.org/test.html\nhub Helena_Ovol1\nshortLabel Ovol1\nlongLabel 5umchir_cas9_gfpsgrna\ngenome

-Jason

……quote…..

Galt Barber

unread,

Oct 13, 2021, 2:48:28 AM10/13/21

to Chris Taylor, Gerardo Perez, gen...@soe.ucsc.edu

Here is what the Globus is doing when given a byte range request.

They are returning a multipart/byteranges inappropriately.

Content-Disposition: inline
Connection: close
Content-Type: multipart/byteranges; boundary=c3b7e1e9-6586-426d-9459-96a41dd950e7

--c3b7e1e9-6586-426d-9459-96a41dd950e7
Content-Type: text/plain
Content-Range: 0-5/*

descri
--c3b7e1e9-6586-426d-9459-96a41dd950e7--

They should not be using a multipart response for a single request, which is all we ever ask for.

Yes, multipart is the only way to satisfy a user who asks for multiple ranges in a single request,

however we never ask for more than one range at a time, and multipart is not appropriate

and cannot be made to work. Our software asks for data at some offset,

and then does not read it all but instead closes the connection after it has read all the data it wants.

When combining multipart with open ended range requests it is especially disastrous.

An open-ended range is one which does not supply any end, it specifies the start but not the end.

100-500 has an end specified and is a closed range specification.

But 100- has no end specified, and so the server with multipart response will supply the rest of the enormous file up to the entire file size.

Imagine how inefficient this is for both their server
and ours when we read a piece near the beginning of a file with 3GB of data,
and it tries sending back that huge unwanted 3GB response,
it is both slow and horribly inefficient and not usable.

Globus needs to configure or program their http server to give us single non-multipart response
optimized for open-ended byterange requests.

Please let us know if you have more questions.

Thanks!

Galt Barber

Senior Software Engineer

UCSC Genome Browser

Ar Luan 11 DFómh 2021 ag 10:54, scríobh Chris Taylor <chri...@usc.edu>:

To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/BY5PR07MB68856B924AB9DE57DF72A258CAB59%40BY5PR07MB6885.namprd07.prod.outlook.com.

Reply all

Reply to author

Forward