Downloading large files

330 views
Skip to first unread message

Philipp at UiT

unread,
Jan 9, 2022, 3:55:30 AM1/9/22
to Dataverse Users Community
We have some users who experience trouble when trying to download larger files from DataverseNO. We have suggested using Wget from the command line (which works fast and without problems), but the users would prefer a simple browser-based method. On the file landing page, users are informed to "use the Download URL in a Wget command or a download manager to avoid interrupted downloads, time outs or other failures." If tried to make this work using Cyberduck (https://cyberduck.io/), but without luck. When I select https as download method, I'm asked to add credentials, but I don't want users to have to add credentials to download openly available files, but when I un-select credentials, the download doesn't work either. Has anyone managed to download larger files from a Dataverse repository using a download manager like Cyberduck? We'd highly appreciate your help! Thanks!

Best, Philipp

Péter Király

unread,
Jan 9, 2022, 5:44:26 AM1/9/22
to dataverse...@googlegroups.com
Hi Philipp,

I guess it is the most important problem for us. I investigated it,
and what I found: there is a session timeout variable for Payara,
which is by tdefault about 15 min. After this Payara breakes the
running thread. It is a general settings, there is no way to set it
only for file upload/download operations. It is not known for me what
is a negative side effect if I set it to say 1 day or so.
As a workaround I have created a download directory in Apache, and
created symbolic links to large files from there. Every time there is
a large file on the horizon, I suggest the uploader to add this Apache
link to the description of the dataset and/or the files, so his/her
users - given they read the metadata - are able to use that link to
download the file without interrupt, because Apache does not have this
timeout setting which makes the connection broken. See e.g.
https://doi.org/10.25625/7OZ1SP

Another solution is to use S3, because it seems, that it is possible
to configure S3 connection to not use Payara for download. Since the
core developer team, and the largest installations of Dataverse use S3
this problem doesn't occur them. I am aware of efforts to fix this
issue.

Best,
Péter

Philipp at UiT <uit.p...@gmail.com> ezt írta (időpont: 2022. jan.
9., V, 9:55):
>
> We have some users who experience trouble when trying to download larger files from DataverseNO. We have suggested using Wget from the command line (which works fast and without problems), but the users would prefer a simple browser-based method. On the file landing page, users are informed to "use the Download URL in a Wget command or a download manager to avoid interrupted downloads, time outs or other failures." If tried to make this work using Cyberduck (https://cyberduck.io/), but without luck. When I select https as download method, I'm asked to add credentials, but I don't want users to have to add credentials to download openly available files, but when I un-select credentials, the download doesn't work either. Has anyone managed to download larger files from a Dataverse repository using a download manager like Cyberduck? We'd highly appreciate your help! Thanks!
>
> Best, Philipp
>
> --
> You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/0d2252af-a4bc-42a6-85a9-b593f4a66bb7n%40googlegroups.com.



--
Péter Király
software developer
GWDG, Göttingen - Europeana - eXtensible Catalog - The Code4Lib Journal
http://linkedin.com/in/peterkiraly

Philipp at UiT

unread,
Jan 9, 2022, 10:10:21 AM1/9/22
to Dataverse Users Community
Hi Péter,

Thanks for sharing this! I now remember you had mention your approach to this challenge in a previous discussion thread. I think we may consider a similar solution until we have migrated to an S3-compliant storage.

Best, Philipp

James Myers

unread,
Jan 9, 2022, 11:44:27 AM1/9/22
to dataverse...@googlegroups.com

I think the only thing wget can do that browser downloads don’t is to autorestart after a failure to get the rest of the file. If that’s the case, I wonder if browser extensions like ‘Auto Resume’ in the Chrome store would help. (There are more sophisticated one that do download in parallel (like the S3 direct upload mechanism) which might also improve performance.) I’m not sure if the new Range Header support in v5.9 is required for or would help with any of this, but I would suspect that if wget succeeds, a browser plugin of this type might help/be able to do the same thing.

 

-- Jim

Philipp at UiT

unread,
Jan 11, 2022, 9:53:02 AM1/11/22
to Dataverse Users Community
Thanks, Jim. That's interesting information.

A couple of follow-up questions:
How does the resume feature in Wget and other tools work? For example: Will a download with auto resume work also if it takes more then 15 minutes (cf. the general timeout setting in Payara) to download the file?

Are there any known issues with expanding the general timeout setting in Payara to let's say 1 hour (cf. Péter's question)?

Best, Philipp

James Myers

unread,
Jan 11, 2022, 10:05:56 AM1/11/22
to dataverse...@googlegroups.com

In general, when tools request a file over https they get info about the size and then bytes start streaming and get written to disk. If the connection fails (for any reason – timeouts or cables getting unplugged, etc.)  before the correct number of bytes are written, tools can then re-request a file, starting from the point (byte offset) where things failed so that it only has to retrieve bytes it hadn’t gotten the first time. This means that, with a timeout, one can keep getting new bytes and eventually get the whole file. Google Chrome and other browsers don’t do this automatically, but they do actually keep the partial download and, if the user requests, they can restart download to just get new bytes. My understanding is that the plugins just automate that request to restart and continue downloading.

 

As for timeouts, I think the general reason for them is to be able to kill stuck connections and/or to avoid people maliciously tying up connections (denial of service), e.g. by asking for many files and reading them a 1 byte/second to keep all of the available connections busy. When you know there are valid long connections, it is usually reasonable to lengthen the timeout – you may just want to watch for any denial of service activity and be ready to shorten the timeout if/when needed.

Philipp at UiT

unread,
Jan 11, 2022, 10:41:16 AM1/11/22
to Dataverse Users Community
Thanks, Jim. I'll discuss this with our colleagues at the IT department who are responsible for running our servers.
Best, Philipp

Janet McDougall - Australian Data Archive

unread,
Jan 17, 2022, 9:55:25 PM1/17/22
to Dataverse Users Community
hi Philipp, 
what size are you considering as larger files when you are experiencing trouble downloading?
thanks
Janet


Philipp at UiT

unread,
Jan 18, 2022, 2:25:41 AM1/18/22
to Dataverse Users Community
Hi Janet,
I guess for people relying on a low-capacity network and not using any download resume feature, download trouble already starts with files from 2-3 GB and upwards. We have now added alternative download links to a couple of datasets with larger files (up to approx. 10 GB). Once we have migrated our instance to the cloud, we'll enable direct download from S3 storage.
Best, Philipp
Reply all
Reply to author
Forward
0 new messages