Large files via API (& python client)

82 views
Skip to first unread message

Adam Ginsburg

unread,
Nov 30, 2015, 5:35:36 AM11/30/15
to Dataverse Users Community
I'm trying to upload large files to dataverse via the API because I cannot copy the file to my local machine and don't have easy access to a web browser.  Using the python client, I get the exception:
"LargeZipFile: Filesize would require ZIP64 extensions"
which probably means the zip64 extensions need to be enabled.  Is that something that could be added to the python client?  Relatedly, would it be possible to stream the upload (send one chunk at a time) and include a progressbar?  For very large files, this would be helpful.

I'm using the python client in particular because I don't want to create a zip file, duplicating my large data on disk.  Is it, or will it be, possible to send files other than ZIP files via the SWORD API?

Thanks,
Adam



Philip Durbin

unread,
Nov 30, 2015, 9:10:38 AM11/30/15
to dataverse...@googlegroups.com
Hi Adam,

I'm sorry to hear you're having trouble.

One thing that changed recently for the Harvard Dataverse is that the size limit was lowered from 10 GB to 2 GB: https://github.com/IQSS/dataverse/commit/979efca

That is to say, your problem may be specific to uploading large files to https://dataverse.harvard.edu

I'd be curious to know how large your zip file is and if you're able to upload it to a test servers such as https://apitest.dataverse.org

Other people have asked about uploading non-zip files via the API and you are welcome to join the discussion at https://github.com/IQSS/dataverse/issues/1612

With regard to specify features of the Python client such as a progress bar, you are welcome to open issues at https://github.com/IQSS/dataverse-client-python/issues

I hope this helps.

Phil


--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dataverse-community/b13071c2-37e7-4960-8dae-d4bcbe1b3822%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Eleni Castro

unread,
Nov 30, 2015, 9:28:04 AM11/30/15
to Dataverse Users Community
Hi Adam,

One short term solution to get your data into Dataverse as soon as possible would be if you could break up that large zip file into smaller zip files that meet our new file limit and then send them one at a time. I know this is not optimal but would work if you need to get the dataset up there.

Cheers
Eleni

-- 

Eleni Castro
Research Coordinator, Data Curation and Outreach
IQSS, Harvard University
617-496-0703
http://www.iq.harvard.edu/people/eleni-castro 
http://orcid.org/0000-0001-9767-8536

Adam Ginsburg

unread,
Nov 30, 2015, 9:59:42 AM11/30/15
to Dataverse Users Community
Hi Phil,
     Thanks, the harvard dataverse limit is an issue.  However, last week (<10 days ago), I successfully uploaded an 18 GB file via the web interface... though a day or two later, I had a 4 GB file rejected.  I'm not entirely clear what's going on there, unless maybe that commit was put into production in the few hours between those two uploads.  In any case, that's a major problem for me, since most of the files I want to preserve are >2GB.  I guess I'll just have to upload these files the very hard way.

     Note that I cannot upload zip files above a certain size anywhere using the python client, because the failure I reported is on the client side - it never tries to send anything to the server before crashing (at least, as far as i understand it).

Eleni - I'm afraid I don't understand your suggestion.  I actually have many single files that are >2GB, and I want to send them to dataverse via the API.  I cannot break these files up unless there is a clean way to restitch them together on the other end.

Eleni Castro

unread,
Nov 30, 2015, 10:03:10 AM11/30/15
to dataverse...@googlegroups.com

Sorry for my misunderstanding. Can you zip these individual >2GB files so that they can be uploaded individually?

Cheers
Eleni

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.
To post to this group, send email to dataverse...@googlegroups.com.

Adam Ginsburg

unread,
Nov 30, 2015, 10:18:24 AM11/30/15
to Dataverse Users Community


On Monday, November 30, 2015 at 4:03:10 PM UTC+1, Eleni Castro wrote:

Sorry for my misunderstanding. Can you zip these individual >2GB files so that they can be uploaded individually?

Cheers
Eleni


No, that is the original problem: using the python client, zip files larger than 2GB cannot be zipped because the python zipfile library (at least in python2) requires the allowZip64 flag to be set for such files: https://docs.python.org/2/library/zipfile.html#zipfile-objects


Amber Leahey

unread,
Dec 1, 2015, 2:52:48 PM12/1/15
to Dataverse Users Community, philip...@harvard.edu
Hi Philip,

We are trying to develop FAQs for our Dataverse instance and this question has come up about large files. I thought I'd jump on this thread since it is really relevant but I just want to clarify things for the FAQs, apologies if this repeats a lot of information. 

1. Are the file size limits the same for uploads through the user interface and the API? 2GB file size limit for both?

2. For file uploads through the API, does your file have to be a .zip? 

3. To your knowledge, are file size limits changed at all from 3.6 to 4.X 
We are currently using 3.6, but we would like to know if there are changes in 4. 

4. And aside from the UI or API, is there a backend upload option at all for larger files >2GB? 

Many thanks for any answers, 
Amber Leahey



On Monday, 30 November 2015 09:10:38 UTC-5, Philip Durbin wrote:
Hi Adam,

I'm sorry to hear you're having trouble.

One thing that changed recently for the Harvard Dataverse is that the size limit was lowered from 10 GB to 2 GB: https://github.com/IQSS/dataverse/commit/979efca

That is to say, your problem may be specific to uploading large files to https://dataverse.harvard.edu

I'd be curious to know how large your zip file is and if you're able to upload it to a test servers such as https://apitest.dataverse.org

Other people have asked about uploading non-zip files via the API and you are welcome to join the discussion at https://github.com/IQSS/dataverse/issues/1612

With regard to specify features of the Python client such as a progress bar, you are welcome to open issues at https://github.com/IQSS/dataverse-client-python/issues

I hope this helps.

Phil

On Mon, Nov 30, 2015 at 5:35 AM, Adam Ginsburg <kefl...@gmail.com> wrote:
I'm trying to upload large files to dataverse via the API because I cannot copy the file to my local machine and don't have easy access to a web browser.  Using the python client, I get the exception:
"LargeZipFile: Filesize would require ZIP64 extensions"
which probably means the zip64 extensions need to be enabled.  Is that something that could be added to the python client?  Relatedly, would it be possible to stream the upload (send one chunk at a time) and include a progressbar?  For very large files, this would be helpful.

I'm using the python client in particular because I don't want to create a zip file, duplicating my large data on disk.  Is it, or will it be, possible to send files other than ZIP files via the SWORD API?

Thanks,
Adam



--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

Philip Durbin

unread,
Dec 1, 2015, 4:03:11 PM12/1/15
to dataverse...@googlegroups.com
Hi Amber,

The MaxFileUploadSizeInBytes setting at http://guides.dataverse.org/en/4.2.1/installation/installation-main.html#maxfileuploadsizeinbytes is for both the UI and the API. Nothing says the limit has to be 2GB. If you remove that setting, Dataverse will not block attempts to upload files of unlimited size. (I meant to reply earlier to Adam that this not a code change but a configuration.)

Yes, files uploaded via the API currently have to be zipped. If we change this it will be part of https://github.com/IQSS/dataverse/issues/1612

Generally speaking the UI for DVN 3.x had a bit of a natural limit of 2 GB for files because the technology we were using (an old version of Icefaces) seemed to fail on larger files. Upload via SWORD has always allowed larger files but we haven't done a ton of testing of how large the files can be. There's a 28.6 GB file at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/8DRZHU that was added after we migrated the Harvard Dataverse from DVN 3 to Dataverse 4 but I don't know if it was uploaded via the UI or the the API.

I just left a longish comment about large files at https://github.com/IQSS/dataverse/issues/952#issuecomment-161002345 that lays out several use cases that have been brought up. I'd love to hear more about if your use case falls into one or more of those buckets. I tried to link to all the related issues that exist in GitHub.

I hope this helps,

Phil


To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-commu...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Amber Leahey

unread,
Dec 3, 2015, 12:54:41 PM12/3/15
to Dataverse Users Community, philip...@harvard.edu
Thanks Phil. We are going to test changing the setting and upload larger files (5gb). 

It makes sense that we could control the files size upload, but 2gb might be a bit limiting to some disciplines. Overall, it hasn't posed a problem for our users though.  

Thanks,
Amber
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Dataverse Users Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dataverse-community+unsub...@googlegroups.com.

To post to this group, send email to dataverse...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages