Natalia
I think there are a few problems here and it may be necessary for you to download outside of GenePattern. I can help you with that if you want to try it. The problems I have identified so far are
1. Because of the large number of files in your manifest (7178) and their size, you are overflowing the disk space available for your job on our default compute nodes. In one of my test runs (still going) its at 40GB after downloading 2395 files. So if its linear we will hit ~120GB while our default compute nodes only have 80GB. We do have an alternative queue (in the advanced parameters) called RNASeq_large_disk that has 500GB of space. However using this runs into a second issue…
which appears to be because of access controls on some of the files…
2022-10-27 18:38:26,244: INFO: [32mSuccessfully downloaded [0m: 3569
2022-10-27 18:38:26,244: INFO: [31mFailed downloads [0m: 3609
Options:
If you are certain that there is no controlled data, the other possibility is that the GDC is throttling the request and rejecting the downloads after about the first 3569 files have been retrieved. In this case you can try to break up your manifest into several smaller parts, submit them individually and then join the gct and cls files together after all downloads are done.
If you think that there is controlled access data, then you cannot use GenePattern for the download since it is not a HIPPA secure environment. I could provide you with an altered docker container for this module that would allow you to provide an access-control token (see
https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/) which you could run in an appropriately secure environment. We cannot run this way on the GenePattern servers though since we do not allow controlled access or protected data, and also should never have passwords or access tokens passed as module parameters.
Please let me know how you would like to proceed,
Ted