New dataset - Download speed

715 views
Skip to first unread message

Guilherme Santos

unread,
Jun 19, 2023, 2:52:50 PM6/19/23
to physionet-challenges
Hello, thanks for sharing the new dataset. 

I know the trouble you must have gone through to obtain it. However, I have been experiencing a very large slowdown when downloading the data. The download speed is no more than 1mb/s. In the first phase (Unofficial) I had already felt this same problem, but I managed to download the 17gb in a few hours. Now, being 1.3TB, things change, and the estimate is to complete in 12-15 days straight. My internet is much faster than that, and I've tried downloading from other places, the situation is the same. I don't know what to do to get the complete dataset. What are you doing about it? Are you aware? Thanks.

PhysioNet Challenge

unread,
Jun 19, 2023, 3:03:30 PM6/19/23
to physionet-challenges
Dear Guilherme,

Thanks for your kind words, and apologies for the delayed response.

I'm sorry that you're having difficulty downloading the updated training set. The full training set for the unofficial phase was 24 GB, and the full training set for the official phase is 1.6 TB, so we expect the downloads to take more time (but hopefully not weeks!). If other teams are having the same issue or other issues with downloading the data, then please let us know.

Depending on why your download speeds are slow, here are a couple of steps that you (and other) teams can try to more quickly download the training set.

Some of the patients have weeks of data, but you do not need weeks of data from each patient to develop your initial models. You can reduce the total file size and download time by only downloading the first 72 hours of data after return after spontaneous circulation (ROSC), which is the time point that we're scoring for the Challenge. Of course, you can continue downloading the rest of the training set after you've downloaded the first 72 hours of data, and we are sharing more than 72 hours of data so that you can decide whether or not to use it for your models:
https://physionetchallenges.org/2023/#accessing-data

You can also run the wget commands in parallel (wget will run in the background with the "-b" argument) or use a similar tool that supports simultaneous downloads. This approach should allow you to download the data for different time points, channel groups, or patients in parallel. For example, this command downloads the files for patient 0284 (the first patient in the training set):
wget -b -r -N -c -np -q "https://physionet.org/files/i-care/2.0/training/0284"

The RECORDS command has a list of all of the patients in the training set, so you can try to iterate over the file and run the commands simultaneously.

Best,
Matt
(On behalf of the Challenge team.)

Please post questions and comments in the forum. However, if your question reveals information about your entry, then please email info at physionetchallenge.org. We may post parts of our reply publicly if we feel that all Challengers should benefit from it. We will not answer emails about the Challenge to any other address. This email is maintained by a group. Please do not email us individually.

Felix Krones

unread,
Jun 19, 2023, 3:34:58 PM6/19/23
to physionet-challenges
Dear Matt, 

Thanks a lot, I think the idea with using wget in parallel is really helpful.
Otherwise, I see the same point as Guilherme, I am also not getting above 1mb/s.

Best Felix

Guilherme Santos

unread,
Jun 20, 2023, 8:23:20 AM6/20/23
to physionet-challenges
Hello Matt and Felix,

Thanks for contributing to the discussion. I created a virtual machine on a cloud platform, with a super fast internet connection, capable of downloading files at astronomical speeds, yet the ability of the physionet repository to return the data is very low. This could be a problem for other teams that haven't come forward or haven't started downloading data yet. I ask you to evaluate. From yesterday to today things have improved a little, and I'm managing to download at 7mb/s. Not bad compared to what we had yesterday. It is important that we monitor so that other teams are not disturbed. Thank you very much.

Allan Moser

unread,
Jun 20, 2023, 11:56:22 AM6/20/23
to physionet-challenges

I've also found the download speed to be very slow. (As slow as 500 kb/s.)  Running several wget jobs in parallel helps, but I cannot run more that 15 at a time, with additional jobs having the connection refused.

PhysioNet Challenge

unread,
Jun 20, 2023, 12:06:10 PM6/20/23
to physionet-challenges
Hi everyone,

Thanks for sharing, and we apologize for the slow downloads.

We are working to host a copy of the data on a public cloud service to make it (much) faster to download the training set, so please stay tuned for an update in the coming days.

Until then, you may want to try downloading data for a subset of the patients and/or timestamps, which should allow you to start developing your models.

Best,
Matt
(On behalf of the Challenge team.)

Please post questions and comments in the forum. However, if your question reveals information about your entry, then please email info at physionetchallenge.org. We may post parts of our reply publicly if we feel that all Challengers should benefit from it. We will not answer emails about the Challenge to any other address. This email is maintained by a group. Please do not email us individually.

Richard Povinelli

unread,
Jun 20, 2023, 4:23:21 PM6/20/23
to physionet-challenges
Greetings All,

I'm also getting ~ 430KB/s on an internet connection that can pull down at 20+MB/s. An idea would be to create a torrent for the files. This would allow everyone to share their bandwidth by distributing the files. Here is a link for a python torrent creation tool: https://py3createtorrent.readthedocs.io/en/latest/user.html

Best regards,
Richard

PhysioNet Challenge

unread,
Jun 20, 2023, 4:26:41 PM6/20/23
to physionet-challenges
Hi Richard,

I'm getting similar speeds (~0.5 MB/s) on both academic and residential connections with much higher bandwidth limits.

We discussed creating a torrent, which the 2018 Challenge did, but we know that some institutions impose limits on running torrent software. However, it's a good idea, and we may still want to do it.

We're working to upload a copy of the data to GCP, which should support much faster download speeds later this week. Please stay tuned, and apologies again, everyone.

Best,
Matt
(On behalf of the Challenge team.)

Please post questions and comments in the forum. However, if your question reveals information about your entry, then please email info at physionetchallenge.org. We may post parts of our reply publicly if we feel that all Challengers should benefit from it. We will not answer emails about the Challenge to any other address. This email is maintained by a group. Please do not email us individually.

PhysioNet Challenge

unread,
Jun 22, 2023, 2:24:17 PM6/22/23
to physionet-challenges
Hi everyone,

We are hosting a copy of the training set on GCP. Please see the "Files" section of the data project for download information:

I was able to download the data at ~30 MB/s using gsutil from the GCP bucket instead of 0.5 MB/s using wget from PhysioNet.org. Depending on your internet connection, you may have event faster download speeds. Here is an example command, which downloads the data for patient 0284 to the current directory:

If you would like to download the data using command line tools, but you have not used gsutil in the past, or recently, then please see these setup instructions:

Thank you again for your patience!

Best,
Matt
(On behalf of the Challenge team.)

Please post questions and comments in the forum. However, if your question reveals information about your entry, then please email info at physionetchallenge.org. We may post parts of our reply publicly if we feel that all Challengers should benefit from it. We will not answer emails about the Challenge to any other address. This email is maintained by a group. Please do not email us individually.

Ján Pavlus

unread,
Jun 23, 2023, 12:54:50 PM6/23/23
to physionet-challenges

Hi,

the gsutil option is enabled only if you have a billing account on Google. I don't want to add any billing to my Google account. The zip file or normal wget link has really pure speed so in 15 days and more... The script to download only part of the files from the official page does not work. Invalid variable.

Please do something with it, otherwise, there will be not much experimenting and machine learning, but more engineering about downloading the data till the deadline...
Dne čtvrtek 22. června 2023 v 20:24:17 UTC+2 uživatel PhysioNet Challenge napsal:

PhysioNet Challenge

unread,
Jun 23, 2023, 12:56:32 PM6/23/23
to physionet-challenges
Dear Ján,

I don't know why gustil requires a billing account, but the PhysioNet.org team said that they will be billed for the downloads, and I did not see any charges from GCP when I downloaded the data with gsutil.

What command on the official phase does not work? I was able to download the data using gsutil, but maybe I used a different command than you did. Please note that some of the commands have a dot or period that is part of the command but easy to miss.


Best,
Matt
(On behalf of the Challenge team.)

Please post questions and comments in the forum. However, if your question reveals information about your entry, then please email info at physionetchallenge.org. We may post parts of our reply publicly if we feel that all Challengers should benefit from it. We will not answer emails about the Challenge to any other address. This email is maintained by a group. Please do not email us individually.

sone sone (Sone)

unread,
Jun 25, 2023, 9:37:46 AM6/25/23
to physionet-challenges
Dear PhysioNet Challenge,

I hope this email finds you well. I would like to inquire about the command to use with gsutil for downloading patient metadata and EEG data from the last 72 hours after receiving the ROSC. Could you please provide me with the appropriate command?

Thank you in advance.

Best regards,
Sone

PhysioNet Challenge

unread,
Jun 25, 2023, 9:46:32 AM6/25/23
to physionet-challenges
Dear Sone,

The following command downloads the RECORDS files, which describes all of the records (the .hea and .mat files) in the training set:

wget "https://physionet.org/files/i-care/2.0/RECORDS"
while IFS= read -r line; do
    wget -q -P "$line" "https://physionet.org/files/i-care/2.0/$line/RECORDS"
done < RECORDS

You can use the RECORDS files to create a list of all of the files that you want to download and the gsutil cp and/or wget commands to download the files. The records have the naming convention "patient_count_hour_group", so, for example, the hour field is less than 72 for the files within 72 hours of ROSC:

Much like the UNIX cp command, the gsutil cp command has the format "cp source destination", e.g.,

gsutil -m cp -r "gs://i-care-2.0.physionet.org/training/0284/0284.txt" "training/0284/0284.txt"
gsutil -m cp -r "gs://i-care-2.0.physionet.org/training/0284/0284_001_004_EEG.hea" "training/0284/0284_001_004_EEG.hea"
gsutil -m cp -r "gs://i-care-2.0.physionet.org/training/0284/0284_001_004_EEG.mat" "training/0284/0284_001_004_EEG.mat"

where the first argument is the source on the Google Cloud Bucket and the second argument is the destination on your machine.


Best,
Matt
(On behalf of the Challenge team.)

Please post questions and comments in the forum. However, if your question reveals information about your entry, then please email info at physionetchallenge.org. We may post parts of our reply publicly if we feel that all Challengers should benefit from it. We will not answer emails about the Challenge to any other address. This email is maintained by a group. Please do not email us individually.

Jackson Bean

unread,
Jun 29, 2023, 4:14:44 PM6/29/23
to physionet-challenges
Just for reference -- here is the given code for grabbing only the first 72 hours of EEG data using the google cloud storage:

gsutil -m cp -r gs://i-care-2.0.physionet.org/training/*/*.txt .

for ((i=0; i<=72; i++)); do
    j=$(printf "%03d" $i)
    gsutil -m cp -r gs://i-care-2.0.physionet.org/training/*/*_${j}_EEG.* .
done

Xiaobin Shen

unread,
Jul 4, 2023, 10:57:56 PM7/4/23
to physionet-challenges
Well after being confused for a long time (I'm completely new to the gsutil thing), I think the command provided on https://physionet.org/content/i-care/2.0/#files should not have the part of -u YOUR_PROJECT_ID, and this might be the reason why Ján needed a billing account on Google. If I understand things correctly, we are simply copying things to our local directory, instead of a GCP bucket or anything (which might require a project id and billing). 

I'll just share my experience just FYI:
After installing gsutil following the “Installing gsutil” part on this page https://cloud.google.com/storage/docs/gsutil_install (ignore the part regarding "Setting Up Credentials to Access Protected Data"), I am able to get downloading speed of ~10mb/s using the command: gsutil -m cp -r gs://i-care-2.0.physionet.org DESTINATION, the only thing we need to specify is the destination that we want to store the data. I have wasted so much time on this, hope this message could help people who are as confused as I was save time.

Nate Riek

unread,
Aug 14, 2023, 10:53:01 AM8/14/23
to physionet-challenges
Hi all,
I was able to download the data weeks ago using the gsutil commands on Windows. I have a few files that downloaded incorrectly and need to try again. This time, I get the following error:

"BadRequestException: 400 Bucket is a requester pays bucket but no user project provided."

Any ideas on how to fix this?

Thanks,
Nate

PhysioNet Challenge

unread,
Aug 14, 2023, 4:48:35 PM8/14/23
to physionet-challenges
Dear Nate,

I am replying just so the whole group can see it: this issue with downloading via gsutil was resolved. See this post for more information:

Best,
James
Reply all
Reply to author
Forward
0 new messages