We've been developing the MLCommons Training Algorithms benchmark which contains several different neural net workloads, including the Criteo 1TB Click Logs dataset. As far as I can tell, the MLCommons Training benchmark also uses this dataset (
link). We recently realized that Criteo has once again changed their data hosting solution (
website) to "wetransfer", where you have to open a website in a browser and click buttons in order to download the data. Our problem with this is that we had scripts to do dataset setup on cloud VMs (without any sort of graphical interface), and now this is broken. Does the MLCommons Training benchmark have any solutions to addressing this?
P.S. From a usability perspective, we do not want to make users transfer 340GB+ to a personal machine and then have to move that to a cloud VM, or have to copy the authenticated URLs out of a chrome download window and paste into our setup script (it is also unclear how long those URLs are valid for).