Re: Criteo1TB benchmark download issues

154 views
Skip to first unread message

Zack Nado

unread,
Jul 12, 2023, 12:17:28 PM7/12/23
to Frank Schneider, George Dahl, R. Eschenhagen, pub...@mlcommons.org


On Tue, Jul 11, 2023 at 5:30 PM Zack Nado <zn...@google.com> wrote:
Hey!

We've been developing the MLCommons Training Algorithms benchmark which contains several different neural net workloads, including the Criteo 1TB Click Logs dataset. As far as I can tell, the MLCommons Training benchmark also uses this dataset (link). We recently realized that Criteo has once again changed their data hosting solution (website) to "wetransfer", where you have to open a website in a browser and click buttons in order to download the data. Our problem with this is that we had scripts to do dataset setup on cloud VMs (without any sort of graphical interface), and now this is broken. Does the MLCommons Training benchmark have any solutions to addressing this?

Thank you for your help!
Zack

P.S. From a usability perspective, we do not want to make users transfer 340GB+ to a personal machine and then have to move that to a cloud VM, or have to copy the authenticated URLs out of a chrome download window and paste into our setup script (it is also unclear how long those URLs are valid for).

Cesar Octavio Ma Avalos Baddouh

unread,
Jul 12, 2023, 12:36:35 PM7/12/23
to Frank Schneider, George Dahl, R. Eschenhagen, pub...@mlcommons.org, Zack Nado
Hey all,

We've recently had luck using https://github.com/iamleot/transferwee to download the files from "WeTransfer" using only the CLI, with no user interaction. 
The command was something like:
"python transferwee.py download https://criteo.wetransfer.com/downloads/4bbea9b4a54baddea549d71271a38e2c20230428071257/d4f0d2" (I think this still works)


From: 'Zack Nado' via public <pub...@mlcommons.org>
Sent: Tuesday, July 11, 2023 5:32 PM
To: Frank Schneider <f.sch...@uni-tuebingen.de>; George Dahl <gd...@google.com>; R. Eschenhagen <re...@cam.ac.uk>; pub...@mlcommons.org <pub...@mlcommons.org>
Subject: Re: Criteo1TB benchmark download issues
 
---- External Email: Use caution with attachments, links, or sharing data ----

--
You received this message because you are subscribed to the Google Groups "public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to public+un...@mlcommons.org.
To view this discussion on the web visit https://groups.google.com/a/mlcommons.org/d/msgid/public/CAKddEYJw2XAXeOm62K%3DwpaNNUenvHOi%2B3QXYi8Dr1NjVSXyiRA%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages