Request to access gs://mlperf-llm-public2

184 views
Skip to first unread message

kk Huang

unread,
Jul 25, 2023, 4:42:44 AM7/25/23
to public
Hello Shriya,

I would like to run the training test with using paxml.

Here is my referred link "https://github.com/mlcommons/training/tree/master/large_language_model/paxml"

The necessary dataset ( The resplit dataset uses 3.0.4 as its version to differenciate from the original 3.0.1 version, and it's available on GCS ) is on GCS, but the CGS is private and could not be accessed.

Are you possible to publicize these data or grant access permission for me?

Thanks

Ritika Borkar

unread,
Jul 25, 2023, 3:48:44 PM7/25/23
to kk Huang, public

Hi,

 

You can now access the LLM dataset from this public S3 bucket with instructions here: https://github.com/mlcommons/training/tree/master/large_language_model/megatron-lm#6-other

Let me know if this doesn't work for you.

 

Thanks,

Ritika

 

From: pub...@mlcommons.org <pub...@mlcommons.org> On Behalf Of kk Huang
Sent: Tuesday, July 25, 2023 1:18 AM
To: public <pub...@mlcommons.org>
Subject: Request to access gs://mlperf-llm-public2

 

External email: Use caution opening links or attachments

 

--
You received this message because you are subscribed to the Google Groups "public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to public+un...@mlcommons.org.
To view this discussion on the web visit https://groups.google.com/a/mlcommons.org/d/msgid/public/4e2ff52d-2e3e-4528-b0dd-a664348888dcn%40mlcommons.org.

kk Huang

unread,
Jul 26, 2023, 3:07:48 PM7/26/23
to Ritika Borkar, public
Hi Ritika:

Thanks for your feedback. I can see the data by using the awscli command to access the S3 bucket that you provide in the link. 
There are some questions here
1) From the introduction of https://github.com/mlcommons/training/tree/master/large_language_model/paxmlThe resplit dataset uses 3.0.4 as its version to differenciate from the original 3.0.1 version, and it's available on GCS
Paxml test used 3.0.4, but the data from s3 bucket is 3.0.1. Are they the same?
2) The data format from s3 bucket is bin and idx. By understanding it used the preprocessing script to convert JSON to these two formats for Megatron use. Is this format ok for Paxml test?
3) There are three init checkpoint files (listed below table) for Paxml test, but they do not exist in s3 bucket . 
Pipelinecheckpoint
No pipelinehttps://console.cloud.google.com/storage/browser/mlperf-llm-public2/gpt3_spmd1x64x24_tpuv4-3072_v84_20221101/checkpoints/checkpoint_00004000
4 stageshttps://console.cloud.google.com/storage/browser/mlperf-llm-public2/gpt3_spmd1x64x24_tpuv4-3072_v84_20221101/checkpoints/checkpoint_00004000_pipeline_4stages
8 stageshttps://console.cloud.google.com/storage/browser/mlperf-llm-public2/gpt3_spmd1x64x24_tpuv4-3072_v84_20221101/checkpoints/checkpoint_00004000_pipeline


Appreciate your help


Ritika Borkar <rbo...@nvidia.com> 於 2023年7月26日 週三 上午3:20寫道:
Reply all
Reply to author
Forward
0 new messages