Issues with re-producing MLperf training & Inferencing 0.7

199 views
Skip to first unread message

Jeff Liu

unread,
Jan 16, 2021, 1:02:28 PM1/16/21
to public
Hi All,

We have seen potential customers asking MLperf benchmark numbers for GPU servers, which is an excellent sign for widely adopting the de facto!  To make MLperf benchmark one of the GPU system specs, we are trying to come up with standard procedures for our production team to use them on the floor. However, we are encountering many issues when trying to reproduce close division benchmark  procedures:
1) datasets  - either we don't have access (ImageNet) or too slow to download
2) repo - many of the recently submitted close division repo no longer valid, giving 404 errors are so common; sometimes, the same URL will provide, maybe, a newer version of packages

We can develop a workaround, but it loses the purposes of "repeatable methodologies" for a near "Apple to Apple" comparison. I want to get your input on whether it is good to make MLperf benchmark as of the spec for all GPU-based systems, at least when customers ask, we have them ready to present? Shall we "freeze" the datasets and repo whenever the MLperf benchmark close division is completed? What is the best way to give customers more confidence to incorporate the benchmark as part of product selection processes?

Thanks and have a good, safe weekend!

Jeff 
 


Paulius Micikevicius

unread,
Jan 19, 2021, 12:38:32 PM1/19/21
to Jeff Liu, public

Thank you for the feedback.

 

Regarding the datasets, would you be able to enumerate the datasets you’re having difficulty with, along with the problem for each. If the problem is slow download, and the reason for this is the server that stores the dataset, then I’m not sure there is much we can do. The good news is that once you download the dataset you’ll be able to use it for a long time - in MLPerf we do not change the datasets very often. If the problem is access because of license agreements then I’d say double check with your legal team. Also, if license agreements are a repeated issue then it’s good for us to know - we may approach the curators of those datasets to clarify the rules around benchmarking (sometimes a dataset license doesn’t allow the use of the data for commercial products, but whether benchmark numbers constitute a ‘product’ isn’t obvious and is a topic for lawyers to figure out).

 

Regarding the repos, we should work to correct that. I’d say file github issues against those repos - that will more explicitly indicate to the corresponding submitters what needs addressing. We’ll take a look at submission guidance as well, we’ll see if it makes sense to maybe introduce some notion for how long after submission the repos have to be available, to minimize such issues in the future.

 

Paulius

 

From: Jeff Liu <yajun...@gmail.com>
Sent: Friday, January 08, 2021 5:13 PM
To: public <pub...@mlcommons.org>
Subject: Issues with re-producing MLperf training & Inferencing 0.7

 

External email: Use caution opening links or attachments

 

--
You received this message because you are subscribed to the Google Groups "public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to public+un...@mlcommons.org.
To view this discussion on the web visit https://groups.google.com/a/mlcommons.org/d/msgid/public/dbd3f2df-053d-4fbc-9d2f-ef3f66317b17n%40mlcommons.org.

Henrique Mendonça

unread,
Feb 24, 2021, 12:53:56 PM2/24/21
to public, paul...@nvidia.com, yajun...@gmail.com
Hey guys,
I missed this conversation before, but just to add my 2 cents.
The original ImageNet2012 can be found in the academic torrents: https://academictorrents.com/details/a306397ccf9c2ead27155983c254227c0fd938e2
Perhaps it'd be a good idea to add all required datasets to that directory as well, if that legally suits MLperf and involved parties (?)
Cheers,
Henrique

Paulius Micikevicius

unread,
Feb 24, 2021, 12:53:59 PM2/24/21
to Henrique Mendonça, public, yajun...@gmail.com

Dataset location is up to organizations providing the datasets, whenever obtaining the datasets one must follow the corresponding license agreements.

Jean-Louis QUEGUINER

unread,
Mar 8, 2021, 11:44:18 AM3/8/21
to public, paul...@nvidia.com, yajun...@gmail.com, henrique...@cscs.ch
Hi all,
Also I am facing major issues when it comes to MLperf inferencing,
code is mainly reproducible for Xavier, Tesla and Ampere Generation for NVidia.
I feel like it's missing some way to reproduce more broadly results with other GPU generation.

Nvidia announced on the website some perf for v0.7 for Volta but the code is not to be found anywhere (unfortunately).

Its sad because V100 for intance is TensorRT compatible.

Anyway that a really good news as Jeff said cutomers are looking at a benchmarks now to make the best choice of HW to MLperf will have strong position in the futur for sure.

Jean-Louis QUEGUINER

unread,
Mar 8, 2021, 11:44:20 AM3/8/21
to public, paul...@nvidia.com, yajun...@gmail.com, henrique...@cscs.ch
Hi all,
I was facing the same issue as Jeff and Honestly Imagenet process to get access is a pain.
I am still wainting, ... it's been 3.5years now ...

Capture d’écran 2021-02-25 à 09.27.26.png

Le mercredi 24 février 2021 à 18:53:59 UTC+1, paul...@nvidia.com a écrit :

Jean-Louis QUEGUINER

unread,
Apr 13, 2021, 12:46:30 PM4/13/21
to public, Jean-Louis QUEGUINER, paul...@nvidia.com, yajun...@gmail.com, henrique...@cscs.ch
hello guys,
Did you face this type of issue ?
[TensorRT] ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory)

It seems to come from TenrsoRT version however I dont see the TensortRT version that was user for the mlperf 0.7 inference.

do you know which version was used ?

Thanks a lot.
Best
Reply all
Reply to author
Forward
0 new messages