On distributed performance measuring

Ralf G.

unread,

Aug 13, 2018, 11:28:21 AM8/13/18

to mlbench

Some thoughts on measuring distributed machine learning performance without impacting the performance. This is a continuation of our discussion from last Friday.

For comparison, Dawnbench and MLPerf both include test-set evaluation in their runtime measurements, with MLPerf only evaluating every 10-15 epochs.

I'm still unsure if we should exclude test-set evaluation in our measurement, since it would likely be evaluated in a real-life setting as well. And "stopping the clock" for workers can become arbitrarily complex, since one worker can potentially block another worker, which in turn blocks another etc., so ignoring evaluation time becomes increasingly complex.

I think we should include evaluation time, but just try to keep its impact small so as to have fair comparisons.

Time to Accuracy

To measure time to accuracy, we need to evaluate on the test set every x epochs/ every x amount of time/every x amount of steps and this operation introduces some overhead
we would like to have measurements that are as fair as possible, discounting the overhead of evaluation
we don't want to train too much longer than necessary (x cannot be too big)

Synchronized SGD all-reduce

This case is trivial, we can evaluate at the end of each(or every x) all-reduce phase. All workers are locked and waiting to start anyways, so it can just be evaluated on a (or each?) worker. In this case it's also possible to discount evaluation time, since the clockcan be stoppedwhile evaluating.

Async Centralized SGD (with a Parameter Server)

Only the performance on the Parameter Server (PS) has to be measured, since it represents the current model state.

Evaluating on PS directly would block async workers and is not feasible for any decently sized test set.

Dumping a checkpoint of the PS every x updates and evaluating on a separate node (Maybe the master node) is more feasible, but checkpointing still incurs an overhead.

Updates from workers could be kept in memory and applied once checkpointing is done, with the workers receiving stale information in the meantime. but that sounds like a nightmare to implement and a quite questionable. I wouldn't recommend this

Blocking workers while checkpointing is going on and stopping their clock as well could be feasible, but it's unclear how this would impact the measurements. Might be negligible if it's only done very rarely, not if it's done after every update.

I can't think of a perfectly fair solution that doesn't influence results and still stops after a certain accuracy is reached. But at least discounting eval time is still doable.

Async Decentralized SGD

Like the Gossip messaging implementations. Each worker only communicates with a fixed set of adjacent workers (usually around 2-3) and there is no single main node.

Since all updates take at most some constant time (depending on the number of nodes) to reach all nodes, a single node could be used to represent the performance of all nodes, with some lag (measured accuracy probably lags behind actual accuracy).

I wonder if a dedicated eval node that receives weight-updates but doesn't send any would be the most fair implementation. But if this node blocks the 1- nodes communicating with it, which in turn block other workers etc, it becomes almost impossible to discount eval time.

Another solution would be to just have each worker checkpoint itself every x batches, but this would lead to more blocking and could introduce chaotic virtual bottlenecks where the whole cluster is waiting for a single worker to checkpoint due to bad timing.

Either way, I'm not sure what other approach could be used. I don't think there is any good procedure to discount evaluation time in this scenario.

Accuracy after time

This metric is a lot easier to implement and evaluate. We train for a fixed amount of time, then stop and evaluate the result. This way, no continuous measurement of test performance is needed.

But we also don't get nice training plots and don't see how implementations differ over time, let alone if the curve was flat at the end or not. So we would get a single number per model that could be used for comparison, but I doubt it would be a meaningful comparison.

If we do checkpoints at regular intervals, it leads to the same problems as with Time to Accuracy. If we have a separate, dedicated node to evaluate checkpoints, both implementations could reduce to the same.

We'd still have to deal with latency introduced due to checkpointing.

Open Questions

Do we want to exclude evaluation on the test set from our metrics?

Which of the two metrics do we implement? Which one do we implement first?

How do we implement them? How do we minimize the impact?

What are your thoughts?

Fabian Pedregosa

unread,

Aug 15, 2018, 10:48:30 PM8/15/18

to Ralf G., mlbench

On 08/13/2018 08:28 AM, Ralf G. wrote:

Some thoughts on measuring distributed machine learning performance without impacting the performance. This is a continuation of our discussion from last Friday.

For comparison, Dawnbench and MLPerf both include test-set evaluation in their runtime measurements, with MLPerf only evaluating every 10-15 epochs.

I'm still unsure if we should exclude test-set evaluation in our measurement, since it would likely be evaluated in a real-life setting as well. And "stopping the clock" for workers can become arbitrarily complex, since one worker can potentially block another worker, which in turn blocks another etc., so ignoring evaluation time becomes increasingly complex.

one thing I've done in the past to overcome this difficulty is during the algorithm to only dump a snapshot of the time and vector of coefficients, and only compute the relevant measures like objective function and train/test set performance once the algorithm has finished running from the stored snapshots. Most of the times making a copy of the vector of coefficients is negligible wrt to the overall cost of the algorithm, avoiding the need to stop the clock, which can be very tricky in asynchronous algorithms.

Best,

Fabian

--
You received this message because you are subscribed to the Google Groups "mlbench" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlbench+u...@googlegroups.com.
To post to this group, send email to mlb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlbench/a75dc8e9-4f35-48da-96cb-b75476da9a44%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

lie.he

unread,

Aug 16, 2018, 5:08:40 AM8/16/18

to mlbench

This metric is a lot easier to implement and evaluate. We train for a fixed amount of time, then stop and evaluate the result. This way, no continuous measurement of test performance is needed.

But we also don't get nice training plots and don't see how implementations differ over time, let alone if the curve was flat at the end or not. So we would get a single number per model that could be used for comparison, but I doubt it would be a meaningful comparison.

A recent post in MLPerf (https://groups.google.com/forum/#!topic/mlperf/b7LxTvvGfbQ) suggests to use Multi-Thresholds Time to Accuracy (MTTTA) because an arbitrary choice of a threshold (like accuracy) may lead to unfair comparison between models.

In this sense, a continuous measurement is also necessary.

Reply all

Reply to author

Forward