Ml.p3.16xlarge

0 views

Skip to first unread message

Paula Shuffleburg

unread,

Aug 4, 2024, 11:36:21 PM8/4/24

to eneninfog

Iam trying to run the example notebook above on SageMaker using the gpt-j-xl model with the suggested instance of an ml.p3.16xlarge. However, I keep running into an out of memory error. I have tried other suggested instances (eg ml.g4dn.12xlarge) as well but get the same error. I've attached the latest error below. I've tried to set the train and val batch sizes to as low as 2 and still run into OOM issues. Any guidance would be appreciated.

Great catch, I missed that. I'll run it through again to see if it fixes the issue. I ended up getting it to run but had to decrease the batch-size significantly along with a few other tweaks. I also had to adjust the processes per host value because that threw an error as well. I'll report back on what I find.

Required. The number of partitions to split the model into.In case of pipeline_parallel_degree for PyTorch, this is the number of devicesover which pipeline parallelism will be performed.

Determines the distribution mechanism of transformer layers.If optimizing speed, there will be less communication across tensor-parallel ranksand layer normalization will not be distributed. However, there will be duplicate activationsstored across tensor-parallel ranks.If optimizing memory, there will be no redundant activations stored,but this will result in more communication overhead across tensor parallel ranks.

Determines the mapping of model partitions onto physical devices.When hybrid model/data parallelism is used, cluster places a single model replica inneighboring device IDs. Contrarily, spread places a model replica as far as possible.For more information, see Ranking Basics without Tensor Parallelism.

In case of the permutation letters, D stands for reduced-data parallelism,P stands for pipeline parallelism,and T stands for tensor parallelism.spread is equivalent to "TPD", and cluster is equivalent to "DPT".For more information, see Placement Strategy with Tensor Parallelism.

The weight of memory balancing in the auto-partitioni ng objective, as opposed to balancing computational load. If 0.0, the library only tries to balance computation; if 1.0 the library only tries to balance the memory use. Any value in between interpolates between these extremes.

This is the maximum number of microbatches that are simultaneously in execution during pipelining. Jointly scaling batch size and number of microbatches can often mitigate the pipeline bubble overhead, but that can lead to increased memory usage if too many microbatches are simultaneously in execution. In such cases setting the number of active microbatches to a lower number can help control memory usage. By default this is set to two plus the number of partitions of the model.

Enables activationoffloading. To improve GPU memory usage, use activation offloadingonly when (1) the microbatches and active_microbatches aregreater than 1, and (2) activation checkpointing is enabled for atleast one module in the model.

Specify the numberof pipeline tasks. This determines how early the activations shouldbe loaded back to the GPU, expressed in number of pipeline tasks.Smaller value indicates that activations are loaded closer in time towhen they are needed for backward pass. Setting this value too smallmight improve memory usage, but might potentially cause throughputloss and GPU bottlenecks during the CPU-to-GPU data transfer.

To run FP16 training, add "fp16"'": True to the smp configuration.Other APIs remain the same between FP16 and FP32.If fp16 is enabled and when user calls smp.DistributedModel,the model will be wrapped with FP16_Module, which converts the modelto FP16 dtype and deals with forward pass in FP16.If fp16 is enabled and when user calls smp.DistributedOptimizer,the optimizer will be wrapped with FP16_Optimizer.

If True, the library shards the optimizer state of all parameters acrossthe data parallel processes which hold the same parameter.This optimizer state sharding happens in a balanced manner.Note that when sharding optimizer state, full optimizer saving is not currently supported.Please save partial optimizer state. For more information about saving and loading checkpoints withoptimizer state sharding, see Instructions for Checkpointing with Tensor Parallelism.

If True and when smp.nn.DistributedTransformerLMHead is used(this is typically used for GPT-2 or GPT-3 models),the library assumes that the devices in the same tensor parallelism groupreceive the same input data. Otherwise, it is assumed that they receivedifferent examples. To learn more, see Prescaled Batch.

To run a training job using sharded data parallelism, add this parameter and specify a number greater than 1.Sharded data parallelism is a memory-saving distributed training technique that splits the training state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group.For more information, see Sharded Data Parallelism.

Specifies the size of a parameter tensor in number of elements that can persist at each GPU. Sharded data parallelism splits each parameter tensor across GPUs of a data parallel group. If the number of elements in the parameter tensor is smaller than this threshold, the parameter tensor is not split; this helps reduce communication overhead because the parameter tensor is replicated across data-parallel GPUs.

Specifies the maximum number of parameters that can simultaneously be in a recombined training state during the forward and backward pass. Parameter fetching with the AllGather operation pauses when the number of active parameters reaches the given threshold. Note that increasing this parameter increases the memory footprint.

If set to True, the AllGather operation runs hierarchically: it runs within each node first, and then runs across nodes. For multi-node distributed training jobs, the hierarchical AllGather operation is automatically activated.

Specifies a threshold for gradient clipping the L2 norm of the gradients before propagating them backward through the model parameters. When sharded data parallelism is activated, gradient clipping is also activated. The default threshold is 1.0. Adjust this parameter if you have the exploding gradients problem.

"processes_per_host": Specifies the number of processes MPI should launch on each host.In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker distributed model parallel library maintainsa one-to-one mapping between processes and GPUs across model and data parallelism.This means that SageMaker schedules each process on a single, separate GPU and no GPU contains more than one process.If you are using PyTorch, you must restrict each process to its own device usingtorch.cuda.set_device(smp.local_rank()). To learn more, seeModify a PyTorch Training Script.

For example, if you use one instance with 4-way model parallelism and 2-way data parallelism,then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs,such as an ml.p3.16xlarge.

The local_rank of a process is the rank of the process among theprocesses in the same instance. This can range from 0 up to the numberof GPUs in the instance, but can be lower if fewer processes than GPUs arelaunched in the instance. For instance, in the precedingexample, local_ranks of the processes will range from 0 to 7,since there are 8 GPUs in a p3dn.24xlarge instance.

When model parallelism is used together with data parallelism (Horovod for TensorFlowand DDP for PyTorch), the library partitions the set of processes intodisjoint mp_groups. An mp_group is a subset of all processesthat together hold a single, partitioned model replica.

For instance, ifa single node job is launched with 8 local processes withpartitions=2 (meaning the model will be split into 2), there arefour mp_groups. The specific sets of processes that form themp_groups can be adjusted by the placement_strategy option.

If placement_strategy is spread, then the fourmp_groups are [0, 4], [1, 5], [2, 6], [3, 7]. Themp_rank is the rank of a process within each mp_group. For example,the mp_rank is 0 for the processes 0, 1, 2, and 3, and the mp_rank is 1 forthe processes 4, 5, 6, and 7.

Analogously, the library defines dp_groups as sets of processes thatall hold the same model partition, and perform data parallelism amongeach other. If placement_strategy is spread, there are two dp_groups:[0, 1, 2, 3] and [4, 5, 6, 7].

Since each process within the dp_group holds the same partition ofthe model, and makes allreduce calls among themselves. Allreduce fordata parallelism does not take place across dp_groups.dp_rank is defined as the rank of a process within its dp_group.In the preceding example, the dp_rank of process 6 is 2.

In addition to the two placement strategies introduced in the previous section,the library provides additional placement strategies for extended tensor parallelism featuresfor PyTorch. The additional placement strategies (parallelism types) are denoted as follows:

With given permutation of the tree letters, the library takes the right-most letteras the first strategy performs over the global ranks in ascending order.Contrarily, the parallelism type represented by the left-most letter is performedover the ranks that are as distant as possible.

Because the neighboring ranks are placed on the same instance withhigh-bandwidth NVLinks, it is recommended to place theparallelism type that has higher bandwidth requirements for your modelon the right-most position in the placement_strategy string. Becausetensor parallelism often requires frequent communication, placingT in the right-most position is recommended (as in the default"cluster" strategy). In many large models, keeping the default of"cluster" would result in the best performance.

The way tensor parallelism works is that when a module is distributed,the inputs to the distributed module in different tp_ranks getsshuffled around in a way that is sliced by the hidden dimension andscaled by the batch dimension. For example, if tensor parallel degree is8, the inputs to DistributedTransformer (a tensor with shape[B, S, H] where B=batch size, S=sequence length,H=hidden width) in different tp_ranks will be communicatedaround, and the shapes will become [8B, S, H/8]. Each tp_rankhas the batch from all the peer tp_ranks, but only the slice thatinteracts with their local partition of the module.