Weencountered an error while training the model on your data. AlgorithmError: ExecuteUserScriptError:ExitCode 1ErrorMessage "raise TypeError(TypeError: Invalid function argument. Expected parameter tensor to be of type torch.Tensor.Traceback (most recent call last)File "/opt/ml/code/llama_finetuning.py", line 335, in fire.Fire(main)
The way that the computation works is that all text is processed, combined and then split into sample (each of length equal to max input length). Then, the examples are batched as per the batch size. If you are using 8 GPU machines, you need to have at least 8 non-empty batches. That is, you either need to have large enough data such that there are 8 batches or you need to decrease the batch size or you need to reduce the max input length.
torch.distributed supports three built-in backends, each withdifferent capabilities. The table below shows which functions are availablefor use with CPU / CUDA tensors.MPI supports CUDA only if the implementation used to build PyTorch supports it.
PyTorch distributed package supports Linux (stable), MacOS (stable), and Windows (prototype).By default for Linux, the Gloo and NCCL backends are built and included in PyTorchdistributed (NCCL only when building with CUDA). MPI is an optional backend that can only beincluded if you build PyTorch from source. (e.g. building PyTorch on a host that has MPIinstalled.)
Use NCCL, since it currently provides the best distributed GPUtraining performance, especially for multiprocess single-node ormulti-node distributed training. If you encounter any problem withNCCL, use Gloo as the fallback option. (Note that Gloo currentlyruns slower than NCCL for GPUs.)
By default, both the NCCL and Gloo backends will try to find the right network interface to use.If the automatically detected interface is not correct, you can override it using the followingenvironment variables (applicable to the respective backend):
You may also use NCCL_DEBUG_SUBSYS to get more details about a specificaspect of NCCL. For example, NCCL_DEBUG_SUBSYS=COLL would print logs ofcollective calls, which may be helpful when debugging hangs, especially thosecaused by collective type or message size mismatch. In case of topologydetection failure, it would be helpful to set NCCL_DEBUG_SUBSYS=GRAPHto inspect the detailed detection result and save as reference if further helpfrom NCCL team is needed.
The torch.distributed package provides PyTorch support and communication primitivesfor multiprocess parallelism across several computation nodes running on one or moremachines. The class torch.nn.parallel.DistributedDataParallel() builds on thisfunctionality to provide synchronous distributed training as a wrapper around anyPyTorch model. This differs from the kinds of parallelism provided byMultiprocessing package - torch.multiprocessing and torch.nn.DataParallel() in that it supportsmultiple network-connected machines and in that the user must explicitly launch a separatecopy of the main training script for each process.
In the single-machine synchronous case, torch.distributed or thetorch.nn.parallel.DistributedDataParallel() wrapper may still have advantages over otherapproaches to data-parallelism, including torch.nn.DataParallel():
Each process maintains its own optimizer and performs a complete optimization step with eachiteration. While this may appear redundant, since the gradients have already been gatheredtogether and averaged across processes and are thus the same for every process, this meansthat no parameter broadcast step is needed, reducing time spent transferring tensors betweennodes.
The package needs to be initialized using the torch.distributed.init_process_group()or torch.distributed.device_mesh.init_device_mesh() function before calling any other methods.Both block until all processes have joined.
Otherwise,torch.distributed does not expose any other APIs. Currently,torch.distributed is available on Linux, MacOS and Windows. SetUSE_DISTRIBUTED=1 to enable it when building PyTorch from source.Currently, the default value is USE_DISTRIBUTED=1 for Linux and Windows,USE_DISTRIBUTED=0 for MacOS.
init_device_mesh follows SPMD programming model, meaning the same PyTorch Python programruns on all processes/ranks in the cluster. Ensure mesh_shape (the dimensions of the nD arraydescribing device layout) is identical across all ranks. Inconsistent mesh_shape may lead to hanging.
The existence of TORCHELASTIC_RUN_ID environmentvariable is used as a proxy to determine whether the current processwas launched with torchelastic. This is a reasonable proxy sinceTORCHELASTIC_RUN_ID maps to the rendezvous id which is always anon-null value indicating the job id for peer discovery purposes..
There are two ways to initialize using TCP, both requiring a network addressreachable from all processes and a desired world_size. The first wayrequires specifying an address that belongs to the rank 0 process. Thisinitialization method requires that all processes have manually specified ranks.
This method will always create the file and try its best to clean up and removethe file at the end of the program. In other words, each initialization withthe file init method will need a brand new empty file in order for the initializationto succeed. If the same file used by the previous initialization (which happens notto get cleaned up) is used again, this is unexpected behavior and can often causedeadlocks and failures. Therefore, even though this method will try its best to clean upthe file, if the auto-delete happens to be unsuccessful, it is your responsibilityto ensure that the file is removed at the end of the training to prevent the samefile to be reused again during the next time. This is especially importantif you plan to call init_process_group() multiple times on the same file name.In other words, if the file is not removed/cleaned up and you callinit_process_group() again on that file, failures are expected.The rule of thumb here is that, make sure that the file is non-existent orempty every time init_process_group() is called.
This class can be directly called to parse the string, e.g.,Backend(backend_str) will check if backend_str is valid, andreturn the parsed lowercase string if so. It also accepts uppercase strings,e.g., Backend("GLOO") returns "gloo".
The simplest pattern to follow is to destroy every process group and backend by callingdestroy_process_group() with the default value of None for the group argument, at apoint in the training script where communications are no longer needed, usually near theend of main(). The call should be made once per trainer-process, not at the outerprocess-launcher level.
The distributed package comes with a distributed key-value store, which can beused to share information between processes in the group as well as toinitialize the distributed package intorch.distributed.init_process_group() (by explicitly creating the storeas an alternative to specifying init_method.) There are 3 choices forKey-Value Stores: TCPStore,FileStore, and HashStore.
A TCP-based distributed key-value store implementation. The server store holdsthe data, while the client stores can connect to the server store over TCP andperform actions such as set() to insert a key-valuepair, get() to retrieve a key-value pair, etc. Thereshould always be one server store initialized because the client store(s) will wait forthe server to establish a connection.
Retrieves the value associated with the given key in the store. If key is notpresent in the store, the function will wait for timeout, which is definedwhen initializing the store, before throwing an exception.
The first call to add for a given key creates a counter associatedwith key in the store, initialized to amount. Subsequent calls to addwith the same key increment the counter by the specified amount.Calling add() with a key that has alreadybeen set in the store by set() will resultin an exception.
Inserts the key-value pair into the store based on the supplied key andperforms comparison between expected_value and desired_value before inserting. desired_valuewill only be set if expected_value for the key already exists in the store or if expected_valueis an empty string.
Returns the number of keys set in the store. Note that this number will typicallybe one greater than the number of keys added by set()and add() since one key is used to coordinate allthe workers using the store.
When used with the TCPStore, num_keys returns the number of keys written to the underlying file. If the store is destructed and another store is created with the same file, the original keys will be retained.
By default collectives operate on the default group (also called the world) andrequire all processes to enter the distributed function call. However, some workloads can benefitfrom more fine-grained communication. This is where distributed groups comeinto play. new_group() function can beused to create new groups, with arbitrary subsets of all processes. It returnsan opaque group handle that can be given as a group argument to all collectives(collectives are distributed functions to exchange information in certain well-known programming patterns).
This function requires that all processes in the main group (i.e. allprocesses that are part of the distributed job) enter this function, evenif they are not going to be members of the group. Additionally, groupsshould be created in the same order in all processes.
Using multiple process groups with the NCCL backend concurrentlyis not safe and the user should perform explicit synchronization intheir application to ensure only one process group is used at a time.This means collectives from one process group should have completedexecution on the device (not just enqueued since CUDA execution isasync) before collectives from another process group are enqueued.See Using multiple NCCL communicators concurrently for more details.
DeviceMesh is a higher level abstraction that manages process groups (or NCCL communicators).It allows user to easily create inter node and intra node process groups without worrying abouthow to set up the ranks correctly for different sub process groups, and it helps manage thosedistributed process group easily. init_device_mesh() function can beused to create new DeviceMesh, with a mesh shape describing the device topology.
3a8082e126