Autocast Simulation Software Free Download

0 views

Skip to first unread message

Carmen Kalua

unread,

Aug 4, 2024, 6:39:01 PM8/4/24

to taimititer

Thelatest version AutoCAST-X1 incorporates many innovations, including part thickness analysis, multi-cavity mould layout, multi-neck feeder design, vector element method for solidification simulation,[citation needed] automatic optimisation of feeders and gating system, tooling cost estimation and methods report generation. The database includes all major metal families and casting processes; which can be customised to accommodate new alloys and mould materials.

The software has a 20-year history of technology development at the Indian Institute of Technology Bombay, supported by several Ph.D. and Masters-level researchers. It is maintained and supported by 3D Foundry Tech, a company incubated at the institute. The software is used in industry (especially, SME foundries) for quality and yield improvement, and in academic institutes for simulation labs.

I was wondering whether for automatic mixed precision, Pytorch would benefit from/requires the input to be in the most precise format (e.g. float32) rather than in the smaller one (e.g. float16). The reason I am asking is that storing the input to my model can be very expensive in terms of storage and if I could downscale the input to float16 and operate on that I would. Of course, I could just run the whole model in fp16 but that might give less accuracy and that is why I am interested in automatic mixed precision.

Therefore, does Pytorch downscale the input to fp16 before applying e.g. a linear layer on it in torch.autocast()? In this case, if Pytorch already does that then could I store the fp16 representation myself and use automatic mixed precision to still have fp32 accumulation and train as if the input was fp32? Otherwise, would Pytorch require that the input be fp32 or would it benefit (in terms of performance, perhaps in the backpropagation) from it?

You used to be able to disable these effects, because they acted like "set autocast of X to level Y proc rate Z" when equipped, and "set autocast of X to level Y proc rate 0" when taken off - so if you put on two things that did the same skill, then took off one of them, the autocast on the first didn't work until you changed maps. This trick doesn't work anymore though (relatively recent change).

I suspect that the rest of the autocast while attacking mechanics were unchanged by this update - and multiple autocasts of the same skill do not stack - however, I don't feel like I can say that with 100% certainty unless someone has tested it after the patch where they changed it so you can't bug/disable autocasts.

Whether you have one of the items that do this, or two, or two hundred - As long as they autocast the same skill while doing the same thing, they do not stack, they do not give multiple chances of activation. That is absolutely, 100%, how autocast while attacking effects have always worked.

HOWEVER this is based on data from before a recent patch to a related aspect of autocast mechanics, and as far as I know, nobody has verified that it still works the same way - so do not take the above statement as the word of god, until someone verifies it.

This is a set of easy-to-use tools that implement the techniques described in Gary King, Michael Tomz, and Jason Wittenberg's "Making the Most of Statistical Analyses: Improving Interpretation and Presentation". Winner of the Okidata Best Research Software Award. These tools use Monte Carlo simulations to compute interpretable quantities from regression models and perform inference on them.

H100 GPU introduced support for a new datatype, FP8 (8-bit floating point), enabling higher throughput of matrix multiplies and convolutions. In this example we will introduce the FP8 datatype and show how to use it with Transformer Engine.

During training neural networks both of these types may be utilized. Typically forward activations and weights require more precision, so E4M3 datatype is best used during forward pass. In the backward pass, however, gradients flowing through the network typically are less susceptible to the loss of precision, but require higher dynamic range. Therefore they are best stored using E5M2 data format. H100 TensorCores provide support for any combination of these types as the inputs, enabling us tostore each tensor using its preferred precision.

Choosing the operations to be performed in FP16 precision requires analysis of the numerical behavior of the outputs with respect to inputs of the operation as well as the expected performance benefit. This enables marking operations like matrix multiplies, convolutions and normalization layers as safe, while leaving norm or exp operations as requiring high precision.

Dynamic loss scaling enables avoiding both over- and underflows of the gradients during training. Those may happen since, while the dynamic range of FP16 is enough to store the distribution of the gradient values, this distribution may be centered around values too high or too low for FP16 to handle. Scaling the loss shifts those distributions (without affecting numerics by using only powers of 2) into the range representable in FP16.

While the dynamic range provided by the FP8 types is sufficient to store any particular activation or gradient, it is not sufficient for all of them at the same time. This makes the single loss scaling factor strategy, which worked for FP16, infeasible for FP8 training and instead requires using distinct scaling factors for each FP8 tensor.

just-in-time scaling. This strategy chooses the scaling factor based on the maximum of absolute values (amax) of the tensor being produced. In practice it is infeasible, as it requires multiple passes through data - the operator produces and writes out the output in higher precision, then the maximum absolute value of the output is found and applied to all values in order to obtain the final FP8 output. This results in a lot of overhead, severely diminishing gains from using FP8.

delayed scaling. This strategy chooses the scaling factor based on the maximums of absolute values seen in some number of previous iterations. This enables full performance of FP8 computation, but requires storing the history of maximums as additional parameters of the FP8 operators.

Figure 3: Delayed scaling strategy. The FP8 operator uses scaling factor obtained using the history of amaxes (maximums of absolute values) seen in some number of previous iterations and produces both the FP8 output and the current amax, which gets stored in the history.

As one can see in Figure 3, delayed scaling strategy requires both storing the history of amaxes, but also choosing a recipe for converting that history into the scaling factor used in the next iteration.

DelayedScaling recipe from transformer_engine.common.recipe module stores all of the required options for FP8 training - length of the amax history to use for scaling factor computation, FP8 data format etc.

Not every operation is safe to be performed using FP8. All of the modules provided by Transformer Engine library were designed to provide maximum performance benefit from FP8 datatype while maintaining accuracy. In order to enable FP8 operations, TE modules need to be wrapped inside the fp8_autocast context manager.

Support for FP8 in the Linear layer of Transformer Engine is currently limited to tensors with shapes where both dimensions are divisible by 16. In terms of the input to the full Transformer network, this typically requires padding sequence length to be multiple of 16.

When a model is run inside the fp8_autocast region, especially in multi-GPU training, some communication is required in order to synchronize the scaling factors and amax history. In order to perform that communication without introducing much overhead, fp8_autocast context manager aggregates the tensors before performing the communication.

Due to this aggregation the backward call needs to happen outside of the fp8_autocast context manager. It has no impact on the computation precision - the precision of the backward pass is determined by the precision of the forward pass.

That happens because in the FP8 case both the input and weights are cast to FP8 before the computation. We can see this if instead of the original inputs we use the inputs representable in FP8 (using a function defined in quickstart_utils.py):