Onnxruntime Optimizer

10 views

Skip to first unread message

Emerson Mata

unread,

Jul 26, 2024, 3:21:04 AM7/26/24

to naejorheto

While ONNX Runtime automatically applies most optimizations while loading transformer models, some of the latest optimizations that have not yet been integrated into ONNX Runtime. These additional optimizations can be applied using the transformer optimization tool to tune models for the best performance. This optimization tool provides an offline capability to optimize transformer models in scenarios where ONNX Runtime does not apply the optimization at load time.

Most optimizations require exact match of a subgraph. Any layout change in the subgraph might cause some optimization to not work. Note that different versions of training or export tool might lead to different graph layouts. It is recommended to use the latest released version of PyTorch and Transformers.

First you need install onnxruntime or onnxruntime-gpu package for CPU or GPU inference. To use onnxruntime-gpu, it is required to install CUDA and cuDNN and add their bin directories to PATH environment variable. See Python installation instructions.

If your BERT model has three inputs (like input_ids, token_type_ids and attention_mask), a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare results from both the original and optimized models. If outputs are all close, it is safe to use the optimized model.

The first command will generate ONNX models (both before and after optimizations), but not run performance tests since batch size is 0. The other three commands will run performance test on each of three engines: OnnxRuntime, PyTorch and PyTorch+TorchScript.

If your GPU (like V100 or T4) has TensorCore, you can append -p fp16 to the above commands to enable mixed precision. In some decoder-only(e.g GPT2) based generative models, you can enable strict mode for SkipLayerNormalization Op on CUDA EP to achieve better accuray. However, the performance will drop a bit.

ONNX is an open graph format to represent machine learning models. ONNX Runtime is a cross-platform machine-learning model accelerator, with a flexible interface to integrate hardware-specific libraries.

OnnxModelOptimizer optimizes an ONNX model by fusing nodes. Fusing nodes involves merging multiple nodes in a model into a single node toreduce the computational cost and improve the performance of the model. The optimization process involves analyzing the structure of the ONNX model and identifying nodes that can be fused.

While ONNX Runtime automatically applies most optimizations while loading transformer models, some of the latest optimizations that have notyet been integrated into ONNX Runtime.OrtTransformersOptimization provides an offline capability to optimize transformers modelsin scenarios where ONNX Runtime does not apply the optimization at load time.These optimizations are provided by onnxruntime throughonnxruntime.transformers. Pleaserefer to the corresponding documentationfor more details on the optimizations done by this tool.

AppendPrePostProcessingOps also supports pre/post processing ops by leveraging the onnxruntime-extension steps and PrePostProcessor.You can refer to here to see how to leverage PrePostProcessor to customize pre and post processing ops.

The tool_command_args will be used to describe the input parameters to create the PrePostProcessor instance. It is list of PrePostProcessorInput.The name is the tensor name. The data_type and shape will be used to create the tensor type. The shape can be a list of integers or a list of string.

Users that write their own pre/post processing steps need to have the knowledge about whether the step includes the operators that is built-in support or supported in onnxruntime-extensions.For example, for some ops like ConvertImageToBGR which requires other extensions may be incompatible with ort-web, user need to exclude this kind of ops to generate proper models.

Quantization is a technique to compress deep learning models by reducing the precision of the model weights from 32 bits to 8 bits. Thistechnique is used to reduce the memory footprint and improve the inference performance of the model. Quantization can be applied to theweights of the model, the activations of the model, or both.

Dynamic Quantization:Dynamic quantization calculates the quantization parameters (scale and zero point) for activations dynamically, which means there is noany requirement for the calibration dataset.

Static Quantization:Static quantization method runs the model using a set of inputs called calibration data. In this way, user must provide a calibrationdataset to calculate the quantization parameters (scale and zero point) for activations before quantizing the model.

Olive consolidates the dynamic and static quantization into a single pass called OnnxQuantization, and provide the user with the ability totune both quantization methods and hyperparameter at the same time.If the user desires to only tune either of dynamic or static quantization, Olive also supports them through OnnxDynamicQuantization andOnnxStaticQuantization respectively.

Note: If target execution provider is QNN EP, the model might need to be preprocessed before quantization. Please refer to QnnPreprocess for more details about the pass and its config parameters.This preprocessing step fuses operators unsupported by QNN EP and inserts necessary operators to make the model compatible with QNN EP.

Olive consolidates the Intel Neural Compressor dynamic and static quantization into a single pass called IncQuantization, and provide the user with the ability totune both quantization methods and hyperparameter at the same time.If the user desires to only tune either of dynamic or static quantization, Olive also supports them through IncDynamicQuantization andIncStaticQuantization respectively.

ONNX Runtime provides high performance across a range of hardware options through its Execution Providers interface for different executionenvironments.For each model running with each execution provider, there are settings that can be tuned (e.g. thread number, execution mode, etc) toimprove performance.OrtPerfTuning covers basic knobs that can be leveraged to find the best performance for your model and hardware.

Converting a model to use Float16 instead of Float32 can decrease the model size and improve performance on some GPUs. The OnnxFloatToFloat16 pass the float16 converter from onnxruntime to convert the model to float16, which convert most nodes/operators to use Float16 instead of Float32.

Conversion to Float16 is often exposed at multiple stages of optimization, including model conversion and transformer optimization. This stand-alone pass is best suited for models that are not transformer architectures, where fusions may rely on a specific data types in node patterns.

If float16 conversion is giving poor results, you can convert most of the ops to float16 but leave some in float32. The OrtMixedPrecision pass finds a minimal set of ops to skip while retaining a certain level of accuracy.

Note: The input_dim and dim_value should have the same length, and the input_name and input_shape should have the same length. Also the input_dim & dim_value and input_name & input_shape should be exclusive to each other, user cannot specify both of them at the same time.

LoRA, QLoRA and related techniques allow us to fine-tune a pre-trained model by adding a small number of trainable matrices called adapters. The same base model can be used for multiple tasks by adding different adapters for each task. To support using multiple adapters with the same optimized onnx model, the ExtractAdapters pass extracts the adapters weights from the model and saves them to a separate file. The model graph is then modified in one of the following ways:

Adapters weights are set as external tensors pointing to a non-existent file. The onnx model is thus invalid by itself as it cannot be loaded. In order to create an inference session using this model, the adapter weights must be added to a sessions options object using add_initializer or add_external_initializers.

Olive also provides a command line tool to export adapters saved after peft fine-tuning to a format compatible with a model that has been optimized with the ExtractAdapters pass. More details on the olive export-adapters command can be found at Command Line Tools.

When the computational graph is loaded, i.e. when you create a InferenceSession, onnxruntime allocates memory for all tensors needed to execute the model. If the model was exported with dynamic inputs, onnxruntime does not yet know how much memory to reserve for all of the input, intermediate, and output tensors. So it initially just loads the graph, and then does the tensor allocations during the first run call.

However, if you run the model again with inputs bigger than the previous call, it needs to do the whole previous step again to accommodate the larger memory requirement. So what you're seeing in the last block, is every time random.randint hits a number larger than the previous maximum, onnxruntime has to rewrite a lot of GPU memory. This goes on until randint outputs one thousand, larger inputs are no longer possible, and thus all subsequent calls will run smoothly.

If you have been keeping up with the fast-moving world of AI, you surely know that in recent years Transformer models have taken over the state-of-the-art in many vital tasks on NLP, Computer Vision, time series analysis, and much more.

Although ubiquitous, many of these models are very big and slow at inference time, requiring billions of operations to process inputs. Ultimately, this affects user experience and increases infrastructure costs as heaps of memory and computing power are needed to run these models. Even our planet is affected by this, as more computing power demands more energy, which could translate to more pollution.

The answer to these questions is optimization. This blog post presents a step-by-step guide with overviews of various techniques, plus a notebook hosted on Google Colab so you can easily reproduce it on your own!

Reply all

Reply to author

Forward

0 new messages