One of the most popular Optimization technique is called quantization.
Running the machine learning model and making inference on mobile devices or embedded devices comes with certain challenges such as the limited amount of resources such as memory, power and data storage, so it’s crucial and critical to deploy ML model on edge devices.
It’s critical to deploy optimized machine learning models on mobile and embedded devices so that they can run efficiently. There are optimization techniques and one of the optimization techniques is Quantization. In the last article, we have seen how to use the TFLite Converter to optimize the model for edge devices without any modification in weights and activation types.
What is Quantization?
Quantization is generally used in mathematics and digital signal processing. Below is the wiki definition.
Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes.
Quantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the dominant numerical format used for research and for deployment has so far been a 32-bit floating-point or FP32. Convert FP32 weights and output activations into the nearest 8-bit integer, some times 4/2/1 bit as well in quantization.
Quantization optimizes the model by quantizing the weights and activation type. TFLite uses quantization technique to speed up inference over the edge devices. TFLite converter is the answer to whether we can manage a deep learning model with lower precision. Now you know exactly quantization, let us, deep dive:
Quantization dramatically reduces both the memory requirement and computational cost of using neural networks.
The quantizing deep learning model uses techniques that allow for reduced precision representations of weights and, optionally, activations for both storage and computation.
TFLite provides several level of support to quantization.
- Post-training quantization
- Quantization aware training.
Below is a table that shows the benefits of model quantization for some CNN models.
As the name implies its post-training technique, this is after your model is trained. Post-training quantization is a technique used to quantizing weights and activation types. This technique can reduce the model size and also improving CPU and hardware acceleration latency. There are different optimization options such as weight, full integer, etc based on our requirement we can choose.
TensorFlow org provided a decision tree that can help us in making decision
The very simple post-training quantization is quantizing only weights from FP to 8 bit precision. This option is available with TFLite converter. At inference, weights are converted from 8-bits of precision to floating-point and computed using floating-point kernels. This conversion is done once and cached to reduce latency. If you want to improve latency further use of a hybrid operator.
import tensorflow as tfconverter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]tflite_quant_model = converter.convert()
At the time of conversion, set the
optimizations flag to optimize for model size.
This optimization provides latencies close to fully fixed-point inference. but, the outputs are still stored using floating-point.
Full integer quantization:
We can get further latency improvements, reductions in peak memory usage, and access to an integer only hardware accelerators by making sure all model math is quantized. In full integer quantization, you need to measure the dynamic range of activations and inputs by supplying data sets, create a dataset using an input data generator.
import tensorflow as tfdef representative_dataset_gen(): for _ in range(num_calibration_steps): yield [input]
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)converter.optimizations = [tf.lite.Optimize.DEFAULT]converter.representative_dataset = representative_dataset_gentflite_quant_model = converter.convert()
The result of full integer quantization should be full quantized, any ops don’t have quantized implementation left in FP. Full integer-only execution gets a model with even faster latency, smaller size, and integer-only accelerators compatible model.
you can enforce full integer quantization for all ops and use integer input and output by adding the following lines before you convert.
The converter throw an error if it encounters an operation it cannot currently quantize.
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8converter.inference_output_type = tf.uint8
Float 16 Quantization example:
The IEEE standard for 16-bit floating-point numbers. We can reduce the size of a floating-point model by quantizing the weights to float16. This technique reduces the model size by half with minimal loss of accuracy as compared to other techniques. This technique model will “dequantize” the weights values to float32 when running on the CPU.
import tensorflow as tfconverter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)converter.optimizations = [tf.lite.Optimize.DEFAULT]converter.target_spec.supported_types = [tf.lite.constants.FLOAT16]tflite_quant_model = converter.convert()
We have seen a different technique in post-training quantization: The float 16 quantization may not be a good choice if you need maximum performance. A Full integer quantization to fixed-point math would be better in that case. Weight quantization is a very basic quantization. Since weights are quantized post-training, there could be an accuracy loss, particularly for smaller networks.
Quantization aware Training:
There could be an accuracy loss in a post-training model quantization and to avoid this and if you don’t want to compromise the model accuracy do quantization aware training. As we have learned the post-training quantization technique is after the model has been trained. To overcome post-training quantization technique drawbacks we have quantization aware model training. This technique ensures that the forward pass matches precision for both training and inference. In this technique Tensorflow created flow, wherein the process of constructing the graph you can insert fake nodes in each layer, to simulate the effect of quantization in the forward and backward passes and to learn ranges in the training process, for each layer separately.
There are two aspects of this technique
- Operator fusion at inference time is accurately modeled at training time.
- Quantization effects at inference are modeled at training time.
tf.quantization.quantize( input, min_range, max_range, T, mode='MIN_COMBINED', round_mode='HALF_AWAY_FROM_ZERO', name=None)
out[i] = (in[i] - min_range) * range(T) / (max_range - min_range)if T == qint8: out[i] -= (range(T) + 1) / 2.0
num_discrete_values = 1 << (# of bits in T)range_adjust = num_discrete_values / (num_discrete_values - 1)range = (range_max - range_min) * range_adjustrange_scale = num_discrete_values / rangequantized = round(input * range_scale) - round(range_min * range_scale) + numeric_limits<T>::min()quantized = max(quantized, numeric_limits<T>::min())quantized = min(quantized, numeric_limits<T>::max())