Quantization

Quantization refers the process of reducing the number of bits that represent a number. In a DL context, weights and activations can be represented using 8-bit integers (INT8) to compress the model size of a trained neural network without any significant loss in model accuracy. INT8 is one kind of quantization. Compared with 32-bit floating point (FP32), using arithmetic with lower precision, such as INT8, to calculate weights and activation requires less memory.

Implementing a quantized model with nGraph

To implement a quantized model with nGraph, provide a partially (or fully) quantized model (where the convolution layer in the model is replaced with a quantized convolution, for example) to the nGraph Library along with quantized parameters: weights, activations, scale, and zero point.

Note

As of version 0.28, only quantization for inference is supported.

nGraph Quantized Operators (Ops)

nGraph uses scale and zero point (also used by ONNX) to map real values to quantized values. All quantized ops use scale and zero point and can be used just like any other nGraph op.

Scale: the quantization scale of the tensor

Zero point: the zero point of the tensor

Round mode: used in combination with scale and zero point to round real values to quantized values

Quantization Ops

Op

Description

Quantize

Maps real values (r) to quantized values (q) using scale (s), zero point (z), and round mode; produces a quantized tensor.

Dequantize

Maps quantized values (q) to real values (r) using scale (s) and zero point (z); converts a quantized tensor to a floating-point tensor.

FakeQuantize

Performs element-wise linear quantization.

QuantizedConvolution

Performs 8-bit convolution.

QuantizedDot

Performs 8-bit dot.

Some frameworks such as TensorFlow* have fused ops. nGraph provides optional operations to help users easily translate (map) any quantized model created from frameworks with fused ops to nGraph. Unlike builders, experimental ops take scale and zero point instead of min and max.

Experimental Quantized Ops (optional)

Operator

Description

QuantizedConvolutionBias

This experimental op can be fused with a ReLU op.

QuantizedConvolutionBiasAdd

This experimental op constructs a quantized convolution with bias and optional ReLU. And then takes input for the add operation.

QuantizedConvolutionBiasSignedAdd

Same as QuantizedConvolutionBiasAdd but with signed add.

QuantizedConvolutionRelu

This experimental op is designed for a particular use case that would require convolution and ReLU to be combined.

QuantizedDotBias

This experimental op can be fused with a ReLU op.

nGraph Quantization Design

The goal of nGraph quantization is to flexibly support a wide variety of frameworks and users. The use of scale and zero point as well as quantized builders in the nGraph design helps to achieve this goal.

Scale and Zero Point

Using scale and zero point allows nGraph to be framework agnostic (i.e., it can equally support all deep learning frameworks). nGraph Bridges will automatically convert min and max (provided by a DL framework) to scale and zero point as needed. Quantized builders are available to help the bridges perform this calculation. However, if users are directly using nGraph (and not using a bridge), they are required to provide scale and zero point for quantized ops.

Another advantage of using scale and zero point to express quantization parameters is that users can flexibly implement quantized ops into various nGraph backends. When implementing quantized ops, all current nGraph backends will directly use scale and zero point (and not min and max) to perform the quantized computation.

Quantized Builders

Quantized builders are helper utilities to assist framework integrators to enable quantized models with nGraph. They serve as an API (interface) between framework bridges and nGraph, allowing framework bridges to directly construct ops in the nGraph Abstraction Layer.

Quantized builders help nGraph framework bridges by:

  • Breaking down a fused quantized operator in the framework to a subgraph (of quantized and non-quantized operators) in the nGraph core IR

  • Converting from min and max to scale and zero point based on the quantization mode described by the DL framework

Note

Fused ops and quantized builders serve the same purpose. In the future, fused ops will replace quantized builders.

nGraph Quantized Builders

Category

Builder

Description

Scaled Mode Min / Max Builders

ScaledQuantize

Converts min and max to scale and zero point using a scaled mode calculation and then constructs and returns an nGraph Quantize operator.

ScaledDequantize

Converts min and max to scale and zero point using a scaled mode calculation and then constructs and returns an nGraph Dequantize operator.

Quantized Convolution and Variants

ScaledQuantizedConvolution

Constructs a quantized convolution with an optional ReLU.

ScaledQuantizedConvolutionBias

Constructs a quantized convolution with bias and an optional ReLU.

ScaledQuantizedConvolutionBiasAdd

Constructs a quantized convolution with bias and an optional ReLU, where the output is added to the output of another convolution (sum_input).

Quantized Dot (Matmul) and Variants

ScaledQuantizedDot

Constructs a quantized dot (Matmul) with an optional ReLU.

ScaledQuantizedDotBias

Constructs a quantized dot (Matmul) with bias and an optional ReLU.

Quantized Concat

ScaledQuantizedConcat

Constructs a quantized concatenation.