Easy Quantization in PyTorch Using Fine-Grained FX

Story At a Glance

In a previous blog, we introduced Intel® Neural Compressor, an open-source Python* library for model compression.
In this blog, we illustrate one of its new features to help you do easy quantization on broader models.
Hugging Face is a community and data science platform where data scientists, researchers, and machine learning engineers can collaborate.
Optimum Intel is the interface between the Hugging Face Transformers library and the different Intel-optimized tools and libraries purpose-built to accelerate end-to-end pipelines on Intel® architectures.

Story At a Glance

In a previous blog, we introduced Intel® Neural Compressor, an open-source Python* library for model compression.
In this blog, we illustrate one of its new features to help you do easy quantization on broader models.
Hugging Face is a community and data science platform where data scientists, researchers, and machine learning engineers can collaborate.
Optimum Intel is the interface between the Hugging Face Transformers library and the different Intel-optimized tools and libraries purpose-built to accelerate end-to-end pipelines on Intel® architectures.

Xin He
Hardware Design Engineer

Haihao Shen
AI Frameworks Engineer

Feng Tian
AI Frameworks Engineer

Intel Corporation

The Issue:

PyTorch provides an FX toolkit for developers to transform a torch.nn.Module into a torch.fx.GraphModule. With the generated GraphModule, FX can execute static quantization by automatically inserting quantize and dequantize operations.

It’s useful to convert an imperative model into a graph model because the latter gives better performance with multiple optimization options such as post-training static quantization.

However, FX cannot handle dynamic control flows automatically, and there are many cases that will block the model transformation because of dynamic control flows.

The Solution: Fine-Grained FX

Fine-grained FX helps models with dynamic control flows on ease-of-use quantization. It is integrated into the pytorch_fx backend of Intel Neural Compressor and supports three popular quantization methods:

post-training dynamic quantization
post-training static quantization
quantization-aware training

PyTorch recommends post-training dynamic quantization for NLP models because its real-time variable scales and zero-points shows stable accuracy after quantization.

Post-training static quantization performs quantization based on fixed scales and zero-points. It supports continuous quantization modules, avoiding redundant quantization and dequantization operations.

Theoretically, static quantization has a better performance than dynamic quantization. Quantization-aware training for static quantization requires an additional training process to adjust model weights to reduce quantization loss. It can provide high accuracy based on the best performance of static quantization.

Because the imperative model consists of several blocks, fine-grained FX will aggressively and recursively detect these blocks for module transformation.

Two examples are shown below: natural language processing (Figure 1) and object detection (Figure 2).

The darker green blocks are detected as suitable for module transformation because they are the largest blocks without any control flow. We leverage the FX toolkit on these blocks and do quantization automatically. By reassembling these processed blocks using the original control flows, the resulting model maintains the same behavior and provides higher performance by leveraging INT8.

Figure 1. Fine-grained FX for BERT natural language processing

Figure 2. Fine-grained FX for YOLO-V2 object detection

Adopting Our Solution

Intel® Neural Compressor for NLP

We provide two kinds of examples for natural language processing models based on Hugging Face Transformers. You can easily replace the input model with your own and quantize it based on fine-grained FX:

Hugging Face Optimum-Intel

Optimum-Intel is an extension of Transformers that enable the use of popular compression techniques such as quantization and pruning via Intel Neural Compressor. All tasks in Optimum-Intel support fine-grained FX: language modeling, multiple choice, question answering, summarization, text classification, token classification, and translation.

We also uploaded several INT8 models into the Hugging Face model hub that can be easily initialized and leveraged with Intel Neural Compressor, e.g.:

from neural_compressor.utils.load_huggingface import OptimizedModel
int8_model = OptimizedModel.from_pretrained(
    'Intel/distilbert-base-uncased-finetuned-sst-2-english-int8-static',
)

Model Name	Approach	Accuracy (accuracy/f1)			Model Size (MB)
Model Name	Approach	INT8	FP32	Relative Loss	INT8	FP32	Compression Ratio
Albert-base-v2 [INT8/FP32]	Post training static quantization	0.9255	0.9232	-0.249%	25	44.6	1.784
Bart-large [INT8/FP32]	Post training dynamic quantization	0.9051	0.912	0.757%	547	1556.48	2.845
Bert-base [INT8/FP32]	Post training static quantization	0.7838	0.7915	0.973%	133	418	3.143
Bert-base [INT8/FP32]	Post training dynamic quantization	0.8997	0.9042	0.498%	174	418	2.402
Bert-base [INT8/FP32]	Quantize aware training	0.9142	0.9042	-1.106%	107	418	3.907
Bert-base [INT8/FP32]	Post training static quantization	0.8997	0.9042	0.498%	120	418	3.483
Camembert-base [INT8/FP32]	Post training dynamic quantization	0.8843	0.8928	0.952%	180	422	2.344
Distilbert-base [INT8/FP32]	Post training static quantization	0.9859	0.9882	0.233%	64.5	253	3.922
Distilbert-base [INT8/FP32]	Post training static quantization	0.9037	0.9106	0.758%	65	255	3.923
Electra-small-discriminator [INT8/FP32]	Post training static quantization	0.9007	0.8983	-0.267%	14	51.8	3.700
Roberta-base [INT8/FP32]	Post training static quantization	0.9247	0.9138	-1.193%	121	476	3.934
Xlnet-base [INT8/FP32]	Post training static quantization	0.8893	0.8897	0.045%	215	448	2.084

Future Work

The vision of fine-grained FX is to improve the productivity of PyTorch quantization, especially of the static quantization approach. We are continuously uploading INT8 models to the Hugging Face model hub for quick deployment.

We invite users to:

Try Intel Neural Compressor and Hugging Face Optimum-Intel and share your models on the model hub.
Check out Intel’s other AI Tools and Framework optimizations.
Learn about the unified, open, standards-based oneAPI programming model that forms the foundation of Intel’s AI Software Portfolio.

See Related Content

On-Demand Webinars

Accelerate AI Workloads with Intel® Optimization for PyTorch
Improve IoT Inference with Quantization Techniques
Accelerate AI Inference without Sacrificing Accuracy

Tech Articles & Blogs

Get the Software

Intel® AI Analytics Toolkit
Accelerate end-to-end machine learning and data science pipelines with optimized deep learning frameworks and high-performing Python* libraries.

Get It Now
See All Tools

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Easy Quantization in PyTorch* Using Fine-Grained FX

Get the Latest on All Things CODE

Story At a Glance

Story At a Glance

The Issue:

The Solution: Fine-Grained FX

Adopting Our Solution

Intel® Neural Compressor for NLP

Hugging Face Optimum-Intel

Future Work

See Related Content

On-Demand Webinars

Tech Articles & Blogs

Get the Software

Product and Performance Information