Quantization

Solving Machine Learning Performance Anti-Patterns: a Systematic Approach

June 24, 2021

Nsight Systems, NVTX, Optimization, TensorRT, TorchScript, Quantization

This article is a high-level introduction to an efficient worfklow for optimizing runtime performance of machine learning systems running on the GPU. Using traces from Nsight Systems to show real production scenarios, I introduce a set of common utilization patterns and outline effective approaches to improve performance.

Object Detection at 2530 FPS with TensorRT and 8-Bit Quantization

December 31, 2020

Visual Analytics, Machine Learning Productionization

SSD300, Pytorch, Object Detection, Optimization, TensorRT, Quantization, ONNX, Nsight Systems

This article is a deep dive into the techniques needed to get SSD300 object detection throughput to 2530 FPS. We will rewrite Pytorch model code, perform ONNX graph surgery, optimize a TensorRT plugin and finally we’ll quantize the model to an 8-bit representation. We will also examine divergence from the accuracy of the full-precision model.