NVTX

Solving Machine Learning Performance Anti-Patterns: a Systematic Approach

June 24, 2021
Machine Learning Productionization
Nsight Systems, NVTX, Optimization, TensorRT, TorchScript, Quantization

This article is a high-level introduction to an efficient worfklow for optimizing runtime performance of machine learning systems running on the GPU. Using traces from Nsight Systems to show real production scenarios, I introduce a set of common utilization patterns and outline effective approaches to improve performance.

Object Detection at 1840 FPS with TorchScript, TensorRT and DeepStream

October 17, 2020
Visual Analytics, Machine Learning Productionization
SSD300, Pytorch, Object Detection, Optimization, DeepStream, TorchScript, TensorRT, ONNX, NVTX, Nsight Systems

In this article we take performance of the SSD300 model even further, leaving Python behind and moving towards true production deployment technologies: TorchScript, TensorRT and DeepStream. We also identify and understand several limitations in Nvidia’s DeepStream framework, and then remove them by modifying how the nvinfer element works.

Object Detection from 9 FPS to 650 FPS in 6 Steps

September 30, 2020
Visual Analytics, Machine Learning Productionization
SSD300, Pytorch, Object Detection, Gstreamer, NVTX, Optimization, Nsight Systems

Making code run fast on GPUs requires a very different approach to making code run fast on CPUs because the hardware architecture is fundamentally different. Machine learning engineers of all kinds should care about squeezing performance from their models and hardware — not just for production purposes, but also for research and training. In research as in development, a fast iteration loop leads to faster improvement. This article is a practical deep dive into making a specific deep learning model (Nvidia’s SSD300) run fast on a powerful GPU server, but the general principles apply to all GPU programming.


© Paul Bridger 2020