Pytorch

Pytorch

PyTorch Memory Tuning

July 20, 2023

This article will focus on minimizing GPU memory footprint — for both optimization and inference workloads. Throughput and latency usually get all the attention, but reducing memory consumption without making architecture sacrifices is often just as valuable.

PyTorch Performance Features and How They Interact

April 14, 2023

Machine Learning Productionization

Pytorch, Optimization

PyTorch in 2023 is a complex beast, with many great performance features hidden away. Simple top-N lists are weak content, so I’ve empirically tested the most important PyTorch tuning techniques and settings in all combinations. I’ve benchmarked inference across a handful of different model architectures and sizes, different versions of PyTorch and even different Docker containers.

Object Detection at 2530 FPS with TensorRT and 8-Bit Quantization

December 31, 2020

Visual Analytics, Machine Learning Productionization

SSD300, Pytorch, Object Detection, Optimization, TensorRT, Quantization, ONNX, Nsight Systems

This article is a deep dive into the techniques needed to get SSD300 object detection throughput to 2530 FPS. We will rewrite Pytorch model code, perform ONNX graph surgery, optimize a TensorRT plugin and finally we’ll quantize the model to an 8-bit representation. We will also examine divergence from the accuracy of the full-precision model.

Mastering TorchScript: Tracing vs Scripting, Device Pinning, Direct Graph Modification

October 29, 2020

Machine Learning Productionization

Pytorch, TorchScript, TensorRT, ONNX, Nsight Systems

TorchScript is one of the most important parts of the Pytorch ecosystem, allowing portable, efficient and nearly seamless deployment. With just a few lines of torch.jit code and some simple model changes you can export an asset that runs anywhere libtorch does. It’s an important toolset to master if you want to run your models outside the lab at high efficiency. This article is a collection of topics going beyond the basics of your first export.

Object Detection at 1840 FPS with TorchScript, TensorRT and DeepStream

October 17, 2020

Visual Analytics, Machine Learning Productionization

SSD300, Pytorch, Object Detection, Optimization, DeepStream, TorchScript, TensorRT, ONNX, NVTX, Nsight Systems

In this article we take performance of the SSD300 model even further, leaving Python behind and moving towards true production deployment technologies: TorchScript, TensorRT and DeepStream. We also identify and understand several limitations in Nvidia’s DeepStream framework, and then remove them by modifying how the nvinfer element works.

Object Detection from 9 FPS to 650 FPS in 6 Steps

September 30, 2020

Visual Analytics, Machine Learning Productionization

SSD300, Pytorch, Object Detection, Gstreamer, NVTX, Optimization, Nsight Systems

Making code run fast on GPUs requires a very different approach to making code run fast on CPUs because the hardware architecture is fundamentally different. Machine learning engineers of all kinds should care about squeezing performance from their models and hardware — not just for production purposes, but also for research and training. In research as in development, a fast iteration loop leads to faster improvement. This article is a practical deep dive into making a specific deep learning model (Nvidia’s SSD300) run fast on a powerful GPU server, but the general principles apply to all GPU programming.

A Simple and Flexible Pytorch Video Pipeline

September 23, 2020

Visual Analytics

SSD300, Pytorch, Object Detection, Gstreamer

Taking machine learning models into production for video analytics doesn’t have to be hard. A pipeline with reasonable efficiency can be created very quickly just by plugging together the right libraries. In this post we’ll create a video pipeline with a focus on flexibility and simplicity using two main libraries: Gstreamer and Pytorch.