<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Deployment | Gabriel Humpire-Mamani</title><link>https://gabrielhumpire.github.io/tags/deployment/</link><atom:link href="https://gabrielhumpire.github.io/tags/deployment/index.xml" rel="self" type="application/rss+xml"/><description>Deployment</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Thu, 01 Jun 2023 00:00:00 +0000</lastBuildDate><image><url>https://gabrielhumpire.github.io/media/icon_hu_645fa481986063ef.png</url><title>Deployment</title><link>https://gabrielhumpire.github.io/tags/deployment/</link></image><item><title>From Research to Production: Deploying Deep Learning on Embedded Hardware</title><link>https://gabrielhumpire.github.io/post/research_production/</link><pubDate>Thu, 01 Jun 2023 00:00:00 +0000</pubDate><guid>https://gabrielhumpire.github.io/post/research_production/</guid><description>&lt;p&gt;After years of research, I spent several years at Instech Netherlands deploying deep learning models into production - including on embedded hardware like the NVIDIA Jetson AGX Xavier. This post shares the practical lessons that research papers don&amp;rsquo;t cover.&lt;/p&gt;
&lt;h2 id="the-gap-between-benchmark-and-production"&gt;The Gap Between Benchmark and Production&lt;/h2&gt;
&lt;p&gt;A model that scores well on your validation set is the beginning, not the end. Production deployment introduces constraints that fundamentally change the engineering problem:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Latency budgets&lt;/strong&gt;: A CT scan analysis that takes 10 minutes is useless in an emergency workflow.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory constraints&lt;/strong&gt;: Embedded devices have a fraction of the RAM of a workstation GPU.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reproducibility&lt;/strong&gt;: Your Docker container must produce identical results on a Jetson as on a cloud A100.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id="the-tensorrt-pipeline"&gt;The TensorRT Pipeline&lt;/h2&gt;
&lt;p&gt;The standard path I used: train in PyTorch or TensorFlow → export to ONNX → compile to TensorRT engine.&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-gdscript3" data-lang="gdscript3"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;PyTorch&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;onnx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;export&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;ONNX&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;trtexec&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;TensorRT&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Each step has failure modes. ONNX export can silently drop dynamic shapes. TensorRT compilation fails on unsupported ops. The key is to validate intermediate outputs at each stage against a known-good reference.&lt;/p&gt;
&lt;h2 id="quantisation-trade-offs"&gt;Quantisation Trade-offs&lt;/h2&gt;
&lt;p&gt;FP16 quantisation is almost always safe for computer vision models - I rarely saw more than 0.5% drop in Dice score on segmentation tasks. INT8 requires careful calibration and more validation, but the latency gains are significant on devices without FP16 tensor cores.&lt;/p&gt;
&lt;p&gt;For medical imaging specifically: validate quantised models on a representative slice of edge cases, not just average accuracy. A small accuracy drop on average can hide large errors on the cases that matter most.&lt;/p&gt;
&lt;h2 id="docker-as-the-deployment-contract"&gt;Docker as the Deployment Contract&lt;/h2&gt;
&lt;p&gt;Every model I deployed was wrapped in a Docker container with pinned versions of CUDA, cuDNN, TensorRT, and the application code. This single decision eliminated an entire class of &amp;ldquo;works on my machine&amp;rdquo; bugs. On the Jetson, we used NVIDIA&amp;rsquo;s L4T base images to guarantee hardware compatibility.&lt;/p&gt;
&lt;h2 id="lessons"&gt;Lessons&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;Profile before optimising - the bottleneck is rarely where you expect.&lt;/li&gt;
&lt;li&gt;ONNX is a good intermediate format but inspect the graph; some exporters produce redundant nodes.&lt;/li&gt;
&lt;li&gt;Build your validation pipeline before you build your model. It will save you weeks.&lt;/li&gt;
&lt;/ol&gt;</description></item><item><title>EV Battery Cell Segmentation via CT</title><link>https://gabrielhumpire.github.io/project/evbattery_segmentation/</link><pubDate>Thu, 26 Jan 2023 00:00:00 +0000</pubDate><guid>https://gabrielhumpire.github.io/project/evbattery_segmentation/</guid><description>&lt;p&gt;Automated 3D segmentation of lithium-ion battery cells from industrial CT scans, deployed in production at Nuctech. Pipeline built with TensorRT and ONNX for optimised inference, enabling fast quality control of electric vehicle battery packs.&lt;/p&gt;</description></item></channel></rss>