A custom NVIDIA DeepStream GStreamer plugin for AprilTag detection entirely on the GPU using CUDA, targeting Jetson Orin NX / AGX with DeepStream 7.1.
Tested with 8 simultaneous 1080p RTSP streams on a Jetson Orin NX 16 GB — CPU usage stays below 20%, GPU usage up to ~40%.
Target platform: NVIDIA Jetson Orin NX / AGX — JetPack 6.x, DeepStream 7.1, CUDA 12.x
The GPU detection core is built on top of the excellent apriltags_cuda library by FRC Team 766 / Team 971, which implements the entire AprilTag detection pipeline (thresholding, connected-component labeling, quad fitting, tag decoding) as CUDA kernels — no CPU involvement in the detection path.
This repository wraps that library as a native NVIDIA DeepStream GStreamer plugin (nvdsapriltagcuda) and includes a set of modifications that were necessary to make it build and run inside a DeepStream Docker container on Jetson without a full Jetson SDK on the host machine.
| Change | Details |
|---|---|
CMakeLists_minimal.txt |
Strips WPILib, NetworkTables, OpenCV, and seasocks — builds with just CUDA + apriltag C lib + glog |
| EGL / NVMM interop | CudaEglFrame maps NvBufSurface NVMM buffers directly into CUDA without any CPU copy |
| RGBA to YUYV kernel | Custom rgba_to_yuyv_kernel — the frc971 detector expects YUYV input; DeepStream delivers RGBA |
| Docker stub libraries | Mock .so files for libnvbufsurface allow link-time resolution inside Docker without a Jetson SDK |
| DeepStream metadata | Results written as NvDsObjectMeta (class ID, object ID, bounding boxes, text labels) |
| FIFO JSON IPC | Each detection serialised as JSON to /tmp/apriltag_detections_cam_<CAMERA_ID> for companion processes |
detection-interval property |
GObject property to detect every N frames and reuse cached results in between |
nvds_apriltag_cuda/
├── plugin/ # GStreamer DeepStream plugin
│ ├── gstnvdsapriltagcuda.cu # Main GStreamer element (CUDA TU)
│ ├── cuda_utils.cu # EGL interop + RGBA->YUYV CUDA kernel
│ ├── cuda_utils.h
│ └── CMakeLists.txt
├── apriltags_cuda/ # Modified GPU detection library
│ ├── CMakeLists_minimal.txt # Key contribution: minimal build for DeepStream/Docker
│ ├── install_deps.sh # Native host dependency installer
│ └── src/ # Core GPU detection pipeline (CUDA kernels)
├── Dockerfile # Full build: apriltag lib + plugin in one image
└── README.md
The element is a GstBaseTransform in-place plugin registered as nvdsapriltagcuda.
- The incoming
NvBufSurface(NVMM/RGBA) is mapped as an EGL image viaNvBufSurfaceMapEglImage CudaEglFrameregisters it as a CUDA graphics resource (cuGraphicsEGLRegisterImage)- A custom CUDA kernel (
rgba_to_yuyv_kernel) converts RGBA → YUYV in-GPU frc971::apriltag::GpuDetector::Detect()runs the full detection pipeline on-device- Results are written as
NvDsObjectMeta(class/object IDs, bounding boxes, text labels) - Each detection is also serialised as JSON and written to a named FIFO pipe for consumption by a companion process
The detection-interval property lets you run detection every N frames and reuse cached results in between, reducing GPU load on high-framerate streams.
| Dependency | Version |
|---|---|
| Jetson platform | Orin NX or AGX (sm_87), or any CUDA GPU |
| JetPack | 6.x |
| DeepStream | 7.1 |
| CUDA toolkit | 12.x (nvcc) |
| CMake | 3.18+ |
| libgoogle-glog | any recent version |
| apriltag C library | 3.3.0 (cgpadwick fork) |
| NVIDIA CCCL | v2.3.2 (CUB / Thrust headers) |
The Dockerfile handles every step: CMake install, apriltag library, CCCL headers, stub libs, and the plugin itself.
# Build from the nvds_apriltag_cuda/ directory
docker build -t nvds-apriltag-cuda:latest .
# Run with GPU access
docker run --rm --runtime=nvidia --gpus all nvds-apriltag-cuda:latestCUDA architecture is set to 87 (Jetson Orin). Change -DCMAKE_CUDA_ARCHITECTURES=87 in the Dockerfile if targeting a different GPU.
sudo apt-get install -y \
build-essential pkg-config git wget \
libgstreamer1.0-dev libgstreamer-plugins-base1.0-dev \
libgoogle-glog-devgit clone --depth 1 --branch 3.3.0 https://github.com/cgpadwick/apriltag.git
cd apriltag && mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_EXAMPLES=OFF ..
make -j$(nproc) && sudo make install && sudo ldconfiggit clone --depth 1 --branch v2.3.2 https://github.com/NVIDIA/cccl.git /opt/ccclcd apriltags_cuda
cp CMakeLists_minimal.txt CMakeLists.txt
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=87 \
-DCCCL_DIR=/opt/cccl \
..
make -j$(nproc) && sudo make install && sudo ldconfigcd plugin
mkdir build && cd build
cmake ..
make -j$(nproc)
sudo make installThe .so installs to /opt/nvidia/deepstream/deepstream/lib/gst-plugins/.
export GST_PLUGIN_PATH=/opt/nvidia/deepstream/deepstream/lib/gst-plugins:$GST_PLUGIN_PATH
gst-inspect-1.0 nvdsapriltagcudagst-launch-1.0 \
rtspsrc location=rtsp://<ip>:<port>/<path> latency=100 ! \
rtph264depay ! h264parse ! nvv4l2decoder ! \
nvstreammux name=mux batch-size=1 width=1920 height=1080 ! \
nvdsapriltagcuda detection-interval=3 ! \
nvdsosd ! \
nvegltransform ! nveglglessinkSet detection-interval to 1 to detect on every frame, or higher values (e.g. 3-5) to reduce GPU load on high-framerate streams while still reusing the last cached result for OSD rendering.
Every detection is serialised as a JSON object and written to a named pipe at:
/tmp/apriltag_detections_cam_<CAMERA_ID>
where CAMERA_ID is read from the CAMERA_ID environment variable (defaults to 0). A companion process can open this pipe for reading to consume detections in real time without modifying the GStreamer pipeline. The FIFO path is currently hardcoded in gstnvdsapriltagcuda.cu; if you need it to be runtime-configurable, expose it as a GObject property.
Tested with 8 simultaneous 1080p RTSP streams on a Jetson Orin NX 16 GB running JetPack 6 / DeepStream 7.1:
| Metric | Measured value |
|---|---|
| CPU usage (all cores) | < 20% |
| GPU usage | up to ~40% |
| RAM usage | 6 - 8 GB |
| Streams | 8 x 1080p RTSP |
| Detection interval | 3 (detect every 3rd frame) |
The low CPU usage is the key advantage over CPU-based AprilTag libraries: thresholding, connected-component labeling, quad fitting, and tag decoding all run as CUDA kernels. The CPU is only involved in GStreamer buffer management and metadata writing.
- apriltags_cuda — original GPU AprilTag library by FRC Team 766 / Team 971 that this plugin is built on
- deepstream-apriltag-vpi — companion plugin using NVIDIA VPI 3 and the PVA co-processor instead of CUDA cores (4 streams, zero CUDA core usage)
- NVIDIA DeepStream SDK
- NVIDIA CUDA on Jetson
GPU detection core: apriltags_cuda by FRC Team 766 / Team 971, originally based on the apriltag work from FRC Team 971 (Spartan Robotics). The original library is licensed under the MIT License.
DeepStream integration, EGL/NVMM interop, RGBA-to-YUYV kernel, Docker stub library technique, and CMakeLists_minimal.txt by t-teja.
nvidia jetson jetson-orin jetson-orin-nx jetson-orin-agx deepstream gstreamer apriltag apriltag-detection cuda gpu-accelerated computer-vision multi-stream rtsp jetpack embedded-vision edge-ai frc robotics nvmm egl-interop