Skip to content

Log GPU memory usage#3077

Merged
MMathisLab merged 39 commits intomainfrom
maxim/log_gpu_memory
Sep 18, 2025
Merged

Log GPU memory usage#3077
MMathisLab merged 39 commits intomainfrom
maxim/log_gpu_memory

Conversation

@maximpavliv
Copy link
Copy Markdown
Contributor

Description

This PR adds GPU memory usage logging during both training and inference.
It helps users diagnose and prevent out-of-memory (OOM) errors, which have been reported several times (e.g. #2942, #2983). By logging GPU usage per process, users can see how much memory DeepLabCut is reserving without external tools.

Implementation

  • Training (train.py): GPU usage is appended to log messages at each epoch.
    Example:
Epoch 1/100 (lr=0.0001), train loss 0.10713, GPU: 2798.0/11011.5 MiB
Epoch 2/100 (lr=0.0001), train loss 0.02403, GPU: 3166.0/11011.5 MiB
  • Inference (videos.py): GPU usage is shown in tqdm progress bars.
    Example:
Running pose prediction with batch size 4
 11%|███▏                        | 1280/11178 [00:21<02:49, 58.23it/s, GPU=3232.0/11011.5 MiB]
  • Logged metrics:

    • torch.cuda.memory_reserved() (per-process reserved memory)
    • torch.cuda.get_device_properties(0).total_memory (total device memory)
  • Why reserved memory?

    • Shows memory this process holds (not available to others).
    • Alternatives considered:
      • torch.cuda.memory_allocated() → actually occupied memory, not reserved buffers.
      • pynvml → global GPU usage across all processes.
    • Reserved memory gives the clearest view per process; tools like nvitop remain better for global monitoring.
  • When CUDA unavailable: logs simply omit GPU usage, still clarifying whether GPU is engaged.

@MMathisLab MMathisLab merged commit be6c8f9 into main Sep 18, 2025
3 of 5 checks passed
@MMathisLab MMathisLab deleted the maxim/log_gpu_memory branch September 18, 2025 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants