Log GPU memory usage by maximpavliv · Pull Request #3077 · DeepLabCut/DeepLabCut

maximpavliv · 2025-08-21T07:18:05Z

Description

This PR adds GPU memory usage logging during both training and inference.
It helps users diagnose and prevent out-of-memory (OOM) errors, which have been reported several times (e.g. #2942, #2983). By logging GPU usage per process, users can see how much memory DeepLabCut is reserving without external tools.

Implementation

Training (train.py): GPU usage is appended to log messages at each epoch.
Example:

Epoch 1/100 (lr=0.0001), train loss 0.10713, GPU: 2798.0/11011.5 MiB
Epoch 2/100 (lr=0.0001), train loss 0.02403, GPU: 3166.0/11011.5 MiB

Inference (videos.py): GPU usage is shown in tqdm progress bars.
Example:

Running pose prediction with batch size 4
 11%|███▏                        | 1280/11178 [00:21<02:49, 58.23it/s, GPU=3232.0/11011.5 MiB]

Logged metrics:
- torch.cuda.memory_reserved() (per-process reserved memory)
- torch.cuda.get_device_properties(0).total_memory (total device memory)
Why reserved memory?
- Shows memory this process holds (not available to others).
- Alternatives considered:
  - torch.cuda.memory_allocated() → actually occupied memory, not reserved buffers.
  - pynvml → global GPU usage across all processes.
- Reserved memory gives the clearest view per process; tools like nvitop remain better for global monitoring.
When CUDA unavailable: logs simply omit GPU usage, still clarifying whether GPU is engaged.

maximpavliv added 30 commits July 21, 2025 14:10

Trim superanimal_humanbody.yaml default project config

2abf37c

Trim superanimal_humanbody_colors

a1a6be1

Correct get_checkpoint_epoch

dfbce1d

Add rtmpose_x modelzoo model config

1432a73

Add FilteredDetector

a4d74cc

Add get_filtered_coco_detector_inference_runner() method

0cbfe59

Add ScaleToUnitRange transform

84b230e

Superanimal humanbody inference: use filtered detector runner

c4c1318

ModelZoo tab: make humanbody general case

dc511cd

get_super_animal_scorer(): add torchvision_detector_name arg

479cc66

Remove superanimal_humanbody_video_inference.py module

6a14584

Regularize get_super_animal_model_config_path()

6774812

Regularize load_super_animal_config()

424ac92

Regularize download_super_animal_snapshot()

04438f8

update_config(): superanimal_humanbody - compatible

256977b

Revert video_inference()

8d4cf20

Revert create_df_from_prediction()

edf4f9d

Restore CTDInferenceRunner

cded581

Remove TorchvisionDetectorInferenceRunner

7eaf923

Revert DetectorInferenceRunner

feb34d1

superanimal_analyze_images() - make humanbody compatible

711a47e

Revert build_predictions_dataframe()

28e4dd0

Revert get_inference_runners()

193f935

Revert detectors/fasterRCNN.py

613b2ac

Revert detectors/torchvision.py

d848741

Revert base Runner

bba8689

Fix superanimal_humanbody unit test

25fa08d

Disable video adaptation for superanimal_humanbody

425484b

Fix testscript_superanimal_inference.py

4b04013

Remove debug print

eea470e

maximpavliv added 5 commits July 31, 2025 13:45

Black formatting

3d47a36

dlc_torch.analyze_videos(): GpuTqdm

c31e555

Log GPU usage in TrainingRunner.fit()

102e062

Correct memory unit

8c4f8a0

Black

d0987fe

maximpavliv added enhancement New feature or request pytorch gpu DLC3.0🔥 labels Aug 21, 2025

maximpavliv requested a review from AlexEMG August 21, 2025 07:19

maximpavliv and others added 2 commits September 12, 2025 15:25

Merge branch 'main' into maxim/log_gpu_memory

23c84f7

Merge branch 'main' into maxim/log_gpu_memory

8509569

maximpavliv requested a review from MMathisLab September 16, 2025 17:25

Merge branch 'main' into maxim/log_gpu_memory

3fed0d6

MMathisLab approved these changes Sep 18, 2025

View reviewed changes

Merge branch 'main' into maxim/log_gpu_memory

7f55a58

MMathisLab merged commit be6c8f9 into main Sep 18, 2025
3 of 5 checks passed

MMathisLab deleted the maxim/log_gpu_memory branch September 18, 2025 17:55

maximpavliv mentioned this pull request Sep 19, 2025

Hide GPU memory usage by default #3102

Merged

maximpavliv mentioned this pull request Oct 14, 2025

Add quirk of show_gpu_memory to docstring and/or documentation #3116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Log GPU memory usage#3077

Log GPU memory usage#3077
MMathisLab merged 39 commits intomainfrom
maxim/log_gpu_memory

maximpavliv commented Aug 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maximpavliv commented Aug 21, 2025

Description

Implementation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants