Skip to content

PyTorch: Improve inference batching speed#3099

Merged
AlexEMG merged 1 commit intoDeepLabCut:mainfrom
arashsm79:arash/improve_inference_batch_collation
Sep 21, 2025
Merged

PyTorch: Improve inference batching speed#3099
AlexEMG merged 1 commit intoDeepLabCut:mainfrom
arashsm79:arash/improve_inference_batch_collation

Conversation

@arashsm79
Copy link
Copy Markdown
Contributor

@arashsm79 arashsm79 commented Sep 17, 2025

Summary

This PR replaces incremental tensor concatenation ( $O(n^2)$ ) during inference with a list-based accumulation. Final stacking now happens only when forming a batch, avoiding repeated reallocation and copy.

(Depends on #3094)

Main points:

  • Appending images is now $O(1)$ amortized.
  • Single torch.stack per processed batch.
  • Reduced peak allocator churn and CPU overhead.
  • No public API changes.

Details

The previous pattern for batching images during inference:

self._batch = torch.cat([self._batch, inputs], dim=0)

caused $O(n^2)$ total memory movement for $n$ appended images. This was a bottleneck for larger batches, causing allocator churn and CPU overhead.

Profiling

This was confirmed by benchmarking the inference procedure with Torch Profile and Scalene.

Torch Profile
We can see for large batch sizes, the GPU is stalled and waits for the producer CPU thread to finish preprocessing the images.
image

Scalene
Statistical analysis shows that the concatenation operation is one of the main hot spots of the inference procedure due to reallocation and copying of the immutable tensor:
image

Results

We can see that these changes fix the problem with large batch sizes being inefficient (thanks @maximpavliv for running the benchmark):

fps_vs_batchsize_128x128 fps_vs_batchsize_256x256 fps_vs_batchsize_512x512
And the batching pipeline is no longer listed in Scalene results: image

@maximpavliv maximpavliv self-requested a review September 17, 2025 14:17
@arashsm79 arashsm79 marked this pull request as ready for review September 17, 2025 21:40
@arashsm79 arashsm79 changed the title [WIP] Improve inference speed Improve inference speed Sep 17, 2025
@AlexEMG
Copy link
Copy Markdown
Member

AlexEMG commented Sep 18, 2025

This is excellent @arashsm79 -- thanks for the contribution!

Copy link
Copy Markdown
Contributor

@maximpavliv maximpavliv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the batching mechanism! 🚀
The fix significantly improves performance for larger batch sizes, which is a big win.

Not directly related to this PR, but I realized that the CTDInferenceRunner is lacking the multithreading scheme (preprocessing and batching preformed by a producer thread, prediction preformed by a consumer thread). Let's address this in a future PR.

@MMathisLab
Copy link
Copy Markdown
Member

Looks great, let's fix merge conflicts and the merge this @arashsm79

Copy link
Copy Markdown
Member

@AlexEMG AlexEMG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic @arashsm79 -- let's fix the conflicts and we're ready for rc13!

@arashsm79 arashsm79 changed the title Improve inference speed Improve inference batching speed Sep 19, 2025
Use list accumulation for inference batches to eliminate O(n^2) torch.cat

- Replaced incremental tensor _batch with list _batch_list
- Stack only at batch processing time
- Updated sequential and async inference paths
@arashsm79 arashsm79 force-pushed the arash/improve_inference_batch_collation branch from 85402e3 to 61c5d68 Compare September 19, 2025 09:48
@AlexEMG AlexEMG changed the title Improve inference batching speed PyTorch: Improve inference batching speed Sep 19, 2025
@AlexEMG AlexEMG merged commit 5815229 into DeepLabCut:main Sep 21, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants