Summary
I propose accelerating pypose by replacing/adding a new set of implementations to functions in pypose/lietensor/operation.py with NVIDIA warp. Currently, they are implemented with PyTorch, and are relatively less efficient since no kernel fusion etc., is conducted.
By replacing the operations / providing an additional set of LieType implementations backed by warp, we can accelerate PyPose ops by 2-10x on both cpu and cuda.
Improvements
Significant speedup, especially on complex operators like AdjXa and Log, lower latency for the robotics stack built with PyPose.
Risks
- NVIDIA warp (in forseable future) will not support devices other than cpu and NVIDIA GPU.
- NVIDIA warp does not support bf16 datatype as of today, only fp16, fp32 and fp64 are supported.
- Since we will implement a kernel for each operator, we might not be able to support arbitrary dimension tensor input - however, we can support up to a reasonable number of batch dimensions by explicitly listing them out in the codebase (e.g. up to 4 batch dimensions).
- Launching warp kernels takes additional overhead, so on small input scenarios, this can make operations a bit slower.
Involved components
LieType implementations.
Preliminary results
I've conducted some preliminary experiments on this by implementing a warp_SO3_Type that inherits the LieType and replaces the operators (forward & backward) with warp functions.
While results on relative simple operators like SO3_Act are mixed, significant speedup can be found on SO3_Log and SO3_AdjXa. (All kernels have correct broadcasting with up to 4 batch dimensions, and the result is tested against the pypose result)
I'm currently using this interface to keep compatability with existing pypose code in my project:
import warp as wp
import pypose as pp
from pypose.lietensor.lietensor import LieType
from .ltype import warpSO3_type
wp.init()
_BACKEND_LIST: list[tuple[LieType, LieType | None]] = [
# (Pypose LieType, Warp LieType)
(pp.SO3_type , warpSO3_type),
(pp.SE3_type , None),
(pp.Sim3_type , None),
(pp.RxSO3_type, None)
]
_PP_TO_WP = {pp_ltype : wp_ltype for pp_ltype, wp_ltype in _BACKEND_LIST}
_WP_TO_PP = {wp_ltype : pp_ltype for pp_ltype, wp_ltype in _BACKEND_LIST}
def to_warp_backend(x: pp.LieTensor) -> pp.LieTensor:
"""Swap the lietensor backend for accelerated compute"""
if is_warp_backend(x): return x
wp_ltype = _PP_TO_WP[x.ltype]
if wp_ltype is None:
raise NotImplementedError(f"Warp backend not implemented for pypose LieType {x.ltype}.")
return pp.LieTensor(x.tensor(), ltype=wp_ltype)
def to_pypose_backend(x: pp.LieTensor) -> pp.LieTensor:
"""Swap the lietensor backend for better op coverage"""
if is_pypose_backend(x): return x
return pp.LieTensor(x.tensor(), ltype=_WP_TO_PP[x.ltype])
def is_warp_backend(x: pp.LieTensor) -> bool:
return x.ltype in {
warpSO3_type
}
def is_pypose_backend(x: pp.LieTensor) -> bool:
return x.ltype in {
pp.SE3_type, pp.SO3_type, pp.RxSO3_type, pp.Sim3_type
}
Optional: Intended side effects
Optional: Missing test coverage
Additional unit tests to ensure the "warp backend" aligns with the original pypose behavior, both for shape and numeric results.
Summary
I propose accelerating pypose by replacing/adding a new set of implementations to functions in
pypose/lietensor/operation.pywith NVIDIA warp. Currently, they are implemented with PyTorch, and are relatively less efficient since no kernel fusion etc., is conducted.By replacing the operations / providing an additional set of
LieTypeimplementations backed by warp, we can accelerate PyPose ops by 2-10x on both cpu and cuda.Improvements
Significant speedup, especially on complex operators like
AdjXaandLog, lower latency for the robotics stack built with PyPose.Risks
Involved components
LieType implementations.
Preliminary results
I've conducted some preliminary experiments on this by implementing a
warp_SO3_Typethat inherits theLieTypeand replaces the operators (forward & backward) with warp functions.While results on relative simple operators like
SO3_Actare mixed, significant speedup can be found onSO3_LogandSO3_AdjXa. (All kernels have correct broadcasting with up to 4 batch dimensions, and the result is tested against the pypose result)I'm currently using this interface to keep compatability with existing pypose code in my project:
Optional: Intended side effects
Optional: Missing test coverage
Additional unit tests to ensure the "warp backend" aligns with the original pypose behavior, both for shape and numeric results.