$ JULIA_CUDA_USE_BINARYBUILDER=false nsys launch julia nsys_OMEinsum.jlThen open nsys UI on your local host.
Run your code remotely on your GPU host.
$ sudo JULIA_CUDA_USE_BINARYBUILDER=false /home/ubuntu/.local/bin/ncu -o profile /home/ubuntu/.local/bin/julia permutedims-ncu.jlDownload the profile output and type locally
$ ncu-ui profile.ncu-repAnalyse the profile results, the "Registers Per Thread" matters a lot, should be <64 for good performance.
$ nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --stop-on-range-end=true --cudabacktrace=true -x true -o my_profile python benchmark_pytorch.py profilegpu