Skip to content

CUDA out of memory when running train.sh #8

@rfuruta

Description

@rfuruta

Hi, when I try to run train.sh on two GPUs (TITAN RTX) with 24GB VRAM, I get the following errors:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 1; 23.65 GiB total capacity; 21.76 GiB already allocated; 24.75 MiB free; 21.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
RuntimeError: CUDA out of memory. Tried to allocate 148.00 MiB (GPU 0; 23.65 GiB total capacity; 21.18 GiB already allocated; 144.31 MiB free; 21.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried to reduce the GPU memory usage by changing some hyperparameters in v1-finetune_cocogit.yaml such as data.params.batch_size->1, data.params.num_workers->1, and model.params.first_stage_config.params.ddconfig.resolution->64, but the above errors still occur.
Is there any way to further reduce the GPU memory usage to run train.sh? Also, how much VRAM is required to run it with default hyperparameters?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions