This repository is the final project in the CS677: Deep Learning class at NJIT, titled "A study in audio-visual scene classification".
The model is trained on the TAU Audio-Visual Urban Scenes 2021 dataset as per the DCASE 2021 Task 1B instructions.
To setup the environment, do the following in order for a MacOS machine:
conda install -c conda-forge ffmpegconda install pytorch torchvision torchaudio -c pytorchpip install pandas tqdm h5py sklearn seaborn tabulate soundfile opencv-pythonpip install mir_evalpip install pyyaml
The ordering more easily resolves conflicts from the ffmpeg installation.
For a linux or windows machine, install the pytorch libraries with your cuda version from here.
To train a model, run
srun python main.py --features_dir <path_to_features_directory> --config <path_to_config_file>
Alternatively, you can modify the gpu-train.sh script to run the program on a cluster.
The program expects the features_directory to contain audio_features and video_features sub-directories. This will automatically happen if you download the features from this link.
By default, the config file trains an audio+video model.
To train an audio only or video only model, modify MODE to audio or video in the config file.
For evaluation, run
srun python evaluate.py --features_path <path_to_features_directory> --model_type audio_video
Change the --model_type argument to audio or video to evaluate
an audio only or video only model.
The evaluate script expects the trained model weights to be in the models/ directory.
The confusion matrices for the audio+video, audio only and video only models are in the outs directory.