This repository contains the preprocessing, alignment, and augmentation workflows used for Audio Visual Speech Recognition experiments built around AV-HuBERT.
The repo is centered on three working areas:
- timit_preperation for TCD-TIMIT preprocessing
- lrs3_preperation for LRS3 preprocessing
- augmentation for interpolation and smart blur experiments
Two conda environments are documented in this repository:
- augmentation/aligner_env.yml: minimal MFA-only environment for alignment runs.
- avsr_aug.yml: preparation + augmentation environment (notebook and script workflows).
Create and activate the preparation/augmentation environment from the project root:
conda env create -f avsr_aug.yml
conda activate avsr_augThis environment is built from imports used in the preparation and augmentation workflows, including OpenCV, NumPy/Pandas/SciPy, SoundFile, tqdm, and MFA/TextGrid tooling.
Before running preprocessing or augmentation, make sure these external dependencies and dataset layouts are available.
ffmpegandffprobemust be callable from your shellPATH(or configured explicitly in notebook/script cells).- MFA requires the
alignerconda env and model resources used by the alignment script. - dlib landmark models are required for TCD-TIMIT landmark extraction in the preprocessing notebook.
Expected root-level data layouts used by the workflows:
- TCD-TIMIT root:
.../TCD_TIMIT/{volunteers|lipspeakers}/<speaker>/Clips/... - LRS3 root:
.../lrs3/{trainval,test,pretrain}/<speaker>/<utt>.mp4 - LRS3 landmarks:
.../lrs3/landmark/<split>/<speaker>/<utt>.pkl - LRS3 TextGrid alignment output: next to each
.wav/.labpair under the selected split folder
The usual path is: prepare a dataset, generate landmarks, crop a stable mouth ROI, then run augmentation on the prepared clips.
For the preprocessing stages, the media is first converted into model-friendly formats and then turned into mouth-centric inputs. The landmark step is what connects the full-face videos to the crop stage, and the crop stage is what produces the final AV-HuBERT-ready clips. The landmark visualization below shows the kind of point layout the cropper works from.
Once the base clips are ready, the augmentation notebooks introduce controlled temporal perturbations. Interpolation-based augmentation creates additional intermediate frames, while smart blur focuses on viseme or phoneme regions inside the clip.
After interpolation, the pipeline moves from global temporal changes to spatially focused modifications. In practice, this means selecting the most speech-relevant area of each frame and applying targeted augmentation where it has the highest impact on lip articulation cues. The mouth-region view below illustrates how these localized regions are defined before smart blur is applied.
If you are working on TCD-TIMIT, start in timit_preperation/README.md. If you are working on LRS3, start in lrs3_preperation/README.md. If you already have prepared clips and want to experiment with augmentation, go straight to augmentation/README.md.
- Run the relevant preprocessing notebook.
- Check that the landmark and crop outputs look correct.
- Run the augmentation notebooks on the prepared clips.
- Compare training or evaluation results.
The table below summarizes a small set of representative LRS3 results: the two control runs first (baseline and AV-HuBERT image augmentation), followed by selected top-performing interpolation and viseme blur configurations.
| Run Type | WER | Relative Improvement |
|---|---|---|
| No Augmentation | 4.11% | 0.00% |
| AV-HuBERT Image Augmentation | 4.02% | 2.09% |
| 200 ms Lag Interpolation | 3.76% | 8.48% |
| 100 ms Lag Interpolation | 3.88% | 5.60% |
| Viseme Blur Highest Visibility (B,C,D,E,F) - Word-Level | 3.83% | 6.76% |
| Low Visibility Viseme Blur (G,H,I,J,K) - Word-Level | 3.90% | 5.04% |
The repository also contains supporting scripts, notes, and diagrams outside these main workflow folders, but the three folders above are the intended entry points.


