| 🎓 Affiliation | MPhil student, University of Macau |
| 🔬 Research | Multimodal LLMs · Speech Models · Vision-Language |
| 🛠️ Stack | Python, PyTorch, HuggingFace, CUDA |
| 📍 Location | Macau, China |
I work on making multimodal models actually understand what they hear and see — not just pattern-match on text. Mostly this means fighting with audio tokenizers, writing data pipelines at 2am, and questioning my life choices when another training run diverges.
Currently thinking about: how to build better speech benchmarks, why audio LLMs underperform vision LLMs by so much, and whether we can close that gap with better discrete representations.
Speech LLMs Multimodal Benchmarking Vision-Language Data Audio Tokenization Efficient Inference
| Project | What it does |
|---|---|
| speech-star | Benchmark measuring whether speech LLMs actually need audio — or just transcripts |
| audiotoken-bridge | Framework for injecting discrete speech tokens into LLMs via LoRA finetuning |
| vl-caption-engine | Automated pipeline for generating + filtering vision-language instruction data |