Three years ago, I didn’t have any involvement in the Filipino NLP space. It just so happened that I was the only Filipino in the spaCy team1, and despite that, there’s no Tagalog language support in the library. Perhaps it was nice to represent: it felt like I was alone in a grocery aisle and I spotted a cereal box on the floor. I’m already there, I’m quite capable, so why not pick it up?
Fortunately, I realized that I wasn’t alone. I still remember cold-emailing Blaise and sharing initial results of an NER pipeline (which soon became a core component of calamanCy). Through him, I got connected with Joseph, whose works I’ve been reading for quite a while. I still remember us meeting in person back in 2023, and I guess the rest is history!
Fast-forward to 2025, across many life events and other collaborations, I also met Ely and Conner through cold emails. We started with an annotation project that didn’t really materialize, but we pivoted into a more impactful project on language model evaluation. That project soon turned into FilBench, and together with Joseph and Blaise, it was published in EMNLP Main!
I know it’s confusing to call both the evaluation benchmark and this collective as FilBench,2 though I find it quite apt (and fortuitous if I may add!). Back in my undergrad, I remember the idea of “org benches” where people hang out to do work. There was something special about having a designated bench, where people could gather, plan, and collaborate. That’s what I think FilBench represents: a bench for Filipino NLP researchers to call home.
In my FilBench-Eval blog post, I mentioned that hopefully next time there will be more of us. I’m happy to see it come into fruition: our first meeting to plan out our future projects had a nice turnout! Who would’ve thought that in just three years, there’s more than ten (!) people willing to collaborate with us. We also have a very chill Discord group with other Filipino NLP researchers.
This collective is not a non-profit, for-profit, or a research institute. We are PhD students, software engineers, enthusiasts who have shared interests in improving the state of Filipino NLP through open research, tools, and datasets. I’ve seen grassroots efforts worked well in other language communities such as Masakhane, SEACrowd, and IndoNLP— I believe it’s possible to do the same thing here. I hope the momentum continues.
Kaya, tara! Come sit with us at the FilBench!
]]>In this work, we plan to create a language model benchmark by annotating specific examples from other benchmarks such as XCOPA, XStoryCloze, Belebele, and MMLU.
We want to start late December and finish early 2026, just in time for the May ARR for EMNLP.
| Document | Description |
|---|---|
| Project Introduction | Running log of project progress. |
| Project Journal | Running log of project progress. |
| Google Drive | Shared folder containing other materials. |
We’re specifically looking for the following contributors:
Since most foundation model providers rely on open-source datasets to train their models, we can indirectly influence their development pipelines by contributing a high-quality instruction-tuning dataset to the open ecosystem. This effort also paves the way for training our own Filipino-centric language models.
Specifically, we will curate a high-quality instruction-tuning dataset for the top four to six (4-6) spoken Philippine languages: Tagalog, Bisaya, Hiligaynon, Ilokano, Cebuano, and Bikolano.
Ultimately, we want to answer the following question:
How does post-training data composition—synthetic, human-annotated, or web-crawled— affect LLM performance on FilBench, under a low annotation budget?
By doing so, we aim to explore the following aspects:
Data sourcing and composition: Where can we find high-quality instruction data for Philippine languages? Should we prioritize synthetic generation, community platforms like Reddit, existing datasets like Aya, or a combination of these sources? What is the optimal mix of data sources to maximize quality and diversity?
Data efficiency: How much instruction-tuning data is needed to achieve strong performance on Filipino NLP benchmarks such as FilBench? Can we identify diminishing returns to guide efficient data collection efforts?
Task relevance: Which tasks and capabilities are most valuable for Filipino-centric use cases? How can we ensure our instruction dataset covers the linguistic and cultural nuances that matter most to Filipino language users?
We have a loose timeline, but we plan to officially start the project on mid 2026. We don’t have a publication target (yet, but likely sometime 2027), most of our findings will be shared through a technical report on arXiV. If you’re interested to contribute / collaborate, you can read the full research brief (and other materials) in the links below:
| Document | Description |
|---|---|
| Research Brief | Research proposal. |
| Project Journal | Running log of project progress. |
| Google Drive | Shared folder containing other materials. |
Reach out to me (Lj) first! Although the official project will start mid-2026, I plan to start some experiments as early as January. During that time, this will be a smaller effort compared to the benchmark project, and I prefer a more focused team with 3 people including me. I’m also happy to receive support in the form of compute credits and grants (if you know of any, please point them our way)!
]]>