FilBench

Come sit with us at the FilBench!

2025-11-22T00:00:00+00:00

I have high hopes for Filipino NLP.

Three years ago, I didn’t have any involvement in the Filipino NLP space. It just so happened that I was the only Filipino in the spaCy team¹, and despite that, there’s no Tagalog language support in the library. Perhaps it was nice to represent: it felt like I was alone in a grocery aisle and I spotted a cereal box on the floor. I’m already there, I’m quite capable, so why not pick it up?

Fortunately, I realized that I wasn’t alone. I still remember cold-emailing Blaise and sharing initial results of an NER pipeline (which soon became a core component of calamanCy). Through him, I got connected with Joseph, whose works I’ve been reading for quite a while. I still remember us meeting in person back in 2023, and I guess the rest is history!

Fast-forward to 2025, across many life events and other collaborations, I also met Ely and Conner through cold emails. We started with an annotation project that didn’t really materialize, but we pivoted into a more impactful project on language model evaluation. That project soon turned into FilBench, and together with Joseph and Blaise, it was published in EMNLP Main!

I know it’s confusing to call both the evaluation benchmark and this collective as FilBench,² though I find it quite apt (and fortuitous if I may add!). Back in my undergrad, I remember the idea of “org benches” where people hang out to do work. There was something special about having a designated bench, where people could gather, plan, and collaborate. That’s what I think FilBench represents: a bench for Filipino NLP researchers to call home.

In my FilBench-Eval blog post, I mentioned that hopefully next time there will be more of us. I’m happy to see it come into fruition: our first meeting to plan out our future projects had a nice turnout! Who would’ve thought that in just three years, there’s more than ten (!) people willing to collaborate with us. We also have a very chill Discord group with other Filipino NLP researchers.

This collective is not a non-profit, for-profit, or a research institute. We are PhD students, software engineers, enthusiasts who have shared interests in improving the state of Filipino NLP through open research, tools, and datasets. I’ve seen grassroots efforts worked well in other language communities such as Masakhane, SEACrowd, and IndoNLP— I believe it’s possible to do the same thing here. I hope the momentum continues.

Kaya, tara! Come sit with us at the FilBench!

During the pre-LLM era, spaCy was very useful and popular in building linguistic processing pipelines for tasks such as named-entity recognition, dependency parsing, and classification. ↩
I actually renamed all instances of the benchmark into FilBench-Eval just to lessen the confusion. ↩

Developing a high-quality benchmark dataset for a shared task

2025-11-19T00:00:00+00:00

FilBench is the first step towards LLM evaluation on Filipino-centric tasks, but we can still do more. Specifically:

Increasing the breadth of languages we test upon (2 -> 6 or more). These languages include: Tagalog, Bisaya, Hiligaynon, Ilokano, Cebuano, and Bikolano.
Focusing only on specific tasks we care about (e.g., causal reasoning, narrative understanding, factual knowledge, etc.); and
Evaluating on a small number of difficult and culturally-grounded examples.

In this work, we plan to create a language model benchmark by annotating specific examples from other benchmarks such as XCOPA, XStoryCloze, Belebele, and MMLU.

Document	Description
Project Introduction	Running log of project progress.
Project Journal	Running log of project progress.
Google Drive	Shared folder containing other materials.

I want to help out!

We’re specifically looking for the following contributors:

Language validators who are native speakers of Tagalog, Bisaya, Hiligaynon (Ilonggo), Ilokano, Cebuano, and Bikolano who are willing to take a more active role in validating annotations from other annotators and help co-develop language-specific annotation guidelines with us.
- Language validators will be offered co-authorship in the paper we’ll submit to a leading NLP conference early next year.
- It will also be helpful if validators can refer other annotators who are willing to help us translate some evaluation examples from English to a Philippine language.
Data annotators who are native speakers of Tagalog, Bisaya, Hiligaynon (Ilonggo), Ilokano, Cebuano, and Bikolano who are willing to translate sentences (ranging from a short sentence to a full story paragraph) from English to a Philippine language according to a set of annotation guidelines.
- Data annotators will be onboarded into an annotation platform, and will be given monetary compensation.

Uy PilipINST! Creating an instruction-tuning dataset for Philippine languages

2025-10-30T00:00:00+00:00

Instruction-tuning (or supervised fine-tuning, SFT) is a method to further refine a model’s capabilities for specific tasks or languages. It is relatively cheap and straightforward, which is why most foundation model providers include instruction-tuned models in their releases.

Since most foundation model providers rely on open-source datasets to train their models, we can indirectly influence their development pipelines by contributing a high-quality instruction-tuning dataset to the open ecosystem. This effort also paves the way for training our own Filipino-centric language models.

What exactly are we trying to do?

Specifically, we will curate a high-quality instruction-tuning dataset for the top four to six (4-6) spoken Philippine languages: Tagalog, Bisaya, Hiligaynon, Ilokano, Cebuano, and Bikolano.

Ultimately, we want to answer the following question:

How does post-training data composition—synthetic, human-annotated, or web-crawled— affect LLM performance on FilBench, under a low annotation budget?

By doing so, we aim to explore the following aspects:

Data sourcing and composition: Where can we find high-quality instruction data for Philippine languages? Should we prioritize synthetic generation, community platforms like Reddit, existing datasets like Aya, or a combination of these sources? What is the optimal mix of data sources to maximize quality and diversity?
Data efficiency: How much instruction-tuning data is needed to achieve strong performance on Filipino NLP benchmarks such as FilBench? Can we identify diminishing returns to guide efficient data collection efforts?
Task relevance: Which tasks and capabilities are most valuable for Filipino-centric use cases? How can we ensure our instruction dataset covers the linguistic and cultural nuances that matter most to Filipino language users?

Document	Description
Research Brief	Research proposal.
Project Journal	Running log of project progress.
Google Drive	Shared folder containing other materials.

I want to help out!

Reach out to me (Lj) first! Although the official project will start mid-2026, I plan to start some experiments as early as January. During that time, this will be a smaller effort compared to the benchmark project, and I prefer a more focused team with 3 people including me. I’m also happy to receive support in the form of compute credits and grants (if you know of any, please point them our way)!

FilBench

Come sit with us at the FilBench!

Developing a high-quality benchmark dataset for a shared task

Other links and timelines

I want to help out!

Uy PilipINST! Creating an instruction-tuning dataset for Philippine languages

What exactly are we trying to do?

Other links and timelines

I want to help out!