This dataset is a hybrid collection of 6,660 samples (human-generated + synthetic data) specifically curated to fine-tune CLAIRE (Conversational Language AI for Resolution & Engagement).
CLAIRE is an intelligent banking assistant designed for secure, inclusive, and real-time customer support in the Philippine financial sector. This dataset and the development of CLAIRE are part of the BPI Datawave 2025.
🔗 Project Repository: PROJECT-CLAIRE
The core knowledge base is sourced from: BPI Help Center - General Questions
The dataset structure is inspired by the RAFT (Retrieval-Augmented Fine-Tuning) methodology (arXiv:2403.10131). RAFT trains models to discern relevant information from "noise" and reduces the likelihood of hallucinations in specialized domains.
- Question (Q): The user's query or the question CLAIRE is being trained to resolve.
- Documents (D): A set of 4 documents retrieved from the knowledge base:
- Golden Documents (D*): Factual information required to correctly answer the question.
- Distractor Documents (Dk): Irrelevant documents used to teach the model to ignore non-pertinent information.
- Note: The position of the Golden Document is randomized within each set of four to prevent positional bias.
- Answer (A): A factual response grounded in the Golden Documents, reflecting both the correct information and the appropriate empathetic tone.
Following RAFT guidelines, we utilize a strategic split to balance "open-book" and "closed-book" learning:
- P = 0.8 (The Open-Book Scenario): 80% of the data includes a golden document. CLAIRE learns to search through the context, identify the relevant fact, and formulate a well-grounded answer.
- 1-P = 0.2 (The Closed-Book Challenge): 20% of the data contains only distractors. This forces CLAIRE to recognize when the answer is not present. Instead of hallucinating, CLAIRE learns to identify that the context is insufficient and responds with a fallback or escalation message.
| Metric | Value |
|---|---|
| Total Samples | 6,660 |
| Category | general_questions |
| Avg. Documents per Sample | 4.00 |
To ensure inclusivity in the Philippine market, the dataset is perfectly balanced across three linguistic styles:
- English: 2,220 (33.3%)
- Tagalog: 2,220 (33.3%)
- Taglish: 2,220 (33.3%)
CLAIRE is trained to adapt its tone based on the user's emotional state. Each language contains exactly 370 samples per emotion:
- Neutral: 1,110 (16.7%)
- Urgent: 1,110 (16.7%)
- Grateful: 1,110 (16.7%)
- Frustrated: 1,110 (16.7%)
- Confused: 1,110 (16.7%)
- Worried: 1,110 (16.7%)
- Samples with 1 'Golden' Document: 5,328 (80.0%)
- Samples with 'Distractor' Documents Only: 1,332 (20.0%)
This dataset is designed for fine-tuning CLAIRE to achieve:
- Contextual Precision: High accuracy in filtering out irrelevant noise.
- Linguistic Inclusivity: Seamless transitions between English, Tagalog, and Taglish.
- Emotional Resonance: Tailored responses that de-escalate frustration and provide comfort to worried users.
- Operational Trust: Significant reduction in hallucinations through the 20% "Closed-Book" training.
Developed by Team MangTomas under Track 3 of BPI Datawave 2025.