Overview

This dataset is a hybrid collection of 6,660 samples (human-generated + synthetic data) specifically curated to fine-tune CLAIRE (Conversational Language AI for Resolution & Engagement).

CLAIRE is an intelligent banking assistant designed for secure, inclusive, and real-time customer support in the Philippine financial sector. This dataset and the development of CLAIRE are part of the BPI Datawave 2025.

🔗 Project Repository: PROJECT-CLAIRE

The core knowledge base is sourced from: BPI Help Center - General Questions

Methodology: RAFT Inspiration

The dataset structure is inspired by the RAFT (Retrieval-Augmented Fine-Tuning) methodology (arXiv:2403.10131). RAFT trains models to discern relevant information from "noise" and reduces the likelihood of hallucinations in specialized domains.

Key Components: Q, D, A

Question (Q): The user's query or the question CLAIRE is being trained to resolve.
Documents (D): A set of 4 documents retrieved from the knowledge base:
- Golden Documents (D*): Factual information required to correctly answer the question.
- Distractor Documents (Dk): Irrelevant documents used to teach the model to ignore non-pertinent information.
- Note: The position of the Golden Document is randomized within each set of four to prevent positional bias.
Answer (A): A factual response grounded in the Golden Documents, reflecting both the correct information and the appropriate empathetic tone.

The 1-P Ratio (P = 0.8)

Following RAFT guidelines, we utilize a strategic split to balance "open-book" and "closed-book" learning:

P = 0.8 (The Open-Book Scenario): 80% of the data includes a golden document. CLAIRE learns to search through the context, identify the relevant fact, and formulate a well-grounded answer.
1-P = 0.2 (The Closed-Book Challenge): 20% of the data contains only distractors. This forces CLAIRE to recognize when the answer is not present. Instead of hallucinating, CLAIRE learns to identify that the context is insufficient and responds with a fallback or escalation message.

Dataset Analytics

1. General Statistics

Metric	Value
Total Samples	6,660
Category	`general_questions`
Avg. Documents per Sample	4.00

2. Linguistic Distribution

To ensure inclusivity in the Philippine market, the dataset is perfectly balanced across three linguistic styles:

English: 2,220 (33.3%)
Tagalog: 2,220 (33.3%)
Taglish: 2,220 (33.3%)

3. Emotional Intelligence (EQ) Distribution

CLAIRE is trained to adapt its tone based on the user's emotional state. Each language contains exactly 370 samples per emotion:

Neutral: 1,110 (16.7%)
Urgent: 1,110 (16.7%)
Grateful: 1,110 (16.7%)
Frustrated: 1,110 (16.7%)
Confused: 1,110 (16.7%)
Worried: 1,110 (16.7%)

4. Document Relevance (RAFT Split)

Samples with 1 'Golden' Document: 5,328 (80.0%)
Samples with 'Distractor' Documents Only: 1,332 (20.0%)

Technical Use Case

This dataset is designed for fine-tuning CLAIRE to achieve:

Contextual Precision: High accuracy in filtering out irrelevant noise.
Linguistic Inclusivity: Seamless transitions between English, Tagalog, and Taglish.
Emotional Resonance: Tailored responses that de-escalate frustration and provide comfort to worried users.
Operational Trust: Significant reduction in hallucinations through the 20% "Closed-Book" training.

Developed by Team MangTomas under Track 3 of BPI Datawave 2025.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
general_questions-v3-nosource.json		general_questions-v3-nosource.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Methodology: RAFT Inspiration

Key Components: Q, D, A

The 1-P Ratio (P = 0.8)

Dataset Analytics

1. General Statistics

2. Linguistic Distribution

3. Emotional Intelligence (EQ) Distribution

4. Document Relevance (RAFT Split)

Technical Use Case

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Folders and files

Latest commit

History

Repository files navigation

Overview

Methodology: RAFT Inspiration

Key Components: Q, D, A

The 1-P Ratio (P = 0.8)

Dataset Analytics

1. General Statistics

2. Linguistic Distribution

3. Emotional Intelligence (EQ) Distribution

4. Document Relevance (RAFT Split)

Technical Use Case

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Packages

Contributors