OasisSimp: An Open-source Asian–English Sentence Simplification Dataset

OasisSimp: An Open-source Asian-English Sentence Simplification Dataset

Hannah Liu^1* Muxin Tian^1* Iqra Ali² Haonan Gao³ Qiaoyiwen Wu¹
Blair Yang⁴ Uthayasanker Thayasivam⁵ En-Shiun Annie Lee¹ Pakawat Nakwijit²
Surangika Ranathunga⁷ Ravi Shekhar⁸

¹ University of Toronto ² Queen Mary University of London ³ Yale University ⁴ CoolWei AI Lab
⁵ University of Moratuwa ⁶ Ontario Tech University ⁷ Massey University ⁸ University of Essex
* Equal Contribution

Long, Oral at LREC 2026 (paper, dataset)

OasisSimp Dataset Overview

Abstract

Sentence simplification aims to make complex text more accessible by reducing linguistic complexity while preserving the original meaning. However, progress in this area remains limited for mid-resource and low-resource languages due to the scarcity of high-quality data. To address this gap, we introduce the OasisSimp dataset, a multilingual dataset for sentence-level simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai. Among these, no prior sentence simplification datasets exist for Thai, Pashto, and Tamil, while limited data is available for Sinhala. Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. We evaluate eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset and observe substantial performance disparities between high-resource and low-resource languages, highlighting the simplification challenges in multilingual settings. The OasisSimp dataset thus provides both a valuable multilingual resource and a challenging benchmark, revealing the limitations of current LLM-based simplification methods and paving the way for future research in low-resource sentence simplification. The dataset is available at https://OasisSimpDataset.github.io/.

Dataset

We are making the version of OasisSimp dataset, used in LREC'26 work, available for others to use :

OasisSimp Dataset : here
Each language has its own folder containing two JSONL files corresponding to valid and test set.

The dataset schema is as follows

        {
          "id": "Unique identifier for the segment",
          "complex": "Complex sentence",
          "simple": "List of simplified versions of the complex sentence",
        }

For any clarification contact OasisSimp Team and Ravi.

Citation

If you used the OasisSimp datasets in your work, please consider citing our LREC'26 paper (download bibtex)

@inproceedings{liu-etal-2026-oasissimp,
title = {{OasisSimp: An Open-source Asian-English Sentence Simplification Dataset}},
author = "Liu, Hannah and Tian, Muxin and Ali, Iqra and Gao, Haonan and We, Qiaoyiwen and Yang, Blair and Thayasivam, Uthayasanker and Lee, En-Shiun Annie and Nakwijit, Pakawat and Ranathunga, Surangika and Shekhar, Ravi",
booktitle = "Proceedings of the Fifteenth biennial Language Resources and Evaluation Conference (LREC 2026)",
month = "May",
year = "2026"
}

Acknowledgement

This work was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant RGPIN-2024-06887, the NSERC Discovery Launch Supplement DGECR-2024-00008, and the Digital Research Alliance of Canada (formerly Compute Canada) Grant RRG no. 5397 on "Multilingual multicultural NLP and LLMs". We also thank CoolWei AI Lab for providing GPU resources that enabled this research. This work was partially supported by the ELOQUENCE project (grant number 101070558) funded by the UKRI and the European Union. Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the UKRI, European Union or European Commission-EU. Neither the European Union nor the granting authority can be held responsible for them.