OpenSyntheticCC

Introduction

OpenSyntheticCC is a repository for fine-tuning language models on synthetic Chain-of-Thought (CoT) and code datasets. It provides scripts and configurations for distributed training, especially with DeepSpeed, and supports large-scale supervised fine-tuning.

Features

Fine-tuning on synthetic CoT and code datasets
Distributed training with DeepSpeed and torchrun
Customizable training parameters via shell scripts
Data collation and tokenization for instruction-following tasks
Example scripts for quick start

Installation

Clone the repository:

git clone https://github.com/your-username/OpenSyntheticCC.git
cd OpenSyntheticCC

Install dependencies:
```
pip install -r requirements.txt
```

Usage

1. Prepare your dataset

Format: JSONL, each line should contain instruction and response fields.
Note: Due to privacy reasons, the dataset is not open-sourced for now. In the future, we will release scripts for generating synthetic datasets.

2. Fine-tune a model

Edit sft.sh to set your model path, data path, and output directory.
Run the script:
```
bash sft.sh
```
The script uses torchrun and DeepSpeed for distributed training. Training parameters (batch size, learning rate, etc.) can be modified in sft.sh.

3. Custom Training

You can also run finetune.py directly:

python finetune.py --model_name_or_path <MODEL_PATH> --data_path <DATA_PATH> --output_dir <OUTPUT_DIR> ...

See sft.sh for a full example of arguments.

Distributed Training

DeepSpeed configuration is provided in deepspeed.json.
The script supports multi-node and multi-GPU training.

File Overview

finetune.py: Main training script for supervised fine-tuning.
sft.sh: Example shell script for distributed training.
deepspeed.json: DeepSpeed configuration for efficient large model training.
git.sh: Helper script for quick git add/commit/push.
.gitignore: Ignore logs, archives, and Java-related files.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
deepspeed		deepspeed
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
finetune.py		finetune.py
finetune_tokenzier.py		finetune_tokenzier.py
git.sh		git.sh
leetcode_dpsk_coder.sh		leetcode_dpsk_coder.sh
leetcode_llama_chat.sh		leetcode_llama_chat.sh
leetcode_qwen_chat.sh		leetcode_qwen_chat.sh
leetcode_qwen_chat_32B.sh		leetcode_qwen_chat_32B.sh
leetcode_qwen_chat_budget.sh		leetcode_qwen_chat_budget.sh
leetcode_qwen_chat_data.sh		leetcode_qwen_chat_data.sh
leetcode_qwen_chat_lr.sh		leetcode_qwen_chat_lr.sh
leetcode_qwen_chat_seed.sh		leetcode_qwen_chat_seed.sh
leetcode_qwen_coder.sh		leetcode_qwen_coder.sh
leetcode_sft.sh		leetcode_sft.sh
leetcode_sft_model.sh		leetcode_sft_model.sh
requirements.txt		requirements.txt
sft.sh		sft.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenSyntheticCC

Introduction

Features

Installation

Usage

1. Prepare your dataset

2. Fine-tune a model

3. Custom Training

Distributed Training

File Overview

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenSyntheticCC

Introduction

Features

Installation

Usage

1. Prepare your dataset

2. Fine-tune a model

3. Custom Training

Distributed Training

File Overview

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages