Install the dependencies with pip install -r requirements.txt
Note: This requirements.txt is originated from the Stanford Alpaca. If you are using a different code base with PyTorch installed, we recommend you manually install the below packages and do not need to install from requirements.txt
pip install tqdm
pip install scikit-learn
- Select Pre-Experienced Data
python3 test_selection/data_analysis.py \
--data_path data/code_alpaca_2k.json \
--save_path alpaca_data_pre.pt \
--model_name_or_path microsoft/Phi-3-mini-4k-instruct \
--max_length 512 \
--prompt alpaca \
--mod pre
--data_path: The targeted dataset in the Alpaca format
--save_path: The path to save the .pt file containing embeddings or scores
--prompt: The prompt type used for training and selecting data, can choose between alpaca or wiz
--mod: pre used for getting needed embeddings or scores on selecting pre-experienced samples
python3 test_selection/data_by_cluster.py \
--pt_data_path alpaca_data_pre.pt \
--json_data_path data/code_alpaca_2k.json \
--json_save_path alpaca_data_pre.json \
--sample_num 10 \
--kmeans_num_clusters 100 \
--low_th 25 \
--up_th 75
--pt_data_path: The .pt file from previous step containing needed embeddings or scores
--json_data_path: The targeted dataset in the Alpaca format
--json_save_path: The path to save the selected pre-experienced samples
--sample_num: How many samples will be selected in each cluster
--kmeans_num_clusters: How many clusters will be generated by K-Means
--low_th and --up_th: The lower and Upper threshold for selecting samples within each cluster
-
Train Pre-Experienced Model
-
Select Cherry Data
python3 test_selection/data_analysis.py \
--data_path data/code_alpaca_2k.json \
--save_path alpaca_data_cherry.pt \
--model_name_or_path microsoft/Phi-3-mini-4k-instruct \
--max_length 512 \
--prompt alpaca \
--mod cherry
python3 test_selection/data_by_IFD.py \
--pt_data_path alpaca_data_cherry.pt \
--json_data_path data/code_alpaca_2k.json \
--json_save_path alpaca_data_cherry.json \
--max_length 512 \
--sample_rate 0.06 \
--prompt alpaca
--sample_rate: How many cherry samples you would like to select? You can also use --sample_number to set the exact number of samples.
- Train Cherry Model
Barry's Machine -
Lenovo Legion Pro 5i laptop has a mobile RTX 4070 from 2023. It has an Intel Core i9 with 24 cores (32 threads) and 32gb RAM. Speed is 4.1GHz up to 5.1GHz… it’s complicated. E-cores and P-cores.
8gb RAM 0.2 TFLOPS at 64-bit precision 15.6 TFLOPS at 16-bit precision
| Dataset | Pre Training Time | Re-training Time |
|---|---|---|
| Alpaca_2k | 47mins | 1hr 30min |
| Alpaca_20k | 8hrs |