Skip to content

deeptiGuptaUIUC/testLLM

Repository files navigation

Install

Install the dependencies with pip install -r requirements.txt

Note: This requirements.txt is originated from the Stanford Alpaca. If you are using a different code base with PyTorch installed, we recommend you manually install the below packages and do not need to install from requirements.txt

pip install tqdm

pip install scikit-learn

Run Code

  1. Select Pre-Experienced Data
python3 test_selection/data_analysis.py \
    --data_path data/code_alpaca_2k.json \
    --save_path alpaca_data_pre.pt \
    --model_name_or_path microsoft/Phi-3-mini-4k-instruct \
    --max_length 512 \
    --prompt alpaca \
    --mod pre

--data_path: The targeted dataset in the Alpaca format
--save_path: The path to save the .pt file containing embeddings or scores
--prompt: The prompt type used for training and selecting data, can choose between alpaca or wiz
--mod: pre used for getting needed embeddings or scores on selecting pre-experienced samples

python3 test_selection/data_by_cluster.py \
    --pt_data_path alpaca_data_pre.pt \
    --json_data_path data/code_alpaca_2k.json \
    --json_save_path alpaca_data_pre.json \
    --sample_num 10 \
    --kmeans_num_clusters 100 \
    --low_th 25 \
    --up_th 75

--pt_data_path: The .pt file from previous step containing needed embeddings or scores --json_data_path: The targeted dataset in the Alpaca format
--json_save_path: The path to save the selected pre-experienced samples
--sample_num: How many samples will be selected in each cluster
--kmeans_num_clusters: How many clusters will be generated by K-Means
--low_th and --up_th: The lower and Upper threshold for selecting samples within each cluster

  1. Train Pre-Experienced Model

  2. Select Cherry Data

python3 test_selection/data_analysis.py \
    --data_path data/code_alpaca_2k.json \
    --save_path alpaca_data_cherry.pt \
    --model_name_or_path microsoft/Phi-3-mini-4k-instruct \
    --max_length 512 \
    --prompt alpaca \
    --mod cherry
python3 test_selection/data_by_IFD.py \
    --pt_data_path alpaca_data_cherry.pt \
    --json_data_path data/code_alpaca_2k.json \
    --json_save_path alpaca_data_cherry.json \
    --max_length 512 \
    --sample_rate 0.06 \
    --prompt alpaca

--sample_rate: How many cherry samples you would like to select? You can also use --sample_number to set the exact number of samples.

  1. Train Cherry Model

Barry's Machine -

Lenovo Legion Pro 5i laptop has a mobile RTX 4070 from 2023. It has an Intel Core i9 with 24 cores (32 threads) and 32gb RAM. Speed is 4.1GHz up to 5.1GHz… it’s complicated. E-cores and P-cores.

8gb RAM 0.2 TFLOPS at 64-bit precision 15.6 TFLOPS at 16-bit precision

Dataset Pre Training Time Re-training Time
Alpaca_2k 47mins 1hr 30min
Alpaca_20k 8hrs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages