GitHub - deeptiGuptaUIUC/testLLM

Install

Install the dependencies with pip install -r requirements.txt

Note: This requirements.txt is originated from the Stanford Alpaca. If you are using a different code base with PyTorch installed, we recommend you manually install the below packages and do not need to install from requirements.txt

pip install tqdm

pip install scikit-learn

Run Code

Select Pre-Experienced Data

python3 test_selection/data_analysis.py \
    --data_path data/code_alpaca_2k.json \
    --save_path alpaca_data_pre.pt \
    --model_name_or_path microsoft/Phi-3-mini-4k-instruct \
    --max_length 512 \
    --prompt alpaca \
    --mod pre

--data_path: The targeted dataset in the Alpaca format
--save_path: The path to save the .pt file containing embeddings or scores
--prompt: The prompt type used for training and selecting data, can choose between alpaca or wiz
--mod: pre used for getting needed embeddings or scores on selecting pre-experienced samples

python3 test_selection/data_by_cluster.py \
    --pt_data_path alpaca_data_pre.pt \
    --json_data_path data/code_alpaca_2k.json \
    --json_save_path alpaca_data_pre.json \
    --sample_num 10 \
    --kmeans_num_clusters 100 \
    --low_th 25 \
    --up_th 75

--pt_data_path: The .pt file from previous step containing needed embeddings or scores --json_data_path: The targeted dataset in the Alpaca format
--json_save_path: The path to save the selected pre-experienced samples
--sample_num: How many samples will be selected in each cluster
--kmeans_num_clusters: How many clusters will be generated by K-Means
--low_th and --up_th: The lower and Upper threshold for selecting samples within each cluster

Train Pre-Experienced Model
Select Cherry Data

python3 test_selection/data_analysis.py \
    --data_path data/code_alpaca_2k.json \
    --save_path alpaca_data_cherry.pt \
    --model_name_or_path microsoft/Phi-3-mini-4k-instruct \
    --max_length 512 \
    --prompt alpaca \
    --mod cherry

python3 test_selection/data_by_IFD.py \
    --pt_data_path alpaca_data_cherry.pt \
    --json_data_path data/code_alpaca_2k.json \
    --json_save_path alpaca_data_cherry.json \
    --max_length 512 \
    --sample_rate 0.06 \
    --prompt alpaca

--sample_rate: How many cherry samples you would like to select? You can also use --sample_number to set the exact number of samples.

Train Cherry Model

Barry's Machine -

Lenovo Legion Pro 5i laptop has a mobile RTX 4070 from 2023. It has an Intel Core i9 with 24 cores (32 threads) and 32gb RAM. Speed is 4.1GHz up to 5.1GHz… it’s complicated. E-cores and P-cores.

8gb RAM 0.2 TFLOPS at 64-bit precision 15.6 TFLOPS at 16-bit precision

Dataset	Pre Training Time	Re-training Time
Alpaca_2k	47mins	1hr 30min
Alpaca_20k	8hrs

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
sample_pt_files		sample_pt_files
test_selection		test_selection
training/stanford_alpaca		training/stanford_alpaca
GPU-PerformanceMonitor.png		GPU-PerformanceMonitor.png
Output_README.md		Output_README.md
README.md		README.md
TaskMonitor.png		TaskMonitor.png
get-pip.py		get-pip.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

Run Code

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Install

Run Code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages