This is a repo I use to run human-eval on code models, adjust as needed. Some scripts adjusted from wizardcoder repo. The code is duplicated, mostly to handle edge cases around model tokenizing and loading (might eventually clean it up).
| model | size | pass@1 | pass@10 | screenshot |
|---|---|---|---|---|
| WizardCoder-15B-V1.0 | 15B | 57% | 68.9% | |
| openchat/opencoderplus | 15B | 27.3% | 43.9% | |
| teknium/Replit-v1-CodeInstruct-3B | 3B | 25.8% | 42.6% | |
| teknium/Replit-v2-CodeInstruct-3B | 3B | 21.5% | 31% | |
| replit-code-v1-3b | 3B | 15.1% | 27.4% |
Create python environment
python -m venv env && source env/bin/activateInstall dependencies
pip install -r requirements.txtRun the eval script
# replace script file name for various models:
# eval_wizard.py
# eval_opencode.py
# eval_replit.py
# eval_replit_instruct.py
python eval_wizard.pyProcess the jsonl file to extract code samples from model completions
Note: the replit base + instruct model does not go through this process
# replace args for various models:
# --path results/wizard --out_path results/wizard/eval.jsonl
# --path results/opencode --out_path results/opencode/eval.jsonl
python process_eval.py --path results/wizard --out_path results/wizard/processed.jsonl --add_promptThen get the results
# replace args for various models:
# results/wizard/processed.jsonl
# results/opencode/processed.jsonl
# results/replit_instruct/eval.jsonl
# results/replit/eval.jsonl
evaluate_functional_correctness results/wizard/processed.jsonl