This is the code repository for the paper TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar (ACL 2026 Main). TokDrift is a framework designed to apply semantic-preserving rewrite rules and measure their influence on various code-related tasks.
bash prepare-env.shRun the provided setup script to create a conda environment with the dependencies:
bash prepare-env-conda.shActivate the environment for running the experiments:
conda activate tokdriftFor testing generated code, set up a virtual environment with required dependencies:
mkdir venv && cd venv
uv venv ./python3_8 --python 3.8
uv pip install --python ./python3_8/bin/python numpy scipy networkx
cd ..This environment is used to execute and validate generated code during experiments.
Download the Avatar dataset from Hugging Face:
cd datasets
# requires Git LFS installed
GIT_LFS_SKIP_SMUDGE=0 git clone https://huggingface.co/datasets/iidai/avatar
cd ..
# Normalize the dataset
python scripts/split_avatar.py
cp ./data/input/avatar/base.py ./datasets/avatar/base/base.pyGenerate rewrite dataset for Avatar tasks only:
python -m src.tokdrift.data_generator --process_avatar
# Copy the dataset config file to the datasets folder
cp ./data/input/avatar/var.py ./datasets/avatar/var/var.py(Optional) Generate rewrite dataset for all tasks (already prepared for humaneval and codenet tasks):
python -m src.tokdrift.data_generator --allTwo example scripts for running baseline and variant experiments are provided in the scripts directory:
baseline_example.sh- Example for running baseline experimentsvariant_example.sh- Example for running variant experiments
Please refer to the following sections (click to expand) for detailed usage instructions.
π Running Experiments
Set these variables before running experiments:
MODEL="your-model-name"
MAX_LENGTH_GENERATION=1024
PROMPT="prompt-type"
LANGUAGE="python"
PRECISION="bf16"
DO_SAMPLE=False
N_SAMPLES=1
BATCH_SIZE=1
SAVE_GENERATIONS_PATH="path/to/save/generations"
METRIC_OUTPUT_PATH="path/to/save/metrics"
HIDDEN_STATES_SAVE_PATH="path/to/save/hidden_states"Note that tokenizer's behavior varies across different model series. Special tokenizers may raise errors when analyzing the results (fragment analysis).
Following the setup in bigcode-evaluation-harness, this task requires two stages: describe then synthesize.
Stage 1: Describe
TASK_1_NAME="humanevalexplaindescribe"
accelerate launch --num_processes 1 -m src.tokdrift.run_experiments \
--model $MODEL \
--max_length_generation $MAX_LENGTH_GENERATION \
--prompt $PROMPT \
--tasks $TASK_1_NAME-$LANGUAGE \
--precision $PRECISION \
--do_sample $DO_SAMPLE \
--n_samples $N_SAMPLES \
--batch_size $BATCH_SIZE \
--allow_code_execution \
--trust_remote_code \
--save_generations \
--save_generations_path $SAVE_GENERATIONS_PATH \
--generation_only \
--hidden_states_save_path $HIDDEN_STATES_SAVE_PATH \
--max_memory_per_gpu "auto"Stage 2: Synthesize
TASK_2_NAME="humanevalexplainsynthesize"
LOAD_GENERATIONS_PATH="path/from/stage1"
accelerate launch --num_processes 1 -m src.tokdrift.run_experiments \
--model $MODEL \
--max_length_generation $MAX_LENGTH_GENERATION \
--prompt $PROMPT \
--load_data_path $LOAD_GENERATIONS_PATH \
--tasks $TASK_2_NAME-$LANGUAGE \
--precision $PRECISION \
--do_sample $DO_SAMPLE \
--n_samples $N_SAMPLES \
--batch_size $BATCH_SIZE \
--allow_code_execution \
--trust_remote_code \
--save_generations \
--save_generations_path $SAVE_GENERATIONS_PATH \
--metric_output_path $METRIC_OUTPUT_PATH \
--max_memory_per_gpu "auto"For tasks like CodeNet Translate, Avatar Translate, and HumanEval Fix Tests:
# Choose one:
TASK_1_NAME="codenettranslate"
# TASK_1_NAME="avatartranslate"
# TASK_NAME="humanevalfixtests"
accelerate launch --num_processes 1 -m src.tokdrift.run_experiments \
--model $MODEL \
--max_length_generation $MAX_LENGTH_GENERATION \
--prompt $PROMPT \
--tasks $TASK_1_NAME-$LANGUAGE \
--precision $PRECISION \
--do_sample $DO_SAMPLE \
--n_samples $N_SAMPLES \
--batch_size $BATCH_SIZE \
--allow_code_execution \
--trust_remote_code \
--save_generations \
--save_generations_path $SAVE_GENERATIONS_PATH \
--metric_output_path $METRIC_OUTPUT_PATH \
--hidden_states_save_path $HIDDEN_STATES_SAVE_PATH \
--max_memory_per_gpu "auto"Storing hidden states is currently not supported when using data parallelism.
π Task Variants
For HumanEval Fix tasks with variants:
TASK_NAME="humanevalfixtests"
COMBINED_TOKEN_VARIANT="snake_case" # or any variant from list below
accelerate launch --num_processes 1 -m src.tokdrift.run_experiments \
--model $MODEL \
--tasks $TASK_NAME-$LANGUAGE-$COMBINED_TOKEN_VARIANT-fix \
[other parameters...]For CodeNet Translate, Avatar Translate, and other tasks:
TASK_1_NAME="codenettranslate" # or avatartranslate
SPECIFIC_CASE="snake_case" # or any variant from list below
accelerate launch --num_processes 1 -m src.tokdrift.run_experiments \
--model $MODEL \
--tasks $TASK_1_NAME-$LANGUAGE-$SPECIFIC_CASE \
[other parameters...]snake_casepascal_casescreaming_snake_casecamel_caseop_dashop_lsquarebracketrparentheses_periodrsquarebracket_rparenthesesop_rsquarebracketop_lparentheseslsquarebracket_namedouble_plus_rparenthesesperiod_asteriskrparentheses_colonrparentheses_semicolonop_semicolonrparentheses_rparentheseslparentheses_rparenthesesperiod_namelparentheses_nameop_nameop_all
After running experiments, analyze the results using the following commands:
First, extract all result datapoints from the log files in the output directory:
python -m src.tokdrift.result_extractor --allThis processes all tasks, models, naming variants, and spacing variants to generate evaluation JSON files with detailed result datapoints.
Generate CSV summary files for all results:
python -m src.tokdrift.result_extractor --sum_to_csvThis creates comprehensive CSV files in ./data/output/ containing:
- Accuracy results
- Accuracy deltas comparing baseline vs variant
- Sensitivity analysis across all variants
- Per-task and per-model breakdowns
Utilize result_evaluator.py to gather all result datapoints for sensitivity analysis.
Get Summary and Sensitivity Results:
python -m src.tokdrift.result_evaluator --diffThis outputs:
- Total number of processed tasks across all experiments
- Sensitivity results showing how naming and spacing variants affect task results
- Including breakdown by fragment change types (merged, split, mixed, unchanged)
Output files are saved to:
./data/output/sensitivity/- Sensitivity percentages./data/output/sample_info/- Sample counts and statistics
Wilcoxon Signed-Rank Test:
Test the statistical significance of performance differences between various model sizes within one model series:
python -m src.tokdrift.result_evaluator --wilcoxon_testThis compares small, medium, and large model variants (e.g., Llama-3 3B vs 8B vs 70B) to determine if smaller models show significantly different sensitivity.
Vector Visualization:
Visualize hidden state representations difference, please check the vector_visualizer.py for more details:
# Generate 2D t-SNE visualizations
python -m src.tokdrift.vector_visualizer --vector --model "Qwen2.5-Coder-32B-Instruct"
# Generate 3D PCA visualizations
python -m src.tokdrift.vector_visualizer --vector_3d --model "Llama-3.3-70B-Instruct"
# Generate similarity plots
python -m src.tokdrift.vector_visualizer --similarity --model "deepseek-coder-33b-instruct"π Example: 2D Visualizations (t-SNE)
|
Naming variants visualization from Qwen2.5-Coder-32B-Instruct (middle layer) |
Spacing variants visualization from Qwen2.5-Coder-32B-Instruct (middle layer) |
π Example: 3D Visualizations (PCA)
|
Naming variants visualization from Llama-3.1-8B-Instruct (last layer) |
Spacing variants visualization from deepseek-coder-33b-instruct (last layer) |
This project builds upon and is inspired by several excellent codebases:
- bigcode-evaluation-harness - Evaluation framework for code generation models
- antlr4 - Programming language parser
- grammars-v4 - Grammars written for ANTLR v4
Please open an issue if you run into any problems or have any suggestions.
@article{YinxiETAL25TokDrift,
title={{TokDrift}: When {LLM} Speaks in Subwords but Code Speaks in Grammar},
author={Yinxi Li and Yuntian Deng and Pengyu Nie},
journal={arXiv preprint arXiv:2510.14972},
year={2025},
url={https://arxiv.org/abs/2510.14972},
}




