TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

This is the code repository for the paper TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar (ACL 2026 Main). TokDrift is a framework designed to apply semantic-preserving rewrite rules and measure their influence on various code-related tasks.

⚙️ Environment Setup

Option 1: Using uv

bash prepare-env.sh

Option 2: Using conda

Run the provided setup script to create a conda environment with the dependencies:

bash prepare-env-conda.sh

Activate the environment for running the experiments:

conda activate tokdrift

Prepare the environment for testing generated Python code

For testing generated code, set up a virtual environment with required dependencies:

mkdir venv && cd venv
uv venv ./python3_8 --python 3.8
uv pip install --python ./python3_8/bin/python numpy scipy networkx
cd ..

This environment is used to execute and validate generated code during experiments.

✏️ Dataset Preparation

Download the Avatar dataset from Hugging Face:

cd datasets
# requires Git LFS installed
GIT_LFS_SKIP_SMUDGE=0 git clone https://huggingface.co/datasets/iidai/avatar
cd ..
# Normalize the dataset
python scripts/split_avatar.py
cp ./data/input/avatar/base.py ./datasets/avatar/base/base.py

Generate rewrite dataset for Avatar tasks only:

python -m src.tokdrift.data_generator --process_avatar
# Copy the dataset config file to the datasets folder
cp ./data/input/avatar/var.py ./datasets/avatar/var/var.py

(Optional) Generate rewrite dataset for all tasks (already prepared for humaneval and codenet tasks):

python -m src.tokdrift.data_generator --all

💡 Example Scripts

Two example scripts for running baseline and variant experiments are provided in the scripts directory:

baseline_example.sh - Example for running baseline experiments
variant_example.sh - Example for running variant experiments

Please refer to the following sections (click to expand) for detailed usage instructions.

📌 Running Experiments

Environment Variables

Set these variables before running experiments:

MODEL="your-model-name"
MAX_LENGTH_GENERATION=1024
PROMPT="prompt-type"
LANGUAGE="python"
PRECISION="bf16"
DO_SAMPLE=False
N_SAMPLES=1
BATCH_SIZE=1
SAVE_GENERATIONS_PATH="path/to/save/generations"
METRIC_OUTPUT_PATH="path/to/save/metrics"
HIDDEN_STATES_SAVE_PATH="path/to/save/hidden_states"

Note that tokenizer's behavior varies across different model series. Special tokenizers may raise errors when analyzing the results (fragment analysis).

HumanEval Explain Tasks (Two-Stage)

Following the setup in bigcode-evaluation-harness, this task requires two stages: describe then synthesize.

Stage 1: Describe

TASK_1_NAME="humanevalexplaindescribe"

accelerate launch --num_processes 1 -m src.tokdrift.run_experiments \
  --model $MODEL \
  --max_length_generation $MAX_LENGTH_GENERATION \
  --prompt $PROMPT \
  --tasks $TASK_1_NAME-$LANGUAGE \
  --precision $PRECISION \
  --do_sample $DO_SAMPLE \
  --n_samples $N_SAMPLES \
  --batch_size $BATCH_SIZE \
  --allow_code_execution \
  --trust_remote_code \
  --save_generations \
  --save_generations_path $SAVE_GENERATIONS_PATH \
  --generation_only \
  --hidden_states_save_path $HIDDEN_STATES_SAVE_PATH \
  --max_memory_per_gpu "auto"

Stage 2: Synthesize

TASK_2_NAME="humanevalexplainsynthesize"
LOAD_GENERATIONS_PATH="path/from/stage1"

accelerate launch --num_processes 1 -m src.tokdrift.run_experiments \
  --model $MODEL \
  --max_length_generation $MAX_LENGTH_GENERATION \
  --prompt $PROMPT \
  --load_data_path $LOAD_GENERATIONS_PATH \
  --tasks $TASK_2_NAME-$LANGUAGE \
  --precision $PRECISION \
  --do_sample $DO_SAMPLE \
  --n_samples $N_SAMPLES \
  --batch_size $BATCH_SIZE \
  --allow_code_execution \
  --trust_remote_code \
  --save_generations \
  --save_generations_path $SAVE_GENERATIONS_PATH \
  --metric_output_path $METRIC_OUTPUT_PATH \
  --max_memory_per_gpu "auto"

Other Tasks (Single-Stage)

For tasks like CodeNet Translate, Avatar Translate, and HumanEval Fix Tests:

# Choose one:
TASK_1_NAME="codenettranslate"
# TASK_1_NAME="avatartranslate"
# TASK_NAME="humanevalfixtests"

accelerate launch --num_processes 1 -m src.tokdrift.run_experiments \
  --model $MODEL \
  --max_length_generation $MAX_LENGTH_GENERATION \
  --prompt $PROMPT \
  --tasks $TASK_1_NAME-$LANGUAGE \
  --precision $PRECISION \
  --do_sample $DO_SAMPLE \
  --n_samples $N_SAMPLES \
  --batch_size $BATCH_SIZE \
  --allow_code_execution \
  --trust_remote_code \
  --save_generations \
  --save_generations_path $SAVE_GENERATIONS_PATH \
  --metric_output_path $METRIC_OUTPUT_PATH \
  --hidden_states_save_path $HIDDEN_STATES_SAVE_PATH \
  --max_memory_per_gpu "auto"

Storing hidden states is currently not supported when using data parallelism.

📍 Task Variants

HumanEval Fix Task

For HumanEval Fix tasks with variants:

TASK_NAME="humanevalfixtests"
COMBINED_TOKEN_VARIANT="snake_case"  # or any variant from list below

accelerate launch --num_processes 1 -m src.tokdrift.run_experiments \
  --model $MODEL \
  --tasks $TASK_NAME-$LANGUAGE-$COMBINED_TOKEN_VARIANT-fix \
  [other parameters...]

Other Tasks with Variants

For CodeNet Translate, Avatar Translate, and other tasks:

TASK_1_NAME="codenettranslate"  # or avatartranslate
SPECIFIC_CASE="snake_case"  # or any variant from list below

accelerate launch --num_processes 1 -m src.tokdrift.run_experiments \
  --model $MODEL \
  --tasks $TASK_1_NAME-$LANGUAGE-$SPECIFIC_CASE \
  [other parameters...]

Available Variants

snake_case
pascal_case
screaming_snake_case
camel_case
op_dash
op_lsquarebracket
rparentheses_period
rsquarebracket_rparentheses
op_rsquarebracket
op_lparentheses
lsquarebracket_name
double_plus_rparentheses
period_asterisk
rparentheses_colon
rparentheses_semicolon
op_semicolon
rparentheses_rparentheses
lparentheses_rparentheses
period_name
lparentheses_name
op_name
op_all

📝 Result Analysis

After running experiments, analyze the results using the following commands:

Extract All Result Datapoints

First, extract all result datapoints from the log files in the output directory:

python -m src.tokdrift.result_extractor --all

This processes all tasks, models, naming variants, and spacing variants to generate evaluation JSON files with detailed result datapoints.

Summarize Results to CSV

Generate CSV summary files for all results:

python -m src.tokdrift.result_extractor --sum_to_csv

This creates comprehensive CSV files in ./data/output/ containing:

Accuracy results
Accuracy deltas comparing baseline vs variant
Sensitivity analysis across all variants
Per-task and per-model breakdowns

Additional Analysis Options

Utilize result_evaluator.py to gather all result datapoints for sensitivity analysis.

Get Summary and Sensitivity Results:

python -m src.tokdrift.result_evaluator --diff

This outputs:

Total number of processed tasks across all experiments
Sensitivity results showing how naming and spacing variants affect task results
Including breakdown by fragment change types (merged, split, mixed, unchanged)

Output files are saved to:

./data/output/sensitivity/ - Sensitivity percentages
./data/output/sample_info/ - Sample counts and statistics

Wilcoxon Signed-Rank Test:

Test the statistical significance of performance differences between various model sizes within one model series:

python -m src.tokdrift.result_evaluator --wilcoxon_test

This compares small, medium, and large model variants (e.g., Llama-3 3B vs 8B vs 70B) to determine if smaller models show significantly different sensitivity.

Vector Visualization:

Visualize hidden state representations difference, please check the vector_visualizer.py for more details:

# Generate 2D t-SNE visualizations
python -m src.tokdrift.vector_visualizer --vector --model "Qwen2.5-Coder-32B-Instruct"

# Generate 3D PCA visualizations
python -m src.tokdrift.vector_visualizer --vector_3d --model "Llama-3.3-70B-Instruct"

# Generate similarity plots
python -m src.tokdrift.vector_visualizer --similarity --model "deepseek-coder-33b-instruct"

🎇 Example: 2D Visualizations (t-SNE)

Naming variants visualization from Qwen2.5-Coder-32B-Instruct (middle layer)

Spacing variants visualization from Qwen2.5-Coder-32B-Instruct (middle layer)

🎆 Example: 3D Visualizations (PCA)

Naming variants visualization from Llama-3.1-8B-Instruct (last layer)

Spacing variants visualization from deepseek-coder-33b-instruct (last layer)

🍻 Acknowledgements

This project builds upon and is inspired by several excellent codebases:

bigcode-evaluation-harness - Evaluation framework for code generation models
antlr4 - Programming language parser
grammars-v4 - Grammars written for ANTLR v4

Please open an issue if you run into any problems or have any suggestions.

📄 Citation

@article{YinxiETAL25TokDrift,
  title={{TokDrift}: When {LLM} Speaks in Subwords but Code Speaks in Grammar}, 
  author={Yinxi Li and Yuntian Deng and Pengyu Nie},
  journal={arXiv preprint arXiv:2510.14972},
  year={2025},
  url={https://arxiv.org/abs/2510.14972},
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
datasets		datasets
figures		figures
scripts		scripts
src/tokdrift		src/tokdrift
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
prepare-env-conda.sh		prepare-env-conda.sh
prepare-env.sh		prepare-env.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

⚙️ Environment Setup

Option 1: Using uv

Option 2: Using conda

Prepare the environment for testing generated Python code

✏️ Dataset Preparation

💡 Example Scripts

Environment Variables

HumanEval Explain Tasks (Two-Stage)

Other Tasks (Single-Stage)

HumanEval Fix Task

Other Tasks with Variants

Available Variants

📝 Result Analysis

Extract All Result Datapoints

Summarize Results to CSV

Additional Analysis Options

🍻 Acknowledgements

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

⚙️ Environment Setup

Option 1: Using uv

Option 2: Using conda

Prepare the environment for testing generated Python code

✏️ Dataset Preparation

💡 Example Scripts

Environment Variables

HumanEval Explain Tasks (Two-Stage)

Other Tasks (Single-Stage)

HumanEval Fix Task

Other Tasks with Variants

Available Variants

📝 Result Analysis

Extract All Result Datapoints

Summarize Results to CSV

Additional Analysis Options

🍻 Acknowledgements

📄 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages