Skip to content

TeamLab/pilot-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง  PILOT-Bench Classification

A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks

This repository provides code, configuration, and evaluation scripts for the PILOT-Bench paper:

Yehoon Jang*, Chaewon Lee*, Hyun-seok Min, and Sungchul Choi (2025)
PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks
Paper | Dataset


๐Ÿงฉ Overview

PILOT-Bench evaluates the legal reasoning capability of large language models (LLMs) within the U.S. Patent Trial and Appeal Board (PTAB) domain.
This repository focuses on three IRAC-aligned classification tasks:

Task IRAC Stage Label Type # Labels Metric Type
Issue Type Issue Multi-label 5 Exact Match / Macro-F1 / Micro-F1
Board Authorities Rule Multi-label 10 Exact Match / Macro-F1 / Micro-F1
Subdecision Conclusion Multi-class 23 (fine) / 6 (coarse) Accuracy / Macro-F1 / Weighted-F1

All experiments follow a zero-shot evaluation protocol, using standardized prompts and unified input settings:

  • Split (Base): Appellant vs. Examiner roles separated
  • Merge: Role-neutral concatenation
  • Split + Claim: Role-split inputs with appended claim text

๐Ÿ“ Directory Structure

pilot-bench/
โ”‚
โ”œโ”€โ”€ config/                # JSON configuration files for prediction and evaluation
โ”‚
โ”œโ”€โ”€ data/                  # Input / output data (PTAB opinion-split JSONs)
โ”‚
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ evaluation/        # Task-specific evaluation scripts
โ”‚   โ”‚   โ”œโ”€โ”€ board_ruling_evaluate.py
โ”‚   โ”‚   โ”œโ”€โ”€ issue_type_evaluate.py
โ”‚   โ”‚   โ”œโ”€โ”€ subdecision_evaluate.py
โ”‚   โ”‚   โ”œโ”€โ”€ subdecision_coarse_evaluate.py
โ”‚   โ”‚   โ””โ”€โ”€ evaluate.sh
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ inference/         # Inference scripts for each classification task
โ”‚   โ”‚   โ”œโ”€โ”€ board_ruling_predict.py
โ”‚   โ”‚   โ”œโ”€โ”€ decision_predict.py
โ”‚   โ”‚   โ””โ”€โ”€ issue_predict.py
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ llms/              # LLM client wrappers (OpenAI, Gemini, Claude, etc.)
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ utils/             # Utility modules
โ”‚
โ””โ”€โ”€ README.md              # Project documentation

โš™๏ธ Installation

git clone https://github.com/TeamLab/pilot-bench.git
cd pilot-bench
pip install -r requirements.txt

Python โ‰ฅ 3.10 is recommended.


๐Ÿš€ Usage

๐Ÿ”น 1. Zero-shot / Prompt-based Inference

1.1. Async

python /Path/root/dir/src/inference/board_ruling_predict.py --config "/Path/root/dir/config/board_ruling_predict.json" --prompt "board_ruling" --wandb_entity "pilot-bench" --wandb_project "board_authorities_predict" --input_setting "base" --model "qwen" --mode "async"

1.2. Batch

python /Path/root/dir/src/inference/board_ruling_predict.py --config "/Path/root/dir/config/board_ruling_predict.json" --prompt "board_ruling" --wandb_entity "pilot-bench" --wandb_project "board_authorities_predict" --input_setting "base" --model "gpt" --mode "batch"

๐Ÿ”น 2. Evalaute

python Path/root/dir/src/evaluation/subdecision_evaluate.py

๐Ÿ”น 3. Evaluation Metrics

All metrics are computed using sklearn.metrics (with zero_division=0 to handle undefined cases).
Each evaluation script also reports coverage statistics to indicate the ratio of evaluated samples relative to the ground truth and predictions.

โœ… Multi-label tasks (Issue Type / Board Authorities)

  • exact_match โ€” Subset accuracy, i.e., the proportion of samples whose predicted label set exactly matches the ground truth (accuracy_score on binarized matrices).
  • micro_precision, micro_recall, micro_f1 โ€” Micro-averaged metrics aggregating TP/FP/FN across all labels (average="micro").
  • macro_precision, macro_recall, macro_f1 โ€” Macro-averaged metrics computed as the unweighted mean across all labels (average="macro").
  • hamming_loss โ€” The fraction of incorrect labels among all possible label assignments.
  • coverage_vs_gt โ€” Ratio of evaluated samples to total ground-truth samples (n_eval_used / n_gt_total).
  • coverage_vs_pred โ€” Ratio of evaluated samples to total prediction files (n_eval_used / n_pred_files).

Additional diagnostic statistics:

  • n_gt_total โ€” Number of ground-truth entries.
  • n_pred_files โ€” Number of prediction JSON files.
  • n_eval_used โ€” Number of samples successfully merged for evaluation.
  • n_labels โ€” Number of unique labels in the task.

โœ… Multi-class tasks (Subdecision)

  • accuracy โ€” Standard classification accuracy (accuracy_score).
  • balanced_acc โ€” Balanced accuracy, the average of per-class recall values (balanced_accuracy_score).
  • macro_precision, macro_recall, macro_f1 โ€” Macro-averaged metrics across all classes (average="macro").
  • micro_f1 โ€” Micro-averaged F1 score over all instances (average="micro").
  • weighted_f1 โ€” Weighted F1, averaging class-level F1 scores weighted by class support (average="weighted").
  • coverage_vs_gt โ€” Ratio of evaluated samples to total ground-truth samples.
  • coverage_vs_pred โ€” Ratio of evaluated samples to total prediction files.

Additional diagnostic statistics:

  • n_gt_total, n_pred_files, n_eval_used

๐Ÿง  Tasks Summary

Each task corresponds to one stage of IRAC reasoning in PTAB appeals:

  1. Issue Type (IRACโ€“Issue)
    Identify contested statutory grounds under 35 U.S.C. (ยง101/102/103/112/Others).

  2. Board Authorities (IRACโ€“Rule)
    Predict which procedural provisions under 37 C.F.R. (ยง 41.50 variants, etc.) are cited by the Board.

  3. Subdecision (IRACโ€“Conclusion)
    Predict the Boardโ€™s final outcome (e.g., Affirmed, Reversed, Affirmed-in-Part).


๐Ÿ“Š Evaluated Models

Closed-source (commercial)
Claude-Sonnet-4 ยท Gemini-2.5-Pro ยท GPT-4o ยท GPT-o3 ยท Solar-Pro2

Open-source
LLaMA-3.1-8B-Instruct ยท Qwen-3-8B ยท Mistral-7B-Instruct ยท T5-Gemma-2B


๐Ÿ’พ Dataset Access

The full dataset, metadata, and opinion-split JSONs are available at
๐Ÿ‘‰ [TeamLab/pilot-bench](To be updated)

Each PTAB case is aligned with its corresponding USPTO patent and contains:

  • appellant_arguments, examiner_findings, ptab_opinion
  • standardized labels for Issue Type, Board Authorities, and Subdecision tasks

๐Ÿงฎ Example Output

{
    // Issue Type
  "issue_type": [
    "102",
    "103"
  ]
}
{
    // Board Authorities
  "board_ruling": [
    "37 CFR 41.50",
    "37 CFR 41.50(a)"
  ]
}
{
    // Subdecision (Fine-grained)
  "decision_number": 0,
  "decision_type": "Affirmed"
}
{
    // Subdecision (Coarse-grained)
  "decision_type": "Reversed",
  "decision_number": 4
}

๐Ÿง‘โ€โš–๏ธ Citation

If you use this repository or dataset, please cite:

@inproceedings{jang2025pilotbench,
  title     = {PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks},
  author    = {Yehoon Jang and Chaewon Lee and Hyun-seok Min and Sungchul Choi},
  year      = {2025},
  booktitle = {Proceedings of the EMNLP 2025 (NLLP Workshop)},
  url       = {https://github.com/TeamLab/pilot-bench}
}

โš–๏ธ License

Released under CC BY 4.0 for research and educational purposes only.
This repository and dataset must not be used to provide or automate legal advice, adjudication, or PTAB decision-making.


๐Ÿ’ฌ Contact

For research inquiries or collaborations:

Yehoon Jang   : [email protected]  
Chaewon Lee   : [email protected]
Hyun-seok Min : [email protected]  
Sungchul Choi : [email protected]

๐Ÿงฉ Acknowledgments

This work was supported by

  • National Research Foundation of Korea (NRF) โ€“ Grant No. RS-2024-00354675 (70%)
  • IITP (ICT Challenge and Advanced Network of HRD) โ€“ Grant No. IITP-2023-RS-2023-00259806 (30%)
    under the supervision of the Ministry of Science and ICT (MSIT), Korea.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors