Nondeterministic Polynomial Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs

To build a successful benchmark for LLMs, we propose Nondeterministic Polynomial-time Problem Challenge (NPPC), which possesses the ever-scalingness with the following four desiderata for a benchmark:

uncrushable, which requires the scaling over the complexity, i.e., the benchmark can generate the problems with continually increasing complexities;
unhackable, which requires the scaling over instances, i.e., the benchmark can generate an infinite number of instances to avoid the exploitation;
auto-verifiable, which requires the scaling over the scalable oversight, i.e., the benchmark can verify the correctness of the solutions efficiently for the problems with any complexity;
general, which requires the scaling over the coverage, i.e., the problems covered by the benchmark should be highly relevant to the real-world problems, rather than some puzzles or rare problems.

To the best of our knowledge, NPPC is the first ever-scaling reasoning benchmark of LLMs, serving as the uncrushable and unhackable testbed for LLMs towards artificial general intelligence (AGI).

Outline

NPPC: An Ever-Scaling Reasoning Benchmark for LLMs
- Outline
- Overview
  - npgym
  - npsolver
  - npeval
- Tutorial
- File Structure
- License

Overview

We introduce the Nondeterministic Polynomial Problem Challenge (NPPC) , an ever-scaling reasoning benchmark for LLMs. There are three main components in NPPC:

npgym, which provides a unified interface of 25 well-known NPC problems and can generate any number of instances and verify the solution with any levels of complexities;
npsolver: which provides a unified interface to evaluate the problem instances with both online and offline models via APIs and local deployments, respectively, to facilitate the users to evaluate their own models;
npeval: which provides the comprehensive and ready-to-use tools to analyze the performances of LLMs over problems, the number of tokens, the ``aha moments'', the reasoning and solution errors, providing the in-depth analysis of the LLMs' reasoning capabilities.

npgym

npgym generates the instance $I$ with the given configuration, and the LLM receives the instance and generate the solution $s$, then the solution is verified by npgym with the output ${\textbf{true}, \textbf{false}}$.

Specifically, for each problem, npgym implements two functions: generate_instance(·) and verify_solution(·).

Different NPC problems exhibit distinct combinatorial structures and computational characteristics. To establish a standardized metric for quantifying the computational complexity of these problems, npgym implements the difficulty levels.

All the NPC problems and the difficulty level configurations are stored in config.py.

#	NPC Problem	Type	Functions	#Levels
1	3-Satisfiability (3SAT)	core	three_sat.py	10
2	Vertex Cover	core	vertex_cover.py	10
3	3-Dimensional Matching (3DM)	core	three_dimension_matching.py	10
4	Travelling Salesman (TSP)	core	tsp.py	10
5	Hamiltonian Cycle	core	hamiltonian_cycle.py	10
6	Graph 3-Colourability (3-COL)	core	graph_three_colorability.py	10
7	Bin Packing	core	bin_packing.py	10
8	Maximum Leaf Spanning Tree	core	max_leaf_span_tree.py	10
9	Quadratic Diophantine Equations (QDE)	core	quad_diop_equ.py	10
10	Minimum Sum of Squares	core	min_sum_square.py	10
11	Shortest Common Superstring	core	shortest_common_superstring.py	10
12	Bandwidth	core	bandwidth.py	10
13	Clique	extension	clique.py	13
14	Independent Set	extension	independent_set.py	13
15	Dominating Set	extension	domninating_set.py	12
16	Set Splitting	extension	set_splitting.py	11
17	Set Packing	extension	set_packing.py	11
18	Exact Cover by 3-Sets (X3C)	extension	exact_cover.py	10
19	Minimum Cover	extension	minimum_cover.py	16
20	Partition	extension	partition.py	13
21	Subset Sum	extension	subset_sum.py	13
22	Hitting String	extension	hitting_string.py	10
23	Quadratic Congruences	extension	quadratic_congruence.py	10
24	Betweenness	extension	betweenness.py	6
25	Clustering	extension	clustering.py	10

npsolver

npsolver uses a simple and consistent prompt template across all problems that includes: problem description, context examples, target instance, and solution format requirements in JSON. The solver suite handles prompt generation, LLM completion (via LiteLLM for online APIs and vLLM for offline models), solution extraction using regex parsing, and standardized error reporting. This unified interface enables seamless evaluation across different LLM deployment types.

The nppc_template and all the NPC problems' definations are stored in prompt.py.

npeval

npeval implements a statistically sound sampling strategy by evaluating 30 instances per seed across multiple seeds for each difficulty level. The suite offers four key performance measures: inter-quantile mean (IQM), mean, median, and optimality gap, along with stratified bootstrap confidence intervals for reliable interval estimation. NPEval analyzes both prompt and completion tokens across problems and difficulty levels, tracks "aha moments" in reasoning processes, and examines both solution verification errors from npgym and reasoning errors in LLMs' internal problem-solving processes. This systematic approach enables thorough performance analysis while managing computational costs and accounting for randomness in both instance generation and LLM responses.

Tutorial

Configure API Keys

Put your API keys into the folder api_keys

nppc/
├── api_keys/                
│   ├── openai_api_key.txt/
│   ├── huoshan_api_key.txt/             
│   ├── deepseek_api_key.txt/
|   ├── claude_api_key.txt/
│   └── maas_api_key.txt

Run Evaluation

python3 main_nppc_final.py --seed 64 --model claude --problem 11 --result_folder results

Argument	Definition	Values
--seed	seed
--model	name for LLM	example models
--problem	the problem name idx	configs
--level	difficulty level	configs
--n_shots	number of in-context examples
--n_trials	number of trials for each level
--batch_size	the problem name
--max_tries	max tries for one batch
--result_folder	folder path to store the results
--verbose	verbose output

Plot Results

cd plottings
python main_performance_all.py

File Structure

nppc/
├── npgym/
│   ├── npc/                
│   │   ├── __init__.py
│   │   ├── base.py
│   │   ├── betweenness.py
│   │   ├── bin_packing.py
│   │   ├── bipartite_subgraph.py
│   │   ├── clique.py
│   │   ├── clustering.py
│   │   ├── cubic_subgraph.py
│   │   ├── dominating_set.py
│   │   ├── edge_colouring.py
│   │   ├── exact_cover.py
│   │   ├── graph_three_colorability.py
│   │   ├── hamiltonian_cycle.py
│   │   ├── hamming_center.py
│   │   ├── hitting_string.py
│   │   ├── independent_set.py
│   │   ├── k_closure.py
│   │   ├── kernel.py
│   │   ├── longest_circuits.py
│   │   ├── longest_path.py
│   │   ├── max_leaf_span_tree.py
│   │   ├── min_sum_square.py
│   │   ├── mini_test_set.py
│   │   ├── minimum_cover.py
│   │   ├── multiprocessor_scheduling.py
│   │   ├── n_queen_completion.py
│   │   ├── open_hemisphere.py
│   │   ├── optimal_linear_arrangement.py
│   │   ├── partition_into_triangle.py
│   │   ├── partition.py
│   │   ├── quad_diop_equ.py
│   │   ├── quadratic_congruence.py
│   │   ├── rural_postman.py
│   │   ├── set_packing.py
│   │   ├── set_splitting.py
│   │   ├── shortest_common_superstring.py
│   │   ├── simulanteous_incongruences.py
│   │   ├── solitarie.py
│   │   ├── subset_product.py
│   │   ├── subset_sum.py
│   │   ├── three_dimension_matching.py
│   │   ├── three_partition.py
│   │   ├── three_sat.py
│   │   ├── tsp.py
│   │   ├── vertex_cover.py
│   │   └── x3c.py
│   └── nphard/            
│   │   ├── __init__.py
│   │   └── bin_packing.py
│   ├── __init__.py
│   ├── configs.py
│   └── core.py
├── npsolver/              
│   ├── __init__.py    
│   ├── prompt.py
│   └── solver.py
├── npeval/                
│   ├── __init__.py    
│   ├── interval.py
│   ├── library.py
│   ├── metrics.py
│   └── plot_utils.py
├── plottings/
│   ├── __init__.py
│   ├── main_aha.py
│   ├── main_error_type.py
│   ├── main_full_results.py
│   ├── main_interval.py
│   ├── main_performance_all.py
│   ├── main_rank.py
│   └── main_tokens.py
├── scripts/          
│   └── gen_scripts.py
├── README.MD
├── eval.sh
└── main_nppc_final.py

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nondeterministic Polynomial Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs

Outline

Overview

npgym

npsolver

npeval

Tutorial

Configure API Keys

Run Evaluation

Plot Results

File Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 214 Commits
assets		assets
npeval		npeval
npgym		npgym
npsolver		npsolver
plottings		plottings
scripts		scripts
supports		supports
LICENSE		LICENSE
README.MD		README.MD
main_nppc_final.py		main_nppc_final.py

Folders and files

Latest commit

History

Repository files navigation

Nondeterministic Polynomial Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs

Outline

Overview

npgym

npsolver

npeval

Tutorial

Configure API Keys

Run Evaluation

Plot Results

File Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages