A Python library for comparing forecast accuracy between different models using statistical tests.
- Five Test Variants: Diebold-Mariano (standard & Harvey-corrected), Clark-West, Sign Test, and Wilcoxon Signed-Rank
- Flexible API: Initialize with predictions or provide them dynamically
- Automatic Warnings: Sample size recommendations for each test
- Error Metrics: Built-in RMSE, MAE, and MSE calculations
- Clean Output: Formatted test results with significance indicators
pip install -r requirements.txtRequirements:
- numpy >= 1.21.0
- scipy >= 1.7.0
from src.evaluation import Evaluation
# Your data
actual = [100.5, 102.3, 101.8, 103.5, 105.2, ...]
pred1 = [100.2, 102.1, 101.5, 103.8, 105.0, ...] # Advanced model
pred2 = [100.0, 100.5, 102.3, 101.8, 103.5, ...] # Baseline model
# Initialize evaluator
eval = Evaluation(actual, pred1, pred2,
model1_name='Advanced Model',
model2_name='Baseline')
# Check error metrics
eval.summary()
# Run statistical tests
eval.evaluate('diebold_mariano_test', h=1, harvey_correction=True)
eval.evaluate('clark_west_test', h=1)
eval.evaluate('sign_test')
eval.evaluate('wilcoxon_test')Use when: Comparing non-nested models
Sample size: T ≥ 30 (or use harvey_correction=True for T < 30)
eval.evaluate('diebold_mariano_test',
h=1, # Forecast horizon
loss='mse', # Loss function: 'mse' or 'mae'
harvey_correction=True) # Small-sample correctionUse when: Comparing nested models (e.g., restricted vs unrestricted)
Sample size: T ≥ 30
eval.evaluate('clark_west_test', h=1)Use when: Very small samples, no distribution assumptions needed
Sample size: Any (even T < 10)
Note: Low statistical power
eval.evaluate('sign_test')Use when: Small/medium samples with non-normal errors
Sample size: T ≥ 10-20
eval.evaluate('wilcoxon_test')| Sample Size | Normal Errors | Non-Normal Errors | Nested Models |
|---|---|---|---|
| T < 10 | DM (Harvey) | Sign Test | N/A |
| 10 ≤ T < 30 | DM (Harvey) | Wilcoxon | N/A |
| T ≥ 30 | DM | DM / Wilcoxon | Clark-West |
Legend: DM = Diebold-Mariano, Harvey = harvey_correction=True
You can evaluate different model pairs without reinitializing:
# Initialize with just actual values
eval = Evaluation(actual)
# Compare different models on the fly
eval.evaluate('diebold_mariano_test',
pred1=model_a_pred,
pred2=model_b_pred,
model1_name='Model A',
model2_name='Model B',
h=12,
harvey_correction=True)example.py: Command-line examples with synthetic dataexample.ipynb: Interactive Jupyter notebook with visualizations
***p < 0.01 (highly significant)**p < 0.05 (significant)*p < 0.10 (marginally significant)n.s.p ≥ 0.10 (not significant)