Skip to content

gianluigilopardo/forecast-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Forecast Evaluation Framework

A Python library for comparing forecast accuracy between different models using statistical tests.

Features

  • Five Test Variants: Diebold-Mariano (standard & Harvey-corrected), Clark-West, Sign Test, and Wilcoxon Signed-Rank
  • Flexible API: Initialize with predictions or provide them dynamically
  • Automatic Warnings: Sample size recommendations for each test
  • Error Metrics: Built-in RMSE, MAE, and MSE calculations
  • Clean Output: Formatted test results with significance indicators

Installation

pip install -r requirements.txt

Requirements:

  • numpy >= 1.21.0
  • scipy >= 1.7.0

Quick Start

from src.evaluation import Evaluation

# Your data
actual = [100.5, 102.3, 101.8, 103.5, 105.2, ...]
pred1 = [100.2, 102.1, 101.5, 103.8, 105.0, ...]  # Advanced model
pred2 = [100.0, 100.5, 102.3, 101.8, 103.5, ...]  # Baseline model

# Initialize evaluator
eval = Evaluation(actual, pred1, pred2, 
                  model1_name='Advanced Model',
                  model2_name='Baseline')

# Check error metrics
eval.summary()

# Run statistical tests
eval.evaluate('diebold_mariano_test', h=1, harvey_correction=True)
eval.evaluate('clark_west_test', h=1)
eval.evaluate('sign_test')
eval.evaluate('wilcoxon_test')

Available Tests

1. Diebold-Mariano Test

Use when: Comparing non-nested models
Sample size: T ≥ 30 (or use harvey_correction=True for T < 30)

eval.evaluate('diebold_mariano_test', 
              h=1,                    # Forecast horizon
              loss='mse',            # Loss function: 'mse' or 'mae'
              harvey_correction=True) # Small-sample correction

2. Clark-West Test

Use when: Comparing nested models (e.g., restricted vs unrestricted)
Sample size: T ≥ 30

eval.evaluate('clark_west_test', h=1)

3. Sign Test (Non-parametric)

Use when: Very small samples, no distribution assumptions needed
Sample size: Any (even T < 10)
Note: Low statistical power

eval.evaluate('sign_test')

4. Wilcoxon Signed-Rank Test (Non-parametric)

Use when: Small/medium samples with non-normal errors
Sample size: T ≥ 10-20

eval.evaluate('wilcoxon_test')

Test Selection Guide

Sample Size Normal Errors Non-Normal Errors Nested Models
T < 10 DM (Harvey) Sign Test N/A
10 ≤ T < 30 DM (Harvey) Wilcoxon N/A
T ≥ 30 DM DM / Wilcoxon Clark-West

Legend: DM = Diebold-Mariano, Harvey = harvey_correction=True

Dynamic Evaluation

You can evaluate different model pairs without reinitializing:

# Initialize with just actual values
eval = Evaluation(actual)

# Compare different models on the fly
eval.evaluate('diebold_mariano_test',
              pred1=model_a_pred,
              pred2=model_b_pred,
              model1_name='Model A',
              model2_name='Model B',
              h=12,
              harvey_correction=True)

Examples

  • example.py: Command-line examples with synthetic data
  • example.ipynb: Interactive Jupyter notebook with visualizations

Output Interpretation

P-value Significance Levels

  • *** p < 0.01 (highly significant)
  • ** p < 0.05 (significant)
  • * p < 0.10 (marginally significant)
  • n.s. p ≥ 0.10 (not significant)

About

A Python library for comparing forecast accuracy between different models using statistical tests.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors