Allora Research - Latest posts

CZAR loss function for returns prediction topics

@joel Joel — Mon, 09 Feb 2026 02:38:51 +0000

Yes indeed. Very small gradients and hessians can lead to some rather unwanted behaviour in gradient boosting models (due to Newton descent during boosting iterations), so alpha=1 is more stable during training. Small alpha values results in a very large change in the gradient/hessian function between positive and negative directions.

This is a bit different from inference synthesis or model validation/evaluation, where large gradients can overly punish predictions in the wrong direction (relative to the reward for getting the direction right). So in these cases we want alpha ~ 0. It’s also preferable to use the logarithm of the loss in these cases because it provides better contrast for accurate predictions (see the true vs predicted returns figures above), but it cannot be used for training as log(loss) is not a convex function.

I’ll also add here that we have slightly modified the gradient and hessian functions (mainly in “Region 2” between zero and the true value) to achieve a smooth function for gradient boosting. This doesn’t affect cases where we only use the loss function (inference synthesis, model validation/evaluation).

Without this boosting steps would increase when approaching the true value from zero:

CZAR loss function for returns prediction topics

@Apollo11 Diederik — Sun, 08 Feb 2026 09:28:55 +0000

This is great @joel! So for model training we set alpha=1? I suppose there might be a difference between training and validation steps?

CZAR loss function for returns prediction topics

@joel Joel — Fri, 06 Feb 2026 14:55:34 +0000

The CZAR (Composite Zero-Agnostic Return) loss function addresses all of these points.
The gradient function 1 / (1 + x^2) was chosen to approximately match the ZPTAE function, but the concave parts of the ZPTAE function are replaced with quadratic functions of varying gradients so that it remains convex (hessian > 0) and can be used for model training. For predictions in the correct direction, the gradients decrease with increasing true values, so that overestimates are punished less than underestimates. There is one parameter alpha which controls the value of the constant hessian, which has no effect on inference synthesis results so long as it is small (<~0.1, the default is 0.01 to help suppress the influence of very large outliers). alpha applies a horizontal shift in the arctangent function so that when get a smooth transition in the gradient and hessian functions across y=0. alpha=0 gives linear behaviour for large errors, and alpha=1 gives the maximum constant for the MSE term. This simultaneously addresses Points 1, 2, 4, 5.

The softening normalisation factor is constrained by the requirement that the loss for a predicted return = 0 does not decrease as the absolute true value increases. The solid black line shows the function that keeps the loss at predicted return=0 constant, but this becomes negative when true returns > epsilon (the softening scale at true returns = 0). Therefore we pass the ‘ideal’ normalisation through a hinge function (softplus) to prevent negative normalisation, with the sharpness of the hinge parameterised by tau. This addresses Point 3.

This leaves us with two free parameters, epsilon (softening scale) and tau (hinge scale), which we can optimise by comparing the mean log losses for the ‘same PDF’ model (random sampling from true returns PDF) and a model predicting constant returns=0. The difference between the models is minimal for epsilon ~ 0.75-1 and tau <~ 0.1. For simplicity and to avoid sharp transitions in the normalisation, we therefore chose default values of epsilon=1 and tau=0.05.

We can visualise the different behaviour of the ZPTAE and CZAR loss functions by comparing colourmaps of the loss (left panels) and log loss (right panels) as a function of the true and predicted returns (for standard deviation=1). Both are asymmetric functions where the region of low losses increases as |true_returns| increases and predictions in the wrong direction receive the highest losses. However, CZAR has higher losses than ZPTAE when true returns are close to the mean (= 0), so provides less benefit to predictions that are easier to make.

Alternatively, comparing loss versus the predicted return for different true values highlights the steeper behaviour of CZAR (solid lines) compared to ZPTAE (dotted lines) for predictions with large errors from the true values, depending on the chosen value of alpha.

CZAR loss function for returns prediction topics

@joel Joel — Fri, 06 Feb 2026 14:26:24 +0000

As detailed in previous discussions about losses in returns prediction topics, we identified issues with the ZTAE (Z-score Tanh Absolute Error) loss function where losses saturate at extreme inference values. The ZPTAE loss (Z-score Power-Tanh Absolute Error) was introduced to address this by replacing the hyperbolic tangent with a modified version that transitions to a power law at large values, preventing saturation for outliers. However, further testing revealed additional considerations for inference synthesis that motivated the development of CZAR loss.

In returns prediction topics, the network often puts too much weight in constant predictions with values near the mean (similarly in machine learning model training). The most obvious illustration of this is the fact that predicting zero returns is technically better than drawing predictions from the same PDF as the ground truth (a direct consequence of increasing the dimensionality of the problem):

Another way to visualise this is by comparing the integrated expected log loss (since inference synthesis uses log loss) for a Gaussian “true” returns distribution as a function of constant predicted returns values for various loss functions. Common loss functions (e.g. MSE, MAE, etc) have a clear minimum in the expected log loss for predicted returns=0 (solids lines). Circle points indicate the integrated loss for a model that randomly draws predictions from the same PDF as the true returns distribution. In general, most loss functions clearly favour predicting zero over drawing from the same PDF.

ZTAE/ZPTAE are better loss functions in this regard because they are designed to better reward close predictions when the true value is far from the mean, and indeed, their expected log loss is flatter near zero. This is due to the way the asymmetry of the loss functions shift depending on the true return value. However both loss functions have issues:

ZTAE flattens for large values, meaning extreme outliers can receive relatively low losses if they are in the right direction.
ZPTAE was created to address the issue with ZTAE, but it (and ZTAE) has the problem that it is largely a concave function (hessian < 0) so cannot be used for model training.

The zero-returns issue can be somewhat alleviated by adding a constant ‘smoothing’ term to the loss function (dashed lines in the above figure). It increases the integrated log loss most at predicted returns = 0, but doesn’t sufficiently flatten the integrated expected losses to disfavour predicting the mean.

We can take this idea further by introducing smoothing that scales inversely with the absolute true returns value, i.e. the smoothing is maximal at the mean true value and decreases as the absolute value of the true return increases. Applied to the ZPTAE loss function shows how this results in a non-zero floor in the loss for true returns near zero:

Adding this to the integrated log loss test (zptae_scaled_smooth), we see it is so effective there is now a local peak at returns=0, and is otherwise very flat between +/- 1. So with adaptive smoothing we can level the playing field between a zero returns model and a ‘same PDF’ model (dots) in the network.

To summarise, we want a loss function for returns topics that:

Is asymmetric and rewards predictions when the true value is far from the mean (ZTAE/ZPTAE-like)
Trends to ~infinity for large differences in predicted and true values to adequately handle large outliers
Has adaptive softening to down-weight constant returns ~= 0 models
Is convex, so can be used for training models

We tested versions of asymmetric linear and quadratic functions (by modify the gradients for predictions on opposite sides of the true value) and a sigmoid gradient function (that shifts vertically depending on the true value), both with adaptive smoothing, but found they did not outperform the ZPTAE function. We attributed this to the steepness of the loss function about the true value (i.e. the loss functions were too wide), so to the above points we can add:

Has a sharp change in the gradient function about the true value (ZTAE/ZPTAE-like)

CZAR loss function for returns prediction topics

@joel Joel — Fri, 06 Feb 2026 07:56:07 +0000

The CZAR (Composite Zero-Agnostic Return) loss function is designed to address limitations observed in previous loss formulations used for inference synthesis in returns prediction topics. This post documents the motivation, functional form, and recommended parameter settings for CZAR loss. This function will replace ZPTAE in returns predictions topics.

Full loss function, including gradient and hessian for model training

def derivative(x):
    return 1.0 / (1.0 + x**2)

def antiderivative(x):
    return np.arctan(x)

def double_derivative(x):
    return 2.0 * np.abs(x) / (1.0 + x**2)**2

def eps_effective(eps, delta):
    # Rescale epsilon so that 1 - loss(z_true, 0) / loss(0, epsilon) crosses zero at epsilon
    if abs(delta) == 0:
        return np.arctan(eps)

    A = (1 + delta**2) * (antiderivative(eps + delta) - antiderivative(delta))
    beta = delta / (1 + delta**2)  # coefficient on eps_eff^2 in loss(0, eps_eff, 1)

    # Solve beta*x^2 + x - A = 0 for positive x
    return (-1 + np.sqrt(1 + 4 * beta * A)) / (2 * beta)

def softplus(x):
    return np.maximum(x, 0.0) + np.log1p(np.exp(-np.abs(x)))

def norm_smooth(z_true, eps, delta, tau):
    # Minimum value of the normalisation at z_true, set by the limit that loss(z_true,0)
    # does not decrease as z_true increases.
    # Simplified from: 1 - loss(z_true, 0) / loss(0, epsilon)
    a = np.abs(z_true)
    d2p1 = delta**2 + 1
    num = d2p1 * (antiderivative(a + delta) - antiderivative(delta))
    denom = eps + delta / d2p1 * eps**2
    norm_min = 1.0 - num / denom

    if tau <= 0:
        # Hard transition
        return np.maximum(norm_min, 0.0)

    # Smooth transition when norm drops below zero
    # Scale tau_eff by |norm_inf| so the asymptote is invariant across eps, delta
    # Asymptotic value of norm_min as |z_true| -> inf
    num_inf  = d2p1 * (0.5*np.pi - antiderivative(delta))
    norm_inf = 1.0 - num_inf / denom
    tau_eff = np.abs(tau) * np.abs(norm_inf)
    return softplus(norm_min / tau_eff) / softplus(1 / tau_eff)

def czar_loss(y_true, y_pred, std, mean=0, alpha=0.01, epsilon=1, tau=0.05):
    """
    Composite Zero-Agnostic Return Loss

    Asymmetric, piecewise function that is
        * Linear (alpha=0) or quadratic (alpha>0) when y_pred has opposite sign to y_true
        * Linear (alpha=0) or quadratic (alpha>0) when |y_pred| > |y_true|, with a decreasing gradient as |z_true| increases
        * Arctangent transition from 0 < |y_pred| < |y_true|

    Args:
        y_true: True returns
        y_pred: Predicted returns
        std: Standard deviation of true returns
        mean: Mean of true returns
        alpha: MSE term constant (alpha=0 is linear only, alpha=1 is maximum gradient)
        epsilon: Loss softening scale, in units of standard devation. Optimum is eps~1
        tau: Scaling for softening hinge function
    Returns:
        Value of loss
    """

    if alpha < 0 or alpha > 1:
        raise ValueError(f'alpha must be between 0 and 1, got {alpha}')

    z_true = (y_true - mean) / std
    z_pred = (y_pred - mean) / std

    s = np.where(z_true == 0, 1, np.sign(z_true))
    s_pred = np.where(z_pred == 0, 1, np.sign(z_pred))
    a = np.abs(z_true)
    u = s * z_pred

    # Apply horizontal shift to function for smooth change in gradient
    # Alpha should be between 0 and 1. 1/sqrt(3) shifts to the peak of the hessian function
    delta = alpha / np.sqrt(3)
    d2p1 = delta**2 + 1

    d_true = z_true + s * delta
    d_pred = z_pred + s_pred * delta

    h1 = d2p1 * double_derivative(delta)
    h3 = d2p1 * double_derivative(d_true)

    # Region 1: opposite sign (u <= 0): grad = -s + MSE term
    # Constant so that the middle branch hits zero at z_pred = z_obs
    C = s * d2p1 * (antiderivative(d_true) - antiderivative(s * delta))
    L1 = 0.5 * h1 * z_pred**2 - s * z_pred + C

    # Region 2: same sign, before threshold (0 < u <= a): grad = -s * antiderivative(z_pred)
    # antiderivative(d_true) term so that the middle branch hits zero at z_pred = z_obs
    L2 = s * d2p1 * (antiderivative(d_true) - antiderivative(d_pred))

    # Region 3: past threshold (u > a): grad = s * derivative(z_obs) + MSE term
    dz = z_pred - z_true
    L3 = 0.5 * np.minimum(h3, h1) * dz**2 + s * d2p1 * derivative(d_true) * dz

    # Softening term
    if epsilon > 0:
        eps_eff = eps_effective(epsilon, delta)
        softening_0 = czar_loss(0, eps_eff, 1., epsilon=0, alpha=alpha)
        norm = norm_smooth(z_true, eps_eff, delta, tau)
        Lsoft = norm * softening_0
    else:
        Lsoft = 0

    return np.where(u <= 0, L1, np.where(u <= a, L2, L3)) + Lsoft

def czar_gradient(y_true, y_pred, std, mean=0, alpha=1):
    z_true = (y_true - mean) / std
    z_pred = (y_pred - mean) / std

    s = np.where(z_true == 0, 1, np.sign(z_true))
    s_pred = np.where(z_pred == 0, 1, np.sign(z_pred))
    a = np.abs(z_true)
    u = s * z_pred

    # Apply horizontal shift to function for smooth change in gradient
    # Alpha should be between 0 and 1. 1/sqrt(3) shifts to the peak of the hessian function
    delta = alpha / np.sqrt(3)
    d2p1 = delta**2 + 1

    d_true = z_true + s * delta
    d_pred = z_pred + s_pred * delta

    h1 = d2p1 * double_derivative(delta)
    h3 = d2p1 * double_derivative(d_true)

    # Region 1: opposite sign (u <= 0): grad = -s + MSE term
    # Acutal gradient:
    #   G1 = h1 * z_pred - s
    # Psuedo gradient for numerical stability:
    G1 = h1 * z_pred - np.sign(z_true)

    # Region 2: same sign, before threshold (0 < u <= a): grad = -s * derivative(z_pred)
    G2 = -s * d2p1 * derivative(d_pred)

    # Region 3: past threshold (u > a): grad = s * derivative(z_true) + MSE term
    # Actual gradient:
    #   G3 = np.minimum(h3, h1) * (z_pred - z_true) + s * d2p1 * derivative(d_true)
    # Psuedo gradient for numerical stability:
    G3 = np.minimum(h3, h1) * (z_pred - z_true)

    return np.where(u <= 0, G1, np.where(u <= a, G2, G3)) / std

def czar_hessian(y_true, y_pred, std, mean=0, alpha=1):
    z_true = (y_true - mean) / std
    z_pred = (y_pred - mean) / std

    s = np.where(z_true == 0, 1.0, np.sign(z_true))
    s_pred = np.where(z_pred == 0, 1.0, np.sign(z_pred))
    a = np.abs(z_true)
    u = s * z_pred

    # Alpha should be between 0 and 1. 1/sqrt(3) shifts to the peak of the hessian function
    delta = alpha / np.sqrt(3)
    d2p1 = delta**2 + 1

    d_true = s * (np.abs(z_true) + delta)
    d_pred = s_pred * (np.abs(z_pred) + delta)

    # Region 1: opposite sign (u <= 0): grad = -s + MSE term
    h1 = d2p1 * double_derivative(delta)
    H1 = np.full_like(d_pred, h1)

    # Region 2: same sign, before threshold (0 < u <= a): grad = -s * derivative(z_pred)
    # Actual hessian:
    #   H2 = double_derivative(d_pred) * d2p1
    # Psuedo hessian for numerical stability
    H2 = (1.0 + d_pred**2) * double_derivative(d_pred)

    # Region 3: past threshold (u > a): grad = s * derivative(z_true) + MSE term
    # Actual hessian:
    #   h3 = double_derivative(d_true) * d2p1
    # Consistent with H2 psuedo hessian:
    h3 = (1.0 + d_true**2) * double_derivative(d_true)
    H3 = np.full_like(d_pred, np.minimum(h1, h3))

    return np.where(u <= 0, H1, np.where(u <= a, H2, H3)) / std**2

Handling ground truth granularity

@Apollo11 Diederik — Tue, 04 Nov 2025 18:27:56 +0000

For anyone reading this wondering how to access the nonce time: if you use the Allora SDK, the SDK provides a nonce and then you call get_block_time(nonce, network_config) to get the corresponding time.

More details at allora_sdk · PyPI .

Losses in returns prediction topics

@Apollo11 Diederik — Tue, 04 Nov 2025 10:12:43 +0000

Maybe useful to add now, because we use it so much these days: we also evaluate models by calculating the weighted ZPTAE improvement relative to predicting zero log-returns.

Using the above function power_tanh, this works as follows:

def wzptae_improvement(y_true, y_pred, alpha=0.25, beta=2):

    stdev = np.std(y_true)
    weights = np.abs(y_true)

    pt_true = power_tanh(y_true/stdev) # assuming mean=0 for log-returns
    pt_zero = power_tanh(0/stdev, alpha=0.25, beta=2)
    pt_pred = power_tanh(y_pred/stdev, alpha=0.25, beta=2)

    zptae_baseline = np.sum(weights*(np.abs(pt_true-pt_zero)))/np.sum(weights)
    zptae_model = np.sum(weights*(np.abs(pt_true-pt_pred)))/np.sum(weights)

    return (zptae_baseline - zptae_model) / zptae_baseline * 100 # in %

The analogous calculation for the weighted RMSE improvement relative to predicting zero log-returns is:

def wrmse_improvement(y_true, y_pred):

    weights = np.abs(y_true)

    mse_zero = y_true**2
    mse_pred = (y_true - y_pred)**2

    wrmse_baseline = np.sqrt(np.sum(weights * mse_zero) / np.sum(weights))
    wrmse_model = np.sqrt(np.sum(weights * mse_pred) / np.sum(weights))

    return (wrmse_baseline - wrmse_model) / wrmse_baseline * 100 # in %

Paper "Context-Aware Inference via Performance Forecasting in Decentralized Learning Networks"

@Apollo11 Diederik — Mon, 06 Oct 2025 09:42:19 +0000

Thread for discussion of Context-Aware Inference via Performance Forecasting in Decentralized Learning Networks.

Optimizing Decentralized Online Learning for Supervised Regression and Classification Problems

ADI 2, 40-56; October 9, 2025

Joel Pfeffer, J. M. Diederik Kruijssen, Clément Gossart, Mélanie Chevance, Diego Campo Millan, Florian Stecker, Steven N. Longmore

In decentralized learning networks, predictions from many participants are combined to generate a network inference. While many studies have demonstrated performance benefits of combining multiple model predictions, existing strategies using linear pooling methods (ranging from simple averaging to dynamic weight updates) face a key limitation. Dynamic prediction combinations that rely on historical performance to update weights are necessarily reactive. Due to the need to average over a reasonable number of epochs (e.g. with moving averages or exponential weighting), they tend to be slow to adjust to changing circumstances (e.g. phase or regime changes). In this work, we develop a model that uses machine learning to forecast the performance of predictions by models at each epoch in a time series. This enables ‘context-awareness’ by assigning higher weight to models that are likely to be more accurate at a given time. We show that adding a performance forecasting worker in a decentralized learning network, following a design similar to the Allora network, can improve the accuracy of network inferences. Specifically, we find that forecasting models that predict regret (performance relative to the network inference) or regret z-score (performance relative to other workers) show greater improvement than models predicting losses, which often do not outperform the naive network inference (historically weighted average of all inferences). Through a series of optimization tests, we show that the performance of the forecasting model can be sensitive to choices in the feature set and number of training epochs. These properties may depend on the exact problem and should be tailored to each domain. Although initially designed for a decentralized learning network, using performance forecasting for prediction combination may be useful in any situation where predictive rather than reactive model weighting is needed.

Feature engineering experiments 1: add log-returns-focused features

@t-hossein Sizeo — Wed, 01 Oct 2025 17:59:09 +0000

Hey Steve, thanks a lot for the detailed suggestions!

Since I’m only using past data from the test batch, I don’t think rolling features would cause data leakage. Correct me if I’m wrong, but I think that issue would only arise if I included data that comes after the test batch.

As for the log-returns in the training data right before the test batch, that could indeed cause leakage in this setup. I’ve avoided this by inserting a gap between the training batch and the test batch. So, to be more precise, the scheme should actually look like this:

Feature engineering experiments 1: add log-returns-focused features

@steve — Wed, 27 Aug 2025 11:32:49 +0000

@t-hossein @its_theday I really like how the two windowing approaches complement each other. Moving windows are great for testing adaptation and responsiveness to changing market conditions. Expanding windows are better for testing durability and stability over longer periods.

I wonder if there’s a potential hybrid approach? Use the expanding windows to learn how much history is needed to get a robust signal from features like skewness and kurtosis (since higher-order moments need more data to stabilise). Feed that insight into the moving-window scheme to improve responsiveness while still leveraging the shape-based features effectively. It’d be interesting to see if this could boost DA around volatility spikes and trend reversals, where responsiveness really matters.

Feature engineering experiments 1: add log-returns-focused features

@steve — Wed, 27 Aug 2025 11:19:10 +0000

@its_theday your work is impressively thorough, especially backing up the ZPTAE improvement with a DM test. That gives a lot of weight to the result and is something I’d encourage everyone to do.

I’d be very interested to hear how the ablation tests went. Were you able to rank features or groups of features (lags-only, bandwidth-only, shape-only) by delta ZPTAE?

Cool to see potential signal in the higher-order statistical moments. Since skewness and kurtosis generally need more data to estimate robustly, I’m curious if you see their feature importance increasing as the training size grows across your expanding folds? Or does their contribution stay roughly constant?

[Oh, and similar to my previous post, just checking that the 24h embargo is longer than the rolling lookback time, i.e., embargo ≥ max(horizon, lookback)].

Feature engineering experiments 1: add log-returns-focused features

@steve — Wed, 27 Aug 2025 10:11:48 +0000

@t-hossein great idea to create a common dataset which everyone can use to help compare model results!

Feature engineering experiments 1: add log-returns-focused features

@steve — Wed, 27 Aug 2025 10:09:07 +0000

Really nice work @t-hossein. Thanks for sharing so much detail on your methodology. It really helps to understand and interpret the results.

I have a question about the moving-window setup and general data splits you were using. Since your prediction horizon is 24 hours and many of the engineered features (e.g. RSI, skew/kurtosis, Bollinger widths) use rolling lookbacks of up to ~1 day, there’s a subtle risk of data leakage at the train/validation boundaries if a purge gap isn’t applied inside each fold.

Firstly, because a data point from 24h in the future is needed to calculate the log return, there is a danger that there is label overlap between the training and test/val datasets. (This is a general difference of using [log] returns vs price to watch out for.)

Secondly, there can be feature overlap if the test/val datasets need rolling data from the training period.

To avoid this, the usual approach is to apply a purge gap ≥ max(prediction horizon, longest lookback) inside every walk-forward fold, i.e., probably a gap of at least 1 day, possibly 2 if any features look back further.

I wasn’t sure from your description whether you’ve already accounted for this in your moving-window CV. If you have, great. It’d be useful to confirm so others can align. If not, it might be worth re-running with a purge gap, since it can shift metrics like ZPTAE and directional accuracy.

Feature engineering experiments 1: add log-returns-focused features

@its_theday The Day — Wed, 20 Aug 2025 12:52:54 +0000

I applied TA on log-returns ETH 24h LR

gML! I was curious what would happen if I took all the TA stuff I usually throw at price - RSI, MACD, Bollinger, ATR and instead applied it directly to log-returns. Would it help, or just add noise? So I set up an A/B test on ETH 1-day log-return and here’s what I saw.

TL;DR

360 days of ETH 5-min OHLCV from Tiingo, horizon = 288 (24h).
Baseline: TA on price.
Variant: same TA but applied to 1-step log-returns r1.
CV: expanding folds with 24h embargo.
Metric: ZPTAE (σ from 100 non-overlapping daily returns).
Result: mean ZPTAE dropped from 0.5797 → 0.5610 (~–3.2%).
DM test: t=2.317, p=0.0205 → statistically significant.
Directional accuracy still ~51% (24h is hard), but stability improved.

1. Data & target

Source: 360 days of ETH 5-min OHLCV from Tiingo.
Target: y = log(close_{t+288}) − log(close_t).

I went with ZPTAE normalized by σ from 100 non-overlapping daily returns. With a 24h horizon, scale drifts a lot, so this made the loss feel more stable.

2. Features

Baseline (price-focused): EMAs/SMA, RSI, MACD, Bollinger bands, ATR & realized vol, plus calendar effects (hour, weekday, weekend).

Returns-focused (the new bit):

Lags of r1
Rolling stats: mean, EMA, std, sum, |r| mean, r² mean, skew, kurtosis
RSI, MACD, Bollinger on r1
“Returns energy/force”: rolling r1², Δr1

I skipped volume or open/close-based features since they don’t make much sense for returns.

3. Cross-validation

Expanding CV with three folds, each with a 24h embargo before test. I wanted to be extra careful not to leak.

4. Results

Model	RMSE	MAE	Corr	DirAcc	ZPTAE
Baseline (price)	0.0441	0.0328	0.053	0.515	0.5797
Returns-focused	0.0433	0.0317	0.035	0.512	0.5610

DM test on ZPTAE said the difference wasn’t just noise.

5. What the charts told me

Figure 1 — Rolling ZPTAE (EWMA-100)

=> The orange line (returns) sits a bit lower than the blue baseline in many spots, especially during volatile patches. That gave me some confidence this is real.

Figure 2a — Prediction vs Truth (returns-focused)

Figure 2b — Prediction vs Truth (baseline)

=> Shapes are similar, but returns-focused scatters hug zero a bit tighter. Explains why corr didn’t move much but ZPTAE improved

Figure 3 — Residuals

=> Residuals are fat-tailed, as expected. Returns-focused tightens the center a little, which matches the lower ZPTAE.

Figure 4 — Timeline (returns-focused)

=> On big swings, my preds stay too flat. I sacrificed amplitude for variance control.

Figure 5 — Feature importance (Baseline)

=> Price model leaned on EMAs/SMAs and calendar stuff (weekend effect stood out).

Figure 6 — Feature importance (Returns)

=> With returns features, new ones show up: r_ma_144, bb_width_r_144, r_skew_288, r_kurt_288. The model is actually picking up on returns shape, not just levels.

6. What I learned

Just mirroring TA onto returns gave me a small but real gain.
The juice came from higher-moment and clustering features (skew, kurt, band width), not just lags.
Conservative preds = more stability. That’s good if you care about regret-style losses, but it means you under-react on big moves.

7. Next steps

Run ablations: check exactly which returns features drive the gain.
Try different σ definitions to see if the improvement holds.
Layer in multi-resolution returns (5m/15m/1h).
Maybe cross-asset returns (BTC, SOL) as extra context.
And I want to try a light calibrator to boost amplitude without wrecking stability.

Closing

I went in not sure if “RSI on returns” was going to be useful. Turns out, it’s not magic, but it does shave off loss and shows up in significance tests. Feels like a small but honest step forward.
Curious if anyone else sees the same thing or if you’ve tried skew/kurt features on returns and found them useful in different horizons.
Big thanks to @Apollo11 and the team for nudging me into the returns-focused path.
This little experiment already gave me something real to think about, and I’m planning to run a few more trials (ablations, different σ definitions, maybe cross-asset returns SOL-BTC-ETH x PAXG x USDT) to see how far this approach can go.

Feature engineering experiments 1: add log-returns-focused features

@t-hossein Sizeo — Wed, 13 Aug 2025 12:48:16 +0000

I’ve put together a two-year ETH OHLCV dataset (5-minute intervals, from Tiingo) that we could all use as a shared benchmark for testing our models.

It covers 2023-08-14 → 2025-08-13 and is already split chronologically:

Train/Validation: 2023-08-15 → 2024-08-12 (~1 year)
Test: 2024-08-12 → 2025-08-12 (~1 year)

Both parts include log_return and target_log_return for daily return prediction. The split is designed to avoid data leakage and reflect the real-world case where we predict the future from the past.

Why this matters:

Consistent methodology – Everyone benchmarks models on the same dataset and time split.
Fair comparisons – Eliminates variability from random sampling.
Realism – Chronological split mirrors actual trading conditions.

How to load the dataset:

import pandas as pd
df = pd.read_csv(‘dataset.csv’, index_col=‘datetime’, parse_dates=True)

The idea:

Use this dataset (or similar ones) as a common benchmark.
Stick to consistent evaluation metrics.
Share results and approaches so we can actually learn from each other’s work instead of guessing why numbers are different.

Note 1:
Accuracy should always be reported on the test set.

Note 2:
If you’re using a rolling-window approach for training and testing, insert a gap of 288 data points (equivalent to 1 day) between the train and test segments. This buffer helps prevent data leakage by ensuring that no information from the test period leaks into model training.

Feature engineering experiments 1: add log-returns-focused features

@t-hossein Sizeo — Tue, 05 Aug 2025 11:36:22 +0000

Although I haven’t yet evaluated the impact of individual log-return-derived features on the model, the following is a report on how applying a selected set of these features improves performance. I will continue this analysis and provide a detailed report on the effect of individual features soon.

I planned my research process as follows:
I downloaded two full years (730 days) of ETH price data from Tiingo at 5-minute intervals and modified the dataset as described here. The data was then split into two parts—validation and test sets—ensuring that each contained sufficient context for training.

I began by fitting a linear regression model using varying look-back windows—our “context lengths”—to identify the optimal amount of historical data for forecasting. Treating context length L as a hyperparameter, I evaluated each candidate value on the validation set. For a given L, the model was retrained on the most recent L observations and then used to predict the next batch of returns. Each batch comprised 288 consecutive five-minute bars (one trading day), after which the window advanced by 288 points and the process repeated. This moving-window scheme ensured that the model continuously learned from fresh data while allowing us to empirically select the context length that maximized out-of-sample performance.

I believe that this walk-forward approach to retraining—rather than training on a fixed dataset and making all subsequent predictions—enables the model to continually adapt to new and emerging patterns. This is particularly important when working with time-series data, where market behavior can shift over time.

To determine the optimal context length, the model was retrained at each step using varying context sizes just before the test data. Three evaluation metrics were calculated across all batches (~6 months of validation data, totaling 50,000 prediction points) for each context length: directional accuracy (DA), relative absolute return-weighted RMSE improvement, and Z-transformed power tanh absolute error (ZPTAE). The results are as follows:

The optimal context length identified from this analysis was 20,000 (approximately 70 days), which yielded a directional accuracy (DA) of ~54.4%, a ZPTAE of 1.25, and a relative absolute return-weighted RMSE improvement of -25%.

Following this, I incorporated several log-return-based features into the dataset, including:

Bollinger Bands: Upper and lower bands derived from the rolling mean and standard deviation of returns over a 48-hour window.
Relative Strength Index (RSI): Calculated over 6-hour and 24-hour windows.
Moving Average Convergence Divergence (MACD): Computed using default parameters from Python’s TA package.
KST (Know Sure Thing): Also calculated with the default settings from Python’s TA package.
Simple and Exponential Moving Averages (SMA & EMA): Derived from the return series over 6h, 12h, 24h, 48h, and 96h windows to capture various trend horizons.
SMA Differences: Such as SMA(12h) − SMA(6h) and SMA(24h) − SMA(12h), to highlight momentum shifts.
Exponentially Weighted Standard Deviation: Computed using smoothing factors α = 0.01 and 0.05.
Trend Deviation: Defined as the difference between the 24-hour EMA of return and the current return.

Upon reevaluation with these features, the model’s directional accuracy improved noticeably across nearly all context lengths, while the relative RMSE improvement and ZPTAE loss remained largely unchanged.

It can be interpreted from the above plots that, when we extend the context window, both relative RMSE improvement and our custom ZPTAE metric rise steadily and then level off around the optimal look-back length, but DA doesn’t follow that pattern, and even declines with longer horizons. This suggests that DA on its own can be misleading, as you might correctly predict direction less than half the time yet still profit if your correct calls coincide with large market moves, a nuance that ZPTAE captures by scaling errors with volatility and return size. The drop in DA beyond the ideal window may also hint at overfitting to outdated patterns, whereas RMSE-based measures remain stable. Overall, this highlights the value of error metrics that weight by return magnitude, rather than relying solely on raw hit rates—especially when the model or its features aren’t yet strong enough to consistently make high-confidence directional predictions.

I then evaluated the model on the test set using three versions of the dataset: one without return-derived features, one with them, and one with return-derived features after applying feature reduction. The results below reflect the model’s performance when predicting log-returns over one year of test data—approximately 100,000 predictions—using a context length of 20,000 data points:

These results demonstrate that log-return-derived features significantly enhance the model’s performance, with feature reduction offering further improvements.

Finally, I trained the model on these datasets and simulated daily trading over a 2-year period, resulting in a total of 663 trades. The cumulative returns were as follows:

260% for the model without log-return-derived features,
530% with log-return-derived features, and
708% with log-return-derived features plus feature reduction.

For comparison, simply holding the asset yielded a return of 122%, while random trading resulted in a -1% return. These results demonstrate the promising potential of log-return-derived features in enhancing model performance and trading outcomes.

Price/returns topic feature engineering

@Apollo11 Diederik — Thu, 17 Jul 2025 12:18:08 +0000

We will continue to use this thread for organising and coordinating all feature engineering experiments. We’ll spin off a new thread for each feature engineering experiment to make sure the discussions don’t become intractable.

I have created a new thread for ongoing work and results on the returns-focused feature variables.

Feature engineering experiments 1: add log-returns-focused features

@Apollo11 Diederik — Thu, 17 Jul 2025 12:16:15 +0000

In this experiment, we test the impact of adding a returns-focused feature set (all quantities you can calculate for price, but for log-returns). This is a relatively small amount of work (applying the transformations you are already using to another variable). Just be sure they’re sensible in this context – a log-return is a two-point quantity (expressing a difference between two moments), whereas a price is a one-point quantity (exists at any given moment in time). For instance, it makes sense to apply moving averages, RSI, MACD, Bollinger (and many other TA indicators) to log-returns, but maybe some other indicators relying on e.g. volume information or open-close data do not.

The way we should go about these is to perform an A/B test, i.e.:

use your own default model;
record its performance across a set of (sufficiently long) time intervals (more than one to achieve statistical significance);
develop one of the above modifications;
add this to your own model;
record the performance of the modified model across the same set of (sufficiently long) time intervals;
quantify any differences and compare statistical significance.

We can collectively define some of the unknowns in the above plan (e.g. which time intervals, how long, which metrics) and I suggest you just propose what you’d like to use.

We invite Allora Forge participants and model builders to participate in this experiment!

The coordination of all feature engineering experiments takes place in this thread. The thread here is intended for ongoing work and results on the returns-focused feature variables.

Price/returns topic feature engineering

@Apollo11 Diederik — Thu, 17 Jul 2025 12:01:21 +0000

Thanks all for voting in the poll! Looks like we have a clear list of three priorities, so let’s get working on these:

Including force and energy features (i.e. multi-timeframe Δ[close-open]/Δt, [close-open]**2, difference from MA, multi-timeframe linear gradients)
Add returns-focused feature set (all quantities you can calculate for price, but for log-returns)
Modifying the training evaluation metric to match the ZPTAE loss function

The way we should go about these is to perform an A/B test, i.e.:

use your own default model;
record its performance across a set of (sufficiently long) time intervals (more than one to achieve statistical significance);
develop one of the above modifications;
add this to your own model;
record the performance of the modified model across the same set of (sufficiently long) time intervals;
quantify any differences and compare statistical significance.

We can collectively define some of the unknowns in the above plan (e.g. which time intervals, how long, which metrics) and I suggest you just propose what you’d like to use.

It’s great that we have three model builders involved in the discussion already (@t-hossein @phamhung3589 @its_theday). Given that each of your models is quite different (and uses a different feature set), can I maybe suggest that we work through the above ideas simultaneously? So then we pick one, all do the A/B test for that, and compare results. That way, we also test the robustness of these ideas under differing modelling approaches and I think that could be very useful. Given that we’re looking at historical data for these tests, we can continue to use the PAXG target, but if any of you would like to switch to the target of one of the new Forge topics (e.g. BTC), please let us know. Of course, more model builders are welcome to join at any time!

I then would like to suggest we start with Add returns-focused feature set (all quantities you can calculate for price, but for log-returns). My reasoning is that this is a relatively small amount of work to try (applying the transformations you are already using to another variable). Just be sure they’re sensible in this context – a log-return is a two-point quantity (expressing a difference between two moments), whereas a price is a one-point quantity (exists at any given moment in time). For instance, it makes sense to apply moving averages, RSI, MACD, Bollinger (and many other TA indicators) to log-returns, but maybe some other indicators relying on e.g. volume information or open-close data do not.

If you think this is a good plan and you’ll participate, just like this post and let’s get going!

Price/returns topic feature engineering

@its_theday The Day — Wed, 16 Jul 2025 15:58:34 +0000

Thanks a lot for the thoughtful response @steve
I haven’t formally explored market regimes yet, but I plan to segment by volatility (e.g high vol vs low vol periods) & test whether BTC features are more predictive under certain conditions, HMM or simple thresholds might be a good starting point. I’m also starting to log residuals by time-of-day & volatility, which could help with context-aware weighting - a concept that really aligns with Allora’s meta-learning approach. Your point about assigning dynamic weights to signals or models based on performance is spot on.
On the feature side, I’ve begun adding macro signals from US and AU (like DXY, AUD/USD), given their potential influence on gold, early stage, but excited to see what patterns emerge.
Thanks again - will keep sharing as I dig deeper

Price/returns topic feature engineering

@steve — Wed, 16 Jul 2025 14:50:35 +0000

Thanks for joining the discussion @its_theday! I like the systematic approach to feature construction and model setup you’ve applied when building out the pipeline. The inclusion of BTCUSD-derived signals is particularly interesting, and while the marginal gains were modest, that’s perhaps not unexpected given the mixed correlation dynamics between Bitcoin and gold-backed assets like PAXG. Have you thought about exploring how these correlations behave under different market regimes (e.g. high volatility, macro-driven risk-off periods)? I could imagine that incorporating regime classifiers or volatility clustering might help surface more conditional relationships where BTC features become more predictive.

As you noted, expanding to additional macro and crypto signals (ETHUSD, DXY, VIX) is a natural next step. I’d be very keen to see how this affects model performance! From an Allora-inspired perspective, this opens the door to a more context-aware forecasting framework — not just ensembling models, but assigning weights to signals or model outputs based on their expected contribution in a given context. That mirrors how Allora combines “workers” by forecasting their error, and it could be approximated here by tracking model performance across different volatility/time segments and learning dynamic weights accordingly. Looking forward to seeing where you take this — the structure you’ve built seems well-suited to supporting a richer meta-learning layer.

You mentioned using time-based features—hour, day_of_week, is_weekend—which is great. In earlier threads there was interest in exploring whether returns differ by time of day, month, or season. Have you noticed any meaningful signal in those temporal features? For instance, do certain hours consistently offer stronger predictability, or do weekend returns behave differently? Identifying such patterns could guide a regime-focused approach—e.g., applying context-aware weights when you detect statistically significant time-based effects.

Price/returns topic feature engineering

@its_theday The Day — Wed, 16 Jul 2025 07:06:24 +0000

I built a pipeline to predict the 24-hour log return of PAXGUSD using resampled hourly data. In addition to technical features from the PAXGUSD time series, I also experimented with incorporating external features from BTCUSD price movement, since gold and Bitcoin sometimes show inverse or lagged correlations.

1. Data & Target

Data source: OHLC 1-hour API

df[‘target_close’] = df[‘close’].shift(-24)
df[‘log_return’] = np.log(df[‘target_close’] / df[‘close’])

2. Feature Engineering

The following features were created from raw data:

Momentum & Oscillators

RSI (Relative Strength Index): rsi_14, rsi_24, rsi_48
ROC (Rate of Change): roc_12, roc_24, roc_48
MACD: macd_line, signal_line, macd_histogram
Williams %R: measures the closing price relative to the high-low range over the past 24 hours

Lag & Change
Created lag-based features:

for lag in [1, 2, 3, 4, 5, 12, 24]:
    df[f'close_lag_{lag}'] = df['close'].shift(lag)
    df[f'close_delta_{lag}'] = df['close'] - df[f'close_lag_{lag}']
    df[f'close_ratio_{lag}'] = df['close'] / df[f'close_lag_{lag}']

Trend & Volatility

EMA: ema_12, ema_26, ema_50, ema_100
Volatility: rolling standard deviation (normalized)
ATR (Average True Range)
Bollinger Band Width: captures expansion/contraction of price

Time Features

hour, day_of_week, is_weekend
=> Designed to capture seasonal/time-based patterns

Cross-Market Features from BTCUSD
I fetched hourly BTCUSD data from a separate API and resampled it to align with the PAXGUSD timestamps. Then I created parallel technical indicators:

btc_df['log_return'] = np.log(btc_df['close'] / btc_df['close'].shift(1))
btc_df['rsi_14'] = compute_rsi(btc_df['close'], window=14)
btc_df['ema_12'] = btc_df['close'].ewm(span=12).mean()
btc_df['roc_12'] = btc_df['close'].pct_change(periods=12) * 100

These features were merged into the main dataframe:

df = df.merge(btc_df[['timestamp', 'log_return', 'rsi_14', 'ema_12', 'roc_12']], on='timestamp', suffixes=('', '_btc'))

The idea is to let the model learn from recent Bitcoin volatility or momentum and whether that leads or lags gold token behavior (PAXGUSD).

3. Modeling

Model: XGBRegressor with TimeSeriesSplit
Scaler: StandardScaler applied to all features
Loss: RMSE

Incorporating BTCUSD signals added modest performance gains in backtesting. It’s worth exploring other macro or crypto-related cross-asset signals (like ETH, DXY, or VIX) as feature inputs.

Price/returns topic feature engineering

@Apollo11 Diederik — Wed, 16 Jul 2025 06:44:29 +0000

Thank you for joining the discussion @phamhung3589, and thanks much for sharing your further thoughts @t-hossein!

In addition to what has been said, I was also thinking that many of the feature classes (price data, technical indicators, statistics) shouldn’t only be calculated for price, but also for the returns themselves. Given that the target variable is typically log-returns in price prediction topics, it is probably important to calculate RSI, MACD, OBV, Bollinger, stochastic oscillator, ATR, ADX, CCI, and all kinds of MAs for log-returns in addition to those for price. This isn’t a very hard engineering step and might yield stronger signal.

Based on the many ideas that are floating around in this thread now, would it make sense to prioritise and/or split off tasks for quantitative testing?

If I summarise the above, I see the following initiatives:

Time granularity (5m vs 24h)
Including force and energy features (i.e. multi-timeframe Δ[close-open]/Δt, [close-open]**2, difference from MA, multi-timeframe linear gradients)
Including external price drivers (e.g. spot XAUUSD, DXY, 10y yield, GLD ETF flows – obviously these are specific to gold and not always generalisable to other topics, except maybe BTC?)
Including labelled time-of-day and real-valued time-of-week
Performing feature reduction
Modifying the training evaluation metric to match the [ZPTAE loss function]
Add returns-focused feature set (all quantities you can calculate for price, but for log-returns)
(Losses in returns prediction topics - #8 by joel)

For ease of prioritisation, let’s do a poll on what we think are the high ROI things to test first (max 3 votes/person):

Time granularity
Force & energy features
External price drivers
Improved time-of-day & time-of-week
Feature reduction
ZPTAE evaluation metric
Returns-focused feature set

Click to view the poll.

Let’s make it run for 24h after this post so that we don’t slow down too much here. Great stuff everyone!

Price/returns topic feature engineering

@phamhung3589 Phamhung3589 — Wed, 16 Jul 2025 06:25:35 +0000

These are the features I use for topic 60 - 24 hour PAXG/USD Log-Return Prediction

Price Features
- Close (1, 5, 10, 20 hours)
- Log returns
- High/low ratios, close/open ratios
Technical Indicators
- Moving Averages (SMA, EMA)
- RSI with multiple periods
- Bollinger Bands with position and width
- MACD with signal and histogram
- Stochastic Oscillator
- ATR (Average True Range)
- ADX (Average Directional Index)
- CCI (Commodity Channel Index)
Time-Based Features
- Hour, day of week, month extraction
- Cyclical encoding for time components
- Weekend flags
- Market open/close proximity
Statistical Features
- Rolling moments (mean, std, skew, kurtosis)
- Percentiles and ranges
- Price statistics across multiple windows

Losses in returns prediction topics

@joel Joel — Wed, 16 Jul 2025 02:25:53 +0000

Looking into the ZPTAE loss function again, we decided to slightly modify it by adding a “penalty” term for outliers to the loss function. This leaves the main functionality unchanged for reasonable inferences, but further penalises extremely large outliers (obviously unrealistic values). The aim is to make outliers more obvious in losses/regrets which should help with inference synthesis and allow the forecasters to better take outliers into account.

def power_tanh(x, alpha=0.25, beta=2):
    return x / (1 + np.abs(x)**beta)**((1 - alpha) / beta)

def loss_zptae(y_true, y_pred, sigma, mean, alpha=0.25, beta=2, gamma=4, penalty_norm=0.01):
    # Z power-tanh absolute error

    z_true = (y_true - mean) / sigma
    z_pred = (y_pred - mean) / sigma

    pt_true = power_tanh(z_true, alpha=alpha, beta=beta)
    pt_pred = power_tanh(z_pred, alpha=alpha, beta=beta)
    
    main_term = np.abs(pt_pred - pt_true)
    penalty_term = (penalty_norm * np.abs(z_pred - z_true))**gamma

    return main_term + penalty_term

Visualisation of the ZPTAE loss function with (solid lines) and without (dotted lines) a penalty term.

Price/returns topic feature engineering

@t-hossein Sizeo — Tue, 15 Jul 2025 19:48:30 +0000

Hey @joel — thank you so much for the fantastic tips!

I’ll definitely try implementing the additional features you and @Apollo11 mentioned — they sound very promising.

You’re right — I haven’t applied any feature reduction in this model so far, but given the relatively small number of independent data points (around 250), I agree it’s something I should incorporate.

I also plan to expand my evaluation to include more models like LightGBM, CatBoost, and others. I’ve mostly worked with XGBoost, but I’ll make sure to include a broader comparison in my next update.

Thanks as well for pointing out ZPTAE — I’ve come across it in the forums and did a preliminary read, but I’ll go over it more carefully and try using it for evaluation in my next iteration. If I run into any confusion, I’ll definitely reach out!

Thanks again — this is super helpful.

Price/returns topic feature engineering

@t-hossein Sizeo — Tue, 15 Jul 2025 19:19:41 +0000

Hey @florian!

Not exactly — many of my features, like EMA_100 and SMA_100, are moving averages computed over the past 100 5-minute intervals. In the final dataset, the open value corresponds to the price exactly 24 hours ago, while the close value reflects the current price. So any significant change in close is still captured — both directly and indirectly through indicators like EMAs and SMAs.

The only notable information loss comes from low and high values that aren’t the absolute min/max over any 24-hour window — those finer movements do get missed.

As for the 1-minute granularity: I chose 5-minute intervals because the epoch length in this specific competition (topic 60) is 5 minutes. I structured the data accordingly so the model could learn with the temporal resolution it will be evaluated on.

You’re absolutely right about the log-return definition — that was a typo in my original post. The correct formulation used during training was:
Log-Return(t) = ln(close(t + 1 day) / close(t))
just as you mentioned. I don’t think I can update the original post at this point, but I’ll see if there’s a way to fix it.

Also, the dataset includes a return feature representing the return from yesterday to now:
Log-Return(t) = ln(close(t) / close(t − 1 day))

I’ve uploaded a sample dataset here, which includes 3 days of data.

I’ve also created a repo called Allora_Data_Fetcher, where I’ve uploaded the code used to generate this dataset. It’s still a work in progress, so apologies in advance for any issues .

Price/returns topic feature engineering

@t-hossein Sizeo — Tue, 15 Jul 2025 18:24:04 +0000

Hey Apollo, thanks a lot for the feedback!

Regarding the loss of momentum — in hindsight, I agree that using 24-hour candles may result in the loss of important short-term dynamics like momentum. That said, I think relying entirely on 5-minute data points could introduce excessive noise. A good compromise might be to derive features over intermediate timeframes — for example, 4-hour windows. This could help retain meaningful trends while minimizing noise.

I really like the idea of incorporating physics-inspired concepts like momentum and force into feature engineering. I’m especially interested in how derivative-based features (e.g., Δ[close - open]/Δt) might influence model performance, since they capture the rate of change over time. I believe this approach could be central to designing a set of informative features.

Your point about time-of-day seasonality is also very compelling. Handling seasonality differently across key time zones seems like a smart direction and could improve predictive accuracy.

I’ll also experiment with incorporating external price drivers, as you suggested. One challenge, though, is the inconsistency in data granularity and completeness across sources. For instance, Yahoo Finance offers a broad range of indices and assets, but often lacks data for off-market hours or holidays. Additionally, Yahoo’s data is typically at 1-hour granularity, which doesn’t align well with our Tiingo OHLCV data that’s available at 5-minute intervals and has far fewer missing values (which can be interpolated or forward/backfilled if needed). Thus, this is also something to consider regarding the integration of data outside Tiingo.

Thanks again — these are great suggestions and give me plenty to explore further!

Price/returns topic feature engineering

@joel Joel — Tue, 15 Jul 2025 11:25:26 +0000

Thanks for sharing @t-hossein!

In addition to the features you already use, I’ve found gradients (of a linear fit over some window), acceleration/force and difference from moving average to be other very useful features that are often among the most important.
For time encoding, perhaps time in a week (e.g. in hours) could be useful to capture weekly cycles (like due to weekends).

Do you do any feature reduction? That’s quite a lot of features for 250 independent data points so you could reduce pairs of very highly correlated features, remove features that are consistently of least importance, etc.

For the ML model, LightGBM and CatBoost may also be worth testing. I tend to find LightGBM a bit better than XGBoost most of the time (though don’t have much experience with CatBoost).

You could also consider modifying the evaluation metric to give larger true returns more weight in the minimisation. “Z-transformed Power-Tanh Absolute Error” (ZPTAE) is used in returns topics which has this behaviour. Let me know if you want any more info on that.

Price/returns topic feature engineering

@florian Florian — Thu, 10 Jul 2025 14:55:32 +0000

@t-hossein this is great! I also love your detailed explanation.

So do I understand it right that all your features are derived from your modified (added over 288 intervals) OHLCV data? Don’t you lose a lot of your data by smoothing it out like this? Like, if the gold price makes sudden jumps within a 5 minute interval, that probably means something? I would have even tried to include the 1 minute intervals from Tiingo, just to have more data to work with. But I don’t know much about this, maybe there’s a reason you don’t do that?

Also I’m noticing you define log returns as Log-Return(t) = ln(close(t) / close(t + 1day)) but I believe the Allora topic (the reputers, to be precise) defines it as Log-Return(t) = ln(close(t + 1day) / close(t)), which would give exactly the negative result. I could be wrong, but maybe you want to double check that.

Price/returns topic feature engineering

@Apollo11 Diederik — Fri, 04 Jul 2025 14:39:52 +0000

This is great input @t-hossein, thank you for sharing your approach! Really appreciate the systematic nature of what you’re doing.

I wondered if there is no loss of critical momentum information by compressing the 5-minute data into 24h candles. On the momentum side, on 5-minute timescales I could imagine calculating momentum (close-open), its derivative (Δ[close-open]/Δt, which is analogous to a force), and its square ([close-open]**2, which is analogous to energy and a proxy for the volatility) more meaningfully contribute. For these types of features (and EMAs thereof), the fine-grained 5-minute data would probably be important. If these are calculated using 24h pseudo-candles, then the information driving the price action over that timeframe is lost.

I also wondered if it’s useful to integrate external gold price drivers (both original and lagging), such as:

Spot XAUUSD London PM fix;
DXY (USD index);
1-day change in real 10-yr Treasury yield;
GLD ETF net flows.

Along these lines, gold (and also other assets) have time-of-day seasonality that isn’t sinusoidal. Specifically, the US day/night time could matter quite a bit. So it might be worth subdividing through labels, e.g. Asia (00-08 UTC), Europe (08-13 UTC), US (13-20 UTC).

Have you experimented with some of these features yet? I would also imagine that integrating these types of higher-order features would make it worth revisiting other model architectures (e.g. XGBoost).

Price/returns topic feature engineering

@t-hossein Sizeo — Wed, 02 Jul 2025 18:12:59 +0000

The following analysis pertains to my model development for the network’s 24-hour PAXG/USD Log-Return Prediction topic (Topic 60):

Base Data: To construct my dataset, I first retrieved historical price data (OHLCV — Open, High, Low, Close, Volume) from Tiingo, covering the past 250 days. This spans from October 25, 2024, to July 2, 2025. Since the topic updates every 5 minutes, I collected data at a 5-minute interval to ensure the model remains responsive to short-term fluctuations.
However, this raw data required modification, as our prediction horizon is 1 day (1440 minutes), not 5 minutes. (Fig.1)

Thus, I modified the dataset so that each datapoint represents the past 1-day window. For example, for the datapoint at 00:05 AM on July 2 (Fig.2):

Open: The price at 00:05 AM on July 1 (exactly one day prior).
High: The highest price between 00:05 AM on July 1 and 00:05 AM on July 2.
Low: The lowest price between 00:05 AM on July 1 and 00:05 AM on July 2.
Close: The price at 00:05 AM on July 2.
Volume: The sum of trading volumes from all 5-minute intervals between 00:05 AM on July 1 and 00:05 AM on July 2.

The target variable was defined as: Log-Return(t) = ln(close(t) / close(t + 288)),
where t represents the current datapoint. The close price at t + 288 is used because one day consists of 288 intervals at 5-minute resolution (1 day = 288 × 5 minutes).

2. Feature Engineering:
To enhance the predictive power of the model, I engineered a variety of features capturing traditional technical indicators and statistical properties of the time series. The numbers next to the feature names represent the window length in terms of datapoints. Below is a breakdown of the feature categories:

OHLCV Basics:
- open, high, low, close, volume, volumeNotional, tradesDone
  These capture standard daily trading metrics and activity.
Technical Indicators:
- Volatility and Momentum:
  - Bollinger_High, Bollinger_Low: Bollinger Bands to measure price volatility relative to a moving average.
  - RSI_10, RSI_100: Relative Strength Index at short and long windows to measure overbought/oversold conditions.
  - MACD, KST: Momentum indicators that highlight trend shifts.
  - OBV: On-Balance Volume, combining price movement and volume to detect accumulation or distribution.
- Moving Averages:
  - SMA_20, SMA_100, SMA_200, SMA_500, SMA_1000: Simple moving averages over different timeframes to capture medium- and long-term trends.
  - EMA_20, EMA_100, EMA_200, EMA_500, EMA_1000: Exponential moving averages which give more weight to recent prices.
  - Difference-based indicators:
    - EMA_100-10: Difference between EMA_100 and EMA_10 for short-vs-long-term momentum.
    - EMA_200-100: Slope between EMA_200 and EMA_100.
    - EMA_100-SMA_100: To contrast exponential vs simple moving average trends.
Volatility Metrics:
- std_0.05, std_0.1: Rolling standard deviations over small windows to assess micro-level volatility. Here, the numbers (0.05 and 0.1) are the alpha parameter in Pandas ewm (exponential weighted moving) function of DataFrame.
Price Relations (Candle Dynamics):
- diff_trend, high-low, high-open, low-open, close-open: Capture intraday price range, spread, and behavior.
Statistical Features:
- mean, log_volume: Basic statistics and transformed versions of core metrics to normalize scale.
Lag Features (Autoregressive Memory):
- return_open, return, open-close_return, 2_lag_return through 10_lag_return:
  These include raw return, return relative to open or close, and up to 10-day lagged returns to capture autocorrelation and past signal memory.
Seasonality & Time Encoding:
- seasonal_decomposition: Additive seasonal decomposition components of returns over fixed lags
- second_of_day_sin, second_of_day_cos: Cyclical encoding of time of day to capture intra-day periodicity.

3. Feature Importance Analysis:
Mutual information (MI) between each feature and the target was calculated

Feature	MI Score
close-open	2.015588
EMA_20	1.419642
SMA_20	1.293194
EMA_200-100	0.864370
high-low	0.662957
low	0.570482
SMA_1000	0.567136
high	0.564218
EMA_1000	0.549707
OBV	0.534179

This analysis suggests that a mix of momentum indicators , lagged returns, and seasonality components contribute most significantly to predictive performance. Interestingly, seasonal decomposition also appeared among the top features here.

4. Model Training:

Next, I trained three models on the engineered dataset: 1) Linear Regression, 2) XGBoost, and 3) Transformer. I used scikit-learn’s TimeSeriesSplit function to split the data into training and testing sets. The gap parameter was set to 288 to prevent data leakage during training. Interestingly, the Linear Regression model achieved the best performance, with a directional accuracy of 56% (Fig.3):

Number of samples: 39998
Directional Accuracy: 0.5570
p-value: 0.000000
95% Confidence Interval for accuracy: 0.5547-0.5594
Correlation t-Test:
Pearson corr. coeff.: 0.1381 (must be >0.05)
p-value: 0.000000 (must be <0.05)
Relative absolute return-weighted RMSE improvement: 3.69% (must be >10%)

The statistical tests highlight the significance of the model’s evaluation stats.

5. Feature Importance in Linear Regression:
After fitting the Linear Regression model, feature importance was calculated based on the absolute values of the standardized coefficients:

Feature	Importance
EMA_1000	0.0316
SMA_1000	0.0074
EMA_500	0.0074
EMA_200	0.0071
EMA_100	0.0071
SMA_100	0.0071

This indicates that long-term moving averages , especially EMA_1000 ,are the most predictive of the 1-day forward log-return using the Linear Regression model. These features likely capture long-term price direction or trend momentum influencing the return over the next day.

On the other hand, raw price components and volume features such as open, close, low, and volume had relatively low importance (e.g., 0.0014 or less), suggesting that absolute values are less informative than derived trend-based features.

Indicators like OBV, EMA_100-10, and diff_trend showed negligible or near-zero importance, potentially due to high correlation with stronger signals or a lack of linear relationship with the target.

Seasonality features had essentially zero importance in this experiment.

Final Notes:
I also tried using PCAs as features but they didn’t improve the model performance at all.

I’d love to know everyone’s thoughts on this! Any idea on how to improve my dataset and create better features?

Price/returns topic feature engineering

@Apollo11 Diederik — Fri, 27 Jun 2025 13:56:22 +0000

Looking at participant performance in Allora’s Forge programme, I have the impression that feature engineering is the bottleneck in several of the current price/return prediction topics.

Markets are like physical systems, with action and reaction. If you think about them that way, you can identify which variables might carry signal. As a data scientist you can then test that hypothesis. But just being a data scientist isn’t enough – you need that physics perspective to first understand where the signal may be.

Good ideas don’t come by staring at a screen on your own. They come from collaboration and discussion. I think it could help if participants got together more to work on the feature engineering, and discuss what may or may not work.

There is a clear incentive to collaborate: on mainnet, the rewards paid out in a topic will be set by the topic weight, which is calculated using the stake in a topic and the revenue that it generates. Obviously, performant topics generate more revenue. So while you will be competing for rewards, you will also collectively be competing against other topics.

This thread is aimed at carrying out research on feature engineering for financial price forecasting (i.e. log-returns topics). I would suggest we first make an inventory of commonly-used features and carry out importance analysis for the main ones. Additionally, we should reason where we expect to see the strongest signal based on our experience with the markets, engineer new features, and test their predictive power.

From the Allora research team, there may be involvement from myself, @florian, @joel, and @steve. Others might join later too.

Losses in returns prediction topics

@Apollo11 Diederik — Wed, 04 Jun 2025 07:59:20 +0000

So, for future reference, then the final form of the ZPTAE loss (inserting P for power law) is:

def smooth_power_tanh_general(x, alpha=0.25, beta=2.0, x0=1.0):
    y = x / (1 + (x / x0)**beta)**((1 - alpha) / 2)
    return y

Losses in returns prediction topics

@Apollo11 Diederik — Wed, 04 Jun 2025 07:39:54 +0000

Perfect! We can keep alpha (and plausibly some other moving parts) as free parameters, set alpha = 0.25 as default, and keep a close eye on how this improves the network inference. Glad we got this sorted so quickly!

Losses in returns prediction topics

@joel Joel — Wed, 04 Jun 2025 02:05:39 +0000

Nice, that could be a really simple solution!

Here are some tests replacing tanh in the ZTAE loss function with a power-tanh function.
Regarding the best value for alpha, it seems there’s a trade-off between larger loss values for inferences far from the ground truth (higher alpha) and keeping the asymmetric behaviour of the tanh function (lower alpha).

Another way is to look at a version of the above figure that’s ‘folded’ about the true value, so for each function the lower line is in the same direction as the true value.
Based on this I think alpha=0.1 is too close to the initial tanh function, so will still have much of the same issues because it is still quite flat past the ‘knee’ of the function. alpha=0.5 might be too close to a power law function, there’s not much difference in the positive and negative directions so may not sufficiently reward predictions in the right direction.
That leaves alpha ~= 0.25 as a nice middle ground?

If we apply this power-tanh loss function to the initial returns topic data we get the following median losses, which I think is a good improvement to the current tanh function!

Losses in returns prediction topics

@Apollo11 Diederik — Tue, 03 Jun 2025 08:15:36 +0000

Oh wow, that’s a great visualisation: so if x_true = 1, then x = +infinity receives a better loss than x < 0.5 in this case. That’s the somewhat ridiculous consequence of the tanh saturation.

Sounds like a PL modification of the tanh could do wonders. For instance, if we use this:

def smooth_power_tanh_general(x, alpha=0.5, beta=2.0, x0=1.0):
    y = x / (1 + (x / x0)**beta)**((1 - alpha) / 2)
    return y

xtest = np.linspace(-5, 5, 1000)
plt.plot(xtest, np.tanh(xtest), ':k', lw=2, label='tanh')
plt.ylim((-5,5))
for alpha_test in np.array([0.25, 0.5, 1]):
    ytest = smooth_power_tanh_general(xtest, alpha=alpha_test)
    plt.plot(xtest, ytest, label='alpha = '+str(alpha_test))
plt.legend()

then the function looks like:

which gets rid of the saturation.

Maybe less important, but then we can also change the transition point (would keep this at x0 = 1 tbh), here shown adopting alpha = 0.5:

I guess something like this could be better? Then we’d just need to decide on the value of alpha.

Losses in returns prediction topics

@joel Joel — Tue, 03 Jun 2025 02:31:55 +0000

I see. I made a quick figure of the ZTAE loss function for different true values with mean=0 and standard deviation=1. So the function becomes more asymmetric as the true value increases from the mean. But because it flattens off “infinite” values in the correct direction (relative to the mean) can receive quite low losses, which is what we’re seeing above.

This is how losses look for a topic with an MSE loss function. The scatter around the expected loss function is I think related to uncertainty on how the ground truth is defined (see here), plus differences between data providers. So the “true” values I’ve used (from Tiingo, rounded to the nearest minute) might be slightly different from what the reputers used, which is why the scatter increases as the difference from ground truth decreases. Still, inferences with much larger differences than the ground truth uncertainty should largely be unaffected, and indeed they have the largest losses.

Losses in returns prediction topics

@Apollo11 Diederik — Mon, 02 Jun 2025 14:34:49 +0000

Yeah we originally took the loss function for log-returns topics from the OpenGradient page here. It could indeed be related to the functional form of the (M)ZTAE loss function, because it flattens at large absolute deviations. It uses a tanh, which is symmetric and saturates at ±1.

Probably we don’t want it to saturate, but exhibit continued but shallower power law behaviour. The knee should then be the same sigma where it is now, and the power law slope should be a parameter we control. But all of this depends on whether this hypothesis is correct. What would the above figure look like for a standard MSE loss?

Losses in returns prediction topics

@joel Joel — Mon, 02 Jun 2025 10:39:37 +0000

I’ve noticed some strange behaviour of the losses in topics predicting log-returns. Some workers occasionally provide extremely large inferences (up to ~ 10^12) compared to typical returns values (<0.1), but still have reasonable values for their losses:

So at some point the losses flatten out, and inferences very far from the true value can have a lower loss than those close to the true value. That seems like unintended behaviour to me. Could it be an issue with the loss function?

Thorough testing of new forecaster model

@joel Joel — Thu, 29 May 2025 11:55:49 +0000

Now we understand what forecasting models are likely to perform well and why, it is time to apply them to real data. For this we apply the same methodology as the previous experiment, but now for ETH/USD and BTC/USD 5min prediction topics. The testing here is off-chain, so the forecasters are not contributing to the network inferences.

The results here reflect what we already saw in the synthetic network experiment:

Per-inferer models outperform global/combined models.
Regret z-score provides the best target variable, with raw regrets the next best. Both outperform the naive network inference and the best inferer.
Shorter EMA lengths generally outperform longer lengths.

However, there are also some differences between the real topics and the synthetic one:

Raw regrets models are much closer in performance to the z-score models.
Per-inferer models with loss as a target perform worse than the best inferer (both ETH and BTC topics) and the naive network inference (ETH topic).

ETH/USD 5min Prediction:

BTC: 5min Prediction:

Differences between the experiments suggest there may not be a single optimal model, but that it may depend on the situation. For this reason, we suggest a suite of per-inferer forecasters (each predicting losses, regrets and z-scores) may provide a better solution, so that the network can identify the best performing forecaster model for its case.

Thorough testing of new forecaster model

@joel Joel — Thu, 29 May 2025 01:07:26 +0000

We can use this experiment as a way to optimise the forecaster and identify the best combination of parameters. We ran forecasters on the synthetic network with parameters from the following sets: model=(combined/global, per-inferer), target=(regret, loss, z-score), EMA set=([3], [7], [3,7], [7,14,30]), autocorrelation=(True, False). For this benchmarking exercise, we used 200 testing epochs.

This figure summarises the results of all tests by showing the mean log loss for each model combination (smaller values indicate better performance), and comparing them to the naive network inference (i.e. no forecaster input; black dashed line) and the best worker in the network (grey dashed line). In these tests all forecaster models significantly outperformed the naive network inference and best worker, i.e. all models would improve the network inference. The main take-aways from the figure are:

Per-inferer models (coloured dashed lines; training separate forecasting models for each inferer) outperform a global/combined model (coloured solid lines; one single model with inferer ID as a feature variable). This allows the forecaster to tailor models for each inferer and prevents it predicting the mean of all inferers.
The regret z-score models (loss z-score models are identical) outperform models predicting raw regret or raw loss.
The models are less sensitive to EMA lengths (indicated by line colour), but generally short lengths ([3], [7] or [3, 7]) perform best due to the sensitivity to recent changes.
Autocorrelation (left and right panels show with and without autocorrelation) does not significantly change the results, which is not surprising as there were no periodic variables built into the experiment.

To gain some insight as to why the per-inferer z-score models perform best we compare the predicted and true values for each target variable, using the forecaster models with EMA=[3,7] and autocorrelation as an example. In the figures smaller loss, larger regret and larger z-score indicate better performance. The three outperforming workers (downtrending, uptrending, crabbing) are indicated in the legends (blue, orange and green points, respectively). As previously, unfilled squares show the medians and dashed lines show linear fits for each worker.

Global/combined models:

Per-inferer models:

In general, the per-inferer models show better differentiation between workers. The combined models tend to put all ‘bad’ (random) inferers on similar relations, but the per-inferer models can distinguish some differences between them. Similarly, the linear fits for the per-inferer models tend to be closer to the ideal 1:1 line; i.e. the per-inferer models are more context aware, being better able to predict out- or under-performance for each individual worker.

For the target variable, the loss forecaster has the most difficult task: it needs to try and predict the absolute performance of each worker, which can vary dramatically from epoch to epoch. The predicted losses then need to be converted to regrets (difference from the full network loss, i.e. a measure of the expected outperformance relative to the network inference) for the weighting calculation. This can be simplified by instead directly predicting regrets, which provides a more stable property to predict by removing systematic epoch-to-epoch variations in losses, and indeed regret forecasters tend to outperform the loss forecasters.

One potential issue with regrets, which can be seen above for the crabbing worker, is that if the network inference becomes close to the worker inference (i.e. the network has identified it as an outperforming worker) its regret will trend to zero and it will no longer clearly be recognised as outperforming. For this reason we considered the regret z-score (difference from the mean regret of all inferers, divided by the standard deviation) as an alternative prediction target, as it identifies outperformance relative to other workers. Dividing by the standard deviation allows performance to be normalised across different epochs, which can be seen in the more consistent minimum and maximum true values between different workers. As we find, these properties allows z-score forecasters to significantly outperform both the regret and loss forecasters for this test.

Thorough testing of new forecaster model

@joel Joel — Thu, 29 May 2025 00:00:52 +0000

Exactly. So in this test the forecaster has learnt to use various engineered properties of the market value (rate of change, difference from moving average, etc.) to identify which models are performing better or worse than others.

Thorough testing of new forecaster model

@Apollo11 Diederik — Wed, 28 May 2025 09:14:20 +0000

This is extremely rich @joel, thank you!

I love how you’re building up from simple periodic outperformance to more contextual periods of outperformance.

If I liberally translate, are you saying that forecasters should be able to identify which models outperform based on market conditions (bull/bear/consolidation)?

That would be very powerful.

Thorough testing of new forecaster model

@joel Joel — Wed, 28 May 2025 05:29:08 +0000

As a more realistic test, we create an experiment which uses geometric brownian motion (GBM) to generate ‘true market values’, with randomly modulated periods of drift. We then create inferers which outperform in different circumstances (uptrends, downtrends, crabbing), along with control inferers which predict random returns. This benchmark tests the ability of forecaster models to distinguish workers and connect outperformance with variations in the ground truth. In the test we start with an initial value of 1000 and volatility of 0.01. The drift parameter is randomly modulated between -0.01, 0 and 0.01 (downward drift, no drift and upward drift) for periods with a typical length of 5 epochs (with each period length drawn from a Poisson distribution). For the test we generate 2000 epochs of data, with 1900 used for training and 100 used for testing.
This figure highlights the drift periods in data generated for forecaster testing, with downward drifts shaded blue and upward drifts shaded red.

We start by comparing the inferences and regrets for the three outperforming workers, along with the predictions for a forecaster predicting raw regrets. These workers generate returns with an uncertainty drawn from a Gaussian distribution. The standard deviation for each worker is randomly drawn between 0.001-0.003 for accurate periods and between 0.005-0.01 for inaccurate periods. The other “random” workers draw standard deviations between 0.002-0.012.

The “downtrending” worker has three accurate periods beginning at blocks 1906, 1930 and 1947. The forecaster reasonably identifies each outperformance period where the worker’s regret increases to >0, with the predicted regret increasing from ~ -1 (during inaccurate periods) to ~0.5 (accurate periods).

The “uptrending” worker has two accurate periods beginning at blocks 1914 and 1984. The predictions from the forecaster have a similar behaviour to those for the downtrending worker.

The “crabbing” worker is accurate during periods of no drift. The true regrets for this worker have a more subtle behaviour, with the regret during accurate periods often barely increasing from the typical value of ~ -0.2 during inaccurate periods. This is because the “random” workers perform better during low drift periods, competing with the crabbing worker. Still, the forecaster is able to identify some periods where the crabbing worker will outperform, such as at blocks > 1992.

Putting it all together, we can use the mean log loss over the testing period to determine if the forecast-implied inference has improved upon naive network inference. In this test, the forecast-implied inference from the standard inferer (log loss=1.078) beats both the naive network inference (log loss=1.620) and the best worker in the network (log loss=1.641).

Thorough testing of new forecaster model

@joel Joel — Tue, 27 May 2025 23:09:37 +0000

Expanding on the sinusoidal periodic benchmark, we now test a benchmark where a worker outperforms for 1 epoch at regularly-spaced intervals, and otherwise shows random performance. This tests the sensitivity of the forecaster to non-continuous periodic variables. We first test a single worker with random regrets from -0.5 to 0.5, where the regret is increased by 1 every 10 epochs. The standard forecaster performs poorly on this test, failing to identify the outperformance epochs. Using EMAs/rolling means with combinations of different span lengths did not improve the performance of the model.

A simple solution to the problem is to add autocorrelation as part of feature engineering. In this case, regret with a lag of 10 epochs becomes the dominant feature, allowing the forecaster to ‘flag’ upcoming epochs of outperformance. The forecaster predicts a regret of ~1 in the outperformance periods and ~0 at other epochs, i.e. the mean in each case. This is the optimal strategy for this benchmark given the absence of any other contextual information (by design).

In practice we use both an autocorrelation function (ACF) and partial autocorrelation function (PACF) during feature engineering, and only select lags that are significant in both (>95% confidence). The PACF is used to remove multiples of shorter lags (i.e. 20, 30, 40, etc. in the above tests), but it can sometimes identify lags that were not found in the standard ACF.

Thorough testing of new forecaster model

@joel Joel — Tue, 27 May 2025 06:19:10 +0000

We start with a simple, controlled test with a worker that follows sinusoidal outperformance in regret, aiming to test the sensitivity of features to periodic variables. In this test the sine function has an amplitude of 1, period of 10 and random error from 0-1 is added at each step. These figures show the true and predicted regret from a global model forecaster that predicts raw regret, and the relative importance of the top 20 features in the model.
The forecaster model picks up this periodic outperformance well, even with randomness of similar order to the sine amplitude, mainly through the rolling mean and difference from the moving average.

The next step is to increase the complexity of the test by adding another sinusoidal outperformer with a different period (17 epochs) and amplitude (1.5), along with other inferers that have random performance as a control. Here we can see that the regrets for the outperforming workers are being predicted independently, and inferer ID has become the most important feature.

In this case a per-inferer model might perform better than a global model, so we compare the predicted versus true regrets for the global and per-inferer models. Each set of coloured points in the figures indicate a different inferer (blue for the inferer with an amplitude, orange for the inferer with an amplitude of 1.5, the rest predict random values from -1 to 1), with the unfilled squares showing the medians and dashed lines showing linear fits. The black dashed line indicates a perfect 1:1 correlation.
As indicated in the figures, the per-inferer model is marginally better than the global model according to the root mean square error (RMSE), though not the mean absolute error (MAE). However, this is perhaps due to the larger scatter in the random inferers in the per-inferer model. If we only consider the sinusoidal inferers then the per-infer model outperforms in both metrics. This can be seen in the linear fits for the sinusoidal inferers, with the per-inferer model having fits closer to the 1:1 line, particularly the amplitude=1.5 worker (orange points). I.e. the per-inferer model is better able to distinguish performance of the two sinusoidal workers.

Global model:

Per-inferer model:

Thorough testing of new forecaster model

@joel Joel — Tue, 27 May 2025 04:13:28 +0000

We first identified a number of potential improvements to the forecaster model.

Model structure: The first version of the forecaster made use of a global model, where inferer addresses were used as a categorical feature in training. However, if the forecaster cannot adequately distinguish information from different inferers with the address feature, then a global model may simply predict the mean for all inferers. As an alternative we consider a per-inferer method, with separate forecasting models for each inferer.

Target variable: As discussed in the Allora Whitepaper, the first model forecasts the losses for each inferer, which are then converted to regrets to be passed to the weighting function. A potential drawback of this method is that the loss-to-regret conversion must use the network loss at the previous epoch (i.e. R_forecast = log L_network,prev - log L_forecast) rather than the actual network regret (which is not yet available). If the network regret changes significantly from epoch-to-epoch this affect the final weighting. However, a benefit of forecasting losses is that they are independent for each inferer, and therefore do not depend on the makeup of the active set of inferers (see details about merit-based sortition here).

As alternatives, we consider models that instead forecast the regret or z-score of the regret for each inferer. This way, the forecaster only needs to predict the relative accuracy for each inferer, rather than the absolute accuracy (as for losses). However, these methods could then be sensitive to changes in the active set of inferers if the network loss changes significantly.

Feature engineering: A number of engineered properties (exponential moving averages and standard deviations, rolling means, gradients) require an epoch span to be defined for the calculation. As the optimal span length or combination of spans is not obvious (e.g. shorter or longer spans), we will test the combinations that produce the best outcomes. We will also test if the current features are sufficient to detect periodic outperformance, or whether further feature engineering is required.

Next, we must decide on a series of tests to identify the best performing model and feature set.

Handling ground truth granularity

@Apollo11 Diederik — Mon, 26 May 2025 12:58:08 +0000

Thank you very much for this great discussion @florian! This should help define clearly for what timestamps all network participants (inferers, forecasters, reputers) should be sourcing their data and generating their predictions.

Thorough testing of new forecaster model

@Apollo11 Diederik — Mon, 26 May 2025 12:56:37 +0000

Thank you for making this thread @joel!

Accurate forecasting is critical for Allora’s performance and I’m really looking forward to reading about your findings here.