NYSE — Next-Day Stock Return Prediction

Project goal: Build a reproducible baseline model to predict next-day percentage returns for individual stocks, helping a trader optimize risk-reward decisions when selecting stocks for short-term trading.

1. Problem statement

This project focuses on predicting next-day returns (regression) for stocks traded on the New York Stock Exchange (NYSE), using historical daily price data and annual fundamentals.

Target user:

A short-term trader in 2015 who seeks to maximize returns while controlling for risk.
The trader is interested in selecting subsets of stocks and building models for different stock categories or investment strategies, rather than modeling the entire universe of stocks.

Objective:

Use historical data to build models that balance risk and reward, helping the trader identify promising stocks for a moderate-risk portfolio.

Dataset:

Covers 2010–2016.
Consists of four raw files:
1. prices.csv — daily open, high, low, close, volume (trading days only).
2. prices-split-adjusted.csv — daily adjusted prices accounting for stock splits.
3. fundamentals.csv — annual fundamentals per stock.
4. securities.csv — stock metadata.

Notes on data:

Prices and adjusted prices are daily time series for trading days, excluding holidays.
Fundamentals are annual.
Securities file contains one row per stock with metadata.

Use case:

The trader builds models for different stock categories based on investment strategy.
The goal is to construct a moderate-risk, high-return portfolio, where the focus is on risk-adjusted performance, not exhaustive coverage of all stocks.

2. Repo layout

ML-Team-Project/
├─ data/
│   ├─ raw/                      # raw CSV data files
│   └─ processed/                # processed data generated by notebooks; not tracked in Git, only appears locally
├─ experiments/                  # notebooks & results; subfolders for different team members' work
├─ models/
│   ├─ lr/                       # Linear Regression models for clusters of stocks
│   ├─ rf/                       # Random Forest models for clusters of stocks
│   ├─ xgb/                      # XGBoost models for clusters of stocks
│   └─ sarimax/                   # SARIMA models for individual stocks
├─ src/
│   ├─ utils/
│   │    └─ logger.py            # centralized logging
│   └─ experiment_tracking/      # MLflow, MinIO, Postgres artifacts, and python env file for experiment tracking and path resolution
├─ requirements.txt               # all required packages and dependencies
├─ README.md
└─ .gitignore

3. Environment setup

To get started, first clone the repository to your local machine and navigate to the project folder:

# clone the repository
git clone <repo-url>
cd <project-folder>
python -m venv .venv
source .venv/Scripts/activate  # Windows Git Bash

# install required packages
pip install -r requirements.txt

4. Experiment Tracking (Optional)

To run experiment tracking services, navigate to the tracking folder and start the Docker environment:

cd src/experiment_tracking
# ensure Docker is running
docker compose build
docker compose up -d

Services started:

Service	Default Port	Description
PgAdmin	5051	Database for experiments
MinIO	9000 (API), 9001 (Console)	Object storage for ML artifacts
MLflow	5001	Experiment tracking server

Verify services:

docker ps
docker logs -f nsye_minio

Access dashboards:

MLflow: http://localhost:5001
MinIO Console: http://localhost:9001
PgAdmin: http://localhost:5051

Stop all services:

docker compose down

Tracking experiments

Prepare your data (e.g., src/data_processing.py).
Generate features (e.g., src/features.py).
Train and evaluate models (e.g., src/train_and_eval.py).

MLflow will automatically log parameters, metrics, and artifacts to MinIO.

Ensure your virtual environment is active, Docker is running, and your MLflow client points to the correct tracking URI (http://localhost:5001).
Optionally organize runs using:
mlflow.set_experiment("Experiment_Name")

Experiment Summary

This project includes several exploratory and modeling notebooks that build toward multi‑stock time‑series prediction. Below is a concise, professional summary of the outcomes.

Single‑Stock Experiments (AMZN)

Two initial notebooks focus on understanding and forecasting AMZN daily return percentage:

Notebook 01 — Target: Daily Return %
Uses engineered features from split‑adjusted price data and evaluates Linear Regression, ARIMA, XGBoost, and LSTM.
Outcome: LSTM performed best after checking for seasonality and stabilizing the series.
Notebook 02 — Target: Next‑Day Close (converted to return)
Models the closing price directly and computes next‑day return from predictions.
Outcome: Linear Regression achieved the strongest baseline performance.
These experiments help define expectations and serve as baselines for scaling to multiple stocks.

Multi‑Stock Pipeline

Three main notebooks form the multi‑stock workflow:

03b_data_preprocessing.ipynb
Combines securities.csv, fundamentals.csv, and split‑adjusted prices to build an enriched dataset.
Generates engineered features and produces final processed datasets saved locally under data/processed/.
04_clustering_models.ipynb
Groups stocks into meaningful clusters to support category‑specific modeling (useful for traders selecting balanced portfolios).
05_time-series-cluster-training.ipynb
Trains forecasting models for each cluster, storing results in the models/ directory.
Supports risk‑adjusted portfolio construction by allowing different model families per stock group.

This workflow provides clean preprocessing, cluster‑aware modeling, and reproducible experiments suited for multi‑stock prediction tasks.

Model Performance Summary (Next-Day Return %)

Model	Notes	Performance
Linear Regression Baseline	Grouped by ticker	MAE: 0.4905, RMSE: 0.8112
Random Forest Baseline	Grouped by ticker	MAE: 0.4750, RMSE: 0.7987
ARIMA/SARIMA	Traditional time-series models (individual stock models)	MAE: 0.9160, RMSE: 1.3037 (underperformed)
XGBoost Baseline (Best)	Grouped by ticker	MAE: 0.4630, RMSE: 0.7558

The XGBoost baseline achieved the best predictive accuracy among the models explored and serves as a reliable benchmark before exploring more advanced deep learning approaches.

All trained models are saved in the models/ directory under respective sub-folders as .pkl files (e.g., models/xgb/, models/sarimax/), enabling reuse for future training, evaluation, or deployment.

Future Work

Building on the current baselines, future work will focus on exploring advanced deep learning approaches tailored to the clustered stock datasets.

LSTM and RNN models: These architectures are designed to capture sequential patterns in stock returns and are expected to better model temporal dependencies across trading days. They will be applied to individual clusters to see if cluster-specific patterns improve predictive accuracy.
Convolutional Neural Networks (CNNs): CNNs can detect local patterns and spatial relationships in price movements, similar to their use in image classification. They may be useful for modeling interactions between features such as moving averages, volumes, and returns within each cluster.
Transformer-based models: Transformers can capture long-range dependencies efficiently and model complex interactions between multiple stocks or features over time.
Cluster-specific exploration: While the current work focused on large tech stocks that performed well post-financial crisis, future work will explore additional clusters representing mid-cap, small-cap, or high-volatility stocks to evaluate model robustness across different investment strategies.

These approaches aim to complement the existing baselines, providing more precise, cluster-aware predictions for day traders while balancing risk and return.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYSE — Next-Day Stock Return Prediction

1. Problem statement

2. Repo layout

3. Environment setup

4. Experiment Tracking (Optional)

Tracking experiments

Experiment Summary

Single‑Stock Experiments (AMZN)

Multi‑Stock Pipeline

Model Performance Summary (Next-Day Return %)

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
data/raw		data/raw
experiments		experiments
models		models
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NYSE — Next-Day Stock Return Prediction

1. Problem statement

2. Repo layout

3. Environment setup

4. Experiment Tracking (Optional)

Tracking experiments

Experiment Summary

Single‑Stock Experiments (AMZN)

Multi‑Stock Pipeline

Model Performance Summary (Next-Day Return %)

Future Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages