Project goal: Build a reproducible baseline model to predict next-day percentage returns for individual stocks, helping a trader optimize risk-reward decisions when selecting stocks for short-term trading.
This project focuses on predicting next-day returns (regression) for stocks traded on the New York Stock Exchange (NYSE), using historical daily price data and annual fundamentals.
Target user:
- A short-term trader in 2015 who seeks to maximize returns while controlling for risk.
- The trader is interested in selecting subsets of stocks and building models for different stock categories or investment strategies, rather than modeling the entire universe of stocks.
Objective:
- Use historical data to build models that balance risk and reward, helping the trader identify promising stocks for a moderate-risk portfolio.
Dataset:
- Covers 2010–2016.
- Consists of four raw files:
prices.csv— daily open, high, low, close, volume (trading days only).prices-split-adjusted.csv— daily adjusted prices accounting for stock splits.fundamentals.csv— annual fundamentals per stock.securities.csv— stock metadata.
Notes on data:
- Prices and adjusted prices are daily time series for trading days, excluding holidays.
- Fundamentals are annual.
- Securities file contains one row per stock with metadata.
Use case:
- The trader builds models for different stock categories based on investment strategy.
- The goal is to construct a moderate-risk, high-return portfolio, where the focus is on risk-adjusted performance, not exhaustive coverage of all stocks.
ML-Team-Project/
├─ data/
│ ├─ raw/ # raw CSV data files
│ └─ processed/ # processed data generated by notebooks; not tracked in Git, only appears locally
├─ experiments/ # notebooks & results; subfolders for different team members' work
├─ models/
│ ├─ lr/ # Linear Regression models for clusters of stocks
│ ├─ rf/ # Random Forest models for clusters of stocks
│ ├─ xgb/ # XGBoost models for clusters of stocks
│ └─ sarimax/ # SARIMA models for individual stocks
├─ src/
│ ├─ utils/
│ │ └─ logger.py # centralized logging
│ └─ experiment_tracking/ # MLflow, MinIO, Postgres artifacts, and python env file for experiment tracking and path resolution
├─ requirements.txt # all required packages and dependencies
├─ README.md
└─ .gitignore
To get started, first clone the repository to your local machine and navigate to the project folder:
# clone the repository
git clone <repo-url>
cd <project-folder>
python -m venv .venv
source .venv/Scripts/activate # Windows Git Bash
# install required packages
pip install -r requirements.txtTo run experiment tracking services, navigate to the tracking folder and start the Docker environment:
cd src/experiment_tracking
# ensure Docker is running
docker compose build
docker compose up -dServices started:
| Service | Default Port | Description |
|---|---|---|
| PgAdmin | 5051 | Database for experiments |
| MinIO | 9000 (API), 9001 (Console) | Object storage for ML artifacts |
| MLflow | 5001 | Experiment tracking server |
Verify services:
docker ps
docker logs -f nsye_minioAccess dashboards:
- MLflow: http://localhost:5001
- MinIO Console: http://localhost:9001
- PgAdmin: http://localhost:5051
Stop all services:
docker compose down- Prepare your data (e.g.,
src/data_processing.py). - Generate features (e.g.,
src/features.py). - Train and evaluate models (e.g.,
src/train_and_eval.py).
MLflow will automatically log parameters, metrics, and artifacts to MinIO.
Ensure your virtual environment is active, Docker is running, and your MLflow client points to the correct tracking URI (
http://localhost:5001).
Optionally organize runs using:mlflow.set_experiment("Experiment_Name")
This project includes several exploratory and modeling notebooks that build toward multi‑stock time‑series prediction. Below is a concise, professional summary of the outcomes.
Two initial notebooks focus on understanding and forecasting AMZN daily return percentage:
-
Notebook 01 — Target: Daily Return %
Uses engineered features from split‑adjusted price data and evaluates Linear Regression, ARIMA, XGBoost, and LSTM.
Outcome: LSTM performed best after checking for seasonality and stabilizing the series. -
Notebook 02 — Target: Next‑Day Close (converted to return)
Models the closing price directly and computes next‑day return from predictions.
Outcome: Linear Regression achieved the strongest baseline performance.
These experiments help define expectations and serve as baselines for scaling to multiple stocks.
Three main notebooks form the multi‑stock workflow:
-
03b_data_preprocessing.ipynb
Combinessecurities.csv,fundamentals.csv, and split‑adjusted prices to build an enriched dataset.
Generates engineered features and produces final processed datasets saved locally underdata/processed/. -
04_clustering_models.ipynb
Groups stocks into meaningful clusters to support category‑specific modeling (useful for traders selecting balanced portfolios). -
05_time-series-cluster-training.ipynb
Trains forecasting models for each cluster, storing results in themodels/directory.
Supports risk‑adjusted portfolio construction by allowing different model families per stock group.
This workflow provides clean preprocessing, cluster‑aware modeling, and reproducible experiments suited for multi‑stock prediction tasks.
| Model | Notes | Performance |
|---|---|---|
| Linear Regression Baseline | Grouped by ticker | MAE: 0.4905, RMSE: 0.8112 |
| Random Forest Baseline | Grouped by ticker | MAE: 0.4750, RMSE: 0.7987 |
| ARIMA/SARIMA | Traditional time-series models (individual stock models) | MAE: 0.9160, RMSE: 1.3037 (underperformed) |
| XGBoost Baseline (Best) | Grouped by ticker | MAE: 0.4630, RMSE: 0.7558 |
The XGBoost baseline achieved the best predictive accuracy among the models explored and serves as a reliable benchmark before exploring more advanced deep learning approaches.
All trained models are saved in the models/ directory under respective sub-folders as .pkl files (e.g., models/xgb/, models/sarimax/), enabling reuse for future training, evaluation, or deployment.
Building on the current baselines, future work will focus on exploring advanced deep learning approaches tailored to the clustered stock datasets.
-
LSTM and RNN models: These architectures are designed to capture sequential patterns in stock returns and are expected to better model temporal dependencies across trading days. They will be applied to individual clusters to see if cluster-specific patterns improve predictive accuracy.
-
Convolutional Neural Networks (CNNs): CNNs can detect local patterns and spatial relationships in price movements, similar to their use in image classification. They may be useful for modeling interactions between features such as moving averages, volumes, and returns within each cluster.
-
Transformer-based models: Transformers can capture long-range dependencies efficiently and model complex interactions between multiple stocks or features over time.
-
Cluster-specific exploration: While the current work focused on large tech stocks that performed well post-financial crisis, future work will explore additional clusters representing mid-cap, small-cap, or high-volatility stocks to evaluate model robustness across different investment strategies.
These approaches aim to complement the existing baselines, providing more precise, cluster-aware predictions for day traders while balancing risk and return.