Data + ML pipeline for discovering, enriching, and scoring newly created liquidity pools on Solana DEXs. It continuously ingests pool metadata, augments it with technical analysis (TA) signals from Binance futures, and applies trained ML models to identify potentially interesting token pairs.
The project is designed for:
- Data collection (new pools, metadata, price/volume/txns)
- Feature engineering (TA indicators, time features, text embeddings)
- Model training (classification + regression)
- Real-time scoring of new pools with alerting hooks
Disclaimer: This project is for research and data analysis. It does not provide financial advice.
db_collector.py— continuously pulls new pools from GeckoTerminal, resolves details via DexScreener, and stores rows in SQLite.new_pool_sniper.py— older CSV-based pipeline (still useful for quick datasets).reclassify.py— updates/enriches historical data and producesdata/tokens_raw_reclassified.csvfor modeling.
snipe.py— live scoring of new pools using trained models + TA features; can send Telegram alerts and open URLs.
db_expose.py— FastAPI server to download DB or stream CSV extracts.
data_preprocessing.ipynb— end-to-end feature engineering and dataset preparation.TA_maps.ipynb— build TA “maps” for indicators.cls_modeling.ipynb/reg_modeling.ipynb— train classification/regression models.
- Python 3.11.x
- CUDA-capable GPU (recommended for embedding + model inference)
- Dependencies listed in
requirements.txt
git clone https://github.com/philipzabicki/solanaDEXtokenCollector.git
cd solanaDEXtokenCollector
pip install -r requirements.txtCreate a credentials.py file in the project root with required keys:
# Binance (TA features)
binance_API_KEY = "your_binance_key"
binance_SECRET_KEY = "your_binance_secret"
# Telegram (alerts)
TELEGRAM_TOKEN = "your_telegram_bot_token"
TELEGRAM_CHAT_IDs = ["123456789"]
# Optional: for reclassify.py (remote DB bootstrap)
URL = "https://your-host/tokens_raw.db"
INCREMENTAL_URL = "https://your-host/tokens-since"python db_collector.pyThis creates/updates data/tokens_raw.db and downloads token images into data/imgs/.
python reclassify.pyThis produces data/tokens_raw_reclassified.csv with current metrics and metadata.
Run the notebooks in this order:
TA_maps.ipynbdata_preprocessing.ipynbcls_modeling.ipynbreg_modeling.ipynb
Outputs are stored under models/ and data/modeling/.
Ensure the required artifacts exist in models/:
final_cls_model.joblibfinal_reg_model.joblibpca_name.joblibpca_symbol.joblibimportant_name_pca_indices.joblibimportant_symbol_pca_indices.joblibfeature_medians.joblibselected_indicators.json
Then run:
python snipe.pyThe script will:
- Pull fresh pools
- Build features (TA + embeddings + time features)
- Run classification + regression models
- Emit Telegram alerts when thresholds are met
python db_expose.pyEndpoints:
GET /download-db— download the SQLite fileGET /dump— full CSV exportGET /tokens— paginated JSONGET /token/{pair_address}— single token by address
- Collect new pools → GeckoTerminal
- Resolve details → DexScreener
- Store & enrich → SQLite + missing metadata + images
- Reclassify → add current metrics
- Engineer features → TA + time + name/symbol embeddings
- Train models → classification + regression
- Score live → alert when thresholds are met
- The project expects CUDA for embedding + ML steps in notebooks and
snipe.py. - Data collection runs continuously; consider running it in a screen/tmux session.
- The alert logic and thresholds are in
snipe.pyand can be tuned.