Skip to content

kay-ou/SimTradeData

Repository files navigation

English | 中文 | Deutsch

SimTradeData - Quantitative Trading Data Downloader

Python 3.10+ License: AGPL-3.0 Version DuckDB Parquet Code Style: Black Poetry

BaoStock + Mootdx + EastMoney + yfinance Multi-Source | China A-Shares + US Stocks | PTrade Compatible | DuckDB + Parquet Storage

SimTradeData is an efficient data download tool designed for SimTradeLab. It supports China A-shares (BaoStock, Mootdx, EastMoney) and US stocks (yfinance) from multiple data sources, automatically orchestrating each source's strengths. Data is stored in DuckDB as intermediate storage and exported to Parquet format, with efficient incremental updates and querying.


Recommended Combo: SimTradeData + SimTradeLab

Fully PTrade Compatible | A-Shares + US Stocks | 20x+ Backtesting Speedup

SimTradeLab

No PTrade Strategy Code Changes Needed | Ultra-Fast Local Backtesting | Zero-Cost Solution


Key Features

Efficient Storage Architecture

  • DuckDB Intermediate Storage: High-performance columnar database with SQL queries and incremental updates
  • Parquet Export Format: Highly compressed, cross-platform compatible, ideal for large-scale data analysis
  • Automatic Incremental Updates: Intelligently detects existing data, only downloads new records

Comprehensive Data Coverage

  • Market Data: OHLCV daily bars with limit-up/down prices and previous close
  • Valuation Metrics: PE/PB/PS/PCF/Turnover Rate/Total Shares/Float Shares
  • Financial Data: 23 quarterly financial indicators + automatic TTM calculation
  • Corporate Actions: Dividends, bonus shares, rights offerings (with forward adjustment factors)
  • Metadata: Stock info, trading calendar, index constituents, ST/suspension status
  • US Stock Support: 6,000+ US common stocks, S&P 500 / NASDAQ-100 index constituents

Data Quality Assurance

  • Auto-Validation: Data integrity validation before writes
  • Export-Time Calculation: Limit prices, TTM metrics computed at export for consistency
  • Detailed Logging: Comprehensive error logs and warnings

Generated Data Structure

data/
├── cn.duckdb          # DuckDB database - A-shares (download source)
├── us.duckdb             # DuckDB database - US stocks (download source)
└── export/                      # Exported Parquet files (by market)
    ├── cn/                      # A-shares export
    │   ├── stocks/              # Daily bars (one file per stock)
    │   │   ├── 000001.SZ.parquet
    │   │   └── 600519.SS.parquet
    │   ├── exrights/            # Corporate action events
    │   ├── fundamentals/        # Quarterly financials (with TTM)
    │   ├── valuation/           # Valuation metrics (daily)
    │   ├── metadata/            # Metadata
    │   └── manifest.json
    └── us/                      # US stocks export
        ├── stocks/
        │   ├── AAPL.US.parquet
        │   └── MSFT.US.parquet
        ├── exrights/
        ├── fundamentals/
        ├── valuation/
        ├── metadata/
        └── manifest.json

Prerequisites

  • Python: 3.10 or higher
  • Poetry: Installation guide
  • Network: Required for downloading data from BaoStock/Mootdx/EastMoney/yfinance (China mainland network recommended for A-share data)

Quick Start

Option 1: Download Pre-Built Data (Recommended)

Download the latest data from Releases:

  • A-shares: data-cn-v* release → extract to data/cn/
  • US stocks: data-us-v* release → extract to data/us/
# A-shares
mkdir -p /path/to/SimTradeLab/data/cn
tar -xzf simtradelab-data-cn-*.tar.gz -C /path/to/SimTradeLab/data/cn/

# US stocks
mkdir -p /path/to/SimTradeLab/data/us
tar -xzf simtradelab-data-us-*.tar.gz -C /path/to/SimTradeLab/data/us/

Option 2: Download Data Yourself

1. Install Dependencies

# Clone the project
git clone https://github.com/kay-ou/SimTradeData.git
cd SimTradeData

# Install dependencies
poetry install

# Activate virtual environment
poetry shell

2. Download Data

Recommended: Unified Download Command

A single command downloads all data, automatically orchestrating Mootdx and BaoStock for their respective strengths:

# Full download (recommended)
# Mootdx: market data, corporate actions, bulk financials, trading calendar, benchmark index
# BaoStock: valuation metrics, ST/suspension status, index constituents
poetry run python scripts/download.py

# Fast first-time download: import TDX daily package first, then supplement with corporate actions etc.
# (6,000+ stocks OHLCV reduced from hours to minutes)
poetry run python scripts/download.py --tdx-download --source mootdx --skip-fundamentals

# Use an already-downloaded TDX ZIP file
poetry run python scripts/download.py --tdx-source data/downloads/hsjday.zip --source mootdx

# Check data status
poetry run python scripts/download.py --status

# Skip financial data (faster)
poetry run python scripts/download.py --skip-fundamentals

# Run Mootdx phase only
poetry run python scripts/download.py --source mootdx

# Run BaoStock phase only
poetry run python scripts/download.py --source baostock

Data Source Division of Labor

Data Type Source Reason
OHLCV Market Data (first time) TDX Daily Package Fastest, ~500MB bulk import of full history
OHLCV Market Data (incremental) Mootdx Fast, local network
Corporate Actions (XDXR) Mootdx More complete data
Bulk Financial Data Mootdx One ZIP = all stocks, far better than per-stock queries
Valuation PE/PB/PS/Turnover BaoStock Exclusive data
ST/Suspension Status BaoStock Exclusive data
Index Constituents BaoStock Exclusive data
Trading Calendar Mootdx Comes with market data
Benchmark Index Mootdx Comes with market data

Using Individual Data Sources

# BaoStock (includes valuation data, but slower)
poetry run python scripts/download_efficient.py
poetry run python scripts/download_efficient.py --skip-fundamentals
poetry run python scripts/download_efficient.py --valuation-only  # Valuation + status only

# Mootdx (faster, but no valuation data)
poetry run python scripts/download_mootdx.py
poetry run python scripts/download_mootdx.py --skip-fundamentals

EastMoney Complementary Data (Money Flow, Dragon Tiger Board, Margin Trading)

# Download last 30 days of complementary data (requires existing market data)
poetry run python scripts/download_daily_extras.py

# Specify number of days (LHB API only retains ~30 days, run regularly)
poetry run python scripts/download_daily_extras.py --days 7

US Stock Data (yfinance)

Free US stock data via yfinance, no API key required:

# Full download (6,000+ US stocks with OHLCV + financials + valuation + metadata)
poetry run python scripts/download_us.py

# Specific symbols (small-scale testing)
poetry run python scripts/download_us.py --symbols AAPL,MSFT,GOOGL

# Market data only (skip time-consuming per-stock financials and metadata)
poetry run python scripts/download_us.py --skip-fundamentals --skip-metadata

# Specify start date
poetry run python scripts/download_us.py --start-date 2020-01-01

US stock ticker format: AAPL.US (consistent with A-shares 600000.SS using {code}.{market}), stored in a separate database data/us.duckdb.

TDX Official Data Package (Fastest Way to Get Full Historical Data)

# Auto-download official TDX Shanghai/Shenzhen/Beijing daily data package (~500MB)
poetry run python scripts/download_tdx_day.py

# Force re-download
poetry run python scripts/download_tdx_day.py --force-download

# Use an already-downloaded file
poetry run python scripts/download_tdx_day.py --file hsjday.zip

3. Export to Parquet

# Export A-shares → data/export/cn/
poetry run python scripts/export_parquet.py

# Export US stocks → data/export/us/
poetry run python scripts/export_parquet.py --market us

# Custom output directory
poetry run python scripts/export_parquet.py --market cn --output /custom/path

4. Release to GitHub (Maintainer)

# Release A-shares data
bash scripts/release_data.sh --market cn

# Release US stock data
bash scripts/release_data.sh --market us

# Specify version
bash scripts/release_data.sh --market cn 1.3.0

5. Use in SimTradeLab

# Copy exported data to SimTradeLab data directory
rsync -a data/export/cn/ /path/to/SimTradeLab/data/cn/
rsync -a data/export/us/ /path/to/SimTradeLab/data/us/

Project Architecture

SimTradeData/
├── scripts/
│   ├── download.py                # Unified download entry (recommended for A-shares)
│   ├── download_efficient.py      # BaoStock download script
│   ├── download_mootdx.py         # Mootdx (TDX API) download script
│   ├── download_daily_extras.py   # EastMoney complementary data download script
│   ├── download_tdx_day.py        # TDX official daily data package download/import
│   ├── download_us.py             # US stock download script (yfinance)
│   ├── import_tdx_day.py          # TDX .day file import script
│   ├── export_parquet.py          # Parquet export script
│   └── release_data.sh            # GitHub Release publishing script
├── simtradedata/
│   ├── router/
│   │   ├── smart_router.py      # SmartRouter - smart data source routing
│   │   ├── route_config.py      # Route table configuration
│   │   └── exceptions.py        # Router exceptions
│   ├── fetchers/
│   │   ├── base_fetcher.py      # Base Fetcher class
│   │   ├── baostock_fetcher.py  # BaoStock data fetching
│   │   ├── unified_fetcher.py   # BaoStock unified fetching (optimized)
│   │   ├── mootdx_fetcher.py    # Mootdx basic data fetching
│   │   ├── mootdx_unified_fetcher.py  # Mootdx unified data fetching
│   │   ├── mootdx_affair_fetcher.py   # Mootdx financial data fetching
│   │   ├── eastmoney_fetcher.py # EastMoney complementary data fetching
│   │   └── yfinance_fetcher.py  # yfinance US stock data fetching
│   ├── processors/
│   │   └── data_splitter.py     # Data stream splitting
│   ├── writers/
│   │   └── duckdb_writer.py     # DuckDB write and export
│   ├── validators/
│   │   └── data_validator.py    # Data quality validation
│   ├── config/
│   │   ├── field_mappings.py    # A-share field mapping config
│   │   ├── us_field_mappings.py # US stock field mapping config
│   │   └── mootdx_finvalue_map.py  # Mootdx financial field mapping
│   └── utils/
│       ├── code_utils.py        # Stock code conversion
│       └── ttm_calculator.py    # Quarterly range calculation
├── data/                        # Data directory (gitignored)
│   ├── cn.duckdb      # A-shares DuckDB source
│   ├── us.duckdb         # US stocks DuckDB source
│   └── export/                  # Parquet exports
│       ├── cn/                  # A-shares export
│       └── us/                  # US stocks export
└── docs/                        # Documentation
    ├── PTRADE_PARQUET_FORMAT.md # Parquet format specification
    └── PTrade_API_mini_Reference.md

Core Modules

1. SmartRouter - Smart Data Source Router

  • Unified data access API, automatically selects the best data source by data type and market
  • Static priority + health-aware: auto fallback to backup sources when primary fails
  • Integrates Phase 1 circuit breaker, skips unhealthy sources
from simtradedata.router import SmartRouter

with SmartRouter() as router:
    # Auto-selects best source: mootdx → eastmoney → baostock
    df = router.get_daily_bars("600000.SS", "2024-01-01", "2024-12-31")

    # Single-source data also goes through router for unified API
    mf = router.get_money_flow("600000.SS", "2024-01-01", "2024-12-31")

    # US stocks auto-route to yfinance
    us = router.get_daily_bars("AAPL.US", "2024-01-01", "2024-12-31")

2. UnifiedDataFetcher - Unified Data Fetching

  • Single API call fetches market, valuation, and status data
  • Reduces API calls by 33%

2. DuckDBWriter - Data Storage and Export

  • Efficient incremental writes (upsert)
  • Computes limit prices and TTM metrics at export time
  • Forward-fills quarterly data to daily frequency

3. DataSplitter - Data Stream Splitting

  • Routes unified data to appropriate tables by type

Data Field Reference

stocks/ - Daily Stock Bars

Field Description
date Trading date
open/high/low/close OHLC prices
high_limit/low_limit Limit-up/down prices (computed at export)
preclose Previous close price
volume Trading volume (shares)
money Trading amount (CNY)

valuation/ - Valuation Metrics (Daily)

Field Description
pe_ttm/pb/ps_ttm/pcf Valuation ratios
roe/roe_ttm/roa/roa_ttm Profitability metrics (forward-filled from quarterly reports)
naps Net asset per share (computed at export)
total_shares/a_floats Total shares / float shares
turnover_rate Turnover rate

fundamentals/ - Financial Data (Quarterly)

Contains 23 financial indicators and their TTM versions. See PTRADE_PARQUET_FORMAT.md for details.

Configuration

Edit scripts/download_efficient.py:

# Date range
START_DATE = "2017-01-01"
END_DATE = None  # None = current date

# Output directory
OUTPUT_DIR = "data"

# Batch size
BATCH_SIZE = 20

Documentation

Document Description
PTRADE_PARQUET_FORMAT.md Parquet data format specification
PTrade_API_mini_Reference.md PTrade API reference

Notes

Data Source Comparison

Feature BaoStock Mootdx API EastMoney TDX Official Package yfinance (US)
Market A-shares A-shares A-shares A-shares US stocks
Speed Slower Fast Fast Fastest (bulk download) Medium
Valuation Data Yes (PE/PB/PS etc.) No No No Yes (computed)
Financial Data Yes (per-stock query) Yes (bulk ZIP, faster) No No Yes (per-stock query)
Money Flow No No Yes (exclusive) No No
Dragon Tiger Board No No Yes (exclusive) No No
Margin Trading No No Yes (exclusive) No No
History Start 2015 2015 2015 Full history Full history
API Key Not required Not required Not required N/A Not required

Recommended: Use scripts/download.py unified command to automatically assign Mootdx for market data and financials, BaoStock for valuation and status, leveraging each source's strengths.

Incremental Update Mechanism

  • Market Data: Checks for new trading days; skips in seconds when no new data
  • Financial Data: Incremental checks based on remote file hash; only downloads changed quarters
  • Index Constituents: Tracks downloaded months; only downloads new months
  • Interrupt Recovery: Financial data progress and data are committed in the same transaction; resumes after interruption

Incremental Update Workflow

# 1. Incremental download (fetches only new data, automatically skips existing)
poetry run python scripts/download.py

# 2. Export to Parquet
poetry run python scripts/export_parquet.py              # CN → data/export/cn/
poetry run python scripts/export_parquet.py --market us  # US → data/export/us/

Step 1 automatically detects the latest date of existing data in DuckDB and only downloads the delta. When there are no new trading days, all stocks are skipped in seconds.

Data Quality

  • Data sourced from BaoStock free data service
  • For research and educational purposes only

Testing

# Unit tests (no network required)
poetry run pytest tests/ -v

# SmartRouter routing and fallback tests
poetry run pytest tests/router/ -v

# SmartRouter live integration test (requires network)
poetry run python scripts/test_smart_router_live.py

Version History

See CHANGELOG.md for the full version history.

Latest: v1.2.0 (2026-03-13) - Smart Data Source Router

Related Links

💖 Sponsor

If this project helps you, consider sponsoring!

WeChat Pay Alipay
WeChat Pay Alipay

License

This project is licensed under AGPL-3.0. See the LICENSE file for details.


Status: Production Ready | Version: v1.2.0 | Last Updated: 2026-03-13

About

SimTradeData is a utility library supporting SimTradeDesk, SimTradeLab and simtradeML with reliable, high-quality simulated trading data for model training, backtesting, and performance evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Packages

 
 
 

Contributors