A scalable Big Data system built on Lambda Architecture for detecting arbitrage opportunities between cryptocurrency spot markets and crypto-backed ETFs.
The system analyzes real-time data streams to identify moments when the market price of a crypto ETF (e.g. IBIT) significantly diverges from its underlying asset value (e.g. Bitcoin), accounting for the units-per-share exchange ratio. It also collects historical data to support in-depth analysis of market liquidity and sentiment.
Binance WebSocket API — Provides a real-time order book stream (All Market Tickers) for the following pairs:
- BTC/USDT, ETH/USDT, SOL/USDT
Each record includes bid/ask prices, quantities, and a timestamp.
Alpaca Markets WebSocket API — Provides real-time quotes for crypto-backed ETFs tracking Bitcoin, Ethereum, and Solana. Monitored tickers include IBIT, BITB, BTC, GBTC, ETHA, ETHW, ETHE, BSOL, and GSOL.
Web Scraping (units_per_share) — Since no public API exposes this ratio, it is scraped cyclically (every 6–24 hours) from official ETF issuer websites: iShares (BlackRock), Bitwise, and Grayscale.
Historical ETF Data (Alpaca) — OHLCV candlestick data at 1-minute resolution for all tracked ETF tickers, covering 2024–2025. Stored in Parquet format (Snappy compression) on HDFS. Initial load is performed ad-hoc; incremental updates run on the 1st of each month at 04:00.
Historical Crypto Data (Binance) — K-line (candlestick) data at 1-minute resolution for BTC/USDT, ETH/USDT, and SOL/USDT. Includes taker buy volumes as a demand-pressure indicator. Same storage and update strategy as the ETF batch layer.
The system is based on Lambda Architecture, combining a Speed Layer (low-latency stream processing) with a Batch Layer (complete historical processing) and a Serving Layer for analytics.
Data Sources → Apache NiFi → [ Kafka → Spark Streaming → HBase ] (Speed Layer)
→ [ HDFS ] (Batch Layer)
↓
Apache Hive (Serving Layer)
Apache NiFi (Ingestion Layer) handles all data ingestion: normalization across sources, reference data enrichment, routing to Kafka (streaming) and HDFS (batch), and transformation/validation. It manages three flow paths: real-time streams, reference data updates (web scraping), and historical batch loading.
Apache Kafka (Speed Layer — Buffering) decouples ingestion from analytics. Two topics are used: binance-quotes and etf-quotes, each with 3 partitions (keyed by instrument symbol for ordering guarantees), Snappy compression, 24-hour data retention, and a 10 GB per-partition size limit.
Apache Spark Structured Streaming (Speed Layer — Processing) is the core analytical engine (PySpark). It reads both Kafka topics in parallel, normalizes fields, and synchronizes the two streams using 500ms fixed time windows with a 10-second watermark. For each matched ETF–Crypto pair it computes:
- Mid prices for both markets
- Implied price:
P_implied = P_mid_ETF / units_per_share - Tracking delta:
Δ% = (P_implied / P_mid_Crypto) − 1.0
Results are routed to three output Kafka topics: active-arbitrage-signals (signals exceeding 0.1% deviation), tracking_quality_hours (hourly aggregates), and etf_mispricing_ranking (raw metrics for all instruments).
Apache HBase (Speed Layer — Storage) stores the current operational state in three tables: realtime_prices (live ETF vs. crypto price pairs), arbitrage_signals (a persistent log of detected opportunities), and tracking_metrics_hourly (hourly aggregated statistics).
HDFS (Batch Layer) stores all historical data in partitioned Parquet files, organized by month.
Apache Hive (Serving Layer) provides SQL analytics over the HDFS data via external tables partitioned by date. Analytical views include daily OHLC summaries, hourly market activity with sentiment metrics (taker buy %), ETF liquidity indicators, and intraday seasonality analysis. Queries are executed via Apache Tez for low-latency performance.
The system detected short-lived price deviations between ETF market prices and their NAV in the range of 0.1%–0.2%, which is consistent with market standards for highly liquid ETF instruments (known as premium/discount to NAV).
These discrepancies are actively corrected by Authorized Participants (APs) — investment banks and specialized trading firms such as Jane Street and JPMorgan — who exploit the spread through ETF creation and redemption mechanisms, thereby restoring market efficiency.
For retail investors, deviations at this scale are insufficient to generate profit after accounting for transaction costs, bid-ask spreads, and execution delays. For market makers and APs, however, given the average daily trading volume of funds like IBIT (~$2–3 billion, reaching up to $4–5 billion during high-volatility periods), even fractional price differences translate to significant arbitrage profits.
Five functional tests were conducted to verify the system end-to-end:
- Binance historical load — Verified that 24 monthly Parquet files (2024–2025) for BTC/USDT, ETH/USDT, and SOL/USDT were correctly written to HDFS.
- Alpaca historical load — Verified that 24 monthly Parquet files for all 9 ETF tickers were correctly written to HDFS.
- Web scraping (units_per_share) — Verified that 9 JSON files with exchange parity data were correctly scraped for the last trading day across all ETF issuers.
- Hive queries — Binance data — Verified the
daily_summaryview for August 2024, confirming 93 rows (31 per trading pair: BTC, ETH, SOL). - Hive queries — Alpaca data — Verified the
avg_hourly_statsview for October 2025, confirming correct aggregation sorted by average trade count.