- Docker compose containing:
- Postgres
- pgAdmin (a GUI toolset for Postgres)
- Automatically run initial SQL migrations to load sample data
- A /data/ dir with sample data csv's
- A /migrations/ dir with scripts to create schemas (src, dim, fact, obt), and load csv's to src tables
- A /models/ dir with sample data models (dim, fact, obt)
- Slides from btc++ presentation
- Additional resources and ideas for further development
- Create star schemas:
-
- Dimensions:
date,pool,block
- Dimensions:
-
- Facts:
block,network_stats_1d,pool_stats_1d,price_1d
- Facts:
-
- Create denormalized datasets:
-
- OBTs:
block,network_stats_1d,pool_stats_1d
- OBTs:
-
- From raw Bitcoin block data, derive metrics including:
estimated_hashrate(network-level)hashvaluehashprice(using daily BTC:USD price)mining_luck(using pool reported hashrates)
- To reduce file sizes, all source data has been limited to calendar year 2024, plus one month prior (Dec 2023) and one month after (Jan 2025), in case any look back or forward is needed.
- Bitcoin block data was sourced from a Bitcoin Core node (via Mempool Space's API): https://mempool.space/docs/api/rest#get-blocks-bulk. The attribute list returned from this endpoint has been reduced to keep the sample data files small. NOTE: The bulk blocks endpoint is not enabled on the public Mempool Space site, but it can be enabled on a self-hosted instance through the API config.
- Mining pool data was sourced from Mempool Space's open source repository of pools: https://github.com/mempool/mining-pools/blob/master/pools-v2.json
- Bitcoin spot price data was sourced from Coinmetrics community (free tier) API: https://docs.coinmetrics.io/api/v4/.
- Pool hashrate data was obtained from several mining pools who provide this data freely, without needing accounts or API keys.
- Initial build:
docker compose up -d --build - If already built:
docker compose up -d
- With deleting data volumes:
docker compose down -v - Without deleting data volumes:
docker compose down
- Open your browser and go to: http://localhost:5050/browser/
- Sign in and create server connection using the credentials in the .env file
- Add the csv file to the
./datafolder - Create a new
.sqlfile inside./migrations/with acopycommand like:copy <table>(<columns>) from '/data/<filename>.csv' delimiter ',' csv header;
- Mount this SQL file in
docker-compose.ymlunder thepostgresservice:./migrations/<sequence>-<name>.sql:/docker-entrypoint-initdb.d/<sequence>-<name>.sql
For the purpose of the workshop, we will be creating models by running SQL directly in pgAdmin. In a production environment, it is highly recommended to materialize models through migration files with idempotency, or to use a data modeling framework such as dbt: https://www.getdbt.com/.
Example SQL is given in the /models/ directory as a starting point. In the workshape we will materialize DIM and FACT models as tables, and OBT models as views. In a production environment, it is also recommended to add unique key constraints, column indexes, and other common database optimizations.
NOTE: The sample data provided was already pre-cleaned and validated. One notable callout with the Bitcoin block data is to build in checks for stale blocks (using blockheight, block_hash and prev_block_hash), left out here for simplicity.
The date dimension model is part of the initial migrations and will be built automatically. This model contains one row per calendar date, with many helpful columns that can be used for analysis (e.g. day_of_week, month_start_date). This is a unique model where the primary surrogate key (date_id) is a stringified date, formatted as 'yyyymmdd'.
date_id
Create a pool dimension model with a formatted display name and url, as well as categorical flags which are helpful for analaysis, such as is_antpool_friend.
pool_id
Create a Bitcoin block dimension model to hold any categorical data pertaining to each block, such as block_hash, is_subsidy_halving, is_difficulty_adjustment.
block_id
Create a Bitcoin block fact model with the blockheight, timestamp, and numerical data such as block_size, reward_subsidy. NOTE: if lower than block-level data granularity is never needed (IE no transaction-level data), it is possible to include all attributes from the Bitcoin block dimension model directly in this fact. This practice is commonly referred to as 'degenerate dimension' attributes.
block_id
dim.block.block_iddim.date.date_iddim.pool.pool_id
Create a daily network stats fact model including block_count, difficulty_weighted_avg (a blended difficulty to account for adjustments), and est_hashrate (the estimated total network hashrate).
date_id
dim.date.date_id
Create a daily pool stats fact model including block_count, reported_hashrate (for those pools who provide it).
date_idpool_id
dim.date.date_iddim.pool.pool_id
Create a daily Bitcoin price fact model including price_open, price_close, and deriving price_change, price_spread, etc. NOTE: For simplicity, this only contains BTC-USD price data. If other coins or tickers are involved, a coin dimension table should also be built.
date_id
dim.date.date_id
Create a Bitcoin block OBT model which combines attributes from the Bitcoin block fact model, and the Bitcoin block, date, and pool dimension tables.
block_id
Create a daily network stats OBT model combines attributes from the daily network stats and price fact models, and the date dimension model. By joining the network stats and price fact models, USD amounts can be derived (including hashprice).
date_id
Create a daily pool stats OBT model which combines attributes from the pool stats, network stats, and price fact models, and the date and pool dimension models. By joining pool and network stats models, it is possible to derive percent of network totals, expected block count, and mining luck (for those who provide reported hashrate).
date_idpool_id
- An OBT model which contains one row per difficulty or subsidy epoch, with aggregate metrics pertaining to each epoch.
- An OBT model which contains one row per mining pool, which aggregates all time metrics pertaining to each pool.
- An OBT model which contains more sophisticated statistical modeling around mining 'luck' (aka variability).
- Bitcoin transactions or address balances, for financial accounting or on-chain forensics (this is when the Bitcoin block dimension table is required).
- Stratum job templates or Bitcoin node logs, to analyze mining pool centralization or block relay efficiency.
- Power consumption or price history, to analyze hashrate correlation or mining hardware efficiency.
- What is Data Modeling?: https://aws.amazon.com/what-is/data-modeling/
- Types of Data Models: https://en.wikipedia.org/wiki/Database_model
- About Columnar Databases: https://databasetown.com/columnar-databases/
- dbt: https://docs.getdbt.com/
- What is dbt?: https://docs.getdbt.com/docs/introduction
- dbt Resources: https://github.com/Hiflylabs/awesome-dbt
NOTE: No experience with or recommendation of these tools.
- SQLMesh: https://sqlmesh.com/
- Coalesce: https://coalesce.io/
- SqlDBM: https://sqldbm.com/Home/
- Websocket (monitor for blocks): https://mempool.space/docs/api/websocket#live-data
- Bulk Blocks (not available on public server): https://mempool.space/docs/api/rest#get-blocks-bulk