infra

BitCell Production Infrastructure

Production-grade infrastructure for multi-region BitCell deployment with monitoring, alerting, and chaos testing.

Overview

This infrastructure setup provides:

Multi-Region Deployment: 4+ geographic regions for high availability
Prometheus Monitoring: Comprehensive metrics collection
Grafana Dashboards: Visual monitoring and alerting
Alertmanager: Intelligent alert routing to Slack/PagerDuty
HAProxy Load Balancing: Automatic failover between regions
Chaos Engineering: Automated resilience testing
Incident Response Runbooks: Operational procedures

Quick Start

Prerequisites

Docker 24.0+ and Docker Compose 2.20+
OR Kubernetes 1.28+ cluster
8GB+ RAM per node
100GB+ SSD storage per node

Deploy with Docker Compose

# Clone repository
git clone https://github.com/Steake/BitCell.git
cd BitCell

# Build node image
docker build -f infra/docker/Dockerfile -t bitcell-node:latest .

# Start infrastructure
cd infra/docker
docker-compose up -d

# Verify deployment
docker-compose ps

Access Monitoring

Grafana: http://localhost:3000 (admin/)
Prometheus: http://localhost:9999
Alertmanager: http://localhost:9093
HAProxy Stats: http://localhost:8404

Architecture

Regional Deployment

┌─────────────────────────────────────────────────────────────┐
│                       Load Balancer                          │
│                         (HAProxy)                            │
└────────────┬────────────┬────────────┬─────────────────┬────┘
             │            │            │                 │
    ┌────────▼───────┐ ┌─▼────────┐ ┌─▼────────────┐ ┌─▼──────────┐
    │   US-East      │ │ US-West  │ │  EU-Central  │ │AP-Southeast│
    │   Region       │ │  Region  │ │   Region     │ │   Region   │
    │  (2 nodes)     │ │(2 nodes) │ │  (2 nodes)   │ │  (1 node)  │
    └────────────────┘ └──────────┘ └──────────────┘ └────────────┘
             │
    ┌────────▼────────────────────────────────────────────────┐
    │               Monitoring Stack                          │
    │  Prometheus → Alertmanager → Grafana                   │
    └─────────────────────────────────────────────────────────┘

Node Configuration

Each node exposes:

P2P Port: 9000-9010 (peer communication)
RPC Port: 8545-8555 (JSON-RPC API)
Metrics Port: 9090-9100 (Prometheus metrics)

Directory Structure

infra/
├── docker/
│   ├── docker-compose.yml       # Multi-region Docker setup
│   ├── Dockerfile               # Node container image
│   └── entrypoint.sh            # Container startup script
├── kubernetes/
│   └── deployment.yaml          # K8s StatefulSets for production
├── monitoring/
│   ├── prometheus.yml           # Metrics collection config
│   ├── alerts.yml               # Alert rules
│   ├── alertmanager.yml         # Alert routing config
│   ├── haproxy.cfg             # Load balancer config
│   └── grafana/
│       ├── provisioning/        # Auto-provisioning configs
│       └── dashboards/          # Pre-built dashboards
├── chaos/
│   └── chaos_test.py           # Chaos engineering test suite
└── runbooks/
    ├── incident-response.md    # Incident handling procedures
    ├── deployment-guide.md     # Deployment documentation
    └── oncall-guide.md         # On-call rotation guide

Monitoring

Metrics Collected

Chain Metrics:

bitcell_chain_height: Current blockchain height
bitcell_sync_progress: Sync progress percentage
bitcell_txs_processed_total: Total transactions processed
bitcell_pending_txs: Pending transaction count

Network Metrics:

bitcell_peer_count: Connected peers
bitcell_dht_peer_count: DHT peer count
bitcell_bytes_sent_total: Network traffic sent
bitcell_bytes_received_total: Network traffic received

Proof Metrics:

bitcell_proofs_generated_total: ZK proofs generated
bitcell_proofs_verified_total: ZK proofs verified
bitcell_proof_gen_time_ms: Proof generation time

EBSL Metrics:

bitcell_active_miners: Active eligible miners
bitcell_banned_miners: Banned miners

Alert Rules

Critical Alerts:

NodeDown: Node unresponsive >2 minutes
RegionDown: All nodes in region down
NoPeers: Node has 0 connected peers
NoActiveMiners: No miners available for block production
ChainNotProgressing: Block height not increasing

Warning Alerts:

LowPeerCount: <2 connected peers
HighProofGenerationTime: Proof gen >30 seconds
HighPendingTransactions: >1000 pending transactions
NodeOutOfSync: Sync progress <95%

Dashboards

Production Overview:

Node status by region
Chain height progression
Transaction throughput (TPS)
Network traffic
Active vs banned miners

Node Details:

Per-node performance metrics
Resource utilization
Peer connectivity
Proof generation times

Regional Health:

Regional availability
Cross-region latency
Failover status
Load distribution

Chaos Engineering

Running Chaos Tests

# Install dependencies
pip3 install requests

# Run all chaos tests
python3 infra/chaos/chaos_test.py

# Run specific scenario
python3 infra/chaos/chaos_test.py --scenario node_failure

Available Scenarios

Node Failure: Single node crash and recovery
Regional Failure: Entire region goes down
Network Partition: Split-brain scenarios
High Latency: Network delay injection
Resource Exhaustion: CPU/memory constraints

Acceptance Criteria

✅ Network survives regional failures (>50% nodes up)
✅ Automatic failover to healthy regions (<30s)
✅ Data consistency maintained during partitions
✅ Performance degradation graceful under load
✅ Recovery automatic without intervention

Deployment

Docker Compose (Development/Testing)

# Start all services
docker-compose -f infra/docker/docker-compose.yml up -d

# Scale a region
docker-compose -f infra/docker/docker-compose.yml up -d --scale node-us-east-2=3

# Stop all services
docker-compose -f infra/docker/docker-compose.yml down

# View logs
docker-compose -f infra/docker/docker-compose.yml logs -f node-us-east-1

Kubernetes (Production)

# Create namespace
kubectl create namespace bitcell-production

# Deploy infrastructure
kubectl apply -f infra/kubernetes/deployment.yaml

# Scale nodes
kubectl scale statefulset bitcell-node-us-east --replicas=3 -n bitcell-production

# View status
kubectl get pods -n bitcell-production
kubectl get svc -n bitcell-production

# View logs
kubectl logs -f statefulset/bitcell-node-us-east -n bitcell-production

Operations

Health Checks

# Check all nodes
for port in 9090 9091 9092 9093 9094 9095 9096; do
    curl -s http://localhost:$port/health | head -1
done

# Check Prometheus targets
curl http://localhost:9999/api/v1/targets | jq '.data.activeTargets[] | {job, health}'

# Check Alertmanager
curl http://localhost:9093/api/v1/alerts | jq '.data[] | {labels, state}'

Common Operations

Restart a node:

docker-compose restart node-us-east-1

Update node configuration:

# Edit docker-compose.yml
vim infra/docker/docker-compose.yml

# Apply changes
docker-compose up -d node-us-east-1

Silence an alert:

curl -X POST http://localhost:9093/api/v1/silences \
    -H "Content-Type: application/json" \
    -d '{
        "matchers": [{"name": "alertname", "value": "NodeDown", "isRegex": false}],
        "startsAt": "2024-12-09T00:00:00Z",
        "endsAt": "2024-12-09T23:59:59Z",
        "createdBy": "oncall",
        "comment": "Planned maintenance"
    }'

Performance Tuning

Optimize for latency:

# docker-compose.yml
environment:
  - NETWORK_LATENCY_OPTIMIZATION=true
  - PEER_CONNECTION_TIMEOUT=5s
  - SYNC_BATCH_SIZE=100

Optimize for throughput:

environment:
  - MAX_CONCURRENT_PROOFS=4
  - MEMPOOL_SIZE=10000
  - BLOCK_GAS_LIMIT=30000000

Troubleshooting

Nodes Not Connecting

Symptoms: Low peer count, isolated nodes

Fix:

# Check network
docker network inspect bitcell_bitcell-net

# Verify bootstrap nodes are reachable
docker exec bitcell-node-us-east-1 ping -c 3 node-us-west-1

# Restart node
docker-compose restart node-us-east-1

High Latency

Symptoms: Slow RPC responses, sync delays

Diagnostic:

# Measure latency between regions
for node in node-us-east-1 node-us-west-1 node-eu-central-1; do
    docker exec bitcell-node-us-east-1 ping -c 10 $node | tail -1
done

Fix:

Optimize network routes
Increase node resources
Enable caching

Prometheus Not Scraping

Symptoms: Missing metrics in Grafana

Fix:

# Check Prometheus targets
curl http://localhost:9999/api/v1/targets

# Verify metrics endpoint
curl http://localhost:9090/metrics

# Restart Prometheus
docker-compose restart prometheus

Database Full

Symptoms: Disk space warnings, slow queries

Fix:

# Check disk usage
docker exec bitcell-node-us-east-1 df -h /data/bitcell

# Prune old data (when available in RC2)
docker exec bitcell-node-us-east-1 bitcell-node prune --keep-recent 10000

# Or increase disk size
# Resize volume in cloud provider

Security

Network Security

Firewall Rules:

# Allow P2P
iptables -A INPUT -p tcp --dport 9000:9010 -j ACCEPT

# Allow RPC (restricted to known IPs)
iptables -A INPUT -p tcp --dport 8545 -s 10.0.0.0/8 -j ACCEPT

# Allow metrics (monitoring subnet only)
iptables -A INPUT -p tcp --dport 9090:9100 -s 172.20.0.0/16 -j ACCEPT

Secrets Management

Do NOT store in git:

API keys
Database passwords
SSL certificates
Signing keys

Use environment variables or secrets manager:

# Docker secrets
docker secret create api_key api_key.txt
docker service update --secret-add api_key bitcell-node

# Kubernetes secrets
kubectl create secret generic bitcell-secrets \
    --from-literal=api-key=$API_KEY \
    -n bitcell-production

SSL/TLS

Enable HTTPS for RPC:

# haproxy.cfg
frontend rpc_frontend
    bind *:443 ssl crt /etc/ssl/certs/bitcell.pem
    redirect scheme https if !{ ssl_fc }

Performance Benchmarks

Target Metrics (RC3 Requirements)

Metric	Target	Current
Cross-region latency	<200ms	~150ms
Node failure recovery	<60s	~45s
Regional failover	<120s	~90s
Transaction throughput	100 TPS	75 TPS
Proof generation	<10s	~25s
Network uptime	99.9%	99.5%

Load Testing

# Run load test
./scripts/load-test.sh --duration 60 --tps 100

# Monitor during test
watch -n 1 'curl -s http://localhost:9090/metrics | grep bitcell_txs'

Cost Estimation

Cloud Provider Costs (Monthly)

AWS:

7× t3.xlarge instances: ~$1,200
700GB EBS storage: ~$70
Data transfer: ~$200
Total: ~$1,500/month

GCP:

7× n2-standard-4 instances: ~$1,100
700GB SSD storage: ~$120
Network egress: ~$150
Total: ~$1,400/month

Optimization:

Use spot instances for 40% savings
Enable autoscaling
Regional storage for backups
CDN for static content

Support

Documentation

Deployment Guide: runbooks/deployment-guide.md
Incident Response: runbooks/incident-response.md
On-Call Guide: runbooks/oncall-guide.md

Communication

Issues: https://github.com/Steake/BitCell/issues
Discussions: https://github.com/Steake/BitCell/discussions
Discord: https://discord.gg/bitcell
Email: [email protected]

Emergency Contacts

On-Call: See PagerDuty schedule
Platform Team: #platform-team on Slack
Security Team: [email protected]

Contributing

Improvements to infrastructure are welcome:

Test changes locally with Docker Compose
Run chaos tests to verify resilience
Update documentation
Submit PR with description

License

MIT OR Apache-2.0

Last Updated: 2024-12-09
Maintained By: BitCell Platform Team
Status: Production Ready (RC3)

Name		Name	Last commit message	Last commit date
parent directory ..
chaos		chaos
docker		docker
kubernetes		kubernetes
monitoring		monitoring
runbooks		runbooks
.gitignore		.gitignore
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
README.md		README.md
SECURITY.md		SECURITY.md
TESTING_RESULTS.md		TESTING_RESULTS.md

FilesExpand file tree

infra

Directory actions

More options

Directory actions

More options

Latest commit

History

infra

Folders and files

parent directory

README.md

BitCell Production Infrastructure

Overview

Quick Start

Prerequisites

Deploy with Docker Compose

Access Monitoring

Architecture

Regional Deployment

Node Configuration

Directory Structure

Monitoring

Metrics Collected

Alert Rules

Dashboards

Chaos Engineering

Running Chaos Tests

Available Scenarios

Acceptance Criteria

Deployment

Docker Compose (Development/Testing)

Kubernetes (Production)

Operations

Health Checks

Common Operations

Performance Tuning

Troubleshooting

Nodes Not Connecting

High Latency

Prometheus Not Scraping

Database Full

Security

Network Security

Secrets Management

SSL/TLS

Performance Benchmarks

Target Metrics (RC3 Requirements)

Load Testing

Cost Estimation

Cloud Provider Costs (Monthly)

Support

Documentation

Communication

Emergency Contacts

Contributing

License