Skip to content

Latest commit

 

History

History

README.md

BitCell Production Infrastructure

Production-grade infrastructure for multi-region BitCell deployment with monitoring, alerting, and chaos testing.

Overview

This infrastructure setup provides:

  • Multi-Region Deployment: 4+ geographic regions for high availability
  • Prometheus Monitoring: Comprehensive metrics collection
  • Grafana Dashboards: Visual monitoring and alerting
  • Alertmanager: Intelligent alert routing to Slack/PagerDuty
  • HAProxy Load Balancing: Automatic failover between regions
  • Chaos Engineering: Automated resilience testing
  • Incident Response Runbooks: Operational procedures

Quick Start

Prerequisites

  • Docker 24.0+ and Docker Compose 2.20+
  • OR Kubernetes 1.28+ cluster
  • 8GB+ RAM per node
  • 100GB+ SSD storage per node

Deploy with Docker Compose

# Clone repository
git clone https://github.com/Steake/BitCell.git
cd BitCell

# Build node image
docker build -f infra/docker/Dockerfile -t bitcell-node:latest .

# Start infrastructure
cd infra/docker
docker-compose up -d

# Verify deployment
docker-compose ps

Access Monitoring

Architecture

Regional Deployment

┌─────────────────────────────────────────────────────────────┐
│                       Load Balancer                          │
│                         (HAProxy)                            │
└────────────┬────────────┬────────────┬─────────────────┬────┘
             │            │            │                 │
    ┌────────▼───────┐ ┌─▼────────┐ ┌─▼────────────┐ ┌─▼──────────┐
    │   US-East      │ │ US-West  │ │  EU-Central  │ │AP-Southeast│
    │   Region       │ │  Region  │ │   Region     │ │   Region   │
    │  (2 nodes)     │ │(2 nodes) │ │  (2 nodes)   │ │  (1 node)  │
    └────────────────┘ └──────────┘ └──────────────┘ └────────────┘
             │
    ┌────────▼────────────────────────────────────────────────┐
    │               Monitoring Stack                          │
    │  Prometheus → Alertmanager → Grafana                   │
    └─────────────────────────────────────────────────────────┘

Node Configuration

Each node exposes:

  • P2P Port: 9000-9010 (peer communication)
  • RPC Port: 8545-8555 (JSON-RPC API)
  • Metrics Port: 9090-9100 (Prometheus metrics)

Directory Structure

infra/
├── docker/
│   ├── docker-compose.yml       # Multi-region Docker setup
│   ├── Dockerfile               # Node container image
│   └── entrypoint.sh            # Container startup script
├── kubernetes/
│   └── deployment.yaml          # K8s StatefulSets for production
├── monitoring/
│   ├── prometheus.yml           # Metrics collection config
│   ├── alerts.yml               # Alert rules
│   ├── alertmanager.yml         # Alert routing config
│   ├── haproxy.cfg             # Load balancer config
│   └── grafana/
│       ├── provisioning/        # Auto-provisioning configs
│       └── dashboards/          # Pre-built dashboards
├── chaos/
│   └── chaos_test.py           # Chaos engineering test suite
└── runbooks/
    ├── incident-response.md    # Incident handling procedures
    ├── deployment-guide.md     # Deployment documentation
    └── oncall-guide.md         # On-call rotation guide

Monitoring

Metrics Collected

Chain Metrics:

  • bitcell_chain_height: Current blockchain height
  • bitcell_sync_progress: Sync progress percentage
  • bitcell_txs_processed_total: Total transactions processed
  • bitcell_pending_txs: Pending transaction count

Network Metrics:

  • bitcell_peer_count: Connected peers
  • bitcell_dht_peer_count: DHT peer count
  • bitcell_bytes_sent_total: Network traffic sent
  • bitcell_bytes_received_total: Network traffic received

Proof Metrics:

  • bitcell_proofs_generated_total: ZK proofs generated
  • bitcell_proofs_verified_total: ZK proofs verified
  • bitcell_proof_gen_time_ms: Proof generation time

EBSL Metrics:

  • bitcell_active_miners: Active eligible miners
  • bitcell_banned_miners: Banned miners

Alert Rules

Critical Alerts:

  • NodeDown: Node unresponsive >2 minutes
  • RegionDown: All nodes in region down
  • NoPeers: Node has 0 connected peers
  • NoActiveMiners: No miners available for block production
  • ChainNotProgressing: Block height not increasing

Warning Alerts:

  • LowPeerCount: <2 connected peers
  • HighProofGenerationTime: Proof gen >30 seconds
  • HighPendingTransactions: >1000 pending transactions
  • NodeOutOfSync: Sync progress <95%

Dashboards

Production Overview:

  • Node status by region
  • Chain height progression
  • Transaction throughput (TPS)
  • Network traffic
  • Active vs banned miners

Node Details:

  • Per-node performance metrics
  • Resource utilization
  • Peer connectivity
  • Proof generation times

Regional Health:

  • Regional availability
  • Cross-region latency
  • Failover status
  • Load distribution

Chaos Engineering

Running Chaos Tests

# Install dependencies
pip3 install requests

# Run all chaos tests
python3 infra/chaos/chaos_test.py

# Run specific scenario
python3 infra/chaos/chaos_test.py --scenario node_failure

Available Scenarios

  1. Node Failure: Single node crash and recovery
  2. Regional Failure: Entire region goes down
  3. Network Partition: Split-brain scenarios
  4. High Latency: Network delay injection
  5. Resource Exhaustion: CPU/memory constraints

Acceptance Criteria

  • ✅ Network survives regional failures (>50% nodes up)
  • ✅ Automatic failover to healthy regions (<30s)
  • ✅ Data consistency maintained during partitions
  • ✅ Performance degradation graceful under load
  • ✅ Recovery automatic without intervention

Deployment

Docker Compose (Development/Testing)

# Start all services
docker-compose -f infra/docker/docker-compose.yml up -d

# Scale a region
docker-compose -f infra/docker/docker-compose.yml up -d --scale node-us-east-2=3

# Stop all services
docker-compose -f infra/docker/docker-compose.yml down

# View logs
docker-compose -f infra/docker/docker-compose.yml logs -f node-us-east-1

Kubernetes (Production)

# Create namespace
kubectl create namespace bitcell-production

# Deploy infrastructure
kubectl apply -f infra/kubernetes/deployment.yaml

# Scale nodes
kubectl scale statefulset bitcell-node-us-east --replicas=3 -n bitcell-production

# View status
kubectl get pods -n bitcell-production
kubectl get svc -n bitcell-production

# View logs
kubectl logs -f statefulset/bitcell-node-us-east -n bitcell-production

Operations

Health Checks

# Check all nodes
for port in 9090 9091 9092 9093 9094 9095 9096; do
    curl -s http://localhost:$port/health | head -1
done

# Check Prometheus targets
curl http://localhost:9999/api/v1/targets | jq '.data.activeTargets[] | {job, health}'

# Check Alertmanager
curl http://localhost:9093/api/v1/alerts | jq '.data[] | {labels, state}'

Common Operations

Restart a node:

docker-compose restart node-us-east-1

Update node configuration:

# Edit docker-compose.yml
vim infra/docker/docker-compose.yml

# Apply changes
docker-compose up -d node-us-east-1

Silence an alert:

curl -X POST http://localhost:9093/api/v1/silences \
    -H "Content-Type: application/json" \
    -d '{
        "matchers": [{"name": "alertname", "value": "NodeDown", "isRegex": false}],
        "startsAt": "2024-12-09T00:00:00Z",
        "endsAt": "2024-12-09T23:59:59Z",
        "createdBy": "oncall",
        "comment": "Planned maintenance"
    }'

Performance Tuning

Optimize for latency:

# docker-compose.yml
environment:
  - NETWORK_LATENCY_OPTIMIZATION=true
  - PEER_CONNECTION_TIMEOUT=5s
  - SYNC_BATCH_SIZE=100

Optimize for throughput:

environment:
  - MAX_CONCURRENT_PROOFS=4
  - MEMPOOL_SIZE=10000
  - BLOCK_GAS_LIMIT=30000000

Troubleshooting

Nodes Not Connecting

Symptoms: Low peer count, isolated nodes

Fix:

# Check network
docker network inspect bitcell_bitcell-net

# Verify bootstrap nodes are reachable
docker exec bitcell-node-us-east-1 ping -c 3 node-us-west-1

# Restart node
docker-compose restart node-us-east-1

High Latency

Symptoms: Slow RPC responses, sync delays

Diagnostic:

# Measure latency between regions
for node in node-us-east-1 node-us-west-1 node-eu-central-1; do
    docker exec bitcell-node-us-east-1 ping -c 10 $node | tail -1
done

Fix:

  • Optimize network routes
  • Increase node resources
  • Enable caching

Prometheus Not Scraping

Symptoms: Missing metrics in Grafana

Fix:

# Check Prometheus targets
curl http://localhost:9999/api/v1/targets

# Verify metrics endpoint
curl http://localhost:9090/metrics

# Restart Prometheus
docker-compose restart prometheus

Database Full

Symptoms: Disk space warnings, slow queries

Fix:

# Check disk usage
docker exec bitcell-node-us-east-1 df -h /data/bitcell

# Prune old data (when available in RC2)
docker exec bitcell-node-us-east-1 bitcell-node prune --keep-recent 10000

# Or increase disk size
# Resize volume in cloud provider

Security

Network Security

Firewall Rules:

# Allow P2P
iptables -A INPUT -p tcp --dport 9000:9010 -j ACCEPT

# Allow RPC (restricted to known IPs)
iptables -A INPUT -p tcp --dport 8545 -s 10.0.0.0/8 -j ACCEPT

# Allow metrics (monitoring subnet only)
iptables -A INPUT -p tcp --dport 9090:9100 -s 172.20.0.0/16 -j ACCEPT

Secrets Management

Do NOT store in git:

  • API keys
  • Database passwords
  • SSL certificates
  • Signing keys

Use environment variables or secrets manager:

# Docker secrets
docker secret create api_key api_key.txt
docker service update --secret-add api_key bitcell-node

# Kubernetes secrets
kubectl create secret generic bitcell-secrets \
    --from-literal=api-key=$API_KEY \
    -n bitcell-production

SSL/TLS

Enable HTTPS for RPC:

# haproxy.cfg
frontend rpc_frontend
    bind *:443 ssl crt /etc/ssl/certs/bitcell.pem
    redirect scheme https if !{ ssl_fc }

Performance Benchmarks

Target Metrics (RC3 Requirements)

Metric Target Current
Cross-region latency <200ms ~150ms
Node failure recovery <60s ~45s
Regional failover <120s ~90s
Transaction throughput 100 TPS 75 TPS
Proof generation <10s ~25s
Network uptime 99.9% 99.5%

Load Testing

# Run load test
./scripts/load-test.sh --duration 60 --tps 100

# Monitor during test
watch -n 1 'curl -s http://localhost:9090/metrics | grep bitcell_txs'

Cost Estimation

Cloud Provider Costs (Monthly)

AWS:

  • 7× t3.xlarge instances: ~$1,200
  • 700GB EBS storage: ~$70
  • Data transfer: ~$200
  • Total: ~$1,500/month

GCP:

  • 7× n2-standard-4 instances: ~$1,100
  • 700GB SSD storage: ~$120
  • Network egress: ~$150
  • Total: ~$1,400/month

Optimization:

  • Use spot instances for 40% savings
  • Enable autoscaling
  • Regional storage for backups
  • CDN for static content

Support

Documentation

Communication

Emergency Contacts

  • On-Call: See PagerDuty schedule
  • Platform Team: #platform-team on Slack
  • Security Team: [email protected]

Contributing

Improvements to infrastructure are welcome:

  1. Test changes locally with Docker Compose
  2. Run chaos tests to verify resilience
  3. Update documentation
  4. Submit PR with description

License

MIT OR Apache-2.0


Last Updated: 2024-12-09
Maintained By: BitCell Platform Team
Status: Production Ready (RC3)