This document describes the testing strategy for GoZen, with a focus on maintaining high reliability for the daemon proxy (P0 component) while avoiding CI instability.
Location: internal/*/ test files
Run on: Every PR, every push to main
Characteristics:
- Fast (< 5 minutes total)
- Stable (no flakiness)
- No external dependencies
- Race detection enabled
Coverage Requirements:
internal/config: ≥80%internal/proxy: ≥80%internal/proxy/transform: ≥80%internal/web: ≥80%internal/bot: ≥80%internal/daemon: ≥50%internal/update: ≥50%internal/sync: ≥50%
Run locally:
go test -race -short ./...Location: tests/integration/
Run on: Every PR, every push to main
Characteristics:
- Moderate speed (< 3 minutes)
- Stable and controlled
- Minimal external dependencies
- Skips flaky tests in CI (via
CI=trueenv var)
What's tested:
- Proxy failover and load balancing
- Health checking and metrics
- Configuration management
- Provider disable/enable
- Connection pool cleanup
- Timeout handling
Run locally:
go test -race ./tests/integration/...Run in CI mode (skips flaky tests):
SKIP_FLAKY_TESTS=true go test -race ./tests/integration/...Location: tests/integration/daemon_*_test.go
Run on:
- Manual trigger (workflow_dispatch)
- Nightly (2 AM UTC)
- After merge to main (workflow_run)
Characteristics:
- Slow (up to 10 minutes)
- May be flaky due to:
- Process spawning and signals
- Port binding races
- Timing-dependent behavior
- GitHub runner environment variations
- Tests real daemon binary behavior
What's tested:
- Daemon auto-restart after crash
- Signal handling (SIGTERM, SIGINT)
- Port takeover logic
- PID file management
- Full daemon lifecycle
Current limitations (documented in test comments):
TestAutoRestart: Only tests daemon startup, not actual restart after crashTestDaemonAutoRestart: Tests fatal error/signal handling, not crash recoveryTestDaemonCrashRecovery: Only tests error classification, not real crash injection
Future work:
- Real crash injection and recovery verification
- Restart loop with exponential backoff validation
- Max restart limit enforcement testing
These limitations are acceptable because:
- Core daemon stability is validated by Layer 1 & 2 tests
- Crash detection logic (IsFatalError) is tested
- Port takeover and process management are tested
- Real crash recovery requires complex process injection
Run locally:
go test -v -timeout 600s ./tests/...Run specific test:
go test -v -run TestDaemonAutoRestart ./tests/integration/Jobs:
- Unit Tests - Fast unit tests with coverage checks
- Integration Tests - Stable integration tests (flaky tests skipped via
CI=true) - Web UI Tests - Frontend tests with coverage
- Website Build - Documentation build verification
Status: All jobs are required checks for PR merge
Triggers:
- Manual: Go to Actions → E2E Tests → Run workflow
- Nightly: Runs at 2 AM UTC daily
- Post-merge: Runs after CI passes on main branch
Status: Not a required check - failures don't block PRs
Notifications:
- Failed E2E runs after merge will comment on the merged PR
- Check Actions tab for detailed results
Tests that are flaky in CI should check the SKIP_FLAKY_TESTS environment variable:
func TestDaemonAutoRestart(t *testing.T) {
// Skip in CI environment - these tests are flaky on GitHub runners
if os.Getenv("SKIP_FLAKY_TESTS") == "true" {
t.Skip("skipping daemon auto-restart test (SKIP_FLAKY_TESTS=true)")
}
// Test implementation...
}go test -short ./...go test -race ./...go test -race ./tests/integration/...go test -v -timeout 600s ./tests/...go test -cover ./internal/proxy/- Add to
internal/<package>/<file>_test.go - Must be fast and stable
- No external dependencies
- Will run on every PR
- Add to
tests/integration/<feature>_test.go - Should be stable and controlled
- If potentially flaky, add skip check:
if os.Getenv("SKIP_FLAKY_TESTS") == "true" { t.Skip("skipping in CI environment (SKIP_FLAKY_TESTS=true)") }
- Add to
tests/integration/daemon_*_test.goor similar - Can be slow and timing-dependent
- Should skip in main CI:
if os.Getenv("SKIP_FLAKY_TESTS") == "true" { t.Skip("skipping E2E test (SKIP_FLAKY_TESTS=true)") }
For the daemon proxy (P0 component):
- Confidence comes from stable tests, not flaky E2E tests
- Main CI must be green = mergeable - no yellow/red noise
- E2E tests provide additional signal but don't block development
- Local testing is the primary validation - CI is a safety net
This approach ensures:
- Fast feedback on PRs
- No false positives blocking merges
- Comprehensive coverage without CI instability
- Clear signal when tests fail (real issues, not flakiness)