Comprehensive testing strategy with unit tests, integration tests, and E2E tests.
Test expert creation with minimal cost (~$0.004):
# Windows
tests\test_keyboards_cli.bat
# Linux/Mac
./tests/test_keyboards_cli.sh
# Validate learning
python tests/validate_expert_learning.py "Keyboards Test"See TEST_EXPERT_CREATION.md and TESTING_SUMMARY.md for details.
After experiencing 4 failed API submissions due to untested code (Oct 30, 2025), we implemented a three-layer testing strategy:
- Unit Tests - Fast, free, validate parameters with mocks
- Integration Tests - Real API calls, validate contracts, use Deepr to improve Deepr
- E2E Tests - Full workflows, validate user experience
Key Insight: Unit tests tell you if code runs. Integration tests tell you if it works.
tests/
├── unit/ # Unit tests (fast, free, always run)
│ ├── test_providers/
│ │ ├── test_openai_tool_validation.py # Parameter validation
│ │ ├── test_gemini_provider.py # Gemini comprehensive
│ │ └── test_openai_provider.py # OpenAI basic
│ ├── test_queue/ # Queue operations (95% coverage)
│ ├── test_storage/ # Storage operations (85% coverage)
│ └── test_costs.py # Cost calculations (93% coverage)
│
├── integration/ # Integration tests (real APIs, costs money)
│ ├── test_research_modes_comprehensive.py # All 4 modes + dogfooding
│ ├── test_file_upload_api.py # File upload workflows
│ ├── test_all_providers.py # Provider comparison
│ └── test_cli_commands.py # CLI integration
│
├── test_e2e_cheap.py # E2E tests (expensive, run rarely)
└── data/
└── research_outputs/ # Saved research for review
# Run all unit tests
pytest -m unit
# Run with coverage
pytest -m unit --cov=deepr --cov-report=html
# Quick smoke test
pytest -m unit --maxfail=1# Run cheap integration tests (~$1-2)
pytest -m "integration and not expensive"
# Run specific research mode tests (~$5-10)
pytest -m research_modes
# Run file upload tests (~$1-2)
pytest -m file_upload
# Run all integration tests (~$10-20)
pytest -m integration# Run all E2E tests (~$20-50)
pytest -m e2e
# Run everything (unit + integration + e2e)
pytest# Test only OpenAI
pytest -k "openai"
# Test only Gemini
pytest -k "gemini"
# Test all providers
pytest -k "provider"| Marker | Purpose | Cost | When to Run |
|---|---|---|---|
unit |
Unit tests, no API calls | $0 | Every commit |
integration |
Real API calls, cheap queries | $0.10-1 | Before PR |
e2e |
Full workflows | $1-10 | Before release |
requires_api |
Needs API key | Varies | Manual/CI |
file_upload |
Tests file upload | ~$1 | After file changes |
research_modes |
Tests all 4 modes | ~$5-10 | Weekly |
expensive |
Costs >$1 | >$1 | Sparingly |
slow |
Takes >1 second | Varies | As needed |
We test all 4 research modes with OpenAI (full capability), and limited modes for Gemini/Grok:
-
Focus Mode (
test_openai_focus_mode_self_improvement)- Quick research queries
- Example: Research Python CLI best practices
- Cost: ~$0.50-1.00
- Duration: 5-10 min
-
Documentation Mode (
test_openai_docs_mode_api_research)- Technical API documentation
- Example: Research OpenAI's own API to validate our usage
- Cost: ~$0.50-1.00
- Duration: 5-10 min
-
Project Mode (
test_openai_project_mode_multi_phase)- Multi-phase adaptive research
- Example: Analyze Deepr's testing strategy (dogfooding)
- Cost: ~$2-5
- Duration: 10-20 min
-
Team Mode (
test_openai_team_mode_diverse_perspectives)- Multiple perspectives synthesized
- Example: Strategic decisions with diverse viewpoints
- Cost: ~$3-8
- Duration: 15-30 min
- Focus mode: Basic capability check
- Docs mode: API documentation research
- No multi-phase or team modes (not supported)
- Focus mode: Basic capability check
- Limited deep research capabilities
Our integration tests double as a learning feedback loop:
Examples:
test_openai_focus_mode_self_improvement- Research Python CLI best practicestest_openai_docs_mode_api_research- Research OpenAI API to validate our implementationtest_openai_project_mode_multi_phase- Research testing strategy improvementstest_provider_comparison_same_query- Compare providers on same question
Benefits:
- Validates research quality (if Deepr can't improve Deepr, how can it help you?)
- Generates actionable insights saved to
tests/data/research_outputs/ - Tests real-world use cases, not toy examples
- Creates feedback loop for continuous improvement
Output Location: All research outputs are saved to tests/data/research_outputs/ for review.
- Unit tests: $0 (no API calls)
- Integration tests (CI): ~$10/month
- Integration tests (manual): ~$10/month
- E2E tests (releases): ~$20/month
- Total: ~$40/month
Each test documents its expected cost in the docstring:
@pytest.mark.integration
async def test_something():
"""Test description.
Cost: ~$0.50
Duration: 5 minutes
"""- Run unit tests locally (free)
- Run integration tests only when needed
- Use
pytest -m "not expensive"to skip high-cost tests - Run expensive tests only before releases
# .git/hooks/pre-commit
pytest -m unit --tb=short --maxfail=3# .github/workflows/pr-tests.yml
- name: Run unit tests
run: pytest -m unit --cov=deepr --cov-fail-under=70
# Only on main branch
- name: Run cheap integration tests
if: github.ref == 'refs/heads/main'
run: pytest -m "integration and not expensive"
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}# .github/workflows/release.yml
- name: Run all tests
run: pytest --maxfail=5
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}@pytest.mark.asyncio
async def test_something():
"""Test description."""
provider = OpenAIProvider(api_key="test")
with patch.object(provider.client.responses, "create") as mock:
mock.return_value = MagicMock()
await provider.submit_research(request)
# KEY: Validate actual parameters
call_kwargs = mock.call_args.kwargs
assert call_kwargs["tools"][0] == expected_format@pytest.mark.asyncio
@pytest.mark.integration
@pytest.mark.requires_api
async def test_something():
"""Test description.
Cost: ~$0.50
Duration: 5 minutes
"""
if not os.getenv("OPENAI_API_KEY"):
pytest.skip("API key not set")
provider = OpenAIProvider() # Real provider
request = ResearchRequest(
prompt="Test query",
model="o4-mini-deep-research",
tools=[ToolConfig(type="web_search_preview")],
background=True
)
job_id = await provider.submit_research(request)
# Poll for completion and validate| Module | Current | Target | Status |
|---|---|---|---|
| Providers | 51% | 70% | In Progress |
| Queue | 95% | 90% | Good |
| Storage | 61% | 80% | Needs Work |
| CLI | 0% | 50% | Not Started |
| Core | 27% | 60% | Needs Work |
| Overall | 14% | 70% | In Progress |
Bad:
mock_api.assert_called_once() # Just checks it was calledGood:
call_kwargs = mock_api.call_args.kwargs
assert "container" not in call_kwargs["tools"][0] # Validates actual parametersBad: "Test query 123"
Good: "What are the latest Python CLI best practices in 2025?"
Why: Real queries validate actual use cases and provide learning feedback.
output_file = Path("tests/data/research_outputs/focus_result.md")
output_file.write_text(response_text)This creates a corpus of research we can review and learn from.
Always include cost and duration estimates in test docstrings.
When you fix a bug, add a test that would have caught it:
class TestToolParameterRegressions:
def test_regression_oct_30_2025_container_bug():
"""Regression: web_search_preview should NOT have container.
Bug discovered: Oct 30, 2025 (4 failed submissions)
"""
# Test implementationSet environment variables:
export OPENAI_API_KEY="sk-..."
export GEMINI_API_KEY="..."
export GROK_API_KEY="..."Or create .env file:
cp .env.example .env
# Edit .env with your keysIncrease timeout in test:
max_wait = 600 # 10 minutesOr skip slow tests:
pytest -m "not slow"Run only cheap tests:
pytest -m "integration and not expensive"Or run unit tests only:
pytest -m unitRun with coverage report:
pytest --cov=deepr --cov-report=html
open htmlcov/index.html # View detailed report- docs/TESTING_STRATEGY.md - Comprehensive testing philosophy
- docs/TEST_COVERAGE_IMPROVEMENT_PLAN.md - Roadmap to 70% coverage
- docs/BUGFIX_CONTAINER_PARAMETER.md - The bug that motivated this
See the main README.md or open an issue.