Skip to content

Latest commit

 

History

History
232 lines (192 loc) · 10.2 KB

File metadata and controls

232 lines (192 loc) · 10.2 KB

CSM Model Integration Plan for EchoForge

This document outlines the plan for integrating the Conversational Speech Model (CSM) from Sesame AI Labs into the EchoForge application, with the goal of achieving feature parity with the tts_poc project while improving robustness, testing, and documentation.

1. Core Model Integration

1.1 Model Implementation

  • Create CSM model class in app/models/csm_model.py
  • Implement model loading with proper error handling
  • Add GPU/CPU detection and fallback mechanisms
  • Implement watermarking integration (or mock)
  • Create placeholder model for when CSM is unavailable

1.2 Voice Generation

  • Implement VoiceGenerator class in app/api/voice_generator.py
  • Add support for different voice parameters (temperature, top-k)
  • Create voice cloning functionality
  • Implement audio post-processing utilities
  • Add proper error handling and diagnostics

2. API Implementation

2.1 REST API Endpoints

  • Implement health check endpoint (/api/health)
  • Add system diagnostics endpoint (/api/diagnostic)
  • Create voice listing endpoint (/api/voices)
  • Add speech generation endpoint (/api/generate)
  • Implement task status endpoint (/api/tasks/{task_id})

2.2 Background Task System

  • Create task queue for handling generation requests
  • Implement progress tracking and status reporting
  • Add proper concurrency handling
  • Create error recovery mechanisms
  • Implement resource limiting

3. Web Interface

3.1 Character Showcase

  • Create character profile component
  • Implement voice sample playback
  • Add text-to-speech generation interface
  • Create filtering by gender and voice style
  • Implement responsive design

3.2 UI Enhancements

  • Add voice parameter adjustment controls
  • Implement audio playback controls
  • Create loading indicators for long operations
  • Add error display and handling
  • Implement light/dark mode support

4. Testing

4.1 Unit Tests

  • Write tests for CSM model
  • Create tests for voice generator
  • Add tests for task system
  • Implement API route tests
  • Write utility function tests

4.2 Integration Tests

  • Test model integration with API
  • Create end-to-end voice generation tests
  • Test background task system
  • Implement web interface tests
  • Add performance benchmarks

5. Documentation

5.1 Code Documentation

  • Add docstrings to all classes and methods
  • Create README updates
  • Document configuration options
  • Add inline comments for complex logic
  • Create module dependency documentation

5.2 User Documentation

  • Write API documentation with examples
  • Create user guide for web interface
  • Add installation and setup guide
  • Document voice generation parameters
  • Create troubleshooting guide

6. Implementation Phases

Phase 1: Core Model Integration

  • Set up basic project structure
  • Implement CSM model integration
  • Create voice generation functionality
  • Add basic test coverage
  • Document core components

Phase 2: API and Task System

  • Implement health check and diagnostic endpoints
  • Create voice listing and generation endpoints
  • Implement task status endpoint
  • Add voice management functionality
  • Write integration tests

Phase 3: Web Interface

  • Create character showcase UI
  • Implement voice filtering and playback
  • Add text-to-speech generation interface
  • Create responsive design
  • Fix JavaScript errors and edge cases
  • Write UI tests

Phase 4: Device Selection and Testing

  • Add device selection dropdown to generation interface
  • Update API to accept device parameter
  • Implement proper handling of different device options (auto, cuda, cpu)
  • Add CSS styling for new UI elements
  • Ensure backward compatibility
  • Test CPU-only generation through API
  • Test CUDA-accelerated generation through API
  • Test automatic device selection through API
  • Compare output files between different device options
  • Verify consistent audio quality across devices
  • Test device selection through web interface
  • Measure performance differences between device options

8.3 Generation Testing Results

  • Script-based generation: Successfully generated voice using generate_voice.py script with CPU device selection. Created WAV file at /home/tdeshane/echoforge/generated/voice_1_7_50.wav (16-bit PCM mono audio at 24000 Hz).

    # Command used for script-based generation with CPU
    python -m scripts.generate_voice --text "This is a test of voice generation using CPU." --device cpu --verbose
  • API-based generation: Successfully generated voice files with all three device options:

    • CPU: /tmp/echoforge/voices/voice_1742111628_dfc2cf8a.wav
    • CUDA: /tmp/echoforge/voices/voice_1742111789_ca813d54.wav
    • Auto: /tmp/echoforge/voices/voice_1742111897_f5ae76d7.wav
    # Command used for API-based generation with CPU
    curl -X POST http://localhost:8765/api/generate -H "Content-Type: application/json" \
      -d '{"text": "Testing voice generation with CPU device selection.", "speaker_id": 1, "temperature": 0.7, "top_k": 50, "style": "default", "device": "cpu"}'
      
    # Command used for API-based generation with CUDA
    curl -X POST http://localhost:8765/api/generate -H "Content-Type: application/json" \
      -d '{"text": "Testing voice generation with CUDA device selection.", "speaker_id": 1, "temperature": 0.7, "top_k": 50, "style": "default", "device": "cuda"}'
      
    # Command used for API-based generation with auto device selection
    curl -X POST http://localhost:8765/api/generate -H "Content-Type: application/json" \
      -d '{"text": "Testing voice generation with auto device selection.", "speaker_id": 1, "temperature": 0.7, "top_k": 50, "style": "default", "device": "auto"}'
  • Task status checking: Verified task status updates through API:

    # Command used for checking task status
    curl -X GET http://localhost:8765/api/tasks/{task_id} -s | python -m json.tool
  • File comparison: All generated files were identical (confirmed via cosine similarity and direct comparison). Each file was 480,078 bytes with 240,000 samples at 24,000 Hz and duration of 10 seconds.

    # Commands used for comparing files
    python -c "import torchaudio, torch; cpu_audio, _ = torchaudio.load('/tmp/echoforge/voices/voice_1742111628_dfc2cf8a.wav'); cuda_audio, _ = torchaudio.load('/tmp/echoforge/voices/voice_1742111789_ca813d54.wav'); auto_audio, _ = torchaudio.load('/tmp/echoforge/voices/voice_1742111897_f5ae76d7.wav'); print(f'CPU-CUDA identical: {torch.all(cpu_audio == cuda_audio).item()}'); print(f'CPU-Auto identical: {torch.all(cpu_audio == auto_audio).item()}'); print(f'CUDA-Auto identical: {torch.all(cuda_audio == auto_audio).item()}')"
  • Audio properties: Files had consistent properties - Min: -1.0, Max: ~1.0, Mean: ~0, Std: ~0.65.

    # Command used for analyzing audio properties
    python -c "import torchaudio, torch, numpy as np; cpu_audio, _ = torchaudio.load('/tmp/echoforge/voices/voice_1742111628_dfc2cf8a.wav'); print(f'First 10 samples: {cpu_audio[0, :10]}'); print(f'Min value: {cpu_audio.min().item()}, Max value: {cpu_audio.max().item()}'); print(f'Mean: {cpu_audio.mean().item()}, Std: {cpu_audio.std().item()}')"
  • Hardware detection: System correctly detected NVIDIA GeForce RTX 3090 GPU and made it available for generation.

    # Command used for checking CUDA availability
    python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'Current device: {torch.cuda.current_device()}'); print(f'Device name: {torch.cuda.get_device_name(0)}'); print(f'Device count: {torch.cuda.device_count()}')"
  • Task management: System completed all generation tasks successfully with no failures, properly updating task status.

9. Admin Interface

9.1 Admin Interface Implementation

  • Create admin dashboard route and template
  • Implement authentication for admin access
  • Add system status overview and monitoring panel
  • Create model management section (load/unload/restart)
  • Implement voice management tools (add/remove/modify)
  • Add task management interface (view/cancel/retry)
  • Create diagnostic tools and logs viewer
  • Implement configuration management interface
  • Add performance metrics and utilization graphs
  • Create user management (if applicable)

9.2 Admin API Endpoints

  • Implement admin authentication endpoint
  • Create system control endpoints (restart services)
  • Add model management endpoints
  • Implement voice management endpoints
  • Create configuration update endpoints
  • Add log retrieval endpoints
  • Implement performance metrics endpoints

9.3 Admin Features

  • Dashboard: Overview of system status, active tasks, resource usage
  • Model Management: Load/unload models, change model parameters
  • Voice Testing: Interface for quick voice testing with different parameters
  • Batch Processing: Tools for batch voice generation
  • System Logs: Real-time log viewer with filtering
  • Performance Monitoring: CPU/GPU/memory usage graphs
  • Configuration Editor: Web interface for editing application settings
  • Voice Library: Tools to manage and organize voice samples
  • User Management: Control access permissions (if applicable)

Progress Tracking

Overall Progress:

  • Phase 1: Core Model Integration (100% complete)
  • Phase 2: API and Task System (100% complete)
  • Phase 3: Web Interface (90% complete)
  • Phase 4: Device Selection and Testing (100% complete)
  • Phase 5: Admin Interface (0% complete)
  • Phase 6: Documentation and Refinement (10% complete)

Current Focus:
Phase 5: Admin Interface - Creating admin dashboard and management tools

Next Milestone:

  • Admin Dashboard Implementation
  • Model Management Controls
  • System Status Monitoring

Project Completion Milestones:

  • Complete Admin Interface
  • Finalize API Documentation
  • Optimize Performance
  • User Guide Creation
  • Security Enhancements