TutorPilot

Why I Built This I spent months watching my tutor friends burn out writing lesson plans. They'd spend 2-3 hours per week on planning when they could be teaching. Every AI tutoring tool I found either gave generic templates or couldn't adapt to individual students. I wanted agents that actually get smarter the more you use them. What It Does Three AI agents work together:

Strategy Planner generates 4-week learning plans Lesson Creator builds detailed lessons from strategy weeks Activity Creator generates React code for interactive simulations

The twist: they all evaluate themselves, learn from tutor edits, and improve their prompts based on patterns. Self-Improvement Loop When tutors edit AI-generated content, they write why they changed it. That's the key. Not just "what changed" but "student needed hands-on activities because quiz format was too passive." Every 6 hours, a reflection agent analyzes:

Self-evaluation scores (what scored low?) Edit notes (what do tutors commonly fix?) Success patterns (what works?)

It generates insights like "tutors replace quizzes with simulations 8/10 times" and feeds them into future prompts. Next generation automatically uses simulations. Technical Stack Backend: Python + FastAPI, Supabase PostgreSQL Models: Google LearnLM, Perplexity Sonar, Qwen3 Coder 480B (via W&B Inference) Deployment: Daytona sandboxes for React activities Observability: Weave traces every AI call Challenges Daytona SDK confusion: Docs showed upload_file(path, content) but actual method is upload_file(content, path). Took 3 hours debugging JSON errors before realizing parameter order was wrong. Self-evaluation parsing: LearnLM sometimes returns markdown code blocks around JSON. Built a robust parser that strips ```json markers and handles malformed responses with regex fallbacks. Auto-debugging reliability: First version would infinite loop on syntax errors. Limited to 3 attempts with exponential backoff. Success rate went from 40% to 85%. Context handoff between agents: Originally each agent called Perplexity separately (slow + expensive). Refactored to store knowledge_context in database so Activity Creator reuses research from Lessons. Cut API calls by 60%. What I Learned

Self-evaluation is powerful but needs structure. Generic "rate 1-10" doesn't work. Specific criteria (pedagogical soundness, engagement, code quality) gives actionable feedback. Tutor edit notes are gold. One field in the database unlocked the entire learning system. Daytona sandboxes are fast (90ms creation) but error logs need parsing. Raw output is messy. Weave tracing saved debugging time. Seeing exact prompts + responses made optimization obvious.

Built With

python

Updates

Ahmed Bakr started this project — Oct 12, 2025 04:41 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.