An AI-powered conversational English practice platform for ESL (English as a Second Language) learners and their teachers. Built as a solo project from December 2022 to April 2023.
Built by Alexandr Kurilin and Dmitry Stavisky.
Language learners need frequent speaking practice to build fluency, but access to conversation partners is limited, expensive, and often anxiety-inducing. Teachers lack the bandwidth to provide one-on-one practice time for every student, and traditional homework assignments don't develop speaking skills.
TutorBox gives students a private, always-available AI conversation partner that adapts to their level and curriculum. Teachers create assignments with specific scenarios, vocabulary, and difficulty levels. Students practice speaking through voice conversations with the AI, and both students and teachers receive detailed performance analytics afterward.
- A teacher creates an assignment — selecting a conversation scenario (e.g. ordering at a restaurant, a job interview), a CEFR difficulty level (A1–C2), and target vocabulary words
- The teacher shares the assignment link with students
- The student opens the link, and a real-time voice conversation begins with the AI tutor
- The student speaks into their microphone — their speech is transcribed in real-time and sent to GPT-3.5 for a contextual response
- The AI's response is read aloud using neural text-to-speech
- After the session, both student and teacher receive detailed performance reports
Browser (React)
|
|-- WebSocket ---------> Google Cloud Speech-to-Text (real-time transcription)
|-- REST (Next.js API) -> OpenAI GPT-3.5 (conversation + analysis)
|-- REST (Next.js API) -> Google Cloud TTS / Unreal Speech (voice synthesis)
|
Next.js API Routes
|-- OpenAI API (chat completion, grammar scoring, CEFR detection)
|-- Google Cloud TTS + Unreal Speech (dual TTS engine support)
|-- SendGrid (email notifications to teachers)
|-- Clerk (authentication + user management)
|-- PostgreSQL (user data, subscription tiers)
|-- PostHog (product analytics)
|-- Logtail (structured logging)
- Framework: Next.js 13, TypeScript, React 18
- Styling: Tailwind CSS, Headless UI
- Database: PostgreSQL, Liquibase migrations
- Auth: Clerk (OAuth)
- AI/ML: OpenAI GPT-3.5 (conversation + analysis), Google Cloud Speech-to-Text (real-time STT via WebSocket), Google Cloud TTS + Unreal Speech (dual TTS engines)
- Infrastructure: Vercel, PostHog, SendGrid, Logtail
- Payments: Stripe (subscription management)
The core technical challenge was building a low-latency voice conversation loop: microphone input → real-time transcription → AI response → speech synthesis → audio playback → automatic microphone restart. This required coordinating the Web Audio API, AudioWorklet processors, WebSocket streaming to Google Cloud STT, and careful state management to prevent race conditions between recording, playback, and UI updates.
Student performance reports run multiple OpenAI API calls in parallel to evaluate different dimensions of language proficiency: fluency (words per minute), grammar accuracy (1–10), comprehension (logical coherence of responses), idiomaticity (natural language use), vocabulary usage tracking, and CEFR level detection. Each metric uses a specialized prompt with calibrated temperature and token limits. Grammar corrections are visualized using a diff algorithm that renders inline strikethrough/addition formatting.
The system prompt dynamically assembles scenario context, target vocabulary, and CEFR level constraints to keep AI responses pedagogically appropriate. Lower CEFR levels produce simpler sentence structures and vocabulary; higher levels allow more complex language. The AI is encouraged to naturally incorporate the teacher's assigned vocabulary words into conversation, creating organic practice opportunities rather than rote drilling.
The platform supports both Google Cloud Neural TTS and Unreal Speech, with multiple voice profiles for different speaker roles (male/female, bot/human). This provided fallback reliability and the ability to compare cost and quality tradeoffs between providers.
In February 2023, we prototyped a fully WhatsApp-native version of TutorBox (see the whatsapp branch). The idea was to eliminate the web app entirely and meet students where they already were — inside WhatsApp, the dominant messaging platform in our target Latin American market. The integration used the WhatsApp Cloud API to receive text and voice messages via webhook, transcribe voice notes through Google Cloud STT, generate conversational replies with OpenAI, synthesize audio responses with Google Cloud TTS, and send them back as WhatsApp voice messages — all in a single request cycle. Conversation history was persisted in PostgreSQL to maintain context across messages. This was an early experiment in using a chat platform as the primary interface for AI interaction, predating the broader industry's move toward conversational AI inside messaging apps by roughly two years.
| Route | Description |
|---|---|
/ |
Marketing landing page |
/practice |
Main conversation interface — real-time voice dialogue with the AI tutor |
/assign |
Teacher tool for creating assignments with scenario, CEFR level, and vocabulary |
/student-report |
Post-session analytics dashboard for students |
/teacher-report |
Detailed student performance report for teachers (includes full transcript with audio playback) |
pages/ # Next.js pages and API routes
api/ # Backend endpoints (OpenAI proxy, TTS proxy, corrections, email)
components/ # React components (chat log, transcriber, TTS, scenario selector)
logic-frontend/ # Client-side logic (prompts, TTS helpers, report calculations)
logic-backend/ # Server-side logic (database, analytics, CORS)
logic-shared/ # Shared types and data (scenarios, CEFR levels, utilities)
db/ # PostgreSQL migrations (Liquibase) and seed data
This project is no longer actively developed. It was built exploring the intersection of AI, speech technology, and language education — during the early wave of GPT-3.5 and neural TTS becoming accessible to individual developers. The early-stage prototype was marketed to and trialed by English language schools and academies in South America. The codebase represents roughly five months of work from ideation through a functional product with real users.