Skip to content

akurilin/tutorbox

Repository files navigation

TutorBox

An AI-powered conversational English practice platform for ESL (English as a Second Language) learners and their teachers. Built as a solo project from December 2022 to April 2023.

Built by Alexandr Kurilin and Dmitry Stavisky.

The Problem

Language learners need frequent speaking practice to build fluency, but access to conversation partners is limited, expensive, and often anxiety-inducing. Teachers lack the bandwidth to provide one-on-one practice time for every student, and traditional homework assignments don't develop speaking skills.

The Solution

TutorBox gives students a private, always-available AI conversation partner that adapts to their level and curriculum. Teachers create assignments with specific scenarios, vocabulary, and difficulty levels. Students practice speaking through voice conversations with the AI, and both students and teachers receive detailed performance analytics afterward.

How a Session Works

  1. A teacher creates an assignment — selecting a conversation scenario (e.g. ordering at a restaurant, a job interview), a CEFR difficulty level (A1–C2), and target vocabulary words
  2. The teacher shares the assignment link with students
  3. The student opens the link, and a real-time voice conversation begins with the AI tutor
  4. The student speaks into their microphone — their speech is transcribed in real-time and sent to GPT-3.5 for a contextual response
  5. The AI's response is read aloud using neural text-to-speech
  6. After the session, both student and teacher receive detailed performance reports

Architecture

Browser (React)
  |
  |-- WebSocket ---------> Google Cloud Speech-to-Text (real-time transcription)
  |-- REST (Next.js API) -> OpenAI GPT-3.5 (conversation + analysis)
  |-- REST (Next.js API) -> Google Cloud TTS / Unreal Speech (voice synthesis)
  |
Next.js API Routes
  |-- OpenAI API (chat completion, grammar scoring, CEFR detection)
  |-- Google Cloud TTS + Unreal Speech (dual TTS engine support)
  |-- SendGrid (email notifications to teachers)
  |-- Clerk (authentication + user management)
  |-- PostgreSQL (user data, subscription tiers)
  |-- PostHog (product analytics)
  |-- Logtail (structured logging)

Tech Stack

  • Framework: Next.js 13, TypeScript, React 18
  • Styling: Tailwind CSS, Headless UI
  • Database: PostgreSQL, Liquibase migrations
  • Auth: Clerk (OAuth)
  • AI/ML: OpenAI GPT-3.5 (conversation + analysis), Google Cloud Speech-to-Text (real-time STT via WebSocket), Google Cloud TTS + Unreal Speech (dual TTS engines)
  • Infrastructure: Vercel, PostHog, SendGrid, Logtail
  • Payments: Stripe (subscription management)

Key Technical Challenges

Real-Time Voice Pipeline

The core technical challenge was building a low-latency voice conversation loop: microphone input → real-time transcription → AI response → speech synthesis → audio playback → automatic microphone restart. This required coordinating the Web Audio API, AudioWorklet processors, WebSocket streaming to Google Cloud STT, and careful state management to prevent race conditions between recording, playback, and UI updates.

Multi-Dimensional AI Scoring

Student performance reports run multiple OpenAI API calls in parallel to evaluate different dimensions of language proficiency: fluency (words per minute), grammar accuracy (1–10), comprehension (logical coherence of responses), idiomaticity (natural language use), vocabulary usage tracking, and CEFR level detection. Each metric uses a specialized prompt with calibrated temperature and token limits. Grammar corrections are visualized using a diff algorithm that renders inline strikethrough/addition formatting.

Prompt Engineering for Pedagogy

The system prompt dynamically assembles scenario context, target vocabulary, and CEFR level constraints to keep AI responses pedagogically appropriate. Lower CEFR levels produce simpler sentence structures and vocabulary; higher levels allow more complex language. The AI is encouraged to naturally incorporate the teacher's assigned vocabulary words into conversation, creating organic practice opportunities rather than rote drilling.

Dual TTS Engine Support

The platform supports both Google Cloud Neural TTS and Unreal Speech, with multiple voice profiles for different speaker roles (male/female, bot/human). This provided fallback reliability and the ability to compare cost and quality tradeoffs between providers.

WhatsApp as an AI Interface

In February 2023, we prototyped a fully WhatsApp-native version of TutorBox (see the whatsapp branch). The idea was to eliminate the web app entirely and meet students where they already were — inside WhatsApp, the dominant messaging platform in our target Latin American market. The integration used the WhatsApp Cloud API to receive text and voice messages via webhook, transcribe voice notes through Google Cloud STT, generate conversational replies with OpenAI, synthesize audio responses with Google Cloud TTS, and send them back as WhatsApp voice messages — all in a single request cycle. Conversation history was persisted in PostgreSQL to maintain context across messages. This was an early experiment in using a chat platform as the primary interface for AI interaction, predating the broader industry's move toward conversational AI inside messaging apps by roughly two years.

Pages

Route Description
/ Marketing landing page
/practice Main conversation interface — real-time voice dialogue with the AI tutor
/assign Teacher tool for creating assignments with scenario, CEFR level, and vocabulary
/student-report Post-session analytics dashboard for students
/teacher-report Detailed student performance report for teachers (includes full transcript with audio playback)

Project Structure

pages/           # Next.js pages and API routes
  api/           # Backend endpoints (OpenAI proxy, TTS proxy, corrections, email)
components/      # React components (chat log, transcriber, TTS, scenario selector)
logic-frontend/  # Client-side logic (prompts, TTS helpers, report calculations)
logic-backend/   # Server-side logic (database, analytics, CORS)
logic-shared/    # Shared types and data (scenarios, CEFR levels, utilities)
db/              # PostgreSQL migrations (Liquibase) and seed data

Status

This project is no longer actively developed. It was built exploring the intersection of AI, speech technology, and language education — during the early wave of GPT-3.5 and neural TTS becoming accessible to individual developers. The early-stage prototype was marketed to and trialed by English language schools and academies in South America. The codebase represents roughly five months of work from ideation through a functional product with real users.

About

An AI-powered conversational English practice platform for ESL students

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors