Say Less

Inspiration

We've all been there: you open Telegram and see seventeen 3-minute voice messages from a friend recapping their entire day. You want to stay connected, but who has time to listen to 51 minutes of audio while commuting, working, or just trying to live life? The modern communication paradox is real, voice messages are more personal than text, but they're also time vampires. We wanted to solve this without losing the emotional richness that makes voice messages special in the first place. Say Less was born from a simple question: "What if there's a fun fast way to listen to these messages?"

What it does

Say Less is a Telegram bot that transforms long-winded voice messages into bite-sized soundbites and reacts with an emoji that captures the speaker's vibe.

You send a voice message (rambling about your day, ranting about traffic, celebrating a win—whatever) Say Less transcribes it using Google Speech Recognition (optimized for Singlish!), extracts the emotional beats using keyword analysis and emotion detection and generates a soundbite by mapping keywords to sound effects (custom sounds for swear words, panic and animal sounds) Replies with the soundbite and slaps an emoji reaction on it that matches your mood (😊, 😤, 😢, etc.) Instead of 3 minutes of rambling, you get a 10-second emotional highlight reel that tells the story through sound.

How we built it

The Tech Stack:

Python & Telebot: Powers the Telegram bot interface
Google Speech Recognition: Transcribes voice messages (with en-SG language support for Singlish)
FFmpeg & PyDub: Handles audio format conversion (.ogg ↔ .wav) and processing
NRCLex + TextBlob Emotion Engine: Dual-layer emotion detection system
YAKE Keyword Extraction: Identifies key moments and themes from rambling speech
Freesound API: Sources royalty-free sound effects for common keywords
Custom Sound Pools: Maps specific words (swear words, animals, emotions) to curated sound files

Challenges we ran into

Extracting Meaningful Keywords from Casual Speech Voice transcriptions are messy. People ramble, repeat themselves, use filler words ("um," "like," "basically"), and don't speak in neat sentences. Our biggest challenge was building a pipeline that could filter out discourse markers and stopwords without losing the story, identify emotional beats vs generic words, and handle multi-word phrases and prioritize action verbs and rare words that carry narrative weight. We went through many iterations of our keyword extraction algorithm. The YAKE algorithm helped, but we still needed custom logic to handle colloquial speech patterns.
Finding Clean, Accurate Sound Effects The Freesound API has thousands of sounds, but finding the right one is surprisingly hard. Search results often returned low-quality or irrelevant sounds (searching "panic" gave us roller coaster screams) We needed sounds that were instantly recognizable in 2 seconds or less. Audio quality also varied wildly: some were studio-grade, others were phone recordings. License filtering was inconsistent across search results. Our solution was a two-tier search system: optimized search term mapping (e.g., "stressed" → "deep sigh" instead of just "stressed") combined with a scoring algorithm that weighs downloads, ratings, and duration. We also built custom sound pools for common words so we could hand-pick high-quality effects.
Balancing Sound Length vs Recognizability Early soundbites had two failure modes:
Too short (0.5s clips): Sounds were unrecognizable—just random beeps
Too long (5s+ clips): Soundbite was basically as long as the original message We needed sounds long enough to be recognizable but short enough to feel like a "highlight reel." After extensive testing, we settled on: 2-second clips (sweet spot for most sound effects) 120ms silence gaps (just enough to distinguish sounds without feeling choppy) 8-10 keyword limit (total soundbite ~20-25 seconds) We also added duration scoring to our search algorithm: Sounds under 1.5 seconds get bonus points for being naturally punchy.

Accomplishments that we're proud of

Zero-friction user experience — No commands, no setup, no learning curve for users. Just add the bot to your chat and send voice messages like normal. It works exactly how users expect Telegram to work.
Custom sound pools for personalization — We built a flexible sound mapping system where users can drop in their own MP3 files for specific keywords. Want all swear words to sound like a cartoon bleep? Done. Want your dog's name to trigger actual recordings of your dog barking? Easy. This makes each soundbite feel personal and unique.
Singlish-aware transcription — Optimized for Singapore English (en-SG) so it actually understands local speech patterns.

What we learned

Audio processing is harder than it looks as FFmpeg codec compatibility and normalization are critical for quality output
Sound design carries emotion faster than text or emojis
Users want personalization—custom sound pools let them inject their own personality