An autonomous AI desk robot with real-time computer vision, human-like voice, face recognition,
skeleton tracking, document Q&A, live code review, home automation —
and a sarcastic personality. Built from electronic waste for ₹8,000.
B.Tech Final Year Project — Data Science (2023–2027)
About · Features · Architecture · Hardware · ML Models · Installation · Usage · Codex · Demo
J.I.N.X is a multi-modal AI robotic desk companion that sees, hears, speaks, judges, and even reviews your code — all built from recycled electronics, a spare smartphone, a dead ThinkPad, and low-cost microcontrollers.
It combines 10 machine learning models spanning computer vision, audio classification, natural language understanding, network security, and sensor fusion into a single desk-mounted platform with a cyberpunk aesthetic and an attitude problem.
"Born from a dead ThinkPad T61 that couldn't even turn on anymore. Its metal chassis became J.I.N.X's body. A spare phone that couldn't make calls became its eyes, ears, and voice. Total hardware cost: ₹8,000. This project proves that AI isn't about expensive hardware — it's about intelligence."
✧ Real-time face detection + recognition (green = safe / blue = unknown / red = threat)
✧ 33-keypoint full body skeleton + pose estimation overlay
✧ 21-keypoint hand gesture recognition — control everything without touching anything
✧ 468-point face mesh with dramatic scanning visual effect
✧ CNN-trained audio classification (gunshots, sirens, glass breaks, screams)
✧ Offline wake word + full voice command system
✧ Gemini 2.0 Flash conversations with context memory and sarcastic personality
✧ Microsoft Neural TTS — sounds like a real human, not a robot. Free. No API key.
✧ Document Q&A — ask questions about uploaded PDFs by voice or web panel
✧ Code review agent — auto-reviews changed files on save, flags bugs and style issues
✧ Roast mode — scans your face, generates a personalized AI roast, delivers it live
✧ Voice-activated music search and streaming via yt-dlp + mpv
✧ IoT home automation — LED strip + smart device control by voice and gesture
✧ WiFi network monitor — flags unknown devices, detects traffic anomalies
✧ Animated eyes on 2.4" TFT with 12 emotional states (neutral / angry / scanning / roast / boot...)
✧ Pan-tilt servo head — physically turns to follow detected faces
✧ Digital pupils that track face position, synced with servo movement
✧ VL53L0X ToF sensor (downward) — detects desk edges with millimeter accuracy
✧ Dual HC-SR04 ultrasonic + forward VL53L0X obstacle avoidance
✧ 7.4V 18650 2S2P battery pack with BMS protection and voltage monitoring
✧ At <15% battery: eyes go sleepy, LEDs pulse yellow, J.I.N.X says "I'm running on spite"
✧ WS2812B LED strip with 11 animated modes — reacts to mood, threats, music, battery
✧ DFPlayer Mini + 3W speaker for sound effects. Neural TTS plays through phone speaker
✧ Body made from ThinkPad T61 chassis, keyboard keys, RAM sticks, HDD platters
✧ Web control panel at http://LAPTOP_IP:5000 — live feed, modes, LEDs, movement, uploads
✧ Streamlit cyberpunk dashboard at port 8501 — camera, skeleton, network map, alerts
✧ Any device on the same WiFi can control J.I.N.X from a browser — no app needed
MODE 1: BUDDY (Default)
├── Friendly personality, responds to voice commands
├── Answers questions, plays music, controls lights
├── Eyes follow people, head tracks faces
├── Skeleton overlay shows your movements in real-time
└── Returns low-battery warning when needed
MODE 2: SENTINEL (Surveillance)
├── Active scanning — color-coded face + object detection
│ ├── 🟢 GREEN = Known + Safe
│ ├── 🔵 BLUE = Unknown (not in database)
│ └── 🔴 RED = Known + Flagged as Threat
├── Audio anomaly detection (glass break, screams, gunshots)
├── Network device monitoring — flags unknown WiFi devices
├── All events logged with timestamps + screenshots
└── LED strips react to threat level in real-time
MODE 3: ROAST
├── Scans person's face → identifies from database
├── Generates personalized comedic roast via Gemini
├── Delivers roast through speaker in human voice
├── Eyes show smug expression, LEDs flash orange party mode
└── Adjustable intensity: light / medium / savage
MODE 4: AGENT
├── Document Q&A — ask questions about uploaded PDFs/books
├── Code review — watches your project folder, reviews on save
├── Image search — "what does a golden retriever look like?"
├── Research assistant — searches web for answers
└── Read document aloud — summarizes books by voice command
MODE 5: SLEEP
├── Eyes close, LEDs dim
├── Wake word still active
└── "I was in the middle of something."
┌──────────────────────────────────┐
│ LAPTOP (Main Server) │
│ │
│ 🧠 ML Models: │
│ ├── YOLOv5-nano (object detect) │
│ ├── MediaPipe (Face/Pose/Hands) │
│ ├── face_recognition (dlib) │
│ ├── Audio CNN (UrbanSound8K) │
│ ├── Vosk STT (offline) │
│ ├── edge-TTS (neural voice) │
│ ├── Gemini 2.0 Flash (LLM) │
│ ├── Network Anomaly (RF) │
│ └── Sensor Fusion (Hivemind) │
│ │
│ 🌐 Services: │
│ ├── Flask Web Control (:5000) │
│ ├── Streamlit Dashboard (:8501) │
│ ├── MQTT Broker (Mosquitto) │
│ └── SQLite Database │
└───────────────┬──────────────────┘
│ WiFi (Private Local Network)
┌─────────────────────┼──────────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌──────▼──────┐
│ J.I.N.X │ │ TABLET │ │ PHONE │
│ ROBOT │ │ NEXUS DASH │ │ (Control) │
│ │ │ │ │ │
│ ┌─────────┐ │ │ ─ Camera │ │ :5000 │
│ │ ESP32 │ │ │ ─ Skeleton │ └─────────────┘
│ │─Motors │ │ │ ─ Network │
│ │─Servos │ │ │ ─ Alerts │
│ │─TFT Eyes│ │ │ ─ Audio │
│ │─LEDs │ │ └─────────────┘
│ │─Sensors │ │
│ │─Speaker │ │
│ └─────────┘ │
│ ┌─────────┐ │
│ │Redmi 12 │ │ ← Neural TTS voice plays here
│ │─Camera │ │
│ │─Mic/Spk │ │
│ └─────────┘ │
└─────────────┘
| # | Component | Spec | Purpose |
|---|---|---|---|
| 1 | ESP32-WROOM-32 DevKit | WiFi+BT, 30-pin | Robot brain |
| 2 | 2.4" TFT ILI9341 | 240×320, SPI | Animated eyes |
| 3 | SG90 Servo ×2 | 180°, 1.8kg-cm | Head pan + tilt |
| 4 | L298N Motor Driver | Dual H-Bridge | DC motor control |
| 5 | HC-SR04 Ultrasonic ×2 | 2–400cm | Obstacle detection |
| 6 | VL53L0X ToF Sensor ×2 | 2m, ±1mm, I2C | Table edge + precise depth |
| 7 | WS2812B LED Strip | 30 LEDs, 5V | Mood reactive lighting |
| 8 | DFPlayer Mini + 3W Speaker | UART, MP3 | Sound effects |
| 9 | 18650 Battery ×4 | 3.7V 2600mAh | 2S2P = 7.4V ~5000mAh |
| 10 | 2S BMS Board | 7.4V 10A | Battery protection |
| 11 | Servo Pan/Tilt Platform | Vertical & Horizontal Axis Movement | Clean Build |
| Component | Source | Purpose |
|---|---|---|
| Metal parts, keyboard keys, RAM, HDD platters | Lenovo ThinkPad T61 | Body structure + decoration |
| Camera, mic, speaker, WiFi | Xiaomi Redmi Note 12 | Primary sensor array + TTS speaker |
| Tablet | UP Govt issued | Dashboard display |
Recycled components saved an estimated ₹15,000+ in equivalent hardware costs.
| # | Model | Task | Type | Dataset |
|---|---|---|---|---|
| 1 | YOLOv5-nano | Object Detection | Pre-trained | COCO (80 classes) |
| 2 | MediaPipe Face Mesh | 468-point Landmarks | Pre-trained | |
| 3 | MediaPipe Pose | 33-point Skeleton | Pre-trained | |
| 4 | MediaPipe Hands + Classifier | Gesture Recognition | Pre-trained + Custom | Google + Custom |
| 5 | dlib ResNet / face_recognition | Face Recognition (128-d) | Pre-trained + Custom | LFW + Your faces |
| 6 | Custom CNN (2D Conv) | Audio Classification | Trained from scratch | UrbanSound8K |
| 7 | Vosk / Google STT | Speech-to-Text | Pre-trained | — |
| 8 | Gemini 2.0 Flash | NLU + Conversation + Agent | API | |
| 9 | Random Forest | Network Anomaly Detection | Trained | NSL-KDD |
| 10 | Weighted Ensemble | Sensor Fusion | Custom | — |
Input: 128×128 Mel Spectrogram
├── Conv2D(32, 3×3) → BatchNorm → ReLU → MaxPool(2×2)
├── Conv2D(64, 3×3) → BatchNorm → ReLU → MaxPool(2×2)
├── Conv2D(128, 3×3) → BatchNorm → ReLU → MaxPool(2×2)
├── Conv2D(64, 3×3) → GlobalAveragePooling
├── Dense(256) → ReLU → Dropout(0.4)
├── Dense(128) → ReLU → Dropout(0.3)
└── Dense(10) → Softmax
Classes: air_conditioner, car_horn, children_playing, dog_bark,
drilling, engine_idling, gun_shot, jackhammer, siren, street_music
Dataset: UrbanSound8K (8,732 samples)
Visual Score (0–1) × 0.35 ← face_recognition threat level
Audio Score (0–1) × 0.30 ← CNN threat-class confidence
Network Score (0–1) × 0.20 ← anomaly model prediction
Proximity (0–1) × 0.15 ← ultrasonic + ToF distances
↓
doom_level (0–1)
> 0.70 → ALERT → LED + eyes + buzzer + log
| File | Codename | Purpose |
|---|---|---|
genesis.py |
GENESIS | Main startup — launches everything |
dna.py |
DNA | All configuration and settings |
blackbox.py |
BLACKBOX | SQLite event logging |
psyche.py |
PSYCHE | Personality, jokes, roast prompts |
optic.py |
OPTIC | Vision — camera, faces, pose, mesh, gestures |
vocoder.py |
VOCODER | Voice — STT, neural TTS, Gemini, commands, music |
echo_hunter.py |
ECHO HUNTER | Audio — CNN sound classification |
ice_wall.py |
ICE WALL | Network — device scan, anomaly detection |
synapse.py |
SYNAPSE | MQTT — all inter-module messaging |
hivemind.py |
HIVEMIND | Sensor fusion — doom level scoring |
agent.py |
AGENT | AI Agent — document Q&A + code review |
nexus.py |
NEXUS | Streamlit cyberpunk dashboard |
web_control/app.py |
NEXUS-WEB | Flask phone control panel |
Personal laptop (Linux recommended, Windows/Mac also work)
Python 3.10+ · Arduino IDE 2.x · Assembled J.I.N.X hardware
WiFi router · Redmi Note 12 with DroidCam · Mosquitto MQTT
cmake (for face_recognition/dlib) · mpv (for music playback)
git clone https://github.com/Sidvortex/J.I.N.X.git
cd J.I.N.X# Arch / EndeavourOS
sudo pacman -S mosquitto cmake espeak-ng mpv yt-dlp portaudio python-pip
# Ubuntu / Debian
sudo apt install mosquitto cmake libcmake-data espeak-ng mpv yt-dlp \
portaudio19-dev python3-pip
pip install -r requirements.txt
sudo systemctl enable --now mosquittonano server/dna.py
# Set:
LAPTOP_IP = "your.laptop.ip"
PHONE_IP = "redmi.note.ip"
GEMINI_API_KEY = "get from aistudio.google.com"
FACE_LABELS = {"yourname": "safe"}# Vosk offline STT (~40MB)
mkdir -p models/vosk-model
# Download: https://alphacephei.com/vosk/models → vosk-model-small-en-us-0.15
# Extract into models/vosk-model/
# YOLOv5 (auto-downloads on first run)
python -c "from ultralytics import YOLO; YOLO('yolov5n.pt')"python scripts/register_face.py --name yourname --file photo.jpg --label safe
# or live:
python scripts/register_face.py --name yourname --live --label safe1. Open Arduino IDE 2.x → arduino/jinx_esp32/jinx_esp32.ino
2. Install via Library Manager:
TFT_eSPI · Adafruit NeoPixel · PubSubClient
ArduinoJson · ESP32Servo · DFRobotDFPlayerMini · VL53L0X
3. Edit config.h → set WIFI_SSID, WIFI_PASS, MQTT_BROKER
4. Board: ESP32 Dev Module → Upload
1. Install DroidCam on Redmi Note 12
2. Connect to same WiFi → set static IP in router
3. Update PHONE_IP in server/dna.py
4. Open DroidCam → Start Server
5. Test: http://PHONE_IP:4747/video
python server/genesis.py # Normal startup
python server/genesis.py --sentinel # Start in Sentinel mode
python server/genesis.py --agent-mode # Document/code focus
python server/genesis.py --no-audio --no-network # Faster startup ✧ "Hey JINX, wake up" → System activation
✧ "Hey JINX, guard mode" → Sentinel surveillance
✧ "Hey JINX, roast [name]" → AI-generated personalized roast
✧ "Hey JINX, what is [topic]" → Gemini answers + shows image
✧ "Hey JINX, play [song/genre]" → Music search and playback
✧ "Hey JINX, lights [color]" → LED color change
✧ "Hey JINX, review my code" → Code review of watched folder
✧ "Hey JINX, read [document name]" → Summarizes uploaded document
✧ "Hey JINX, status" → System health report
✧ "Hey JINX, goodnight" → Sleep mode
| Model | Metric | Score |
|---|---|---|
| Face Recognition | Accuracy | ~95%+ |
| Face Recognition | False Acceptance Rate | <2% |
| Audio CNN | F1-Score | ~85%+ |
| Audio CNN | Accuracy | ~88%+ |
| Network Anomaly | ROC-AUC | ~92%+ |
| Voice Recognition | Word Error Rate | ~10–15% |
| Sensor Fusion | Detection Accuracy | ~90%+ |
| Table Edge Detection | Accuracy | ~99% (VL53L0X) |
✧ SLAM-based room mapping and path planning
✧ Raspberry Pi 4 integration — remove laptop dependency
✧ Robotic arm for object manipulation
✧ Emotion detection from facial expressions
✧ Multi-language voice (Hindi + English)
✧ Smart home ecosystem integration (Google Home, Alexa)
✧ Mobile app (React Native) for remote control
✧ Cloud dashboard for remote monitoring outside local network
✧ Multi-robot swarm communication
✧ Hexapod leg conversion (servo-based spider legs)
✧ Google MediaPipe team (vision models)
✧ Ultralytics (YOLOv5)
✧ Adam Geitgey (face_recognition library)
✧ Microsoft (edge-TTS Neural voices)
✧ Google Gemini AI
✧ Vosk / Alpha Cephei (offline STT)
✧ The dead ThinkPad T61 that gave its body for science
✧ The open-source community
📸 This is a unfinished projet right now, as soon as we get funds we will be desiging its body, we would not let it stay nude like a 3yr-old child roaming around the house !!
🎥 we hope you like the video on our channel and support us. Stay Tuned for such crazy projects we will be delivering them with a craze on !!
|
|
B.Tech Data Science (2023–2027)
Built with dark magic acquired from the pitch black caves of West Bengal
@sidvortex
J.I.N.X doesn't just think. It judges.



