A (definitely not PlayStation inspired) gesture controlled console environment where you can play classic games!
A small, Python & AI project that demonstrates gesture recognition using Google's MediaPipe Hands and OpenCV to open webcams!
*This is a student project built to further learn data manipulation with NumPy & Pandas, practice and understand how Artificial Intelligence actually works with neural networks (MLPs & CNNs), and how we can use Transfer Learning to edit models to better suit our final classes and needs.
The original model was trained in a Google Collab Notebook Trained Collab Model Attempt, however, this build uses Google's pre-built MediaPipe model, while also allowing you, the user, to input and save your very own dataset!*
Like the name suggests, this is supposed to be a console that allows you to play hopefully down the line a plethora of games. For now, however, you may try two playable demos: a gesture‑controlled Snake game and a gesture‑controlled Rock–Paper–Scissors game!
The optional small transfer‑learning pipeline (collect → train → use) mentioned above allows you to improve gesture accuracy by training a classifier on MediaPipe landmarks using data you can collect on your own!
- Real-time hand landmark detection using MediaPipe Hands
- Gesture controlled Snake (index or thumb gestures) with smoothing and classifier support for better playing comfort
- Gesture controlled Rock–Paper‑Scissors with stable-hold confirmation, confidence score, and result screens
- Optional data collection and training pipeline (collect_data.py → train_classifier.py)
- Easy to run in a Python virtual environment (Windows / macOS / Linux)
- Python 3.11 (recommended) — keep venv per project
- Webcam for real-time detection
- Windows users: install Visual C++ Redistributable if builds fail
# create venv (one-time)
py -3.11 -m venv .venv
# activate on Windows (PowerShell)
.\.venv\Scripts\Activate.ps1
# upgrade pip and install requirements
python -m pip install --upgrade pip
pip install -r requirements.txtrequirements.txt should include:
mediapipe==0.10.21
opencv-python
numpy
scikit-learn
pandas
joblib
If you have trouble installing
mediapipeon your system, check Python version (we recommend 3.11) and platform wheel availability.
python snake_gesture.pypython rock_paper_gesture.pypython collect_data.py
# Keys: r=right, l=left, u=up, d=down, s=save, q=quitpython train_classifier.py
# produces models/gesture_model.joblib and models/label_encoder.joblibsnake_gesture.py will automatically load models/gesture_model.joblib if present and use it; otherwise it falls back to the angle+thumb heuristics.
- The system uses MediaPipe Hands to extract 21 hand landmarks per detected hand (x,y normalized coordinates). These landmarks are used directly for heuristic rules (finger extended / folded) or flattened into a feature vector for training a small classifier (RandomForest by default).
- Gesture smoothing uses an exponential moving average (EMA) on angles and short voting buffers to avoid flicker and accidental flips, this was done to mainly allow players to use their thumbs to navigate smoother as well.
- The RPS game implements a stable‑hold confirmation (player must hold the same pose for a short duration) and a visible result + countdown flow to prioritize the user experience.
- Pretrained model: Google's MediaPipe Hand Landmarker (pretrained hand landmark model). The project uses MediaPipe’s prebuilt models for landmark extraction.
- Custom classifier (optional): RandomForest pipeline trained on flattened MediaPipe landmark X/Y coordinates (42 features). Saved with
joblib.
- Accuracy: Original accuracy using a Kaggle Rock Paper Scissors dataset:
- Custom data would obviously depends per user.
- FPS / Latency: Depends on CPU and camera; MediaPipe Hands runs in real time on modern laptops (tens of FPS).
See the Training Models used on my Collab Notebook:
3 Training Models Created:
- 1 - Model 1 (4200811 trainable parameters) & 10 Epochs at a learning rate of 0.001.
- 2 - Model 2 (2417547 trainable parameters) & 15 Epochs at a learning rate of 0.1.
- 3 - Model 3 (4794739 trainable parameters) & 20 Epochs at a learning rate of 0.001
| Experiment | Train Batch Size | Test Batch Size | Parameters | Num Conv Layers | Padding Used | Learning Rate | Epochs | Final Train Acc | Final Val Acc |
|---|---|---|---|---|---|---|---|---|---|
| Model 1 | 64 | 16 | 4,200,811 | 2 | 0 | 0.001 | 10 | 98.40% | 98.26% |
| Model 2 | 128 | 32 | 2,417,547 | 3 | 1 | 0.1 | 15 | 34.28% | 34.28% |
| Model 3 | 64 | 16 | 4,794,739 | 3 | 1 | 0.01 | 20 | 98.97% | 99.32% |
- Python version compatibility - Newer Python versions (3.12+) pose issues when using models, as they are usually built on older versions of Python.
- Mediapipe / wheel compatibility: fixed by using Python 3.11 and the mediapipe 0.10.21 wheel on Windows.
- Gesture jitter & misclassification: solved with EMA smoothing, majority voting, and an optional transfer‑learning classifier trained on user‑collected landmarks.
- Thumb vs index pointing: resolved with a simple heuristic that compares normalized distances relative to palm size (thumb & index selection logic).
- Datasets Used: Both datasets used were relatively small, this allowed training to be quick, however, at the cost of accuracy and high amounts of loss.
- Particularly with the Rock Paper Scissors dataset, as the data had the palms on a greenscreen, but a webcam would usually not have just a palm and greenscreen being recorded.
- This meant it was just better and more efficient to use the pre-trained MediaPipe model.
- Add better visuals, menus, sound effects and a high-scores and other UI related stuff.
- Add a small CNN or lightweight neural network trained on image crops (or landmark sequences) for even better robustness
- Export a web demo (WebRTC) using TensorFlow.js + MediaPipe on the browser perhaps.
- If
mediapipeimport has a yellow squiggle in VS Code but the script runs, reselect the.venvinterpreter and reload the window. Ensure the terminal is activated with.\.venv\Scripts\Activate.ps1on Windows. - If OpenCV camera doesn’t open, try different camera indices (
cv2.VideoCapture(1)), close other apps using the camera (Zoom/Discord), and check Windows camera privacy settings.
- Google's MediaPipe Hand Landmarker / Hands (used for landmark extraction and real-time tracking).
- Google Collab notebook I used to try different Models: Collab Notebook
- Rock Paper Scissors Dataset — 2,188 images
- Finger Direction Detection Dataset — 132 images