| Tracking | Sweeping |
|---|---|
![]() |
![]() |
A runnable project adding perception capablity to XLeRobot:
- Real-time pan/tilt camera head tracking (moving objects in FOV)
- Scene description and simple conversational interaction
- LeRobot compatible robot wrapper for easy migration to real hardware
cd /path/to/this-repo
python -m venv .venv
source .venv/bin/activate
pip install -e .
python -m xlerobot_personality.main --dry-run --visualizeType questions in terminal while it runs (for example: what do you see?).
Type quit to stop.
Find your hardware:
lerobot-find-port
lerobot-find-cameras opencvThen edit real_head.example.yaml:
- set
hardware.serial_port - set the real
pan_idandtilt_id - set
camera.device_index - set
hardware.robot_id - set
hardware.calibration_dirto the folder containing<robot_id>.json - config path fields may use
~,${HOME}, or relative paths resolved from the config file location
Important:
- this app reads and writes motor positions in normalized units, so the calibration file must exist before running
Generate the first calibration file with the interactive head utility:
cd /path/to/this-repo
source .venv/bin/activate
PYTHONPATH=src python -m xlerobot_personality.head_calibrate \
--config configs/real_head.example.yamlThe utility will:
- connect to the Feetech bus on
hardware.serial_port - ask you to place the head in a neutral forward pose
- record the raw encoder center offsets
- ask you to sweep pan and tilt through their full safe ranges
- save
<hardware.robot_id>.jsoninhardware.calibration_dir
There is also a helper launcher:
./mybash/calibrate_head.shcd /path/to/this-repo
source .venv/bin/activate
PYTHONPATH=src python -m xlerobot_personality.main \
--config configs/real_head.example.yaml \
--web-preview \
--no-visualizeThe sample real-head config enables scene.use_ollama: true with qwen2.5vl:3b, so once ollama serve is up you should see a short scene paragraph at 1 Hz in both the terminal and browser preview.
Or use the helper script:
./mybash/run_real_browser.shCurrent support assumes:
- pan/tilt servos are Feetech servos supported by LeRobot
- camera is available through OpenCV (
camera.source: opencv) - you will either point to an existing head calibration JSON or generate one with the utility below
- real camera auto-tracking uses
yolo_personwhenultralyticsis installed, otherwise it falls back tohog_person
Then set either:
tracking.backend: autoto prefer YOLO when available- or
tracking.backend: yolo_personto require YOLO explicitly
The default sample config uses:
tracking.backend: autotracking.yolo_model: yolov8n.pt
If yolov8n.pt is not already on disk, Ultralytics will try to download it on first run.
The scene loop can generate a short camera description every second with a local Ollama vision model.
Start Ollama and pull a vision checkpoint (plus an optional lightweight chat model for robot dialogue):
ollama pull qwen2.5vl:3b # VLM
ollama pull qwen2.5:1.5b # LLM
ollama serveMake sure it is enabled in config:
scene:
summary_hz: 1.0
use_ollama: true
vlm_model: qwen2.5vl:3b
use_brain: true
brain_model: qwen2.5:1.5b
ollama_url: http://127.0.0.1:11434With that enabled, the [scene] log line and browser summary panel will show a fresh short paragraph roughly once per second without blocking the tracking loop.
When use_brain: true, terminal chat (ask>) is routed through the lightweight Ollama chat model and grounded with:
- cached scene description context from the scene loop (VLM/rule-based; no extra VLM call per user query)
- current person/target tracking status from the detector/tracker
- recent chat history (configurable, default 10 turns)
- user query
If runtime.speech.enabled: true, each terminal answer is also spoken with a local Piper voice model and played through a local audio command such as aplay.
Download a Piper voice:
cd /path/to/this-repo
source .venv/bin/activate
python3 -m piper.download_voices en_US-lessac-mediumFor headless servers or SSH sessions, use the browser preview:
# on the server
cd /path/to/this-repo
source .venv/bin/activate
PYTHONPATH=src python -m xlerobot_personality.main \
--config configs/local_demo.yaml \
--dry-run \
--web-preview \
--no-visualizeForward the port from your local machine:
ssh -L 8765:127.0.0.1:8765 your_user@your_serverThen open:
http://localhost:8765
The page shows the annotated frame, state, pan/tilt values, scene summary, and a stop button.
By default, the app now starts in auto head-control mode. Use Resume Tracking in the browser to enable/disable auto-tracking.
xlerobot_personality.xlerobot_head.XLERobotHead- LeRobot-style
RobotAPI (connect,get_observation,send_action,disconnect) - Pan/tilt action keys:
pan.pos,tilt.pos - Camera observation key:
head_cam - Works in
mockmode (no hardware) and infeetech+cameramode (starter wiring)
- LeRobot-style
xlerobot_personality.tracking_controller- Detector-backed target tracking from camera frames
- PID-based pan/tilt control with safety clamps and step limiting
- Personality FSM:
IDLE_SCAN,TRACKING,REACQUIRE,INTERACTING
xlerobot_personality.scene_agent- Keeps short scene memory
- Produces periodic scene summaries
- Can describe camera frames at 1 Hz with a local Ollama vision model such as
qwen2.5vl:3b - Can run an Ollama chat "brain" that fuses VLM scene summary + tracker context + user query
- Replies in a cute, concise, witty style when the Ollama brain is enabled
- Optional OpenAI VLM hook if
openaidependency and key are configured
xlerobot_personality.orchestrator- Async loops for tracking, scene summarization, and dialog
Main fields:
hardware.use_mock: iftrue, servo writes are simulatedcamera.source:synthetic,opencv, orlerobothardware.servo.manual_speed_deg_s: browser/manual teleop speedhardware.servo.command_smoothing_s: low-pass smoothing on commanded head positionhardware.servo.transition_smoothing_s: extra smoothing duration used when switching between scan, track, and manual statestracking.backend:auto,motion,hog_person,yolo_person, oryolo_pose_persontracking.frame_target_y_ratio: where in the image you want the tracked target to sit; higher keeps the head aimed lowertracking.tilt_tracking_sign: auto-tracking tilt direction; for the current browser/manual convention,-1.0means "target higher in image -> tilt camera up"tracking.detector_interval: run the detector every N control framestracking.detector_confidence: minimum detector confidence for accepting a persontracking.acquire_confirm_frames: require this many matching detections before starting a new tracktracking.target_selection_mode:stickyto stay on the current person when possible, ormost_centeredto pick whoever is closest to the image centertracking.tilt_error_gain: scales how aggressively vertical tracking moves the headtracking.person_target_y_ratio: where inside a person box to aim vertically; higher looks lower on the persontracking.person_min_full_body_aspect_ratio: if a person box looks too short, infer missing lower body before computing the vertical aim pointtracking.person_top_frame_ratio: desired top-of-person position in the image for headroom controltracking.person_top_framing_gain: extra vertical correction from the top of the person box; helps the head tilt up when someone standstracking.person_closeup_height_ratio: activates stronger headroom control once the person occupies this fraction of image heighttracking.person_closeup_top_gain: extra multiplier for headroom control in close-up framingtracking.yolo_model: YOLO checkpoint name or path, for exampleyolov8n.pttracking.yolo_pose_model: YOLO pose checkpoint name or path, for exampleyolov8n-pose.pttracking.yolo_device: inference device, for examplecpuorcuda:0tracking: PID, loop rate, scan behavior, FOVtracking.start_tracking_enabled: start in auto-tracking (true) or manual mode (false)scene.summary_hz: caption refresh rate;1.0means 1 Hzscene.use_ollama: enable local Ollama vision captionsscene.vlm_model: vision-capable Ollama model, for exampleqwen2.5vl:3bscene.ollama_url: Ollama server URL, usuallyhttp://127.0.0.1:11434scene.ollama_timeout_s: per-request timeout for caption generationscene.ollama_keep_alive: keeps the local model warm between 1 Hz requestsscene.max_image_dim_px: downsizes frames before sending them to the VLM to keep latency boundedscene.use_brain: enable the lightweight Ollama chat brain for interactive Q/Ascene.brain_model: Ollama chat model for dialogue, for exampleqwen2.5:1.5bscene.brain_temperature: sampling temperature for dialogue style/creativityscene.brain_max_tokens: max generated tokens for each dialogue responsescene.include_chat_history: include recent chat turns in the brain promptscene.chat_history_max_turns: number of prior user/robot turns to include (default10)scene.vlm_only_on_significant_change: call VLM only when scene fingerprint changes enoughscene.vlm_change_threshold: normalized frame-difference threshold (0-1)scene.vlm_change_sample_dim_px: thumbnail size used to measure frame changescene.vlm_target_center_change_ratio: target-movement threshold before a new VLM callscene.vlm_target_area_change_ratio: target-scale-change threshold before a new VLM callscene.vlm_force_refresh_s: force a refresh after this many seconds even if scene is stablescene: memory window, interaction hold, optional OpenAI settingsruntime.web_preview: enable browser previewruntime.web_host/runtime.web_port: bind address for browser previewruntime.speech.enabled: speak each terminal answer with Piperruntime.speech.backend:piperruntime.speech.model_path: path to the Piper.onnxvoice fileruntime.speech.config_path: optional path to the Piper voice JSON; defaults to<model_path>.jsonruntime.speech.audio_player:auto,aplay,paplay,ffplay, orafplayruntime.speech.speaker_id: optional multi-speaker voice IDruntime.speech.length_scale: speech speed/length controlruntime.speech.noise_scale/runtime.speech.noise_w_scale: optional Piper voice variation controlsruntime.speech.volume: output volume passed to Piper synthesisruntime.speech.use_cuda: use CUDA for Piper inference if supportedruntime.speech.lead_in_ms: prepend this many milliseconds of silence before playback to avoid clipped sentence starts


