This repository holds an example for Sireum Mr. Roboto (Roboto for short).
Roboto takes a demo automation script (e.g.,
example.md)
and runs it using java.awt.Robot to automate GUI interactions such as
typing, keyboard shortcuts, mouse clicks, image-based template matching,
on-screen overlay notifications, and text-to-speech.
The specification language is in the form of either:
-
Slash (Slang universal shell) script that builds objects defined by the Script Slang types (outputs JSON consumed by the runner); or,
-
Markdown with YAML frontmatter specifying script attributes and OS-specific variables, with each heading (
#) defining an action and bullet points (*) specifying commands (e.g., example.md).
sireum roboto run <option>* <path> <arg>*
where <path> is either a .md (Markdown) or .cmd (Slang script) file.
By default, Roboto uses MaryTTS for speech synthesis (local, no API key needed).
sireum roboto run example.md
Define the AZURE_KEY environment variable using one of your Azure account
text-to-speech service keys.
sireum roboto run --service azure example.md
Install AWS CLI and configure it (aws configure).
sireum roboto run --service aws example.md
Record the script execution as an MP4 video with synchronized audio (system audio + TTS speech). Requires FFmpeg.
sireum roboto run --record output.mp4 example.md
The recording captures video with synchronized TTS and system audio.
sireum roboto capture <output.png>
This captures the full screen after a 5-second countdown.
The captured image can be cropped (e.g., in Preview on macOS) to create
template images for clickImage and waitForImage commands.
---
name: "Script Name"
defaultCharDelayMs: 50
defaultActionDelayMs: 2000
vars:
- key: value
varsMac:
- key: macValue
varsWin:
- key: winValue
varsLinux:
- key: linuxValue
varsLinuxArm:
- key: linuxArmValue
subst:
- hamr: Hammer
- sysml: Sis M L
---| Key | Description | Default |
|---|---|---|
name |
Script name displayed at startup | "Roboto" |
defaultCharDelayMs |
Default delay (ms) between characters for typeText |
50 |
defaultActionDelayMs |
Default delay (ms) between actions | 2000 |
vars |
Base variables (all platforms) | |
varsMac |
macOS variable overrides | |
varsWin |
Windows variable overrides | |
varsLinux |
Linux (x86_64) variable overrides | |
varsLinuxArm |
Linux (AArch64) variable overrides | |
subst |
TTS pronunciation substitutions |
Variables defined in the frontmatter are substituted in command text using
$name$ syntax (dollar-sign delimited).
OS-specific variable sections (varsMac, varsWin, etc.) override base vars
for the current platform.
Environment variables in variable values are expanded using $ENV_VAR syntax
(single $ prefix). Use $$ to escape a literal $ (e.g., $$SIREUM_HOME
produces $SIREUM_HOME without expansion).
The subst section defines pronunciation overrides for text-to-speech.
When $term$ syntax is used in speak command text, the term is replaced
with its subst value before being sent to the TTS engine. This is useful for
terms that TTS engines mispronounce (e.g., CamelCase identifiers, abbreviations,
domain-specific terms).
subst:
- hamr: Hammer
- sysml: Sis M L
- isolette: Eye-so-let* speak: Let me demonstrate $hamr$ code generation for the $isolette$ model.The TTS engine receives: "Let me demonstrate Hammer code generation for the Eye-so-let model."
Each Markdown heading (#) defines an action. The heading text is the action
name. Bullet points (*) under a heading are the commands for that action.
HTML comments (<!-- -->) are ignored and can be used to comment out
sections.
Commands use the syntax: command[(options)]: arguments
| Command | Syntax | Description |
|---|---|---|
typeText |
typeText[(delay)]: text |
Type text; delay=0 pastes from clipboard (instant), delay=-1 or omitted uses defaultCharDelayMs |
pressKey |
pressKey[(mod,...)]: key |
Press a key with optional modifiers |
typeChar |
typeChar[(mod,...)]: c |
Type a single character with optional modifiers |
wait |
wait: ms |
Wait for the specified milliseconds |
notify |
notify: message |
Show an on-screen overlay notification |
speak |
speak[(async)]: text |
Speak text using TTS; async option for non-blocking playback |
waitForSpeech |
waitForSpeech |
Wait for async speech to finish |
mouseMove |
mouseMove: x, y |
Move mouse to coordinates |
mouseClick |
mouseClick[(button)]: x, y |
Click at coordinates (Left/Middle/Right, default Left) |
mouseDoubleClick |
mouseDoubleClick[(button)]: x, y |
Double-click at coordinates |
mouseDrag |
mouseDrag: fromX, fromY, toX, toY |
Drag from one point to another |
clickImage |
clickImage[(similarity, xOff, yOff)]: image.png |
Find image on screen and click it (path relative to .md file) |
waitForImage |
waitForImage[(similarity, timeoutMs)]: image.png |
Wait for image to appear on screen |
clickText |
clickText[(timeoutMs, xOff, yOff)]: text |
Find text on screen using OCR and click it |
waitForText |
waitForText[(timeoutMs)]: text |
Wait for text to appear on screen using OCR |
screenCapture |
screenCapture: output.png |
Capture the screen to a file |
hideCursor |
hideCursor |
Hide the mouse cursor for the rest of the script — stops compositing the cursor overlay onto recorded frames; on macOS also runs a tiny Swift helper that calls CGDisplayHideCursor to truly hide the OS cursor system-wide |
showCursor |
showCursor |
Re-enable the cursor overlay; on macOS lets the helper exit cleanly via CGDisplayShowCursor so the system cursor reappears |
Enter, Escape (or Esc), Tab, Space, Backspace, Delete,
Up, Down, Left, Right, Home, End, PageUp, PageDown,
F1–F12
Ctrl, Shift, Alt, Meta (or Cmd)
Image paths are resolved relative to the .md file's directory.
The similarity parameter (0.0–1.0, default 0.9) controls how closely the
screen region must match the template image.
xOff and yOff (default 0) offset the click position from the center of
the matched region.
Template images with transparent backgrounds (PNG alpha channel) are supported. Transparent pixels (alpha < 128) are ignored during matching, so only the foreground content needs to match. This allows the same template to work across different themes or background colors.
Workflow:
- Run
sireum roboto capture screenshot.png - Crop the target UI element from the screenshot (e.g., in Preview on macOS)
- Optionally, make the background transparent for theme-independent matching
- Save the cropped image next to the
.mdfile - Reference it:
clickImage(0.9): my-button.png
Useful for screen recordings of TUI demos where the mouse pointer would otherwise sit on top of the captured frames.
hideCursorstops Roboto from compositing its cursor overlay onto the recorded frames (drawCursor=false).- On macOS it additionally spawns a tiny Swift helper that calls
CGDisplayHideCursor(CGMainDisplayID())and blocks on its stdin so the OS cursor itself is hidden system-wide for the duration. The helper balances the hide withCGDisplayShowCursorwhen its stdin closes (either viashowCursoror the JVM shutdown hook on script exit), so the cursor is never left hidden on a crash. - On non-macOS platforms the helper is skipped and Roboto falls back to parking the OS pointer at the right edge / vertical-middle of the primary screen, which avoids menu-bar / dock / hot-corner triggers.
showCursorreverses both effects.
Typical use is a single hideCursor near the top of the script —
matching showCursor is only needed if a later section actually requires
cursor visibility.