Introduction
WayDriver is a Rust library for headless GUI application testing on Wayland. It launches apps in isolated compositor sessions, interacts with them via AT-SPI accessibility APIs, and captures screenshots and WebM video via PipeWire.
The repo also contains waydriver-mcp, a standalone Model Context Protocol server binary built on top of the library that lets AI assistants drive GTK4 apps directly — see MCP Server.
Crates.io · API docs (docs.rs) · GitHub · License: Apache-2.0
Demo
The clip below is the full output of crates/waydriver-examples/examples/gnome_calculator.rs, runnable with cargo run -p waydriver-examples --example gnome_calculator. Read the source for the API surface in context — it covers a session lifecycle, AT-SPI button clicks, keyboard chord dispatch (Shift+9/Shift+0 for parens), a typed unit conversion, and per-step result verification via XPath locators. The recording is captured by waydriver itself via PipeWire.
How it works
Each test session creates an isolated environment with a headless compositor, input injection, and screen capture:
graph TD
subgraph Session["Per-session processes"]
dbus["dbus-daemon (private)"]
dbus --- mutter["Mutter --headless --wayland"]
mutter --- screencast["ScreenCast API (screenshots)"]
mutter --- remotedesktop["RemoteDesktop API (input)"]
dbus --- pipewire["PipeWire (frame capture)"]
dbus --- wireplumber["WirePlumber (PipeWire graph manager)"]
app["Your app (on Mutter's Wayland display)"]
app --- atspi["AT-SPI (accessibility tree, actions)"]
end
The library is backend-agnostic. Three traits define the interface:
CompositorRuntime— lifecycle of a headless compositor (start, stop, expose Wayland display)InputBackend— keyboard and pointer injectionCaptureBackend— screen capture (start/stop PipeWire streams, grab PNG frames)
Concrete implementations are separate crates. The trait-based design allows backends to be added as sibling crates without changing the core.
Backend support
| Feature | Mutter | KWin | Sway |
|---|---|---|---|
| Headless compositor | Yes | — | — |
| Keyboard input | Yes (RemoteDesktop) | — | — |
| Pointer input | Yes (RemoteDesktop) | — | — |
| Screenshots | Yes (ScreenCast + PipeWire) | — | — |
| Video recording (WebM/VP8) | Yes (ScreenCast + PipeWire) | — | — |
| AT-SPI (UI inspection, clicks) | Yes | — | — |
Currently only Mutter is implemented (waydriver-compositor-mutter, waydriver-input-mutter, waydriver-capture-mutter). Each compositor has its own APIs (Mutter uses org.gnome.Mutter.* D-Bus interfaces, KWin has org.kde.KWin.*, Sway uses wlroots Wayland protocols), so each would need its own set of backend crates.
Crate structure
| Crate | Purpose |
|---|---|
waydriver | Trait definitions, Session, AT-SPI client, keysym helpers, shared GStreamer capture helper |
waydriver-compositor-mutter | CompositorRuntime impl — manages Mutter, PipeWire, WirePlumber, private D-Bus |
waydriver-input-mutter | InputBackend impl — keyboard/pointer via Mutter RemoteDesktop |
waydriver-capture-mutter | CaptureBackend impl — screenshots via Mutter ScreenCast + PipeWire |
waydriver-mcp | Binary — MCP JSON-RPC server over stdio that exposes the library to AI assistants |
Getting Started
Requirements
All dependencies are provided by the Nix flake (nix develop). If not using Nix, you need the following system packages.
Build dependencies
| Debian/Ubuntu | Fedora | Arch |
|---|---|---|
pkg-config | pkg-config | pkg-config |
libglib2.0-dev | glib2-devel | glib2 |
libgstreamer1.0-dev | gstreamer1-devel | gstreamer |
libgstreamer-plugins-base1.0-dev | gstreamer1-plugins-base-devel | gst-plugins-base |
Runtime dependencies
| Debian/Ubuntu | Fedora | Arch |
|---|---|---|
mutter | mutter | mutter |
pipewire | pipewire | pipewire |
wireplumber | wireplumber | wireplumber |
gstreamer1.0-plugins-base | gstreamer1-plugins-base | gst-plugins-base |
gstreamer1.0-plugins-good | gstreamer1-plugins-good | gst-plugins-good |
gstreamer1.0-pipewire | gstreamer1-plugins-pipewire | gst-plugin-pipewire |
at-spi2-core | at-spi2-core | at-spi2-core |
dbus | dbus | dbus |
Quick install:
# Debian/Ubuntu
sudo apt install pkg-config libglib2.0-dev libgstreamer1.0-dev \
libgstreamer-plugins-base1.0-dev mutter pipewire wireplumber \
gstreamer1.0-plugins-base gstreamer1.0-plugins-good \
gstreamer1.0-pipewire at-spi2-core dbus
# Fedora
sudo dnf install pkg-config glib2-devel gstreamer1-devel \
gstreamer1-plugins-base-devel mutter pipewire wireplumber \
gstreamer1-plugins-base gstreamer1-plugins-good \
gstreamer1-plugins-pipewire at-spi2-core dbus
# Arch
sudo pacman -S pkg-config glib2 gstreamer gst-plugins-base \
gst-plugins-good gst-plugin-pipewire mutter pipewire \
wireplumber at-spi2-core dbus
Add WayDriver to your project
Add the core library plus the Mutter backend crates:
cargo add waydriver waydriver-compositor-mutter waydriver-input-mutter waydriver-capture-mutter
WayDriver’s API is async, so you’ll also want a Tokio runtime:
cargo add tokio --features full
Usage
#![allow(unused)]
fn main() {
use std::sync::Arc;
use waydriver::{Session, SessionConfig, CompositorRuntime};
use waydriver_compositor_mutter::MutterCompositor;
use waydriver_input_mutter::MutterInput;
use waydriver_capture_mutter::MutterCapture;
let mut compositor = MutterCompositor::new();
compositor.start(None).await?;
// `state()` is `Option`; immediately after a successful `start()` it is
// always `Some` — `expect` documents that invariant locally.
let state = compositor.state().expect("state available after start");
let input = MutterInput::new(state.clone());
let capture = MutterCapture::new(state);
let session = Arc::new(Session::start(
Box::new(compositor),
Box::new(input),
Box::new(capture),
SessionConfig {
command: "your-gtk-app".into(),
args: vec![],
cwd: None,
app_name: "your-gtk-app".into(),
// Record the entire session to a WebM file. Set to `None` to skip.
video_output: Some("/tmp/session.webm".into()),
video_bitrate: None, // defaults to waydriver::capture::DEFAULT_VIDEO_BITRATE (2 Mbps)
video_fps: None, // defaults to waydriver::capture::DEFAULT_VIDEO_FPS (15)
},
).await?);
// Take a screenshot (returns PNG bytes).
let png = session.take_screenshot().await?;
// Target widgets with XPath selectors over the AT-SPI tree. Actions
// auto-wait for the element to be visible + enabled before firing.
session.locate("//Button[@name='primary-button']").click().await?;
session.locate("//Text[@name='search']").set_text("hello").await?;
// Keyboard input with modifier chords.
session.press_chord("Ctrl+Shift+S").await?;
// Explicit waits when auto-wait isn't enough — e.g. an item appearing
// after some async work.
session.locate("//Label[@name='status']")
.wait_for_text(|t| t == "ready")
.await?;
// Inspect the tree while debugging selectors.
let xml = session.dump_tree().await?;
println!("{xml}");
Arc::try_unwrap(session).unwrap().kill().await?;
}
Next: the Locator API reference covers the full action surface, and the MCP Server chapter shows how to drive apps from an AI assistant without writing Rust.
Locator API
Session::locate(xpath) returns a lazy Locator — each action re-snapshots
the AT-SPI tree and re-resolves the selector, so you don’t have to worry
about stale element handles. Common methods:
| Method | What it does |
|---|---|
click() / double_click() / right_click() | Invoke the AT-SPI Action interface (primary, secondary, tertiary actions) |
hover() / drag_to(target) / drag_to_coords(x, y) | Pointer-driven hover and drag — lands on real Wayland input events for repaint. drag_to_coords releases at raw screen coordinates, so the drop can land off-window (e.g. libadwaita tab drag-out) |
focus() / scroll_into_view() | Component::grab_focus and scroll_to/scroll_to_point |
set_text(s) / fill(s) | Direct EditableText write vs. focus-and-type fallback for widgets without EditableText (e.g. GtkTextView) |
select_option(by) | Pick a child of a Selection-interface container by label or index |
text() | Read via the Text interface |
count() / all() / inspect_all() | Multi-match: count, list of locators, full metadata in one snapshot |
name() / role() / attribute(k) / attributes() / bounds() | Accessible name, role, AT-SPI attributes, screen-relative bounds |
is_showing() / is_enabled() | State predicates |
wait_for_visible() / _hidden() / _enabled() / _count(n) / _text(pred) | Block until state or predicate holds |
wait_for(pred) / wait_until(pred) / wait_until_async(pred) | General-purpose predicate auto-waits |
with_timeout(d) | Per-call override of the auto-wait timeout |
nth(i) / first() / last() / parent() / locate(sub_xpath) | Compose sub-locators |
Single-target actions (click, focus, set_text, text, …) error with
AmbiguousSelector if the selector matches more than one element. Narrow
with .nth(i) or a more specific XPath.
MCP Server
waydriver-mcp is a standalone binary that exposes the library over the Model Context Protocol, letting AI assistants (Claude Desktop, Claude Code, etc.) drive GTK4 apps in isolated headless sessions. It speaks JSON-RPC over stdio and constructs the Mutter backends internally — clients only see the high-level tools below.
| Tool | Purpose |
|---|---|
start_session | Spawn a headless Mutter session and launch a command inside it (optional report_dir, resolution, scale, isolate_settings, gsettings, record_video, video_bitrate, capture_external_effects overrides per session) |
list_sessions | List active session ids, app names, and Wayland displays |
kill_session | Tear down a session and clean up all child processes |
set_setting | Change a GSettings key on the running session live — rewrites the isolated keyfile in place so the app re-applies it via its changed handler (cursor, fonts, color-scheme, …) without a restart |
dump_tree | Dump the AT-SPI accessibility tree as XML — each node carries a _ref you can target with query/click/etc. |
query | Evaluate an XPath over the tree; returns every match’s role, name, attributes, and states |
click / double_click / right_click | Invoke an element’s primary / secondary / tertiary AT-SPI Action. Auto-waits for visibility + enablement. |
hover | Move the pointer to an element’s center — drives a real Wayland motion event so hover-state UI repaints |
drag_to | Press, move across an element’s center, release — full Wayland drag gesture |
drag_to_coords | Like drag_to, but release at raw screen-absolute (x, y) — drop onto empty space or off the source window (libadwaita tab drag-out and other “drop onto nothing” DnD) |
focus | Give keyboard focus to an element via AT-SPI Component::grab_focus |
set_text | Replace an editable element’s contents via EditableText (fast, requires the interface) |
fill | Focus + clear + type — fallback for widgets without EditableText (e.g. GtkTextView/GtkEntry). Tries AT-SPI Component::grab_focus first; widgets whose bridge doesn’t expose Component (the documented GTK4 case) fall back to a pointer click at the widget’s centre to drive focus through the input layer, the same way a user would. Set assume_focused: true to skip the whole focus step when the target is already focused. Supports caret_nav/select_all clear modes. |
select_option | Pick an entry from a Selection-interface container (combo box, list, …) by label or by index |
read_text | Read an element’s text via the Text interface |
read_value | Read an element’s AT-SPI Value (current/min/max) — a scrolled view’s offset, or a slider/progress/spin value |
scroll | Scroll a located area by wheel detents along an axis (parks the pointer over it first); pair with read_value to confirm the offset moved |
type_text | Type a string into the currently focused element through the input backend |
press_key | Press a named key or chord (Return, Ctrl+A, Shift+Tab, Escape, …) |
move_pointer | Move the pointer by a relative offset in logical pixels |
pointer_click | Press and release a pointer button (defaults to left click) |
take_screenshot | Capture a PNG via the keepalive ScreenCast stream and return its path |
compare_element_to_baseline | Crop an element and diff it against a committed reference PNG (perceptual CIEDE2000) — returns a diff score (not a pass/fail verdict) and writes a red-highlighted diff image on mismatch |
get_captured_effects | Read the desktop notifications and portal open-URI requests the app emitted onto the session bus (mock D-Bus sinks). Requires capture_external_effects: true on start_session; effects have no AT-SPI projection, so this is the only way to assert on them |
launch_secondary_instance | Relaunch the app with extra args in the same session env — a single-instance GApplication forwards the command line to the running primary; observe the primary’s reaction via wait_for_stdout_line/query |
Selectors use XPath 1.0 against a snapshot of the AT-SPI tree serialized to XML, with role names normalized to PascalCase (e.g. push button → Button). Example XPaths: //Button[@name='OK'], //Text[@name='search'], //MenuItem[contains(@name, 'Mode')], (//Button)[last()].
Each session produces output under a configurable report directory. Screenshots are written as {report_dir}/{session_id}/{session_id}-{n}.png — each session gets its own subdirectory and n increments per take_screenshot call. The base report_dir defaults to /tmp/waydriver and can be overridden with the --report-dir <PATH> CLI flag or the WAYDRIVER_REPORT_DIR environment variable. Individual start_session calls may also pass a report_dir argument to override the server default for that session.
Alongside the screenshots, each session writes:
{session_id}.webm— full-session VP8/WebM recording of the display at 15 fps, finalized with a seekhead onkill_session. On by default; disable per-server with--record-video false/WAYDRIVER_RECORD_VIDEO=false, or per-session withstart_session’srecord_video: false. Bitrate via--video-bitrate <bits/sec>/WAYDRIVER_VIDEO_BITRATE(default2_000_000) or per-sessionvideo_bitrate.events.jsonl— append-only audit log of every session-scoped tool call (action, params, ok/err status, timestamp) at{report_dir}/{session_id}/events.jsonl.events.js— atomic rewrite of the same data aswindow.__events_update([...])for consumption by the viewer.index.html— styled viewer (Tailwind via the Play CDN) that embeds the recording in a<video>tag when present. Reloadsevents.jsevery 2 s via a<script src>swap (which works overfile://unlikefetch), append-only rendering so expanded<details>stay expanded across refreshes. Written once at session start.
start_session’s response includes a file:// URL to the session viewer — open it directly from the filesystem in any browser. No HTTP server, no ports, no network access required. Multiple waydriver-mcp instances (different Claude Code tabs / projects) can run side by side without conflict.
Why Docker?
waydriver-mcp needs ~8 system services at runtime (mutter, pipewire, wireplumber, dbus, AT-SPI, gstreamer). Installing these manually is fragile and distro-specific. Docker solves four problems:
- Security — the MCP server spawns arbitrary processes, interacts with them via D-Bus, and captures their screen. Running this on your host session gives it access to everything your user can do. Inside a container, it only sees what you explicitly mount — no access to your files, browser sessions, or credentials. Add
--network noneto block network access entirely (the report viewer is purely staticfile://, so it works without any network) - Zero-setup distribution —
docker pulland you’re running, no system packages to install - D-Bus isolation — each container gets its own dbus-daemon, so apps with singleton D-Bus activation don’t interfere across concurrent test sessions
- ABI compatibility — apps built inside the container are guaranteed to link against the same libraries the MCP runtime uses
Running with Docker (recommended)
Prebuilt images are published to GitHub Container Registry for each release:
| Image | Purpose |
|---|---|
ghcr.io/bohdantkachenko/waydriver-mcp | Runtime — MCP server with all system deps |
ghcr.io/bohdantkachenko/waydriver-mcp-builder | Build env — Fedora 42 + Rust + gcc/g++ + meson + cmake + GTK4/GLib dev headers |
docker pull ghcr.io/bohdantkachenko/waydriver-mcp:latest
docker pull ghcr.io/bohdantkachenko/waydriver-mcp-builder:latest
Use the builder image to compile your app in a Fedora environment that matches the runtime. The resulting binary is ABI-compatible with the runtime image. See Testing your app below for language-specific build examples.
MCP client config (e.g. .mcp.json for Claude Code):
{
"mcpServers": {
"waydriver-mcp": {
"command": "sh",
"args": ["-c", "docker run --rm -i --network none -v \"$PWD:/workspace:ro\" -v /tmp/waydriver:/tmp/waydriver ghcr.io/bohdantkachenko/waydriver-mcp:latest"]
}
}
}
$PWD:/workspace:ro— mounts the project directory so the MCP can launch your app binaries from/workspace//tmp/waydriver:/tmp/waydriver— makes session reports (screenshots, WebM recordings,events.jsonl,index.html) accessible on the host at/tmp/waydriver/. The mount uses the same path on both sides so thefile://URL thatstart_sessionreturns is openable as-is on the host--network none— safe to fully isolate: the report viewer is pure static HTML + JS loaded from your local filesystem
For NixOS users, also mount the Nix store so Nix-built binaries work inside the container:
{
"mcpServers": {
"waydriver-mcp": {
"command": "sh",
"args": ["-c", "docker run --rm -i --network none -v /nix/store:/nix/store:ro -v \"$PWD:/workspace:ro\" -v /tmp/waydriver:/tmp/waydriver ghcr.io/bohdantkachenko/waydriver-mcp:latest"]
}
}
}
Or build from source:
docker build -t waydriver-mcp .
Testing your app with waydriver-mcp
The MCP server is persistent — it stays up for the entire AI assistant session. You rebuild your app independently, and each start_session call picks up the latest binary from the volume. No MCP restart needed between iterations.
Rust apps — build with the builder image, volume-mount the binary:
docker run --rm -v "$PWD:/src:ro" -v "$PWD/build:/out" \
ghcr.io/bohdantkachenko/waydriver-mcp-builder:latest \
sh -c "cp -r /src /tmp/build && cd /tmp/build && cargo build --release && cp target/release/myapp /out/"
{
"mcpServers": {
"waydriver-mcp": {
"command": "docker",
"args": ["run", "--rm", "-i",
"-v", "/path/to/myapp/build:/workspace:ro",
"ghcr.io/bohdantkachenko/waydriver-mcp:latest"]
}
}
}
Then call start_session with command: "/workspace/myapp".
C/C++ apps — the builder image includes gcc, g++, meson, ninja-build, cmake, and GTK4/GLib dev headers:
docker run --rm -v "$PWD:/src:ro" -v "$PWD/build:/out" \
ghcr.io/bohdantkachenko/waydriver-mcp-builder:latest \
sh -c "cp -r /src /tmp/build && cd /tmp/build && meson setup _build && meson compile -C _build && cp _build/myapp /out/"
For extra deps (e.g. libadwaita-devel), extend the builder:
FROM ghcr.io/bohdantkachenko/waydriver-mcp-builder:latest
RUN dnf install -y libadwaita-devel
Node/Python apps — extend the runtime image to add the interpreter, use a named volume for deps:
FROM ghcr.io/bohdantkachenko/waydriver-mcp:latest
RUN dnf install -y nodejs && dnf clean all
Install deps into a named volume (re-run only when lockfile changes):
docker volume create myapp-nodemods
docker run --rm \
-v "$PWD/package.json:/app/package.json:ro" \
-v "$PWD/package-lock.json:/app/package-lock.json:ro" \
-v "myapp-nodemods:/app/node_modules" \
-w /app \
ghcr.io/bohdantkachenko/waydriver-mcp-builder:latest \
sh -c "dnf install -y nodejs npm && npm ci --omit=dev"
Mount source + deps — edit source freely, MCP picks up changes on next start_session:
"args": ["run", "--rm", "-i",
"-v", "/path/to/myapp/src:/app/src:ro",
"-v", "myapp-nodemods:/app/node_modules:ro",
"myapp-mcp:latest"]
NixOS users — mount /nix/store so Nix-built binaries just work:
"args": ["run", "--rm", "-i",
"-v", "/nix/store:/nix/store:ro",
"-v", "/path/to/myapp:/workspace:ro",
"ghcr.io/bohdantkachenko/waydriver-mcp:latest"]
Running with Nix
For local development without Docker, the Nix app wraps the binary with the required runtime env vars:
nix run .#mcp
Sessions are kept in an in-memory HashMap keyed by id, so multiple apps can run concurrently within one server process.
Visual locator — OCR + flood-fill region detection
Gated behind the visual Cargo feature on the waydriver crate. Adds
two coordinated abilities for finding widgets that the AT-SPI tree
doesn’t reveal:
- OCR-based text matching — locate a widget by its on-screen text when AT-SPI doesn’t surface it as an accessible.
- Region detection — once OCR finds the text, walk outward through the pixels to find the visually-distinct shape enclosing it (a button pill, row, card frame), so clicks land on the widget rather than its inner glyphs.
This doc describes both pipelines, how they compose, what they cost, and when each one is the right tool.
Why this exists
AT-SPI is the normal interaction path: enumerate the accessibility
tree, find a widget by name/role/state, call Action.do_action or
synthesize pointer events at its bounds. waydriver’s regular
Locator
does all that.
But real toolkits have gaps. Two we’ve hit and confirmed are genuinely upstream:
-
libadwaita lazy realization — an
AdwPreferencesGroupconstructed withvisible:falseinside anAdwPreferencesPageand then flipped visible afterpresent()never has its accessible subtree built. The same happens to a non-initialAdwPreferencesDialogpage. The containedAdwButtonRow/AdwSwitchRowpaints on screen but is absent from every AT-SPI surface. We exhaustively tried to force realization from the client and none work (confirmed live on mutter 49 / GTK4 4.20 / libadwaita 1.8):- parent traversal (
GetChildren), a0..ChildCountGetChildAtIndex(i)loop, andCache.GetItemson the app bus — the widgets are simply never published; - a grid of
Component.GetAccessibleAtPointhit-tests over the dialog and every descendant (thousands of calls) — no change; - synthetic compositor pointer-hover across the page — no change;
- keyboard focus traversal (Tab through the dialog — how Orca surfaces them) — no change.
Libadwaita doesn’t register these accessibles, and there’s no AT-SPI or input path that makes it. The bug is genuinely upstream; the OCR visual locator below is the only working way to drive these widgets.
- parent traversal (
-
AdwButtonRow has no accessible name — even when the row is in the tree, its title doesn’t surface as an AT-SPI name, so
Locator::find_by_namereturns zero.
We can’t fix these from the client side: D-Bus enumeration finds what the toolkit chose to publish. The pixels on screen, however, are real. The visual locator drives off those pixels.
It’s strictly opt-in. waydriver’s existing Locator::click etc.
never silently fall back to OCR — the cost (hundreds of ms) is too
high to hide, and silent fallback would mask real selector bugs. You
reach for Session::find_by_text only when you’ve established that
AT-SPI doesn’t see the widget.
The OCR pipeline
┌──────────────────────────────────────────────┐
│ Session::take_screenshot() │
│ PipeWire keepalive stream → PNG bytes │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ image::load_from_memory(...) │
│ PNG → DynamicImage │
└────────────────────┬─────────────────────────┘
│
optional .within(rect) │ crop to parent region
+ 32px context pad │ (Locator::find_by_text)
v
┌──────────────────────────────────────────────┐
│ ocrs::OcrEngine │
│ prepare_input → detect_words → │
│ find_text_lines → recognize_text │
│ (pure-Rust, ONNX via rten) │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ Filter words by `text` (Substring/Exact) │
│ Translate bboxes back to screen coords │
│ Return Vec<Rect> │
└──────────────────────────────────────────────┘
Engine lifecycle
The OcrEngine is loaded once per session into a shared
tokio::sync::OnceCell. The two .rten model files (text-detection
~2.5 MB, text-recognition ~10 MB) are looked up in this order:
- Env-var override —
WAYDRIVER_OCRS_DETECTION_MODELandWAYDRIVER_OCRS_RECOGNITION_MODELboth set. - XDG cache hit —
$XDG_CACHE_HOME/waydriver/ocrs-models/(or~/.cache/...) has both files. - Auto-download — fetch from the ocrs project’s S3 bucket into the XDG cache. First call only; subsequent runs hit (2).
Set SessionConfig::prewarm_visual = true
to spawn the engine load as a background task during Session::start
so the first find_by_text call doesn’t pay the ~1–2 s model load.
On a fresh machine with no XDG cache, the first session also pays
~5–20 s of model download — pre-populate the cache in CI setup if
that matters.
Cropping to a parent (the Locator::find_by_text path)
Session::find_by_text(text) OCR’s the full screen. That works but
is slow (~200–500 ms on a 1024×768 frame) and noisy — every word
visible on screen is a candidate, so disambiguation matters.
Locator::find_by_text(text) on an AT-SPI parent locator is the
faster, more accurate form:
#![allow(unused)]
fn main() {
let dialog = session.locate("//Dialog[@name='Preferences']");
let text = dialog.find_by_text("lazy-button").await?;
}
This crops the screenshot to the parent’s AT-SPI bounds (plus a 32 px padding ring) before it reaches ocrs:
- Speed. OCR runtime is roughly linear in image area; cropping to a typical dialog cuts a search from ~300 ms to ~50 ms.
- Accuracy. Less surrounding text means fewer false positives and less context that confuses the recognition head.
Why the 32 px context padding? Empirically, a tight crop strips the
visual context that ocrs’s recogniser uses to disambiguate ambiguous
glyphs. Without padding, small/low-contrast labels misread (we saw
lazy-button → lazv-button). The 32 px ring restores the context;
hits inside the ring but outside the original scope are filtered
back out after OCR so the caller sees only matches that genuinely
fall inside the requested region.
MatchMode
Substring(default) — case-insensitive substring match. Tolerant of OCR’s noise (it’ll match"open"against"open-lazy-issue1-dialog").Exact— equality on the full joined line, normalized.
Both modes Unicode-normalize haystack and needle before comparing: NFKD decomposition + case-fold + combining-mark stripping. This makes matching insensitive to:
- Case —
"Add Account"matches"add account". - Diacritics —
"café"matches"cafe","naïve"matches"naive". - Ligatures and compatibility codepoints —
"file"(U+FB01) matches"file","flux"(U+FB02) matches"flux".
Exotic punctuation (e.g. the Unicode minus − U+2212 that
gnome-calculator uses in its history line) is not auto-mapped
to ASCII equivalents — match it explicitly when needed.
Block grouping with visual-boundary detection
OCR returns text lines bottom-up via several heuristics, applied in order:
- Geometric clustering: lines with small y-gap and overlapping x-ranges merge into one block (wrapped paragraph behaviour).
- Pixel-level boundary checks (when an image is available):
even if the geometric tests pass, the merge is vetoed when the
gap between two lines contains:
- A background-colour change — sample an averaged window of
pixels just below the upper line and one just above the lower
line (window radius
VisualTextTuning::background_sample_radiuspx, default 2 = 5×5); if their colours differ by more thanVisualTextTuning::background_color_tolerance(default 24), the lines sit on different backgrounds. The averaged-window sampler smooths over single antialias-fringe pixels that would skew a single-pixel read. - A horizontal divider stripe — scan every row in the gap;
a row where ≥
boundary_majority_threshold(default 0.8) ofboundary_samples_per_axis(default 16) sampled pixels differ from both surrounding backgrounds is a horizontal rule. - A vertical divider stripe — scan every column in the x-overlap range; same majority + colour-distance test. Picks up split-pane rules that pass through the gap.
- A background-colour change — sample an averaged window of
pixels just below the upper line and one just above the lower
line (window radius
- Connectivity check (opt-in,
connectivity_check_enabled = falseby default): a bounded BFS in the gap. From the bg pixel just below the upper line, flood-fill at mostmax_connectivity_pixels(default 4096) pixels and check whether the flood reaches the bg pixel just above the lower line. If not, the lines are in visually-separated regions despite having the same background colour — catches “two cards on the same fill, each boxed in by a thin border the divider check is too sparse to detect”.
All checks consult VisualTextTuning::color_distance (default
LabCie76, see below) when comparing pixels. The divider checks
toggle together via divider_detection_enabled (default true);
disable on themes where shadow rasters or anti-aliased streaks
would trip the heuristic.
Perceptual colour distance
ColorDistance controls how the visual locator compares pixel
colours, both for region detection (flood-fill, seed pick, shape
classification) and the boundary checks:
Rgb— raw RGB Euclidean squared distance. Cheap, not perceptual. Use to reproduce legacy thresholds tuned against raw RGB.LabCie76(default) — ΔE*76 in CIE Lab space. Roughly perceptual (“a ΔE of 6 is barely noticeable, 12 is clearly different”), cheap (one sRGB→Lab conversion).LabCie2000— ΔE*00, perceptual gold standard. ~5× slower than CIE76; only worth it when CIE76 misclassifies subtle hue shifts in practice.
The default background_color_tolerance: 24 scales sensibly across
modes — RGB ΔE 24 maps to Lab ΔE76 ~6, both “near-identical
backgrounds”. When retuning, re-tune for the mode you switched to.
Multi-word and multi-line matching
OCR returns text as a tree of TextLines, each containing
TextWords. The matcher joins words with spaces and substring-
matches against the joined string. Two layers of join:
- Per-line for
MatchMode::Exact. A line’s words are joined with spaces; the needle must equal the whole joined line. UseExactto distinguish"Add account"from"Add account and continue". - Per-block for
MatchMode::Substring. The grouper builds multi-line blocks from geometrically-close lines (see block grouping). For each block, the matcher tries every joiner-choice variant: at each line break, it can use" "or""independently, giving2^(N−1)variants for a block of N lines (capped at N = 5; above that, fall back to the single space-join). This handles:- Wrapped multi-word labels —
"Click here to learn more"matches whether the words wrapped onto one row or three (the space-join variant covers this). - Hyphenated wraps —
"needle"matches an OCR result of["nee", "dle"](the no-space variant joins to"needle"). - Ligature splits across lines (rare but possible) — the Unicode normalization pass handles ligatures inside a single line already; the variants extend the same idea across breaks.
- Wrapped multi-word labels —
When a substring match spans multiple words — on the same line or across lines — the returned bbox is the union of the matched words’ bboxes. For a single-line match this is the tight rectangle around the matched text. For a multi-line match it’s the AABB of every involved word, which can include vertical gaps between the text rows; the centroid still lands inside the matched text block, which is what you want for clicking and region seeding.
Trade-off of cross-line substring: unrelated labels on
adjacent lines can spuriously match across the line break (a search
for "account Remove" would hit text that read
"Add account / Remove account"). In practice nobody writes
selectors that way, and the user opted in to OCR because AT-SPI
couldn’t help — they’re already using a fuzzy tool. Use Exact
when you need line-precise semantics.
Introspection
Both VisualLocator and RegionLocator implement Debug, so
tracing::debug!("{loc:?}") or dbg!(loc) shows what the locator
represents:
VisualLocator { kind: "text-label", text: "Add account",
match_mode: Substring, region: Some(Rect { ... }),
timeout: None }
RegionLocator { kind: "visual-region",
bbox: Rect { x: 192, y: 158, width: 640, height: 92 },
centroid: (512, 204) }
The kind field is a constant string that makes the role explicit
in logs — "text-label" for OCR text matches, "visual-region" for
flood-fill shapes — so dumps tell you what the locator means without
having to follow the type back to its constructor.
VisualLocator also exposes the constructed-with values via
getters:
text()— the search query.region()— the parent scope, if any.match_mode()— current matching strategy.
What VisualLocator::click does today
Click the centre of the OCR word’s bbox. Works when the text
glyphs sit inside the gesture controller’s hit-rect — a centred label
inside an AdwButtonRow, for instance.
Doesn’t always work:
- Checkboxes / toggles whose label and click target are separate widgets.
- Widgets sized much larger than their text, where clicking on the glyphs hits the inner label’s selection gesture instead of the surrounding container’s activation gesture.
For those cases, the region pipeline below is the escape hatch.
The template-matching pipeline
For widgets that have no on-screen text (icon-only buttons, image
links, custom-drawn glyphs), OCR can’t help. The
ImageLocator
path takes a reference PNG captured against a known-good
screenshot of the same app, and finds where that patch sits in the
current screen via classical normalized cross-correlation (NCC).
#![allow(unused)]
fn main() {
let icon = std::fs::read("references/save_icon.png")?;
session
.find_image(&icon)?
.with_threshold(0.9)
.click()
.await?;
// Or scoped to an AT-SPI parent (faster, fewer false positives):
let toolbar = session.locate("//ToolBar[@name='Main']");
toolbar
.find_image(&icon).await?
.click()
.await?;
}
Algorithm
- Decode the template PNG once at
find_imagetime. - On each terminal-method call (
bounds,click, …), take a fresh screenshot, crop to the optional scope rect, convert both target and template to grayscale. imageproc::template_matching::match_templatewith methodCrossCorrelationNormalized— slide the template, scoring each position by NCC (Σ(a·b) / sqrt(Σa² · Σb²), in[0, 1], peaks at 1.0 for a perfect match).- Walk the score grid for all peaks above the threshold (default
0.85), sort best-first, apply non-maximum suppression so neighbouring peaks withinmin(template_w, template_h) / 2px collapse to one hit. - Translate hit positions back into screen coords.
Threshold tuning
0.95+— very strict. Use when the reference was captured on the same machine, same theme, same DPI as the test run. Rejects most false positives in busy layouts.0.85(default) — tolerant of subpixel antialias differences and minor lighting shifts.<0.70— likely matches something, but in a busy screen will probably match the wrong thing. If a known-good reference scores below 0.7, recapture it.
When to use this vs. find_by_text
| You want to click… | Use |
|---|---|
| A button with text | find_by_text("Save") |
| An icon-only button (Save icon, hamburger, X) | find_image(&icon_png) |
| A widget AT-SPI surfaces | Locator with an XPath selector |
| Something that wraps over multiple lines | find_by_text("Click here to learn more") |
OCR is the right choice whenever you can read the on-screen text. Template matching is the escape hatch for visual-only widgets.
Known failure modes
- DPI / scale change. A 32×32 reference captured on a 1× display won’t match a 64×64 render on a 2× display. The basic matcher does no scale search; recapture per DPI, or build an image pyramid wrapper if a workload demonstrates the need.
- Theme swap. Light → dark mode = all references stale.
- Antialias / font hinting drift. Same widget on a different GPU / fontconfig stack can score below 0.85. Lower the threshold or recapture.
- Animation / hover / focus mid-capture. Ripple effects, focus rings, hover highlights all change the pixels. Capture references in a steady state.
- Multiple identical icons on screen.
bounds()errors out on ambiguous matches; usewithin(rect)to disambiguate.
Cost
One NCC pass over the haystack ≈ O(W·H·w·h) work. For a 1920×1080
screenshot and a 64×64 template, ~8 billion ops naïvely; modern
machines do this in 10–50 ms. Cropping with within(rect) cuts
the haystack and is the single best speedup. The implementation
calls match_template (single-threaded); if a workload demands
it, swapping to match_template_parallel is a one-line change.
The region detection pipeline
When clicking text glyphs doesn’t fire the surrounding widget’s activation, we want a different click target: the centroid of the visually-distinct shape that contains the text. That’s typically a button pill, a row’s rounded rectangle, or a card frame.
The algorithm is a BFS flood-fill from a seed pixel adjacent to the OCR text bbox. A “region” is a contiguous block of pixels whose RGB Euclidean distance to a seed sample is within tolerance — a button’s fill, a row’s background, a card’s surface. Each iteration finds one enclosing region; iterating outward builds a chain.
┌──────────────────────────────────────────────┐
│ Inputs │
│ parent_bounds (AT-SPI Rect, screen coords) │
│ inner_bbox (OCR text bbox, screen coords)│
│ full_png (Session::take_screenshot) │
│ tuning (SessionConfig::visual_ │
│ region_tuning) │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ Crop full_png to parent_bounds │
│ Translate inner_bbox into crop coords │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ pick_seed_outside(inner_bbox, image) │
│ Try right / left / below / above the │
│ inner bbox, +4 px offset. Sanity-check │
│ uniformity vs a neighbouring pixel so we │
│ don't seed on glyph antialiasing fringe. │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ flood_fill(image, seed, tolerance) │
│ BFS, Vec<bool> visited grid. │
│ Add 4-neighbour pixels where │
│ ‖rgb(neighbour) - rgb(seed)‖₂ ≤ tolerance│
│ Track bbox + centroid as we go. │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ region_0 = { bbox, centroid } │
│ Translate back to screen coords. │
│ Push into result list. │
└────────────────────┬─────────────────────────┘
│
v (find_regions / first_region only)
┌──────────────────────────────────────────────┐
│ Stop? │
│ • region == previous region (no growth) │
│ • region covers entire crop │
│ • iteration count ≥ tuning.max_regions │
│ • pixel_just_outside(region) has nowhere │
│ to go (region touches all image edges) │
└────────────────────┬─────────────────────────┘
│ otherwise
v
┌──────────────────────────────────────────────┐
│ seed = pixel_just_outside(region.bbox) │
│ Loop back to flood_fill. │
└──────────────────────────────────────────────┘
Why a centroid, not a bbox centre
For axis-aligned rectangles, the bbox centre and the geometric centroid coincide. For non-rectangular shapes — pills (rounded rectangles), circles, polygon icons — the bbox centre can land outside the actual region. The centroid is the mean of every pixel position in the visited set; it’s always inside the shape, which is where you want to click.
For a 60×30 pill flood-filled from inside, the centroid lands at the pill’s geometric centre. For a circle, same. For an L-shaped selection or a polygon icon, the centroid is inside the shape and clicks land on the widget.
Shape classification
Each RegionLocator carries a coarse Shape
value derived from the flood-fill’s pixel-count vs bbox-area ratio
combined with a 4-corner sample. The classifier picks one of:
Rectangle— fill ratio ≥ 0.97 and all four bbox corners match the seed colour. Bare GTK button interiors,AdwButtonRowcontents.Pill— fill ratio ≥ 0.82 with 0–1 bbox corners inside. The corner radius trims the bbox corners off the shape. Most GTK button pills and Adw row backgrounds land here.Ellipse— fill ratio in 0.65–0.83 with 0 bbox corners inside. Round avatar buttons, circular close icons.Irregular— anything else. Polygon icons, regions with holes, shapes whose ratio doesn’t fit a primitive. Don’t trustbounds().center_*()here — usecentroid().
The classification is best-effort, intended for assertions and log readability, not as a contract. Borderline cases (e.g. a rectangle with one pixel of antialiased corner darkening) can flip between categories. If a test branches on shape, treat unexpected classifications as a soft signal rather than an absolute fail.
The seed for the flood doesn’t have to be at the centre of the
target region — flood-fill is a BFS that recovers the same bbox /
centroid / classification regardless of starting point, as long as
the seed lands somewhere inside the region. pick_seed_outside
aims ~4 px outside the OCR text bbox specifically to leave the
glyphs (which the flood treats as a separate region) and land on
the surrounding fill.
Tuning (SessionConfig::visual_region_tuning)
Every threshold the region pipeline uses is exposed on
VisualRegionTuning:
tolerance: u8(default24) — distance threshold for “same region”, interpreted undercolor_distance. Glyph antialiasing pixels typically jump 60+ (RGB); subtle gradients within a button surface stay under 20. Lower the number when flood over-grows into adjacent widgets; raise it when flood under-grows because of gradients.color_distance: ColorDistance(defaultLabCie76) — which colour-distance metric to use. See perceptual colour distance.max_regions: usize(default16) — safety cap on the iteration chain. Realistic widget tree depth is 3–5; the cap protects against pathological banded images.seed_uniformity_threshold_sq: u32(default100) — squared RGB distance below which the seed-pick treats a candidate seed and its 2-px-out neighbour as “uniform”. Raise on noisy backgrounds.shape_rectangle_min_ratio: f64(default0.97),shape_pill_min_ratio: f64(default0.82),shape_ellipse_ratio_range: (f64, f64)(default(0.65, 0.83)) — fill-ratio thresholds for shape classification.
MAX_PIXELS_PER_REGION is implicit and equal to the cropped image’s
total pixel count — the flood can’t escape it.
Tuning (SessionConfig::visual_text_tuning)
Knobs on
VisualTextTuning:
multiline_max_gap_factor: f32(default0.6) — see block grouping.multiline_x_slack_px: i32(default4).background_color_tolerance: u8(default24) — threshold for the bg-colour change check.divider_detection_enabled: bool(defaulttrue).ocr_context_padding_px: i32(default32) — padding added on every side of a cropped element before running OCR; gives the recognition head visual context that disambiguates small/low- contrast glyphs.boundary_samples_per_axis: usize(default16),boundary_majority_threshold: f32(default0.8) — divider-scan density and the majority threshold.background_sample_radius: u32(default2) — radius of the averaged window used when sampling the bg colour at each boundary check.0falls back to a single-pixel sample.color_distance: ColorDistance(defaultLabCie76).connectivity_check_enabled: bool(defaultfalse),max_connectivity_pixels: usize(default4096) — opt-in bounded flood-fill check; see block grouping.
Tuning (SessionConfig::visual_click_tuning)
Knobs on
VisualClickTuning
control the headless-mutter cold-start pointer workaround applied
by VisualLocator::click and RegionLocator::click:
cold_start_warmup_enabled: bool(defaulttrue) — set tofalseon real hardware where the cold-start race doesn’t apply to fall through to a single motion + button-press.cold_start_warmup_offset_px: f64(default4.0) — distance of the warmup motion from the target.cold_start_motion_settle: Duration(default60 ms) — sleep after each motion call.cold_start_press_settle: Duration(default50 ms) — sleep between button-down and button-up.
Model file verification
The auto-downloaded ocrs .rten model files are checksummed
against constants embedded in crates/waydriver/src/visual/models.rs:
- Cached file at session start: hashed, refused on mismatch (deleted + re-downloaded).
- Fresh download: hashed before the
*.partial → *.rtenrename; a corrupted download never becomes a cache hit. - Env-var overrides (
WAYDRIVER_OCRS_DETECTION_MODEL,WAYDRIVER_OCRS_RECOGNITION_MODEL) bypass verification — the user has explicitly pointed us at a file they control.
If upstream ocrs publishes new model files, the constants will
refuse to load the cache. Capture the new hashes with sha256sum
and update DETECTION_SHA256 / RECOGNITION_SHA256; or set the
env-var override at runtime as an escape hatch.
Locator::list_text and Locator::list_labelled_regions — enumeration
When you want to discover what’s on screen rather than search for a specific label, two enumeration methods produce a complete map of the text-bearing widgets inside a Locator’s scope:
#![allow(unused)]
fn main() {
let dialog = session.locate("//Dialog[@name='Preferences']");
// Every OCR'd line inside the dialog, line text + union bbox.
let hits = dialog.list_text().await?;
for h in &hits {
println!("{:?} at {:?}", h.text, h.bounds);
}
// Each line paired with its enclosing visual region. One flood-fill
// per label; the screenshot is taken once and reused.
for (label, region) in dialog.list_labelled_regions().await? {
println!("{} ({:?}) inside {:?} shape", label.text, label.bounds, region.shape());
}
}
list_text returns Vec<TextHit> where each TextHit has the
joined line text and the union bbox of all words in that line.
There’s no substring filter — for searches use
find_by_text. Cost is one OCR pass over the
locator’s bounds (~50–200 ms cropped, ~200–500 ms full-screen).
list_labelled_regions adds a flood-fill per hit on top, returning
Vec<(TextHit, RegionLocator)>. Use it for:
- Test discovery / scaffolding. Print the full set of clickable text-bearing things in a dialog and pick targets interactively.
- Visual regression. Compare label set + region shapes between runs.
- Dynamic selection. “Click the first row whose label starts
with
Show” —list_labelled_regionsthen filter then click.
The cost is list_text plus N × flood-fill (typically ~10–30 ms
each). A dialog with 15 labels takes ~150–500 ms total.
Session::region_at(x, y) — pixel-based entry point
The lowest level in the visual stack. Skips both OCR and the AT-SPI
parent lookup — just flood-fills from the supplied screen pixel and
returns the RegionLocator for whatever contiguous-colour shape
contains that pixel.
#![allow(unused)]
fn main() {
// I already know there's a clickable thing near here.
let region = session.region_at(512, 365).await?;
match region.shape() {
Shape::Pill | Shape::Rectangle => region.click().await?,
_ => return Err(anyhow!("expected a button-shaped widget at the cursor")),
}
}
Useful for:
- Coordinate-driven tests (you know the layout because you wrote the fixture).
- Visual debugging: “what’s at this pixel?” — dump
regionand read its bbox/shape/centroid. - Bridge code that already has coordinates from another source (a previous screenshot, a layout assertion, a logged event).
The seed pixel doesn’t need to be at the centre of the region. Flood-fill is deterministic: any pixel inside the target region recovers the same bbox / centroid / shape. The only thing that varies with the seed is which region you get — a pixel on a text glyph returns the glyph’s bbox; a pixel on the button fill returns the button’s bbox.
The three Locator methods
All of them resolve self’s AT-SPI bounds, take a fresh screenshot,
and call into the region pipeline.
Locator::find_regions(&self, inner: &VisualLocator)— full sweep. ReturnsVec<RegionLocator>in outermost-first order: index 0 is the outermost region insideself’s bounds; the last element is the tightest region aroundinner. The order matches the call-site mental model (start at the parent, walk inward).Locator::first_region(&self, inner)— outermost only (find_regions[0]). Runs the full sweep but skips the intermediateVecallocations.Locator::last_region(&self, inner)— innermost only (find_regions[last]). One flood-fill, no chain walk. Cheap. This is usually what you want — the button pill adjacent to the text.
Plus the convenience on VisualLocator:
VisualLocator::parent_region()— equivalent toparent.last_region(self), but doesn’t require the caller to remember the parent locator. Requires theVisualLocatorto have a parent scope (constructed viaLocator::find_by_textorSession::find_by_text(...).within(rect)).
RegionLocator action surface
Parallels VisualLocator’s shape, minus anything that would need
AT-SPI handles:
bounds() -> Rect— axis-aligned bounding rect of the flood.centroid() -> (i32, i32)— pixel-set centre, the click target.click()— pointer click at the centroid. Uses the same motion-warmup-then-press pattern asVisualLocator::clickto side-step headless mutter’s cold-start pointer-routing race.hover()— pointer move only.screenshot()— PNG cropped tobounds().
There is deliberately no fill, set_text, focus, or any
is_<state> predicate. Those need AT-SPI handles; a region is just
a bbox + centroid.
How they compose
#![allow(unused)]
fn main() {
// AT-SPI sees the parent dialog but not the lazy button inside it.
let dialog = session.locate("//Dialog[@name='Preferences']");
// Find the on-screen text "lazy-button" inside that dialog.
let text = dialog.find_by_text("lazy-button").await?;
// Click the centroid of the pill surrounding the text. One flood-fill
// from a seed adjacent to the OCR bbox — fastest of the three region
// methods because it doesn't walk the enclosure chain.
dialog.last_region(&text).await?.click().await?;
}
Three orthogonal layers:
| Layer | Input | Output | Cost |
|---|---|---|---|
AT-SPI Locator | XPath | accessible refs | ms |
VisualLocator | text + optional parent scope | text bboxes | 50–500 ms (OCR) |
RegionLocator | text bbox + parent screenshot | shape + centroid | ~10–30 ms (flood) |
Each layer is opt-in. You reach down only when the layer above doesn’t work for your widget.
Cost summary
| Operation | Typical latency |
|---|---|
AT-SPI locator (session.locate) | <10 ms |
| Session start — model download (first run) | 5–20 s |
| Session start — model load (no prewarm) | 1–2 s on first OCR call |
| Session start — model load (prewarm) | parallel with session boot |
Session::find_by_text (full screen) | 200–500 ms |
Locator::find_by_text (cropped) | 50–200 ms |
Locator::last_region | +10–30 ms over OCR |
Locator::find_regions (full sweep) | +30–100 ms (depends on chain depth) |
These latencies assume an optimized build. rten inference dominates OCR
cost and is roughly 30× slower at the dev profile’s opt-level 0: measured
~5–8 s per full-frame pass with optimized dependencies vs ~50–200 s without,
on CPU-only hosts. Consumers running the visual feature under cargo test
must add a dependency-only override to the workspace root Cargo.toml
(Cargo ignores profile overrides declared anywhere else — a library can’t
ship this for you):
[profile.dev.package."*"]
opt-level = 3
(waydriver’s own workspace root already applies this to just the rten/ocrs
crates, so in-repo contributors and the e2e suite get optimized OCR in
dev/test builds without the broader "*" override. The init warning below
still fires for in-repo debug builds — an opt-level override does not clear
cfg(debug_assertions) — and is a known false-positive there.)
The engine loader logs a warning at init when it detects a debug build. Two
further cost levers already built in: a scoped Locator::find_by_text crops
the frame to the parent’s bounds before inference (fewer pixels, fewer text
lines — only the unscoped Session::find_by_text pays for the full frame),
and the per-frame OCR cache means repeated lookups on an unchanged screen
reuse a single pass.
When to use what
- Default path —
Locator::clickagainst an XPath. Use this unless the widget doesn’t surface in AT-SPI. - Widget renders text and isn’t in AT-SPI —
Locator::find_by_texton the nearest AT-SPI parent, then.click(). Works when the text glyphs are inside the gesture-controller’s hit-rect (mostAdwButtonRows, GTK buttons with centred labels). - Text-center click doesn’t fire activation —
parent.last_region(&text).click(). Uses the centroid of the enclosing visual shape, which is more robust for widgets where the inner label widget eats the click. - You want the surrounding card / panel, not the button —
parent.first_region(&text).click()or walkfind_regionsand pick the layer you want. - No AT-SPI parent at all —
Session::find_by_text(text).click()works but pays full-screen OCR cost; prefer constraining via.within(rect)whenever you can derive a scope.
Failure modes (known)
- Sibling-coloured regions merge. If the button shares its fill
colour with an adjacent widget, flood-fill spans both. Lower
toleranceand re-test. - Gradient fills stop the flood early. A button with a top-to-
bottom gradient may have RGB deltas exceeding
tolerancepartway down. Raisetolerance(carefully — too high and the flood eats neighbouring regions). - Thin antialiased borders ≤ 2 px can confuse
pick_seed_outsideif the 4-px offset lands inside the border. The seed picker validates uniformity against a neighbouring pixel and falls back to the next candidate, but pathological cases still exist. Construct theVisualLocatorwith a tighter.within(...)or supply an explicitRectto side-step. - OCR misreads on small / low-contrast text. ocrs’s recognition
head is trained on document text; UI labels at 10–14 px in dark
themes can read poorly. The 32 px context-padding ring helps
(tunable via
VisualTextTuning::ocr_context_padding_px); raising the fixture’s font size if you control it helps more. - Pointer cold-start race. Headless mutter sometimes drops the
first pointer event after a fresh session.
VisualLocator::clickandRegionLocator::clickboth warmup-motion-then-click to side-step it, but a test that triggers many rapid clicks can still hit the race on subsequent clicks. Add a 60 ms sleep between clicks if you see this — or tuneVisualClickTuning(disable the warmup on real hardware, lengthen the settles on slow CI). - Custom theme with shadow rasters between rows. The divider
scan can mistake anti-aliased shadow gradients for a horizontal
rule and refuse to merge wrapped paragraphs. Set
VisualTextTuning::divider_detection_enabled = falseto fall back to bg-colour-only boundary detection. - Stale model cache from upstream rebuild. SHA-256 verification
refuses to load model files that don’t match the embedded
hashes. If ocrs publishes new models, either bump the constants
in
models.rsor setWAYDRIVER_OCRS_DETECTION_MODEL/WAYDRIVER_OCRS_RECOGNITION_MODELto point at known-good files. - Right-to-left scripts and non-LTR reading order. The block grouper and the per-line haystack are built on the assumption that words read left-to-right within a line and lines read top-to-bottom within a block. Hebrew, Arabic, or any RTL script will produce word bboxes in screen-left-to-right order but the joined haystack won’t reflect logical reading order — substring matches against a logical-order needle may miss. Vertical scripts (Japanese/Chinese in tategaki) are not supported. If you’re driving an RTL app, prefer AT-SPI selectors; the visual locator’s matching semantics aren’t right for that case.
Implementation map
| What | Where |
|---|---|
Session::find_by_text (root entry) | crates/waydriver/src/session.rs |
Locator::find_by_text (scoped entry) | crates/waydriver/src/locator.rs |
VisualLocator + OCR pipeline | crates/waydriver/src/visual/mod.rs |
| Model resolution + auto-download | crates/waydriver/src/visual/models.rs |
Engine lifecycle (OnceCell shared cache) | crates/waydriver/src/visual/engine.rs |
Flood-fill, seed picking, RegionLocator | crates/waydriver/src/visual/region.rs |
Locator::find_regions/first_region/last_region | crates/waydriver/src/locator.rs |
SessionConfig::visual_region_tuning | crates/waydriver/src/session.rs |
Cargo feature visual | crates/waydriver/Cargo.toml |
| E2E test exercising both pipelines | crates/waydriver-e2e/tests/e2e.rs — lazy_a11y_*_clickable_via_visual_locator |
Architecture Notes
Keepalive ScreenCast stream
In headless mode, Mutter only composites (and delivers Wayland frame callbacks) when a ScreenCast consumer is pulling frames. Without an active stream, GTK4 apps render their first frame but never repaint — the frame clock never ticks.
Session::start opens a persistent ScreenCast stream that stays alive for the session’s lifetime. This keeps Mutter compositing continuously so frame callbacks flow and GTK4 apps repaint normally.
Input: RemoteDesktop vs AT-SPI
Two input paths are available, with different trade-offs:
-
RemoteDesktop keyboard/pointer (
press_keysym,pointer_button) — events go through the full Wayland input pipeline (Mutter -> Wayland protocol -> GDK -> GTK event loop). GTK4 processes them normally and repaints. Use this for interactions that need to produce visible changes. -
AT-SPI actions (
Locator::click()/focus()/set_text()) — directly invoke widget signal handlers through the accessibility tree, targeted by XPath. Accurate and precise, but they update GTK4’s internal model without triggering compositor redraws. Useful for reading the accessibility tree and programmatic activation, but screenshots taken after AT-SPI-only interactions may show stale frames.
App isolation
Apps are launched with GSETTINGS_BACKEND=keyfile and XDG_CONFIG_HOME pointing to the per-session runtime directory. This bypasses the host dconf daemon entirely, so each session starts with default app state and never reads or writes the user’s settings.
Dual D-Bus
GTK4’s built-in AT-SPI backend only registers on the host session bus — it ignores custom DBUS_SESSION_BUS_ADDRESS. So each session uses two D-Bus connections:
- Host session bus: AT-SPI communication with the app
- Private D-Bus: Mutter’s ScreenCast and RemoteDesktop APIs (isolated from the host compositor)
graph LR
subgraph Host
host_dbus["Host session bus"]
end
subgraph Session["Per-session"]
private_dbus["Private D-Bus"]
mutter["Mutter"]
app["Your app"]
waydriver["WayDriver"]
end
waydriver -- "AT-SPI" --> host_dbus
app -- "AT-SPI register" --> host_dbus
waydriver -- "ScreenCast\nRemoteDesktop" --> private_dbus
mutter -- "org.gnome.Mutter.*" --> private_dbus
External-effect sinks
Some app behaviours leave the process entirely — a desktop notification, a “open this URL” portal request — so they have no AT-SPI projection to query. With capture_external_effects enabled, waydriver mocks the daemons that would receive those calls on the app’s session bus (org.freedesktop.Notifications and org.freedesktop.portal.Desktop’s OpenURI), records each call, and exposes them via get_captured_effects / Session::notifications() / open_uri_requests(). It’s opt-in because the sinks own well-known names — safe on the per-session/container bus, a no-op (with a warning) on a shared host bus that already runs a real daemon.
Clipboard / PRIMARY-selection readback is not available: Mutter 46.2 exposes no clipboard D-Bus interface and implements neither wlr-data-control nor ext-data-control-v1, so there’s no out-of-band way to read the selection. The working stopgap is to paste into the app (Ctrl+V / middle-click) and read the result back through the AT-SPI Text interface (read_text / Locator::text).
Screenshot and recording pipeline
graph LR
screencast["Mutter ScreenCast API"]
monitor["RecordMonitor\n(virtual monitor)"]
pw_keep["PipeWire stream\n(keepalive — screenshots)"]
pw_rec["PipeWire stream\n(dedicated — recording)"]
gst_shot["On-demand GStreamer pipeline\n(pngenc snapshot=true)"]
gst_rec["Long-lived GStreamer pipeline\n(vp8enc + webmmux)"]
png["PNG bytes"]
webm["WebM file"]
screencast --> monitor
monitor --> pw_keep --> gst_shot --> png
monitor --> pw_rec --> gst_rec --> webm
Screenshots and recording use separate ScreenCast streams. take_screenshot spins up a transient pngenc pipeline on the keepalive stream on each call; recording runs a single vp8enc ! webmmux ! filesink pipeline for the session’s lifetime on its own dedicated stream, flushed with EOS on Session::kill so the WebM is seekable. They must not share a node: mutter’s screencast node only emits frames on screen damage (framerate=0/1), and a continuous recorder consumer would starve a later-attaching screenshot consumer of the initial frame on a static app — so the recorder gets its own stream and the screenshot path stays the keepalive node’s first/triggering consumer. Both use the GStreamer Rust bindings (gstreamer + gstreamer-app crates) and only gst-plugins-good (no -bad/-ugly).
API Reference
The full, type-level API reference is generated by rustdoc and hosted on docs.rs. It always reflects the latest published release:
| Crate | Reference |
|---|---|
waydriver — core traits, Session, locators, keysym helpers | docs.rs/waydriver |
waydriver-compositor-mutter — CompositorRuntime impl | docs.rs/waydriver-compositor-mutter |
waydriver-input-mutter — InputBackend impl | docs.rs/waydriver-input-mutter |
waydriver-capture-mutter — CaptureBackend impl | docs.rs/waydriver-capture-mutter |
For the high-level tool surface exposed to AI assistants, see the MCP Server chapter rather than the rustdoc.
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.3.4 - 2026-06-16
Added
- (atspi) read cache-only row values via text_ref/value_ref/selected_text_ref
Other
- (atspi) measure event-cache behavior and specify the design (#11)
- restore the GitHub-native video URL in the README
- (atspi) parallelize the tree-walk snapshot (#11)
- add a download-link fallback to the README demo video
- serve the demo video from the repo instead of a dead GitHub URL
- slim README to a landing page, defer detail to waydriver.io
- add mdBook documentation site with GitHub Pages deploy
- (atspi) benchmark tree-walk cost on a large synthetic tree (#11)
0.3.3 - 2026-06-14
Added
- (locator) drag_to_coords for off-window drop endpoints
- (gaction) drive app./win. GActions via org.gtk.Actions (#33)
- external-effect sinks (notifications/portal) + single-instance CLI forwarding
- (atspi) activate cache-only accessibles by (bus, path) ref
- (visual) add perceptual baseline-compare primitive
- live GSettings writes + AT-SPI Value/scroll readback
- (visual) per-search OCR upscale via VisualLocator::with_upscale (#23)
- (session) expose key_down/key_up for held-modifier gestures
Fixed
- (session) claim external-effect sink names before launching the app
- (atspi) resolve LABELLED_BY names for cache-only rows
Other
- release v0.3.2
- add cloud-env (non-Nix) dev tooling for SessionStart and Fedora container
- limit rustdoc to the crate itself (–no-deps)
0.3.2 - 2026-06-14
Added
- (locator) drag_to_coords for off-window drop endpoints
- (gaction) drive app./win. GActions via org.gtk.Actions (#33)
- external-effect sinks (notifications/portal) + single-instance CLI forwarding
- (atspi) activate cache-only accessibles by (bus, path) ref
- (visual) add perceptual baseline-compare primitive
- live GSettings writes + AT-SPI Value/scroll readback
- (visual) per-search OCR upscale via VisualLocator::with_upscale (#23)
- (session) expose key_down/key_up for held-modifier gestures
Fixed
- (session) claim external-effect sink names before launching the app
- (atspi) resolve LABELLED_BY names for cache-only rows
Other
- add cloud-env (non-Nix) dev tooling for SessionStart and Fedora container
- limit rustdoc to the crate itself (–no-deps)
0.3.1 - 2026-06-13
Added
- (atspi) read lazily-realized widgets via focus_walk + cache
Fixed
- (pointer) translate window-relative AT-SPI bounds to screen space
- (session) isolate XDG state/data/cache dirs + verify reported bugs live
Other
- (visual) warn on debug-built OCR stack + document the ~30x cost
0.3.0 - 2026-06-12
Added
- [breaking] harden visual-OCR, locator, and key-chord paths
Fixed
- (capture) stop pipewire runtime-dir nesting overflow at the root
0.2.10 - 2026-06-08
Fixed
- (mcp) keep start_session from hanging on stalled setup
0.2.9 - 2026-06-06
Added
- (gsettings) per-session GSettings isolation via keyfile backend
- (scale) custom display scale (HiDPI) for sessions
Other
- apply cargo fmt
0.2.8 - 2026-06-05
Fixed
- (compositor-mutter) snapshot host runtime root to keep session dirs flat
0.2.7 - 2026-06-03
Fixed
- (capture) give the video recorder its own ScreenCast stream
0.2.6 - 2026-05-24
Added
- (mcp) expose visual locator tools (OCR, template match, stdout wait)
- (visual) opt-in visual locator stack — OCR, flood-fill regions, template matching
Other
- (visual) apply rustfmt to visual locator stack
0.2.5 - 2026-05-13
Added
- (locator) pointer-click fallback for widgets without AT-SPI Action
0.2.4 - 2026-05-12
Fixed
- (session) prime mutter keyboard focus to prevent first-keypress drop
0.2.3 - 2026-04-29
Fixed
- (compositor-mutter) scrub stale PIPEWIRE_REMOTE before spawning per-session pipewire stack
Other
- (readme) embed gnome-calculator demo video
0.2.2 - 2026-04-26
Added
- (locator) pointer-click fallback when fill target lacks Component::grab_focus
- (input) thread CancellationToken through InputBackend for prompt kill
- (locator) element-scoped pointer actions (hover, double_click, right_click, drag_to)
- (locator) Locator::select_option via AT-SPI Selection interface
- (input) Locator::scroll_into_view with AT-SPI + wheel fallbacks
- (input) Locator::fill(), absolute pointer motion, Session::type_text (WAY-5)
- (atspi) capture element bounds via Component::get_extents
Fixed
- (mcp) kill_session no longer blocks on in-flight tool auto-waits
Other
- release v0.2.1
- refresh AGENTS.md and README.md for current API surface
- workspace-wide audit pass tightening trait surfaces and error types
- (mcp) split tool handlers into per-concern modules
- (mcp) split monolithic main.rs into focused modules
- split e2e tests into waydriver-e2e crate, add configurable video_fps
- (error) preserve typed error sources on Atspi/Process/Screenshot
- (compositor-mutter) separate doc paragraph before stage rationale
0.2.1 - 2026-04-26
Added
- (locator) pointer-click fallback when fill target lacks Component::grab_focus
- (input) thread CancellationToken through InputBackend for prompt kill
- (locator) element-scoped pointer actions (hover, double_click, right_click, drag_to)
- (locator) Locator::select_option via AT-SPI Selection interface
- (locator) layered wait_for / wait_until / wait_until_async primitives
- (input) Locator::scroll_into_view with AT-SPI + wheel fallbacks
- (input) Locator::fill(), absolute pointer motion, Session::type_text (WAY-5)
- (atspi) capture element bounds via Component::get_extents
- (locator) add richer AT-SPI state predicates and matching waiters
Fixed
- (session) bound kill latency with AT-SPI method timeout and shutdown budget
- (mcp) kill_session no longer blocks on in-flight tool auto-waits
Other
- refresh AGENTS.md and README.md for current API surface
- workspace-wide audit pass tightening trait surfaces and error types
- (error) preserve typed error sources on Atspi/Process/Screenshot
- split e2e tests into waydriver-e2e crate, add configurable video_fps
- (mcp) split tool handlers into per-concern modules
- (mcp) split monolithic main.rs into focused modules
- (compositor-mutter) separate doc paragraph before stage rationale
0.2.0 - 2026-04-24
Added
- (fixture) GTK4/libadwaita e2e fixture with stdout event capture
- (input) keyboard chord support via key_down/key_up primitives
- (atspi) Locator::focus via Component::grab_focus
- (atspi) auto-wait and explicit wait_for_* on Locator
- (atspi) [breaking] XPath-based locator API over AT-SPI tree
- (capture) WebM video recording for sessions
- (mcp) configurable virtual-monitor resolution
- (mcp) per-session event log and static HTML viewer
- (mcp) configurable report dir with per-session screenshot counter
Other
- update README and AGENTS.md for Locator API
- (release) move CHANGELOG.md into waydriver crate with root symlink
- (release) consolidate per-crate changelogs into workspace CHANGELOG
- (mcp) drop flaky second-screenshot assertion in e2e
0.1.3 - 2026-04-17
Added
- add publishable builder image and document multi-language dev workflows
0.1.2 - 2026-04-16
Added
- (mcp) add Docker packaging and container-based e2e test
0.1.1 - 2026-04-16
Added
- add MCP server for AI-driven headless UI testing
Other
- add rustdoc comments to public API surface
- add per-distro dependency tables and install commands to README
Contributing Guide
The canonical guidance for working in this repository — development environment,
build/test commands, the architecture deep-dive, testing notes, the CI pipeline,
and commit-message conventions — lives in
AGENTS.md
at the repo root. It is written for both human contributors and AI coding
assistants. This page covers the one thing most contributors hit first: building
without Nix.
Developing without Nix
Contributors who don’t use Nix can build and test the workspace directly once the system packages are installed. Two repo helpers automate this:
.claude/hooks/session-start.shinstalls the build + runtime packages, ensures therustfmt/clippyrustup components, and warms the crate cache. It is gated on$CLAUDE_CODE_REMOTE, so it only runs in the Claude Code cloud env; on another machine, run the apt/dnf/pacman command for your distro instead.scripts/dev-container.shdrops you into a Fedora 42 shell (matching the Dockerfile/CI) with your working tree bind-mounted, for buildingwaydriver-fixture-gtkand running the native e2e suite. These need libadwaita ≥ 1.6, so they can’t build on Ubuntu 24.04 (which ships 1.5).
On a non-Nix host, build and test the rest of the workspace with
--exclude waydriver-fixture-gtk, and set GST_PLUGIN_PATH, XDG_DATA_DIRS, and
the at-spi2-core/libexec path yourself when running the raw binary (the
nix run .#mcp wrapper that injects these is Nix-only). See
AGENTS.md
for details.