Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

WayDriver is a Rust library for headless GUI application testing on Wayland. It launches apps in isolated compositor sessions, interacts with them via AT-SPI accessibility APIs, and captures screenshots and WebM video via PipeWire.

The repo also contains waydriver-mcp, a standalone Model Context Protocol server binary built on top of the library that lets AI assistants drive GTK4 apps directly — see MCP Server.

Crates.io · API docs (docs.rs) · GitHub · License: Apache-2.0

Demo

The clip below is the full output of crates/waydriver-examples/examples/gnome_calculator.rs, runnable with cargo run -p waydriver-examples --example gnome_calculator. Read the source for the API surface in context — it covers a session lifecycle, AT-SPI button clicks, keyboard chord dispatch (Shift+9/Shift+0 for parens), a typed unit conversion, and per-step result verification via XPath locators. The recording is captured by waydriver itself via PipeWire.

How it works

Each test session creates an isolated environment with a headless compositor, input injection, and screen capture:

graph TD
    subgraph Session["Per-session processes"]
        dbus["dbus-daemon (private)"]
        dbus --- mutter["Mutter --headless --wayland"]
        mutter --- screencast["ScreenCast API (screenshots)"]
        mutter --- remotedesktop["RemoteDesktop API (input)"]
        dbus --- pipewire["PipeWire (frame capture)"]
        dbus --- wireplumber["WirePlumber (PipeWire graph manager)"]

        app["Your app (on Mutter's Wayland display)"]
        app --- atspi["AT-SPI (accessibility tree, actions)"]
    end

The library is backend-agnostic. Three traits define the interface:

  • CompositorRuntime — lifecycle of a headless compositor (start, stop, expose Wayland display)
  • InputBackend — keyboard and pointer injection
  • CaptureBackend — screen capture (start/stop PipeWire streams, grab PNG frames)

Concrete implementations are separate crates. The trait-based design allows backends to be added as sibling crates without changing the core.

Backend support

FeatureMutterKWinSway
Headless compositorYes
Keyboard inputYes (RemoteDesktop)
Pointer inputYes (RemoteDesktop)
ScreenshotsYes (ScreenCast + PipeWire)
Video recording (WebM/VP8)Yes (ScreenCast + PipeWire)
AT-SPI (UI inspection, clicks)Yes

Currently only Mutter is implemented (waydriver-compositor-mutter, waydriver-input-mutter, waydriver-capture-mutter). Each compositor has its own APIs (Mutter uses org.gnome.Mutter.* D-Bus interfaces, KWin has org.kde.KWin.*, Sway uses wlroots Wayland protocols), so each would need its own set of backend crates.

Crate structure

CratePurpose
waydriverTrait definitions, Session, AT-SPI client, keysym helpers, shared GStreamer capture helper
waydriver-compositor-mutterCompositorRuntime impl — manages Mutter, PipeWire, WirePlumber, private D-Bus
waydriver-input-mutterInputBackend impl — keyboard/pointer via Mutter RemoteDesktop
waydriver-capture-mutterCaptureBackend impl — screenshots via Mutter ScreenCast + PipeWire
waydriver-mcpBinary — MCP JSON-RPC server over stdio that exposes the library to AI assistants

Getting Started

Requirements

All dependencies are provided by the Nix flake (nix develop). If not using Nix, you need the following system packages.

Build dependencies

Debian/UbuntuFedoraArch
pkg-configpkg-configpkg-config
libglib2.0-devglib2-develglib2
libgstreamer1.0-devgstreamer1-develgstreamer
libgstreamer-plugins-base1.0-devgstreamer1-plugins-base-develgst-plugins-base

Runtime dependencies

Debian/UbuntuFedoraArch
muttermuttermutter
pipewirepipewirepipewire
wireplumberwireplumberwireplumber
gstreamer1.0-plugins-basegstreamer1-plugins-basegst-plugins-base
gstreamer1.0-plugins-goodgstreamer1-plugins-goodgst-plugins-good
gstreamer1.0-pipewiregstreamer1-plugins-pipewiregst-plugin-pipewire
at-spi2-coreat-spi2-coreat-spi2-core
dbusdbusdbus

Quick install:

# Debian/Ubuntu
sudo apt install pkg-config libglib2.0-dev libgstreamer1.0-dev \
  libgstreamer-plugins-base1.0-dev mutter pipewire wireplumber \
  gstreamer1.0-plugins-base gstreamer1.0-plugins-good \
  gstreamer1.0-pipewire at-spi2-core dbus

# Fedora
sudo dnf install pkg-config glib2-devel gstreamer1-devel \
  gstreamer1-plugins-base-devel mutter pipewire wireplumber \
  gstreamer1-plugins-base gstreamer1-plugins-good \
  gstreamer1-plugins-pipewire at-spi2-core dbus

# Arch
sudo pacman -S pkg-config glib2 gstreamer gst-plugins-base \
  gst-plugins-good gst-plugin-pipewire mutter pipewire \
  wireplumber at-spi2-core dbus

Add WayDriver to your project

Add the core library plus the Mutter backend crates:

cargo add waydriver waydriver-compositor-mutter waydriver-input-mutter waydriver-capture-mutter

WayDriver’s API is async, so you’ll also want a Tokio runtime:

cargo add tokio --features full

Usage

#![allow(unused)]
fn main() {
use std::sync::Arc;
use waydriver::{Session, SessionConfig, CompositorRuntime};
use waydriver_compositor_mutter::MutterCompositor;
use waydriver_input_mutter::MutterInput;
use waydriver_capture_mutter::MutterCapture;

let mut compositor = MutterCompositor::new();
compositor.start(None).await?;
// `state()` is `Option`; immediately after a successful `start()` it is
// always `Some` — `expect` documents that invariant locally.
let state = compositor.state().expect("state available after start");
let input = MutterInput::new(state.clone());
let capture = MutterCapture::new(state);

let session = Arc::new(Session::start(
    Box::new(compositor),
    Box::new(input),
    Box::new(capture),
    SessionConfig {
        command: "your-gtk-app".into(),
        args: vec![],
        cwd: None,
        app_name: "your-gtk-app".into(),
        // Record the entire session to a WebM file. Set to `None` to skip.
        video_output: Some("/tmp/session.webm".into()),
        video_bitrate: None, // defaults to waydriver::capture::DEFAULT_VIDEO_BITRATE (2 Mbps)
        video_fps: None,     // defaults to waydriver::capture::DEFAULT_VIDEO_FPS (15)
    },
).await?);

// Take a screenshot (returns PNG bytes).
let png = session.take_screenshot().await?;

// Target widgets with XPath selectors over the AT-SPI tree. Actions
// auto-wait for the element to be visible + enabled before firing.
session.locate("//Button[@name='primary-button']").click().await?;
session.locate("//Text[@name='search']").set_text("hello").await?;

// Keyboard input with modifier chords.
session.press_chord("Ctrl+Shift+S").await?;

// Explicit waits when auto-wait isn't enough — e.g. an item appearing
// after some async work.
session.locate("//Label[@name='status']")
    .wait_for_text(|t| t == "ready")
    .await?;

// Inspect the tree while debugging selectors.
let xml = session.dump_tree().await?;
println!("{xml}");

Arc::try_unwrap(session).unwrap().kill().await?;
}

Next: the Locator API reference covers the full action surface, and the MCP Server chapter shows how to drive apps from an AI assistant without writing Rust.

Locator API

Session::locate(xpath) returns a lazy Locator — each action re-snapshots the AT-SPI tree and re-resolves the selector, so you don’t have to worry about stale element handles. Common methods:

MethodWhat it does
click() / double_click() / right_click()Invoke the AT-SPI Action interface (primary, secondary, tertiary actions)
hover() / drag_to(target) / drag_to_coords(x, y)Pointer-driven hover and drag — lands on real Wayland input events for repaint. drag_to_coords releases at raw screen coordinates, so the drop can land off-window (e.g. libadwaita tab drag-out)
focus() / scroll_into_view()Component::grab_focus and scroll_to/scroll_to_point
set_text(s) / fill(s)Direct EditableText write vs. focus-and-type fallback for widgets without EditableText (e.g. GtkTextView)
select_option(by)Pick a child of a Selection-interface container by label or index
text()Read via the Text interface
count() / all() / inspect_all()Multi-match: count, list of locators, full metadata in one snapshot
name() / role() / attribute(k) / attributes() / bounds()Accessible name, role, AT-SPI attributes, screen-relative bounds
is_showing() / is_enabled()State predicates
wait_for_visible() / _hidden() / _enabled() / _count(n) / _text(pred)Block until state or predicate holds
wait_for(pred) / wait_until(pred) / wait_until_async(pred)General-purpose predicate auto-waits
with_timeout(d)Per-call override of the auto-wait timeout
nth(i) / first() / last() / parent() / locate(sub_xpath)Compose sub-locators

Single-target actions (click, focus, set_text, text, …) error with AmbiguousSelector if the selector matches more than one element. Narrow with .nth(i) or a more specific XPath.

MCP Server

waydriver-mcp is a standalone binary that exposes the library over the Model Context Protocol, letting AI assistants (Claude Desktop, Claude Code, etc.) drive GTK4 apps in isolated headless sessions. It speaks JSON-RPC over stdio and constructs the Mutter backends internally — clients only see the high-level tools below.

ToolPurpose
start_sessionSpawn a headless Mutter session and launch a command inside it (optional report_dir, resolution, scale, isolate_settings, gsettings, record_video, video_bitrate, capture_external_effects overrides per session)
list_sessionsList active session ids, app names, and Wayland displays
kill_sessionTear down a session and clean up all child processes
set_settingChange a GSettings key on the running session live — rewrites the isolated keyfile in place so the app re-applies it via its changed handler (cursor, fonts, color-scheme, …) without a restart
dump_treeDump the AT-SPI accessibility tree as XML — each node carries a _ref you can target with query/click/etc.
queryEvaluate an XPath over the tree; returns every match’s role, name, attributes, and states
click / double_click / right_clickInvoke an element’s primary / secondary / tertiary AT-SPI Action. Auto-waits for visibility + enablement.
hoverMove the pointer to an element’s center — drives a real Wayland motion event so hover-state UI repaints
drag_toPress, move across an element’s center, release — full Wayland drag gesture
drag_to_coordsLike drag_to, but release at raw screen-absolute (x, y) — drop onto empty space or off the source window (libadwaita tab drag-out and other “drop onto nothing” DnD)
focusGive keyboard focus to an element via AT-SPI Component::grab_focus
set_textReplace an editable element’s contents via EditableText (fast, requires the interface)
fillFocus + clear + type — fallback for widgets without EditableText (e.g. GtkTextView/GtkEntry). Tries AT-SPI Component::grab_focus first; widgets whose bridge doesn’t expose Component (the documented GTK4 case) fall back to a pointer click at the widget’s centre to drive focus through the input layer, the same way a user would. Set assume_focused: true to skip the whole focus step when the target is already focused. Supports caret_nav/select_all clear modes.
select_optionPick an entry from a Selection-interface container (combo box, list, …) by label or by index
read_textRead an element’s text via the Text interface
read_valueRead an element’s AT-SPI Value (current/min/max) — a scrolled view’s offset, or a slider/progress/spin value
scrollScroll a located area by wheel detents along an axis (parks the pointer over it first); pair with read_value to confirm the offset moved
type_textType a string into the currently focused element through the input backend
press_keyPress a named key or chord (Return, Ctrl+A, Shift+Tab, Escape, …)
move_pointerMove the pointer by a relative offset in logical pixels
pointer_clickPress and release a pointer button (defaults to left click)
take_screenshotCapture a PNG via the keepalive ScreenCast stream and return its path
compare_element_to_baselineCrop an element and diff it against a committed reference PNG (perceptual CIEDE2000) — returns a diff score (not a pass/fail verdict) and writes a red-highlighted diff image on mismatch
get_captured_effectsRead the desktop notifications and portal open-URI requests the app emitted onto the session bus (mock D-Bus sinks). Requires capture_external_effects: true on start_session; effects have no AT-SPI projection, so this is the only way to assert on them
launch_secondary_instanceRelaunch the app with extra args in the same session env — a single-instance GApplication forwards the command line to the running primary; observe the primary’s reaction via wait_for_stdout_line/query

Selectors use XPath 1.0 against a snapshot of the AT-SPI tree serialized to XML, with role names normalized to PascalCase (e.g. push buttonButton). Example XPaths: //Button[@name='OK'], //Text[@name='search'], //MenuItem[contains(@name, 'Mode')], (//Button)[last()].

Each session produces output under a configurable report directory. Screenshots are written as {report_dir}/{session_id}/{session_id}-{n}.png — each session gets its own subdirectory and n increments per take_screenshot call. The base report_dir defaults to /tmp/waydriver and can be overridden with the --report-dir <PATH> CLI flag or the WAYDRIVER_REPORT_DIR environment variable. Individual start_session calls may also pass a report_dir argument to override the server default for that session.

Alongside the screenshots, each session writes:

  • {session_id}.webm — full-session VP8/WebM recording of the display at 15 fps, finalized with a seekhead on kill_session. On by default; disable per-server with --record-video false / WAYDRIVER_RECORD_VIDEO=false, or per-session with start_session’s record_video: false. Bitrate via --video-bitrate <bits/sec> / WAYDRIVER_VIDEO_BITRATE (default 2_000_000) or per-session video_bitrate.
  • events.jsonl — append-only audit log of every session-scoped tool call (action, params, ok/err status, timestamp) at {report_dir}/{session_id}/events.jsonl.
  • events.js — atomic rewrite of the same data as window.__events_update([...]) for consumption by the viewer.
  • index.html — styled viewer (Tailwind via the Play CDN) that embeds the recording in a <video> tag when present. Reloads events.js every 2 s via a <script src> swap (which works over file:// unlike fetch), append-only rendering so expanded <details> stay expanded across refreshes. Written once at session start.

start_session’s response includes a file:// URL to the session viewer — open it directly from the filesystem in any browser. No HTTP server, no ports, no network access required. Multiple waydriver-mcp instances (different Claude Code tabs / projects) can run side by side without conflict.

Why Docker?

waydriver-mcp needs ~8 system services at runtime (mutter, pipewire, wireplumber, dbus, AT-SPI, gstreamer). Installing these manually is fragile and distro-specific. Docker solves four problems:

  • Security — the MCP server spawns arbitrary processes, interacts with them via D-Bus, and captures their screen. Running this on your host session gives it access to everything your user can do. Inside a container, it only sees what you explicitly mount — no access to your files, browser sessions, or credentials. Add --network none to block network access entirely (the report viewer is purely static file://, so it works without any network)
  • Zero-setup distributiondocker pull and you’re running, no system packages to install
  • D-Bus isolation — each container gets its own dbus-daemon, so apps with singleton D-Bus activation don’t interfere across concurrent test sessions
  • ABI compatibility — apps built inside the container are guaranteed to link against the same libraries the MCP runtime uses

Prebuilt images are published to GitHub Container Registry for each release:

ImagePurpose
ghcr.io/bohdantkachenko/waydriver-mcpRuntime — MCP server with all system deps
ghcr.io/bohdantkachenko/waydriver-mcp-builderBuild env — Fedora 42 + Rust + gcc/g++ + meson + cmake + GTK4/GLib dev headers
docker pull ghcr.io/bohdantkachenko/waydriver-mcp:latest
docker pull ghcr.io/bohdantkachenko/waydriver-mcp-builder:latest

Use the builder image to compile your app in a Fedora environment that matches the runtime. The resulting binary is ABI-compatible with the runtime image. See Testing your app below for language-specific build examples.

MCP client config (e.g. .mcp.json for Claude Code):

{
  "mcpServers": {
    "waydriver-mcp": {
      "command": "sh",
      "args": ["-c", "docker run --rm -i --network none -v \"$PWD:/workspace:ro\" -v /tmp/waydriver:/tmp/waydriver ghcr.io/bohdantkachenko/waydriver-mcp:latest"]
    }
  }
}
  • $PWD:/workspace:ro — mounts the project directory so the MCP can launch your app binaries from /workspace/
  • /tmp/waydriver:/tmp/waydriver — makes session reports (screenshots, WebM recordings, events.jsonl, index.html) accessible on the host at /tmp/waydriver/. The mount uses the same path on both sides so the file:// URL that start_session returns is openable as-is on the host
  • --network none — safe to fully isolate: the report viewer is pure static HTML + JS loaded from your local filesystem

For NixOS users, also mount the Nix store so Nix-built binaries work inside the container:

{
  "mcpServers": {
    "waydriver-mcp": {
      "command": "sh",
      "args": ["-c", "docker run --rm -i --network none -v /nix/store:/nix/store:ro -v \"$PWD:/workspace:ro\" -v /tmp/waydriver:/tmp/waydriver ghcr.io/bohdantkachenko/waydriver-mcp:latest"]
    }
  }
}

Or build from source:

docker build -t waydriver-mcp .

Testing your app with waydriver-mcp

The MCP server is persistent — it stays up for the entire AI assistant session. You rebuild your app independently, and each start_session call picks up the latest binary from the volume. No MCP restart needed between iterations.

Rust apps — build with the builder image, volume-mount the binary:

docker run --rm -v "$PWD:/src:ro" -v "$PWD/build:/out" \
  ghcr.io/bohdantkachenko/waydriver-mcp-builder:latest \
  sh -c "cp -r /src /tmp/build && cd /tmp/build && cargo build --release && cp target/release/myapp /out/"
{
  "mcpServers": {
    "waydriver-mcp": {
      "command": "docker",
      "args": ["run", "--rm", "-i",
        "-v", "/path/to/myapp/build:/workspace:ro",
        "ghcr.io/bohdantkachenko/waydriver-mcp:latest"]
    }
  }
}

Then call start_session with command: "/workspace/myapp".

C/C++ apps — the builder image includes gcc, g++, meson, ninja-build, cmake, and GTK4/GLib dev headers:

docker run --rm -v "$PWD:/src:ro" -v "$PWD/build:/out" \
  ghcr.io/bohdantkachenko/waydriver-mcp-builder:latest \
  sh -c "cp -r /src /tmp/build && cd /tmp/build && meson setup _build && meson compile -C _build && cp _build/myapp /out/"

For extra deps (e.g. libadwaita-devel), extend the builder:

FROM ghcr.io/bohdantkachenko/waydriver-mcp-builder:latest
RUN dnf install -y libadwaita-devel

Node/Python apps — extend the runtime image to add the interpreter, use a named volume for deps:

FROM ghcr.io/bohdantkachenko/waydriver-mcp:latest
RUN dnf install -y nodejs && dnf clean all

Install deps into a named volume (re-run only when lockfile changes):

docker volume create myapp-nodemods
docker run --rm \
  -v "$PWD/package.json:/app/package.json:ro" \
  -v "$PWD/package-lock.json:/app/package-lock.json:ro" \
  -v "myapp-nodemods:/app/node_modules" \
  -w /app \
  ghcr.io/bohdantkachenko/waydriver-mcp-builder:latest \
  sh -c "dnf install -y nodejs npm && npm ci --omit=dev"

Mount source + deps — edit source freely, MCP picks up changes on next start_session:

"args": ["run", "--rm", "-i",
  "-v", "/path/to/myapp/src:/app/src:ro",
  "-v", "myapp-nodemods:/app/node_modules:ro",
  "myapp-mcp:latest"]

NixOS users — mount /nix/store so Nix-built binaries just work:

"args": ["run", "--rm", "-i",
  "-v", "/nix/store:/nix/store:ro",
  "-v", "/path/to/myapp:/workspace:ro",
  "ghcr.io/bohdantkachenko/waydriver-mcp:latest"]

Running with Nix

For local development without Docker, the Nix app wraps the binary with the required runtime env vars:

nix run .#mcp

Sessions are kept in an in-memory HashMap keyed by id, so multiple apps can run concurrently within one server process.

Visual locator — OCR + flood-fill region detection

Gated behind the visual Cargo feature on the waydriver crate. Adds two coordinated abilities for finding widgets that the AT-SPI tree doesn’t reveal:

  1. OCR-based text matching — locate a widget by its on-screen text when AT-SPI doesn’t surface it as an accessible.
  2. Region detection — once OCR finds the text, walk outward through the pixels to find the visually-distinct shape enclosing it (a button pill, row, card frame), so clicks land on the widget rather than its inner glyphs.

This doc describes both pipelines, how they compose, what they cost, and when each one is the right tool.

Why this exists

AT-SPI is the normal interaction path: enumerate the accessibility tree, find a widget by name/role/state, call Action.do_action or synthesize pointer events at its bounds. waydriver’s regular Locator does all that.

But real toolkits have gaps. Two we’ve hit and confirmed are genuinely upstream:

  • libadwaita lazy realization — an AdwPreferencesGroup constructed with visible:false inside an AdwPreferencesPage and then flipped visible after present() never has its accessible subtree built. The same happens to a non-initial AdwPreferencesDialog page. The contained AdwButtonRow / AdwSwitchRow paints on screen but is absent from every AT-SPI surface. We exhaustively tried to force realization from the client and none work (confirmed live on mutter 49 / GTK4 4.20 / libadwaita 1.8):

    • parent traversal (GetChildren), a 0..ChildCount GetChildAtIndex(i) loop, and Cache.GetItems on the app bus — the widgets are simply never published;
    • a grid of Component.GetAccessibleAtPoint hit-tests over the dialog and every descendant (thousands of calls) — no change;
    • synthetic compositor pointer-hover across the page — no change;
    • keyboard focus traversal (Tab through the dialog — how Orca surfaces them) — no change.

    Libadwaita doesn’t register these accessibles, and there’s no AT-SPI or input path that makes it. The bug is genuinely upstream; the OCR visual locator below is the only working way to drive these widgets.

  • AdwButtonRow has no accessible name — even when the row is in the tree, its title doesn’t surface as an AT-SPI name, so Locator::find_by_name returns zero.

We can’t fix these from the client side: D-Bus enumeration finds what the toolkit chose to publish. The pixels on screen, however, are real. The visual locator drives off those pixels.

It’s strictly opt-in. waydriver’s existing Locator::click etc. never silently fall back to OCR — the cost (hundreds of ms) is too high to hide, and silent fallback would mask real selector bugs. You reach for Session::find_by_text only when you’ve established that AT-SPI doesn’t see the widget.

The OCR pipeline

                ┌──────────────────────────────────────────────┐
                │  Session::take_screenshot()                  │
                │   PipeWire keepalive stream → PNG bytes      │
                └────────────────────┬─────────────────────────┘
                                     │
                                     v
                ┌──────────────────────────────────────────────┐
                │  image::load_from_memory(...)                │
                │   PNG → DynamicImage                         │
                └────────────────────┬─────────────────────────┘
                                     │
              optional .within(rect) │ crop to parent region
                  + 32px context pad │   (Locator::find_by_text)
                                     v
                ┌──────────────────────────────────────────────┐
                │  ocrs::OcrEngine                             │
                │   prepare_input → detect_words →             │
                │   find_text_lines → recognize_text           │
                │   (pure-Rust, ONNX via rten)                 │
                └────────────────────┬─────────────────────────┘
                                     │
                                     v
                ┌──────────────────────────────────────────────┐
                │  Filter words by `text` (Substring/Exact)    │
                │  Translate bboxes back to screen coords      │
                │  Return Vec<Rect>                            │
                └──────────────────────────────────────────────┘

Engine lifecycle

The OcrEngine is loaded once per session into a shared tokio::sync::OnceCell. The two .rten model files (text-detection ~2.5 MB, text-recognition ~10 MB) are looked up in this order:

  1. Env-var overrideWAYDRIVER_OCRS_DETECTION_MODEL and WAYDRIVER_OCRS_RECOGNITION_MODEL both set.
  2. XDG cache hit$XDG_CACHE_HOME/waydriver/ocrs-models/ (or ~/.cache/...) has both files.
  3. Auto-download — fetch from the ocrs project’s S3 bucket into the XDG cache. First call only; subsequent runs hit (2).

Set SessionConfig::prewarm_visual = true to spawn the engine load as a background task during Session::start so the first find_by_text call doesn’t pay the ~1–2 s model load. On a fresh machine with no XDG cache, the first session also pays ~5–20 s of model download — pre-populate the cache in CI setup if that matters.

Cropping to a parent (the Locator::find_by_text path)

Session::find_by_text(text) OCR’s the full screen. That works but is slow (~200–500 ms on a 1024×768 frame) and noisy — every word visible on screen is a candidate, so disambiguation matters.

Locator::find_by_text(text) on an AT-SPI parent locator is the faster, more accurate form:

#![allow(unused)]
fn main() {
let dialog = session.locate("//Dialog[@name='Preferences']");
let text = dialog.find_by_text("lazy-button").await?;
}

This crops the screenshot to the parent’s AT-SPI bounds (plus a 32 px padding ring) before it reaches ocrs:

  • Speed. OCR runtime is roughly linear in image area; cropping to a typical dialog cuts a search from ~300 ms to ~50 ms.
  • Accuracy. Less surrounding text means fewer false positives and less context that confuses the recognition head.

Why the 32 px context padding? Empirically, a tight crop strips the visual context that ocrs’s recogniser uses to disambiguate ambiguous glyphs. Without padding, small/low-contrast labels misread (we saw lazy-buttonlazv-button). The 32 px ring restores the context; hits inside the ring but outside the original scope are filtered back out after OCR so the caller sees only matches that genuinely fall inside the requested region.

MatchMode

  • Substring (default) — case-insensitive substring match. Tolerant of OCR’s noise (it’ll match "open" against "open-lazy-issue1-dialog").
  • Exact — equality on the full joined line, normalized.

Both modes Unicode-normalize haystack and needle before comparing: NFKD decomposition + case-fold + combining-mark stripping. This makes matching insensitive to:

  • Case"Add Account" matches "add account".
  • Diacritics"café" matches "cafe", "naïve" matches "naive".
  • Ligatures and compatibility codepoints"file" (U+FB01) matches "file", "flux" (U+FB02) matches "flux".

Exotic punctuation (e.g. the Unicode minus U+2212 that gnome-calculator uses in its history line) is not auto-mapped to ASCII equivalents — match it explicitly when needed.

Block grouping with visual-boundary detection

OCR returns text lines bottom-up via several heuristics, applied in order:

  1. Geometric clustering: lines with small y-gap and overlapping x-ranges merge into one block (wrapped paragraph behaviour).
  2. Pixel-level boundary checks (when an image is available): even if the geometric tests pass, the merge is vetoed when the gap between two lines contains:
    • A background-colour change — sample an averaged window of pixels just below the upper line and one just above the lower line (window radius VisualTextTuning::background_sample_radius px, default 2 = 5×5); if their colours differ by more than VisualTextTuning::background_color_tolerance (default 24), the lines sit on different backgrounds. The averaged-window sampler smooths over single antialias-fringe pixels that would skew a single-pixel read.
    • A horizontal divider stripe — scan every row in the gap; a row where ≥ boundary_majority_threshold (default 0.8) of boundary_samples_per_axis (default 16) sampled pixels differ from both surrounding backgrounds is a horizontal rule.
    • A vertical divider stripe — scan every column in the x-overlap range; same majority + colour-distance test. Picks up split-pane rules that pass through the gap.
  3. Connectivity check (opt-in, connectivity_check_enabled = false by default): a bounded BFS in the gap. From the bg pixel just below the upper line, flood-fill at most max_connectivity_pixels (default 4096) pixels and check whether the flood reaches the bg pixel just above the lower line. If not, the lines are in visually-separated regions despite having the same background colour — catches “two cards on the same fill, each boxed in by a thin border the divider check is too sparse to detect”.

All checks consult VisualTextTuning::color_distance (default LabCie76, see below) when comparing pixels. The divider checks toggle together via divider_detection_enabled (default true); disable on themes where shadow rasters or anti-aliased streaks would trip the heuristic.

Perceptual colour distance

ColorDistance controls how the visual locator compares pixel colours, both for region detection (flood-fill, seed pick, shape classification) and the boundary checks:

  • Rgb — raw RGB Euclidean squared distance. Cheap, not perceptual. Use to reproduce legacy thresholds tuned against raw RGB.
  • LabCie76 (default) — ΔE*76 in CIE Lab space. Roughly perceptual (“a ΔE of 6 is barely noticeable, 12 is clearly different”), cheap (one sRGB→Lab conversion).
  • LabCie2000 — ΔE*00, perceptual gold standard. ~5× slower than CIE76; only worth it when CIE76 misclassifies subtle hue shifts in practice.

The default background_color_tolerance: 24 scales sensibly across modes — RGB ΔE 24 maps to Lab ΔE76 ~6, both “near-identical backgrounds”. When retuning, re-tune for the mode you switched to.

Multi-word and multi-line matching

OCR returns text as a tree of TextLines, each containing TextWords. The matcher joins words with spaces and substring- matches against the joined string. Two layers of join:

  • Per-line for MatchMode::Exact. A line’s words are joined with spaces; the needle must equal the whole joined line. Use Exact to distinguish "Add account" from "Add account and continue".
  • Per-block for MatchMode::Substring. The grouper builds multi-line blocks from geometrically-close lines (see block grouping). For each block, the matcher tries every joiner-choice variant: at each line break, it can use " " or "" independently, giving 2^(N−1) variants for a block of N lines (capped at N = 5; above that, fall back to the single space-join). This handles:
    • Wrapped multi-word labels"Click here to learn more" matches whether the words wrapped onto one row or three (the space-join variant covers this).
    • Hyphenated wraps"needle" matches an OCR result of ["nee", "dle"] (the no-space variant joins to "needle").
    • Ligature splits across lines (rare but possible) — the Unicode normalization pass handles ligatures inside a single line already; the variants extend the same idea across breaks.

When a substring match spans multiple words — on the same line or across lines — the returned bbox is the union of the matched words’ bboxes. For a single-line match this is the tight rectangle around the matched text. For a multi-line match it’s the AABB of every involved word, which can include vertical gaps between the text rows; the centroid still lands inside the matched text block, which is what you want for clicking and region seeding.

Trade-off of cross-line substring: unrelated labels on adjacent lines can spuriously match across the line break (a search for "account Remove" would hit text that read "Add account / Remove account"). In practice nobody writes selectors that way, and the user opted in to OCR because AT-SPI couldn’t help — they’re already using a fuzzy tool. Use Exact when you need line-precise semantics.

Introspection

Both VisualLocator and RegionLocator implement Debug, so tracing::debug!("{loc:?}") or dbg!(loc) shows what the locator represents:

VisualLocator { kind: "text-label", text: "Add account",
                match_mode: Substring, region: Some(Rect { ... }),
                timeout: None }

RegionLocator { kind: "visual-region",
                bbox: Rect { x: 192, y: 158, width: 640, height: 92 },
                centroid: (512, 204) }

The kind field is a constant string that makes the role explicit in logs — "text-label" for OCR text matches, "visual-region" for flood-fill shapes — so dumps tell you what the locator means without having to follow the type back to its constructor.

VisualLocator also exposes the constructed-with values via getters:

What VisualLocator::click does today

Click the centre of the OCR word’s bbox. Works when the text glyphs sit inside the gesture controller’s hit-rect — a centred label inside an AdwButtonRow, for instance.

Doesn’t always work:

  • Checkboxes / toggles whose label and click target are separate widgets.
  • Widgets sized much larger than their text, where clicking on the glyphs hits the inner label’s selection gesture instead of the surrounding container’s activation gesture.

For those cases, the region pipeline below is the escape hatch.

The template-matching pipeline

For widgets that have no on-screen text (icon-only buttons, image links, custom-drawn glyphs), OCR can’t help. The ImageLocator path takes a reference PNG captured against a known-good screenshot of the same app, and finds where that patch sits in the current screen via classical normalized cross-correlation (NCC).

#![allow(unused)]
fn main() {
let icon = std::fs::read("references/save_icon.png")?;
session
    .find_image(&icon)?
    .with_threshold(0.9)
    .click()
    .await?;

// Or scoped to an AT-SPI parent (faster, fewer false positives):
let toolbar = session.locate("//ToolBar[@name='Main']");
toolbar
    .find_image(&icon).await?
    .click()
    .await?;
}

Algorithm

  1. Decode the template PNG once at find_image time.
  2. On each terminal-method call (bounds, click, …), take a fresh screenshot, crop to the optional scope rect, convert both target and template to grayscale.
  3. imageproc::template_matching::match_template with method CrossCorrelationNormalized — slide the template, scoring each position by NCC (Σ(a·b) / sqrt(Σa² · Σb²), in [0, 1], peaks at 1.0 for a perfect match).
  4. Walk the score grid for all peaks above the threshold (default 0.85), sort best-first, apply non-maximum suppression so neighbouring peaks within min(template_w, template_h) / 2 px collapse to one hit.
  5. Translate hit positions back into screen coords.

Threshold tuning

  • 0.95+ — very strict. Use when the reference was captured on the same machine, same theme, same DPI as the test run. Rejects most false positives in busy layouts.
  • 0.85 (default) — tolerant of subpixel antialias differences and minor lighting shifts.
  • <0.70 — likely matches something, but in a busy screen will probably match the wrong thing. If a known-good reference scores below 0.7, recapture it.

When to use this vs. find_by_text

You want to click…Use
A button with textfind_by_text("Save")
An icon-only button (Save icon, hamburger, X)find_image(&icon_png)
A widget AT-SPI surfacesLocator with an XPath selector
Something that wraps over multiple linesfind_by_text("Click here to learn more")

OCR is the right choice whenever you can read the on-screen text. Template matching is the escape hatch for visual-only widgets.

Known failure modes

  • DPI / scale change. A 32×32 reference captured on a 1× display won’t match a 64×64 render on a 2× display. The basic matcher does no scale search; recapture per DPI, or build an image pyramid wrapper if a workload demonstrates the need.
  • Theme swap. Light → dark mode = all references stale.
  • Antialias / font hinting drift. Same widget on a different GPU / fontconfig stack can score below 0.85. Lower the threshold or recapture.
  • Animation / hover / focus mid-capture. Ripple effects, focus rings, hover highlights all change the pixels. Capture references in a steady state.
  • Multiple identical icons on screen. bounds() errors out on ambiguous matches; use within(rect) to disambiguate.

Cost

One NCC pass over the haystack ≈ O(W·H·w·h) work. For a 1920×1080 screenshot and a 64×64 template, ~8 billion ops naïvely; modern machines do this in 10–50 ms. Cropping with within(rect) cuts the haystack and is the single best speedup. The implementation calls match_template (single-threaded); if a workload demands it, swapping to match_template_parallel is a one-line change.

The region detection pipeline

When clicking text glyphs doesn’t fire the surrounding widget’s activation, we want a different click target: the centroid of the visually-distinct shape that contains the text. That’s typically a button pill, a row’s rounded rectangle, or a card frame.

The algorithm is a BFS flood-fill from a seed pixel adjacent to the OCR text bbox. A “region” is a contiguous block of pixels whose RGB Euclidean distance to a seed sample is within tolerance — a button’s fill, a row’s background, a card’s surface. Each iteration finds one enclosing region; iterating outward builds a chain.

                ┌──────────────────────────────────────────────┐
                │  Inputs                                      │
                │   parent_bounds (AT-SPI Rect, screen coords) │
                │   inner_bbox    (OCR text bbox, screen coords)│
                │   full_png      (Session::take_screenshot)   │
                │   tuning        (SessionConfig::visual_      │
                │                  region_tuning)              │
                └────────────────────┬─────────────────────────┘
                                     │
                                     v
                ┌──────────────────────────────────────────────┐
                │  Crop full_png to parent_bounds              │
                │  Translate inner_bbox into crop coords       │
                └────────────────────┬─────────────────────────┘
                                     │
                                     v
                ┌──────────────────────────────────────────────┐
                │  pick_seed_outside(inner_bbox, image)        │
                │   Try right / left / below / above the       │
                │   inner bbox, +4 px offset. Sanity-check     │
                │   uniformity vs a neighbouring pixel so we   │
                │   don't seed on glyph antialiasing fringe.   │
                └────────────────────┬─────────────────────────┘
                                     │
                                     v
                ┌──────────────────────────────────────────────┐
                │  flood_fill(image, seed, tolerance)          │
                │   BFS, Vec<bool> visited grid.               │
                │   Add 4-neighbour pixels where               │
                │     ‖rgb(neighbour) - rgb(seed)‖₂ ≤ tolerance│
                │   Track bbox + centroid as we go.            │
                └────────────────────┬─────────────────────────┘
                                     │
                                     v
                ┌──────────────────────────────────────────────┐
                │  region_0 = { bbox, centroid }               │
                │  Translate back to screen coords.            │
                │  Push into result list.                      │
                └────────────────────┬─────────────────────────┘
                                     │
                                     v (find_regions / first_region only)
                ┌──────────────────────────────────────────────┐
                │  Stop?                                       │
                │   • region == previous region (no growth)    │
                │   • region covers entire crop                │
                │   • iteration count ≥ tuning.max_regions     │
                │   • pixel_just_outside(region) has nowhere   │
                │     to go (region touches all image edges)   │
                └────────────────────┬─────────────────────────┘
                                     │ otherwise
                                     v
                ┌──────────────────────────────────────────────┐
                │  seed = pixel_just_outside(region.bbox)      │
                │  Loop back to flood_fill.                    │
                └──────────────────────────────────────────────┘

Why a centroid, not a bbox centre

For axis-aligned rectangles, the bbox centre and the geometric centroid coincide. For non-rectangular shapes — pills (rounded rectangles), circles, polygon icons — the bbox centre can land outside the actual region. The centroid is the mean of every pixel position in the visited set; it’s always inside the shape, which is where you want to click.

For a 60×30 pill flood-filled from inside, the centroid lands at the pill’s geometric centre. For a circle, same. For an L-shaped selection or a polygon icon, the centroid is inside the shape and clicks land on the widget.

Shape classification

Each RegionLocator carries a coarse Shape value derived from the flood-fill’s pixel-count vs bbox-area ratio combined with a 4-corner sample. The classifier picks one of:

  • Rectangle — fill ratio ≥ 0.97 and all four bbox corners match the seed colour. Bare GTK button interiors, AdwButtonRow contents.
  • Pill — fill ratio ≥ 0.82 with 0–1 bbox corners inside. The corner radius trims the bbox corners off the shape. Most GTK button pills and Adw row backgrounds land here.
  • Ellipse — fill ratio in 0.65–0.83 with 0 bbox corners inside. Round avatar buttons, circular close icons.
  • Irregular — anything else. Polygon icons, regions with holes, shapes whose ratio doesn’t fit a primitive. Don’t trust bounds().center_*() here — use centroid().

The classification is best-effort, intended for assertions and log readability, not as a contract. Borderline cases (e.g. a rectangle with one pixel of antialiased corner darkening) can flip between categories. If a test branches on shape, treat unexpected classifications as a soft signal rather than an absolute fail.

The seed for the flood doesn’t have to be at the centre of the target region — flood-fill is a BFS that recovers the same bbox / centroid / classification regardless of starting point, as long as the seed lands somewhere inside the region. pick_seed_outside aims ~4 px outside the OCR text bbox specifically to leave the glyphs (which the flood treats as a separate region) and land on the surrounding fill.

Tuning (SessionConfig::visual_region_tuning)

Every threshold the region pipeline uses is exposed on VisualRegionTuning:

  • tolerance: u8 (default 24) — distance threshold for “same region”, interpreted under color_distance. Glyph antialiasing pixels typically jump 60+ (RGB); subtle gradients within a button surface stay under 20. Lower the number when flood over-grows into adjacent widgets; raise it when flood under-grows because of gradients.
  • color_distance: ColorDistance (default LabCie76) — which colour-distance metric to use. See perceptual colour distance.
  • max_regions: usize (default 16) — safety cap on the iteration chain. Realistic widget tree depth is 3–5; the cap protects against pathological banded images.
  • seed_uniformity_threshold_sq: u32 (default 100) — squared RGB distance below which the seed-pick treats a candidate seed and its 2-px-out neighbour as “uniform”. Raise on noisy backgrounds.
  • shape_rectangle_min_ratio: f64 (default 0.97), shape_pill_min_ratio: f64 (default 0.82), shape_ellipse_ratio_range: (f64, f64) (default (0.65, 0.83)) — fill-ratio thresholds for shape classification.

MAX_PIXELS_PER_REGION is implicit and equal to the cropped image’s total pixel count — the flood can’t escape it.

Tuning (SessionConfig::visual_text_tuning)

Knobs on VisualTextTuning:

  • multiline_max_gap_factor: f32 (default 0.6) — see block grouping.
  • multiline_x_slack_px: i32 (default 4).
  • background_color_tolerance: u8 (default 24) — threshold for the bg-colour change check.
  • divider_detection_enabled: bool (default true).
  • ocr_context_padding_px: i32 (default 32) — padding added on every side of a cropped element before running OCR; gives the recognition head visual context that disambiguates small/low- contrast glyphs.
  • boundary_samples_per_axis: usize (default 16), boundary_majority_threshold: f32 (default 0.8) — divider-scan density and the majority threshold.
  • background_sample_radius: u32 (default 2) — radius of the averaged window used when sampling the bg colour at each boundary check. 0 falls back to a single-pixel sample.
  • color_distance: ColorDistance (default LabCie76).
  • connectivity_check_enabled: bool (default false), max_connectivity_pixels: usize (default 4096) — opt-in bounded flood-fill check; see block grouping.

Tuning (SessionConfig::visual_click_tuning)

Knobs on VisualClickTuning control the headless-mutter cold-start pointer workaround applied by VisualLocator::click and RegionLocator::click:

  • cold_start_warmup_enabled: bool (default true) — set to false on real hardware where the cold-start race doesn’t apply to fall through to a single motion + button-press.
  • cold_start_warmup_offset_px: f64 (default 4.0) — distance of the warmup motion from the target.
  • cold_start_motion_settle: Duration (default 60 ms) — sleep after each motion call.
  • cold_start_press_settle: Duration (default 50 ms) — sleep between button-down and button-up.

Model file verification

The auto-downloaded ocrs .rten model files are checksummed against constants embedded in crates/waydriver/src/visual/models.rs:

  • Cached file at session start: hashed, refused on mismatch (deleted + re-downloaded).
  • Fresh download: hashed before the *.partial → *.rten rename; a corrupted download never becomes a cache hit.
  • Env-var overrides (WAYDRIVER_OCRS_DETECTION_MODEL, WAYDRIVER_OCRS_RECOGNITION_MODEL) bypass verification — the user has explicitly pointed us at a file they control.

If upstream ocrs publishes new model files, the constants will refuse to load the cache. Capture the new hashes with sha256sum and update DETECTION_SHA256 / RECOGNITION_SHA256; or set the env-var override at runtime as an escape hatch.

Locator::list_text and Locator::list_labelled_regions — enumeration

When you want to discover what’s on screen rather than search for a specific label, two enumeration methods produce a complete map of the text-bearing widgets inside a Locator’s scope:

#![allow(unused)]
fn main() {
let dialog = session.locate("//Dialog[@name='Preferences']");

// Every OCR'd line inside the dialog, line text + union bbox.
let hits = dialog.list_text().await?;
for h in &hits {
    println!("{:?} at {:?}", h.text, h.bounds);
}

// Each line paired with its enclosing visual region. One flood-fill
// per label; the screenshot is taken once and reused.
for (label, region) in dialog.list_labelled_regions().await? {
    println!("{} ({:?}) inside {:?} shape", label.text, label.bounds, region.shape());
}
}

list_text returns Vec<TextHit> where each TextHit has the joined line text and the union bbox of all words in that line. There’s no substring filter — for searches use find_by_text. Cost is one OCR pass over the locator’s bounds (~50–200 ms cropped, ~200–500 ms full-screen).

list_labelled_regions adds a flood-fill per hit on top, returning Vec<(TextHit, RegionLocator)>. Use it for:

  • Test discovery / scaffolding. Print the full set of clickable text-bearing things in a dialog and pick targets interactively.
  • Visual regression. Compare label set + region shapes between runs.
  • Dynamic selection. “Click the first row whose label starts with Show” — list_labelled_regions then filter then click.

The cost is list_text plus N × flood-fill (typically ~10–30 ms each). A dialog with 15 labels takes ~150–500 ms total.

Session::region_at(x, y) — pixel-based entry point

The lowest level in the visual stack. Skips both OCR and the AT-SPI parent lookup — just flood-fills from the supplied screen pixel and returns the RegionLocator for whatever contiguous-colour shape contains that pixel.

#![allow(unused)]
fn main() {
// I already know there's a clickable thing near here.
let region = session.region_at(512, 365).await?;
match region.shape() {
    Shape::Pill | Shape::Rectangle => region.click().await?,
    _ => return Err(anyhow!("expected a button-shaped widget at the cursor")),
}
}

Useful for:

  • Coordinate-driven tests (you know the layout because you wrote the fixture).
  • Visual debugging: “what’s at this pixel?” — dump region and read its bbox/shape/centroid.
  • Bridge code that already has coordinates from another source (a previous screenshot, a layout assertion, a logged event).

The seed pixel doesn’t need to be at the centre of the region. Flood-fill is deterministic: any pixel inside the target region recovers the same bbox / centroid / shape. The only thing that varies with the seed is which region you get — a pixel on a text glyph returns the glyph’s bbox; a pixel on the button fill returns the button’s bbox.

The three Locator methods

All of them resolve self’s AT-SPI bounds, take a fresh screenshot, and call into the region pipeline.

  • Locator::find_regions(&self, inner: &VisualLocator) — full sweep. Returns Vec<RegionLocator> in outermost-first order: index 0 is the outermost region inside self’s bounds; the last element is the tightest region around inner. The order matches the call-site mental model (start at the parent, walk inward).
  • Locator::first_region(&self, inner) — outermost only (find_regions[0]). Runs the full sweep but skips the intermediate Vec allocations.
  • Locator::last_region(&self, inner) — innermost only (find_regions[last]). One flood-fill, no chain walk. Cheap. This is usually what you want — the button pill adjacent to the text.

Plus the convenience on VisualLocator:

  • VisualLocator::parent_region() — equivalent to parent.last_region(self), but doesn’t require the caller to remember the parent locator. Requires the VisualLocator to have a parent scope (constructed via Locator::find_by_text or Session::find_by_text(...).within(rect)).

RegionLocator action surface

Parallels VisualLocator’s shape, minus anything that would need AT-SPI handles:

  • bounds() -> Rect — axis-aligned bounding rect of the flood.
  • centroid() -> (i32, i32) — pixel-set centre, the click target.
  • click() — pointer click at the centroid. Uses the same motion-warmup-then-press pattern as VisualLocator::click to side-step headless mutter’s cold-start pointer-routing race.
  • hover() — pointer move only.
  • screenshot() — PNG cropped to bounds().

There is deliberately no fill, set_text, focus, or any is_<state> predicate. Those need AT-SPI handles; a region is just a bbox + centroid.

How they compose

#![allow(unused)]
fn main() {
// AT-SPI sees the parent dialog but not the lazy button inside it.
let dialog = session.locate("//Dialog[@name='Preferences']");

// Find the on-screen text "lazy-button" inside that dialog.
let text = dialog.find_by_text("lazy-button").await?;

// Click the centroid of the pill surrounding the text. One flood-fill
// from a seed adjacent to the OCR bbox — fastest of the three region
// methods because it doesn't walk the enclosure chain.
dialog.last_region(&text).await?.click().await?;
}

Three orthogonal layers:

LayerInputOutputCost
AT-SPI LocatorXPathaccessible refsms
VisualLocatortext + optional parent scopetext bboxes50–500 ms (OCR)
RegionLocatortext bbox + parent screenshotshape + centroid~10–30 ms (flood)

Each layer is opt-in. You reach down only when the layer above doesn’t work for your widget.

Cost summary

OperationTypical latency
AT-SPI locator (session.locate)<10 ms
Session start — model download (first run)5–20 s
Session start — model load (no prewarm)1–2 s on first OCR call
Session start — model load (prewarm)parallel with session boot
Session::find_by_text (full screen)200–500 ms
Locator::find_by_text (cropped)50–200 ms
Locator::last_region+10–30 ms over OCR
Locator::find_regions (full sweep)+30–100 ms (depends on chain depth)

These latencies assume an optimized build. rten inference dominates OCR cost and is roughly 30× slower at the dev profile’s opt-level 0: measured ~5–8 s per full-frame pass with optimized dependencies vs ~50–200 s without, on CPU-only hosts. Consumers running the visual feature under cargo test must add a dependency-only override to the workspace root Cargo.toml (Cargo ignores profile overrides declared anywhere else — a library can’t ship this for you):

[profile.dev.package."*"]
opt-level = 3

(waydriver’s own workspace root already applies this to just the rten/ocrs crates, so in-repo contributors and the e2e suite get optimized OCR in dev/test builds without the broader "*" override. The init warning below still fires for in-repo debug builds — an opt-level override does not clear cfg(debug_assertions) — and is a known false-positive there.)

The engine loader logs a warning at init when it detects a debug build. Two further cost levers already built in: a scoped Locator::find_by_text crops the frame to the parent’s bounds before inference (fewer pixels, fewer text lines — only the unscoped Session::find_by_text pays for the full frame), and the per-frame OCR cache means repeated lookups on an unchanged screen reuse a single pass.

When to use what

  • Default pathLocator::click against an XPath. Use this unless the widget doesn’t surface in AT-SPI.
  • Widget renders text and isn’t in AT-SPILocator::find_by_text on the nearest AT-SPI parent, then .click(). Works when the text glyphs are inside the gesture-controller’s hit-rect (most AdwButtonRows, GTK buttons with centred labels).
  • Text-center click doesn’t fire activationparent.last_region(&text).click(). Uses the centroid of the enclosing visual shape, which is more robust for widgets where the inner label widget eats the click.
  • You want the surrounding card / panel, not the buttonparent.first_region(&text).click() or walk find_regions and pick the layer you want.
  • No AT-SPI parent at allSession::find_by_text(text).click() works but pays full-screen OCR cost; prefer constraining via .within(rect) whenever you can derive a scope.

Failure modes (known)

  • Sibling-coloured regions merge. If the button shares its fill colour with an adjacent widget, flood-fill spans both. Lower tolerance and re-test.
  • Gradient fills stop the flood early. A button with a top-to- bottom gradient may have RGB deltas exceeding tolerance partway down. Raise tolerance (carefully — too high and the flood eats neighbouring regions).
  • Thin antialiased borders ≤ 2 px can confuse pick_seed_outside if the 4-px offset lands inside the border. The seed picker validates uniformity against a neighbouring pixel and falls back to the next candidate, but pathological cases still exist. Construct the VisualLocator with a tighter .within(...) or supply an explicit Rect to side-step.
  • OCR misreads on small / low-contrast text. ocrs’s recognition head is trained on document text; UI labels at 10–14 px in dark themes can read poorly. The 32 px context-padding ring helps (tunable via VisualTextTuning::ocr_context_padding_px); raising the fixture’s font size if you control it helps more.
  • Pointer cold-start race. Headless mutter sometimes drops the first pointer event after a fresh session. VisualLocator::click and RegionLocator::click both warmup-motion-then-click to side-step it, but a test that triggers many rapid clicks can still hit the race on subsequent clicks. Add a 60 ms sleep between clicks if you see this — or tune VisualClickTuning (disable the warmup on real hardware, lengthen the settles on slow CI).
  • Custom theme with shadow rasters between rows. The divider scan can mistake anti-aliased shadow gradients for a horizontal rule and refuse to merge wrapped paragraphs. Set VisualTextTuning::divider_detection_enabled = false to fall back to bg-colour-only boundary detection.
  • Stale model cache from upstream rebuild. SHA-256 verification refuses to load model files that don’t match the embedded hashes. If ocrs publishes new models, either bump the constants in models.rs or set WAYDRIVER_OCRS_DETECTION_MODEL / WAYDRIVER_OCRS_RECOGNITION_MODEL to point at known-good files.
  • Right-to-left scripts and non-LTR reading order. The block grouper and the per-line haystack are built on the assumption that words read left-to-right within a line and lines read top-to-bottom within a block. Hebrew, Arabic, or any RTL script will produce word bboxes in screen-left-to-right order but the joined haystack won’t reflect logical reading order — substring matches against a logical-order needle may miss. Vertical scripts (Japanese/Chinese in tategaki) are not supported. If you’re driving an RTL app, prefer AT-SPI selectors; the visual locator’s matching semantics aren’t right for that case.

Implementation map

WhatWhere
Session::find_by_text (root entry)crates/waydriver/src/session.rs
Locator::find_by_text (scoped entry)crates/waydriver/src/locator.rs
VisualLocator + OCR pipelinecrates/waydriver/src/visual/mod.rs
Model resolution + auto-downloadcrates/waydriver/src/visual/models.rs
Engine lifecycle (OnceCell shared cache)crates/waydriver/src/visual/engine.rs
Flood-fill, seed picking, RegionLocatorcrates/waydriver/src/visual/region.rs
Locator::find_regions/first_region/last_regioncrates/waydriver/src/locator.rs
SessionConfig::visual_region_tuningcrates/waydriver/src/session.rs
Cargo feature visualcrates/waydriver/Cargo.toml
E2E test exercising both pipelinescrates/waydriver-e2e/tests/e2e.rslazy_a11y_*_clickable_via_visual_locator

Architecture Notes

Keepalive ScreenCast stream

In headless mode, Mutter only composites (and delivers Wayland frame callbacks) when a ScreenCast consumer is pulling frames. Without an active stream, GTK4 apps render their first frame but never repaint — the frame clock never ticks.

Session::start opens a persistent ScreenCast stream that stays alive for the session’s lifetime. This keeps Mutter compositing continuously so frame callbacks flow and GTK4 apps repaint normally.

Input: RemoteDesktop vs AT-SPI

Two input paths are available, with different trade-offs:

  • RemoteDesktop keyboard/pointer (press_keysym, pointer_button) — events go through the full Wayland input pipeline (Mutter -> Wayland protocol -> GDK -> GTK event loop). GTK4 processes them normally and repaints. Use this for interactions that need to produce visible changes.

  • AT-SPI actions (Locator::click() / focus() / set_text()) — directly invoke widget signal handlers through the accessibility tree, targeted by XPath. Accurate and precise, but they update GTK4’s internal model without triggering compositor redraws. Useful for reading the accessibility tree and programmatic activation, but screenshots taken after AT-SPI-only interactions may show stale frames.

App isolation

Apps are launched with GSETTINGS_BACKEND=keyfile and XDG_CONFIG_HOME pointing to the per-session runtime directory. This bypasses the host dconf daemon entirely, so each session starts with default app state and never reads or writes the user’s settings.

Dual D-Bus

GTK4’s built-in AT-SPI backend only registers on the host session bus — it ignores custom DBUS_SESSION_BUS_ADDRESS. So each session uses two D-Bus connections:

  • Host session bus: AT-SPI communication with the app
  • Private D-Bus: Mutter’s ScreenCast and RemoteDesktop APIs (isolated from the host compositor)
graph LR
    subgraph Host
        host_dbus["Host session bus"]
    end

    subgraph Session["Per-session"]
        private_dbus["Private D-Bus"]
        mutter["Mutter"]
        app["Your app"]
        waydriver["WayDriver"]
    end

    waydriver -- "AT-SPI" --> host_dbus
    app -- "AT-SPI register" --> host_dbus
    waydriver -- "ScreenCast\nRemoteDesktop" --> private_dbus
    mutter -- "org.gnome.Mutter.*" --> private_dbus

External-effect sinks

Some app behaviours leave the process entirely — a desktop notification, a “open this URL” portal request — so they have no AT-SPI projection to query. With capture_external_effects enabled, waydriver mocks the daemons that would receive those calls on the app’s session bus (org.freedesktop.Notifications and org.freedesktop.portal.Desktop’s OpenURI), records each call, and exposes them via get_captured_effects / Session::notifications() / open_uri_requests(). It’s opt-in because the sinks own well-known names — safe on the per-session/container bus, a no-op (with a warning) on a shared host bus that already runs a real daemon.

Clipboard / PRIMARY-selection readback is not available: Mutter 46.2 exposes no clipboard D-Bus interface and implements neither wlr-data-control nor ext-data-control-v1, so there’s no out-of-band way to read the selection. The working stopgap is to paste into the app (Ctrl+V / middle-click) and read the result back through the AT-SPI Text interface (read_text / Locator::text).

Screenshot and recording pipeline

graph LR
    screencast["Mutter ScreenCast API"]
    monitor["RecordMonitor\n(virtual monitor)"]
    pw_keep["PipeWire stream\n(keepalive — screenshots)"]
    pw_rec["PipeWire stream\n(dedicated — recording)"]
    gst_shot["On-demand GStreamer pipeline\n(pngenc snapshot=true)"]
    gst_rec["Long-lived GStreamer pipeline\n(vp8enc + webmmux)"]
    png["PNG bytes"]
    webm["WebM file"]

    screencast --> monitor
    monitor --> pw_keep --> gst_shot --> png
    monitor --> pw_rec --> gst_rec --> webm

Screenshots and recording use separate ScreenCast streams. take_screenshot spins up a transient pngenc pipeline on the keepalive stream on each call; recording runs a single vp8enc ! webmmux ! filesink pipeline for the session’s lifetime on its own dedicated stream, flushed with EOS on Session::kill so the WebM is seekable. They must not share a node: mutter’s screencast node only emits frames on screen damage (framerate=0/1), and a continuous recorder consumer would starve a later-attaching screenshot consumer of the initial frame on a static app — so the recorder gets its own stream and the screenshot path stays the keepalive node’s first/triggering consumer. Both use the GStreamer Rust bindings (gstreamer + gstreamer-app crates) and only gst-plugins-good (no -bad/-ugly).

API Reference

The full, type-level API reference is generated by rustdoc and hosted on docs.rs. It always reflects the latest published release:

CrateReference
waydriver — core traits, Session, locators, keysym helpersdocs.rs/waydriver
waydriver-compositor-mutterCompositorRuntime impldocs.rs/waydriver-compositor-mutter
waydriver-input-mutterInputBackend impldocs.rs/waydriver-input-mutter
waydriver-capture-mutterCaptureBackend impldocs.rs/waydriver-capture-mutter

For the high-level tool surface exposed to AI assistants, see the MCP Server chapter rather than the rustdoc.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.3.4 - 2026-06-16

Added

  • (atspi) read cache-only row values via text_ref/value_ref/selected_text_ref

Other

  • (atspi) measure event-cache behavior and specify the design (#11)
  • restore the GitHub-native video URL in the README
  • (atspi) parallelize the tree-walk snapshot (#11)
  • add a download-link fallback to the README demo video
  • serve the demo video from the repo instead of a dead GitHub URL
  • slim README to a landing page, defer detail to waydriver.io
  • add mdBook documentation site with GitHub Pages deploy
  • (atspi) benchmark tree-walk cost on a large synthetic tree (#11)

0.3.3 - 2026-06-14

Added

  • (locator) drag_to_coords for off-window drop endpoints
  • (gaction) drive app./win. GActions via org.gtk.Actions (#33)
  • external-effect sinks (notifications/portal) + single-instance CLI forwarding
  • (atspi) activate cache-only accessibles by (bus, path) ref
  • (visual) add perceptual baseline-compare primitive
  • live GSettings writes + AT-SPI Value/scroll readback
  • (visual) per-search OCR upscale via VisualLocator::with_upscale (#23)
  • (session) expose key_down/key_up for held-modifier gestures

Fixed

  • (session) claim external-effect sink names before launching the app
  • (atspi) resolve LABELLED_BY names for cache-only rows

Other

  • release v0.3.2
  • add cloud-env (non-Nix) dev tooling for SessionStart and Fedora container
  • limit rustdoc to the crate itself (–no-deps)

0.3.2 - 2026-06-14

Added

  • (locator) drag_to_coords for off-window drop endpoints
  • (gaction) drive app./win. GActions via org.gtk.Actions (#33)
  • external-effect sinks (notifications/portal) + single-instance CLI forwarding
  • (atspi) activate cache-only accessibles by (bus, path) ref
  • (visual) add perceptual baseline-compare primitive
  • live GSettings writes + AT-SPI Value/scroll readback
  • (visual) per-search OCR upscale via VisualLocator::with_upscale (#23)
  • (session) expose key_down/key_up for held-modifier gestures

Fixed

  • (session) claim external-effect sink names before launching the app
  • (atspi) resolve LABELLED_BY names for cache-only rows

Other

  • add cloud-env (non-Nix) dev tooling for SessionStart and Fedora container
  • limit rustdoc to the crate itself (–no-deps)

0.3.1 - 2026-06-13

Added

  • (atspi) read lazily-realized widgets via focus_walk + cache

Fixed

  • (pointer) translate window-relative AT-SPI bounds to screen space
  • (session) isolate XDG state/data/cache dirs + verify reported bugs live

Other

  • (visual) warn on debug-built OCR stack + document the ~30x cost

0.3.0 - 2026-06-12

Added

  • [breaking] harden visual-OCR, locator, and key-chord paths

Fixed

  • (capture) stop pipewire runtime-dir nesting overflow at the root

0.2.10 - 2026-06-08

Fixed

  • (mcp) keep start_session from hanging on stalled setup

0.2.9 - 2026-06-06

Added

  • (gsettings) per-session GSettings isolation via keyfile backend
  • (scale) custom display scale (HiDPI) for sessions

Other

  • apply cargo fmt

0.2.8 - 2026-06-05

Fixed

  • (compositor-mutter) snapshot host runtime root to keep session dirs flat

0.2.7 - 2026-06-03

Fixed

  • (capture) give the video recorder its own ScreenCast stream

0.2.6 - 2026-05-24

Added

  • (mcp) expose visual locator tools (OCR, template match, stdout wait)
  • (visual) opt-in visual locator stack — OCR, flood-fill regions, template matching

Other

  • (visual) apply rustfmt to visual locator stack

0.2.5 - 2026-05-13

Added

  • (locator) pointer-click fallback for widgets without AT-SPI Action

0.2.4 - 2026-05-12

Fixed

  • (session) prime mutter keyboard focus to prevent first-keypress drop

0.2.3 - 2026-04-29

Fixed

  • (compositor-mutter) scrub stale PIPEWIRE_REMOTE before spawning per-session pipewire stack

Other

  • (readme) embed gnome-calculator demo video

0.2.2 - 2026-04-26

Added

  • (locator) pointer-click fallback when fill target lacks Component::grab_focus
  • (input) thread CancellationToken through InputBackend for prompt kill
  • (locator) element-scoped pointer actions (hover, double_click, right_click, drag_to)
  • (locator) Locator::select_option via AT-SPI Selection interface
  • (input) Locator::scroll_into_view with AT-SPI + wheel fallbacks
  • (input) Locator::fill(), absolute pointer motion, Session::type_text (WAY-5)
  • (atspi) capture element bounds via Component::get_extents

Fixed

  • (mcp) kill_session no longer blocks on in-flight tool auto-waits

Other

  • release v0.2.1
  • refresh AGENTS.md and README.md for current API surface
  • workspace-wide audit pass tightening trait surfaces and error types
  • (mcp) split tool handlers into per-concern modules
  • (mcp) split monolithic main.rs into focused modules
  • split e2e tests into waydriver-e2e crate, add configurable video_fps
  • (error) preserve typed error sources on Atspi/Process/Screenshot
  • (compositor-mutter) separate doc paragraph before stage rationale

0.2.1 - 2026-04-26

Added

  • (locator) pointer-click fallback when fill target lacks Component::grab_focus
  • (input) thread CancellationToken through InputBackend for prompt kill
  • (locator) element-scoped pointer actions (hover, double_click, right_click, drag_to)
  • (locator) Locator::select_option via AT-SPI Selection interface
  • (locator) layered wait_for / wait_until / wait_until_async primitives
  • (input) Locator::scroll_into_view with AT-SPI + wheel fallbacks
  • (input) Locator::fill(), absolute pointer motion, Session::type_text (WAY-5)
  • (atspi) capture element bounds via Component::get_extents
  • (locator) add richer AT-SPI state predicates and matching waiters

Fixed

  • (session) bound kill latency with AT-SPI method timeout and shutdown budget
  • (mcp) kill_session no longer blocks on in-flight tool auto-waits

Other

  • refresh AGENTS.md and README.md for current API surface
  • workspace-wide audit pass tightening trait surfaces and error types
  • (error) preserve typed error sources on Atspi/Process/Screenshot
  • split e2e tests into waydriver-e2e crate, add configurable video_fps
  • (mcp) split tool handlers into per-concern modules
  • (mcp) split monolithic main.rs into focused modules
  • (compositor-mutter) separate doc paragraph before stage rationale

0.2.0 - 2026-04-24

Added

  • (fixture) GTK4/libadwaita e2e fixture with stdout event capture
  • (input) keyboard chord support via key_down/key_up primitives
  • (atspi) Locator::focus via Component::grab_focus
  • (atspi) auto-wait and explicit wait_for_* on Locator
  • (atspi) [breaking] XPath-based locator API over AT-SPI tree
  • (capture) WebM video recording for sessions
  • (mcp) configurable virtual-monitor resolution
  • (mcp) per-session event log and static HTML viewer
  • (mcp) configurable report dir with per-session screenshot counter

Other

  • update README and AGENTS.md for Locator API
  • (release) move CHANGELOG.md into waydriver crate with root symlink
  • (release) consolidate per-crate changelogs into workspace CHANGELOG
  • (mcp) drop flaky second-screenshot assertion in e2e

0.1.3 - 2026-04-17

Added

  • add publishable builder image and document multi-language dev workflows

0.1.2 - 2026-04-16

Added

  • (mcp) add Docker packaging and container-based e2e test

0.1.1 - 2026-04-16

Added

  • add MCP server for AI-driven headless UI testing

Other

  • add rustdoc comments to public API surface
  • add per-distro dependency tables and install commands to README

Contributing Guide

The canonical guidance for working in this repository — development environment, build/test commands, the architecture deep-dive, testing notes, the CI pipeline, and commit-message conventions — lives in AGENTS.md at the repo root. It is written for both human contributors and AI coding assistants. This page covers the one thing most contributors hit first: building without Nix.

Developing without Nix

Contributors who don’t use Nix can build and test the workspace directly once the system packages are installed. Two repo helpers automate this:

  • .claude/hooks/session-start.sh installs the build + runtime packages, ensures the rustfmt/clippy rustup components, and warms the crate cache. It is gated on $CLAUDE_CODE_REMOTE, so it only runs in the Claude Code cloud env; on another machine, run the apt/dnf/pacman command for your distro instead.
  • scripts/dev-container.sh drops you into a Fedora 42 shell (matching the Dockerfile/CI) with your working tree bind-mounted, for building waydriver-fixture-gtk and running the native e2e suite. These need libadwaita ≥ 1.6, so they can’t build on Ubuntu 24.04 (which ships 1.5).

On a non-Nix host, build and test the rest of the workspace with --exclude waydriver-fixture-gtk, and set GST_PLUGIN_PATH, XDG_DATA_DIRS, and the at-spi2-core/libexec path yourself when running the raw binary (the nix run .#mcp wrapper that injects these is Nix-only). See AGENTS.md for details.