Visual locator — OCR + flood-fill region detection
Gated behind the visual Cargo feature on the waydriver crate. Adds
two coordinated abilities for finding widgets that the AT-SPI tree
doesn’t reveal:
- OCR-based text matching — locate a widget by its on-screen text when AT-SPI doesn’t surface it as an accessible.
- Region detection — once OCR finds the text, walk outward through the pixels to find the visually-distinct shape enclosing it (a button pill, row, card frame), so clicks land on the widget rather than its inner glyphs.
This doc describes both pipelines, how they compose, what they cost, and when each one is the right tool.
Why this exists
AT-SPI is the normal interaction path: enumerate the accessibility
tree, find a widget by name/role/state, call Action.do_action or
synthesize pointer events at its bounds. waydriver’s regular
Locator
does all that.
But real toolkits have gaps. Two we’ve hit and confirmed are genuinely upstream:
-
libadwaita lazy realization — an
AdwPreferencesGroupconstructed withvisible:falseinside anAdwPreferencesPageand then flipped visible afterpresent()never has its accessible subtree built. The same happens to a non-initialAdwPreferencesDialogpage. The containedAdwButtonRow/AdwSwitchRowpaints on screen but is absent from every AT-SPI surface. We exhaustively tried to force realization from the client and none work (confirmed live on mutter 49 / GTK4 4.20 / libadwaita 1.8):- parent traversal (
GetChildren), a0..ChildCountGetChildAtIndex(i)loop, andCache.GetItemson the app bus — the widgets are simply never published; - a grid of
Component.GetAccessibleAtPointhit-tests over the dialog and every descendant (thousands of calls) — no change; - synthetic compositor pointer-hover across the page — no change;
- keyboard focus traversal (Tab through the dialog — how Orca surfaces them) — no change.
Libadwaita doesn’t register these accessibles, and there’s no AT-SPI or input path that makes it. The bug is genuinely upstream; the OCR visual locator below is the only working way to drive these widgets.
- parent traversal (
-
AdwButtonRow has no accessible name — even when the row is in the tree, its title doesn’t surface as an AT-SPI name, so
Locator::find_by_namereturns zero.
We can’t fix these from the client side: D-Bus enumeration finds what the toolkit chose to publish. The pixels on screen, however, are real. The visual locator drives off those pixels.
It’s strictly opt-in. waydriver’s existing Locator::click etc.
never silently fall back to OCR — the cost (hundreds of ms) is too
high to hide, and silent fallback would mask real selector bugs. You
reach for Session::find_by_text only when you’ve established that
AT-SPI doesn’t see the widget.
The OCR pipeline
┌──────────────────────────────────────────────┐
│ Session::take_screenshot() │
│ PipeWire keepalive stream → PNG bytes │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ image::load_from_memory(...) │
│ PNG → DynamicImage │
└────────────────────┬─────────────────────────┘
│
optional .within(rect) │ crop to parent region
+ 32px context pad │ (Locator::find_by_text)
v
┌──────────────────────────────────────────────┐
│ ocrs::OcrEngine │
│ prepare_input → detect_words → │
│ find_text_lines → recognize_text │
│ (pure-Rust, ONNX via rten) │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ Filter words by `text` (Substring/Exact) │
│ Translate bboxes back to screen coords │
│ Return Vec<Rect> │
└──────────────────────────────────────────────┘
Engine lifecycle
The OcrEngine is loaded once per session into a shared
tokio::sync::OnceCell. The two .rten model files (text-detection
~2.5 MB, text-recognition ~10 MB) are looked up in this order:
- Env-var override —
WAYDRIVER_OCRS_DETECTION_MODELandWAYDRIVER_OCRS_RECOGNITION_MODELboth set. - XDG cache hit —
$XDG_CACHE_HOME/waydriver/ocrs-models/(or~/.cache/...) has both files. - Auto-download — fetch from the ocrs project’s S3 bucket into the XDG cache. First call only; subsequent runs hit (2).
Set SessionConfig::prewarm_visual = true
to spawn the engine load as a background task during Session::start
so the first find_by_text call doesn’t pay the ~1–2 s model load.
On a fresh machine with no XDG cache, the first session also pays
~5–20 s of model download — pre-populate the cache in CI setup if
that matters.
Cropping to a parent (the Locator::find_by_text path)
Session::find_by_text(text) OCR’s the full screen. That works but
is slow (~200–500 ms on a 1024×768 frame) and noisy — every word
visible on screen is a candidate, so disambiguation matters.
Locator::find_by_text(text) on an AT-SPI parent locator is the
faster, more accurate form:
#![allow(unused)]
fn main() {
let dialog = session.locate("//Dialog[@name='Preferences']");
let text = dialog.find_by_text("lazy-button").await?;
}
This crops the screenshot to the parent’s AT-SPI bounds (plus a 32 px padding ring) before it reaches ocrs:
- Speed. OCR runtime is roughly linear in image area; cropping to a typical dialog cuts a search from ~300 ms to ~50 ms.
- Accuracy. Less surrounding text means fewer false positives and less context that confuses the recognition head.
Why the 32 px context padding? Empirically, a tight crop strips the
visual context that ocrs’s recogniser uses to disambiguate ambiguous
glyphs. Without padding, small/low-contrast labels misread (we saw
lazy-button → lazv-button). The 32 px ring restores the context;
hits inside the ring but outside the original scope are filtered
back out after OCR so the caller sees only matches that genuinely
fall inside the requested region.
MatchMode
Substring(default) — case-insensitive substring match. Tolerant of OCR’s noise (it’ll match"open"against"open-lazy-issue1-dialog").Exact— equality on the full joined line, normalized.
Both modes Unicode-normalize haystack and needle before comparing: NFKD decomposition + case-fold + combining-mark stripping. This makes matching insensitive to:
- Case —
"Add Account"matches"add account". - Diacritics —
"café"matches"cafe","naïve"matches"naive". - Ligatures and compatibility codepoints —
"file"(U+FB01) matches"file","flux"(U+FB02) matches"flux".
Exotic punctuation (e.g. the Unicode minus − U+2212 that
gnome-calculator uses in its history line) is not auto-mapped
to ASCII equivalents — match it explicitly when needed.
Block grouping with visual-boundary detection
OCR returns text lines bottom-up via several heuristics, applied in order:
- Geometric clustering: lines with small y-gap and overlapping x-ranges merge into one block (wrapped paragraph behaviour).
- Pixel-level boundary checks (when an image is available):
even if the geometric tests pass, the merge is vetoed when the
gap between two lines contains:
- A background-colour change — sample an averaged window of
pixels just below the upper line and one just above the lower
line (window radius
VisualTextTuning::background_sample_radiuspx, default 2 = 5×5); if their colours differ by more thanVisualTextTuning::background_color_tolerance(default 24), the lines sit on different backgrounds. The averaged-window sampler smooths over single antialias-fringe pixels that would skew a single-pixel read. - A horizontal divider stripe — scan every row in the gap;
a row where ≥
boundary_majority_threshold(default 0.8) ofboundary_samples_per_axis(default 16) sampled pixels differ from both surrounding backgrounds is a horizontal rule. - A vertical divider stripe — scan every column in the x-overlap range; same majority + colour-distance test. Picks up split-pane rules that pass through the gap.
- A background-colour change — sample an averaged window of
pixels just below the upper line and one just above the lower
line (window radius
- Connectivity check (opt-in,
connectivity_check_enabled = falseby default): a bounded BFS in the gap. From the bg pixel just below the upper line, flood-fill at mostmax_connectivity_pixels(default 4096) pixels and check whether the flood reaches the bg pixel just above the lower line. If not, the lines are in visually-separated regions despite having the same background colour — catches “two cards on the same fill, each boxed in by a thin border the divider check is too sparse to detect”.
All checks consult VisualTextTuning::color_distance (default
LabCie76, see below) when comparing pixels. The divider checks
toggle together via divider_detection_enabled (default true);
disable on themes where shadow rasters or anti-aliased streaks
would trip the heuristic.
Perceptual colour distance
ColorDistance controls how the visual locator compares pixel
colours, both for region detection (flood-fill, seed pick, shape
classification) and the boundary checks:
Rgb— raw RGB Euclidean squared distance. Cheap, not perceptual. Use to reproduce legacy thresholds tuned against raw RGB.LabCie76(default) — ΔE*76 in CIE Lab space. Roughly perceptual (“a ΔE of 6 is barely noticeable, 12 is clearly different”), cheap (one sRGB→Lab conversion).LabCie2000— ΔE*00, perceptual gold standard. ~5× slower than CIE76; only worth it when CIE76 misclassifies subtle hue shifts in practice.
The default background_color_tolerance: 24 scales sensibly across
modes — RGB ΔE 24 maps to Lab ΔE76 ~6, both “near-identical
backgrounds”. When retuning, re-tune for the mode you switched to.
Multi-word and multi-line matching
OCR returns text as a tree of TextLines, each containing
TextWords. The matcher joins words with spaces and substring-
matches against the joined string. Two layers of join:
- Per-line for
MatchMode::Exact. A line’s words are joined with spaces; the needle must equal the whole joined line. UseExactto distinguish"Add account"from"Add account and continue". - Per-block for
MatchMode::Substring. The grouper builds multi-line blocks from geometrically-close lines (see block grouping). For each block, the matcher tries every joiner-choice variant: at each line break, it can use" "or""independently, giving2^(N−1)variants for a block of N lines (capped at N = 5; above that, fall back to the single space-join). This handles:- Wrapped multi-word labels —
"Click here to learn more"matches whether the words wrapped onto one row or three (the space-join variant covers this). - Hyphenated wraps —
"needle"matches an OCR result of["nee", "dle"](the no-space variant joins to"needle"). - Ligature splits across lines (rare but possible) — the Unicode normalization pass handles ligatures inside a single line already; the variants extend the same idea across breaks.
- Wrapped multi-word labels —
When a substring match spans multiple words — on the same line or across lines — the returned bbox is the union of the matched words’ bboxes. For a single-line match this is the tight rectangle around the matched text. For a multi-line match it’s the AABB of every involved word, which can include vertical gaps between the text rows; the centroid still lands inside the matched text block, which is what you want for clicking and region seeding.
Trade-off of cross-line substring: unrelated labels on
adjacent lines can spuriously match across the line break (a search
for "account Remove" would hit text that read
"Add account / Remove account"). In practice nobody writes
selectors that way, and the user opted in to OCR because AT-SPI
couldn’t help — they’re already using a fuzzy tool. Use Exact
when you need line-precise semantics.
Introspection
Both VisualLocator and RegionLocator implement Debug, so
tracing::debug!("{loc:?}") or dbg!(loc) shows what the locator
represents:
VisualLocator { kind: "text-label", text: "Add account",
match_mode: Substring, region: Some(Rect { ... }),
timeout: None }
RegionLocator { kind: "visual-region",
bbox: Rect { x: 192, y: 158, width: 640, height: 92 },
centroid: (512, 204) }
The kind field is a constant string that makes the role explicit
in logs — "text-label" for OCR text matches, "visual-region" for
flood-fill shapes — so dumps tell you what the locator means without
having to follow the type back to its constructor.
VisualLocator also exposes the constructed-with values via
getters:
text()— the search query.region()— the parent scope, if any.match_mode()— current matching strategy.
What VisualLocator::click does today
Click the centre of the OCR word’s bbox. Works when the text
glyphs sit inside the gesture controller’s hit-rect — a centred label
inside an AdwButtonRow, for instance.
Doesn’t always work:
- Checkboxes / toggles whose label and click target are separate widgets.
- Widgets sized much larger than their text, where clicking on the glyphs hits the inner label’s selection gesture instead of the surrounding container’s activation gesture.
For those cases, the region pipeline below is the escape hatch.
The template-matching pipeline
For widgets that have no on-screen text (icon-only buttons, image
links, custom-drawn glyphs), OCR can’t help. The
ImageLocator
path takes a reference PNG captured against a known-good
screenshot of the same app, and finds where that patch sits in the
current screen via classical normalized cross-correlation (NCC).
#![allow(unused)]
fn main() {
let icon = std::fs::read("references/save_icon.png")?;
session
.find_image(&icon)?
.with_threshold(0.9)
.click()
.await?;
// Or scoped to an AT-SPI parent (faster, fewer false positives):
let toolbar = session.locate("//ToolBar[@name='Main']");
toolbar
.find_image(&icon).await?
.click()
.await?;
}
Algorithm
- Decode the template PNG once at
find_imagetime. - On each terminal-method call (
bounds,click, …), take a fresh screenshot, crop to the optional scope rect, convert both target and template to grayscale. imageproc::template_matching::match_templatewith methodCrossCorrelationNormalized— slide the template, scoring each position by NCC (Σ(a·b) / sqrt(Σa² · Σb²), in[0, 1], peaks at 1.0 for a perfect match).- Walk the score grid for all peaks above the threshold (default
0.85), sort best-first, apply non-maximum suppression so neighbouring peaks withinmin(template_w, template_h) / 2px collapse to one hit. - Translate hit positions back into screen coords.
Threshold tuning
0.95+— very strict. Use when the reference was captured on the same machine, same theme, same DPI as the test run. Rejects most false positives in busy layouts.0.85(default) — tolerant of subpixel antialias differences and minor lighting shifts.<0.70— likely matches something, but in a busy screen will probably match the wrong thing. If a known-good reference scores below 0.7, recapture it.
When to use this vs. find_by_text
| You want to click… | Use |
|---|---|
| A button with text | find_by_text("Save") |
| An icon-only button (Save icon, hamburger, X) | find_image(&icon_png) |
| A widget AT-SPI surfaces | Locator with an XPath selector |
| Something that wraps over multiple lines | find_by_text("Click here to learn more") |
OCR is the right choice whenever you can read the on-screen text. Template matching is the escape hatch for visual-only widgets.
Known failure modes
- DPI / scale change. A 32×32 reference captured on a 1× display won’t match a 64×64 render on a 2× display. The basic matcher does no scale search; recapture per DPI, or build an image pyramid wrapper if a workload demonstrates the need.
- Theme swap. Light → dark mode = all references stale.
- Antialias / font hinting drift. Same widget on a different GPU / fontconfig stack can score below 0.85. Lower the threshold or recapture.
- Animation / hover / focus mid-capture. Ripple effects, focus rings, hover highlights all change the pixels. Capture references in a steady state.
- Multiple identical icons on screen.
bounds()errors out on ambiguous matches; usewithin(rect)to disambiguate.
Cost
One NCC pass over the haystack ≈ O(W·H·w·h) work. For a 1920×1080
screenshot and a 64×64 template, ~8 billion ops naïvely; modern
machines do this in 10–50 ms. Cropping with within(rect) cuts
the haystack and is the single best speedup. The implementation
calls match_template (single-threaded); if a workload demands
it, swapping to match_template_parallel is a one-line change.
The region detection pipeline
When clicking text glyphs doesn’t fire the surrounding widget’s activation, we want a different click target: the centroid of the visually-distinct shape that contains the text. That’s typically a button pill, a row’s rounded rectangle, or a card frame.
The algorithm is a BFS flood-fill from a seed pixel adjacent to the OCR text bbox. A “region” is a contiguous block of pixels whose RGB Euclidean distance to a seed sample is within tolerance — a button’s fill, a row’s background, a card’s surface. Each iteration finds one enclosing region; iterating outward builds a chain.
┌──────────────────────────────────────────────┐
│ Inputs │
│ parent_bounds (AT-SPI Rect, screen coords) │
│ inner_bbox (OCR text bbox, screen coords)│
│ full_png (Session::take_screenshot) │
│ tuning (SessionConfig::visual_ │
│ region_tuning) │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ Crop full_png to parent_bounds │
│ Translate inner_bbox into crop coords │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ pick_seed_outside(inner_bbox, image) │
│ Try right / left / below / above the │
│ inner bbox, +4 px offset. Sanity-check │
│ uniformity vs a neighbouring pixel so we │
│ don't seed on glyph antialiasing fringe. │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ flood_fill(image, seed, tolerance) │
│ BFS, Vec<bool> visited grid. │
│ Add 4-neighbour pixels where │
│ ‖rgb(neighbour) - rgb(seed)‖₂ ≤ tolerance│
│ Track bbox + centroid as we go. │
└────────────────────┬─────────────────────────┘
│
v
┌──────────────────────────────────────────────┐
│ region_0 = { bbox, centroid } │
│ Translate back to screen coords. │
│ Push into result list. │
└────────────────────┬─────────────────────────┘
│
v (find_regions / first_region only)
┌──────────────────────────────────────────────┐
│ Stop? │
│ • region == previous region (no growth) │
│ • region covers entire crop │
│ • iteration count ≥ tuning.max_regions │
│ • pixel_just_outside(region) has nowhere │
│ to go (region touches all image edges) │
└────────────────────┬─────────────────────────┘
│ otherwise
v
┌──────────────────────────────────────────────┐
│ seed = pixel_just_outside(region.bbox) │
│ Loop back to flood_fill. │
└──────────────────────────────────────────────┘
Why a centroid, not a bbox centre
For axis-aligned rectangles, the bbox centre and the geometric centroid coincide. For non-rectangular shapes — pills (rounded rectangles), circles, polygon icons — the bbox centre can land outside the actual region. The centroid is the mean of every pixel position in the visited set; it’s always inside the shape, which is where you want to click.
For a 60×30 pill flood-filled from inside, the centroid lands at the pill’s geometric centre. For a circle, same. For an L-shaped selection or a polygon icon, the centroid is inside the shape and clicks land on the widget.
Shape classification
Each RegionLocator carries a coarse Shape
value derived from the flood-fill’s pixel-count vs bbox-area ratio
combined with a 4-corner sample. The classifier picks one of:
Rectangle— fill ratio ≥ 0.97 and all four bbox corners match the seed colour. Bare GTK button interiors,AdwButtonRowcontents.Pill— fill ratio ≥ 0.82 with 0–1 bbox corners inside. The corner radius trims the bbox corners off the shape. Most GTK button pills and Adw row backgrounds land here.Ellipse— fill ratio in 0.65–0.83 with 0 bbox corners inside. Round avatar buttons, circular close icons.Irregular— anything else. Polygon icons, regions with holes, shapes whose ratio doesn’t fit a primitive. Don’t trustbounds().center_*()here — usecentroid().
The classification is best-effort, intended for assertions and log readability, not as a contract. Borderline cases (e.g. a rectangle with one pixel of antialiased corner darkening) can flip between categories. If a test branches on shape, treat unexpected classifications as a soft signal rather than an absolute fail.
The seed for the flood doesn’t have to be at the centre of the
target region — flood-fill is a BFS that recovers the same bbox /
centroid / classification regardless of starting point, as long as
the seed lands somewhere inside the region. pick_seed_outside
aims ~4 px outside the OCR text bbox specifically to leave the
glyphs (which the flood treats as a separate region) and land on
the surrounding fill.
Tuning (SessionConfig::visual_region_tuning)
Every threshold the region pipeline uses is exposed on
VisualRegionTuning:
tolerance: u8(default24) — distance threshold for “same region”, interpreted undercolor_distance. Glyph antialiasing pixels typically jump 60+ (RGB); subtle gradients within a button surface stay under 20. Lower the number when flood over-grows into adjacent widgets; raise it when flood under-grows because of gradients.color_distance: ColorDistance(defaultLabCie76) — which colour-distance metric to use. See perceptual colour distance.max_regions: usize(default16) — safety cap on the iteration chain. Realistic widget tree depth is 3–5; the cap protects against pathological banded images.seed_uniformity_threshold_sq: u32(default100) — squared RGB distance below which the seed-pick treats a candidate seed and its 2-px-out neighbour as “uniform”. Raise on noisy backgrounds.shape_rectangle_min_ratio: f64(default0.97),shape_pill_min_ratio: f64(default0.82),shape_ellipse_ratio_range: (f64, f64)(default(0.65, 0.83)) — fill-ratio thresholds for shape classification.
MAX_PIXELS_PER_REGION is implicit and equal to the cropped image’s
total pixel count — the flood can’t escape it.
Tuning (SessionConfig::visual_text_tuning)
Knobs on
VisualTextTuning:
multiline_max_gap_factor: f32(default0.6) — see block grouping.multiline_x_slack_px: i32(default4).background_color_tolerance: u8(default24) — threshold for the bg-colour change check.divider_detection_enabled: bool(defaulttrue).ocr_context_padding_px: i32(default32) — padding added on every side of a cropped element before running OCR; gives the recognition head visual context that disambiguates small/low- contrast glyphs.boundary_samples_per_axis: usize(default16),boundary_majority_threshold: f32(default0.8) — divider-scan density and the majority threshold.background_sample_radius: u32(default2) — radius of the averaged window used when sampling the bg colour at each boundary check.0falls back to a single-pixel sample.color_distance: ColorDistance(defaultLabCie76).connectivity_check_enabled: bool(defaultfalse),max_connectivity_pixels: usize(default4096) — opt-in bounded flood-fill check; see block grouping.
Tuning (SessionConfig::visual_click_tuning)
Knobs on
VisualClickTuning
control the headless-mutter cold-start pointer workaround applied
by VisualLocator::click and RegionLocator::click:
cold_start_warmup_enabled: bool(defaulttrue) — set tofalseon real hardware where the cold-start race doesn’t apply to fall through to a single motion + button-press.cold_start_warmup_offset_px: f64(default4.0) — distance of the warmup motion from the target.cold_start_motion_settle: Duration(default60 ms) — sleep after each motion call.cold_start_press_settle: Duration(default50 ms) — sleep between button-down and button-up.
Model file verification
The auto-downloaded ocrs .rten model files are checksummed
against constants embedded in crates/waydriver/src/visual/models.rs:
- Cached file at session start: hashed, refused on mismatch (deleted + re-downloaded).
- Fresh download: hashed before the
*.partial → *.rtenrename; a corrupted download never becomes a cache hit. - Env-var overrides (
WAYDRIVER_OCRS_DETECTION_MODEL,WAYDRIVER_OCRS_RECOGNITION_MODEL) bypass verification — the user has explicitly pointed us at a file they control.
If upstream ocrs publishes new model files, the constants will
refuse to load the cache. Capture the new hashes with sha256sum
and update DETECTION_SHA256 / RECOGNITION_SHA256; or set the
env-var override at runtime as an escape hatch.
Locator::list_text and Locator::list_labelled_regions — enumeration
When you want to discover what’s on screen rather than search for a specific label, two enumeration methods produce a complete map of the text-bearing widgets inside a Locator’s scope:
#![allow(unused)]
fn main() {
let dialog = session.locate("//Dialog[@name='Preferences']");
// Every OCR'd line inside the dialog, line text + union bbox.
let hits = dialog.list_text().await?;
for h in &hits {
println!("{:?} at {:?}", h.text, h.bounds);
}
// Each line paired with its enclosing visual region. One flood-fill
// per label; the screenshot is taken once and reused.
for (label, region) in dialog.list_labelled_regions().await? {
println!("{} ({:?}) inside {:?} shape", label.text, label.bounds, region.shape());
}
}
list_text returns Vec<TextHit> where each TextHit has the
joined line text and the union bbox of all words in that line.
There’s no substring filter — for searches use
find_by_text. Cost is one OCR pass over the
locator’s bounds (~50–200 ms cropped, ~200–500 ms full-screen).
list_labelled_regions adds a flood-fill per hit on top, returning
Vec<(TextHit, RegionLocator)>. Use it for:
- Test discovery / scaffolding. Print the full set of clickable text-bearing things in a dialog and pick targets interactively.
- Visual regression. Compare label set + region shapes between runs.
- Dynamic selection. “Click the first row whose label starts
with
Show” —list_labelled_regionsthen filter then click.
The cost is list_text plus N × flood-fill (typically ~10–30 ms
each). A dialog with 15 labels takes ~150–500 ms total.
Session::region_at(x, y) — pixel-based entry point
The lowest level in the visual stack. Skips both OCR and the AT-SPI
parent lookup — just flood-fills from the supplied screen pixel and
returns the RegionLocator for whatever contiguous-colour shape
contains that pixel.
#![allow(unused)]
fn main() {
// I already know there's a clickable thing near here.
let region = session.region_at(512, 365).await?;
match region.shape() {
Shape::Pill | Shape::Rectangle => region.click().await?,
_ => return Err(anyhow!("expected a button-shaped widget at the cursor")),
}
}
Useful for:
- Coordinate-driven tests (you know the layout because you wrote the fixture).
- Visual debugging: “what’s at this pixel?” — dump
regionand read its bbox/shape/centroid. - Bridge code that already has coordinates from another source (a previous screenshot, a layout assertion, a logged event).
The seed pixel doesn’t need to be at the centre of the region. Flood-fill is deterministic: any pixel inside the target region recovers the same bbox / centroid / shape. The only thing that varies with the seed is which region you get — a pixel on a text glyph returns the glyph’s bbox; a pixel on the button fill returns the button’s bbox.
The three Locator methods
All of them resolve self’s AT-SPI bounds, take a fresh screenshot,
and call into the region pipeline.
Locator::find_regions(&self, inner: &VisualLocator)— full sweep. ReturnsVec<RegionLocator>in outermost-first order: index 0 is the outermost region insideself’s bounds; the last element is the tightest region aroundinner. The order matches the call-site mental model (start at the parent, walk inward).Locator::first_region(&self, inner)— outermost only (find_regions[0]). Runs the full sweep but skips the intermediateVecallocations.Locator::last_region(&self, inner)— innermost only (find_regions[last]). One flood-fill, no chain walk. Cheap. This is usually what you want — the button pill adjacent to the text.
Plus the convenience on VisualLocator:
VisualLocator::parent_region()— equivalent toparent.last_region(self), but doesn’t require the caller to remember the parent locator. Requires theVisualLocatorto have a parent scope (constructed viaLocator::find_by_textorSession::find_by_text(...).within(rect)).
RegionLocator action surface
Parallels VisualLocator’s shape, minus anything that would need
AT-SPI handles:
bounds() -> Rect— axis-aligned bounding rect of the flood.centroid() -> (i32, i32)— pixel-set centre, the click target.click()— pointer click at the centroid. Uses the same motion-warmup-then-press pattern asVisualLocator::clickto side-step headless mutter’s cold-start pointer-routing race.hover()— pointer move only.screenshot()— PNG cropped tobounds().
There is deliberately no fill, set_text, focus, or any
is_<state> predicate. Those need AT-SPI handles; a region is just
a bbox + centroid.
How they compose
#![allow(unused)]
fn main() {
// AT-SPI sees the parent dialog but not the lazy button inside it.
let dialog = session.locate("//Dialog[@name='Preferences']");
// Find the on-screen text "lazy-button" inside that dialog.
let text = dialog.find_by_text("lazy-button").await?;
// Click the centroid of the pill surrounding the text. One flood-fill
// from a seed adjacent to the OCR bbox — fastest of the three region
// methods because it doesn't walk the enclosure chain.
dialog.last_region(&text).await?.click().await?;
}
Three orthogonal layers:
| Layer | Input | Output | Cost |
|---|---|---|---|
AT-SPI Locator | XPath | accessible refs | ms |
VisualLocator | text + optional parent scope | text bboxes | 50–500 ms (OCR) |
RegionLocator | text bbox + parent screenshot | shape + centroid | ~10–30 ms (flood) |
Each layer is opt-in. You reach down only when the layer above doesn’t work for your widget.
Cost summary
| Operation | Typical latency |
|---|---|
AT-SPI locator (session.locate) | <10 ms |
| Session start — model download (first run) | 5–20 s |
| Session start — model load (no prewarm) | 1–2 s on first OCR call |
| Session start — model load (prewarm) | parallel with session boot |
Session::find_by_text (full screen) | 200–500 ms |
Locator::find_by_text (cropped) | 50–200 ms |
Locator::last_region | +10–30 ms over OCR |
Locator::find_regions (full sweep) | +30–100 ms (depends on chain depth) |
These latencies assume an optimized build. rten inference dominates OCR
cost and is roughly 30× slower at the dev profile’s opt-level 0: measured
~5–8 s per full-frame pass with optimized dependencies vs ~50–200 s without,
on CPU-only hosts. Consumers running the visual feature under cargo test
must add a dependency-only override to the workspace root Cargo.toml
(Cargo ignores profile overrides declared anywhere else — a library can’t
ship this for you):
[profile.dev.package."*"]
opt-level = 3
(waydriver’s own workspace root already applies this to just the rten/ocrs
crates, so in-repo contributors and the e2e suite get optimized OCR in
dev/test builds without the broader "*" override. The init warning below
still fires for in-repo debug builds — an opt-level override does not clear
cfg(debug_assertions) — and is a known false-positive there.)
The engine loader logs a warning at init when it detects a debug build. Two
further cost levers already built in: a scoped Locator::find_by_text crops
the frame to the parent’s bounds before inference (fewer pixels, fewer text
lines — only the unscoped Session::find_by_text pays for the full frame),
and the per-frame OCR cache means repeated lookups on an unchanged screen
reuse a single pass.
When to use what
- Default path —
Locator::clickagainst an XPath. Use this unless the widget doesn’t surface in AT-SPI. - Widget renders text and isn’t in AT-SPI —
Locator::find_by_texton the nearest AT-SPI parent, then.click(). Works when the text glyphs are inside the gesture-controller’s hit-rect (mostAdwButtonRows, GTK buttons with centred labels). - Text-center click doesn’t fire activation —
parent.last_region(&text).click(). Uses the centroid of the enclosing visual shape, which is more robust for widgets where the inner label widget eats the click. - You want the surrounding card / panel, not the button —
parent.first_region(&text).click()or walkfind_regionsand pick the layer you want. - No AT-SPI parent at all —
Session::find_by_text(text).click()works but pays full-screen OCR cost; prefer constraining via.within(rect)whenever you can derive a scope.
Failure modes (known)
- Sibling-coloured regions merge. If the button shares its fill
colour with an adjacent widget, flood-fill spans both. Lower
toleranceand re-test. - Gradient fills stop the flood early. A button with a top-to-
bottom gradient may have RGB deltas exceeding
tolerancepartway down. Raisetolerance(carefully — too high and the flood eats neighbouring regions). - Thin antialiased borders ≤ 2 px can confuse
pick_seed_outsideif the 4-px offset lands inside the border. The seed picker validates uniformity against a neighbouring pixel and falls back to the next candidate, but pathological cases still exist. Construct theVisualLocatorwith a tighter.within(...)or supply an explicitRectto side-step. - OCR misreads on small / low-contrast text. ocrs’s recognition
head is trained on document text; UI labels at 10–14 px in dark
themes can read poorly. The 32 px context-padding ring helps
(tunable via
VisualTextTuning::ocr_context_padding_px); raising the fixture’s font size if you control it helps more. - Pointer cold-start race. Headless mutter sometimes drops the
first pointer event after a fresh session.
VisualLocator::clickandRegionLocator::clickboth warmup-motion-then-click to side-step it, but a test that triggers many rapid clicks can still hit the race on subsequent clicks. Add a 60 ms sleep between clicks if you see this — or tuneVisualClickTuning(disable the warmup on real hardware, lengthen the settles on slow CI). - Custom theme with shadow rasters between rows. The divider
scan can mistake anti-aliased shadow gradients for a horizontal
rule and refuse to merge wrapped paragraphs. Set
VisualTextTuning::divider_detection_enabled = falseto fall back to bg-colour-only boundary detection. - Stale model cache from upstream rebuild. SHA-256 verification
refuses to load model files that don’t match the embedded
hashes. If ocrs publishes new models, either bump the constants
in
models.rsor setWAYDRIVER_OCRS_DETECTION_MODEL/WAYDRIVER_OCRS_RECOGNITION_MODELto point at known-good files. - Right-to-left scripts and non-LTR reading order. The block grouper and the per-line haystack are built on the assumption that words read left-to-right within a line and lines read top-to-bottom within a block. Hebrew, Arabic, or any RTL script will produce word bboxes in screen-left-to-right order but the joined haystack won’t reflect logical reading order — substring matches against a logical-order needle may miss. Vertical scripts (Japanese/Chinese in tategaki) are not supported. If you’re driving an RTL app, prefer AT-SPI selectors; the visual locator’s matching semantics aren’t right for that case.
Implementation map
| What | Where |
|---|---|
Session::find_by_text (root entry) | crates/waydriver/src/session.rs |
Locator::find_by_text (scoped entry) | crates/waydriver/src/locator.rs |
VisualLocator + OCR pipeline | crates/waydriver/src/visual/mod.rs |
| Model resolution + auto-download | crates/waydriver/src/visual/models.rs |
Engine lifecycle (OnceCell shared cache) | crates/waydriver/src/visual/engine.rs |
Flood-fill, seed picking, RegionLocator | crates/waydriver/src/visual/region.rs |
Locator::find_regions/first_region/last_region | crates/waydriver/src/locator.rs |
SessionConfig::visual_region_tuning | crates/waydriver/src/session.rs |
Cargo feature visual | crates/waydriver/Cargo.toml |
| E2E test exercising both pipelines | crates/waydriver-e2e/tests/e2e.rs — lazy_a11y_*_clickable_via_visual_locator |