UX Engineering Guidelines

ReplayLab is a technical product, but the interface must not make users decode ReplayLab internals before they can make progress. These guidelines turn GOV.UK/GDS service-design principles into coding rules for ReplayLab engineers and coding agents.

Use this document before planning, while implementing, and before final validation for any user-facing UI, report, viewer, CLI output, or documentation workflow.

Source Principles

These guidelines are grounded in GOV.UK/GDS material:

Government Design Principles: start with user needs, do less, make complex things simple, iterate, design for everyone, understand context, and build services rather than pages.
Make the service simple to use: help users do the thing they need to do as simply as possible, so they succeed first time with minimal help.
User needs: every visible piece of content should meet a valid need, and needs should be framed as a user, action, and reason.
Writing for GOV.UK: use plain English, front-load useful words, avoid jargon, and explain technical terms only when they are necessary.

ReplayLab adapts these principles for developer tools: users may be technical, but they are still trying to finish a task. Expert users should not have to read provider JSON, hashes, storage paths, or internal IDs to understand what happened.

Non-Negotiable Standard

Every user-facing screen must answer the user's current task before it shows implementation evidence.

Use this order:

What happened.
What changed, failed, succeeded, or needs attention.
What the user can do next.
Supporting evidence.
Debug evidence.

If raw implementation evidence appears before the product answer, the UI is not finished.

The Value Filter

Before adding any visible element, label, metric, table, chip, card, action, or paragraph, answer these questions:

Does this help the user decide what happened?
Does this help the user choose the next action?
Does this help the user avoid a mistake?
Does this increase trust in the result without forcing them to decode internals?
Would removing it make the task harder?

If the answer is no, remove it, hide it behind progressive disclosure, or move it to a debug section.

Signs Of Fluff

Treat these as suspicious in primary UI:

A value is technically true but does not change the next action.
A label mirrors a data model field instead of a user concept.
A table exists because the backend has rows, not because the user needs rows.
A paragraph explains around a poor layout instead of making the layout clear.
A chip repeats context already obvious from the heading or hierarchy.
A button exists because an endpoint exists, even when prerequisites are absent.
A success, failure, or warning state appears without a next action.

Planning Checklist

Before implementing a user-facing change, write down the answer to these checks in the plan or working notes:

User: who is using this surface?
Task: what are they trying to do?
Decision: what decision must the screen help them make?
Minimum evidence: what is the smallest set of facts needed for that decision?
Next action: what should they do after reading this?
Debug boundary: what information is useful only for diagnosis?
Acceptance screenshot: what must be obvious within five seconds?

Do not start from the database table, DTO, JSON path, provider protocol, or available endpoint. Start from the user task.

Product Language First

Primary UI must use product language. Implementation language is allowed only when the user explicitly enters a debug or advanced view.

Implementation language	Product language
`$.input[0].content`	Changed user message
`request_hash`	Request changed
`resource_name=openai.responses`	OpenAI response call
`cap_...`	Captured run
`replay_app_...`	Regression replay
`payload ref`	Captured request or response body
`raw JSON`	Debug payload
`boundary`	Provider call, HTTP call, tool call, or captured step
`diagnostic_guard`	Diagnostic guard
`clean_guard`	Clean regression guard

Use the implementation label only as secondary metadata:

In collapsed debug sections.
In copy-to-debug controls.
In developer-only tables.
In tooltips only when it clarifies a product label.

Do not make internal field names the heading of a card, section, or primary row.

Comparisons Must Look Like Comparisons

If the user needs to compare two things, do not show a paragraph and a hash. Show the two things.

Required comparison layout:

A human title: what changed.
Two columns: expected/captured and replay/current/actual.
Visible values with wrapping or expansion.
Highlighted changed words, phrases, fields, or rows where possible.
A short next-action sentence.
Debug hashes and field paths below the comparison, not above it.

Replay Divergence Pattern

Bad:

$.input[0].content changed
Expected hash: f8b80...
Replay hash: f300...
The request hashes differ.

Good:

Changed user message sent to OpenAI

Expected capture                       Replay attempt
Checkout API latency is elevated...     Checkout/payment requests fail...

What changed: the incident summary wording changed before provider call 3.
Next: fix the code/input drift or recapture the baseline if this change was intentional.

Debug details: $.input[0].content, request hashes...

If ReplayLab cannot compute a safe word-level or field-level diff, the UI should still show both values side by side and say that only a safe preview is available.

Progressive Disclosure

Primary UI should be compact and task-shaped. Debug evidence should be available without making it the main experience.

Use three layers:

Primary answer: the outcome and next action.
Supporting evidence: formatted input, output, changed fields, counts, and user-recognizable labels.
Debug evidence: raw JSON, hashes, JSON paths, storage paths, provider protocol keys, internal IDs, request/response refs, and boundary rows.

Debug evidence must have a clear label such as "Debug details", "Raw provider payload", "Matcher rows", or "Internal artifact metadata".

Never require debug expansion to answer the main user question.

Failure Screen Contract

Every failure screen must answer these questions before raw tables:

What failed?
Where did it fail?
What was expected?
What happened instead?
What changed, if known?
What did ReplayLab verify before the failure?
What evidence is missing or uncertain?
What should the user do next?

Failure copy must be specific enough to act on. Avoid generic text such as "needs attention" without a concrete reason.

Failure Screen Anti-Patterns

Do not ship failure screens where:

The first useful information is below a table.
The main evidence is a hash mismatch.
The next action is hidden in a paragraph.
Multiple warnings compete without priority.
A successful partial result is buried under a global failure state.
The UI says "review debug rows" before showing the product-level difference.

Trace And Evidence Surfaces

Trace screens must help users understand execution structure before protocol details.

Primary trace labels should describe:

Project run.
Workflow step.
Agent.
Graph node.
LLM call.
Execution tool call when ReplayLab captured an actual application tool boundary.
Execution tool result when ReplayLab captured an actual application tool boundary.
LLM requested tool when the row is provider-protocol evidence.
Provider protocol tool result when the row is provider-protocol evidence.
HTTP call.
Final output.

Do not use raw provider event names, JSON paths, SDK object fields, or storage keys as primary trace labels.

Trace detail panes should separate:

User and assistant content.
Tool inputs and results.
Provider metadata.
ReplayLab metadata.
Raw payloads.

Provider metadata and ReplayLab metadata are supporting evidence, not the main content.

Upload Preview Surfaces

Upload preview must answer "what would leave my machine?" before it lists files.

Primary answer:

Upload verdict.
What would be uploaded.
What stays local.
Why anything is blocked or needs review.
Whether upload has happened.

Secondary evidence:

File roles.
Relative paths.
Sizes and hashes.
Payload counts.
Feature availability.

Debug-only evidence:

Internal artifact IDs.
Manifest details.
Raw local-store paths.
Queue internals.

Generated Guard Surfaces

Generated provider replay guard UI must distinguish user value before file details.

Primary answer:

Provider replay guard: protects clean provider-boundary replay evidence.
Diagnostic provider replay guard: preserves a known failure shape while it is being fixed.
Safe workflow regression: unavailable until ReplayLab captures and controls execution tools and relevant I/O.
CI state: generated, locally runnable, or CI wired.
Next action: run locally, wire into CI, inspect failure, or recapture.

Secondary evidence:

Generated path.
Source replay.
Guard mode.
Last pytest result.

Debug-only evidence:

Content hash.
Internal action ID.
Artifact metadata.

Do not present diagnostic guards as proof that behavior is correct.

Data Tables

Tables are useful when users need to scan, compare, sort, or inspect many similar items. Tables are poor primary UX for explaining one failure.

Before adding a table, ask:

What comparison does the table enable?
Which columns change the user's next action?
Can low-value columns be hidden, collapsed, or moved to detail?
Is there a summary card that should appear before the table?
Does the table still fit in the viewport without horizontal hunting?

Table columns should be product concepts, not raw storage or protocol fields. If a raw column is needed, keep it in a debug table.

Actions And Next Steps

Every primary screen should have one obvious next action.

Good actions:

Run regression replay.
Open changed provider call.
Compare expected vs replay.
Generate diagnostic provider replay guard.
Preview upload safety.
Open generated test.
Fix run profile.

Bad actions:

Copy a long command as the only primary path.
Open a raw artifact path.
Review debug evidence before the product answer is visible.
Click an enabled action whose prerequisites are missing.
Choose between several equal-looking buttons with no recommended path.

Disable unavailable actions with an exact reason. Do not show fake commands, placeholder paths, or copyable-looking examples as if they are recovered state.

Writing Rules

Use plain, task-oriented language.

Front-load the useful words.
Prefer active headings.
Use short sentences.
Use familiar product words consistently.
Explain technical terms the first time they are necessary.
Avoid filler: "simply", "just", "leverage", "utilize", "robust", "seamless", "comprehensive", "powerful".
Avoid apologetic or vague UI text.
Do not use FAQ-style dumps; put information where it is needed.

For technical users, plain language is still faster. Expert users should be able to scan and then choose to inspect details.

Visual Density And Minimalism

Minimalism does not mean empty screens. It means every visible element earns its space.

ReplayLab should feel dense, utilitarian, and scannable:

Prefer compact summaries over large hero sections.
Group by task, not by backend object type.
Keep repeated chips quiet and secondary.
Use whitespace to separate decisions, not to decorate.
Avoid cards inside cards unless the nested card is an actual repeated item or modal.
Avoid large blocks of explanatory prose when a better layout would make the explanation unnecessary.

If the screen feels visually busy, remove low-value evidence before adding new containers, colors, or explanations.

Accessibility And Inclusion

Accessible design is not optional.

Use semantic HTML for headings, buttons, tables, lists, and forms.
Make controls keyboard reachable.
Give icon-only controls accessible labels.
Do not rely on color alone for state.
Keep contrast readable.
Make text wrap instead of clipping.
Keep responsive layouts usable at narrow widths.
Test dense technical screens with zoom and reduced viewport space when the feature is layout-heavy.

Accessibility also means cognitive accessibility: users should not need to keep several hashes, IDs, or protocol rows in memory to understand the page.

Screenshot Acceptance Test

For UI work, a screenshot is not decoration. It is acceptance evidence.

Before calling UI work complete, inspect the rendered screen and ask:

Can a user say what happened in five seconds?
Can a user identify the next action in five seconds?
Is the most important comparison visible without opening debug evidence?
Are internal field names hidden or clearly secondary?
Are long values readable, wrapped, or expandable?
Are warnings prioritized?
Is there anything visible that does not help the current task?
Would a screenshot alone convince a reviewer that the UX works?

If the answer is no, keep working or record the exact follow-up in PLAN.md.

ReplayLab Examples

Replay Divergence

User task:

Decide whether a regression replay failed because code/input drifted or because the baseline should be recaptured.

Primary UI:

"Replay changed the user message sent to OpenAI."
Expected vs replay columns.
Highlighted changed wording.
"Fix drift" or "recapture baseline" next action.

Secondary UI:

Provider call number.
Matched calls before divergence.
Later calls not replayed.

Debug UI:

$.input[0].content.
Request hashes.
Matcher rows.
Raw payload JSON.

Trace Payload Detail

User task:

Understand what an LLM saw and produced.

Primary UI:

System instructions.
User message.
Tool declarations.
Assistant output.
LLM requested tools and provider protocol tool results.

Secondary UI:

Model.
Usage.
Duration.
Payload status.

Debug UI:

Raw provider JSON.
Payload ref.
Request hash.

Upload Preview

User task:

Decide whether it is safe to upload evidence.

Primary UI:

Verdict.
What would leave the machine.
What stays local.
Blockers or review risks.

Secondary UI:

File role.
Size.
Relative path.
Hash.

Debug UI:

Manifest internals.
Queue IDs.
Local-store implementation details.

Captured Run Browser

User task:

Find the right captured run and act on it.

Primary UI:

Project.
Capture name.
Time.
Replayable evidence state.
Latest replay or experiment state.
Recommended action.

Secondary UI:

Provider count.
Payload refs.
Providers.

Debug UI:

Capsule ID.
Raw artifact paths.

Implementation Checklist

When implementing user-facing UI:

Start from the user task and next action.
Name sections with product language.
Move implementation labels to debug details.
Use comparison components for comparisons.
Keep raw tables secondary.
Collapse advanced and debug evidence by default.
Add disabled reasons for unavailable actions.
Add tests for visible product labels and hidden debug labels.
Validate in the browser or rendered artifact.
Compare the screenshot to the five-second checklist.

Review Checklist

Use this before opening a PR, committing a UI change, or reporting completion:

The primary screen answers the user task.
The next action is obvious.
Product labels come before implementation labels.
Debug terms do not appear as primary headings.
Comparisons are side by side where possible.
Truncation does not hide the thing the user needs to compare.
Tables are not the first explanation for a single problem.
The failure state explains what happened and what to do next.
Screenshot evidence supports the claim that the UI works.
Any remaining UX weakness is fixed now or recorded in PLAN.md.

What Not To Do

Do not ship UI that:

Uses a JSON path as the main heading.
Uses a hash as the only visible difference.
Explains a comparison without showing the compared values.
Makes users open debug rows to understand the primary failure.
Shows provider internals before the user-facing outcome.
Uses raw JSON as the main trace experience.
Presents every backend artifact as an equal product concept.
Adds a card or chip because space is available, not because it helps.
Leaves stale or confusing controls enabled.
Reports completion without looking at the rendered result.