Tutorial: Generate A Pytest Regression
This tutorial focuses on the final step in the local loop: turning a replayable provider capsule into a pytest test. It works with any supported full-payload provider capsule.
Generated tests intentionally use replaylab replay.
That does not mean production capture must use replaylab run; the OpenAI and HTTP tutorials capture through startup SDK instrumentation and handle.capture(...), then use the replay command only for local regression execution.
Setup
Start with a provider capsule from either:
The capsule must contain succeeded provider boundaries with response payload refs. Metadata-only capsules can be inspected, but they cannot serve provider responses during replay. If you intentionally want a regression that preserves a recorded provider failure, use the failed-boundary workflow below.
Capture
Capture with full payloads through your normal app command. For the tutorial apps, that means:
uv run python tutorial_openai_app.py
# or
uv run python tutorial_http_app.py
If you are using the no-key dogfood path instead, replaylab run is still the right local wrapper; see Quickstart.
Capsule List
uv run replaylab capsule list --local-store-root .replaylab
Choose the provider capsule.
Same-process tutorial capsules include integrations such as openai, requests, auto_patch, and same_process.
Wrapper-driven dogfood capsules also include a wrapper capsule; in that case, choose the child provider capsule, not the wrapper capsule.
Inspect
uv run replaylab capsule inspect <child_capsule_id> --local-store-root .replaylab
Confirm:
Boundaries: 1
Payloads: 2
The exact counts can be larger for multi-call workflows.
Replay
--auto-patch-integrations auto is the low-boilerplate default. Use explicit labels when you want
the generated replay command to patch only a smaller provider set.
uv run replaylab replay <child_capsule_id> \
--local-store-root .replaylab \
--auto-patch-integrations auto \
--report-id replay_tutorial_regression \
-- python your_app.py
Compare
uv run replaylab report compare \
<child_capsule_id> \
.replaylab/replays/replay_tutorial_regression/report.json \
--local-store-root .replaylab
For a provider replay guard, generate after comparison succeeds.
If comparison fails, use the local app's Generate diagnostic provider replay guard action to
preserve the observed failure shape while you debug it.
To compare two replay attempts, use report diff:
uv run replaylab report diff \
.replaylab/replays/replay_tutorial_regression_before/report.json \
.replaylab/replays/replay_tutorial_regression_after/report.json
This exits 0 when the candidate report is at least as clean as the baseline.
It exits 1 when the candidate adds blocked, mismatched, extra, missing, or payload-unavailable outcomes, or when the two reports are incompatible.
Optional BYOK AI assistance can explain the deterministic diff in plain language:
uv run replaylab report diff-explain \
.replaylab/replays/replay_tutorial_regression_before/report.json \
.replaylab/replays/replay_tutorial_regression_after/report.json \
--dry-run
Use the deterministic report diff exit code for pass/fail. Use diff-explain only as advisory
context when the changed outcomes need a faster human summary.
Open A Local Viewer
Before generating a regression, open a local viewer if you want a browser-readable view of the replay:
uv run replaylab report view report \
.replaylab/replays/replay_tutorial_regression/report.json \
--capsule <child_capsule_id> \
--local-store-root .replaylab \
--output replay-viewer.html
The React viewer highlights problem outcomes and includes a "What to do next" section, filters,
search, copyable command blocks, and the next inspection, compare, diff, and generate-test commands.
It shows payload availability booleans and request hashes, but never payload file contents.
Use replaylab report export-viewer report ... when you need to write the same file without opening
a browser.
The dependency-free static HTML fallback is:
uv run replaylab report export-html \
.replaylab/replays/replay_tutorial_regression/report.json \
--capsule <child_capsule_id> \
--local-store-root .replaylab \
--output replay-report.html
If you have a failed report before a fix and a candidate report after a fix, export a diff page too:
uv run replaylab report view diff \
.replaylab/replays/replay_tutorial_regression_before/report.json \
.replaylab/replays/replay_tutorial_regression_after/report.json \
--output replay-diff-viewer.html
The diff page groups improved outcomes, regressions, changed failures, and unchanged failures. Use it before updating a generated regression so you can see whether the candidate run truly got cleaner.
For non-opening React export, use replaylab report export-viewer diff. For a static fallback diff,
use replaylab report diff-html.
Generate-Test
uv run replaylab generate-test <child_capsule_id> \
--output tests/regression/test_replaylab_regression.py \
--fixture-root tests/fixtures/replaylab/capsules \
--app-root . \
--auto-patch-integrations auto \
-- python your_app.py
The generated test:
- copies the capsule fixture into
tests/fixtures/replaylab/capsules - runs your app command through
replaylab replay - asserts the replay report status
- asserts every expected boundary replayed
- fails if there are blocked, mismatched, extra, missing, or payload-unavailable results
- prints a secret-safe replay diagnostic summary when an assertion fails
Run it:
uv run pytest tests/regression/test_replaylab_regression.py
What good looks like:
1 passed
Generate A Safe Workflow Regression
Safe workflow regression is stricter than a provider replay guard. Generate it from a regression
replay report only after the safety preflight reports ready. The report must include controlled
provider replay, provider-visible model tool evidence, resolved implementation candidates,
explicit replaylab.control_tool(...) execution evidence, an accepted saved HTTP effect policy,
HTTP enforcement evidence, local-effect enforcement evidence, accepted SQLite database policy plus
database enforcement evidence when SQLite statements exist, raw-socket network enforcement
evidence, queue/pubsub enforcement evidence, unsupported HTTP client escape enforcement evidence,
completed local-container sandbox evidence, and no blocked or unsupported effects. Linked
native/FFI or process-escape evidence keeps the report ineligible because those paths can bypass
ReplayLab's current monkey-patched controls.
In the local app, open the eligible report and choose Generate safe workflow regression.
The generated pytest copies both fixtures:
- the source capsule under
tests/fixtures/replaylab/capsules - the reviewed project policy under
tests/fixtures/replaylab/effect-policies
The CLI path is report-driven:
uv run replaylab generate-test <child_capsule_id> \
--source-report .replaylab/replays/replay_tutorial_regression/report.json \
--regression-mode safe_workflow_regression \
--output tests/regression/test_safe_workflow_replay.py \
--fixture-root tests/fixtures/replaylab/capsules \
--app-root . \
--auto-patch-integrations auto \
-- python your_app.py
The generated test installs the policy fixture into a temporary .replaylab store, runs replay with
HTTP effect policy mode enforce, local-effect control mode enforce, SQLite database-effect
control mode enforce, raw-socket network-effect control mode enforce, and queue/pubsub effect
control mode enforce. It also runs with unsupported HTTP client control mode enforce and
local-container sandbox mode enforce, then asserts readiness is still ready.
Sandboxed replay requires Docker and a suitable hardened image that already contains the app
dependency set and ReplayLab runtime because the V1 replay container runs with deny-all network,
read-only root filesystem, non-root user 65532:65532, split mounts, and --pull never.
Build and check the default local image before creating a safe workflow regression:
replaylab sandbox build-image --app-root .
replaylab sandbox doctor --app-root .
If setup is not ready, doctor reports sanitized checks for Docker CLI availability, Docker daemon
availability, local image presence, and the hardened no-network import smoke. Follow the reported
next action before rerunning sandboxed replay; common fixes are building the image, starting Docker,
fixing the recipe, or using a custom --sandbox-image.
For local package dependencies, add a bounded recipe and rebuild:
[tool.replaylab.sandbox]
image = "replaylab-sandbox-runtime:py3.13"
include_paths = ["packages/my_local_dependency"]
requirements_files = ["requirements.txt"]
or pass --recipe sandbox.toml to both build-image and doctor. Recipes can include
app-root-relative paths, requirement files, extra requirements, validated apt package names, and
known pip/uv index secrets through Docker BuildKit secret env names. They cannot include arbitrary
Dockerfile commands or expose secret values in ReplayLab output.
Set REPLAYLAB_SANDBOX_IMAGE when your generated regression should use a custom prebuilt project
image.
Safe workflow regression does not mock unaccepted HTTP, database, raw-socket, queue, or unsupported HTTP client effects; replay broker delivery; execute workers; replay unsupported HTTP client responses; provide Daytona or managed hosted execution; or sandbox non-SQLite databases, unsupported queue/pubsub SDKs, native/FFI escapes, process-escape APIs, or other operating-system effects beyond local Docker containment.
Add Generated Guards To CI
ReplayLab copies the generated pytest file under tests/regression and its source capsule fixture
under tests/fixtures/replaylab/capsules. The local app marks a generated guard as:
CI readywhen the generated test and fixture exist and a local pytest command can run it.CI wiredwhen a GitHub Actions workflow appears to run the generated guard path,tests/regression, or the broader pytest suite.Needs attentionwhen the generated test, fixture, guard mode, or last pytest result indicates the guard is not ready.
Use the generated-regression history in replaylab app to copy the local pytest command and a
GitHub Actions snippet. A minimal workflow is:
name: ReplayLab generated guards
on:
pull_request:
push:
branches: [main]
jobs:
replaylab-generated-guards:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v5
- name: Install dependencies
run: uv sync --all-groups
- name: Run ReplayLab generated guards
run: uv run pytest tests/regression
Generate A Diagnostic Guard From A Problem Replay
The local app can generate report-driven diagnostic provider replay guards from failed or diverged
regression replays. Open replaylab app, select the captured run or problem replay, and choose
Generate diagnostic provider replay guard.
A diagnostic provider replay guard is different from a provider replay guard:
- provider replay guard: a passing pytest that protects a clean provider-boundary replay
- diagnostic provider replay guard: a passing pytest that intentionally preserves the same known failure shape
Diagnostic guards assert the replay report status, expected/attempted/replayed counts, blocked, mismatched, extra, missing, and payload-unavailable counts, and the per-problem outcome metadata and secret-safe hashes/messages where present. They are useful when you want the issue to remain reproducible while you fix it, but they are not proof that the behavior is correct.
Unsupported diagnostic generation fails clearly when the source report is missing, the captured run cannot be resolved, there are no deterministic boundary results, or the provider shape is not yet supported.
Generate A Failed-Boundary Regression From A Capsule
Sometimes the behavior you need to preserve is a provider failure. For example, you may want to prove that a timeout, validation error, or provider exception still follows the same app path while you work on a fix.
Capture the failure with full payloads, then generate a failed regression explicitly:
uv run replaylab generate-test <failed_child_capsule_id> \
--output tests/regression/test_replaylab_provider_failure.py \
--fixture-root tests/fixtures/replaylab/capsules \
--app-root . \
--auto-patch-integrations auto \
--expected-replay-status failed \
-- python your_app.py
The generated failed-boundary test asserts:
- the replay command exits non-zero
- the replay report status is
failed - the expected boundary count is unchanged
- no blocked, mismatched, extra, missing, or payload-unavailable outcomes were introduced
- the report contains the recorded provider-failure message fragment
By default, ReplayLab still rejects failed-boundary capsules. That prevents accidental generation of tests that lock in failures when you meant to assert a successful replay.
Current failed-boundary generation is intentionally narrow: the failed provider boundary must be the final captured provider call, and the capsule must include full request and error payload refs.