Skip to content

Tutorial: Generate A Pytest Regression

This tutorial focuses on the final step in the local loop: turning a replayable provider capsule into a pytest test. It works with any supported full-payload provider capsule.

Generated tests intentionally use replaylab replay. That does not mean production capture must use replaylab run; the OpenAI and HTTP tutorials capture through startup SDK instrumentation and handle.capture(...), then use the replay command only for local regression execution.

Setup

Start with a provider capsule from either:

The capsule must contain succeeded provider boundaries with response payload refs. Metadata-only capsules can be inspected, but they cannot serve provider responses during replay. If you intentionally want a regression that preserves a recorded provider failure, use the failed-boundary workflow below.

Capture

Capture with full payloads through your normal app command. For the tutorial apps, that means:

uv run python tutorial_openai_app.py
# or
uv run python tutorial_http_app.py

If you are using the no-key dogfood path instead, replaylab run is still the right local wrapper; see Quickstart.

Capsule List

uv run replaylab capsule list --local-store-root .replaylab

Choose the provider capsule. Same-process tutorial capsules include integrations such as openai, requests, auto_patch, and same_process. Wrapper-driven dogfood capsules also include a wrapper capsule; in that case, choose the child provider capsule, not the wrapper capsule.

Inspect

uv run replaylab capsule inspect <child_capsule_id> --local-store-root .replaylab

Confirm:

Boundaries: 1
Payloads: 2

The exact counts can be larger for multi-call workflows.

Replay

--auto-patch-integrations auto is the low-boilerplate default. Use explicit labels when you want the generated replay command to patch only a smaller provider set.

uv run replaylab replay <child_capsule_id> \
  --local-store-root .replaylab \
  --auto-patch-integrations auto \
  --report-id replay_tutorial_regression \
  -- python your_app.py

Compare

uv run replaylab report compare \
  <child_capsule_id> \
  .replaylab/replays/replay_tutorial_regression/report.json \
  --local-store-root .replaylab

For a provider replay guard, generate after comparison succeeds. If comparison fails, use the local app's Generate diagnostic provider replay guard action to preserve the observed failure shape while you debug it.

To compare two replay attempts, use report diff:

uv run replaylab report diff \
  .replaylab/replays/replay_tutorial_regression_before/report.json \
  .replaylab/replays/replay_tutorial_regression_after/report.json

This exits 0 when the candidate report is at least as clean as the baseline. It exits 1 when the candidate adds blocked, mismatched, extra, missing, or payload-unavailable outcomes, or when the two reports are incompatible.

Optional BYOK AI assistance can explain the deterministic diff in plain language:

uv run replaylab report diff-explain \
  .replaylab/replays/replay_tutorial_regression_before/report.json \
  .replaylab/replays/replay_tutorial_regression_after/report.json \
  --dry-run

Use the deterministic report diff exit code for pass/fail. Use diff-explain only as advisory context when the changed outcomes need a faster human summary.

Open A Local Viewer

Before generating a regression, open a local viewer if you want a browser-readable view of the replay:

uv run replaylab report view report \
  .replaylab/replays/replay_tutorial_regression/report.json \
  --capsule <child_capsule_id> \
  --local-store-root .replaylab \
  --output replay-viewer.html

The React viewer highlights problem outcomes and includes a "What to do next" section, filters, search, copyable command blocks, and the next inspection, compare, diff, and generate-test commands. It shows payload availability booleans and request hashes, but never payload file contents. Use replaylab report export-viewer report ... when you need to write the same file without opening a browser.

The dependency-free static HTML fallback is:

uv run replaylab report export-html \
  .replaylab/replays/replay_tutorial_regression/report.json \
  --capsule <child_capsule_id> \
  --local-store-root .replaylab \
  --output replay-report.html

If you have a failed report before a fix and a candidate report after a fix, export a diff page too:

uv run replaylab report view diff \
  .replaylab/replays/replay_tutorial_regression_before/report.json \
  .replaylab/replays/replay_tutorial_regression_after/report.json \
  --output replay-diff-viewer.html

The diff page groups improved outcomes, regressions, changed failures, and unchanged failures. Use it before updating a generated regression so you can see whether the candidate run truly got cleaner.

For non-opening React export, use replaylab report export-viewer diff. For a static fallback diff, use replaylab report diff-html.

Generate-Test

uv run replaylab generate-test <child_capsule_id> \
  --output tests/regression/test_replaylab_regression.py \
  --fixture-root tests/fixtures/replaylab/capsules \
  --app-root . \
  --auto-patch-integrations auto \
  -- python your_app.py

The generated test:

  • copies the capsule fixture into tests/fixtures/replaylab/capsules
  • runs your app command through replaylab replay
  • asserts the replay report status
  • asserts every expected boundary replayed
  • fails if there are blocked, mismatched, extra, missing, or payload-unavailable results
  • prints a secret-safe replay diagnostic summary when an assertion fails

Run it:

uv run pytest tests/regression/test_replaylab_regression.py

What good looks like:

1 passed

Generate A Safe Workflow Regression

Safe workflow regression is stricter than a provider replay guard. Generate it from a regression replay report only after the safety preflight reports ready. The report must include controlled provider replay, provider-visible model tool evidence, resolved implementation candidates, explicit replaylab.control_tool(...) execution evidence, an accepted saved HTTP effect policy, HTTP enforcement evidence, local-effect enforcement evidence, accepted SQLite database policy plus database enforcement evidence when SQLite statements exist, raw-socket network enforcement evidence, queue/pubsub enforcement evidence, unsupported HTTP client escape enforcement evidence, completed local-container sandbox evidence, and no blocked or unsupported effects. Linked native/FFI or process-escape evidence keeps the report ineligible because those paths can bypass ReplayLab's current monkey-patched controls.

In the local app, open the eligible report and choose Generate safe workflow regression. The generated pytest copies both fixtures:

  • the source capsule under tests/fixtures/replaylab/capsules
  • the reviewed project policy under tests/fixtures/replaylab/effect-policies

The CLI path is report-driven:

uv run replaylab generate-test <child_capsule_id> \
  --source-report .replaylab/replays/replay_tutorial_regression/report.json \
  --regression-mode safe_workflow_regression \
  --output tests/regression/test_safe_workflow_replay.py \
  --fixture-root tests/fixtures/replaylab/capsules \
  --app-root . \
  --auto-patch-integrations auto \
  -- python your_app.py

The generated test installs the policy fixture into a temporary .replaylab store, runs replay with HTTP effect policy mode enforce, local-effect control mode enforce, SQLite database-effect control mode enforce, raw-socket network-effect control mode enforce, and queue/pubsub effect control mode enforce. It also runs with unsupported HTTP client control mode enforce and local-container sandbox mode enforce, then asserts readiness is still ready.

Sandboxed replay requires Docker and a suitable hardened image that already contains the app dependency set and ReplayLab runtime because the V1 replay container runs with deny-all network, read-only root filesystem, non-root user 65532:65532, split mounts, and --pull never. Build and check the default local image before creating a safe workflow regression:

replaylab sandbox build-image --app-root .
replaylab sandbox doctor --app-root .

If setup is not ready, doctor reports sanitized checks for Docker CLI availability, Docker daemon availability, local image presence, and the hardened no-network import smoke. Follow the reported next action before rerunning sandboxed replay; common fixes are building the image, starting Docker, fixing the recipe, or using a custom --sandbox-image.

For local package dependencies, add a bounded recipe and rebuild:

[tool.replaylab.sandbox]
image = "replaylab-sandbox-runtime:py3.13"
include_paths = ["packages/my_local_dependency"]
requirements_files = ["requirements.txt"]

or pass --recipe sandbox.toml to both build-image and doctor. Recipes can include app-root-relative paths, requirement files, extra requirements, validated apt package names, and known pip/uv index secrets through Docker BuildKit secret env names. They cannot include arbitrary Dockerfile commands or expose secret values in ReplayLab output.

Set REPLAYLAB_SANDBOX_IMAGE when your generated regression should use a custom prebuilt project image.

Safe workflow regression does not mock unaccepted HTTP, database, raw-socket, queue, or unsupported HTTP client effects; replay broker delivery; execute workers; replay unsupported HTTP client responses; provide Daytona or managed hosted execution; or sandbox non-SQLite databases, unsupported queue/pubsub SDKs, native/FFI escapes, process-escape APIs, or other operating-system effects beyond local Docker containment.

Add Generated Guards To CI

ReplayLab copies the generated pytest file under tests/regression and its source capsule fixture under tests/fixtures/replaylab/capsules. The local app marks a generated guard as:

  • CI ready when the generated test and fixture exist and a local pytest command can run it.
  • CI wired when a GitHub Actions workflow appears to run the generated guard path, tests/regression, or the broader pytest suite.
  • Needs attention when the generated test, fixture, guard mode, or last pytest result indicates the guard is not ready.

Use the generated-regression history in replaylab app to copy the local pytest command and a GitHub Actions snippet. A minimal workflow is:

name: ReplayLab generated guards

on:
  pull_request:
  push:
    branches: [main]

jobs:
  replaylab-generated-guards:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v5
      - name: Install dependencies
        run: uv sync --all-groups
      - name: Run ReplayLab generated guards
        run: uv run pytest tests/regression

Generate A Diagnostic Guard From A Problem Replay

The local app can generate report-driven diagnostic provider replay guards from failed or diverged regression replays. Open replaylab app, select the captured run or problem replay, and choose Generate diagnostic provider replay guard.

A diagnostic provider replay guard is different from a provider replay guard:

  • provider replay guard: a passing pytest that protects a clean provider-boundary replay
  • diagnostic provider replay guard: a passing pytest that intentionally preserves the same known failure shape

Diagnostic guards assert the replay report status, expected/attempted/replayed counts, blocked, mismatched, extra, missing, and payload-unavailable counts, and the per-problem outcome metadata and secret-safe hashes/messages where present. They are useful when you want the issue to remain reproducible while you fix it, but they are not proof that the behavior is correct.

Unsupported diagnostic generation fails clearly when the source report is missing, the captured run cannot be resolved, there are no deterministic boundary results, or the provider shape is not yet supported.

Generate A Failed-Boundary Regression From A Capsule

Sometimes the behavior you need to preserve is a provider failure. For example, you may want to prove that a timeout, validation error, or provider exception still follows the same app path while you work on a fix.

Capture the failure with full payloads, then generate a failed regression explicitly:

uv run replaylab generate-test <failed_child_capsule_id> \
  --output tests/regression/test_replaylab_provider_failure.py \
  --fixture-root tests/fixtures/replaylab/capsules \
  --app-root . \
  --auto-patch-integrations auto \
  --expected-replay-status failed \
  -- python your_app.py

The generated failed-boundary test asserts:

  • the replay command exits non-zero
  • the replay report status is failed
  • the expected boundary count is unchanged
  • no blocked, mismatched, extra, missing, or payload-unavailable outcomes were introduced
  • the report contains the recorded provider-failure message fragment

By default, ReplayLab still rejects failed-boundary capsules. That prevents accidental generation of tests that lock in failures when you meant to assert a successful replay.

Current failed-boundary generation is intentionally narrow: the failed provider boundary must be the final captured provider call, and the capsule must include full request and error payload refs.