Skip to content

Scenario-Driven Validation

ReplayLab scenarios are runnable user workflows. They prove that the product loop works as a developer would use it: capture a run, inspect the artifact, replay without the live dependency, compare the report, generate a pytest provider replay guard, and run the generated test.

Scenarios are not a simulator or a new user-facing test DSL. They are maintainer checks that keep the docs, examples, and implementation honest.

List scenarios

Run:

python scripts/run_scenario.py list

You should see:

ReplayLab scenarios
no-key-dogfood: deterministic boundaries=2 providers=openai, requests
new-user-pypi-local: loopback boundaries=1 providers=requests
support-bot-demo-local: loopback boundaries=2 providers=openai, requests
http-loopback-pypi: loopback boundaries=3 providers=requests, httpx
http-edge-cases-local: loopback boundaries=5 providers=requests, httpx
auto-instrumentation-local: loopback boundaries=1 providers=requests
asgi-fastapi-local: loopback boundaries=1 providers=requests
asgi-lifecycle-local: loopback boundaries=1 providers=requests
job-worker-local: loopback boundaries=1 providers=requests
job-lifecycle-local: loopback boundaries=2 providers=requests
pydantic-ai-local: loopback boundaries=1 providers=openai
langgraph-local: loopback boundaries=2 providers=requests, openai
langgraph-external-local: loopback boundaries=2 providers=requests, openai
langchain-openai-local: loopback boundaries=1 providers=openai
anthropic-local: loopback boundaries=1 providers=anthropic
gemini-local: loopback boundaries=1 providers=gemini
openai-streaming-local: loopback boundaries=1 providers=openai
anthropic-streaming-local: loopback boundaries=1 providers=anthropic
gemini-streaming-local: loopback boundaries=1 providers=gemini
ai-diagnosis-loopback: loopback boundaries=1 providers=requests
html-report-local: deterministic boundaries=2 providers=openai, requests
html-report-failure-local: deterministic boundaries=2 providers=openai, requests
react-viewer-local: deterministic boundaries=2 providers=openai, requests
viewer-first-local: deterministic boundaries=2 providers=openai, requests
guided-local-workflow: deterministic boundaries=2 providers=openai, requests
local-app-shell: deterministic boundaries=2 providers=openai, requests
local-app-workflow: deterministic boundaries=2 providers=openai, requests
failure-story-local: deterministic boundaries=2 providers=openai, requests
tool-resolution-local: deterministic boundaries=3 providers=openai, execution_tool, requests
unsupported-effect-local: deterministic boundaries=7 providers=openai, execution_tool, requests, database_effects, sqlite3
raw-socket-effect-local: deterministic boundaries=4 providers=openai, execution_tool, network_effects, requests
queue-effect-local: deterministic boundaries=5 providers=openai, execution_tool, queue_effects, requests
unsupported-http-client-local: deterministic boundaries=5 providers=openai, execution_tool, requests, unsupported_http_clients
escape-scope-local: deterministic boundaries=3 providers=openai, execution_tool, requests
sandbox-adversarial-local: loopback boundaries=2 providers=openai, requests
sandbox-escape-local: deterministic boundaries=2 providers=openai, requests
sandbox-hardening-local: loopback boundaries=2 providers=openai, requests
sandbox-runtime-image-local: loopback boundaries=2 providers=openai, requests
pydantic-ai-external-paid: paid boundaries=1 providers=openai
openai-real-pypi: paid boundaries=1 providers=openai

This means the current validation matrix covers the no-key local MVP path, a fresh PyPI first project, a customer-style support-bot demo, the real HTTP client matrix over loopback, the local-source HTTP edge-case path, ASGI/FastAPI middleware capture, the simplified auto-instrumentation path, richer lifecycle checks for ASGI requests and worker jobs, AI-assisted diagnosis through a loopback fake endpoint, clean and failed static HTML report export, report-diff HTML export, clean/failed/diff React viewer export, the viewer-first local workflow, PydanticAI, LangGraph (local and external realism), and LangChain ChatOpenAI provider-level framework compatibility, native Anthropic Messages and Gemini Google Gen AI provider paths, event-preserving OpenAI/Anthropic/Gemini provider streams, queue/pubsub enqueue/publish escape control, unsupported HTTP client escape control, native/FFI and process-escape scope guards, local Docker sandbox containment evidence, the local app shell and workflow API, and paid OpenAI-backed realism paths.

Scenario tiers

Tier What it can use When to run it
deterministic Fake providers, temp files, no secrets, no network Normal CI and every user-facing workflow change
loopback Real installed clients and local sockets Release checks and provider/client compatibility checks
paid Real external provider calls and local secrets Explicit maintainer validation before public-alpha confidence claims

Paid scenarios require --paid and must stay opt-in. They cap provider calls, redact secret-like output, use temporary workspaces, and prove replay/generated tests work without a live provider.

Run the no-key scenario

Run:

python scripts/run_scenario.py run no-key-dogfood

ReplayLab copies the deterministic dogfood app into a temporary workspace, captures one OpenAI-like boundary and one HTTP boundary, replaces the fake providers with failing providers, replays the same app command, compares the report, generates a pytest provider replay guard, and runs that generated test. The scenario also asserts that capsule inspect, replay, and report inspect print actionable Next steps hints for the next local command without exposing payload bodies or original app argv.

Expected ending:

ReplayLab scenario passed.
Scenario: no-key-dogfood
Tier: deterministic
Boundaries: 2
Payloads: 4
Providers: openai, requests

This proves the basic local loop works without an API key.

Run the fresh PyPI first-project scenario

Run:

python scripts/run_scenario.py run new-user-pypi-local --keep-workspace --package-version 0.1.0a4

ReplayLab creates a clean temporary virtual environment, installs replaylab==0.1.0a4, requests, and pytest from PyPI, writes a minimal app with startup replaylab.init(...) and handle.capture(...), starts a loopback HTTP service for capture, then stops that service before replay.

Expected ending:

ReplayLab scenario passed.
Scenario: new-user-pypi-local
Tier: loopback
Boundaries: 1
Providers: requests

This is the maintainer check for the first 15 minutes of a new user experience: install from PyPI, capture a normal provider call, inspect and replay it, compare the report, export the local viewer, generate a pytest provider replay guard, and run that guard without a live provider.

Run the customer support-bot demo scenario

Run:

python scripts/run_scenario.py run support-bot-demo-local --keep-workspace --package-version 0.1.0a4

ReplayLab creates a clean temporary virtual environment, installs replaylab==0.1.0a4, openai, requests, and pytest from PyPI, writes a tutorial-style support bot, starts fake OpenAI Responses and support-ticket loopback services for capture, then stops both services before replay.

Expected ending:

ReplayLab scenario passed.
Scenario: support-bot-demo-local
Tier: loopback
Boundaries: 2
Providers: openai, requests

This is the customer-demo check. It proves a realistic support-bot flow can capture OpenAI and HTTP boundaries, replay with providers stopped, export clean/failed/diff React viewers, compare reports, generate a pytest provider replay guard, and run that guard without provider credentials.

For future release-candidate validation before publishing, install ReplayLab from the local candidate wheelhouse while still resolving third-party dependencies from PyPI:

python scripts/release_rehearsal.py --channel github --version <next-version>
python scripts/run_scenario.py run support-bot-demo-local \
  --keep-workspace \
  --package-version <next-version> \
  --package-wheelhouse dist/public-alpha/v<next-version>
python scripts/run_scenario.py run pydantic-ai-local \
  --keep-workspace \
  --package-version <next-version> \
  --package-wheelhouse dist/public-alpha/v<next-version>
python scripts/run_scenario.py run local-app-workflow \
  --keep-workspace \
  --package-version <next-version> \
  --package-wheelhouse dist/public-alpha/v<next-version>

For maintainer demo rehearsal, use the kept workspace as an artifact store for the current checkout:

cd <kept-workspace>
uv run --project /path/to/ReplayLab replaylab app --local-store-root .replaylab

That verifies the current local app can start from the latest failed regression replay/latest regression replay/latest captured run, recover the recorded command, default provider patching from artifact providers, and run the workflow without manually reconstructing capsule or report paths.

Run the real HTTP loopback scenario

Run:

python scripts/run_scenario.py run http-loopback-pypi --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs replaylab==0.1.0a4, requests, httpx, and pytest from PyPI, starts a loopback HTTP server for capture, records one requests call, one sync httpx call, and one async httpx call, then stops the server before replay.

Expected ending:

ReplayLab scenario passed.
Scenario: http-loopback-pypi
Tier: loopback
Boundaries: 3
Providers: requests, httpx

This proves replay is serving stored HTTP responses, because the server is no longer running.

Run the local HTTP edge-case scenario

Run:

python scripts/run_scenario.py run http-edge-cases-local --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs the current checkout plus real requests, httpx, and pytest, starts a loopback HTTP server for capture, records query-pair params, safe header names, JSON request bodies, text request bodies, text content, bytes content, JSON responses, text responses, and bytes responses, then stops the server before replay.

Expected ending:

ReplayLab scenario passed.
Scenario: http-edge-cases-local
Tier: loopback
Boundaries: 5
Providers: requests, httpx

This proves the current checkout can replay realistic HTTP request identity and body shapes without calling the live loopback server.

Run the local auto-instrumentation scenario

Run:

python scripts/run_scenario.py run auto-instrumentation-local --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs the current checkout plus FastAPI, Uvicorn, requests, and pytest, starts a deterministic loopback provider for capture, initializes the app with auto_patch_integrations="auto", instruments the app with replaylab.instrument_app(app, handle=handle), then stops the provider before replay.

Expected ending:

ReplayLab scenario passed.
Scenario: auto-instrumentation-local
Tier: loopback
Boundaries: 1
Providers: requests

This proves the low-boilerplate startup path can capture and replay a FastAPI request without manual endpoint capture scopes or direct middleware registration.

Run the local ASGI/FastAPI scenario

Run:

python scripts/run_scenario.py run asgi-fastapi-local --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs the current checkout plus FastAPI, Uvicorn, requests, and pytest, starts a deterministic loopback provider for capture, serves one FastAPI request through ReplayLabASGIMiddleware, then stops the provider before replay.

Expected ending:

ReplayLab scenario passed.
Scenario: asgi-fastapi-local
Tier: loopback
Boundaries: 1
Providers: requests

This proves request-scoped ASGI middleware capture works without endpoint-level handle.capture(...) boilerplate, and that generated pytest can replay the same app command without the provider.

Run the local ASGI lifecycle scenario

Run:

python scripts/run_scenario.py run asgi-lifecycle-local --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs the current checkout plus FastAPI, Uvicorn, requests, and pytest, starts a deterministic loopback provider for capture, then sends three requests through a FastAPI app:

  • GET /health, ignored by ReplayLab middleware.
  • GET /ready, provider-free and therefore not written by default.
  • GET /tickets/123, provider-backed and carrying request ID, authorization, and cookie headers.

Expected ending:

ReplayLab scenario passed.
Scenario: asgi-lifecycle-local
Tier: loopback
Boundaries: 1
Providers: requests

This proves request-scoped middleware stays low-intrusion in a more realistic app shape: only the provider-backed route writes a capsule, the capsule records safe ASGI facts such as method, path, route path, endpoint, status code, and configured request ID, and authorization/cookie values stay out of ReplayLab artifacts.

Run the local worker/job scenario

Run:

python scripts/run_scenario.py run job-worker-local --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs the current checkout plus requests and pytest, starts a deterministic loopback provider for capture, runs one decorated worker job, then stops the provider before replay.

Expected ending:

ReplayLab scenario passed.
Scenario: job-worker-local
Tier: loopback
Boundaries: 1
Providers: requests

This proves job-scoped capture works without adding handle.capture(...) inside the job body, and that generated pytest can replay the same worker command without the provider.

Run the local job lifecycle scenario

Run:

python scripts/run_scenario.py run job-lifecycle-local --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs the current checkout plus requests and pytest, starts a deterministic loopback provider for capture, then invokes:

  • one provider-free decorated job, which writes no capsule by default;
  • one sync provider job with ticket_id supplied positionally;
  • one async provider job with ticket_id supplied as a keyword.

Expected ending:

ReplayLab scenario passed.
Scenario: job-lifecycle-local
Tier: loopback
Boundaries: 2
Providers: requests

This proves capture_job(...) writes one capsule per provider-backed job invocation, extracts session IDs for positional and keyword calls, records safe job metadata, and keeps job arguments, keyword arguments, and return values out of adapter-owned metadata.

Run the PydanticAI compatibility scenario

Run:

python scripts/run_scenario.py run pydantic-ai-local --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs the current checkout plus pydantic-ai-slim[openai], openai, and pytest, starts a deterministic OpenAI Responses-compatible loopback provider for capture, then invokes one PydanticAI Agent using OpenAIResponsesModel and OpenAIProvider(openai_client=...) inside handle.capture(...). The provider endpoint is stopped before replay, report comparison, React viewer export, generated pytest creation, and generated pytest execution.

Expected ending:

ReplayLab scenario passed.
Scenario: pydantic-ai-local
Tier: loopback
Boundaries: 1
Providers: openai

This proves supported OpenAI Responses calls inside PydanticAI can be captured and replayed through generic provider instrumentation. It is a validated compatibility scenario, not a native PydanticAI trace adapter.

Run the LangGraph compatibility scenario

Run:

python scripts/run_scenario.py run langgraph-local --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs the current checkout plus langgraph, requests, openai, and pytest, starts deterministic loopback providers for HTTP and OpenAI Responses capture, then runs one StateGraph through graph.invoke(...). One node uses normal requests.get(...); another node uses normal OpenAI().responses.create(...). Both providers are stopped before replay, report comparison, React viewer export, generated pytest creation, and generated pytest execution.

Expected ending:

ReplayLab scenario passed.
Scenario: langgraph-local
Tier: loopback
Boundaries: 2
Providers: requests, openai

This proves supported provider calls inside LangGraph nodes can be captured and replayed without a LangGraph-specific ReplayLab adapter. The scenario also validates the local app trace shape: project run, workflow step, inferred Graph node rows, API call, LLM call, and formatted payload detail. It does not validate LangChain ChatOpenAI, Chat Completions, streaming, or framework-native graph-edge replay semantics.

Run the external LangGraph realism scenario

Run:

python scripts/run_scenario.py run langgraph-external-local --keep-workspace

This scenario bootstraps or reuses the external langgraph-example project under /Users/damienbenveniste/Projects/agentic-management/replaylab-example-projects, copies it into an isolated workspace, runs loopback provider capture/replay/compare/generate-test/pytest, and then verifies the app-owned loop evidence from replaylab app.

Expected ending:

ReplayLab scenario passed.
Scenario: langgraph-external-local
Tier: loopback
Boundaries: 2
Providers: requests, openai

This proves the LangGraph path is realistic beyond generated in-repo scenario source and stays discoverable through captured-run product surfaces. The external path is forced into provider-backed loopback mode so it cannot silently pass through a provider-free fallback. Provider-free LangGraph runs are allowed as metadata captures, but they are not replayable evidence and the app disables replay/experiment/generated-guard actions for them.

Run the LangChain ChatOpenAI compatibility scenario

Run:

python scripts/run_scenario.py run langchain-openai-local --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs the current checkout plus langchain-openai, openai, and pytest, starts a deterministic OpenAI Responses-compatible loopback provider for capture, then runs one ChatOpenAI(..., use_responses_api=True) call inside handle.capture(...). The provider endpoint is stopped before replay, report comparison, React viewer export, generated pytest creation, and generated pytest execution. The scenario also validates that replaylab app keeps this run discoverable as a captured run with an available replay action, attached replay evidence, attached provider replay guard evidence, and reachable upload preview.

Expected ending:

ReplayLab scenario passed.
Scenario: langchain-openai-local
Tier: loopback
Boundaries: 1
Providers: openai

This proves the LangChain ChatOpenAI Responses path works with ReplayLab's provider-level capture and replay loop. It does not validate LangChain paths that require Chat Completions semantics, streaming, or Anthropic adapters.

Run the Anthropic Messages compatibility scenario

Run:

python scripts/run_scenario.py run anthropic-local --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs the current checkout plus the real anthropic SDK and pytest, starts a deterministic local Anthropic Messages-compatible loopback server, captures one Anthropic().messages.create(...) call, then stops the server before replay, report comparison, React viewer export, generated pytest creation, and generated pytest execution. The scenario also validates that replaylab app shows a captured run with an Anthropic LLM call, formatted request/response previews, attached regression replay evidence, attached generated guard evidence, and reachable upload preview.

Expected ending:

ReplayLab scenario passed.
Scenario: anthropic-local
Tier: loopback
Boundaries: 1
Providers: anthropic

This proves native Anthropic Messages calls can be captured and regression-replayed without an Anthropic API key. It does not validate batches, files, Bedrock/Vertex clients, OpenAI-compatible Anthropic routing, or Anthropic framework adapters.

Run the Gemini Generate Content compatibility scenario

Run:

python scripts/run_scenario.py run gemini-local --keep-workspace

ReplayLab creates a clean temporary virtual environment, installs the current checkout plus the real google-genai SDK and pytest, starts a deterministic local Gemini generateContent loopback server, captures one Client().models.generate_content(...) call, then stops the server before replay, report comparison, React viewer export, generated pytest creation, and generated pytest execution. The scenario also validates that replaylab app shows a captured run with a Gemini LLM call, formatted request/response previews, attached regression replay evidence, attached generated guard evidence, and reachable upload preview.

Expected ending:

ReplayLab scenario passed.
Scenario: gemini-local
Tier: loopback
Boundaries: 1
Providers: gemini

This proves native Google Gen AI Gemini generate_content calls can be captured and regression-replayed without a Gemini API key. It does not validate multimodal file/image/video flows, the Live API, or Vertex-specific product breadth.

Run the core provider streaming scenarios

Run:

python scripts/run_scenario.py run openai-streaming-local --keep-workspace
python scripts/run_scenario.py run anthropic-streaming-local --keep-workspace
python scripts/run_scenario.py run gemini-streaming-local --keep-workspace

Each scenario creates a clean temporary virtual environment, installs the current checkout plus the real provider SDK and pytest, starts a deterministic local streaming-compatible loopback server, captures one fully consumed provider stream, then stops the server before regression replay, comparison, React viewer export, generated pytest creation, and generated pytest execution. The scenario also validates that replaylab app keeps the streaming captured run discoverable with a provider LLM call, streaming metadata, formatted final output, attached replay evidence, attached generated guard evidence, and reachable upload preview.

Expected endings:

ReplayLab scenario passed.
Scenario: openai-streaming-local
Tier: loopback
Boundaries: 1
Providers: openai

ReplayLab scenario passed.
Scenario: anthropic-streaming-local
Tier: loopback
Boundaries: 1
Providers: anthropic

ReplayLab scenario passed.
Scenario: gemini-streaming-local
Tier: loopback
Boundaries: 1
Providers: gemini

This proves event-preserving streaming support for OpenAI responses.create(..., stream=True), Anthropic messages.stream(...) / messages.create(..., stream=True), and Gemini generate_content_stream(...) without provider API keys. A stream becomes replayable evidence only after it is fully consumed; incomplete streams are recorded as failed/incomplete evidence.

Run the AI diagnosis loopback scenario

Run:

python scripts/run_scenario.py run ai-diagnosis-loopback --keep-workspace

ReplayLab creates a deterministic capsule and replay report, starts a local OpenAI Responses-compatible fake endpoint, runs replaylab report explain, runs replaylab report diff-explain, runs replaylab ai plan-instrumentation, verifies the outputs are secret-safe, and confirms no external AI provider was called.

Expected ending:

ReplayLab scenario passed.
Scenario: ai-diagnosis-loopback
Tier: loopback
Boundaries: 1
Providers: requests

This proves optional report explanation, report-diff explanation, and instrumentation planning can be validated without a paid provider call.

Run the static HTML report scenario

Run:

python scripts/run_scenario.py run html-report-local --keep-workspace

ReplayLab runs the deterministic no-key dogfood workflow, exports the replay report as a single local HTML file, and verifies the file contains the expected report sections without payload bodies or secret-looking strings.

Expected ending:

ReplayLab scenario passed.
Scenario: html-report-local
Tier: deterministic
Boundaries: 2
Providers: openai, requests

This proves users can get a browser-readable local replay summary without a hosted service.

Run the static HTML failure scenario

Run:

python scripts/run_scenario.py run html-report-failure-local --keep-workspace

ReplayLab runs the deterministic no-key dogfood workflow, intentionally replays a mismatched app command, exports a failed replay report as HTML, exports a baseline-vs-candidate diff as HTML, and verifies both files include problem diagnostics without payload bodies or secret-looking strings.

Expected ending:

ReplayLab scenario passed.
Scenario: html-report-failure-local
Tier: deterministic
Boundaries: 2
Providers: openai, requests

This proves failed replays can still be understood through the dependency-free static fallback.

Run the local React viewer scenario

Run:

python scripts/run_scenario.py run react-viewer-local --keep-workspace

ReplayLab runs the deterministic no-key dogfood workflow, exports a clean React viewer, intentionally replays a mismatched app command, exports a failed React viewer, exports a report-diff React viewer, and verifies the generated files include diagnostics without payload bodies or secret-looking strings. The scenario checks the viewer's diagnosis section, quick filters, grouped next commands, and diff summary labels so the maintainer can catch confusing UX while using the same files a user would open.

Expected ending:

ReplayLab scenario passed.
Scenario: react-viewer-local
Tier: deterministic
Boundaries: 2
Providers: openai, requests

This is the main local viewer UX scenario. It exists so the maintainer can open the same files a user would open and catch navigation, density, wording, and failure-diagnosis problems early.

Run the viewer-first local workflow scenario

Run:

python scripts/run_scenario.py run viewer-first-local --keep-workspace

ReplayLab runs the deterministic no-key dogfood workflow, uses replaylab report view report to write clean and failed viewer files without opening a browser, then uses replaylab report view diff to write the before/after diff viewer. The scenario verifies the viewer-first files include the diagnosis, "What to do next" section, copyable command labels, failure/diff diagnostics, and no payload bodies or secret-looking strings.

Expected ending:

ReplayLab scenario passed.
Scenario: viewer-first-local
Tier: deterministic
Boundaries: 2
Providers: openai, requests

This is the product-flow scenario for local viewing. It exists to catch whether the workflow still feels too command-heavy after the CLI has launched the browser-readable viewer.

Run the guided local workflow scenario

Run:

python scripts/run_scenario.py run guided-local-workflow --keep-workspace

ReplayLab runs the deterministic no-key dogfood workflow, then uses replaylab workflow local to replay the app, compare the report, write the local React viewer, generate a pytest provider replay guard, and run that generated test from one command. It then intentionally replays a mismatched app through the same guided command and verifies the failed viewer is still written.

Expected ending:

ReplayLab scenario passed.
Scenario: guided-local-workflow
Tier: deterministic
Boundaries: 2
Providers: openai, requests

This is the scenario for checking whether the post-capsule workflow is now less command-heavy for a developer. Lower-level replay, report, viewer, and generate-test commands remain available when a script needs individual steps.

Run the local app shell scenario

Run:

python scripts/run_scenario.py run local-app-shell --keep-workspace

ReplayLab runs the deterministic no-key dogfood workflow, starts replaylab app --no-open-browser against the generated .replaylab store, verifies the served page is the product workspace instead of an exported ReplayLab Viewer, then queries product API routes such as /api/workspace, /api/projects, /api/projects/{project_id}/artifacts, /api/captured-runs/{captured_run_id}, /api/reports/{replay_id}, and /api/capsules/{capsule_id}. Compatibility aliases such as /api/reports and /api/capsules remain available during migration. The scenario verifies the local adapter discovers the provider capsule, replay report, product captured-run view, comparison, run profile state, latest captured-run/regression-replay shortcuts, and diagnosis data without returning payload body or secret-looking markers. It also verifies the app exposes start-point data for the latest failed regression replay, latest regression replay, and latest captured run when those artifacts exist.

Expected ending:

ReplayLab scenario passed.
Scenario: local-app-shell
Tier: deterministic
Boundaries: 2
Providers: openai, requests

This is the first scenario for the interactive local product surface. It exists to catch whether the app makes artifact discovery easier than manually assembling capsule paths, report paths, and viewer commands. External-example checks additionally verify that report artifact directory IDs, such as a user-chosen --report-id, can open the same detail view as the internal replay ID. The scenario also validates that captured-run detail includes the split-pane trace explorer, agent/LLM trace labels, redacted Input/Output payload previews, protocol assistant/tool events, a run-list side inspector on the captured-runs route, and collapsed raw replay/debug data without leaking secret-looking markers.

Run the local app workflow scenario

Run:

python scripts/run_scenario.py run local-app-workflow --keep-workspace

ReplayLab runs the deterministic no-key dogfood workflow, starts replaylab app --no-open-browser against the generated .replaylab store, extracts the local app action token from the served page, then calls the product action API through POST /api/actions/replay. The legacy workflow endpoints remain available for older callers, but the product app path no longer posts command_text or asks the frontend to choose provider integrations. The scenario verifies a clean app-driven replay, comparison, viewer export, pytest generation, and generated pytest run. It then rewrites the app into an intentional mismatch and verifies the app API still writes a failed report and failed viewer with secret-safe structured step results. The scenario now also checks that run profiles are recovered from artifacts, start points carry stable product roles and avoid duplicate choices, provider labels come from the selected artifact, fresh captures can replay through /api/actions/replay, and older missing-profile captures show app-managed setup candidates instead of copy-pastable placeholder commands. It also verifies the original captured run exposes latest-first regression replay history with timestamps, durations, problem counts, and report IDs that can be opened through the report detail API. When the workflow creates a live experiment, the scenario verifies the experiment trace is attached under the original captured run through product artifact views and is not exposed as a new primary captured-run baseline. It also validates that the captured-run product detail keeps the app-owned loop discoverable after refresh: regression replays, live experiments, generated provider replay guards and diagnostic provider replay guards, export evidence, and upload preview entrypoints remain attached to the baseline captured run.

Expected ending:

ReplayLab scenario passed.
Scenario: local-app-workflow
Tier: deterministic
Boundaries: 2
Providers: openai, requests

This is the scenario for checking whether the local app is now a useful product workflow surface, not just an artifact browser. It should catch regressions where the app makes users fall back to hand-built CLI chains.

Run the tool-resolution local scenario

Run:

python scripts/run_scenario.py run tool-resolution-local --keep-workspace

ReplayLab captures and replays a deterministic OpenAI Responses artifact with a provider-visible model tool and local Python source. The local lookup_customer callable is explicitly wrapped with replaylab.control_tool(...). The scenario starts replaylab app and verifies the replay safety preflight exposes advisory implementation candidates, explicit execution-tool wrapper evidence, observed HTTP stack attribution, an advisory tool effect map, read-only effect policy proposal items, saved project effect policy review state after reload, opt-in HTTP effect policy enforcement evidence, opt-in local-effect control evidence, read-only safe workflow readiness, and safe workflow regression availability. It also verifies unsupported-effect scope detection is clear for the happy path. It first proves observe mode still works, then proves HTTP enforce blocks without a saved policy, then saves an accepted project rule and proves HTTP plus local-effect enforce allows the observed lookup_customer -> GET api.example.test HTTP effect. The final enforced branch runs with local-container sandbox mode. When the configured Docker image is available and contains the needed runtime dependencies, the readiness gate reaches ready and the scenario generates and runs a safe workflow regression pytest. When Docker or the image is unavailable, the same scenario verifies typed sandbox evidence blocks safe workflow generation instead of silently overclaiming readiness. It also runs a negative branch that attempts a file write under local-effect enforce and verifies the file is not created.

To prepare the default local sandbox image before running the positive branch, run:

python scripts/run_scenario.py run sandbox-runtime-image-local --keep-workspace

To use a custom project image instead, set REPLAYLAB_SCENARIO_SANDBOX_IMAGE and REPLAYLAB_SANDBOX_IMAGE before running tool-resolution-local.

Run the unsupported-effect local scenario

Run:

python scripts/run_scenario.py run unsupported-effect-local --keep-workspace

ReplayLab captures and replays the same deterministic model-tool workflow with linked SQLite usage in the tool source. The first branch does not enable database-effect control, so the safety preflight treats the linked SQLite evidence as unsupported scope, keeps readiness not_ready, refuses safe workflow generation, and still leaves provider replay guard generation available.

The second branch explicitly enables database_effects observation, saves reviewed HTTP and SQLite policy decisions, reruns with HTTP, local-effect, and database-effect enforcement, and verifies the SQLite statement shapes are allowed by exact saved policy. That enforced report reaches readiness ready, generates a safe workflow regression pytest, and runs the generated pytest successfully.

Run the raw-socket effect local scenario

Run:

python scripts/run_scenario.py run raw-socket-effect-local --keep-workspace

ReplayLab captures and replays a deterministic model-tool workflow where the wrapped lookup_customer callable attempts direct socket I/O before its supported HTTP call. The observe branch explicitly requests network_effects hooks, records secret-safe raw-socket evidence, keeps safe workflow readiness not_ready, refuses safe workflow generation, and still generates a provider replay guard.

The enforce branch reruns with network_effect_control_mode=enforce. ReplayLab installs raw-socket hooks automatically, blocks the app-origin socket send before network I/O, and exposes the block in the report safety preflight as network-effect control evidence. The scenario verifies this is a control block, not a provider replay mismatch or mocked network response.

Run the queue-effect local scenario

Run:

python scripts/run_scenario.py run queue-effect-local --keep-workspace

ReplayLab captures and replays a deterministic model-tool workflow where the wrapped lookup_customer callable attempts an RQ-style enqueue through a local stub module before its supported HTTP call. The observe branch explicitly requests queue_effects hooks, records secret-safe queue/pubsub evidence, keeps safe workflow readiness not_ready, refuses safe workflow generation, and still generates a provider replay guard.

The enforce branch reruns with queue_effect_control_mode=enforce. ReplayLab installs queue hooks automatically, blocks the app-origin enqueue before broker I/O, and exposes the block in the report safety preflight as queue/pubsub effect control evidence. The scenario verifies this is a control block, not a provider replay mismatch, queued-worker execution, broker replay, or mocked queue response.

Run the unsupported HTTP client local scenario

Run:

python scripts/run_scenario.py run unsupported-http-client-local --keep-workspace

ReplayLab captures and replays a deterministic model-tool workflow where the wrapped lookup_customer callable uses a local fake urllib3-style client before its supported HTTP call. The observe branch explicitly requests unsupported_http_clients hooks, records secret-safe unsupported HTTP client evidence, keeps safe workflow readiness not_ready, refuses safe workflow generation, and still generates a provider replay guard.

The enforce branch reruns with unsupported_http_client_control_mode=enforce. ReplayLab installs unsupported HTTP client hooks automatically, blocks the app-origin urllib3 request before network I/O, and exposes the block in the report safety preflight as unsupported HTTP client control evidence. The scenario verifies this is a control block, not a provider replay mismatch, response replay adapter, or mocked HTTP response.

Run the escape-scope local scenario

Run:

python scripts/run_scenario.py run escape-scope-local --keep-workspace

ReplayLab captures and replays two deterministic model-tool branches. One branch has linked native/FFI source evidence through ctypes; the other has linked process-escape source evidence through multiprocessing, ProcessPoolExecutor, os.fork, os.exec*, and pty.spawn calls. The escape code is not executed; the scenario proves the AST-only unsupported-effect scanner can still find linked scope evidence without importing user code.

Both branches keep safe workflow readiness not_ready, refuse safe workflow generation, and still generate provider replay guards. The local app report detail shows the blocking evidence in Unsupported effect scope. This is a scope guard, not native-call enforcement, process virtualization, mocking, or sandboxing.

Run the sandbox escape local scenario

Run:

python scripts/run_scenario.py run sandbox-escape-local --keep-workspace

ReplayLab captures a deterministic provider-replay artifact and then attempts a sandboxed regression replay. If Docker and the configured image can start, the replay runs inside a local container with deny-all network and copied-workspace filesystem isolation. If Docker or the image is unavailable, ReplayLab writes a failed report with typed sandbox evidence instead of pretending the workflow is eligible.

The scenario verifies the local app report detail exposes Sandbox containment, safe workflow readiness remains blocked unless containment completed, and provider replay guard generation remains available. This is local Docker containment evidence, not Daytona, managed hosted execution, VM/microVM isolation, or a replacement for HTTP/local/database/network/queue effect controls.

Run the sandbox runtime image scenario

Run:

python scripts/run_scenario.py run sandbox-runtime-image-local --keep-workspace

ReplayLab builds the default local image replaylab-sandbox-runtime:py3.13, runs replaylab sandbox doctor, captures a deterministic provider workflow, generates a provider replay guard, and runs sandboxed replay with the prepared image. The scenario validates completed sandbox evidence with deny-all network, non-root runtime user, read-only root filesystem, split mounts, and copied-workspace containment.

This scenario may pull the Docker base image and install Python dependencies while building the image. The replay child itself still runs with --network none and --pull never.

Run the sandbox hardening scenario

Run:

python scripts/run_scenario.py run sandbox-hardening-local --keep-workspace

ReplayLab builds a recipe-backed image replaylab-sandbox-hardening:py3.13 with a local path dependency declared through a sandbox recipe, runs replaylab sandbox doctor with the same hardened runtime flags, captures a deterministic provider workflow, generates a provider replay guard, and runs sandboxed replay with the prepared image. The scenario validates that the replay report records recipe hash evidence plus non-root, read-only-root, split-mount sandbox containment.

Run the sandbox adversarial scenario

Run:

python scripts/run_scenario.py run sandbox-adversarial-local --keep-workspace

ReplayLab builds and doctors the local sandbox image, captures a deterministic provider workflow, then runs hardened sandbox replay with bounded escape probes. The app checks that host-only environment markers are not inherited, the Docker socket is not visible, app/root filesystem writes fail, /tmp remains writable, direct socket I/O cannot leave the deny-network sandbox, and a small subprocess stays inside the container. The scenario also verifies absolute host-path command arguments and external symlinks are refused before launch, and that linked process-escape source evidence keeps safe workflow generation unavailable. This is developer validation evidence, not a VM or managed sandbox guarantee.

Run the failure-story scenario

Run:

python scripts/run_scenario.py run failure-story-local --keep-workspace

ReplayLab runs the deterministic no-key dogfood workflow, rewrites the app into an intentional mismatch, and then verifies the same failure story appears in four places: report inspect, the React viewer, the static HTML fallback, and the local app report-detail API. The scenario exists to catch table-first regressions where a failed replay is technically recorded but a human still cannot tell what diverged or what to do next.

Expected ending:

ReplayLab scenario passed.
Scenario: failure-story-local
Tier: deterministic
Boundaries: 2
Providers: openai, requests

The scenario checks for the first mismatched boundary, expected-vs-actual provider call evidence, request hashes, a first recommended action, and secret safety.

Run the paid external PydanticAI realism scenario

Put a real key in .env.local:

OPENAI_API_KEY=...

Then run:

python scripts/run_scenario.py run pydantic-ai-external-paid --paid --keep-workspace

ReplayLab copies the external pydantic-ai-example project into an isolated workspace, runs one real paid capture flow, enforces a provider-call cap (default 10), retries once on transient provider/network failures, replays with an invalid key, compares, generates pytest, runs pytest, and verifies app-owned loop discoverability.

The scenario never prints OPENAI_API_KEY.

Run the paid OpenAI scenario

Put a real key in .env.local:

OPENAI_API_KEY=...

Then run:

python scripts/run_scenario.py run openai-real-pypi --paid --keep-workspace

ReplayLab installs from PyPI, selects a low-cost Responses-capable model unless overridden, makes one real capture call, replays with an invalid API key, compares the report, generates pytest, and runs the generated test without a valid key.

The script never prints OPENAI_API_KEY.

Development rule

When a task changes a user-facing workflow, update or add a scenario in the same task. If no scenario applies, the final implementation summary must say why.