Scenario-Driven Validation
ReplayLab scenarios are runnable user workflows. They prove that the product loop works as a developer would use it: capture a run, inspect the artifact, replay without the live dependency, compare the report, generate a pytest provider replay guard, and run the generated test.
Scenarios are not a simulator or a new user-facing test DSL. They are maintainer checks that keep the docs, examples, and implementation honest.
List scenarios
Run:
python scripts/run_scenario.py list
You should see:
ReplayLab scenarios
no-key-dogfood: deterministic boundaries=2 providers=openai, requests
new-user-pypi-local: loopback boundaries=1 providers=requests
support-bot-demo-local: loopback boundaries=2 providers=openai, requests
http-loopback-pypi: loopback boundaries=3 providers=requests, httpx
http-edge-cases-local: loopback boundaries=5 providers=requests, httpx
auto-instrumentation-local: loopback boundaries=1 providers=requests
asgi-fastapi-local: loopback boundaries=1 providers=requests
asgi-lifecycle-local: loopback boundaries=1 providers=requests
job-worker-local: loopback boundaries=1 providers=requests
job-lifecycle-local: loopback boundaries=2 providers=requests
pydantic-ai-local: loopback boundaries=1 providers=openai
langgraph-local: loopback boundaries=2 providers=requests, openai
langgraph-external-local: loopback boundaries=2 providers=requests, openai
langchain-openai-local: loopback boundaries=1 providers=openai
anthropic-local: loopback boundaries=1 providers=anthropic
gemini-local: loopback boundaries=1 providers=gemini
openai-streaming-local: loopback boundaries=1 providers=openai
anthropic-streaming-local: loopback boundaries=1 providers=anthropic
gemini-streaming-local: loopback boundaries=1 providers=gemini
ai-diagnosis-loopback: loopback boundaries=1 providers=requests
html-report-local: deterministic boundaries=2 providers=openai, requests
html-report-failure-local: deterministic boundaries=2 providers=openai, requests
react-viewer-local: deterministic boundaries=2 providers=openai, requests
viewer-first-local: deterministic boundaries=2 providers=openai, requests
guided-local-workflow: deterministic boundaries=2 providers=openai, requests
local-app-shell: deterministic boundaries=2 providers=openai, requests
local-app-workflow: deterministic boundaries=2 providers=openai, requests
failure-story-local: deterministic boundaries=2 providers=openai, requests
tool-resolution-local: deterministic boundaries=3 providers=openai, execution_tool, requests
unsupported-effect-local: deterministic boundaries=7 providers=openai, execution_tool, requests, database_effects, sqlite3
raw-socket-effect-local: deterministic boundaries=4 providers=openai, execution_tool, network_effects, requests
queue-effect-local: deterministic boundaries=5 providers=openai, execution_tool, queue_effects, requests
unsupported-http-client-local: deterministic boundaries=5 providers=openai, execution_tool, requests, unsupported_http_clients
escape-scope-local: deterministic boundaries=3 providers=openai, execution_tool, requests
sandbox-adversarial-local: loopback boundaries=2 providers=openai, requests
sandbox-escape-local: deterministic boundaries=2 providers=openai, requests
sandbox-hardening-local: loopback boundaries=2 providers=openai, requests
sandbox-runtime-image-local: loopback boundaries=2 providers=openai, requests
pydantic-ai-external-paid: paid boundaries=1 providers=openai
openai-real-pypi: paid boundaries=1 providers=openai
This means the current validation matrix covers the no-key local MVP path, a fresh PyPI first project, a customer-style support-bot demo, the real HTTP client matrix over loopback, the local-source HTTP edge-case path, ASGI/FastAPI middleware capture, the simplified auto-instrumentation path, richer lifecycle checks for ASGI requests and worker jobs, AI-assisted diagnosis through a loopback fake endpoint, clean and failed static HTML report export, report-diff HTML export, clean/failed/diff React viewer export, the viewer-first local workflow, PydanticAI, LangGraph (local and external realism), and LangChain ChatOpenAI provider-level framework compatibility, native Anthropic Messages and Gemini Google Gen AI provider paths, event-preserving OpenAI/Anthropic/Gemini provider streams, queue/pubsub enqueue/publish escape control, unsupported HTTP client escape control, native/FFI and process-escape scope guards, local Docker sandbox containment evidence, the local app shell and workflow API, and paid OpenAI-backed realism paths.
Scenario tiers
| Tier | What it can use | When to run it |
|---|---|---|
deterministic |
Fake providers, temp files, no secrets, no network | Normal CI and every user-facing workflow change |
loopback |
Real installed clients and local sockets | Release checks and provider/client compatibility checks |
paid |
Real external provider calls and local secrets | Explicit maintainer validation before public-alpha confidence claims |
Paid scenarios require --paid and must stay opt-in. They cap provider calls, redact secret-like
output, use temporary workspaces, and prove replay/generated tests work without a live provider.
Run the no-key scenario
Run:
python scripts/run_scenario.py run no-key-dogfood
ReplayLab copies the deterministic dogfood app into a temporary workspace, captures one OpenAI-like
boundary and one HTTP boundary, replaces the fake providers with failing providers, replays the same
app command, compares the report, generates a pytest provider replay guard, and runs that generated test.
The scenario also asserts that capsule inspect, replay, and report inspect print actionable
Next steps hints for the next local command without exposing payload bodies or original app argv.
Expected ending:
ReplayLab scenario passed.
Scenario: no-key-dogfood
Tier: deterministic
Boundaries: 2
Payloads: 4
Providers: openai, requests
This proves the basic local loop works without an API key.
Run the fresh PyPI first-project scenario
Run:
python scripts/run_scenario.py run new-user-pypi-local --keep-workspace --package-version 0.1.0a4
ReplayLab creates a clean temporary virtual environment, installs replaylab==0.1.0a4, requests,
and pytest from PyPI, writes a minimal app with startup replaylab.init(...) and
handle.capture(...), starts a loopback HTTP service for capture, then stops that service before
replay.
Expected ending:
ReplayLab scenario passed.
Scenario: new-user-pypi-local
Tier: loopback
Boundaries: 1
Providers: requests
This is the maintainer check for the first 15 minutes of a new user experience: install from PyPI, capture a normal provider call, inspect and replay it, compare the report, export the local viewer, generate a pytest provider replay guard, and run that guard without a live provider.
Run the customer support-bot demo scenario
Run:
python scripts/run_scenario.py run support-bot-demo-local --keep-workspace --package-version 0.1.0a4
ReplayLab creates a clean temporary virtual environment, installs replaylab==0.1.0a4, openai,
requests, and pytest from PyPI, writes a tutorial-style support bot, starts fake OpenAI
Responses and support-ticket loopback services for capture, then stops both services before replay.
Expected ending:
ReplayLab scenario passed.
Scenario: support-bot-demo-local
Tier: loopback
Boundaries: 2
Providers: openai, requests
This is the customer-demo check. It proves a realistic support-bot flow can capture OpenAI and HTTP boundaries, replay with providers stopped, export clean/failed/diff React viewers, compare reports, generate a pytest provider replay guard, and run that guard without provider credentials.
For future release-candidate validation before publishing, install ReplayLab from the local candidate wheelhouse while still resolving third-party dependencies from PyPI:
python scripts/release_rehearsal.py --channel github --version <next-version>
python scripts/run_scenario.py run support-bot-demo-local \
--keep-workspace \
--package-version <next-version> \
--package-wheelhouse dist/public-alpha/v<next-version>
python scripts/run_scenario.py run pydantic-ai-local \
--keep-workspace \
--package-version <next-version> \
--package-wheelhouse dist/public-alpha/v<next-version>
python scripts/run_scenario.py run local-app-workflow \
--keep-workspace \
--package-version <next-version> \
--package-wheelhouse dist/public-alpha/v<next-version>
For maintainer demo rehearsal, use the kept workspace as an artifact store for the current checkout:
cd <kept-workspace>
uv run --project /path/to/ReplayLab replaylab app --local-store-root .replaylab
That verifies the current local app can start from the latest failed regression replay/latest regression replay/latest captured run, recover the recorded command, default provider patching from artifact providers, and run the workflow without manually reconstructing capsule or report paths.
Run the real HTTP loopback scenario
Run:
python scripts/run_scenario.py run http-loopback-pypi --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs replaylab==0.1.0a4, requests,
httpx, and pytest from PyPI, starts a loopback HTTP server for capture, records one requests
call, one sync httpx call, and one async httpx call, then stops the server before replay.
Expected ending:
ReplayLab scenario passed.
Scenario: http-loopback-pypi
Tier: loopback
Boundaries: 3
Providers: requests, httpx
This proves replay is serving stored HTTP responses, because the server is no longer running.
Run the local HTTP edge-case scenario
Run:
python scripts/run_scenario.py run http-edge-cases-local --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs the current checkout plus real
requests, httpx, and pytest, starts a loopback HTTP server for capture, records query-pair
params, safe header names, JSON request bodies, text request bodies, text content, bytes content,
JSON responses, text responses, and bytes responses, then stops the server before replay.
Expected ending:
ReplayLab scenario passed.
Scenario: http-edge-cases-local
Tier: loopback
Boundaries: 5
Providers: requests, httpx
This proves the current checkout can replay realistic HTTP request identity and body shapes without calling the live loopback server.
Run the local auto-instrumentation scenario
Run:
python scripts/run_scenario.py run auto-instrumentation-local --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs the current checkout plus
FastAPI, Uvicorn, requests, and pytest, starts a deterministic loopback provider for capture,
initializes the app with auto_patch_integrations="auto", instruments the app with
replaylab.instrument_app(app, handle=handle), then stops the provider before replay.
Expected ending:
ReplayLab scenario passed.
Scenario: auto-instrumentation-local
Tier: loopback
Boundaries: 1
Providers: requests
This proves the low-boilerplate startup path can capture and replay a FastAPI request without manual endpoint capture scopes or direct middleware registration.
Run the local ASGI/FastAPI scenario
Run:
python scripts/run_scenario.py run asgi-fastapi-local --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs the current checkout plus
FastAPI, Uvicorn, requests, and pytest, starts a deterministic loopback provider for capture,
serves one FastAPI request through ReplayLabASGIMiddleware, then stops the provider before replay.
Expected ending:
ReplayLab scenario passed.
Scenario: asgi-fastapi-local
Tier: loopback
Boundaries: 1
Providers: requests
This proves request-scoped ASGI middleware capture works without endpoint-level handle.capture(...)
boilerplate, and that generated pytest can replay the same app command without the provider.
Run the local ASGI lifecycle scenario
Run:
python scripts/run_scenario.py run asgi-lifecycle-local --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs the current checkout plus
FastAPI, Uvicorn, requests, and pytest, starts a deterministic loopback provider for capture,
then sends three requests through a FastAPI app:
GET /health, ignored by ReplayLab middleware.GET /ready, provider-free and therefore not written by default.GET /tickets/123, provider-backed and carrying request ID, authorization, and cookie headers.
Expected ending:
ReplayLab scenario passed.
Scenario: asgi-lifecycle-local
Tier: loopback
Boundaries: 1
Providers: requests
This proves request-scoped middleware stays low-intrusion in a more realistic app shape: only the provider-backed route writes a capsule, the capsule records safe ASGI facts such as method, path, route path, endpoint, status code, and configured request ID, and authorization/cookie values stay out of ReplayLab artifacts.
Run the local worker/job scenario
Run:
python scripts/run_scenario.py run job-worker-local --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs the current checkout plus
requests and pytest, starts a deterministic loopback provider for capture, runs one decorated
worker job, then stops the provider before replay.
Expected ending:
ReplayLab scenario passed.
Scenario: job-worker-local
Tier: loopback
Boundaries: 1
Providers: requests
This proves job-scoped capture works without adding handle.capture(...) inside the job body, and
that generated pytest can replay the same worker command without the provider.
Run the local job lifecycle scenario
Run:
python scripts/run_scenario.py run job-lifecycle-local --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs the current checkout plus
requests and pytest, starts a deterministic loopback provider for capture, then invokes:
- one provider-free decorated job, which writes no capsule by default;
- one sync provider job with
ticket_idsupplied positionally; - one async provider job with
ticket_idsupplied as a keyword.
Expected ending:
ReplayLab scenario passed.
Scenario: job-lifecycle-local
Tier: loopback
Boundaries: 2
Providers: requests
This proves capture_job(...) writes one capsule per provider-backed job invocation, extracts
session IDs for positional and keyword calls, records safe job metadata, and keeps job arguments,
keyword arguments, and return values out of adapter-owned metadata.
Run the PydanticAI compatibility scenario
Run:
python scripts/run_scenario.py run pydantic-ai-local --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs the current checkout plus
pydantic-ai-slim[openai], openai, and pytest, starts a deterministic OpenAI
Responses-compatible loopback provider for capture, then invokes one PydanticAI Agent using
OpenAIResponsesModel and OpenAIProvider(openai_client=...) inside handle.capture(...).
The provider endpoint is stopped before replay, report comparison, React viewer export, generated
pytest creation, and generated pytest execution.
Expected ending:
ReplayLab scenario passed.
Scenario: pydantic-ai-local
Tier: loopback
Boundaries: 1
Providers: openai
This proves supported OpenAI Responses calls inside PydanticAI can be captured and replayed through generic provider instrumentation. It is a validated compatibility scenario, not a native PydanticAI trace adapter.
Run the LangGraph compatibility scenario
Run:
python scripts/run_scenario.py run langgraph-local --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs the current checkout plus
langgraph, requests, openai, and pytest, starts deterministic loopback providers for HTTP
and OpenAI Responses capture, then runs one StateGraph through graph.invoke(...). One node uses
normal requests.get(...); another node uses normal OpenAI().responses.create(...).
Both providers are stopped before replay, report comparison, React viewer export, generated pytest
creation, and generated pytest execution.
Expected ending:
ReplayLab scenario passed.
Scenario: langgraph-local
Tier: loopback
Boundaries: 2
Providers: requests, openai
This proves supported provider calls inside LangGraph nodes can be captured and replayed without a
LangGraph-specific ReplayLab adapter. The scenario also validates the local app trace shape: project
run, workflow step, inferred Graph node rows, API call, LLM call, and formatted payload detail. It
does not validate LangChain ChatOpenAI, Chat Completions, streaming, or framework-native graph-edge
replay semantics.
Run the external LangGraph realism scenario
Run:
python scripts/run_scenario.py run langgraph-external-local --keep-workspace
This scenario bootstraps or reuses the external langgraph-example project under
/Users/damienbenveniste/Projects/agentic-management/replaylab-example-projects, copies it into an
isolated workspace, runs loopback provider capture/replay/compare/generate-test/pytest, and then
verifies the app-owned loop evidence from replaylab app.
Expected ending:
ReplayLab scenario passed.
Scenario: langgraph-external-local
Tier: loopback
Boundaries: 2
Providers: requests, openai
This proves the LangGraph path is realistic beyond generated in-repo scenario source and stays discoverable through captured-run product surfaces. The external path is forced into provider-backed loopback mode so it cannot silently pass through a provider-free fallback. Provider-free LangGraph runs are allowed as metadata captures, but they are not replayable evidence and the app disables replay/experiment/generated-guard actions for them.
Run the LangChain ChatOpenAI compatibility scenario
Run:
python scripts/run_scenario.py run langchain-openai-local --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs the current checkout plus
langchain-openai, openai, and pytest, starts a deterministic OpenAI Responses-compatible
loopback provider for capture, then runs one ChatOpenAI(..., use_responses_api=True) call inside
handle.capture(...).
The provider endpoint is stopped before replay, report comparison, React viewer export, generated
pytest creation, and generated pytest execution.
The scenario also validates that replaylab app keeps this run discoverable as a captured run with
an available replay action, attached replay evidence, attached provider replay guard evidence, and
reachable upload preview.
Expected ending:
ReplayLab scenario passed.
Scenario: langchain-openai-local
Tier: loopback
Boundaries: 1
Providers: openai
This proves the LangChain ChatOpenAI Responses path works with ReplayLab's provider-level capture and replay loop. It does not validate LangChain paths that require Chat Completions semantics, streaming, or Anthropic adapters.
Run the Anthropic Messages compatibility scenario
Run:
python scripts/run_scenario.py run anthropic-local --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs the current checkout plus the real
anthropic SDK and pytest, starts a deterministic local Anthropic Messages-compatible loopback
server, captures one Anthropic().messages.create(...) call, then stops the server before replay,
report comparison, React viewer export, generated pytest creation, and generated pytest execution.
The scenario also validates that replaylab app shows a captured run with an Anthropic LLM call,
formatted request/response previews, attached regression replay evidence, attached generated guard
evidence, and reachable upload preview.
Expected ending:
ReplayLab scenario passed.
Scenario: anthropic-local
Tier: loopback
Boundaries: 1
Providers: anthropic
This proves native Anthropic Messages calls can be captured and regression-replayed without an Anthropic API key. It does not validate batches, files, Bedrock/Vertex clients, OpenAI-compatible Anthropic routing, or Anthropic framework adapters.
Run the Gemini Generate Content compatibility scenario
Run:
python scripts/run_scenario.py run gemini-local --keep-workspace
ReplayLab creates a clean temporary virtual environment, installs the current checkout plus the real
google-genai SDK and pytest, starts a deterministic local Gemini generateContent loopback
server, captures one Client().models.generate_content(...) call, then stops the server before
replay, report comparison, React viewer export, generated pytest creation, and generated pytest
execution. The scenario also validates that replaylab app shows a captured run with a Gemini
LLM call, formatted request/response previews, attached regression replay evidence, attached
generated guard evidence, and reachable upload preview.
Expected ending:
ReplayLab scenario passed.
Scenario: gemini-local
Tier: loopback
Boundaries: 1
Providers: gemini
This proves native Google Gen AI Gemini generate_content calls can be captured and
regression-replayed without a Gemini API key. It does not validate multimodal file/image/video
flows, the Live API, or Vertex-specific product breadth.
Run the core provider streaming scenarios
Run:
python scripts/run_scenario.py run openai-streaming-local --keep-workspace
python scripts/run_scenario.py run anthropic-streaming-local --keep-workspace
python scripts/run_scenario.py run gemini-streaming-local --keep-workspace
Each scenario creates a clean temporary virtual environment, installs the current checkout plus the
real provider SDK and pytest, starts a deterministic local streaming-compatible loopback server,
captures one fully consumed provider stream, then stops the server before regression replay,
comparison, React viewer export, generated pytest creation, and generated pytest execution. The
scenario also validates that replaylab app keeps the streaming captured run discoverable with a
provider LLM call, streaming metadata, formatted final output, attached replay evidence, attached
generated guard evidence, and reachable upload preview.
Expected endings:
ReplayLab scenario passed.
Scenario: openai-streaming-local
Tier: loopback
Boundaries: 1
Providers: openai
ReplayLab scenario passed.
Scenario: anthropic-streaming-local
Tier: loopback
Boundaries: 1
Providers: anthropic
ReplayLab scenario passed.
Scenario: gemini-streaming-local
Tier: loopback
Boundaries: 1
Providers: gemini
This proves event-preserving streaming support for OpenAI responses.create(..., stream=True),
Anthropic messages.stream(...) / messages.create(..., stream=True), and Gemini
generate_content_stream(...) without provider API keys. A stream becomes replayable evidence only
after it is fully consumed; incomplete streams are recorded as failed/incomplete evidence.
Run the AI diagnosis loopback scenario
Run:
python scripts/run_scenario.py run ai-diagnosis-loopback --keep-workspace
ReplayLab creates a deterministic capsule and replay report, starts a local OpenAI
Responses-compatible fake endpoint, runs replaylab report explain, runs
replaylab report diff-explain, runs replaylab ai plan-instrumentation, verifies the outputs are
secret-safe, and confirms no external AI provider was called.
Expected ending:
ReplayLab scenario passed.
Scenario: ai-diagnosis-loopback
Tier: loopback
Boundaries: 1
Providers: requests
This proves optional report explanation, report-diff explanation, and instrumentation planning can be validated without a paid provider call.
Run the static HTML report scenario
Run:
python scripts/run_scenario.py run html-report-local --keep-workspace
ReplayLab runs the deterministic no-key dogfood workflow, exports the replay report as a single local HTML file, and verifies the file contains the expected report sections without payload bodies or secret-looking strings.
Expected ending:
ReplayLab scenario passed.
Scenario: html-report-local
Tier: deterministic
Boundaries: 2
Providers: openai, requests
This proves users can get a browser-readable local replay summary without a hosted service.
Run the static HTML failure scenario
Run:
python scripts/run_scenario.py run html-report-failure-local --keep-workspace
ReplayLab runs the deterministic no-key dogfood workflow, intentionally replays a mismatched app command, exports a failed replay report as HTML, exports a baseline-vs-candidate diff as HTML, and verifies both files include problem diagnostics without payload bodies or secret-looking strings.
Expected ending:
ReplayLab scenario passed.
Scenario: html-report-failure-local
Tier: deterministic
Boundaries: 2
Providers: openai, requests
This proves failed replays can still be understood through the dependency-free static fallback.
Run the local React viewer scenario
Run:
python scripts/run_scenario.py run react-viewer-local --keep-workspace
ReplayLab runs the deterministic no-key dogfood workflow, exports a clean React viewer, intentionally replays a mismatched app command, exports a failed React viewer, exports a report-diff React viewer, and verifies the generated files include diagnostics without payload bodies or secret-looking strings. The scenario checks the viewer's diagnosis section, quick filters, grouped next commands, and diff summary labels so the maintainer can catch confusing UX while using the same files a user would open.
Expected ending:
ReplayLab scenario passed.
Scenario: react-viewer-local
Tier: deterministic
Boundaries: 2
Providers: openai, requests
This is the main local viewer UX scenario. It exists so the maintainer can open the same files a user would open and catch navigation, density, wording, and failure-diagnosis problems early.
Run the viewer-first local workflow scenario
Run:
python scripts/run_scenario.py run viewer-first-local --keep-workspace
ReplayLab runs the deterministic no-key dogfood workflow, uses replaylab report view report to
write clean and failed viewer files without opening a browser, then uses replaylab report view diff
to write the before/after diff viewer. The scenario verifies the viewer-first files include the
diagnosis, "What to do next" section, copyable command labels, failure/diff diagnostics, and no
payload bodies or secret-looking strings.
Expected ending:
ReplayLab scenario passed.
Scenario: viewer-first-local
Tier: deterministic
Boundaries: 2
Providers: openai, requests
This is the product-flow scenario for local viewing. It exists to catch whether the workflow still feels too command-heavy after the CLI has launched the browser-readable viewer.
Run the guided local workflow scenario
Run:
python scripts/run_scenario.py run guided-local-workflow --keep-workspace
ReplayLab runs the deterministic no-key dogfood workflow, then uses
replaylab workflow local to replay the app, compare the report, write the local React viewer,
generate a pytest provider replay guard, and run that generated test from one command. It then intentionally
replays a mismatched app through the same guided command and verifies the failed viewer is still
written.
Expected ending:
ReplayLab scenario passed.
Scenario: guided-local-workflow
Tier: deterministic
Boundaries: 2
Providers: openai, requests
This is the scenario for checking whether the post-capsule workflow is now less command-heavy for a developer. Lower-level replay, report, viewer, and generate-test commands remain available when a script needs individual steps.
Run the local app shell scenario
Run:
python scripts/run_scenario.py run local-app-shell --keep-workspace
ReplayLab runs the deterministic no-key dogfood workflow, starts
replaylab app --no-open-browser against the generated .replaylab store, verifies the served
page is the product workspace instead of an exported ReplayLab Viewer, then queries product API
routes such as /api/workspace, /api/projects, /api/projects/{project_id}/artifacts,
/api/captured-runs/{captured_run_id}, /api/reports/{replay_id}, and
/api/capsules/{capsule_id}. Compatibility aliases such as /api/reports and /api/capsules
remain available during migration. The scenario verifies the local adapter discovers the provider
capsule, replay report, product captured-run view, comparison, run profile state,
latest captured-run/regression-replay shortcuts, and diagnosis data without returning payload body
or secret-looking markers. It also verifies the app exposes start-point data for the latest failed
regression replay, latest regression replay, and latest captured run when those artifacts exist.
Expected ending:
ReplayLab scenario passed.
Scenario: local-app-shell
Tier: deterministic
Boundaries: 2
Providers: openai, requests
This is the first scenario for the interactive local product surface. It exists to catch whether
the app makes artifact discovery easier than manually assembling capsule paths, report paths, and
viewer commands. External-example checks additionally verify that report artifact directory IDs,
such as a user-chosen --report-id, can open the same detail view as the internal replay ID. The
scenario also validates that captured-run detail includes the split-pane trace explorer, agent/LLM
trace labels, redacted Input/Output payload previews, protocol assistant/tool events, a run-list
side inspector on the captured-runs route, and collapsed raw replay/debug data without leaking
secret-looking markers.
Run the local app workflow scenario
Run:
python scripts/run_scenario.py run local-app-workflow --keep-workspace
ReplayLab runs the deterministic no-key dogfood workflow, starts
replaylab app --no-open-browser against the generated .replaylab store, extracts the local app
action token from the served page, then calls the product action API through
POST /api/actions/replay. The legacy workflow endpoints remain available for older callers, but
the product app path no longer posts command_text or asks the frontend to choose provider
integrations. The scenario verifies a clean app-driven replay, comparison, viewer export, pytest
generation, and generated pytest run.
It then rewrites the app into an intentional mismatch and verifies the app API still writes a failed
report and failed viewer with secret-safe structured step results. The scenario now also checks that
run profiles are recovered from artifacts, start points carry stable product roles and avoid
duplicate choices, provider labels come from the selected artifact, fresh captures can replay
through /api/actions/replay, and older missing-profile captures show app-managed setup candidates
instead of copy-pastable placeholder commands. It also verifies the original captured run exposes
latest-first regression replay history with timestamps, durations, problem counts, and report IDs
that can be opened through the report detail API. When the workflow creates a live experiment, the
scenario verifies the experiment trace is attached under the original captured run through product
artifact views and is not exposed as a new primary captured-run baseline. It also validates that
the captured-run product detail keeps the app-owned loop discoverable after refresh: regression
replays, live experiments, generated provider replay guards and diagnostic provider replay guards,
export evidence, and upload
preview entrypoints remain attached to the baseline captured run.
Expected ending:
ReplayLab scenario passed.
Scenario: local-app-workflow
Tier: deterministic
Boundaries: 2
Providers: openai, requests
This is the scenario for checking whether the local app is now a useful product workflow surface, not just an artifact browser. It should catch regressions where the app makes users fall back to hand-built CLI chains.
Run the tool-resolution local scenario
Run:
python scripts/run_scenario.py run tool-resolution-local --keep-workspace
ReplayLab captures and replays a deterministic OpenAI Responses artifact with a provider-visible
model tool and local Python source. The local lookup_customer callable is explicitly wrapped with
replaylab.control_tool(...). The scenario starts replaylab app and verifies the replay safety
preflight exposes advisory implementation candidates, explicit execution-tool wrapper evidence,
observed HTTP stack attribution, an advisory tool effect map, read-only effect policy proposal
items, saved project effect policy review state after reload, opt-in HTTP effect policy enforcement
evidence, opt-in local-effect control evidence, read-only safe workflow readiness, and safe workflow
regression availability. It also verifies unsupported-effect scope detection is clear for the happy
path. It first proves observe mode still works, then proves HTTP enforce
blocks without a saved policy, then saves an accepted project rule and proves HTTP plus local-effect
enforce allows the observed lookup_customer -> GET api.example.test HTTP effect. The final
enforced branch runs with local-container sandbox mode. When the configured Docker image is
available and contains the needed runtime dependencies, the readiness gate reaches ready and the
scenario generates and runs a safe workflow regression pytest. When Docker or the image is
unavailable, the same scenario verifies typed sandbox evidence blocks safe workflow generation
instead of silently overclaiming readiness. It also runs a negative branch that attempts a file
write under local-effect enforce and verifies the file is not created.
To prepare the default local sandbox image before running the positive branch, run:
python scripts/run_scenario.py run sandbox-runtime-image-local --keep-workspace
To use a custom project image instead, set REPLAYLAB_SCENARIO_SANDBOX_IMAGE and
REPLAYLAB_SANDBOX_IMAGE before running tool-resolution-local.
Run the unsupported-effect local scenario
Run:
python scripts/run_scenario.py run unsupported-effect-local --keep-workspace
ReplayLab captures and replays the same deterministic model-tool workflow with linked SQLite usage
in the tool source. The first branch does not enable database-effect control, so the safety
preflight treats the linked SQLite evidence as unsupported scope, keeps readiness not_ready,
refuses safe workflow generation, and still leaves provider replay guard generation available.
The second branch explicitly enables database_effects observation, saves reviewed HTTP and SQLite
policy decisions, reruns with HTTP, local-effect, and database-effect enforcement, and verifies the
SQLite statement shapes are allowed by exact saved policy. That enforced report reaches readiness
ready, generates a safe workflow regression pytest, and runs the generated pytest successfully.
Run the raw-socket effect local scenario
Run:
python scripts/run_scenario.py run raw-socket-effect-local --keep-workspace
ReplayLab captures and replays a deterministic model-tool workflow where the wrapped
lookup_customer callable attempts direct socket I/O before its supported HTTP call. The observe
branch explicitly requests network_effects hooks, records secret-safe raw-socket evidence, keeps
safe workflow readiness not_ready, refuses safe workflow generation, and still generates a
provider replay guard.
The enforce branch reruns with network_effect_control_mode=enforce. ReplayLab installs raw-socket
hooks automatically, blocks the app-origin socket send before network I/O, and exposes the block in
the report safety preflight as network-effect control evidence. The scenario verifies this is a
control block, not a provider replay mismatch or mocked network response.
Run the queue-effect local scenario
Run:
python scripts/run_scenario.py run queue-effect-local --keep-workspace
ReplayLab captures and replays a deterministic model-tool workflow where the wrapped
lookup_customer callable attempts an RQ-style enqueue through a local stub module before its
supported HTTP call. The observe branch explicitly requests queue_effects hooks, records
secret-safe queue/pubsub evidence, keeps safe workflow readiness not_ready, refuses safe workflow
generation, and still generates a provider replay guard.
The enforce branch reruns with queue_effect_control_mode=enforce. ReplayLab installs queue hooks
automatically, blocks the app-origin enqueue before broker I/O, and exposes the block in the report
safety preflight as queue/pubsub effect control evidence. The scenario verifies this is a control
block, not a provider replay mismatch, queued-worker execution, broker replay, or mocked queue
response.
Run the unsupported HTTP client local scenario
Run:
python scripts/run_scenario.py run unsupported-http-client-local --keep-workspace
ReplayLab captures and replays a deterministic model-tool workflow where the wrapped
lookup_customer callable uses a local fake urllib3-style client before its supported HTTP call.
The observe branch explicitly requests unsupported_http_clients hooks, records secret-safe
unsupported HTTP client evidence, keeps safe workflow readiness not_ready, refuses safe workflow
generation, and still generates a provider replay guard.
The enforce branch reruns with unsupported_http_client_control_mode=enforce. ReplayLab installs
unsupported HTTP client hooks automatically, blocks the app-origin urllib3 request before network
I/O, and exposes the block in the report safety preflight as unsupported HTTP client control
evidence. The scenario verifies this is a control block, not a provider replay mismatch, response
replay adapter, or mocked HTTP response.
Run the escape-scope local scenario
Run:
python scripts/run_scenario.py run escape-scope-local --keep-workspace
ReplayLab captures and replays two deterministic model-tool branches. One branch has linked
native/FFI source evidence through ctypes; the other has linked process-escape source evidence
through multiprocessing, ProcessPoolExecutor, os.fork, os.exec*, and pty.spawn calls. The
escape code is not executed; the scenario proves the AST-only unsupported-effect scanner can still
find linked scope evidence without importing user code.
Both branches keep safe workflow readiness not_ready, refuse safe workflow generation, and still
generate provider replay guards. The local app report detail shows the blocking evidence in
Unsupported effect scope. This is a scope guard, not native-call enforcement, process
virtualization, mocking, or sandboxing.
Run the sandbox escape local scenario
Run:
python scripts/run_scenario.py run sandbox-escape-local --keep-workspace
ReplayLab captures a deterministic provider-replay artifact and then attempts a sandboxed regression replay. If Docker and the configured image can start, the replay runs inside a local container with deny-all network and copied-workspace filesystem isolation. If Docker or the image is unavailable, ReplayLab writes a failed report with typed sandbox evidence instead of pretending the workflow is eligible.
The scenario verifies the local app report detail exposes Sandbox containment, safe workflow
readiness remains blocked unless containment completed, and provider replay guard generation remains
available. This is local Docker containment evidence, not Daytona, managed hosted execution,
VM/microVM isolation, or a replacement for HTTP/local/database/network/queue effect controls.
Run the sandbox runtime image scenario
Run:
python scripts/run_scenario.py run sandbox-runtime-image-local --keep-workspace
ReplayLab builds the default local image replaylab-sandbox-runtime:py3.13, runs
replaylab sandbox doctor, captures a deterministic provider workflow, generates a provider replay
guard, and runs sandboxed replay with the prepared image. The scenario validates completed
sandbox evidence with deny-all network, non-root runtime user, read-only root filesystem, split
mounts, and copied-workspace containment.
This scenario may pull the Docker base image and install Python dependencies while building the
image. The replay child itself still runs with --network none and --pull never.
Run the sandbox hardening scenario
Run:
python scripts/run_scenario.py run sandbox-hardening-local --keep-workspace
ReplayLab builds a recipe-backed image replaylab-sandbox-hardening:py3.13 with a local path
dependency declared through a sandbox recipe, runs replaylab sandbox doctor with the same hardened
runtime flags, captures a deterministic provider workflow, generates a provider replay guard, and
runs sandboxed replay with the prepared image. The scenario validates that the replay report records
recipe hash evidence plus non-root, read-only-root, split-mount sandbox containment.
Run the sandbox adversarial scenario
Run:
python scripts/run_scenario.py run sandbox-adversarial-local --keep-workspace
ReplayLab builds and doctors the local sandbox image, captures a deterministic provider workflow,
then runs hardened sandbox replay with bounded escape probes. The app checks that host-only
environment markers are not inherited, the Docker socket is not visible, app/root filesystem writes
fail, /tmp remains writable, direct socket I/O cannot leave the deny-network sandbox, and a small
subprocess stays inside the container. The scenario also verifies absolute host-path command
arguments and external symlinks are refused before launch, and that linked process-escape source
evidence keeps safe workflow generation unavailable. This is developer validation evidence, not a
VM or managed sandbox guarantee.
Run the failure-story scenario
Run:
python scripts/run_scenario.py run failure-story-local --keep-workspace
ReplayLab runs the deterministic no-key dogfood workflow, rewrites the app into an intentional
mismatch, and then verifies the same failure story appears in four places: report inspect, the
React viewer, the static HTML fallback, and the local app report-detail API. The scenario exists to
catch table-first regressions where a failed replay is technically recorded but a human still cannot
tell what diverged or what to do next.
Expected ending:
ReplayLab scenario passed.
Scenario: failure-story-local
Tier: deterministic
Boundaries: 2
Providers: openai, requests
The scenario checks for the first mismatched boundary, expected-vs-actual provider call evidence, request hashes, a first recommended action, and secret safety.
Run the paid external PydanticAI realism scenario
Put a real key in .env.local:
OPENAI_API_KEY=...
Then run:
python scripts/run_scenario.py run pydantic-ai-external-paid --paid --keep-workspace
ReplayLab copies the external pydantic-ai-example project into an isolated workspace, runs one
real paid capture flow, enforces a provider-call cap (default 10), retries once on transient
provider/network failures, replays with an invalid key, compares, generates pytest, runs pytest, and
verifies app-owned loop discoverability.
The scenario never prints OPENAI_API_KEY.
Run the paid OpenAI scenario
Put a real key in .env.local:
OPENAI_API_KEY=...
Then run:
python scripts/run_scenario.py run openai-real-pypi --paid --keep-workspace
ReplayLab installs from PyPI, selects a low-cost Responses-capable model unless overridden, makes one real capture call, replays with an invalid API key, compares the report, generates pytest, and runs the generated test without a valid key.
The script never prints OPENAI_API_KEY.
Development rule
When a task changes a user-facing workflow, update or add a scenario in the same task. If no scenario applies, the final implementation summary must say why.