Skip to content

Testing

Philosophy

Developer tests the deterministic orchestration, not the agent's reasoning.

The agent is non-deterministic. It might triage the same issue differently on two runs. Testing that would be flaky and expensive. What is deterministic: the routing logic that decides which stage runs next, the tool permissions that control what the agent can do, the git operations that commit and push code, the config validation that catches bad YAML.

That's what the tests cover. If the routing function says "go to implementation" when the triage result is action_direct, that's provably correct. If the quality gate loops three times then routes to craft_draft_pr, that's provably correct. The agent's judgment is out of scope.

Running tests

uv run pytest

This runs the full suite with coverage. The minimum coverage threshold is enforced by pytest-cov (see pyproject.toml for the current value). Coverage reports show missing lines so you know where gaps are.

Run a specific test file:

uv run pytest tests/test_resolve.py

Run a specific test:

uv run pytest tests/test_resolve.py::TestTriageRouting::test_action_direct_routes_to_implementation

What's tested

Pipeline routing -- every decision branch in every pipeline. If triage returns decline, does the pipeline go to record_decision? If quality tools fail three times, does it fall through to craft_draft_pr?

Stage execution -- run_stage() renders the right template, passes the right tools, handles session resume/fork correctly, merges output into context.

Session resolution -- missing sessions fall back to fresh sessions with a warning. Resume passes the correct session ID. Fork sets the fork flag.

Config validation -- malformed YAML gets specific error messages naming the exact field that's wrong.

Git operations -- worktree creation, diff capture, commit, push. All tested against real git repos in temp directories.

GitHub API helpers -- auth token generation, PR creation, comment posting. Mocked at the HTTP level.

Template rendering -- templates render without errors given valid context dicts.

CLI parsing -- URL formats parse correctly. Missing args produce usage errors.

What's not tested

Agent output quality. The tests mock the agent provider and return canned responses. They verify that the pipeline does the right thing given a particular response, not that the agent produces good responses.

This is intentional. The pipeline is the safety net. If it works correctly, the worst an agent can do is produce bad code that a human reviews in a PR. It can't accidentally push to main or create issues without structured output validation.